Methodology
Research Design and Data Source
The research is a cross-sectional, quantitative study that made use of secondary data analysis to achieve two main objectives. The research initially determined causal effects of gender on mental health by regulating confounding factors using various statistical methods. Second, the study formed and validated predictive machine learning models of early detection of at-risk students. The dataset was 101 university students with 11 variables which had the variable Student Mental Health and was acquired at Kaggle (Shariful07, 2020). This publicly available dataset contains the answers of students in various Malaysian universities, which were obtained with the help of the structured self-report questionnaires, which measure mental health status and demographic variables.
In the first round of data quality evaluation, a single observation had blank values in more than one variable and was deleted using the listwise deletion option resulting in an eventual analysis sample of 100 students. It is a small sample size by machine learning standards but allowed the causal inference techniques to be applied rigorously and gave them the statistical power to detect moderate to large effect sizes at standard significance levels.
Variables and Operational Definitions
Three binary mental health outcomes were studied as dependent variables. Depression state was operationalized as a clinical diagnosis or treatment of depression as reported by the students. Anxiety status was also a clinical diagnosis or treatment of anxiety disorders. Status of panic attack showed that the patient had experienced episodes of panic attacks that necessitated clinical treatment. Although both binary operationalizations simplify the continuum of mental health symptomatology, they meet the clinical decision-making thresholds, and they ease the interpretation of causal effects in terms of risk differences and odds ratios.
Gender was used as the major exposure variable in the causal analyses, and it was coded as a binary variable where male was used as the index category and female as the reference category. Although the concept of gender is multidimensional in nature, and includes biological, psychological and social aspects, the data set conceptualized this construct as a dichotomous variable according to the sex assigned at birth, which is the weakness of secondary data analysis.
The identification of potential confounding variables was done a priori according to theoretical information and previous empirical studies. Age, as a demographic variable, is in years, one of the fundamental variables in both genders’ distributions in university populations and mental health risks at different developmental stages. Graduation year, which was coded as a categorical variable, first year to fourth year and above, reflects academic advancement, and stress factors. Cumulative grade point average (CGPA) is a measure of academic performance expressed on a standardized scale, which can be a cause and effect of mental health status. Marital status, which is dichotomized as single or married or being in serious relationships, represents social support and life circumstances that may have an impact on mental health. Subject matter of study, coded categorically in academic subjects, was tested as a possible instrumental variable due to differentiation of gender presentation in different academic subjects without direct causation on mental health except through confounding mechanisms.
Data Preprocessing and Partitioning
Preprocessing of data was done in a systematic way to guarantee that the data will be analyzed. The metrics used to assess quality included checking the pattern of missing data, the presence of outliers using distributional analysis, and logical consistency between the corresponding variables. Label encoding was used to encode categorical variables to give numerical representations of textual categories with ordinal relationships expanded where necessary. The transformation of age into categorical groups (less than 20 years, 20 to 22 years and more than 22 years) was used to make stratified analyses and to minimize the parametric modeling assumptions.
Stratified random sampling was used to divide the dataset into two groups (80 and 20 observations respectively) where the training set and the test set had 80 and 20 observations respectively. The stratification was made so that the distribution of the main outcome variable (depression) is balanced in training and testing subsets and evaluation bias due to varying prevalence of outcomes is eliminated. A fixed random seed was introduced to make all the analyses reproducible and verify the results that were reported.
Causal Inference Framework
The causal inference component utilized five methodological approaches, which were based on various identifying assumptions and provided different perspectives on confounding adjustment (He et al., 2024; Ku et al., 2024). The convergence of methods that use divergent assumptions proves the existence of robustness that is not limited to the limitations of any methodological approach (Akinkugbe et al., 2025).
Method 1 used multivariable logistic regression to carry out regression adjustment (Mishra & Kushwaha, 2024). The crude association between depression and gender was estimated in the unadjusted model with no covariate adjustment to give a baseline that may be biased due to confounding. The adjusted model incorporated age, year of study, CGPA and marital status as covariates to estimate the conditional relationship between gender and depression at the constant of confounders. The methodology presupposes adequate specification of the functional form of covariates and results in interrelationship and is free of unmeasured confounding (Akinkugbe et al., 2025). The condition of multicollinearity was measured by variance inflation factors (VIF), whereby, a value that is less than 2.5 was acceptable to represent a collinearity level (Pedregosa et al., 2011). The area of the receiver operating characteristic curve (AUC-ROC) was used to measure model discrimination, which measures the performance of the model to differentiate between depressed and non-depressed students (Stapor et al., 2024). Also, ordinary least squares (OLS) regression where depression was considered as a continuous dependent variable was estimated to give interpretable coefficient as absolute percentage point changes in the probability of depression which could be compared to propensity score techniques that give risk differences.
Method 2 was based on Mantel-Haenszel stratification to estimate pooled odds ratios in strata by confounding variables (Mishra & Kushwaha, 2024). The students were stratified in terms of age and odds ratios stratum-specific to measure the gender depression relationship in each age group were calculated. Mantel-Haenszel pooled odds ratio is a weighted average of stratum-specific odds ratios where the weights are based on the sizes of the strata and the accuracy of the estimate. Such a method does not assume anything parametrically on the forms of functions but needs sufficient sample sizes within the strata to generate stable results. The level of confounding that can be attributed to the stratification variable is measured by comparing the crude odds ratios with the Mantel-Haenszel adjusted odds ratios (Akinkugbe et al., 2025).
Method 3 adopted direct standardization whereby age-adjusted rates were obtained through the established epidemiological procedures (Akinkugbe et al., 2025). The rate of stratum specific depression was calculated based on each age group and gender combination. These rates were then summed up to a standard population distribution (the total age distribution in the sample) to generate standardized rates eliminating the impact of the age distributions differentiating the genders. The standardized rate ratio is used to compare the rate of depression in each gender adjusting for the effect of age structure, which gives a real indication of the relative risk on the rate scale.
In the 4th method, the propensity score was used in three forms with the use of current causal inference methods (Nakazawa et al., 2023; Ku et al., 2024). Firstly, the propensity score, which depicts the conditional probability of being female, with the covariates observed was estimated by using logistic regression models where the predictors included age, year, CGPA, marital status and course. The distribution of propensity scores was analyzed to check common support (overlap) of the treatment groups, which is a requirement of making valid causal inferences (Nakazawa et al., 2023). Each female student was matched to the male student with the nearest propensity score without replacement to form one-to-one nearest neighbor matching to establish a balanced dataset where covariates distribution was comparable across groups (Pedregosa et al., 2011). The measure of balance was standardized mean differences (SMD), and any value that was below 0.10 reflected acceptable balance. Treatment effects were estimated as difference in means of depression in matched treatment and control groups. Second, the propensity score stratification was used to separate the sample into five quintiles in agreement with the propensity score values. The effect of within-quintile treatment was estimated and averaged across quintiles by precision weighted averaging. Third, inverse probability of treatment weighting (IPTW) formed a pseudo-population in which treatment allocation depended upon measured confounders through weighting every observer by the inverse of their propensity score (commonly treated individuals) or one-minus the propensity score (commonly control individuals). The computation of stabilized weights was done to reduce the variance and the 99th percentile weight was trimmed to reduce the impact of extreme values. The pseudo-population was weighted, and regression was done to estimate the treatment effects (Nakazawa et al., 2023).
Method 5 tried instrumental variable (IV) two stage least squares estimation. Course of study was also discussed as a possible gender tool, but it needs three assumptions to be met. The relevance assumption is to ensure the instrument is used to predict assignment of treatment, which is reported by the first stage F-statistic with values greater than 10 as it is considered to have enough instrument strength. The exclusion restriction demands that the instrument only has an impact on the outcome insofar as it influences treatment which cannot be empirically tested but must be substantiated by substantive reasoning. The independence assumption is that the instrument should be unrelated with unmeasured confounding factors. Course effect on gender was estimated at the first stage of regression and second stage regression predicted depression using predicted gender values of the first stage regression. The linear models package offered formal two-stage least squares implementation with the correct estimation of standard error.
Machine Learning Classification Framework
The machine learning bit trained three supervised classification algorithms that were based on different modeling paradigms (Ahmad et al., 2023; Alkahtani et al., 2024). As an interpretable baseline model, logistic regression was used to estimate the log-odds of mental health outcomes with linear functions of predictor variables (Pedregosa et al., 2011). This parametric method gives the coefficients that can be directly interpreted in terms of probability, but it assumes that the relationships are linear and additive.
Random Forest is an ensemble learning that builds several decision trees by using bootstrap aggregation and random selection of features (Zafar and Wani, 2024; Al-Hakeim et al., 2024). The training data is bootstrap sampled into each tree and a random sample of the features is taken at each node split, providing controlled randomness which helps cut down the correlation between trees and increases the generalization. All trees are aggregated in final predictions using majority voting, which minimizes the variance and minimizes the bias (Pedregosa et al., 2011). The implementation employed 100 trees with default hyperparameters such as maximum tree depth, being unrestricted and the minimum number of samples per leaf being one to balance between the complexity of the model and its interpretability.
XGBoost (Extreme Gradient Boosting) uses gradient boosting, which is the iterative technique of ensemble whereby other trees are added to trees that have already been trained and each new tree is designed to eliminate the errors introduced by the other trees (Chen and Guestrin, 2016; Wang et al., 2024). The algorithm reduces a regularized objective function which is a combination of a prediction error and model complexity penalties which helps to avoid overfitting. XGBoost also implements several optimizations such as second-order gradient information, effective tree construction algorithms, and automatic management of missing values (Zhu et al., 2022). It was implemented using 100 rounds of boosting with learning rate 0.1 which regulates the amount of each tree to contribute to the final prediction as well as allowing the policy to learn gradually. The maximum depth of the trees was established at six, and this was a balance between expressiveness of the model and the risk of overfitting (Chen & Guestrin, 2016).
This was done by training models on original datasets and propensity score-matched datasets (Nakazawa et al., 2023). The matched data test on training has shown that the improvement in predictive performance due to matched covariate balance (improved in comparison with unmatched covariate balance) is obtained at the cost of less informative sample size. The implementation of all the models was done in Python with the help of scikit-learn and XGBoost, and they make the implementation reproducible and properly tested (Pedregosa et al., 2011; Chen and Guestrin, 2016).
Sensitivity Analysis and Robustness Assessment
E-value was used to determine sensitivity to confounding that could not be measured. The E-value is a measure of the least amount of association, on the risk ratio scale, that an unmeasured confounder must possess with the treatment and outcome to completely mediate a given result. Greater E-values mean that they are less likely to be invalidated by unmeasured confounding since it means that only very strong unmeasured confounders can negate results. Point estimates and bounds on confidence intervals were computed and e-values were done to give both best-case and worst-case sensitivity tests. This discussion respects the basic drawback of observational research that the unknown factors can contaminate the cause, and effect estimates and quantify the level of worry justifiable.