I. Introduction
Estimating and comprehending causal impacts and treatment—efficacy is necessary for many commercial and policy decisions. In the past, the most common approach has been used to analyze average impacts such as the estimation of Heterogeneous Treatment Effects (HTEs). However, due to the availability of huge datasets—the literature on causal inference, and recent developments in Machine Learning (ML) techniques has gained a lot of attention [
1,
2,
3,
4,
5]. The topic of whether and which treatment to utilize or allocate optimally is at the heart of the dispute. Consequently, the assignment of an ideal treatment, where each treatment’s efficacy is assessed based on its anticipated treatment effect, is one of the main applications for the detection of HTEs. In recent years—researchers have seen an increase in the popularity of economic literature on causal inference. The focus of HTEs impact estimation and optimal treatment assignment has been the medical sector—in particular—given the increased interest in patient-centered outcomes. However, the potential for cost-effective use cases should not be overlooked in application sectors like the staff management or the efficacy of governmental policies. However, classical nonparametric techniques for estimating HTEs include kernel methods, nearest-neighbor matching, and series estimation. With fewer confounders, these techniques perform well in predictions—but, as the number of covariates increases—performance quickly decreases. This strengthens the case for ML techniques, which frequently require more observations for effective prediction performance yet generally outperform conventional approaches with a wide range of variables. The literature on ML techniques for treatment impact estimation is expanding—raising the questions of which techniques work best for the best treatment assignment and how to compare the techniques using an actual dataset. Therefore, the counterfactual tree-based approach known as Virtual Twin Random Forest (VTRF) [
6], the directly estimating tree-based approach known as Causal Forest (CF) [
7], and the feed-forward neural network-based approach known as Causal Net (CN) [
8] are used in this study. Additionally, we used a method based on [
9] to assess the approaches predicting diverse treatment effects for optimal treatment assignment on an empirical dataset about the incentivization of manual work. However, the winner’s curse is a significant obstacle to assigning treatments optimally when adopting diverse treatment. Therefore, we will explain the idea of the winner’s curse, offer shrinkage techniques as a potential remedy, and assess how well the shrinkage techniques work when applied to the ML methods’ predictions on the empirical dataset.
The study is as follows; the background will be seen in the following section. The related works are listed in Section III. The materials and methods are covered in Section IV. The experimental analysis is carried out in Section V, and in Section VI, we provide some conclusions and plans for future research.
II. Background
This section outlined background for estimating Conditional Average Treatment Effects
1 (CATE) and HTEs using ML methods such as random forests [
10,
11,
12,
13,
14,
15,
16]—are particularly effective for estimating treatment effects because they can capture complex interactions and nonlinearities in the data, which are common when analyzing HTEs. However, CF is advantageous because it leverages the flexibility and scalability of random forests while explicitly focusing on estimating treatment effects. By combining these estimates across multiple trees, CF provides an ensemble estimate of treatment effects. Similarly, CN offers flexibility and scalability in capturing complex relationships between covariates and outcomes, making it suitable for estimating HTEs in high-dimensional data. Generally, it constructs treatment effects by comparing observed outcomes for treated individuals with counterfactual outcomes predicted for untreated individuals. However, unlike traditional neural networks that predict outcomes directly—CN jointly estimates treatment effects and outcomes using a feed-forward neural network with custom layers. This approach allows for the estimation of treatment effects without explicitly observing untreated cases, which is often the case in observational studies. However, HTEs extend this concept further—emphasizing the diverse ways in which treatments affect individuals within a population. By creating synthetic “virtual twins” for untreated individuals, this method enables researchers to approximate the counterfactual scenario where everyone is untreated, facilitating the estimation of treatment effects. This VTRF approach is designed to estimate counterfactual outcomes for untreated individuals. In essence, it quantifies how the effect of a treatment varies depending on individual attributes. Therefore, by leveraging techniques such as regression trees, random forests, VTRF, CF, and CN—researchers can effectively estimate HTEs and make more accurate predictions in observational studies. However, HTE recognizes that one-size-fits-all approaches may not be suitable for interventions, as they may have different impacts on different subgroups. Therefore, ensemble approaches helps to mitigate overfitting and improve generalization performance. By recognizing and accounting for individual differences in response to treatments—researchers can better tailor interventions to specific subgroups and improve overall effectiveness.
IV. Materials and Methods
The empirical analysis that was done is described in this section. We will describe the model’s training and hyperparameter tuning, the dataset, and the method for comparing the best treatment assignment performance in terms of anticipated treatment impacts on an actual data set.
A. Data Analysis
The data used in this study was from [
24]—which was conducted in September 2021 over around two and a half weeks with a US-only sample utilizing Amazon MTurk. [
24] investigated if, in a follow-up MTurk research, a ML model that assigns individual optimal treatments can perform better than one that assigns merely the best treatment overall. However, we’ll concentrate on the study’s initial phase and the techniques used to contrast the various ML models. Prior to beginning the activity, participants in the study were required to complete a lengthy survey about their social preferences, personality traits, and demographics. In addition to inquiries on age, gender, and educational attainment, the Big 5 personality traits, risk aversion, competitiveness, altruism, positive reciprocity, risk preferences, and loss aversion were also included. Therefore, in the next working job, players might earn points over the course of 10 minutes by pushing the buttons “a” and “b”. A timer, the contestants’ current scores, and their current bonus were all visible to them. The incentive—or lack thereof—to push the buttons and accrue points was determined by the random assignment of each participant to one of six treatments or the control group. The mean values for the treated and untreated observations were 1898 and 1533, respectively. The scores were distributed substantially higher for all treatments than for the control group (see
Table 1: Wilcoxon Rank Sum test for α < 1%).
Figure 1 shows the average and median results. Interestingly, Wilcoxon Rank Sum tests at the 5% level show that the distribution of average outcomes of treatments one, two, four, and five is distributed significantly above the points of treatments three and six (see
Table 1).
Figure 2 shows the estimated probability density distributions of the points for the corresponding treatments. Since the reward in treatments three and control was constant regardless of the participant’s point total, there is a density spike at 0. The significant difference between the mean and median can also be explained by this surge. A predetermined set of criteria is used to remove participants from the dataset who did not participate in the experiment in an ordered manner.
B. Training and Tuning
The idea here is to find the best settings for models using a grid search method and three-fold cross-validation. This means trying different settings to see which ones work best—and checking how well they do by splitting the data into three parts and testing them on each part. For each method being used, a model for each of the six treatments will be trained. These models will only use data from the control group and the groups that got each treatment. Then, the model will predict how effective each treatment is. The treatment that the model says works the best will be picked. Choosing the right model isn’t easy when predicting treatment effects. There isn’t a simple way to measure how well the model is doing. Therefore, [
29] suggest using something called τ-risk to help pick the best model. They did a study and found that using this method usually leads to models that make the fewest mistakes when guessing the treatment effect. τ-risk works like this—it looks at how close the guesses are to the real outcomes and how close the guesses are—to how likely it is for someone to get a treatment. Similarly, a Lasso-estimator and logistic regression will be used to make these guesses, and they will be trained on the data to ensure they’re doing a good job. τ-risk will be used to pick the best model for both the CF and the CN methods. However, picking the best model for the random forests in the VTRF method is simpler. Mean Squared Error (MSE) will also be used when figuring out the outcomes for the two forests that guess what would have happened if someone didn’t get a treatment [
30,
31,
32,
33,
34,
35]. Then, MSE will be used again when looking at the difference between the predicted outcomes. In theory, τ-risk could be used for the random forests too, but it would take too long because so many different settings would have to be tried. Therefore, sticking with MSE—since it’s faster. This careful process helps ensure that the models are set up right and can accurately predict how effective different treatments are. This way, the results of the analysis can be trusted more.
C. Comparison Method
To check how well the methods work on a real dataset—where we can’t see the true treatment effects—a method like one described by [
9] will be used. For instance, look at the cases where the model’s suggested treatment matches the treatment that was randomly given. If the average results of these matched cases are much better than the average results of all treated cases, it suggests the models are doing a good job of picking the best treatments. First, the models will be trained using one set of data and then another set will be used to see how well they predict the effects of different treatments. For each case in the second set, pick the treatment that the model says will have the best effect. Some cases will match the randomly given treatment. These are called “matched” cases. By looking at the average outcomes of these matched cases, one can tell how well the models are doing. If the models are good at picking the best treatment—the average outcome of the matched cases will be higher than for models that aren’t as good. If the average outcome of matched cases is better than the average outcome of all treated cases, it means using the model to assign treatments is better than just assigning treatments randomly. And if the average outcome of matched cases is higher than the average outcome for people who got a specific treatment, it means the model is better than just giving everyone that specific treatment. To make sure our results aren’t just due to chance in how the data is split, this process will be repeated 100 times using different splits. Then, look at the average outcomes across all these repeats. Therefore, treatments three and six didn’t perform as well as the others, as mentioned earlier. So, it’s likely that the models will suggest lower effects for these treatments, and they might not get picked as often. That’s why it’ll also be looked at how well the models do when only considering treatments one, two, four, and five, to make sure they’re not just avoiding treatments three and six to look good.
V. Experimental Analysis
The findings of the empirical analysis that was previously described are presented in this section. We will first discuss the outcomes when all six treatments are used, and then we will discuss the outcomes when a subset of therapies—just treatments one, two, four, and five—are used.
A. Full Treatment Set
In this section—the results of the study’s analysis are shared.
Table 2 shows the average outcomes of matched observations for all six treatments, and
Figure 3 displays how these outcomes vary across the one hundred times we repeated the cross-validation process for the three methods. For all three methods, the average points of matched observations are higher than the average points for all treatments. This suggests that all models would do a better job than random assignment in assigning treatments. Among the models, the CF performs the best with an average outcome of matched observations of 1982, followed by the VTRF with 1959, and the CN with 1921. Only the CF has a matched average outcome higher than the average outcome in treatment four. However, the average outcome of matched observations for treatment four is lower than the actual average points for treatment four for all models, especially for the VTRF and CF. For treatment five, the difference in outcomes between the VTRF and CF models is significant, indicating that these models correctly assign more competitive-oriented participants to the real-time-feedback treatment. The CF performs much better than the other two models in terms of having a higher average outcome of matched observations than the average outcome of treatment four—occurring 81 out of 100 times (compared to 28/100 for VTRF and 5/100 for CN). For the CF alone, the results suggest that assigning treatment using the model is better than just assigning the best-performing treatment. However, the CN performs much worse than the tree-based models according to the introduced metrics. To understand why the VTRF performs worse than the CF, the observations were also matched based on the training set they were trained on (see
Table 3). The VTRF had a much higher average points of matched estimations compared to the CF, indicating that it overfits on the training set.
B. Sub Treatment Set
One challenge previously discussed was those treatments three and six had much lower outcomes compared to the other treatments. This could mean that models might seem to work well just because they avoid giving treatments three and six. To address this, the analysis was done again, but only focusing on treatments one, two, four, and five. The results for this subset of treatments are shown in
Table 4, and their distributions over the repeats are in
Figure 4. Once more, the CF model performed the best, with an average outcome of matched observations of 1980 and achieving above the mean in all 100 repeats (now 1939, considering only the four used treatments). It also performed well compared to treatment four in 75 out of 100 repeats, slightly less than before. The VTRF did better in this analysis compared to when using all treatments. Its average outcome of matched observations over the hundred repeats is 1970, with 52 out of 100 repeats having a higher average outcome than treatment four. However, in 3 out of 100 repeats, the average outcome of matched observations was lower than the average outcome of treatments one, two, four, and five. The performance of the CN remained lower compared to the other two methods, with an average outcome of matched observations of 1945, slightly higher than before, but still below treatment four’s average outcome and just above the mean outcome of the four treatments. Overall, the results for the subset of treatments are quite similar to those for all treatments. This means that using treatments with similar average outcomes doesn’t change the findings much. One notable difference is that the VTRF performed better in this analysis. This suggests that using the VTRF for treatment assignment might be just as good as assigning only treatment four. One possible reason for this is that the VTRF wrongly assigns many observations to treatment six due to overestimation. This is supported by the fact that the VTRF had more matched assignments compared to the CF but with lower average outcomes. Additionally, the average outcomes of matched observations for the VTRF increased for all four treatments in the subset analysis, while they decreased for the CN and the CF.
VI. Conclusions and Future Works
As causal inference becomes more popular, predicting the best treatment using ML methods is becoming important in many areas. When comparing three methods—VTRF, CN, and CF—on real data, the CF seems to be the best at assigning the best treatment. It shows that picking the treatment with the highest predicted effect from the CF leads to better outcomes than just picking the best overall treatment. The VTRF and the CN are better than random assignment but not as good as the CF. In a smaller set of treatments, the VTRF seems to do as well as just picking treatment four. Even though the VTRF doesn’t perform as well overall, it shouldn’t be dismissed. It’s easier to understand and use than the CF, especially for people who aren’t familiar with causal inference. However, one reason the VTRF might not perform as well overall is that it tends to overfit. This means it gets too good at predicting outcomes based on the data it was trained on, so it doesn’t do as well on new data. The CN didn’t perform as well as the other methods. There could be a few reasons for this. The number of observations might have been too low for the neural networks to work well. Also, tuning neural networks is harder than tuning tree-based methods like the CF. The methods used to tune the CN might not have been the best for the data. Plus, the hyperparameter grids used might not have been the best fit. But expanding the grids would have taken too long to compute for this study.
Funding
No funds, grants, or other support was received.
Conflict of Interest
The authors declare that they have no known competing for financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Data Availability
Data will be made on reasonable request.
Code Availability
Code will be made on reasonable request.
References
- V. Kanaparthi, “Exploring the Impact of Blockchain, AI, and ML on Financial Accounting Efficiency and Transformation,” Jan. 2024, Accessed: Feb. 04, 2024. [Online]. Available: https://arxiv.org/abs/2401.15715v1.
- V. Kanaparthi, “Robustness Evaluation of LSTM-based Deep Learning Models for Bitcoin Price Prediction in the Presence of Random Disturbances,” Jan. 2024. [CrossRef]
- V. Kanaparthi, “Examining Natural Language Processing Techniques in the Education and Healthcare Fields,” International Journal of Engineering and Advanced Technology, vol. 12, no. 2, pp. 8–18, Dec. 2022. [CrossRef]
- V. Kanaparthi, “Credit Risk Prediction using Ensemble Machine Learning Algorithms,” in 6th International Conference on Inventive Computation Technologies, ICICT 2023 - Proceedings, 2023, pp. 41–47. [CrossRef]
- V. Kanaparthi, “AI-based Personalization and Trust in Digital Finance,” Jan. 2024, Accessed: Feb. 04, 2024. [Online]. Available: https://arxiv.org/abs/2401.15700v1.
- J. C. Foster, J. M. G. Taylor, and S. J. Ruberg, “Subgroup identification from randomized clinical trial data,” Statistics in Medicine, vol. 30, no. 24, pp. 2867–2880, Oct. 2011. [CrossRef]
- S. Wager and S. Athey, “Estimation and Inference of Heterogeneous Treatment Effects using Random Forests,” Journal of the American Statistical Association, vol. 113, no. 523, pp. 1228–1242, Jul. 2018. [CrossRef]
- M. H. Farrell, T. Liang, and S. Misra, “Deep Neural Networks for Estimation and Inference,” Econometrica, vol. 89, no. 1, pp. 181–213, Jan. 2021. [CrossRef]
- Gg. J. Hitsch and S. Misra, “Heterogeneous Treatment Effects and Optimal Targeting Policy Evaluation,” SSRN Electronic Journal, Nov. 2018. [CrossRef]
- P. Kaur, G. S. Kashyap, A. Kumar, M. T. Nafis, S. Kumar, and V. Shokeen, “From Text to Transformation: A Comprehensive Review of Large Language Models’ Versatility,” Feb. 2024, Accessed: Mar. 21, 2024. [Online]. Available: https://arxiv.org/abs/2402.16142v1.
- G. S. Kashyap, K. Malik, S. Wazir, and R. Khan, “Using Machine Learning to Quantify the Multimedia Risk Due to Fuzzing,” Multimedia Tools and Applications, vol. 81, no. 25, pp. 36685–36698, Oct. 2022. [CrossRef]
- N. Marwah, V. K. Singh, G. S. Kashyap, and S. Wazir, “An analysis of the robustness of UAV agriculture field coverage using multi-agent reinforcement learning,” International Journal of Information Technology (Singapore), vol. 15, no. 4, pp. 2317–2327, May 2023. [CrossRef]
- G. S. Kashyap et al., “Revolutionizing Agriculture: A Comprehensive Review of Artificial Intelligence Techniques in Farming,” Feb. 2024. [CrossRef]
- M. Kanojia, P. Kamani, G. S. Kashyap, S. Naz, S. Wazir, and A. Chauhan, “Alternative Agriculture Land-Use Transformation Pathways by Partial-Equilibrium Agricultural Sector Model: A Mathematical Approach,” Aug. 2023, Accessed: Sep. 16, 2023. [Online]. Available: https://arxiv.org/abs/2308.11632v1.
- S. Naz and G. S. Kashyap, “Enhancing the predictive capability of a mathematical model for pseudomonas aeruginosa through artificial neural networks,” International Journal of Information Technology 2024, pp. 1–10, Feb. 2024. [CrossRef]
- G. S. Kashyap, A. E. I. Brownlee, O. C. Phukan, K. Malik, and S. Wazir, “Roulette-Wheel Selection-Based PSO Algorithm for Solving the Vehicle Routing Problem with Time Windows,” Jun. 2023, Accessed: Jul. 04, 2023. [Online]. Available: https://arxiv.org/abs/2306.02308v1.
- J. Kleinberg, J. Ludwig, S. Mullainathan, and Z. Obermeyer, “Prediction policy problems,” in American Economic Review, May 2015, vol. 105, no. 5, pp. 491–495. [CrossRef]
- T. Kitagawa and A. Tetenov, “Who Should Be Treated? Empirical Welfare Maximization Methods for Treatment Choice,” Econometrica, vol. 86, no. 2, pp. 591–616, Mar. 2018. [CrossRef]
- C. F. Manski, “Statistical treatment rules for heterogeneous populations,” Econometrica, vol. 72, no. 4, pp. 1221–1246, Jul. 2004. [CrossRef]
- K. Hirano and J. R. Porter, “Asymptotics for Statistical Treatment Rules,” Econometrica, vol. 77, no. 5, pp. 1683–1701, Sep. 2009. [CrossRef]
- K. Imai and M. Ratkovic, “Estimating treatment effect heterogeneity in randomized program evaluation,” Annals of Applied Statistics, vol. 7, no. 1, pp. 443–470, Mar. 2013. [CrossRef]
- M. Taddy, M. Gardner, L. Chen, and D. Draper, “A Nonparametric Bayesian Analysis of Heterogenous Treatment Effects in Digital Experimentation,” Journal of Business and Economic Statistics, vol. 34, no. 4, pp. 661–672, Oct. 2016. [CrossRef]
- D. Simester, A. Timoshenko, and S. I. Zoumpoulis, “Efficiently evaluating targeting policies: Improving on champion vs. Challenger experiments,” Management Science, vol. 66, no. 8, pp. 3412–3424, Jan. 2020. [CrossRef]
- S. Opitz, D. Sliwka, T. Vogelsang, and T. Zimmermann, “The Targeted Assignment of Incentive Schemes,” SSRN Electronic Journal, Oct. 2022. [CrossRef]
- V. K. Kanaparthi, “Navigating Uncertainty: Enhancing Markowitz Asset Allocation Strategies through Out-of-Sample Analysis,” Dec. 2023. [CrossRef]
- V. Kanaparthi, “Transformational application of Artificial Intelligence and Machine learning in Financial Technologies and Financial services: A bibliometric review,” Jan. 2024. [CrossRef]
- V. K. Kanaparthi, “Examining the Plausible Applications of Artificial Intelligence & Machine Learning in Accounts Payable Improvement,” FinTech, vol. 2, no. 3, pp. 461–474, Jul. 2023. [CrossRef]
- V. Kanaparthi, “Evaluating Financial Risk in the Transition from EONIA to ESTER: A TimeGAN Approach with Enhanced VaR Estimations,” Jan. 2024. [CrossRef]
- A. Schuler, M. Baiocchi, R. Tibshirani, and N. Shah, “A comparison of methods for model selection when estimating individual treatment effects,” Apr. 2018, Accessed: Apr. 28, 2024. [Online]. Available: https://arxiv.org/abs/1804.05146v2.
- G. S. Kashyap et al., “Detection of a facemask in real-time using deep learning methods: Prevention of Covid 19,” Jan. 2024, Accessed: Feb. 04, 2024. [Online]. Available: https://arxiv.org/abs/2401.15675v1.
- G. S. Kashyap, A. Siddiqui, R. Siddiqui, K. Malik, S. Wazir, and A. E. I. Brownlee, “Prediction of Suicidal Risk Using Machine Learning Models.” Dec. 25, 2021. Accessed: Feb. 04, 2024. [Online]. Available: https://papers.ssrn.com/abstract=4709789.
- S. Wazir, G. S. Kashyap, K. Malik, and A. E. I. Brownlee, “Predicting the Infection Level of COVID-19 Virus Using Normal Distribution-Based Approximation Model and PSO,” Springer, Cham, 2023, pp. 75–91. [CrossRef]
- H. Habib, G. S. Kashyap, N. Tabassum, and T. Nafis, “Stock Price Prediction Using Artificial Intelligence Based on LSTM– Deep Learning Model,” in Artificial Intelligence & Blockchain in Cyber Physical Systems: Technologies & Applications, CRC Press, 2023, pp. 93–99. [CrossRef]
- G. S. Kashyap, D. Mahajan, O. C. Phukan, A. Kumar, A. E. I. Brownlee, and J. Gao, “From Simulations to Reality: Enhancing Multi-Robot Exploration for Urban Search and Rescue,” Nov. 2023, Accessed: Dec. 03, 2023. [Online]. Available: https://arxiv.org/abs/2311.16958v1.
- S. Wazir, G. S. Kashyap, and P. Saxena, “MLOps: A Review,” Aug. 2023, Accessed: Sep. 16, 2023. [Online]. Available: https://arxiv.org/abs/2308.10908v1.
| 1 |
CATE refers to the variation in treatment effects across different groups of individuals—often defined by their covariates or characteristics. |
|
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).