Submitted:
06 August 2025
Posted:
09 September 2025
You are already at the latest version
Abstract
Keywords:
Introduction
Problem Statement
Research Significance
Research Questions and Hypotheses
- RQ1: Can we predict the revenue that will result from a specific user journey path after making a purchase?
- H1: Marketing paths with greater diversity and more touchpoints will result in higher predicted revenue (Montgomery et al., 2004; Verbeke et al., 2012). This hypothesis is based on the idea that users who are shown diverse campaigns have better brand recall and purchase intent.
- RQ2: If a user has completed a partial journey without converting, which campaign is most likely to lead to a purchase in the end?H2: Campaigns that frequently appear just before conversion in historical data are more likely to convert users mid-journey (Xu et al., 2014; Ren et al., 2018). This assumes that certain campaigns have a higher last-touch impact due to their persuasive positioning.
- RQ3: If a user has completed a partial journey without converting, what is the best next campaign step to maximize revenue?
- H3: Machine learning models can simulate next-step recommendations that improve expected revenue compared to random selection ((Tao et al., 2023). This presumes that structured journey data encodes learnable patterns about campaign sequencing and conversion outcomes.
Objectives
- To develop regression models capable of forecasting revenue and campaigns based on completed user journeys using marketing campaign user path data.
- To design classification models that can identify the most likely converting campaign for users who are already partway through a campaign journey.
- To simulate next-step campaign decisions using path truncation and evaluate their revenue-maximizing potential via predictive modeling.
- To compare model performance using metrics like RMSE, R², and classification accuracy, and translate these findings into actionable marketing strategy insights within e-commerce.
- To validate the utility of machine learning as a prescriptive tool for customer journey optimization in real-world campaign planning contexts (Fildes et al., 2008).
Limitations
Delimitations
Literature Review
Attribution Modeling and Its Shortcomings
Predictive Modeling in Digital Marketing and E-Commerce
Customer Journey and Sequential Modeling
Conversion Prediction and Mid-Journey Targeting
Research Gap and Contribution
Methodology
Data Source and Description
- Campaign: The marketing channel or tactic (e.g., Organic, Email, Display)
- Path.Step: The order of campaign impressions in the journey
- Total.Revenue: Final purchase amount from that user path
- Days.till.key.event: Time elapsed until conversion
- X.TP: Count of touchpoints
Feature Engineering
- Touchpoints: Represents the total number of campaign interactions in a full user journey. This feature captures journey depth and was used as a key predictor in revenue estimation (RQ1), based on the assumption that longer engagement correlates with higher customer value.
- Unique.Campaigns: Measures the count of distinct campaign types encountered in each path. It serves as a proxy for campaign diversity and reflects the range of exposures that may influence conversion and revenue outcomes. Prior studies suggest that campaign variety can impact user behavior (Montgomery et al., 2004).
- Campaign.Length: Mirrors the Touchpoints feature but was retained for clarity in interpretability analyses. It provides an intuitive label when comparing user journey lengths across different segments or visualizations.
- Truncated.Steps: Denotes the number of steps present in a partially observed path, used in both RQ2 and RQ3 to simulate users who have not yet made a purchase. It enables the modeling of mid-journey decision-making contexts.
- Touchpoints.So.Far: A subset count indicating how many interactions have occurred prior to the current prediction point in truncated paths. It adds sequential context to the model, helping distinguish between early-stage and late-stage journey states.
- Next.Campaign: A categorical feature used exclusively in RQ3 to represent each candidate campaign simulated as the potential next step in a truncated path. For every truncated journey, this feature was iteratively populated with all possible campaign options to evaluate their projected revenue impact.
Modeling Approaches
RQ1: Revenue Prediction from Completed Paths
- Model Type: Supervised regression
- Models Used: Linear Regression (baseline), Random Forest Regressor, XGBoost Regressor
- Input Features: Touchpoints, Days.till.key.event, Campaign.Length, Unique.Campaigns
- Target Variable: Total.Revenue
RQ2: Predicting Conversion Campaign Mid-Journey
- Model Type: Multi-class classification
- Model Used: XGBoost Classifier
- Input Features: Truncated.Steps, Touchpoints.So.Far, Days.So.Far
- Target Variable: Last.Campaign (the campaign shown just before purchase in the real path)
RQ3: Identifying the Next-Best Campaign for Revenue Maximization
- Design Type: Quasi-experimental simulation
- Model: XGBoost Regressor
- Target Variable: Total.Revenue (from full path)
- Key Predictors: Truncated path features and one-hot encoded candidate next campaign
Analysis Results
RQ1: Predicting Revenue from Completed Paths


RQ2: Predicting Converting Campaign Mid-Journey
- Top-1 Accuracy: 71.64%
- Top-3 Accuracy: 85.17%
- Standard Deviation (Top-1 Accuracy): ±0.53%

| Seed | Accuracy |
|---|---|
| 27535007 | 0.7144524 |
| 75932782 | 0.7202852 |
| 62644871 | 0.7147116 |
| 32535588 | 0.7230071 |
| 624145 | 0.7157485 |
| 11151506 | 0.7154893 |
| 35632642 | 0.7241737 |
| 13293583 | 0.7052495 |
| 11151232 | 0.7145820 |
| 18742825 | 0.7161374 |

| Campaign | Occurrence | Percentage of Total |
|---|---|---|
| Merch Store US and CA | Search [Do not Edit] | 392 | 5.08% |
| (organic) | 383 | 4.96% |
| Jan2024_ChromeDino_V1 | 299 | 3.87% |
| May2024_MDW_V1 | 283 | 3.66% |
| [Group 3 - Hats] Hats Search Campaign | 258 | 3.34% |
| June2024_Summer_V1 | 257 | 3.33% |
| July2024_GreenSummer_V2 | 256 | 3.31% |
| Oct2024_Quilt_V2 | 255 | 3.30% |
| Oct2024_Quilt_V1 | 250 | 3.24% |
| [Experiment Bug: 411232449] Merch Store US and CA | Search [Do not Edit] | 243 | 3.15% |
RQ3: Recommending Next-Best Campaign


Discussion
Insights from RQ1: Revenue Prediction
Insights from RQ3: Next-Best Campaign Recommendation
Cross-Cutting Themes
Limitations and Future Work
Methodological Limitations
- Quasi-Experimental Limitation in RQ3: The main limitation lies in the quasi-experimental design of RQ3, where simulated “next campaign” steps are appended to truncated user journeys, and the revenue from the original full path is retained as the outcome label. Because these next campaigns were not really served to users, the predicted revenue represents a proxy outcome under hypothetical scenarios rather than a true causal effect. While this design enables directional comparisons between campaigns and reflects realistic mid-journey decision contexts, it does not establish causal attribution and should be interpreted as exploratory rather than definitive evidence of campaign impact.
- Simplified Assumptions About Campaign Impact: All campaign steps were treated equally without incorporating variables such as campaign cost, creative content, or user fatigue. This oversimplification may limit the model’s realism and operational accuracy.
- Exclusion of User Demographics and Contextual Data: Important variables such as device type, geolocation, referral source, or user segment were unavailable in the dataset. This limits the model’s personalization capabilities and generalizability.
- Static Modeling: The models assume a static user journey and do not account for evolving behaviors or feedback loops (Lemon & Verhoef, 2016). Real-time systems might require dynamic retraining and deployment mechanisms (Kumar & Reinartz, 2018).
Data Limitations
Future Research Opportunities
- Causal Modeling and Uplift Modeling: To strengthen the next-best campaign framework, researchers should apply counterfactual modeling methods or uplift modeling to directly estimate the incremental impact of showing a specific campaign (Radcliffe, 2007; Gutierrez & Gérardy, 2016).
- Incorporation of Reinforcement Learning: Next-best-action systems can be enhanced through reinforcement learning frameworks that adapt over time, taking into account evolving user responses and business objectives (Sutton & Barto, 2018).
- Deep Learning for Sequence Modeling: While this study favored explainable models like XGBoost, future research may explore LSTM, GRU, or transformer architectures to capture long-range temporal dependencies in user journeys (Hochreiter & Schmidhuber, 1997; Vaswani et al., 2017).
- Integration into Operational Systems: A practical next step is to evaluate how these models perform when deployed in live campaign environments. A/B testing and ongoing monitoring would be necessary to assess business impact.
- Cross-Industry Validation: Replicating the models across different verticals such as travel, insurance, or education would help confirm the generalizability of the framework.
Conclusion
References
- Abhishek, V. , Fader, P. S., & Hosanagar, K. (2012). Media exposure through the funnel: A model of multi-stage attribution. Marketing Science, 31, 362–386. [CrossRef]
- Anderl, E. , Becker, I., von Wangenheim, F., & Schumann, J. H. (2015). Mapping the customer journey: Lessons learned from graph-based online attribution modeling. Journal of Interactive Marketing, 34, 1–16. [CrossRef]
- Berman, R. (2018). Beyond the last touch: Attribution in online advertising. Marketing Science, 37, 771–792. [CrossRef]
- Breiman, L. (2001). Random forests. Machine Learning, 45, 5–32. [CrossRef]
- Bucklin, R. E. , & Sismeiro, C. (2009). Click here for Internet insight: Advances in clickstream data analysis in marketing. Journal of Interactive Marketing, 23, 35–48. [CrossRef]
- Chen, T. , & Guestrin, C. In (2016). XGBoost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining; pp. 785–794. [CrossRef]
- Cramer-Flood, E. (2024, January 10). Worldwide digital ad spending forecast 2024: Growth accelerates—digital dominates. Insider Intelligence.
- Dalessandro, B. , Perlich, C. , Hook, R., & Provost, 281–289., F. (2012). Evaluating online ad campaigns in a pipeline: Causal modeling and econometrics in practice. In Proceedings of the 18th ACM SIGKDD; pp. 281–289. [Google Scholar]
- Fildes, R. , Goodwin, P., Lawrence, M., & Nikolopoulos, K. (2008). Effective forecasting and judgmental adjustments: An empirical evaluation and strategies for improvement in supply chain planning. International Journal of Forecasting, 24, 3–19. [CrossRef]
- Google. (2024). Google Analytics Demo Account: Google Merchandise Store Dataset.
- Gutierrez, P. , & Gérardy, J.-Y. (2016). Causal inference and uplift modeling: A review of the literature. JMLR Workshop & Conference Proceedings, 67, 1–13.
- Hochreiter, S. , & Schmidhuber, J. (1997). Long short-term memory. Neural Computation, 9, 1735–1780. [CrossRef]
- Kumar, V. , & Reinartz, W. (2018). Customer relationship management: Concept, strategy, and tools (3rd ed.). Springer.
- Lemon, K. N. , & Verhoef, P. C. (2016). Understanding customer experience throughout the customer journey. Journal of Marketing 80(6), 69–96. [CrossRef]
- Li, H. , & Kannan, P. K. (2014). Attributing conversions in a multichannel online marketing environment: An empirical model and a field experiment. Journal of Marketing Research, 51, 40–56. [CrossRef]
- Montgomery, A. L. , Li, S., Srinivasan, K., & Liechty, J. C. (2004). Modeling online browsing and path analysis using clickstream data. Marketing Science 23(4), 579–595. [CrossRef]
- Pechmann, C. , & Stewart, D. W. (1990). The effects of comparative advertising on attention, memory, and purchase intentions. Journal of Consumer Research 17(2), 180–191. [CrossRef]
- Pedregosa, F. , Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M & Duchesnay, E. (2011). Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12, 2825–2830.
- Radcliffe, N. (2007). Using control groups to target on predicted lift: Building and assessing uplift model. Direct Marketing Analytics Journal, 14–21.
- Ren, K. , Fang, Y., Zhang, W., Liu, S., Li, J., Zhang, Y., Yu, Y., & Wang, J. (2018). Learning multi-touch conversion attribution with dual-attention mechanisms for online advertising, arXiv:1808.03737.
- Shao, X. , & Li, L. In (2011). Data-driven multi-touch attribution models. In Proceedings of the 17th ACM SIGKDD; 258–264. [Google Scholar] [CrossRef]
- Sutton, R. S. , & Barto, A. G. (2018). Reinforcement learning: An introduction (2nd ed.). MIT Press.
- Tao, J. , Chen, Q., Snyder Jr., J. W., Kumar, A. S., Meisami, A., & Xue, L. (2023). A graphical point process framework for understanding removal effects in multi-touch attribution. arXiv preprint arXiv:2302.06075. arXiv:2302.06075.
- Viloria, A. , Lezama, O., Jaimes, A., & Pérez, J. (2019). Big data marketing during the period 2012–2019: A bibliometric review. Intelligent computing, information and control systems, Springer; 1039, 186–193. [CrossRef]
- Verbeke, W. , Martens, D., Mues, C., & Baesens, B. (2012). Building comprehensible customer churn prediction models with advanced rule induction techniques. Expert Systems with Applications 38(3), 2354–2364. [CrossRef]
- Xu, L. , Duan, W., & Whinston, A. B. (2014). Path to purchase: A mutually exciting point process model for online advertising and conversion. Management Science, 60, 1392–1412. [CrossRef]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).