Submitted:
17 June 2026
Posted:
22 June 2026
You are already at the latest version
Abstract
Keywords:
1. Introduction
- We adapt Bayesian Persuasion to generative agents, recovering personalization as informational alignment between an agent’s generated content and the receiver-specific content implicit in the ground-truth sales outreach artifact.
- We construct SDR-Bench, a public corpus of 6,279 customer success stories from approximately 200 enterprises across 22 industries, each paired with the seller, prospect, product, and historical timestamp required to evaluate whether an agent can predict the strategic content of the deal-closing pitch.
- We release SDR-Arena, an evaluation framework that operationalizes the formalization on SDR-Bench and on proprietary sales artifacts; to prevent future data leakage, where an agent retrieves the very success story it is being asked to predict, SDR-Arena serves agents a frozen view of the public web at the historical timestamp of each evaluation instance.
- We apply the framework to proprietary sales-email and sales-transcript corpora from a Fortune 100 tech company and a mid-sized healthcare firm, comprising approximately 115,000 filtered outreach emails and 5435 outreach calls by 124 SDRs labeled by whether they induced a successful receiver action.
- We validate the framework through field deployment with 12 professional sales development representatives across our partner enterprises and a gold-standard exercise with senior SDRs from five enterprises

2. Problem Formulation
2.1. Sales Development Lifecycle

2.2. Bayesian Persuasion Formulation
2.3. Evaluation Methodology and Proxy
2.4. Dataset Construction
SDR-Bench Dataset
Private Enterprise Dataset
| Metric / Artifact Category | Healthcare | Tech |
|---|---|---|
| Number of SDRs | 3 | 124 |
| Total Emails Collected | 48,150 | 609,191 |
| Deduplication | 31,034 | 186,379 |
| Sales Outreach Emails | 24,506 | 90,809 |
| Sales Call scheduled | 354 | 13,236 |
| Golden dataset handpicked | 400 | 400 |
Human Personalization Strategies


3. SDR-Arena
- LLMs + Web Search: A baseline equipping frontier models with standard search tools to measure the marginal utility of agentic workflows against simpler tool-use capabilities.
- Deep Research Agents: Specialized agents that produce comprehensive research via multi-turn conversation and broad search retrieval over the internet (LangChain (2025); Shao et al. (2024)).
3.1. Evaluation Framework
4. Results & Analysis
4.1. Discussion of Empirical Findings
| Enterprise Sales Emails | Success Stories | Average Cost per Query | |||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Model | Healthcare (<1B) | Tech (>10B) | Other | Tech. | Mfg. | Energy | IT | Agg. | Prompt | Completion | Cost | ||
| Unsucc | Succ | Unsucc | Succ | Sales Call Transcripts | (30) | (30) | (30) | (30) | (180) | Tokens | Tokens | ($) | |
| STORM-QWEN-2.5 | 22.46 | 32.27 | 43.15 | 39.24 | 30.43 | 43.40 | 41.59 | 39.24 | 44.60 | 42.51 | ∼29k | ∼6.3k | ∼0.135 |
| ODR-QWEN-2.5 | 15.40 | 30.41 | 38.51 | 39.82 | 22.55 | 30.16 | 30.95 | 35.28 | 35.33 | 33.53 | ∼66k | ∼8.6k | ∼0.250 |
| QWEN2.5-72B (WEB) | 32.11 | 36.72 | 39.53 | 36.43 | 25.34 | 32.09 | 36.75 | 37.04 | 38.21 | 36.84 | ∼5.6k | ∼0.25k | ∼0.002 |
| Claude Sonnet 4.6 (WEB) | 59.89 | 60.89 | 64.08 | 61.33 | 34.52 | 52.6 | 60.5 | 54 | 54.9 | 55.8 | ∼79.2k | ∼2.2k | ∼0.270 |
| GPT-4o (WEB) | 35.71 | 40.62 | 47.44 | 45.17 | — | 33.50 | 38.93 | 32.26 | 36.01 | 35.42 | ∼9.2k | ∼0.4k | ∼0.027 |
| GPT-4o-mini (WEB) | 36.16 | 39.14 | 48.51 | 45.63 | — | 33.30 | 38.54 | 37.07 | 38.05 | 37.46 | ∼12.2k | ∼0.6k | ∼0.002 |
| GPT-5.4-mini (WEB) | 39.28 | 43.67 | 54.64 | 52.66 | — | 40.99 | 46.02 | 41.18 | 45.39 | 44.63 | ∼7.6k | ∼0.7k | ∼0.009 |
| GPT-5.4 (WEB) | 39.57 | 48.71 | 53.80 | 53.02 | — | 38.21 | 45.46 | 42.90 | 46.67 | 44.32 | ∼12.1k | ∼0.9k | ∼0.044 |
Pre-Training Leakage Is Not Driving WCS
4.2. Human Alignment and Validation
- Per-pitch usefulness: Twelve SDRs used GPT-4o + SDR-Arena to generate pitch points for new prospect companies inside their normal outbound pipeline, with each SDR auditing the model output on accounts they were actively working. For every generated pitch point, the SDR rated, on a binary criterion, whether it both (a) reflected genuine understanding of the prospect’s pain points and (b) was usable in outreach without rewriting; 48.2% of pitch points met both criteria.The field rate corresponds to roughly half of agent output being expert-grade in live deployment, with the remaining points being factually accurate but strategically generic, directly consistent with the personalization plateau identified above.
- Gold-Standard Alignment:We asked senior SDRs ( years of industry experience and years at the firm) from five enterprises to independently author reference “gold-standard” strategies to pitch 30 products to 5 prospects each. The SDRs were not shown any model output during the exercise, so the reference strategies are an independent expert read of what should be pitched. We then computed the overlap between the gold-standard strategy and the outputs of the four benchmarked models, and correlated this expert-overlap score against the corresponding automated WCS on SDR-Bench. Across models, expert overlap and WCS track at Pearson . The WCS-based ranking therefore transfers to senior-SDR judgment without re-tuning the rubric per enterprise, supporting WCS as a calibrated proxy for whether an agent has identified the strategic content a domain expert would pitch.
5. Conclusion
A. Appendix



A.1. SDR-Bench: Dataset Curation Details
| Filtration Criteria | Count |
|---|---|
| Domains Found for Companies with over $1B revenue | ∼30k |
| Domains Found for B2B Companies with over $1B revenue | 12,080 |
| Companies whose Sitemap could be found | 8,298 |
| Candidate Success Story URLs based on pattern matching | ∼117k |
| Count of Companies covering these 117k URLs | 1,772 |
| Exclude non-text formats (videos/pdfs) | ∼79k |
| URLs for which content could be collected | ∼31k |
| Qwen based filtering using content to exclude listicle, parent, generic pages and pages with no publish date | ∼7.2k |
| Filtering out stories where the customer is not a specific business | 6279 |
A.2. Sales Emails
A.2.1. Filtering & Analysis
- Language Filtering: We remove all non-English emails using language detection, yielding .
- Email Deduplication: We identify and remove duplicate email templates using a combination of exact matching and fuzzy string comparision yielding
- Intent Classification. We employ an LLM-as-a-judge paradigm to classify emails into outreach versus non-outreach categories. Specifically, we filter out generic conversational emails, administrative correspondence, and non-sales communications. Let be the LLM judge function where indicates a valid sales outreach email. Our filtered corpus is thus:
- Strategy Frequency Distribution: The distribution over strategies reveals the current state of human personalization practices.
- Product-Conditional Strategies: The distribution identifies product-specific personalization patterns.
- Target Company: The recipient organization.
- Sender’s Company: The sender’s organization.
- Email: The content of the email
- Timestampt: Date when the email was sent
- ProductP: The solution being pitched
- Strategy Labels: Personalization strategies used in the email
- Pitch Points: Pitch points used in the email
A.2.2. Personalization Strategies
- Industry based: References industry-specific trends, pain points, competitors, or case studies from the target company’s industry.
- Event based: Leverages trigger events (funding rounds, MA, product launches, earnings reports, news mentions) to identify timely business needs.
- Technology based: References the recipient’s current tech stack to propose replacement, integration, or complementary solutions.
- Lead Activity-based: References direct actions by the specific lead (whitepaper downloads, webinar attendance, pricing page visits, demo interactions).
- Buying Group Activity-based: References collective actions by the lead’s team or buying committee.
- Geography-based: Utilizes physical location or regional regulatory context (e.g., GDPR, CCPA compliance requirements).
- Lead Persona-based: Explicitly maps the lead’s role, title, or job responsibilities to role-specific pain points.
- Firmographics-based: Leverages company-level metrics (headcount growth, revenue, department size) as personalization anchors.
- Relationship-based: References existing customer relationships, cross-sell or upsell opportunities.
- None: Generic outreach lacking recipient-specific context.
A.3. How to Measure Personalization in an Ideal World?
A.4. Task Formulation Details
A.5. Alignment of LLM with Humans for Pitch Point Extraction
| TP | FP | FN | Precision | Recall | F1 Score |
| 138 | 11 | 3 | 0.92 | 0.97 | 0.95 |
A.6. Token Usage vs Performance of Agents

| Model | Avg. Prompt Tokens | Avg. Completion Tokens | Avg. Inference Cost | WCS |
|---|---|---|---|---|
| STORM | ∼29k | ∼6.3k | ∼$0.135 | 42.51 |
| QWEN-2.5-72B | ∼5.6k | ∼250 | ∼$0.002 | 36.84 |
| GPT-4o | ∼9.2k | ∼427 | ∼$0.027 | 35.42 |
| GPT-4o-mini | ∼12.2k | ∼572 | ∼$0.002 | 37.46 |
| GPT-5.4-mini | ∼7.6k | ∼662 | ∼$0.009 | 44.63 |
| GPT-5.4 | ∼12.1k | ∼904 | ∼$0.044 | 44.32 |
| ODR | ∼66k | ∼8.6k | ∼$0.250 | 33.53 |
| Claude Sonnet-4.6 | ∼79k | ∼2.0k | ∼$0.270 | 55.80 |
A.7. Prompts Library



A.8. Human Study to Validate Pitch Point Extraction
A.9. Institutional Review Board Approval
A.10. Statistical Significance and Confidence Intervals
| Model | Mean WCS | 95% CI |
|---|---|---|
| STORM | 0.4246 | [0.4015, 0.4491] |
| ODR | 0.3358 | [0.3150, 0.3585] |
| GPT-4o | 0.3638 | [0.3422, 0.3860] |
| Qwen | 0.3692 | [0.3496, 0.3876] |
A.11. Broader Impacts and Ethical Considerations
Paradox of Measurement and Dual-Use Risks
Privacy and Data Stewardship
Economic Displacement and Human-AI Collaboration
Acceptable Use Policy
- Unsolicited High-Volume Outreach: Using the dataset to train agents for mass-spamming or harassment.
- Deceptive Practices: Generating content that masquerades as human correspondence without disclosure.
- Social Engineering: Leveraging the personalization metrics to craft targeted phishing attacks.
References
- Afzoon, Saleh, Zahra Jamali, Usman Naseem, and Amin Beheshti. 2024. Persobench: Benchmarking personalized response generation in large language models. arXiv arXiv:2410.03198. [Google Scholar]
- Anthropic. 2026. Claude sonnet 4.6. https://www.anthropic.com/claude/sonnet (accessed on 2026-05-07).
- 2026. Bright Data Bright data serp api. Accessed. (accessed on 2026-05-07).
- Chen, Jin, Zheng Liu, Xu Huang, Chenwang Wu, Qi Liu, Gangwei Jiang, Yuanhao Pu, Yuxuan Lei, Xiaolong Chen, and Xingmei Wang. 2024. When large language models meet personalization: Perspectives of challenges and opportunities. World Wide Web 27, 4: 42. [Google Scholar] [CrossRef]
- Choi, Hana, Carl Mela, Santiago Balseiro, and Adam Leary. 2020. 06. Online display advertising markets: A literature review and future directions. Information Systems Research 31. [Google Scholar] [CrossRef]
- Durmus, Esin, Liane Lovitt, Alex Tamkin, Stuart Ritchie, Jack Clark, and Deep Ganguli. 2024. Measuring the persuasiveness of language models.
- Hovland, C.I., I.L. Janis, and H.H. Kelley. 1953. Communication and Persuasion: Psychological Studies of Opinion Change. In Yale paperbound. Yale University Press. [Google Scholar]
- Jiang, Bowen, Yuan Yuan, Maohao Shen, Zhuoqun Hao, Zhangchen Xu, Zichen Chen, Ziyi Liu, Anvesh Rao Vijjini, Jiashu He, and Hanchao Yu. 2025. Personamem-v2: Towards personalized intelligence via learning implicit user personas and agentic memory. arXiv arXiv:2512.06688. [Google Scholar]
- Kamenica, Emir, and Matthew Gentzkow. 2011. Bayesian persuasion. American Economic Review 101, 6: 2590–2615. [Google Scholar] [CrossRef]
- LangChain. 2025. Open deep research. https://github.com/langchain-ai/open_deep_research (accessed on 2026-05-07).
- Lasswell, Harold D. 1948. The structure and function of communication in society. In The communication of ideas, Volume. Harper and Row: vol. 37, pp. 215–228. [Google Scholar]
- Li, Li, Peilin Cai, Ryan A Rossi, Franck Dernoncourt, Branislav Kveton, Junda Wu, Tong Yu, Linxin Song, Tiankai Yang, and Yuehan Qin. 2025. A personalized conversational benchmark: Towards simulating personalized conversations. arXiv arXiv:2505.14106. [Google Scholar]
- Matz, Sandra, S. Vaid, Heinrich Peters, Gabriella Harari, and M. Cerf. 2024. 02. The potential of generative ai for personalized persuasion at scale. Scientific Reports 14. [Google Scholar] [CrossRef]
- OpenAI. 2023. Gpt-4 technical report. arXiv arXiv:2303.08774. [Google Scholar]
- Petty, Richard E., and John T. Cacioppo. 1986. The elaboration likelihood model of persuasion. In Advances in Experimental Social Psychology. Academic Press: Volume 19, pp. 123–205. [Google Scholar] [CrossRef]
- Ricci, Francesco, Lior Rokach, and Bracha Shapira. 2010. Introduction to recommender systems handbook. In Recommender systems handbook. Springer: pp. 1–35. [Google Scholar]
- Salemi, Alireza, Sheshera Mysore, Michael Bendersky, and Hamed Zamani. 2024. Lamp: When large language models meet personalization. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics Volume 1: 7370–7392. [Google Scholar] [CrossRef]
- Shao, Yijia, Yucheng Jiang, Theodore Kanell, Peter Xu, Omar Khattab, and Monica Lam. 2024, June. Assisting in writing Wikipedia-like articles from scratch with large language models. In K. Duh, H. Gomez, and S. Bethard (Eds.), Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), Mexico City, Mexico, pp. 6252–6278. Association for Computational Linguistics. [CrossRef]
- Sharma, Sahil, Puneet Mittal, Mukesh Kumar, and Vivek Bhardwaj. 2025. The role of large language models in personalized learning: a systematic review of educational impact. Discover Sustainability 6, 1: 1–24. [Google Scholar] [CrossRef]
- Tasdelen, Osman, and Daniel Bodemer. 2025. Generative ai in the classroom: effects of context-personalized learning material and tasks on motivation and performance. International Journal of Artificial Intelligence in Education, 1–22. [Google Scholar]
- Terho, Harri, Anna Salonen, and Meri Yrjänen. 2022a. 09. Toward a contextualized understanding of inside sales: the role of sales development in effective lead funnel management. Journal of Business & Industrial Marketing 38, 2: 337–352. [Google Scholar] [CrossRef]
- Terho, Harri, Anna Salonen, and Meri Yrjänen. 2022b. 09. Toward a contextualized understanding of inside sales: the role of sales development in effective lead funnel management. Journal of Business and Industrial Marketing 38. [Google Scholar] [CrossRef]
- Yang, An, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, and Haoran Wei. 2024. Qwen2.5 technical report. arXiv arXiv:2412.15115. [Google Scholar]
- Zhang, Saizheng, Emily Dinan, Jack Urbanek, Arthur Szlam, Douwe Kiela, and Jason Weston. 2018, July. Personalizing dialogue agents: I have a dog, do you have pets too? In I. Gurevych and Y. Miyao (Eds.), Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Melbourne, Australia, pp. 2204–2213. Association for Computational Linguistics. [CrossRef]
- Zhao, Zheng, Clara Vania, Subhradeep Kayal, Naila Khan, Shay B Cohen, and Emine Yilmaz. 2025. Personalens: A benchmark for personalization evaluation in conversational ai assistants. Findings of the Association for Computational Linguistics: ACL 2025, 18023–18055. [Google Scholar] [CrossRef]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).