Benchmarking the Personalization Capabilities of Large Language Models

Ashutosh Srivastava; Siddharth Yedlapati; Vinay Aggarwal; Yaman K Singla; Shashwat Dixit; Jitendra Ajmera

doi:10.20944/preprints202606.1214.v2

Submitted:

17 June 2026

Posted:

22 June 2026

You are already at the latest version

Abstract

Personalization, the act of varying a message to induce action from a specific receiver while keeping sender, channel, and time fixed, has a long tradition in psychology and marketing as a two-party problem in which sender and receiver have independent objectives. Large language models remove the bounded-inventory constraint of classical retrieval-and-ranking approaches by generating a continuum of message variants conditioned on inferred receiver state, raising the question of how well current models perform personalization in the classical sense. Existing LLM personalization benchmarks measure sender-side adaptation, in which the receiver is the same user the model is serving. The two-party question, whether a generated message induces its intended action in a third party, has been investigated only through A/B tests and small-scale human studies that cannot be re-run against a new model on demand. We adapt the Bayesian Persuasion framework of Kamenica and Gentzkow (2011) to generative agents and instantiate the formulation in sales, where receiver actions are routinely logged against the outreach that induced them. We release SDR-Bench, a public corpus of 6,279 customer success stories spanning 22 industries and approximately 200 enterprises, served through a temporally constrained simulation that prevents future-data leakage. Across frontier LLMs and deep-research agents, we observe a consistent personalization plateau and on a Fortune 100 tech cohort no model statistically separates successful from unsuccessful outreach. A field deployment with 12 professional sales representatives validates the framework, with 48 percent of model-generated content rated immediately useful and senior-expert agreement at Pearson 0.82. We release SDR-Arena and SDR-Bench publicly to support reproducible study of generative personalization at scale.

Keywords:

large language models (LLMs)

;

personalization

;

generative AI

;

Bayesian persuasion

;

sales development representatives (SDRs)

;

benchmarking

;

persuasive communication

;

human-AI collaboration

;

agent evaluation

;

SDR-Bench

;

SDR-Arena

;

customer success stories

;

enterprise AI

;

information alignment

;

marketing and sales intelligence

Subject:

Computer Science and Mathematics - Artificial Intelligence and Machine Learning

1. Introduction

Communication as defined by the seminal work of Lasswell (Lasswell (1948)) characterizes any communicative act in five variables: who says what to whom, through which channel, with what effect, and at what time. Within this framework personalization can be located as the act of varying the message, that is, the what, while conditioning on (and holding fixed) the speaker, receiver, channel, and time. The act has two parties whose interests need not initially align; a sender, who selects the message with the goal of inducing some action from the receiver, and a receiver, who has independent preferences and chooses whether to act. The study of how senders shape messages to induce receiver action has been a long tradition in multiple fields, including, psychology, economics, and marketing, beginning with the Yale Communication and Attitude Change program (Hovland et al. (1953)) and continuing through the Elaboration Likelihood Model (Petty and Cacioppo (1986)) and the formal signaling-game treatment of Bayesian Persuasion (Kamenica and Gentzkow (2011)).

So far, machine learning research on personalization has approached this problem as learning a policy to retrieve and rank items from a fixed inventory of candidates. Recommender systems rank items from a catalog against user-history signals (Ricci et al. (2010)); advertising platforms optimize the selection, and targeting of ads among a fixed pool of pre-authored creatives (Choi et al. (2020)); persona-based dialogue systems condition response generation on explicit persona representations while remaining restricted to comparatively narrow conversational domains (Zhang et al. (2018)).

Large language models transition this personalization from a retrieval and ranking problem to a generative one. LLMs can generate a continuum of message variants for a given (speaker, receiver, channel, time) tuple, conditioned on whatever attributes of the receiver can be inferred from the available context. A growing body of work has applied LLMs to generative-personalization in marketing (Matz et al. (2024)), education (Sharma et al. (2025); Tasdelen and Bodemer (2025)), and human-AI interaction (Chen et al. (2024)). These applied results raise the question of how well current LLMs perform personalization in the sense the classical literature studies it: as a sender selecting what to say in order to induce a specific receiver action. While some work exists for measuring LLM personalization, however, it measures a very different property compared to the personalization talked about in psychology and economics literature (Kamenica and Gentzkow (2011); Lasswell (1948)). LaMP (Salemi et al. (2024)) evaluates personalized text generation conditioned on a user’s history of past interactions. PersoBench (Afzoon et al. (2024)) measures persona consistency in open-domain dialogue. PersonaConvBench (Li et al. (2025)) scores persona-grounded conversational quality. PersonaLens (Zhao et al. (2025)) evaluates assistant behavior under declared user preferences. PersonaMem (Jiang et al. (2025)) measures long-horizon recall of user attributes across sessions. In these works, the recipient of the LLM generated message is the same user that the model is serving, and the optimization target is one-party: the alignment between the model’s output and the preferences of the user who issued the prompt, in the same sense that RLHF aligns an assistant to its user. The two-party question, in which sender and receiver have independent objectives and the sender’s success is measured by whether the receiver acts, has primarily been investigated through randomized human-subject experiments in which participants are exposed to human- or LLM-generated persuasive messages and evaluated based on subsequent shifts in attitudes, agreement, or behavioral intentions (Durmus et al. (2024);Matz et al. (2024)). Such studies are tied to a specific methodology, require weeks of execution, and cannot be re-run against a new model on demand. Consequently, comparisons between LLMs and human experts for two-party personalization remain specific to individual studies and are not directly comparable across systems. Therefore, there is a need for a formal model of personalization which accounts for both the sender and receiver of the message, sufficient to support a reproducible and automated benchmark applicable to arbitrary generative systems.

This formulation requires an empirical setting where receiver actions are observed and logged against the specific messages that induced them, the messages are authored at the receiver level rather than at the segment level, and human-authored ground truth at known successful induced actions are available at scale. Sales outreach artifacts provides a good testbed to measure this as receiver actions (replying, scheduling a call, closing a deal) are routinely logged against the specific outreach that induced them (Terho et al. (2022a)). A sales outreach is drafted one-to-one by a Sales Development Representative (SDR) for a specific prospect. In an in-house study conducted with a Fortune 100 enterprise, we observed that personalized SDR outreach achieved approximately seven times the click-through rate of templated outreach when promoting the same products to comparable prospect groups.. Furthermore, the sales funnel produces a layered set of human-authored artifacts at known successful transitions, namely outreach emails that secured a call, call transcripts that secured deal discussions, and post-deal customer success stories that document the content that closed the deal. Together, these properties make sales a natural empirical instantiation of the two-party formulation where each artifact in the funnel serves as ground truth at a different stage of the same (seller, prospect, product) tuple, enabling stage-specific evaluation of generative personalization.

We develop our framework SDR-Arena (illustrated in Figure 1) on this empirical setting and list our contributions are as follows:

We adapt Bayesian Persuasion to generative agents, recovering personalization as informational alignment between an agent’s generated content and the receiver-specific content implicit in the ground-truth sales outreach artifact.
We construct SDR-Bench, a public corpus of 6,279 customer success stories from approximately 200 enterprises across 22 industries, each paired with the seller, prospect, product, and historical timestamp required to evaluate whether an agent can predict the strategic content of the deal-closing pitch.
We release SDR-Arena, an evaluation framework that operationalizes the formalization on SDR-Bench and on proprietary sales artifacts; to prevent future data leakage, where an agent retrieves the very success story it is being asked to predict, SDR-Arena serves agents a frozen view of the public web at the historical timestamp of each evaluation instance.
We apply the framework to proprietary sales-email and sales-transcript corpora from a Fortune 100 tech company and a mid-sized healthcare firm, comprising approximately 115,000 filtered outreach emails and 5435 outreach calls by 124 SDRs labeled by whether they induced a successful receiver action.
We validate the framework through field deployment with 12 professional sales development representatives across our partner enterprises and a gold-standard exercise with senior SDRs from five enterprises

Figure 1. Overview of SDR-Arena showcasing how LLM generated output is compared with artifacts like Sales Emails, Transcripts & Success Stories to benchmark their personalization capability.

Across frontier LLMs and open-source deep-research agents, including STORM (Shao et al. (2024)), ODR (LangChain (2025)), GPT-4o (OpenAI (2023)), Claude Sonnet 4.6 (Anthropic (2026)) and Qwen-2.5 ((Yang et al. (2024)), we observe a consistent personalization plateau. Alignment scores cluster in the 30 to 43 percent range, and on the tech-firm cohort no model statistically separates successful from unsuccessful outreach. Specialized agents such as STORM reach the upper end of the range, but at one to two orders of magnitude greater inference cost; standard LLMs with temporally constrained search occupy a more compute-efficient frontier.

2. Problem Formulation

We formalize the empirical sales in the form of generative personalization by first describing the Sales Development Lifecycle, then casting personalization as a Bayesian Persuasion task where the generated outreach is aimed at inducing specific actions from the recipients within the sales funnel.

2.1. Sales Development Lifecycle

A sales journey (illustrated in Figure 2) begins with Sales Development Representatives (SDRs) researching prospective accounts to identify needs and budget signals, then sending tailored outreach emails to schedule an initial call. The call further develops the prospect’s needs and progresses toward deal closure, with some opportunities materializing into deals and others not. Following a successful closure, the workflow often culminates in a Customer Success Story: a publicly documented case study published by the Seller company showcasing how their products helped a customer overcome key challenges. Released as web articles, these stories validate the partnership by documenting the transition from a ‘pain state’ to a ‘success state’ (Terho et al. (2022b)). Examples include success stories from Oracle, Salesforce, and Adobe.

Figure 2. Sales Journey: From Prospecting to Outreach to Call and eventual Deal Closure leading to Success Story publication.

2.2. Bayesian Persuasion Formulation

We adapt the Bayesian Persuasion framework of Kamenica (Kamenica and Gentzkow (2011)) to model personalized outreach as a signaling game between a Sender (the SDR or the LLM agent that replaces the SDR) and a Receiver (the prospect). This framework naturally aligns for personalization because it acknowledges that the receiver enters the interaction with a prior belief about their own needs, and the sender’s role is to provide a signal, the personalized message, that updates the receiver’s posterior in favor of action.

The unobserved receiver state

ω

represents the latent compatibility between the receiver’s requirements and the sender’s product. We decompose

ω

at time t as a tuple

ω_{t} = {n_{i}, w_{i}}

where n denotes the receiver’s explicit needs (functional requirements, current pain points) and w denotes their latent wants (strategic goals, avenues for value generation). The receiver chooses an action

a \in {0, 1}

, where

a = 1

represents a successful transition to the next stage of the sales funnel (e.g., an email leading to a call, or a call leading to deal discussions), while

a = 0

represents a failure to progress.

Following Kamenica (Kamenica and Gentzkow (2011)), the Receiver is treated as a rational Bayesian agent with a prior belief

μ

over

ω

. The Receiver takes action

a = 1

if and only if their expected utility

u_{R}

, conditional on their belief, exceeds a reservation threshold

τ

:

E [u_{R} (a = 1, ω_{t}, ξ) ∣ μ] \geq τ

The Sender’s utility

u_{S}

is aligned with the Receiver acting

a = 1

. Thus, the Sender’s goal is to deliver a signal that updates the receiver’s posterior belief such that the above condition is satisfied. Note that the utility function is also subject to exogeneous factors

ξ

(timing, organizational urgency, prior context, noise) that are independent of the sender’s signal but contribute to the receiver’s utility. The sender’s signal can shift

μ

in favor of acting; it cannot control

ξ

. Consequently, even an optimal signal is not guaranteed to induce

a = 1

, and the receiver’s action is best understood as a probabilistic outcome whose likelihood the sender attempts to maximize.

The key adaptation for our setting is that the sender does not observe

ω_{t}

directly. The sender (in our case an LLM agent

Φ

) operates on an observable context

W_{t}

(state of the world at time t) consisting of public information available at time t , from which the latent state must be inferred. From this inference, the agent generates an outreach O, which we represent structurally as a list of pitch points. Each pitch point is a specific argument linking the seller’s product to one of the receiver’s inferred needs or wants through a particular value proposition. Formally, the agent’s policy is a mapping:

O = {p p_{1}, p p_{2}, \dots, p p_{k}}, \hat{O} = {{\hat{p p}}_{1}, \dots, {\hat{p p}}_{n}} = Φ (W_{t}, P ∣ \hat{ω})

where

\hat{ω}

is the agent’s inferred receiver state,

\hat{O}

is the outreach generated by agent

Φ

and P is the product. The generated list serves as the informational signal intended to update the receiver’s posterior such that the probability of the desired stage transition (

a = 1

) is maximized.

2.3. Evaluation Methodology and Proxy

The ideal objective of the Sender is to generate an outreach O that maximizes the Receiver’s expected utility:

O^{*} = {arg max}_{O} E [u_{R} (a = 1, ω) ∣ O]

. Direct evaluation of this objective is intractable as the receiver’s utility function

u_{R}

and the true state

ω

are unobservable to the agent (and to the benchmark). We therefore replace the unobservable utility with an observable proxy - the informational overlap between the agent’s generated content and the content implicit in a human-authored message that is known to have induced the desired action.

We utilize a dataset of successful historical outreach attempts across different engagement stages (Emails, Calls, and Success Stories). Let

D = {(C_{i}, ω_{i}, O_{i}^{*})}_{i = 1}^{N}

be a set of ground truth examples where

O_{i}^{*}

is a human-authored message that successfully induced action

a = 1

(moving to the next funnel stage).Success in email is defined by an outreach

O_{i}^{*}

that led to a call; a call success implies

O_{i}^{*}

led to deal discussions; and a success story implies

O_{i}^{*}

led to deal closure. Because each

O_{i}^{*}

resulted in a positive outcome, it empirically satisfies the receiver’s utility threshold and can be treated as a sample from the set of utility-maximizing messages for the corresponding (sender, receiver, product, time) tuple. We extract a list of ground truth pitch points

V^{*}

from

O_{i}^{*}

and define the Weighted Coverage Score (WCS) as the semantic alignment between the predicted pitch points

\hat{O}

and the ground truth points

V^{*}

:

Weighted Coverage Score = S (\hat{O}, V^{*})

where

S (\cdot, \cdot)

is a semantic alignment scoring function (defined in Section 3.1).

This allows us to rigorously benchmark Agent performance by measuring the Relevance Alignment between the predicted pitch points (where the value proposition is embedded) against those extracted from the successful ground truth. This serves as a tractable proxy for the Bayesian persuasion objective: a higher matching score implies the Agent has successfully identified the winning strategy that induces the desired action.

WCS is, by construction, a lower bound on personalization quality. It captures the what component of personalization — whether the agent has correctly inferred the receiver’s decision-relevant needs and wants and identified the value-generating arguments that historically induced the desired action while abstracting away the how (style, tone, formatting). An agent that achieves high WCS has demonstrated that it can recover the receiver-specific strategic content of a known successful message; an agent that achieves low WCS has not, regardless of how well-written its output is.

2.4. Dataset Construction

To evaluate generative personalization in real-world settings, we construct two complementary datasets: (i) SDR-Bench, a large-scale public benchmark derived from enterprise sales artifacts, and (ii) a private enterprise dataset containing real sales outreach emails and downstream sales outcomes from two organizations. Together, these datasets provide both reproducible public evaluation and high-fidelity validation on real-world personalized communication.

SDR-Bench Dataset

To enable reproducible public benchmarking of generative personalization systems, we construct SDR-Bench, a large-scale corpus of publicly available enterprise sales narratives and customer success stories. We targeted approximately 12,000 global enterprises with revenues exceeding $1B, identifying sitemaps for 8,298 organizations and collecting 117,000 candidate URLs using heuristics tailored to common success-story paths (e.g., /customer-stories, /case-study).

A multi-stage filtering pipeline (Table 3) removed non-text formats, generic landing pages, articles lacking verifiable publication dates or identifiable product solutions, and anonymized stories where the customer organization was not explicitly named (e.g., “a large food products company”). This process yielded a final corpus of 6,279 success-story articles spanning 22 industries. Distributions of companies and stories by industry are shown in Figure 4 and Appendix Figure 6. Detailed construction steps are provided in Appendix A.1.

Private Enterprise Dataset

To validate our theoretical proxy, we require settings where the ground-truth message

O_{i}^{*}

and its successful outcome

(a = 1)

are explicitly observed. We therefore collaborated with two enterprises—a Fortune 100 technology company and a mid-sized healthcare firm—to collect real human-authored sales outreach paired with downstream prospect actions.

For the Fortune 100 company, we collected approximately 100k outreach emails authored by 124 SDRs and 5,435 sales call transcripts over a two-year period (2023–2025), identifying 13,236 instances in which the outreach successfully induced a sales call. For the healthcare firm, we analyzed 24,506 outreach emails, of which 354 resulted in a scheduled sales call. These successful outreach instances serve as observed realizations of optimal messages (

O^{*}

) in our relevance-alignment framework. Table 1 summarizes the dataset construction pipeline.

Table 1. Processing of Enterprise Sales Email Data.

Metric / Artifact Category	Healthcare	Tech
Number of SDRs	3	124
Total Emails Collected	48,150	609,191
Deduplication	31,034	186,379
Sales Outreach Emails	24,506	90,809
Sales Call scheduled	354	13,236
Golden dataset handpicked	400	400

Human Personalization Strategies

To characterize the qualitative structure of expert personalization, we analyzed the strategies employed by SDRs across both datasets (Figure 3). The three most common strategies were: (i) industry-based personalization, tailoring content to sector-specific trends and pain points; (ii) persona-based personalization, adapting the value proposition to the recipient’s organizational role; and (iii) activity-based personalization, leveraging behavioral signals such as webinar attendance or prior engagement.

Appendix Figure 5 provides qualitative examples showing how SDRs adapt the same product positioning differently across recipients with distinct inferred needs (

n_{i}

) and wants (

w_{i}

).

Figure 3. Distribution of count of strategies across a random subset of 34,000 emails.

Figure 4. Distribution of count of companies by Industry Type.

3. SDR-Arena

We introduce SDR-Arena, a scalable framework designed to systematically benchmark LLM-based agents on generative personalization over sales outreach artifacts. To ensure a rigorous and valid evaluation, the arena utilizes an isolated environment that provides agents access to a Historical Internet Simulator (

W_{t}

). The arena serves as a standardized testbed for comparing diverse agentic workflows, ranging from complex research pipelines to simple tool-use configurations. We evaluate two primary configurations on SDR-Bench and our Enterprise Dataset:

LLMs + Web Search: A baseline equipping frontier models with standard search tools to measure the marginal utility of agentic workflows against simpler tool-use capabilities.
Deep Research Agents: Specialized agents that produce comprehensive research via multi-turn conversation and broad search retrieval over the internet (LangChain (2025); Shao et al. (2024)).

Historical Internet Simulator: This environment prevents “future leakage” by enforcing a strict temporal boundary, ensuring agents only synthesize information that was publicly available at the simulated time of the sales interaction. The system enforces the

W_{t}

boundary by passing search_start_date and search_end_date parameters to the BrightData SERP API (Bright Data (2026)). By restricting results to time t, we ensure that the generated pitch points are constructed solely from context that would have been accessible to a human researcher at the time of the original sales event, preventing the model from ‘cheating’, where an agent might mistakenly find the successful outcome of a deal that hasn’t happened yet in the simulation.

3.1. Evaluation Framework

We define each evaluation instance as a tuple

(S, C, P, t)

, where S is the seller, C is the prospect, P represents the products, and t is the historical timestamp. The tuple is extracted from each sales artifact individually. Please refer Appendix A.4 for examples.

Implementation: The agent is prompted to act as a sales representative for S pitching P to C using the time-restricted search tool. The resulting output

\hat{O} = {{\hat{p p}}_{1}, \dots, {\hat{p p}}_{n}}

is a set of personalized pitch points intended to address the inferred needs and strategic goals of the prospect. We employ an LLM-based semantic judge to extract ground truth pitch points

V^{*}

from the historical sales artifact. We use the raw content of the sales artifact and employ GPT-4o OpenAI (2023) to perform an ontological extraction of ‘Pitch Points.’ Each pitch point is required to follow a strict triad structure: Product/Service → Specific Pain Point → Value Proposition/Mechanism. To ensure the pitch points are grounded, the extraction model was instructed to provide and validate pitch points with exact ‘evidence quotes’ from the source text for every claim. An expert study verified the precision and recall of this extraction to be

0.92

and

0.97

respectively, showing strong alignment with expert judgment. Refer to App. Section A.8 for more details.

For each

p p \in V^{*}

, the judge evaluates whether the agent’s output

\hat{O}

successfully covered the point. Performance is measured by the Coverage Score, defined as the fraction of ground truth strategic value propositions successfully recovered by the agent. We employ a Coverage Judge relying on a 5-point Likert scale that grades Sales Effectiveness and Factual Precision, ranging from 0 (Miss / Irrelevant) through 1 (Marketing Fluff), 2 (Topic Match), 3 (Implied / Soft Match), 4 (Strong Sales Argument), up to 5 (Strategic Bullseye) - A perfect extraction that captures the exact pain point of the recipient and the specific mechanism the product provides to address it. The full scoring rubric and the judge prompt are in the Appendix.

Weighted Coverage Score (WCS): This metric normalizes the Likert-scale evaluations into a percentage representing the agent’s completeness in capturing the winning sales logic. For a given success story with N ground truth pitch points, let

s_{i} \in {0, \dots, 5}

be the score assigned by the judge for the i-th point. The WCS is calculated as:

WCS = (\frac{\sum_{i = 1}^{N} s_{i}}{5 N}) \times 100 %

A score of 100% implies that the agent successfully predicted every critical deal-winning argument with maximum specificity. It is important to note that this is a prediction task rather than a retrieval task: the ground truth serves as a future artifact, and agents must predict these winning points using only historical data available at time t. This metric measures the semantic alignment (Section 2.3) of agent outputs against these future artifacts resulting in a realistic back testing scenario.

4. Results & Analysis

We evaluate models across two categories: frontier LLMs augmented with the temporally-restricted SDR-Arena web-search tool, comprising Claude Sonnet 4.6, GPT-4o, GPT-4o-mini, GPT-5.4, GPT-5.4-mini and QWEN-2.5-72B and deep research agents, STORM and ODR, both built on QWEN-2.5-72B. These configurations are evaluated across three corpora. The first is a public corpus of curated customer success stories partitioned by industry: Technology, Manufacturing, Energy, and IT, with an aggregate set of 180 stories. The second is a corpus of transcripts of sales calls from a company exceeding $10B in revenue. The third is a corpus of human-authored enterprise sales emails, divided into two cohorts: a Healthcare company with under $1B in revenue and a Technology company exceeding $10B in revenue, each containing 200 successful and 200 unsuccessful emails.

4.1. Discussion of Empirical Findings

We observe several notable trends. First, Claude Sonnet 4.6 leads all models with an aggregate WCS of 55.8 on the public success story dataset, representing a meaningful gap above the next best model, GPT-5.4-mini at 44.63. Despite this it only recovers roughly half of the strategic content of the human-authored success story, indicating a clear personalization plateau across all agent families. Second, frontier LLMs are a more cost-efficient alternative to deep research agents. Claude Sonnet 4.6 achieves the highest WCS at an inference cost comparable to ODR ( $0.270 vs. $0.250), while surpassing it by more than 20 WCS points.

The Enterprise Sales Email cohorts (Table 2) reveal a sector-dependent pattern. In the Healthcare cohort, models more consistently assign higher scores to successful outreach than unsuccessful outreach (e.g., STORM:

32.27

vs.

22.46

), suggesting they capture personalization cues relevant to specialized, high-stakes sectors. In the Technology cohort, however, this pattern inverts or collapses: several models score unsuccessful emails comparably to or higher than successful ones (e.g., STORM:

43.15

vs.

39.24

), indicating that models generate coherent but strategically shallow content insufficient to drive real-world revenue in competitive markets.

Table 2. Ground truth artifact evaluation. Values reported as Coverage.

	Enterprise Sales Emails					Success Stories					Average Cost per Query
Model	Healthcare (<1B)		Tech (>10B)		Other	Tech.	Mfg.	Energy	IT	Agg.	Prompt	Completion	Cost
	Unsucc	Succ	Unsucc	Succ	Sales Call Transcripts	(30)	(30)	(30)	(30)	(180)	Tokens	Tokens	($)
STORM-QWEN-2.5	22.46	32.27	43.15	39.24	30.43	43.40	41.59	39.24	44.60	42.51	∼29k	∼6.3k	∼0.135
ODR-QWEN-2.5	15.40	30.41	38.51	39.82	22.55	30.16	30.95	35.28	35.33	33.53	∼66k	∼8.6k	∼0.250
QWEN2.5-72B (WEB)	32.11	36.72	39.53	36.43	25.34	32.09	36.75	37.04	38.21	36.84	∼5.6k	∼0.25k	∼0.002
Claude Sonnet 4.6 (WEB)	59.89	60.89	64.08	61.33	34.52	52.6	60.5	54	54.9	55.8	∼79.2k	∼2.2k	∼0.270
GPT-4o (WEB)	35.71	40.62	47.44	45.17	—	33.50	38.93	32.26	36.01	35.42	∼9.2k	∼0.4k	∼0.027
GPT-4o-mini (WEB)	36.16	39.14	48.51	45.63	—	33.30	38.54	37.07	38.05	37.46	∼12.2k	∼0.6k	∼0.002
GPT-5.4-mini (WEB)	39.28	43.67	54.64	52.66	—	40.99	46.02	41.18	45.39	44.63	∼7.6k	∼0.7k	∼0.009
GPT-5.4 (WEB)	39.57	48.71	53.80	53.02	—	38.21	45.46	42.90	46.67	44.32	∼12.1k	∼0.9k	∼0.044

Pre-Training Leakage Is Not Driving WCS

A complementary concern is that publicly indexed success stories in SDR-Bench may have appeared in LLM pre-training corpora, inflating WCS through memorization rather than genuine inference. To probe this, we partition SDR-Bench by article publication date and re-evaluate on pre-2024 vs. post-2024 cohorts; since GPT-4o’s training cutoff sits between Q4-2023 and early 2024, post-2024 stories are unlikely to have been seen during pre-training. We observe negligible WCS differences across the split (STORM:

0.42

vs.

0.43

; GPT-4o:

0.36

vs.

0.36

). The absence of pre-training-era inflation indicates that performance on SDR-Bench reflects context-conditioned synthesis, not retrieval of memorized content.

4.2. Human Alignment and Validation

To ensure that our automated metrics reliably reflect real-world quality, we conducted expert studies calibrating our Coverage Judge against independent human raters, and confirming practical utility through deployment with professional sales representatives.

To show that the Coverage Judge follows human judgement, we conducted a human study on 20 success stories (80 model responses across STORM, ODR, GPT, and Qwen). Three independent human annotators, blinded to model identities, scored coverage following the exact protocol of our LLM-based Coverage Judge.

The study yields three convergent signals on Judge fidelity. (i) The Judge tracks human scores with strong rank correlation (Spearman’s

ρ = 0.7435

,

p < 0.0001

), holding across models (ODR:

0.7768

; GPT-4o:

0.7575

; QWEN-2.5:

0.7330

; STORM:

0.6963

). (ii) The Judge preserves model ordering: both human- and Judge-graded WCS rank STORM > GPT-4o > ODR, so absolute-score differences do not distort comparative conclusions. (iii) The Judge is systematically more conservative than human raters (STORM:

48.06

vs.

55.29

; GPT-4o:

37.94

vs.

46.04

; ODR:

31.10

vs.

41.69

), ruling out score inflation and establishing WCS as a rigorous lower bound that tracks human intuition at scale.

We show that the WCS-based ranking transfers to expert SDR judgment in two field studies with 12 senior SDRs from the partner enterprises whose data appears in this paper. The first measures per-pitch usefulness - whether any individual model-generated pitch point would be used verbatim in real outreach, and the second measures strategy-level overlap between model output and SDR-authored gold standards. Together they probe the two granularities at which an automated score can mismatch expert judgment: per-point quality and overall strategic match.

Per-pitch usefulness: Twelve SDRs used GPT-4o + SDR-Arena to generate pitch points for $200 +$ new prospect companies inside their normal outbound pipeline, with each SDR auditing the model output on accounts they were actively working. For every generated pitch point, the SDR rated, on a binary criterion, whether it both (a) reflected genuine understanding of the prospect’s pain points and (b) was usable in outreach without rewriting; 48.2% of pitch points met both criteria.The field rate corresponds to roughly half of agent output being expert-grade in live deployment, with the remaining points being factually accurate but strategically generic, directly consistent with the personalization plateau identified above.
Gold-Standard Alignment:We asked senior SDRs ( $\geq 10$ years of industry experience and $\geq 5$ years at the firm) from five enterprises to independently author reference “gold-standard” strategies to pitch 30 products to 5 prospects each. The SDRs were not shown any model output during the exercise, so the reference strategies are an independent expert read of what should be pitched. We then computed the overlap between the gold-standard strategy and the outputs of the four benchmarked models, and correlated this expert-overlap score against the corresponding automated WCS on SDR-Bench. Across models, expert overlap and WCS track at Pearson $r = 0.816$ . The WCS-based ranking therefore transfers to senior-SDR judgment without re-tuning the rubric per enterprise, supporting WCS as a calibrated proxy for whether an agent has identified the strategic content a domain expert would pitch.

Together, these studies establish that our pipeline’s outputs are both factually grounded and meaningful in live sales contexts.

We also evaluate a closed source deep research agent, GEMINI-2.5-PRO-DR on a separate 25-story subset. Its Deep Research API does not expose temporal-restriction parameters and its higher inference cost precludes broader evaluation. On this subset, it achieves a WCS of 62.63. Notably, the margin between this score and that of Claude Sonnet 4.6 with web search remains narrow, further underscoring that frontier LLMs with web search constitute a cost-effective alternative to deep research agents.

5. Conclusion

In this work, we introduced SDR-Arena, the first comprehensive framework for benchmarking the generative personalization capabilities of Large Language Models. By grounding our evaluation in the Bayesian Persuasion framework, we transitioned from subjective assessments of "quality" to a rigorous measure of Relevance Alignment. Our experiments utilize SDR-Bench—a novel, high-fidelity corpus of over 6,200 success stories—and a unique enterprise-scale dataset of successful sales outreach to quantify how effectively LLMs can synthesize winning strategic arguments.

Our findings reveal a significant “personalization plateau.”, showing a substantial gap remains between AI-generated outreach and human-level strategic proficiency.

By releasing SDR-Arena, we provide the research community with the tools necessary to study autonomous personalization while strictly controlling for data leakage. As LLMs continue to move into high-stakes business operations, we hope this framework serves as a foundation for developing AI agents that are not only persuasive but verifiably aligned with the nuanced needs of their human recipients.

A. Appendix

Figure 5. Personalization in actual Sales Emails.

Figure 6. Distribution of count of Success Stories by Industry Type.

Figure 7. Qualitative Example: Ground truth pitch points scored against pitch points generated by the agent.

A.1. SDR-Bench: Dataset Curation Details

Table 3. Filtration Criteria and Counts for Scraped Public Data

Filtration Criteria	Count
Domains Found for Companies with over $1B revenue	∼30k
Domains Found for B2B Companies with over $1B revenue	12,080
Companies whose Sitemap could be found	8,298
Candidate Success Story URLs based on pattern matching	∼117k
Count of Companies covering these 117k URLs	1,772
Exclude non-text formats (videos/pdfs)	∼79k
URLs for which content could be collected	∼31k
Qwen based filtering using content to exclude listicle, parent, generic pages and pages with no publish date	∼7.2k
Filtering out stories where the customer is not a specific business	6279

A.2. Sales Emails

A.2.1. Filtering & Analysis

Let

E = {e_{1}, e_{2}, \dots, e_{N}}

denote the raw corpus of sales emails. We apply a three-stage filtering pipeline:

Language Filtering: We remove all non-English emails using language detection, yielding $E en \subset E$ .
Email Deduplication: We identify and remove duplicate email templates using a combination of exact matching and fuzzy string comparision yielding $E deduplicated \subset E en$
Intent Classification. We employ an LLM-as-a-judge paradigm to classify emails into outreach versus non-outreach categories. Specifically, we filter out generic conversational emails, administrative correspondence, and non-sales communications. Let $J : e \to {0, 1}$ be the LLM judge function where $J (e) = 1$ indicates a valid sales outreach email. Our filtered corpus is thus:

$E filtered = {e \in E deduplicated : J (e) = 1}$

For each email

e \in E filtered

, we use an LLM to extract the set of strategies employed in each email:

Strat (e) \subseteq S

. This allows us to visualize the following patterns:

Strategy Frequency Distribution: The distribution $P (s)$ over strategies reveals the current state of human personalization practices.
Product-Conditional Strategies: The distribution $P (s | Product k)$ identifies product-specific personalization patterns.

These distributions provide interpretable insights into how human SDRs currently operationalize personalization.

Beyond strategy classification, we extract fine-grained pitch points from each email using an LLM. For each email e, we extract:

pp (e) = {p p_{1}, p p_{2}, \dots, p p_{k}}

where each

p p_{i}

represents a discrete pitch point used in the outreach. These pitch points constitute the ground truth against which DR agent outputs are evaluated.

The email dataset comprises of annotated outreach emails with the following attributes per sample:

Target Company $T_{i}$ : The recipient organization.
Sender’s Company $S_{i}$ : The sender’s organization.
Email $E_{i}$ : The content of the email
Timestampt: Date when the email was sent
ProductP: The solution being pitched
Strategy Labels $Strat (e)$ : Personalization strategies used in the email
Pitch Points $p p (e)$ : Pitch points used in the email

A.2.2. Personalization Strategies

In order to systematically characterize the various personalization strategies used in the emails, we employed the following two step pipeline:

First, we asked domain experts to manually annotate a seed set of emails to identify recurring personalization patterns. Second, we used an LLM to extract and cluster strategies from 500 randomly sampled emails, which were then reconciled with expert annotations to produce a unified taxonomy.

A personalization strategy

s \in S

is a variable representing the primary information source leveraged to establish relevance between the seller’s value proposition and the buyer’s needs .We define the Personalization Strategy Space

S = {s_{1}, s_{2}, \dots, s_{10}}

consisting of 10 categories:

Industry based: References industry-specific trends, pain points, competitors, or case studies from the target company’s industry.
Event based: Leverages trigger events (funding rounds, MA, product launches, earnings reports, news mentions) to identify timely business needs.
Technology based: References the recipient’s current tech stack to propose replacement, integration, or complementary solutions.
Lead Activity-based: References direct actions by the specific lead (whitepaper downloads, webinar attendance, pricing page visits, demo interactions).
Buying Group Activity-based: References collective actions by the lead’s team or buying committee.
Geography-based: Utilizes physical location or regional regulatory context (e.g., GDPR, CCPA compliance requirements).
Lead Persona-based: Explicitly maps the lead’s role, title, or job responsibilities to role-specific pain points.
Firmographics-based: Leverages company-level metrics (headcount growth, revenue, department size) as personalization anchors.
Relationship-based: References existing customer relationships, cross-sell or upsell opportunities.
None: Generic outreach lacking recipient-specific context.

A.3. How to Measure Personalization in an Ideal World?

Ideally, one could evaluate personalization by observing how the same Receiver responds to multiple personalized signals

s_{i}

, where each

s_{i}

is generated by a different LLM, effectively a multiverse of interventions. By comparing Receiver actions across these interventions, we could directly quantify the personalization abilities of different LLMs. Because such a multiverse is unavailable in practice, we construct an empirical benchmark using real-world sales interactions.

A.4. Task Formulation Details

For the success story of Salesforce, the tuple would be (S: Salesforce, C: Snapology of Lehi, P: Salesforce Starter, t: 26-05-2023).

A.5. Alignment of LLM with Humans for Pitch Point Extraction

Table 4. Comparative analysis between LLM extracted pitch points and human annotations on 30 customer success stories.

TP	FP	FN	Precision	Recall	F1 Score
138	11	3	0.92	0.97	0.95

A.6. Token Usage vs Performance of Agents

Figure 8. Graph of token usage vs performance of various agents.

Table 5 reports the average per-outreach token consumption and inference cost for each agent configuration on the SDR-Bench evaluation set, alongside its WCS. Costs are computed using public list prices for the corresponding model API at the time of evaluation. Deep-research pipelines (STORM, ODR) consume one to two orders of magnitude more tokens than standard LLM-plus-search baselines, while only marginally improving WCS over the latter. QWEN-2.5-72B is the most cost-efficient configuration, achieving WCS within ∼5.7 points of STORM at ∼67× lower cost.

Table 5. Per-outreach inference cost vs. WCS on the SDR-Bench evaluation set.

Model	Avg. Prompt Tokens	Avg. Completion Tokens	Avg. Inference Cost	WCS
STORM	∼29k	∼6.3k	∼$0.135	42.51
QWEN-2.5-72B	∼5.6k	∼250	∼$0.002	36.84
GPT-4o	∼9.2k	∼427	∼$0.027	35.42
GPT-4o-mini	∼12.2k	∼572	∼$0.002	37.46
GPT-5.4-mini	∼7.6k	∼662	∼$0.009	44.63
GPT-5.4	∼12.1k	∼904	∼$0.044	44.32
ODR	∼66k	∼8.6k	∼$0.250	33.53
Claude Sonnet-4.6	∼79k	∼2.0k	∼$0.270	55.80

A.7. Prompts Library

A.8. Human Study to Validate Pitch Point Extraction

To validate the LLM’s accuracy and exhaustiveness in extracting pitch points, we conducted a human study on a random sample of 30 customer success stories. Annotators evaluated each story against the LLM-extracted pitch points along two dimensions: (1) Precision — verifying factual consistency and flagging hallucinations, and (2) Recall — identifying any pitch points the LLM missed. The LLM achieved a precision of 0.92, recall of 0.97, and an F1-score of 0.95, validating its use as a robust, scalable proxy for ground truth extraction.

A.9. Institutional Review Board Approval

The human evaluation studies including the per-pitch usefulness field deployment with 12 sales development representatives and the gold-standard alignment exercise with senior SDRs from five enterprises were reviewed and approved by the Institutional Review Board. All participants were informed of the study’s purpose and provided consent prior to participation. No sensitive personal data was collected beyond professional judgments on model-generated sales content, and all responses were anonymized prior to analysis.

A.10. Statistical Significance and Confidence Intervals

We acknowledge that the evaluation is conducted on a limited subset due to the high computational cost of deep research agents. To ensure robustness, we perform bootstrapping (1,000 iterations) to compute 95% confidence intervals (CIs) for the Weighted Coverage Score (WCS) across the SDR-Bench evaluation set.

Table 6. Bootstrap Estimates of WCS on SDR-Bench Evaluation Set.

Model	Mean WCS	95% CI
STORM	0.4246	[0.4015, 0.4491]
ODR	0.3358	[0.3150, 0.3585]
GPT-4o	0.3638	[0.3422, 0.3860]
Qwen	0.3692	[0.3496, 0.3876]

Validation of the Personalization Plateau. The 95% CIs for GPT-4o [0.3422, 0.3860] and Qwen [0.3496, 0.3876] exhibit substantial overlap, indicating no statistically significant difference in performance. This supports the existence of a personalization plateau, where different model architectures converge to a similar performance ceiling under our evaluation framework.

Significance of STORM. In contrast, the CI for STORM [0.4015, 0.4491] does not overlap with those of other models, indicating a statistically significant performance improvement.

Stability of Estimates. The relatively narrow width of the confidence intervals suggests stable estimates despite the limited sample size. The evaluation set comprises approximately 720 agent–environment interactions, providing a sufficiently representative estimate of model performance under the SDR-Arena setup.

A.11. Broader Impacts and Ethical Considerations

Our work raises important societal and ethical considerations, which we address below.

Paradox of Measurement and Dual-Use Risks

Benchmarking personalization creates an inherent tension: quantifying what makes sales outreach effective risks providing a blueprint for scalable, manipulative content. We argue, however, that the absence of transparent evaluation standards poses a greater risk by allowing opaque commercial systems to operate unchecked. SDR-Arena provides the transparency needed to distinguish context-aware assistance from hallucinatory or manipulative outreach.

Privacy and Data Stewardship

Our dataset curation followed strict ethical guidelines. The proprietary email dataset was processed in a secure, access-controlled environment with all PII anonymized or redacted, and is not included in our public release. The public SDR-Bench is limited to already-published customer success stories, further filtered to enterprise entities to minimize individual exposure.

Economic Displacement and Human-AI Collaboration

Our findings reveal a “personalization plateau,” suggesting LLMs currently lag behind human experts in identifying nuanced, strategic revenue drivers. This supports a Human-in-the-Loop paradigm: our benchmark should guide assistants that reduce research drudgery for humans, not autonomous systems that replace human judgment.

Acceptable Use Policy

To mitigate the risks of misuse, the release of our framework and the SDR-Bench dataset will be accompanied by a restrictive Acceptable Use Policy. This policy explicitly prohibits the use of our artifacts or fine-tuned models for:

Unsolicited High-Volume Outreach: Using the dataset to train agents for mass-spamming or harassment.
Deceptive Practices: Generating content that masquerades as human correspondence without disclosure.
Social Engineering: Leveraging the personalization metrics to craft targeted phishing attacks.

By bringing scientific rigor to sales agent evaluation, we aim to steer the field toward personalization that respects user context and delivers genuine value, rather than optimizing for engagement at the expense of user trust.

References

Afzoon, Saleh, Zahra Jamali, Usman Naseem, and Amin Beheshti. 2024. Persobench: Benchmarking personalized response generation in large language models. arXiv arXiv:2410.03198. [Google Scholar]
Anthropic. 2026. Claude sonnet 4.6. https://www.anthropic.com/claude/sonnet (accessed on 2026-05-07).
2026. Bright Data Bright data serp api. Accessed. (accessed on 2026-05-07).
Chen, Jin, Zheng Liu, Xu Huang, Chenwang Wu, Qi Liu, Gangwei Jiang, Yuanhao Pu, Yuxuan Lei, Xiaolong Chen, and Xingmei Wang. 2024. When large language models meet personalization: Perspectives of challenges and opportunities. World Wide Web 27, 4: 42. [Google Scholar] [CrossRef]
Choi, Hana, Carl Mela, Santiago Balseiro, and Adam Leary. 2020. 06. Online display advertising markets: A literature review and future directions. Information Systems Research 31. [Google Scholar] [CrossRef]
Durmus, Esin, Liane Lovitt, Alex Tamkin, Stuart Ritchie, Jack Clark, and Deep Ganguli. 2024. Measuring the persuasiveness of language models.
Hovland, C.I., I.L. Janis, and H.H. Kelley. 1953. Communication and Persuasion: Psychological Studies of Opinion Change. In Yale paperbound. Yale University Press. [Google Scholar]
Jiang, Bowen, Yuan Yuan, Maohao Shen, Zhuoqun Hao, Zhangchen Xu, Zichen Chen, Ziyi Liu, Anvesh Rao Vijjini, Jiashu He, and Hanchao Yu. 2025. Personamem-v2: Towards personalized intelligence via learning implicit user personas and agentic memory. arXiv arXiv:2512.06688. [Google Scholar]
Kamenica, Emir, and Matthew Gentzkow. 2011. Bayesian persuasion. American Economic Review 101, 6: 2590–2615. [Google Scholar] [CrossRef]
LangChain. 2025. Open deep research. https://github.com/langchain-ai/open_deep_research (accessed on 2026-05-07).
Lasswell, Harold D. 1948. The structure and function of communication in society. In The communication of ideas, Volume. Harper and Row: vol. 37, pp. 215–228. [Google Scholar]
Li, Li, Peilin Cai, Ryan A Rossi, Franck Dernoncourt, Branislav Kveton, Junda Wu, Tong Yu, Linxin Song, Tiankai Yang, and Yuehan Qin. 2025. A personalized conversational benchmark: Towards simulating personalized conversations. arXiv arXiv:2505.14106. [Google Scholar]
Matz, Sandra, S. Vaid, Heinrich Peters, Gabriella Harari, and M. Cerf. 2024. 02. The potential of generative ai for personalized persuasion at scale. Scientific Reports 14. [Google Scholar] [CrossRef]
OpenAI. 2023. Gpt-4 technical report. arXiv arXiv:2303.08774. [Google Scholar]
Petty, Richard E., and John T. Cacioppo. 1986. The elaboration likelihood model of persuasion. In Advances in Experimental Social Psychology. Academic Press: Volume 19, pp. 123–205. [Google Scholar] [CrossRef]
Ricci, Francesco, Lior Rokach, and Bracha Shapira. 2010. Introduction to recommender systems handbook. In Recommender systems handbook. Springer: pp. 1–35. [Google Scholar]
Salemi, Alireza, Sheshera Mysore, Michael Bendersky, and Hamed Zamani. 2024. Lamp: When large language models meet personalization. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics Volume 1: 7370–7392. [Google Scholar] [CrossRef]
Shao, Yijia, Yucheng Jiang, Theodore Kanell, Peter Xu, Omar Khattab, and Monica Lam. 2024, June. Assisting in writing Wikipedia-like articles from scratch with large language models. In K. Duh, H. Gomez, and S. Bethard (Eds.), Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), Mexico City, Mexico, pp. 6252–6278. Association for Computational Linguistics. [CrossRef]
Sharma, Sahil, Puneet Mittal, Mukesh Kumar, and Vivek Bhardwaj. 2025. The role of large language models in personalized learning: a systematic review of educational impact. Discover Sustainability 6, 1: 1–24. [Google Scholar] [CrossRef]
Tasdelen, Osman, and Daniel Bodemer. 2025. Generative ai in the classroom: effects of context-personalized learning material and tasks on motivation and performance. International Journal of Artificial Intelligence in Education, 1–22. [Google Scholar]
Terho, Harri, Anna Salonen, and Meri Yrjänen. 2022a. 09. Toward a contextualized understanding of inside sales: the role of sales development in effective lead funnel management. Journal of Business & Industrial Marketing 38, 2: 337–352. [Google Scholar] [CrossRef]
Terho, Harri, Anna Salonen, and Meri Yrjänen. 2022b. 09. Toward a contextualized understanding of inside sales: the role of sales development in effective lead funnel management. Journal of Business and Industrial Marketing 38. [Google Scholar] [CrossRef]
Yang, An, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, and Haoran Wei. 2024. Qwen2.5 technical report. arXiv arXiv:2412.15115. [Google Scholar]
Zhang, Saizheng, Emily Dinan, Jack Urbanek, Arthur Szlam, Douwe Kiela, and Jason Weston. 2018, July. Personalizing dialogue agents: I have a dog, do you have pets too? In I. Gurevych and Y. Miyao (Eds.), Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Melbourne, Australia, pp. 2204–2213. Association for Computational Linguistics. [CrossRef]
Zhao, Zheng, Clara Vania, Subhradeep Kayal, Naila Khan, Shay B Cohen, and Emine Yilmaz. 2025. Personalens: A benchmark for personalization evaluation in conversational ai assistants. Findings of the Association for Computational Linguistics: ACL 2025, 18023–18055. [Google Scholar] [CrossRef]

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.

Benchmarking the Personalization Capabilities of Large Language Models

Abstract

Keywords:

Subject:

1. Introduction

2. Problem Formulation

2.1. Sales Development Lifecycle

2.2. Bayesian Persuasion Formulation

2.3. Evaluation Methodology and Proxy

2.4. Dataset Construction

SDR-Bench Dataset

Private Enterprise Dataset

Human Personalization Strategies

3. SDR-Arena

3.1. Evaluation Framework

4. Results & Analysis

4.1. Discussion of Empirical Findings

Pre-Training Leakage Is Not Driving WCS

4.2. Human Alignment and Validation

5. Conclusion

A. Appendix

A.1. SDR-Bench: Dataset Curation Details

A.2. Sales Emails

A.2.1. Filtering & Analysis

A.2.2. Personalization Strategies

A.3. How to Measure Personalization in an Ideal World?

A.4. Task Formulation Details

A.5. Alignment of LLM with Humans for Pitch Point Extraction

A.6. Token Usage vs Performance of Agents

A.7. Prompts Library

A.8. Human Study to Validate Pitch Point Extraction

A.9. Institutional Review Board Approval

A.10. Statistical Significance and Confidence Intervals

A.11. Broader Impacts and Ethical Considerations

Paradox of Measurement and Dual-Use Risks

Privacy and Data Stewardship

Economic Displacement and Human-AI Collaboration

Acceptable Use Policy

References

MDPI Initiatives

Important Links

Subscribe