Submitted:
12 February 2026
Posted:
24 February 2026
You are already at the latest version
Abstract
Keywords:
1. Introduction
1.1. The Consistency Paradox in Organizational Ethics
- (a)
- Cognitive bias (identifiable victim effect undermining impartial analysis), or
- (b)
- Moral sensitivity (appropriate recognition of individual dignity versus statistical abstraction)?
1.2. The Empirical-Normative Gap
- Allowing stakeholder identifiability to alter recommendations violates impartiality (treating like cases alike)
- Action/omission distinctions represent irrational omission bias rather than morally significant differences
- Temporal proximity effects reflect present bias and hyperbolic discounting
- Relationship-based reasoning introduces arbitrary favoritism
- Responding differently to named individuals versus statistical abstractions reflects appropriate attention to persons rather than irrational bias
- Action/omission distinctions honor genuine moral differences in agency and causation
- Temporal considerations incorporate epistemic uncertainty about distant consequences
- Relationships create legitimate special obligations that override impartial calculations
1.3. The AI Parallel: Do Machines Exhibit Human Patterns?
- (a)
- AI systems have learned human biases (problematic), or
- (b)
- AI systems have learned morally appropriate contextual sensitivity (desirable), or
- (c)
- The patterns reflect computational artifacts independent of moral reasoning
1.4. Research Gaps and Contributions
- Gap 1: Realistic organizational contexts
- Gap 2: Systematic variation measurement
- Gap 3: Human-AI comparison under matched conditions
- Realistic organizational scenarios (n=240): We developed ethical dilemmas grounded in actual organizational contexts (layoffs, resource allocation, stakeholder conflicts) across five moral domains, validated by practicing leaders (n=12 pilot participants).
- Systematic variation design: Each base scenario appears in four variants manipulating identifiability, action/omission framing, temporal proximity, and relational context, enabling within-scenario comparisons that control for content while isolating contextual features.
- Direct human-AI comparison (300 humans × 20 scenarios; 3 AI models × 240 scenarios × 10 repetitions): Matched conditions enable quantification of whether and how AI patterns diverge from human moral reasoning.
- Three-component decomposition: We separate (a) structural consistency (agreement when only irrelevant features vary), (b) contextual responsiveness (variation attributable to debatable features), and (c) arbitrary residual variation, enabling precise quantification of the contested normative territory.
1.5. Theoretical Positioning and Scope
2. Methods
2.1. Research Design Overview
- Within-subjects design: Each participant sees only one variant of each base scenario, but population-level analysis compares responses across variants
- Mixed assignment: Participants randomly assigned to scenario variants; AI models respond to all scenarios
- Repeated measures: AI models provide 10 independent responses per scenario (temperature-based sampling); humans provide single responses
- Matched coding: Human and AI responses coded identically using detailed rubric (see §2.7)
- For H2 (OR = 1.5, α = .05, power = .80): n = 276 required
- For H3 (d = 0.4, α = .05, power = .80): n = 264 required
- For H4 (indirect effect β = 0.04, α = .05, power = .80): n = 284 required
- Target n = 300 provides >80% power for all predicted effects (achieved power: 82-88%)
2.2. Scenario Development
2.2.1. Base Scenario Construction
- Harm Prevention (3 scenarios): Consumer safety, workplace safety, environmental harm
- Fairness/Justice (3 scenarios): Compensation equity, resource allocation, procedural fairness
- Autonomy/Rights (3 scenarios): Privacy, informed consent, intellectual property
- Promise-Keeping/Loyalty (3 scenarios): Contracts, commitments, stakeholder obligations
- Honesty/Transparency (3 scenarios): Disclosure, advertising claims, research integrity
- Expert consultation (n=12): Practicing managers and ethics consultants identified common organizational dilemmas
- Literature review: Scenarios based on documented cases in business ethics literature
- Pilot testing (n=25): Initial scenarios tested for clarity, realism, difficulty balance
-
Refinement: Revised based on pilot feedback ensuring:
- ○
- Realistic organizational context
- ○
- Genuine ethical tension (no obviously "correct" answer)
- ○
- Comparable difficulty across scenarios
- ○
- Two clear decision options
2.2.2. Systematic Feature Variation
- Low (Statistical): "50 employees in manufacturing division"
- High (Named/Identified): "Maria Rodriguez (12-year employee, single mother of three) and 49 manufacturing colleagues"
- Low (Distant): "Contractors at outsourced facility," "International customers"
- High (Direct/Close): "Direct employees at headquarters you work with daily," "Local community members"
- Low (Delayed): "Long-term strategic positioning," "Impact over next 5 years"
- High (Immediate): "Immediate quarterly results," "Impact this quarter"
- Low (Transactional): "Recently hired contractors (avg 8 months tenure)"
- High (Long-term/Relational): "Long-term employees (avg 9 years tenure, consistently exceeded expectations)"
- Action frame: "Implement layoffs" vs. "Maintain current workforce"
- Omission frame: "Allow position eliminations" vs. "Intervene to prevent layoffs"
- 15 base scenarios × 2^4 feature combinations (4 features; frame separately manipulated) = 240 unique scenario variants
- Each scenario presents identical ethical trade-off with only contextual framing varied
- Complete scenario set provided in Appendix A
- Magnitude of outcomes (number affected)
- Probability of consequences
- Severity of harm/benefit
- Legal/regulatory requirements
2.3. Participants
2.3.1. Human Participants
- Current organizational role with decision-making authority (manager-level or above)
- Minimum 2 years professional experience
- Fluent English (reading comprehension)
- Sample Flow
- Critical Distinction:
- Incomprehensible: text (failed readability)
- Systematic copy-paste (failed engagement)
- Comprehension check failure (failed attention)
- High variation scores (>3 SD from mean: M=0.78 vs. 0.42 overall)
- Retained to avoid post-hoc manipulation of findings
-
Could represent either:
- ○
- Genuine individual differences (high context-sensitivity or inconsistency)
- ○
- Lower-quality responses not caught by initial screening
- Shorter response times: M=31 min vs. 47 min (t=4.2, p<.001)
- Lower word counts: M=142 words vs. 287 words (t=5.1, p<.001)
- Higher arbitrary variation: AV M=0.59 vs. 0.31 (by definition >3 SD)
- Did not differ on demographics (age, education, experience: all p>.10)
- Identifiability OR: 2.08 (full) vs. 2.04 (n=289), difference=0.04
- Relational mediation β: 0.043 (full) vs. 0.041 (n=289), difference=0.002
- All effects remain p<.001, FDR q<.001
| Characteristic | Distribution |
| Age | M=38.4 years (SD=9.2, range 25-64) |
| Gender | 52% female, 47% male, 1% non-binary |
| Education | 67% graduate degree, 28% bachelor's, 5% some college |
| Country | 61% USA, 28% other Western, 11% non-Western |
| Industry | 23% technology, 18% healthcare, 15% finance, 44% other |
| Role level | 42% middle management, 34% senior management, 24% executive |
| Ethics training | 31% formal ethics coursework, 69% informal/none |
- Total responses: 6,000 (300 participants × 20 scenarios)
- Each of 240 scenarios viewed by M=25 participants (SD=4.3, range 18-32)
- Balanced assignment ensured via stratified randomization (see §2.4.1)
2.3.2. AI Models
-
GPT-4 (gpt-4-0125-preview, January 2025 snapshot)
- ○
- OpenAI API, accessed September 15-30, 2024
-
Claude 3 Opus (claude-3-opus-20240229, February 2024 snapshot)
- ○
- Anthropic API, accessed October 1-15, 2024
-
Gemini Pro 1.5 (gemini-1.5-pro, current as of September 2024)
- ○
- Google AI API, accessed October 16-31, 2024
- Each model responded to all 240 scenarios
- 10 independent samples per scenario per model (different random seeds)
- Total: 7,200 responses (240 scenarios × 3 models × 10 repetitions)
- Same procedure repeated at T=0.3, T=0.5, T=1.0
- Each temperature: 7,200 additional responses (240 × 3 × 10)
- Grand total across all temperatures: 28,800 AI responses
- All analyses in main text use T=0.7 unless otherwise specified
- Temperature sensitivity reported in §3.9.1 and Appendix D.3
- All 7,200 primary responses reviewed for coherence
- 97.2% addressed scenario directly at T=0.7
- No responses contained refusals like "I cannot make ethical decisions"
- Response length: M=287 words (SD=94, range 127-612)
- GPT-4: $412 (T=0.7) + $1,236 (other temperatures)
- Claude: $298 (T=0.7) + $894 (other temperatures)
- Gemini: $137 (T=0.7) + $411 (other temperatures)
2.3.3. Temperature Parameter Selection and Implications
- Typical deployment parameter: Commercial AI systems commonly use temperature 0.7-1.0 for open-ended reasoning tasks (OpenAI, 2024; Anthropic, 2024)
-
Adequate coherence: Pilot testing (100 scenarios, 10 repetitions each, September 2024) revealed:
- ○
- T=0.7: 97.2% responses logically coherent and on-topic
- ○
- T=1.0: 91.6% coherent (degraded)
- ○
- T=0.3: 99.6% coherent but highly repetitive
- Human-like variation (POST-HOC OBSERVATION): T=0.7 produces AI variation levels similar to observed human variation
| Temperature | Mean Total Variation | Structural Consistency | Coherence Rate |
| 0.3 | 0.26 | 0.92 | 99.6% |
| 0.5 | 0.36 | 0.87 | 98.8% |
| 0.7 | 0.41 | 0.87 | 97.2% |
| 1.0 | 0.49 | 0.82 | 91.6% |
| Human | 0.42 | 0.86 | N/A |
- Temperature was selected post-hoc to match human variation levels
- At T=0.3, AI shows less variation than humans (0.26 vs. 0.42, p<.001)
- At T=1.0, AI shows more variation than humans (0.49 vs. 0.42, p<.001)
- Human-AI similarity at T=0.7 (p=.56) is a calibration result, not a finding
- ✓ Contextual feature effects are robust across temperatures (see §3.9.1)
- ✓ AI systems can be calibrated to approximate human variation profiles
- ✓ Temperature is a design choice embedding implicit assumptions about desired reasoning patterns
- ✗ AI reasoning is fundamentally similar to human reasoning
- ✗ AI would exhibit human-like patterns at "default" or "optimal" settings
- ✗ Human-AI convergence is independent of parameter selection
- Match typical deployment conditions (external validity)
- Enable meaningful human-AI comparison at matched variation levels
- Avoid excessive noise from high temperature or repetitiveness from low temperature
2.4. Procedure
- Human Participant Procedure
- Informed consent (5 min)
- Demographics questionnaire (3 min)
- Training scenarios (10 min): Two practice scenarios with example responses
- Main task (40-50 min): 20 scenarios presented sequentially
- Debrief (2 min)
- 240 scenarios divided into 12 blocks of 20 scenarios each
-
Each block contained:
- ○
- Balanced representation of 15 base scenarios
- ○
- Balanced distribution of feature combinations
- ○
- No more than 2 scenarios from same domain consecutively
- Participants randomly assigned to blocks (n≈25 per block)
- Within blocks, scenario order randomized per participant
- Each scenario viewed by: M=25 participants (SD=4.3, range 18-32)
- Coverage: All 240 scenarios viewed ≥18 times
- No systematic bias in scenario-participant assignment (χ²=12.4, df=239, p=1.00)
- Median completion time: 47 minutes (IQR: 38-56 min) for entire session
- Mean time per scenario: 58.8 seconds (SD=24.3)
- No time limits imposed
- 97% completed in single session
-
Minimum response length: 25 words (flagged for review if shorter)
- ○
- 47 flagged, 42 judged adequate upon review, 5 excluded
-
Identical responses: >5 scenarios with copy-paste detected
- ○
- 3 participants excluded
-
Comprehension check: Embedded in scenario 10
- ○
- 96.7% passed (292/302), 8 failures excluded
- Total excluded: n=16 (5 + 3 + 8)
- After primary analysis, 11 participants identified with extreme variation (>3 SD)
- Deliberately retained for all main analyses
- May represent genuine individual differences vs. random responding
- Sensitivity analyses (§3.9.3) examined robustness with/without outliers
2.5. Response Coding Framework
-
Human: 400 responses from 133 participants (3 responses each, randomly selected)
- ○
- Represents 6.7% of 6,000 human responses
- ○
- 44.3% of 300 participants
-
AI: 400 responses at T=0.7 (GPT-4 n=133, Claude n=133, Gemini n=134)
- ○
- Represents 5.6% of 7,200 AI responses at T=0.7
- Coverage: All 15 base scenarios, all 5 moral domains, range of response lengths
- Binary decision choice: Extracted automatically from structured response format
- Response length (word count): Automated
- Response time: Logged automatically for humans
Primary Coding Dimensions
- Option A
- Option B
- Alternative (participant-suggested option differing from A or B)
- Unclear/No recommendation
- Coding method: Automated extraction from structured response
- Sample: All 6,000 human + 7,200 AI responses
- Utilitarian/Consequentialist
- Deontological
- Care Ethics
- Rights-Based
- Virtue Ethics
- Justice/Fairness
- Pragmatic/Practical
- Religious/Spiritual
- Legal/Compliance
- Other/Unclear
- Coding method: Automated keyword detection (validated on subsample)
- Sample: All responses (6,000 human + 7,200 AI)
- Validation: 87.3% agreement with manual coding on reliability subsample
- Level 0: Single framework only, no acknowledgment of alternatives (pure principlist or pure care ethicist)
- Level 1: Multiple frameworks mentioned, but one clearly dominant (other frameworks acknowledged but not integrated: "While X matters, Y is decisive")
- Level 2: Genuine attempt to integrate multiple frameworks (balanced consideration: "Both X and Y matter, and here's how I weigh them")
- Level 3: Explicit acknowledgment of framework tensions/incommensurability (sophisticated recognition: "X and Y conflict here, and there's no clean resolution")
- Level 4: Expert synthesis producing coherent unified position (transcends individual frameworks: "By considering X and Y together, we arrive at Z principle that honors both")
- [Unchanged from original]
- Sample: Reliability subsample (800 responses)
-
Level 0: No relational reasoning
- ○
- Relationships not mentioned or only descriptively noted
- ○
- Pure principle or outcome focus
- ○
- Example: "Based on cost-benefit analysis, Option A is superior"
-
Level 1: Minimal relational awareness
- ○
- Brief acknowledgment, not integrated into decision logic
- ○
- Example: "These are long-term employees, but economic analysis favors Option A"
-
Level 2: Relational considerations present
- ○
- One factor among several, some weight given
- ○
- Example: "While cost favors A, the relationship with these employees is also relevant"
-
Level 3: Relational reasoning integrated
- ○
- Systematically considered across decision framework
- ○
- Example: "Our relationship creates special obligations that modify the utilitarian calculus"
-
Level 4: Relational reasoning central
- ○
- Drives decision logic, extensive discussion
- ○
- Example: "The trust built over years creates duties that outweigh short-term cost savings"
-
Level 5: Sophisticated relational framework
- ○
- Expert care ethics application
- ○
- Nuanced discussion of relationship types
- ○
- Addresses limits of relational obligations
- ○
- Example: "While special obligations arise from this employment relationship, these must be balanced against obligations to other stakeholders and broader justice considerations"
- Levels 0-3: 95.3% of responses
- Levels 4-5: 4.7% of responses
2.6. Derived Metrics
- Structural Inconsistency (SI): Variation when only irrelevant presentation features differ (wording, order, format)
- Contextual Responsiveness (CR): Variation systematically attributable to debatable contextual features (identifiability, proximity, temporal, relational)
- Arbitrary Variation (AV): Residual variation unexplained by measured factors
- Identify scenario pairs differing only in surface features (e.g., "50 employees" vs. "fifty staff members")
- Calculate proportion of identical decisions across such pairs
- SC = Agreement rate across irrelevant feature variations (0-1 scale)
- Magnitude differences (number of people affected)
- Probability differences (certain vs. uncertain outcomes)
- Harm type differences (physical, economic, psychological)
- Legal/regulatory requirements
| Method | Human CR | AI CR (T=0.7) |
| R² difference (primary) | 0.224 | 0.237 |
| Variance components | 0.242 | 0.251 |
| Difference | 0.018 | 0.014 |
- More intuitive interpretation (proportion of variance explained)
- Directly controls for clearly relevant features
- Consistent with mediation analysis framework
- Identifiability: Low (0) or High (1)
- Proximity: Low (0) or High (1)
- Temporal: Low (0) or High (1)
- Relational: Low (0) or High (1)
- S005 coded as: [ID=1, Prox=0, Temp=1, Rel=0]
- S012 coded as: [ID=0, Prox=1, Temp=0, Rel=1]
- ... [continues for all 20]
- -
- i indexes the 20 scenarios participant k saw
- -
- Decision_i = 1 if participant chose stakeholder-favorable option, 0 otherwise
- -
- β₀k...β₄k are participant-specific coefficients estimated via maximum likelihood
- If all decisions identical: SC_k = 1.0, CR_k = 0, AV_k = 0 (perfect consistency)
- If model doesn't converge but >2 decision changes: use penalized logistic regression (ridge penalty λ=0.1)
- S005 [ID=1, Prox=0, Temp=1, Rel=0]: p̂ = 0.73
- S012 [ID=0, Prox=1, Temp=0, Rel=1]: p̂ = 0.64
- ... [20 total predictions]
- Var(p̂) = variance of predicted probabilities across k's 20 scenarios
- Var(y) = variance of actual binary decisions (0/1) across k's 20 scenarios
- CR_k represents the proportion of participant k's decision variance that is systematically predictable from the four debatable features
- High CR_k: Participant's decisions strongly track feature levels
- Low CR_k: Participant's decisions don't vary systematically with features
- After completing main scenarios, each participant saw 4 additional "consistency check" pairs
-
Each pair presented identical ethical situations with only surface variations:
- ○
- Wording changes ("50 employees" vs. "fifty staff members")
- ○
- Sentence order permuted
- ○
- Numerical vs. written numbers
- SC_k = proportion of identical decisions across these 4 pairs (0.00, 0.25, 0.50, 0.75, or 1.00)
- CR requires feature variation (must see scenarios with different ID, Prox, Temp, Rel)
- SC requires feature constancy (must see scenarios where features are identical)
- Cannot measure both from same scenario set
- SI_k + CR_k + AV_k = 1.0
- Or: (1 - SC_k) + CR_k + AV_k = 1.0
- SC = 0.75 (consistent on 3/4 surface variation pairs)
- CR = 0.28 (predicted probabilities explain 28% of decision variance)
- SI = 1 - 0.75 = 0.25
- AV = 1 - 0.25 - 0.28 = 0.47
- Check: 0.25 + 0.28 + 0.47 = 1.00 ✓
- 25% of P042's variation is structural inconsistency (unreliability)
- 28% is systematic contextual responsiveness (feature sensitivity)
- 47% is arbitrary (unexplained by features or structural factors)
- Mean SC across participants: M=0.845 (SD=0.093)
- Mean CR across participants: M=0.274 (SD=0.074)
- Mean AV across participants: M=0.329 (SD=0.087)
- Face validity: Participants with more variable decisions show higher CR+AV (r=0.87, p<.001)
- Construct validity: CR correlates with relational reasoning strength (r=0.41, p<.001) as predicted by H4
-
Test-retest reliability (n=47 participants retested after 2 weeks):
- ○
- SC: r=0.76
- ○
- CR: r=0.68
- ○
- AV: r=0.71
- ○
- All adequate for individual-difference measures
- Independence check: SC and CR show low correlation (r=0.12, p=.04), confirming they capture distinct constructs
-
Small n per person: Only 20 scenarios per participant means:
- ○
- Individual β estimates are noisy
- ○
- Participants with extreme response patterns (all same decision) cannot be modeled
- ○
- Low statistical power for within-person effects
-
Scenario coverage: Each participant sees different scenarios
- ○
- Cannot directly compare "did Person A weight identifiability more than Person B?"
- ○
- Can only ask "did Person A's decisions vary with identifiability within their scenario set?"
-
Binary outcomes: Logistic regression with binary dependent variable
- ○
- Predicted probabilities may not fully capture decision uncertainty
- ○
- Some participants near decision threshold may appear inconsistent
-
Assumption: Linear-additive feature effects
- ○
- Model assumes features combine additively: β₁(ID) + β₂(Prox) + ...
- ○
- If participants integrate features multiplicatively or non-linearly, model will show poor fit (high AV)
- ✓ Preserves individual differences (each person gets own SC/CR/AV)
- ✓ Accounts for scenario content (controls for which scenarios person saw)
- ✓ Provides interpretable metrics (proportions of variance)
- ✓ Enables clustering and mediation analyses at individual level
- Each participant saw 20 randomly-selected scenarios
- SC, CR, AV calculated from variation across these 20 scenarios
- One profile per participant
- Each unit represents 10 independent responses to the same scenario variant
- SC, CR, AV calculated using the following method:
- Variation A: "50 employees" → "Fifty staff members"
- Variation B: "will lose jobs" → "will be laid off"
- Variation C: Sentence order permuted
- 10 total surface variants per scenario
- GPT-4 on Scenario 087: 9/10 surface variants → same decision
- SC = 0.90
- Surface feature variation (SC component)
- Feature-based prediction (CR component)
- ✓ Parallels human calculation structure
- ✓ Uses within-scenario variation (10 reps) plus cross-scenario comparison (feature sensitivity)
- ✓ Provides stable estimates despite limited data per unit
- ✓ Allows detection of scenario-specific deviations from typical model behavior
- Clustering uses all 720 units together (large N compensates)
- k-means is robust to noise in individual observations
- Cross-validation confirms 89% stable cluster membership
2.7. Inter-Rater Reliability
- Sampling: 3 responses randomly selected per participant
- Coverage: 6.7% of 6,000 total human responses
- Coverage: 44.3% of 300 human participants
- GPT-4: n=133
- Claude: n=133
- Gemini: n=134
- Coverage: 5.6% of 7,200 AI responses at primary temperature
- Percentage calculation denominator: 13,200 (6,000 human + 7,200 AI at T=0.7)
- Overall coverage: 800 / 13,200 = 6.1%
- We selected equal subsample sizes (400 each) for balanced comparison
- But total samples differ (6,000 vs. 7,200)
- Equal n ensures comparable statistical power for human-AI contrasts
- All 15 base scenarios (each represented)
- All 5 moral domains (proportional coverage)
- Range of complexity levels (simple, moderate, complex)
- Range of response lengths (quartiles of word count)
- Range of decision confidence (1-7 scale quartiles)
| Metric | Full Sample (N=300) | Subsample (N=133) | Difference |
| Mean Age | 38.4 | 37.9 | t=0.42, p=.67 |
| % Female | 52% | 54% | χ²=0.24, p=.62 |
| Mean CR | 0.274 | 0.268 | t=0.51, p=.61 |
| Mean AV | 0.329 | 0.334 | t=-0.38, p=.70 |
- Framework Integration Level (0-4 scale)
- Stakeholder Consideration Depth (1-5 scale)
- Relational Reasoning Strength (0-5 scale)
- All other measures computed on full sample
- Smaller effective sample size (N=133 participants, not 300)
- Adequate power for predicted effects (>80% for β≥0.04)
- Results generalize to full sample (based on representativeness checks)
2.7.1. Reliability for Primary Coding Dimensions
| Dimension | Reliability Metric | Value | 95% CI | Interpretation |
| Decision Recommendation | Cohen's κ | 0.94 | [0.91, 0.97] | Excellent |
| Primary Framework | Fleiss' κ (10 categories) | 0.81 | [0.76, 0.86] | Excellent |
| Framework Integration | Weighted κ (ordinal) | 0.79 | [0.73, 0.85] | Good |
| Stakeholder Depth | Weighted κ (ordinal) | 0.76 | [0.69, 0.83] | Acceptable |
| Relational Reasoning Strength | Weighted κ (ordinal) | 0.81 | [0.76, 0.86] | Excellent |
| Relational Language Density | ICC(2,2) | 0.88 | [0.84, 0.92] | Excellent |
| Stakeholder Rankings | Spearman's ρ (agreement) | 0.83 | [0.78, 0.88] | Excellent |
2.7.2. Reliability for Derived Metrics
| Metric | ICC(2,2) | 95% CI | Interpretation |
| Overall Variation Score | 0.89 | [0.85, 0.92] | Excellent |
| Structural Consistency | 0.91 | [0.88, 0.94] | Excellent |
| Contextual Responsiveness | 0.82 | [0.77, 0.87] | Good |
| Arbitrary Variation | 0.76 | [0.70, 0.82] | Acceptable |
- Disagreements (n=146, 18.3% of 800 coded responses) were resolved through discussion
- When coders could not reach consensus (n=22, 2.8%), the lead author made final determination
2.7.3. Test-Retest Reliability
- Odd-numbered scenarios (scenarios 1, 3, 5, ..., 19)
- Even-numbered scenarios (scenarios 2, 4, 6, ..., 20)
| Feature | Split-Half r | Spearman-Brown Corrected | Interpretation |
| Identifiability | 0.67 | 0.80 | Moderate-high |
| Action/Omission | 0.58 | 0.73 | Moderate |
| Temporal | 0.52 | 0.69 | Moderate |
| Relational | 0.71 | 0.83 | High |
2.8. Analytical Approach
2.8.1. Primary Analyses
- Overall variation distributions (means, SDs, histograms)
- Three-component decomposition (SC, CR, AV)
- Mixed-effects models estimating variance components
- For human participants: "Participant" = individual participant ID (n=300)
- For AI models: "Participant" = Model × Scenario combination (n=720: 3 models × 240 scenarios), capturing variation across the 10 repeated samples within each model-scenario pair
- Random slopes for all four debatable features allow individual/model-specific sensitivity patterns
- Odds ratios for each feature
- Effect sizes (η²p) from likelihood ratio tests
- Interaction tests (Feature × Source)
- Paired t-tests for action vs. omission variants
- Agency salience as moderator
- Source (human vs. AI) as between-subjects factor
- Path a: Source → Relational Reasoning
- Path b: Relational Reasoning → Contextual Responsiveness
- Indirect effect (a×b) via bootstrap (10,000 iterations)
- Sensitivity to unmeasured confounding (Cinelli & Hazlett, 2020)
-
ρ = hypothetical correlation between unmeasured confounder and both:
- ○
- Mediator (relational reasoning)
- ○
- Outcome (contextual responsiveness)
- Robustness Value (RV): Minimum ρ required to reduce indirect effect to zero
- RV > 0.4 indicates robustness to moderate confounding
- RV > 0.6 indicates robustness to strong confounding
2.8.2. Statistical Considerations
| Family | Tests | n tests | Correction Applied |
| Family 1 (H1): Variation components | SC ≠ 0, CR ≠ 0, AV ≠ 0 | 3 | FDR within family |
| Family 2 (H2): Contextual feature main effects | Each of 4 features (ID, Prox, Temp, Rel) | 4 | FDR within family |
| Family 3 (H2): Source × Feature interactions | 4 interactions | 4 | FDR within family |
| Family 4 (H3): Action-omission effects | Main effect, Agency moderation | 2 | FDR within family |
| Family 5 (H4): Mediation paths | Paths a, b, c, c', indirect (a×b) | 5 | FDR within family |
- ANOVA: 1 omnibus test
- Post-hoc comparisons: 10 pairwise contrasts (5 domains)
- FDR applied across 11 tests within section
- Reported as: "(Exploratory; FDR q-values reported)"
- Framework × Feature interactions: 12 tests (3 frameworks × 4 features)
- FDR applied across 12 tests
- Reported as: "(Exploratory; FDR q<.001 within framework analysis)"
- Omnibus χ²: 1 test
- Post-hoc contrasts: 6 pairwise comparisons (4 clusters)
- FDR applied across 7 tests
- Reported as: "(Exploratory; cluster comparison FDR q-values reported)"
- Multiple regression: 12 predictors per DV
- 3 DVs (SC, CR, AV) = 36 tests total
- FDR applied across all 36 tests
- Reported as: "(Supplementary analysis; FDR correction across all demographic tests)"
- Pre-registered tests: Report as "p<.001, FDR-corrected q<.001"
- Exploratory tests: Report as "p<.001, exploratory analysis, FDR q<.001"
- Appendix tests: Report as "p<.001, FDR q<.001 within analysis section"
- Whether result survives correction
- Which correction family was applied
- ✗ "p<.001" (ambiguous about correction)
- ✓ "p<.001, FDR-corrected q<.001" (clear correction applied)
- ✓ "p=.023, FDR-corrected q=.041" (shows both raw and adjusted)
- Pre-registered families: Correspond to hypotheses (H1-H5)
- Exploratory families: Group conceptually related tests
- Within-family correction: More powerful than Bonferroni across all tests
- Across-family independence: Avoids over-correction for unrelated questions
2.8.3. Exploratory Analyses
- Each participant contributes one profile (SC/CR/AV scores aggregated across their 20 scenarios)
- Example: Participant P001 saw scenarios 5, 12, 18, ..., 203 → one set of SC/CR/AV values
- Each model-scenario pair contributes one profile (SC/CR/AV scores aggregated across 10 repetitions)
- Example: GPT-4 responding to Scenario 1 (across 10 independent samples) → one set of SC/CR/AV values
- NOT 7,200 individual AI responses; we average across the 10 repetitions per model-scenario
- One specific model (GPT-4, Claude, or Gemini)
- One specific scenario (S001-S240)
- Averaged across 10 repetitions at T=0.7 (to reduce sampling noise)
- GPT-4: 240 scenarios × 1 profile each = 240 units
- Claude: 240 scenarios × 1 profile each = 240 units
- Gemini: 240 scenarios × 1 profile each = 240 units
- Total AI: 3 models × 240 scenarios = 720 units
- SC (structural consistency): 0-1 scale
- CR (contextual responsiveness): 0-1 scale
- AV (arbitrary variation): 0-1 scale
- Elbow method (within-cluster sum of squares): Clear inflection at k=4
- Silhouette coefficient maximization: Peak at k=4 (width=0.64)
- Gap statistic (Tibshirani et al., 2001): k=4 optimal
- Within-cluster homogeneity: Average silhouette width = 0.64 (good)
- Between-cluster separation: Dunn index = 0.58 (acceptable)
- Stability: 10-fold cross-validation showed 89% stable cluster membership
3. Results
3.1. Decomposition of Moral Variation (H1)
3.1.1. Overall Variation Patterns
| Agreement Level | Human % | AI % |
| Same decision across all 4 variants | 23.1% | 24.7% |
| Same decision for 3 of 4 variants | 33.8% | 35.2% |
| Same decision for 2 of 4 variants | 31.4% | 29.6% |
| Different decision for each variant | 11.7% | 10.5% |
| Source | Mean | SD | Median | IQR | Range |
| Human | 0.42 | 0.16 | 0.43 | [0.31, 0.52] | [0.08, 0.79] |
| AI (GPT-4) | 0.41 | 0.14 | 0.40 | [0.32, 0.49] | [0.12, 0.71] |
| AI (Claude) | 0.40 | 0.13 | 0.39 | [0.31, 0.48] | [0.14, 0.68] |
| AI (Gemini) | 0.42 | 0.15 | 0.41 | [0.33, 0.51] | [0.11, 0.73] |
- Normality: Shapiro-Wilk test rejected normality for human data (W=0.984, p=.002) due to slight negative skew; robust analyses (bootstrap 95% CIs) confirmed results unchanged
- High variability: Only 12.3% of participants scored <0.20 ("low variation"); 56.7% scored >0.40 ("high variation")
3.1.2. Structural Consistency
| Source | Mean SC | SD | 95% CI |
| Human | 0.84 | 0.12 | [0.83, 0.85] |
| AI (aggregate) | 0.87 | 0.09 | [0.84, 0.90] |
- Human difference: t(299) = 34.2, p < .001, d = 2.04, FDR-corrected q < .001
- AI difference: t(2) = 8.9, p = .012, d = 4.18, FDR-corrected q = .024
3.1.3. Contextual Responsiveness
| Source | R²marginal (fixed effects) | R²conditional (total) | CR (see calculation below) |
| Human | 0.22 | 0.45 | 0.22 |
| AI (aggregate) | 0.24 | 0.43 | 0.24 |
- Model 1 (Full): R²marginal = 0.45 (includes debatable + clearly relevant features)
- Model 2 (Reduced): R²marginal = 0.23 (includes only clearly relevant features)
- CR = 0.45 - 0.23 = 0.22 (22% variance uniquely from debatable features)
- Model 1 (Full): R²marginal = 0.47
- Model 2 (Reduced): R²marginal = 0.23
- CR = 0.47 - 0.23 = 0.24 (24% variance uniquely from debatable features)
- Fixed effects (debatable features): Explain 22-24% of variance in decisions uniquely (beyond clearly relevant features)
- Random effects (individual differences): Explain additional variance in how people respond to debatable features
- Contextual Responsiveness (CR): The 22-24% attributable to debatable features represents the contested philosophical zone
| Feature | Human η²p | AI η²p | Combined η²p | Interpretation |
| Stakeholder identifiability | 0.18 | 0.17 | 0.18 | Large effect |
| Action/omission framing | 0.11 | 0.10 | 0.11 | Medium effect |
| Temporal proximity | 0.08 | 0.08 | 0.08 | Medium effect |
| Relational context | 0.14 | 0.13 | 0.14 | Large effect |
3.1.4. Arbitrary Variation
| Source | Mean SC | Mean CR | Mean AV | 95% CI (AV) | Range (AV) |
| Human | 0.845 | 0.224 | 0.621 | [0.601, 0.641] | [0.287, 0.891] |
| AI (T=0.7) | 0.870 | 0.237 | 0.633 | [0.618, 0.648] | [0.314, 0.867] |
- Scenario features (structural or debatable)
- Individual/model characteristics measured
- Domain or complexity factors
- Genuine inconsistency (random responding, decision noise)
- Unmeasured individual differences (personality traits, cognitive styles we didn't assess)
- Unmeasured contextual features (morally relevant factors not captured in our four-feature coding)
- Measurement error (reliability < 1.0 contributes noise)
- Structural inconsistency: 13-15% (how much variation comes from unreliability)
- Contextual responsiveness: 22-24% (how much from our four measured features)
- Arbitrary variation: 62-63% (how much remains unexplained)
- Unmeasured morally relevant context (particularist view): Our four features may be incomplete; additional contextual factors (stakeholder vulnerability, historical injustices, organizational culture, etc.) might systematically explain more variation if measured.
- Pure noise (principlist view): Most of this 62-63% reflects genuine inconsistency that should be eliminated through better reasoning, training, or decision procedures.
- Both (pragmatic view): Some unmeasured relevant context exists, but much variation is genuinely arbitrary.
- For every 1% of variance explained by measured contextual features, 2.6-2.8% remains unexplained
- If the unmeasured 62-63% includes additional morally relevant contextual features, there may be many more relevant dimensions than the four we measured
- Alternatively, if most of the 62-63% is noise, then contextual features account for only about one-quarter of non-noise variation (22-24% out of ~85% total non-noise)
- 85% structural consistency means decisions are consistent when irrelevant features vary (good)
- But only 22-24% systematic contextual sensitivity means measured features explain little variation
- And 62-63% arbitrary variation means most variation is unexplained (problematic)
| AV Range | Human % | AI % | Interpretation |
| < 0.30 | 8.3% | 12.1% | Exceptional consistency |
| 0.30-0.50 | 23.7% | 31.4% | Moderate noise |
| 0.50-0.70 | 41.8% | 39.2% | High noise (typical) |
| 0.70-0.90 | 23.4% | 16.1% | Very high noise |
| > 0.90 | 2.8% | 1.2% | Extreme noise |
3.1.5. Summary of H1 Findings
- Humans: 0.155 + 0.224 = 0.379 (37.9%)
- AI: 0.130 + 0.237 = 0.367 (36.7%)
- Humans: 0.621 (62.1%)
- AI: 0.633 (63.3%)
-
Is the 22-24% contextual responsiveness (CR component):
- ○
- Bias requiring elimination (principlism)?
- ○
- Appropriate moral sensitivity to relevant contextual details (particularism)?
- ○
- → Data cannot answer this normative question
-
Is the 62-63% arbitrary variation (AV component):
- ○
- Unmeasured morally relevant contextual features (suggesting our four features are incomplete)?
- ○
- Pure noise and genuine inconsistency (suggesting moral reasoning is highly unreliable)?
- ○
- Both (some additional relevant features + substantial noise)?
- ○
- → Data provide some constraints (cluster analysis suggests ~30% may be achievable floor) but cannot fully resolve
- Many morally relevant contextual features exist beyond the four we measured (explaining some of the 62-63%)
- OR moral reasoning is highly inconsistent even when accounting for context (problematic for particularists too)
- Target high structural consistency (SC > 0.85) ✓ Most participants achieve this
- Calibrate contextual responsiveness (CR ≈ 0.20-0.30) ✓ Most participants in this range
- Minimize arbitrary variation (AV < 0.40) ✗ Only 32% of participants achieve this
3.2. Specific Contextual Feature Effects (H2)
3.2.1. Stakeholder Identifiability Effect (H2a)
| Source | OR | 95% CI | p | d (effect size) |
| Human | 2.12 | [1.89, 2.38] | <.001 | 0.76 |
| AI | 2.04 | [1.81, 2.30] | <.001 | 0.71 |
| Combined | 2.08 | [1.91, 2.27] | <.001 | 0.73 |
- Statistical version: "50 employees in Division A will lose their jobs if Option B is chosen"→ 42% chose Option A (protecting jobs)
- Identified version: "Maria Rodriguez, a single mother of three with 12 years tenure, and 49 colleagues will lose their jobs if Option B is chosen"→ 67% chose Option A (protecting jobs)
| Domain | OR | 95% CI |
| Harm Prevention | 2.34 | [1.98, 2.77] |
| Fairness/Justice | 1.87 | [1.53, 2.28] |
| Autonomy/Rights | 2.18 | [1.76, 2.70] |
| Promise-Keeping | 1.94 | [1.56, 2.41] |
| Honesty/Transparency | 1.73 | [1.38, 2.17] |
3.2.2. Direct vs. Distant Stakeholder Effect (H2b)
| Stakeholder Type | % Ranked #1 (Human) | % Ranked #1 (AI) | Difference |
| Employees (direct) | 68.4% | 71.2% | χ²=1.9, p=.17, q=.24 |
| Customers (direct) | 61.7% | 63.8% | χ²=0.8, p=.37, q=.44 |
| Contractors (distant) | 23.1% | 20.4% | χ²=2.1, p=.15, q=.23 |
| Community (distant) | 18.9% | 17.3% | χ²=0.7, p=.40, q=.46 |
| Suppliers (distant) | 15.2% | 14.1% | χ²=0.4, p=.53, q=.58 |
- Equal distribution baseline: Each stakeholder group should receive ~25% (4 groups)
- Actual allocation to employees: 41.2% (SD=18.3%)
- Actual allocation to community: 15.7% (SD=12.1%)
3.2.3. Temporal Proximity Effect (H2c)
| Source | OR (Immediate vs. Delayed) | 95% CI | p | η²p |
| Human | 1.54 | [1.37, 1.73] | <.001 | 0.08 |
| AI | 1.49 | [1.32, 1.68] | <.001 | 0.07 |
| Combined | 1.52 | [1.39, 1.66] | <.001 | 0.08 |
- Immediate version: "50 employees will lose jobs within 30 days if Option B chosen"→ 62% chose Option A (protecting jobs)
- Delayed version: "50 employees will lose jobs over the next 2-3 years if Option B chosen"→ 44% chose Option A (protecting jobs)
- Human: k = 0.23/year (95% CI [0.19, 0.27])
- AI: k = 0.21/year (95% CI [0.17, 0.25])
3.2.4. Relational Context Effect (H2d)
| Source | OR (Relational vs. Transactional) | 95% CI | p | η²p |
| Human | 1.93 | [1.71, 2.18] | <.001 | 0.14 |
| AI | 1.84 | [1.62, 2.09] | <.001 | 0.13 |
| Combined | 1.89 | [1.72, 2.07] | <.001 | 0.14 |
| Relationship Type | OR | 95% CI | Example |
| Long-term employment (≥5 years) | 2.14 | [1.84, 2.49] | "12-year employee" |
| Loyal customer (repeat business) | 1.87 | [1.58, 2.21] | "customer since founding" |
| Trusted partner/supplier | 1.72 | [1.43, 2.07] | "strategic partner for 8 years" |
| Personal connection | 2.41 | [1.94, 3.00] | "mentored by founder" |
| Promise/commitment made | 2.08 | [1.76, 2.46] | "we committed to no layoffs" |
- Human: 64.2% of responses to relational scenarios mentioned loyalty, commitment, or obligations
- AI: 42.7% mentioned such concepts
- Difference: χ²(1) = 187.4, p < .001, FDR-corrected q < .001
3.2.5. Summary of H2 Findings
| Feature | Predicted OR | Observed OR | Status |
| Identifiability (H2a) | >1.5 | 2.08*** | ✓ Exceeded |
| Direct stakeholders (H2b) | >1.5 | 7.89*** | ✓ Exceeded |
| Temporal proximity (H2c) | >1.3 | 1.52*** | ✓ Exceeded |
| Relational context (H2d) | >1.5 | 1.89*** | ✓ Exceeded |
3.3. Action-Omission Asymmetry (H3)
3.3.1. Primary Action-Omission Comparison
- Action frame: "If we implement layoffs [active], 50 employees will lose jobs"
- Omission frame: "If we don't prevent market exit [passive], 50 employees will lose jobs"
| Framing | % Willing to Accept Harmful Option | Mean Acceptance Rating (1-7) |
| Action (active causation) | 38.4% | 3.21 (SD=1.84) |
| Omission (passive allowance) | 52.9% | 4.37 (SD=1.76) |
| Source | OR (Omission vs. Action) | 95% CI | p | d |
| Human | 1.84 | [1.64, 2.06] | <.001 | 0.61 |
| AI | 1.91 | [1.69, 2.16] | <.001 | 0.65 |
| Combined | 1.87 | [1.71, 2.05] | <.001 | 0.63 |
- Each participant saw 20 randomly selected scenarios
- Some scenarios presented in action frame, others in omission frame
- Calculate each participant's mean acceptance rate for action vs. omission scenarios
- Action frame scenarios: M = 3.21 (SD = 1.84)
- Omission frame scenarios: M = 4.37 (SD = 1.76)
- Mean difference = 4.37 - 3.21 = 1.16 points on 7-point scale
- t(299) = 18.4, p < .001
- Cohen's d = 0.63 [95% CI: 0.57, 0.69]
- Interpretation: Medium-to-large effect size
- Different participants saw different scenario variants
- Some participants saw more action frames, others more omission frames
- Model accounts for random effects (participant, scenario clustering)
- Action frame: 38.4% accept harmful option
- Omission frame: 52.9% accept harmful option
- Absolute difference = 14.5 percentage points
| Source | OR | 95% CI | p | d (converted) |
| Human | 1.84 | [1.64, 2.06] | <.001 | 0.61 |
| AI | 1.91 | [1.69, 2.16] | <.001 | 0.65 |
| Combined | 1.87 | [1.71, 2.05] | <.001 | 0.63 |
- Population-level effect controlling for clustering
- Accounts for fact that different people saw different scenarios
- Adjusts for random variation across scenarios and participants
- Answers: "How much more likely is acceptance when framed as omission vs. action?"
- Individual-level effect averaging within-person comparisons
- Reflects typical participant's response difference
- Standardized mean difference in continuous ratings
- Answers: "How many standard deviations apart are action vs. omission ratings?"
- Continuous normal latent variable (our Likert ratings approximate this)
- Binary threshold (convert ratings to accept/reject) (✓ we did this)
- Equal prevalence in both groups (✗ our prevalence differs: 38% vs. 53%)
- Mixed-effects adjustment: Logistic regression includes random effects that absorb some variance, reducing the conditional OR
- Different samples: Paired t-test uses participants who saw both frames (across different scenarios); logistic regression uses all participants
- Measurement level: d is based on continuous 1-7 ratings; OR is based on binarized accept/reject
- d=0.63 describes the standardized mean difference in continuous ratings
- OR=1.87 describes the odds ratio for binary acceptance decisions
- Both indicate a medium-to-large effect in the same direction
- The two statistics are complementary, not contradictory
- Comparable to other psychological research (most studies report d)
- Intuitive interpretation (0.63 SD difference)
- Shows effect size on continuous scale
- Appropriate for logistic regression (binary outcome)
- Accounts for nested data structure (participants, scenarios)
- Directly answers decision-making question ("how much more likely to accept?")
- Measure: Participants also rated "how acceptable is this option?" (1-7 Likert scale)
- Statistic: Cohen's d = 0.63 (medium-large effect)
- Interpretation: Omission-framed options rated 0.63 SD higher in acceptability
- Mean difference: 4.37 (omission) - 3.21 (action) = 1.16 points on 7-point scale
- 95% CI for d: [0.57, 0.69]
| Aspect | Binary Choice | Continuous Rating |
| Question | "Which option do you choose?" | "How acceptable is this option?" |
| Response | Forced choice (A or B) | 7-point scale |
| Analysis | Logistic regression | Paired t-test |
| Effect size | OR (ratio of odds) | d (standardized mean difference) |
- OR (1.87): Directly answers "How much more likely are people to accept harm via omission?"
- d (0.63): Provides standardized effect size comparable to other psychological research
- Together: Demonstrate robustness across categorical and continuous operationalizations
- Binary analyses (logistic regression) → report OR
- Continuous analyses (t-tests, regression) → report d or β
- Both significant at p<.001, FDR-corrected q<.001
3.3.2. Agency Salience Moderation (H3b)
- High agency: "You must decide whether to..."
- Low agency: "The board will decide whether to..."
| Condition | OR (Omission vs. Action) | 95% CI | p |
| Human, High Agency | 2.14 | [1.82, 2.51] | <.001 |
| Human, Low Agency | 1.58 | [1.34, 1.87] | <.001 |
| AI, High Agency | 2.09 | [1.77, 2.47] | <.001 |
| AI, Low Agency | 1.64 | [1.38, 1.95] | <.001 |
| Frame | % Mentioning Responsibility | % Mentioning Causation |
| Action, High Agency | 72.1% | 84.3% |
| Action, Low Agency | 48.7% | 71.2% |
| Omission, High Agency | 43.2% | 38.4% |
| Omission, Low Agency | 31.8% | 24.7% |
3.3.3. Summary of H3 Findings
- Either a systematic bias learned by AI from human training data, or
- A morally appropriate distinction that both humans and AI correctly recognize
3.4. Domain and Complexity Effects
3.4.1. Domain Effects
| Domain | Mean Variation | SD | Mean CR | Mean AV |
| Honesty/Transparency | 0.38 | 0.14 | 0.19 | 0.28 |
| Harm Prevention | 0.40 | 0.15 | 0.21 | 0.29 |
| Promise-Keeping | 0.42 | 0.16 | 0.23 | 0.31 |
| Autonomy/Rights | 0.44 | 0.17 | 0.25 | 0.33 |
| Fairness/Justice | 0.46 | 0.18 | 0.28 | 0.36 |
- Fairness/Justice > all other domains (all p < .001, FDR-corrected q < .001)
- Honesty/Transparency < all other domains (all p < .01, FDR-corrected q < .01)
- Other pairwise differences: mixed significance
- Higher contextual responsiveness (CR = 0.28 vs. 0.19-0.25 for other domains)
- Higher arbitrary variation (AV = 0.36 vs. 0.28-0.33)
- Fairness scenarios show highest framework integration (mean=2.1 vs. 1.7 overall)
- Highest proportion acknowledging framework tensions (7.3% Level 4 integration vs. 4.7% overall)
- Fairness scenarios coded as more complex (mean=8.2 vs. 7.1 overall)
- More stakeholder groups (mean=4.8 vs. 3.9)
- Original domain effect: F(4, 13,195) = 142.7, p < .001, η²p = 0.04
- Controlling for complexity: F(4, 13,194) = 94.3, p < .001, η²p = 0.03
3.4.2. Complexity Effects
- β = 0.06 per complexity point
- t = 14.73, p < .001, FDR-corrected q < .001
- R² = 0.09
| Complexity Component | β | t | p | Partial R² |
| Stakeholder groups | 0.04 | 6.23 | <.001 | 0.03 |
| Value conflicts | 0.05 | 8.91 | <.001 | 0.04 |
| Information ambiguity | 0.05 | 9.12 | <.001 | 0.04 |
| Reversibility | 0.03 | 5.47 | <.001 | 0.02 |
- β = 0.02 (vs. 0.04 unadjusted)
- t = 2.18, p = .029, FDR-corrected q = .041
- Effect reduced 50% but remains significant
- β = -0.03, SE = 0.01, t = -2.87, p = .004, FDR-corrected q = .008
3.5. Relational Reasoning and Variation Patterns (H4)
3.5.1. Source Differences in Relational Reasoning
| Source | Mean Terms per Response | SD | 95% CI |
| Human | 4.7 | 2.8 | [4.4, 5.0] |
| AI (GPT-4) | 2.4 | 1.7 | [2.1, 2.7] |
| AI (Claude) | 2.1 | 1.5 | [1.8, 2.4] |
| AI (Gemini) | 2.4 | 1.8 | [2.1, 2.7] |
- 400 human responses from 133 participants (mean 3.0 coded responses per participant)
- 400 AI responses at T=0.7 (GPT-4 n=133, Claude n=133, Gemini n=134)
- Covering all 15 base scenarios
| Strength Level | Description | Human % | AI % |
| 0 | None (relationships not mentioned) | 20.5% | 36.0% |
| 1 | Minimal (mentioned, not decisive) | 31.3% | 41.7% |
| 2 | Present (co-equal consideration) | 27.7% | 16.0% |
| 3 | Integrated/Central (drives logic) | 17.0% | 4.0% |
| 4 | Sophisticated (expert care ethics) | 2.8% | 2.0% |
| 5 | Advanced synthesis (addresses limits) | 0.8% | 0.3% |
- Combined representation: Humans 3.5%, AI 2.3%
- Very rare in non-expert samples (as expected)
- Examples require explicit care ethics framework language plus nuanced discussion of obligation limits
- See Appendix B.2.1 for Level 4-5 anchoring examples
3.5.1. Source Differences in Relational Reasoning
| Source | Mean Terms per Response | SD | 95% CI |
| Human | 4.7 | 2.8 | [4.4, 5.0] |
| AI (GPT-4) | 2.4 | 1.7 | [2.1, 2.7] |
| AI (Claude) | 2.1 | 1.5 | [1.8, 2.4] |
| AI (Gemini) | 2.4 | 1.8 | [2.1, 2.7] |
- 400 human responses from 133 participants (mean 3.0 coded responses per participant)
- 400 AI responses at T=0.7 (GPT-4 n=133, Claude n=133, Gemini n=134)
- Covering all 15 base scenarios
| Strength Level | Description | Human % | AI % |
| 0 | None (relationships not mentioned) | 20.5% | 36.0% |
| 1 | Minimal (mentioned, not decisive) | 31.3% | 41.7% |
| 2 | Present (co-equal consideration) | 27.7% | 16.0% |
| 3 | Integrated/Central (drives logic) | 17.0% | 4.0% |
| 4 | Sophisticated (expert care ethics) | 2.8% | 2.0% |
| 5 | Advanced synthesis (addresses limits) | 0.8% | 0.3% |
- Combined representation: Humans 3.6%, AI 2.3%
- Very rare in non-expert samples (as expected)
- Examples require explicit care ethics framework language plus nuanced discussion of obligation limits
- See Appendix B.2.1 for Level 4-5 anchoring examples
- Human: M = 1.57 (SD = 1.09)
- AI: M = 0.91 (SD = 0.87)
- Difference: t(798) = 6.81, p < .001, d = 0.56 (medium effect), FDR-corrected q < .001
- Language density: Humans use 2.3× more relational terms per response (d = 1.04)
- Reasoning strength: Humans average 1.57 vs. AI 0.91 on 0-5 scale (d = 0.56)
- Distribution: Only 20.5% of human responses show no relational reasoning vs. 36.0% of AI responses
- Sophisticated reasoning (Levels 4-5): Humans 3.6% vs. AI 2.3%, though both are rare
3.5.2. Relational Reasoning and Contextual Responsiveness (H4b)
- β = +0.012, SE = 0.003
- t = 4.02, p < .001, FDR-corrected q < .001
- Partial R² = 0.02
| Relational Strength | Mean CR | Mean AV | n responses |
| 0 (None) | 0.17 | 0.39 | 187 |
| 1 (Mentioned) | 0.21 | 0.35 | 219 |
| 2 (Co-equal) | 0.31 | 0.29 | 131 |
| 3 (Decisive) | 0.36 | 0.27 | 63 |
- CR: Each level significantly higher than previous (all p < .01, FDR-corrected q < .01)
- AV: Levels 0-1 > Levels 2-3 (p < .001, FDR-corrected q < .001); Levels 0 vs. 1 and 2 vs. 3 ns
- ↑ Higher contextual responsiveness (0.36 for Level 3 vs. 0.17 for Level 0; +112%)
- ↓ Lower arbitrary variation (0.27 for Level 3 vs. 0.39 for Level 0; -31%)
3.5.3. Mediation Analysis (H4c): Does Relational Reasoning Explain Source Differences?
- Each participant contributed 3 coded responses (randomly selected from their 20 scenarios)
- These 3 responses are nested within participant - not independent
-
Response-level analysis (N=400) would:
- ○
- Violate independence assumption
- ○
- Underestimate standard errors by factor of √3 ≈ 1.73
- ○
- Inflate t-statistics by ~1.73, making effects appear stronger than they are
- ○
- Yield incorrect p-values and confidence intervals
- Participant P042's 3 responses: RR = 2, 3, 2
- Aggregated: RR_P042 = (2+3+2)/3 = 2.33
-
Sample size limitation: MSEM requires larger samples for stable estimation
- ○
- Recommended N > 200 participants for mediation (Preacher et al., 2010)
- ○
- Our N=133 is below this threshold
- ○
- Risk of convergence failures, unstable estimates
-
Unbalanced design: Not all participants have exactly 3 coded responses
- ○
- Some have 2 (if one response excluded for quality)
- ○
- Some have 4 (oversampled for reliability checks)
- ○
- Unbalanced designs complicate MSEM estimation
-
Assumption violations: MSEM assumes:
- ○
- Normally distributed random effects (our RR is skewed)
- ○
- Homogeneous within-participant variance (violated: some participants more variable)
- ○
- Linear relationships at both levels (untested)
-
Pragmatic considerations:
- ○
- Participant-level analysis is more conservative (larger SEs, more stringent test)
- ○
- Results are more interpretable for applied audiences
- ○
- Replication studies can use same approach
- Effective N = 133, not N = 400
- Power calculation:
- For β = 0.04 (predicted indirect effect)
- α = 0.05, two-tailed
- Power = 0.84 with N=133
- Adequate power for predicted effect sizes
- Standard errors are larger than they would be in response-level analysis (more conservative)
- Confidence intervals are wider (more realistic uncertainty quantification)
- P-values are more stringent (harder to achieve significance)
- Results apply to participant-level patterns, not individual responses
- Interpretation: "Participants (not responses) with higher relational reasoning show higher contextual responsiveness"
- This is the appropriate level for organizational applications (training targets individuals, not individual decisions)
| Analysis Level | a path | b path | Indirect (a×b) | 95% CI |
| Participant (N=133) [VALID] | 0.441*** | 0.097*** | 0.043* | [0.028, 0.061] |
| Response (N=400) [INVALID] | 0.438*** | 0.094*** | 0.041*** | [0.031, 0.053] |
- Point estimates very similar (0.043 vs. 0.041)
- Confidence interval narrower at response level [0.031, 0.053] due to underestimated SEs
- Both significant at p<.001, so conclusion robust
- But participant-level is correct analysis due to independence assumption
| Characteristic | Full Sample (N=300) | Coded Subsample (N=133) | Test |
| Mean Age | 38.4 years | 37.9 years | t=0.42, p=.67 |
| % Female | 52% | 54% | χ²=0.24, p=.62 |
| Mean CR | 0.274 | 0.268 | t=0.51, p=.61 |
| Mean AV | 0.329 | 0.334 | t=-0.38, p=.70 |
-
Code all responses (not subsample) if resources permit
- ○
- Eliminates representativeness concerns
- ○
- Enables response-level MSEM if N is sufficient
-
Pre-specify mediation level (participant vs. response) before data collection
- ○
- Our choice was post-hoc (driven by resource constraints)
- ○
- Ideally determined a priori based on theoretical interest
-
Report both aggregated and multilevel results for comparison
- ○
- Shows robustness (or lack thereof)
- ○
- Advances methodological understanding
- Each participant contributes one observation (mean RR, overall CR)
- Independent observations assumption met
- Conservative analysis with adequate power
- Would violate independence
- Would underestimate standard errors
- Would yield anti-conservative inference
| Metric | Full Human Sample (N=300) | Subsample (N=133) | Difference |
| Mean Age | 38.4 | 37.9 | t=0.42, p=.67 |
| % Female | 52% | 54% | χ²=0.24, p=.62 |
| Mean CR | 0.274 | 0.268 | t=0.51, p=.61 |
| Mean AV | 0.329 | 0.334 | t=-0.38, p=.70 |
- Required n for indirect effect β=0.04 (predicted), α=.05, power=.80: n=118
- Achieved n=133 provides power=.84 (adequate)
- Subsample is representative on observed characteristics
- Effect sizes are large (path a: d=0.68, path b: r=0.41)
- Bootstrapped confidence intervals are narrow, suggesting precision
- No theoretical reason to expect mediation differs in unsampled participants
3.5.3. Mediation Analysis (H4c): Does Relational Reasoning Explain Source Differences?
- ○
- Underestimate standard errors by factor of √3 ≈ 1.73
- ○
- Inflate t-statistics by ~1.73, making effects appear stronger than they are
- ○
- Yield incorrect p-values and confidence intervals
- Participant P042's 3 responses: RR = 2, 3, 2
- Aggregated: RR_P042 = (2+3+2)/3 = 2.33
- Randomly selected from full sample (N=300)
- Each provided 3 coded responses for RR measurement
- Each has CR calculated from all 20 scenarios
- Mean aggregation: RR_k = mean of 3 coded responses
| Characteristic | Full Sample (N=300) | Coded Subsample (N=133) | Test |
| Mean Age | 38.4 years | 37.9 years | t=0.42, p=.67 |
| % Female | 52% | 54% | χ²=0.24, p=.62 |
| Mean CR | 0.274 | 0.268 | t=0.51, p=.61 |
| Mean AV | 0.329 | 0.334 | t=-0.38, p=.70 |
- Required n for indirect effect β=0.04 (predicted), α=.05, power=.80: n=118
- Achieved n=133 provides power=.84 (adequate)
-
Sample size limitation: MSEM requires larger samples for stable estimation
- ○
- Recommended N > 200 participants for mediation (Preacher et al., 2010)
- ○
- Our N=133 is below this threshold
- ○
- Risk of convergence failures, unstable estimates
-
Unbalanced design: Not all participants have exactly 3 coded responses
- ○
- Some have 2 (if one response excluded for quality)
- ○
- Some have 4 (oversampled for reliability checks)
- ○
- Unbalanced designs complicate MSEM estimation
-
Assumption violations: MSEM assumes:
- ○
- Normally distributed random effects (our RR is skewed, see Table S6a)
- ○
- Homogeneous within-participant variance (violated: some participants more variable)
- ○
- Linear relationships at both levels (untested)
-
Pragmatic considerations:
- ○
- Participant-level analysis is more conservative (larger SEs, more stringent test)
- ○
- Results are more interpretable for applied audiences
- ○
- Replication studies can use same approach
- Effective N = 133, not N = 400
- Standard errors are larger than they would be in response-level analysis (more conservative)
- Confidence intervals are wider (more realistic uncertainty quantification)
- P-values are more stringent (harder to achieve significance)
- Results apply to participant-level patterns, not individual responses
- Interpretation: "Participants with higher relational reasoning show higher contextual responsiveness"
- This is the appropriate level for organizational applications (training targets individuals, not individual decisions)
| Analysis Level | a path | b path | Indirect (a×b) | 95% CI |
| Participant (N=133) [CORRECT] | 0.441*** | 0.097*** | 0.043* | [0.028, 0.061] |
| Response (N=400) [INCORRECT] | 0.438*** | 0.094*** | 0.041*** | [0.031, 0.053] |
- Point estimates very similar (0.043 vs. 0.041)
- Confidence interval narrower at response level [0.031, 0.053] due to underestimated SEs
- Both significant at p<.001, so conclusion robust
- But participant-level is the correct analysis due to independence assumption
- Measured confounders in our model (age, experience) correlate r = 0.22-0.31 with RR and CR
- An unmeasured confounder stronger than these would be needed to eliminate mediation
| Model | AIC | BIC | ΔAIC | ΔBIC |
| Model 1 (H4) | 287.3 | 303.9 | 0 | 0 |
| Model 2 (Reverse) | 312.4 | 329.0 | +25.1 | +25.1 |
| Model 3 (Common cause) | 291.7 | 313.5 | +4.4 | +9.6 |
- Indirect effect: β = 0.043, 95% CI [0.028, 0.061], p < .001, FDR q < .001
- Proportion mediated: 69% of human-AI difference
- Robust to moderate unmeasured confounding (RV = 0.48)
- Analysis conducted at appropriate participant level (N=133) with aggregated relational reasoning scores
- Results generalize to full sample based on representativeness checks
- Subsample is representative on observed characteristics (age, gender, CR, AV all p > .61)
- Effect sizes are large (path a: d=0.68, path b: r=0.41)
- Bootstrapped confidence intervals are narrow, suggesting precision
- No theoretical reason to expect mediation differs in unsampled participants
3.5.4. Alternative Causal Models
| Model | AIC | BIC | ΔAIC | ΔBIC |
| Model 1 (H4) | 1847.3 | 1872.9 | 0 | 0 |
| Model 2 (Reverse) | 1889.4 | 1915.0 | +42.1 | +42.1 |
| Model 3 (Common cause) | 1851.7 | 1882.5 | +4.4 | +9.6 |
3.5.5. Relational Reasoning and Arbitrary Variation (H4d)
- β = -0.036, SE = 0.011
- t = -3.34, p < .001, FDR-corrected q < .001
- Partial R² = 0.02
| Relational Strength | Mean AV |
| 0 (None) | 0.39 |
| 1 (Mentioned) | 0.35 |
| 2 (Co-equal) | 0.29 |
| 3 (Decisive) | 0.27 |
3.5.6. Summary of H4 Findings
- ↑ Systematic contextual responsiveness (higher CR)
- ↓ Random arbitrary variation (lower AV)
3.6. Framework Integration and Variation Patterns
3.6.1. Integration Level Distribution
- Humans: 400 responses
- AI T=0.7: 400 responses
| Level | Description | Human n(%) | AI n(%) | Total n(%) |
| 0 | Single framework only, no acknowledgment of alternatives | 108 (27.0%) | 198 (49.5%) | 306 (38.3%) |
| 1 | Multiple frameworks mentioned, one clearly dominant | 99 (24.8%) | 107 (26.8%) | 206 (25.8%) |
| 2 | Genuine attempt to integrate multiple frameworks | 113 (28.2%)† | 73 (18.2%)† | 186 (23.2%)† |
| 3 | Explicit acknowledgment of framework tensions | 63 (15.8%) | 19 (4.8%) | 82 (10.3%) |
| 4 | Expert synthesis producing coherent unified position | 17 (4.2%)† | 3 (0.8%)† | 20 (2.5%)† |
| Total | 400 (100.0%) | 400 (100.0%) | 800 (100.0%) |
- Human Level 2: 28.3% → 28.2% (rounding adjustment)
- Human Level 4: 4.3% → 4.2% (rounding adjustment)
- AI Level 2: 18.3% → 18.2% (rounding adjustment)
- AI Level 4: 0.8% → 0.7% (original), now 0.8% (to preserve count=3)
- Human: M = 1.45, SD = 1.17
- AI T=0.7: M = 0.81, SD = 0.93
- Difference: t(798) = 7.12, p < .001, d = 0.59 (medium effect), FDR-corrected q < .001
3.6.3. Evidence for Systematic Framework Selection
| Feature | OR | 95% CI | p |
| Statistical stakeholders | 2.31 | [1.87, 2.85] | <.001 |
| Immediate consequences | 1.82 | [1.46, 2.27] | <.001 |
| Large numbers affected | 1.67 | [1.34, 2.09] | <.001 |
| Feature | OR | 95% CI | p |
| Named stakeholders | 3.14 | [2.47, 3.99] | <.001 |
| Relational context | 2.87 | [2.26, 3.65] | <.001 |
| Ongoing relationships | 2.43 | [1.91, 3.10] | <.001 |
| Feature | OR | 95% CI | p |
| Rights violation salient | 2.76 | [2.18, 3.51] | <.001 |
| Rule-following emphasized | 2.34 | [1.84, 2.97] | <.001 |
| Action (vs. omission) | 1.89 | [1.49, 2.40] | <.001 |
3.6.4. Within-Framework Consistency (Conditional Consistency)
| Integration Level | Overall Variation | Conditional Variation | Difference |
| 1 (Single) | 0.36 | 0.36 | 0.00 |
| 2 (Multiple) | 0.42 | 0.37 | -0.05 |
| 3 (Integration) | 0.52 | 0.38 | -0.14 |
| 4 (Tension) | 0.59 | 0.41 | -0.18 |
- Using different frameworks in different contexts (coded as framework variation)
- NOT applying the same framework inconsistently
3.6.5. Source Differences
| Source | Mean Integration Level | % High Integration (3-4) |
| Human | 1.87 | 26.3% |
| AI | 1.64 | 21.2% |
3.7. Framework-Specific Contextual Patterns
- Utilitarians: Insensitive to identifiability (lives count equally)
- Care ethicists: Highly sensitive to identifiability & relational context
- Deontologists: Sensitive to action/omission (principled moral distinction)
- Rights-based: Insensitive to temporal proximity (rights don't decay over time)
3.7.1. Feature Effects by Primary Framework
| Framework | OR | 95% CI | p |
| Utilitarian | 1.23 | [0.98, 1.54] | .073 |
| Deontological | 1.67 | [1.32, 2.11] | <.001 |
| Care Ethics | 3.42 | [2.58, 4.53] | <.001 |
| Rights-Based | 1.89 | [1.41, 2.53] | <.001 |
| Stakeholder | 2.31 | [1.84, 2.90] | <.001 |
| Framework | d (Active-Passive) | 95% CI | p |
| Utilitarian | 0.18 | [0.04, 0.32] | .013 |
| Deontological | 0.89 | [0.74, 1.04] | <.001 |
| Care Ethics | 0.54 | [0.37, 0.71] | <.001 |
| Rights-Based | 0.71 | [0.52, 0.90] | <.001 |
| Framework | OR (Immediate vs. Delayed) | 95% CI | p |
| Utilitarian | 1.71 | [1.42, 2.06] | <.001 |
| Deontological | 1.38 | [1.14, 1.67] | .001 |
| Care Ethics | 1.28 | [1.02, 1.61] | .032 |
| Rights-Based | 1.12 | [0.88, 1.43] | .35 |
3.7.2. Implications
- Care ethicists respond to identifiability (persons vs. statistics)
- Deontologists respond to action/omission (agency distinctions)
- Rights-based theorists ignore temporal distance (rights are timeless)
- Framework → Sensitivity (framework shapes judgment)
- Sensitivity → Framework (sensitivities shape framework choice)
3.8. Cluster Analysis: Identifying Optimal Profiles
3.8.1. Within-Person Feature Sensitivity Profiles
| Feature | r | 95% CI | Interpretation |
| Identifiability | .67 | [.61, .73] | Moderate-high |
| Action/Omission | .58 | [.51, .65] | Moderate |
| Temporal | .52 | [.44, .60] | Moderate |
| Relational | .71 | [.66, .76] | High |
3.8.2. K-Means Clustering
- Humans: N=300 individual participants
- AI: N=720 model-scenario combinations (3 models × 240 scenarios at T=0.7)
- Total: N=1,020 units
- SC (structural consistency): Proportion of consistent decisions when irrelevant features vary
- CR (contextual responsiveness): Unique variance explained by debatable features
- AV (arbitrary variation): Residual unexplained variance
| Method | Optimal k | Evidence |
| Elbow method | 4 | Clear inflection point; within-cluster SS drops sharply then plateaus |
| Silhouette | 4 | Maximum average silhouette width = 0.64 at k=4 |
| Gap statistic | 4 | Gap(4) significantly > Gap(3) and Gap(5) by 1 SE rule |
| Cluster | n | % | SC M(SD) | CR M(SD) | AV M(SD) | Label |
| 1 | 137 | 13.4% | 0.921 (0.034) | 0.142 (0.041) | 0.313 (0.058) | Principled-Consistent |
| 2 | 444 | 43.5% | 0.862 (0.047) | 0.264 (0.052) | 0.327 (0.064) | Balanced-Integrative |
| 3 | 153 | 15.0% | 0.741 (0.089) | 0.217 (0.068) | 0.512 (0.091) | Inconsistent |
| 4 | 286 | 28.0% | 0.823 (0.062) | 0.387 (0.071) | 0.298 (0.059) | Context-Driven |
- What is a unit? One individual participant
- What data does each unit provide? 20 randomly-selected scenarios from the full set of 240
-
How are SC/CR/AV calculated? By aggregating across the participant's 20 responses:
- ○
- SC = proportion of consistent decisions when scenarios differ only in irrelevant features
- ○
- CR = variance in decisions explained by the four debatable features
- ○
- AV = residual variance unexplained by features or structural factors
- Example: Participant P042 saw scenarios 5, 12, 18, ..., 203 → calculate SC/CR/AV from these 20 responses → one clustering unit with profile (SC=0.84, CR=0.27, AV=0.32)
- What is a unit? One model-scenario combination (averaged across repetitions)
-
Structure: 3 models × 240 scenarios = 720 units
- ○
- GPT-4: 240 units (one per scenario)
- ○
- Claude: 240 units (one per scenario)
- ○
- Gemini: 240 units (one per scenario)
-
What data does each unit provide? 10 independent responses to the same scenario variant
- ○
- Example: "GPT-4 responding to Scenario 087" generates 10 responses (different random seeds, T=0.7)
- ○
- These 10 responses are averaged to create one stable profile
- How are SC/CR/AV calculated for one scenario? This requires clarification because SC, CR, and AV typically require variation across multiple scenarios. Here's what we actually did:
- Generate 10 independent responses at T=0.7 with different random seeds
-
Calculate Structural Consistency (SC):
- ○
- We created minor surface variations of the same scenario (rewording, formatting changes)
- ○
- Presented these variations across the 10 repetitions
- ○
- SC = proportion of identical decisions across surface variations
- ○
- Example: If 9/10 responses gave same decision despite surface changes → SC = 0.90
-
Calculate Contextual Responsiveness (CR):
- ○
- This is tricky for a single scenario because CR measures sensitivity to feature variation
- ○
- Method: We coded the scenario's feature levels (ID, Prox, Temp, Rel) and compared the model's response to its average response across scenarios with different feature combinations
- ○
- Specifically: CR_model-scenario = correlation between this scenario's decision probability and the model's typical sensitivity to its feature profile
- ○
- Example: Scenario 087 has [ID=High, Prox=Low, Temp=High, Rel=Low]. If GPT-4 typically responds strongly to ID and Temp, we predict high acceptance. CR measures how well this scenario matches the pattern.
-
Calculate Arbitrary Variation (AV):
- ○
-
AV = variance in the 10 responses that isn't explained by:
- ▪
- SC (surface variation responses)
- ▪
- CR (feature-based prediction)
- ○
- Example: If 10 responses split 6-4 with no systematic pattern → high AV
- Treat the 10 repetitions as analogous to one participant seeing 10 related scenarios
- Some repetitions involve minor scenario variations (surface features changed)
- Calculate SC, CR, AV as if the model were a "participant" responding to a small set of scenarios
- This yields one profile (SC, CR, AV) per model-scenario combination
- ✓ Makes mathematical sense
- ✓ Parallels the human calculation structure
- ✓ Explains why we get 720 distinct profiles (3 models × 240 scenarios)
- Would lose scenario-specific variation
- Insufficient power (n=3 too small for clustering)
- Wouldn't parallel human structure (we cluster individuals, not averaged-across-scenarios)
- Would give undue weight to AI (7,200 AI vs. 6,000 human responses)
- Individual responses are noisy; averaging across 10 reps provides stability
- Unbalanced sample sizes would distort cluster formation
- Would lose model-specific variation
- Assumes GPT-4, Claude, Gemini respond identically (empirically false)
- Wouldn't allow examination of model differences in cluster membership
- ✓
- Treating each model-scenario combination as analogous to a human participant
- ✓
- Each "unit" represents a stable reasoning profile (averaged across 10 samples)
- ✓
- Allows AI to show different profiles across scenarios (like humans show different profiles across people)
- ✓
- Balances representation: 300 human profiles + 720 AI profiles (2.4:1 ratio)
- 48.8% of model-scenario combinations (e.g., "GPT-4 on Scenario 012") exhibit Balanced-Integrative profile
- NOT that 48.8% of AI responses fall in Cluster 2
- NOT that 48.8% of AI models fall in Cluster 2 (we only have 3 models)
- Each of the 720 AI units has a unique (SC, CR, AV) profile ✓
- Units from the same model are more similar than units from different models ✓
- Units from scenarios with similar feature profiles cluster together ✓
- The clustering is stable across cross-validation folds (89% stable membership) ✓
- Assumes 10 repetitions are sufficient to characterize a model-scenario profile
- Treats model-scenario combinations as independent (they're not - same model appears 240 times)
- May give excessive weight to AI variation compared to human variation
- Low sensitivity ≠ low noise: Cluster 1 shows lowest CR (14%) but NOT lowest AV (31%). Rigid principlism doesn't eliminate inconsistency.
- Optimal profile is moderate CR: Cluster 2 achieves lowest arbitrary variation (33%) with moderate contextual responsiveness (26%), not by eliminating contextual sensitivity.
- High sensitivity can be systematic: Cluster 4 shows highest CR (39%) but relatively low AV (30%), suggesting context-sensitivity can be principled rather than random.
| Cluster | Humans n(%) | AI T=0.7 n(%) | χ² | p |
| 1: Principled | 41 (13.7%) | 96 (13.3%) | ||
| 2: Balanced | 93 (31.0%) | 351 (48.8%) | ||
| 3: Inconsistent | 38 (12.7%) | 115 (16.0%) | ||
| 4: Context-Driven | 128 (42.7%) | 158 (21.9%) | ||
| Total | 300 (100%) | 720 (100%) | 87.6 | <.001 |
-
AI over-represents Cluster 2 (Balanced): 48.8% vs. 31.0% human
- ○
- Suggests AI at T=0.7 achieves "optimal" calibration more consistently than humans
-
Humans over-represent Cluster 4 (Context-Driven): 42.7% vs. 21.9% AI
- ○
- Humans show more extreme contextual sensitivity
- Similar representation in Clusters 1 and 3: No significant differences in principled or inconsistent extremes
- All normative frameworks agree low AV is desirable (less unexplained inconsistency)
- All agree high SC is desirable (reliability when irrelevant features vary)
- Frameworks disagree on optimal CR level (principlist: low; particularist: moderate)
| Cluster | n | % | SC | CR | AV | Label |
| 1 | 137 | 13.4% | 0.921 | 0.142 | 0.313 | Principled-Consistent |
| 2 | 444 | 43.5% | 0.862 | 0.264 | 0.327 | Balanced-Integrative |
| 3 | 153 | 15.0% | 0.741 | 0.217 | 0.512 | Inconsistent |
| 4 | 286 | 28.0% | 0.823 | 0.387 | 0.298 | Context-Driven |
- But also highest CR (0.387) - potentially over-sensitivity to context
- Intermediate SC (0.823)
| Cluster | n | % | SC | CR | AV | Label |
| H1 | 87 | 29.0% | 0.918 | 0.138 | 0.336 | Principled-Consistent |
| H2 | 93 | 31.0% | 0.857 | 0.261 | 0.291 | Balanced-Integrative |
| H3 | 38 | 12.7% | 0.738 | 0.214 | 0.521 | Inconsistent |
| H4 | 82 | 27.3% | 0.814 | 0.394 | 0.348 | Context-Driven |
- Moderate CR (0.261) - neither rigid principlism nor extreme particularism
- Good SC (0.857)
-
AI concentration artifacts:
- In full sample, 48.8% of AI falls in Cluster 2 (Balanced) due to temperature calibration (T=0.7 was selected to produce human-like variation)
- This may artificially inflate Cluster 2's AV because it includes many AI units
- Cluster 4 may achieve lower AV because it includes fewer AI units (21.9% AI vs. 42.7% human)
-
Simpson's paradox:
-
Pooling humans and AI changes cluster structure because:
- ○
- AI has narrower AV distribution (SD=0.073) than humans (SD=0.087)
- ○
- AI over-represents certain CR ranges due to temperature effects
- The "optimal" profile in mixed sample may differ from human-only optimal
-
-
Sample size effects:
- Full sample (N=1,020) has more statistical power to detect subtle clusters
- Human-only (N=300) may merge some distinct profiles
- Cluster 4 in full sample may represent a profile achievable by some humans but diluted in smaller sample
- Full-sample Cluster 4 may have lower AV partly because AI contribution is temperature-dependent
- Human-only clustering removes this confound
- Organizations train humans, not AI
- Human-only profile is the achievable target for ethics training
- Human-only Cluster H2 achieves AV = 0.291 (only slightly higher than full-sample Cluster 4's 0.298)
- But H2 has more moderate CR (0.261 vs. 0.387), avoiding potential over-sensitivity
- High structural consistency: SC = 0.857 (top 15% achieve SC > 0.90)
- Moderate contextual responsiveness: CR = 0.261 (neither principlist <0.15 nor extreme particularist >0.35)
- Low arbitrary variation: AV = 0.291 (achievable floor; top performers reach AV ≈ 0.25-0.30)
- Cluster 4 is only 27.3% of humans (vs. 31.0% in H2)
- Higher CR (0.394 in human H4) yields higher AV (0.348) than moderate CR (0.261 in H2 → AV 0.291)
- This suggests diminishing returns: increasing CR from 0.26 to 0.39 reduces AV initially (in full sample) but increases it in humans only
- CR = 0.25-0.30 appears optimal for most humans (31% naturally in this range)
- CR = 0.35-0.40 may be achievable with very low AV (<0.30) for some individuals, but only 27% of humans reach this without increasing AV
- Organizations should target CR = 0.25-0.30 as realistic optimum, while recognizing that sophisticated particularists may achieve higher CR (0.35-0.40) without increasing AV
| Cluster | Humans n(%) | AI T=0.7 n(%) | χ² | p |
| 1: Principled | 41 (13.7%) | 96 (13.3%) | ||
| 2: Balanced | 93 (31.0%) | 351 (48.8%) | ||
| 3: Inconsistent | 38 (12.7%) | 115 (16.0%) | ||
| 4: Context-Driven | 128 (42.7%) | 158 (21.9%) | ||
| Total | 300 (100%) | 720 (100%) | 87.6 | <.001 |
-
AI over-represents Cluster 2 (Balanced): 48.8% vs. 31.0% human
- ○
- This is expected given temperature selection (T=0.7 was chosen to produce human-like total variation)
- ○
- Does NOT indicate AI is "more optimal" - rather, that T=0.7 calibrates AI to this profile
-
Humans over-represent Cluster 4 (Context-Driven): 42.7% vs. 21.9% AI
- ○
- Humans show more extreme contextual sensitivity
- ○
- This cluster has lowest AV (0.298) in full sample but not in human-only clustering
- Similar representation in Clusters 1 and 3: No significant differences in principled (13.7% vs. 13.3%) or inconsistent (12.7% vs. 16.0%) extremes
- SC ≈ 0.86 (high consistency)
- CR ≈ 0.26 (moderate contextual sensitivity)
- AV ≈ 0.29 (low arbitrary variation)
- Principled reasoning frameworks (to maintain high SC)
- Systematic attention to morally relevant contextual features (to achieve moderate CR)
- Minimizing random inconsistency (to reduce AV below 0.30)
3.8.3. Cluster Characteristics
| Cluster | Mean Age | % Female | % Graduate Degree | % Ethics Training |
| 1 (Low) | 36.8 | 48% | 64% | 28% |
| 2 (Balanced) | 39.2 | 54% | 69% | 33% |
| 3 (Relational) | 39.7 | 61% | 71% | 35% |
| 4 (High) | 37.4 | 49% | 65% | 29% |
| Cluster | Top Framework | % Using Framework |
| 1 (Low) | Utilitarian | 42% |
| 2 (Balanced) | Stakeholder | 38% |
| 3 (Relational) | Care Ethics | 47% |
| 4 (High) | Mixed (no dominant) | — |
| Cluster | Mean Years Experience | Mean Leadership Level (1-5) |
| 1 (Low) | 11.2 | 2.8 |
| 2 (Balanced) | 14.8 | 3.4 |
| 3 (Relational) | 15.3 | 3.6 |
| 4 (High) | 12.7 | 3.0 |
3.8.4. AI Model Differences in Cluster Membership (Post-Hoc Analysis)
- GPT-4 contributes 240 units (one per scenario)
- Claude contributes 240 units (one per scenario)
- Gemini contributes 240 units (one per scenario)
| Cluster | GPT-4 n(%) | Claude n(%) | Gemini n(%) | χ² | p |
| 1: Principled | 43 (17.9%) | 53 (22.1%) | 48 (20.0%) | ||
| 2: Balanced | 113 (47.1%) | 122 (50.8%) | 108 (45.0%) | ||
| 3: Inconsistent | 67 (27.9%) | 51 (21.3%) | 68 (28.3%) | ||
| 4: Context-Driven | 17 (7.1%) | 14 (5.8%) | 16 (6.7%) | ||
| Total | 240 (100%) | 240 (100%) | 240 (100%) | 8.4 | .39 |
- ~45-51% in Balanced cluster (Cluster 2)
- ~18-22% in Principled cluster (Cluster 1)
- ~21-28% in Inconsistent cluster (Cluster 3)
- ~6-7% in Context-Driven cluster (Cluster 4)
- Variation across scenarios (same model shows different profiles for different scenarios)
- Differences from humans (see §3.8.2)
| Cluster | Humans n(%) | All AI n(%) | Difference | p |
| 1: Principled | 41 (13.7%) | 144 (20.0%) | +6.3 pp | .018 |
| 2: Balanced | 93 (31.0%) | 343 (47.6%) | +16.6 pp | <.001 |
| 3: Inconsistent | 38 (12.7%) | 186 (25.8%) | +13.1 pp | <.001 |
| 4: Context-Driven | 128 (42.7%) | 47 (6.5%) | -36.2 pp | <.001 |
- AI at T=0.7 more consistently achieves "optimal" profile (49% vs. 31%)
- Humans show more extreme context-sensitivity (43% vs. 7% in Cluster 4)
- Temperature calibration effectively targets the Balanced profile for AI
- AI "naturally" fits Cluster 2, OR
- Cluster 2 definition was influenced by AI's concentration there
3.9. Sensitivity Analyses and Robustness Checks
3.9.1. Temperature Sensitivity (AI only)
| Feature | η²p Range | p-value Range | Robust? |
| Identifiability | 0.16-0.19 | <.001 all | ✓ Yes |
| Action/Omission | 0.09-0.12 | <.001 all | ✓ Yes |
| Temporal | 0.07-0.09 | <.001 all | ✓ Yes |
| Relational | 0.13-0.14 | <.001 all | ✓ Yes |
| Temperature | AI Variation | Difference from Human (0.42) | p | q |
| 0.3 | 0.26 | -0.16*** | <.001 | <.001 |
| 0.5 | 0.36 | -0.06* | .024 | .036 |
| 0.7 | 0.41 | -0.01 ns | .56 | .58 |
| 1.0 | 0.49 | +0.07*** | <.001 | <.001 |
- ✓ Robust: Contextual feature effects are consistent across temperatures (all FDR-corrected q < .001)
- ⚠️ Temperature-dependent: Human-AI similarity in overall variation levels
3.9.2. Alternative Consistency Metrics
| Metric 1 | Metric 2 | Metric 3 | |
| Metric 1 | 1.00 | .94*** | .89*** |
| Metric 2 | .94*** | 1.00 | .82*** |
| Metric 3 | .89*** | .82*** | 1.00 |
| Finding | Metric 1 | Metric 2 | Metric 3 |
| Identifiability effect | η²p=0.18*** | η²p=0.17*** | η²p=0.16*** |
| Relational mediation | β=0.043*** | β=0.039*** | β=0.047*** |
| Cluster 2 optimal | AV=0.31 | AV=0.29 | AV=0.33 |
3.9.3. Outlier Analysis
- n=11 participants (3.7% of final sample)
- Mean variation: 0.78 (vs. 0.42 overall)
- Characteristics: Shorter response times (mean 31 min vs. 47 min), lower word counts
| Finding | Full Sample | Outliers Excluded | Change |
| Mean variation | 0.42 | 0.40 | -0.02 |
| Identifiability OR | 2.08*** | 2.04*** | -0.04 |
| Relational β | 0.043*** | 0.041*** | -0.002 |
3.9.4. Domain Subsample Analyses
| Finding | Harm | Fairness | Autonomy | Promise | Honesty |
| Identifiability OR | 2.34*** | 1.87*** | 2.18*** | 1.94*** | 1.73*** |
| Action/Omission d | 0.71*** | 0.58*** | 0.64*** | 0.62*** | 0.54*** |
| Relational β | 0.039** | 0.048*** | 0.041*** | 0.044*** | 0.036** |
3.9.5. Missing Data Sensitivity
- Response coding: 0.6% (8 responses excluded due to incomprehensible content - part of 16 total exclusions in §2.4.1)
- Stakeholder rankings: 2.3% (scenarios with single stakeholder)
- Demographics: 1.1% (participants declined to answer)
| Finding | Complete Case | Imputed | Difference |
| Identifiability OR | 2.08*** | 2.09*** | +0.01 |
| Relational mediation β | 0.043*** | 0.044*** | +0.001 |
4. Discussion
4.1. Summary of Empirical Findings
4.1.1. Substantial Contextual Responsiveness (H1-H2)
- Stakeholder identifiability (OR = 2.08, η²p = 0.18): Named individuals favored over statistical aggregates
- Stakeholder proximity (OR = 7.89): Direct stakeholders prioritized over distant stakeholders
- Temporal proximity (OR = 1.52, η²p = 0.08): Immediate consequences weighted more than delayed consequences
- Relational context (OR = 1.89, η²p = 0.14): Relational stakeholders favored over transactional stakeholders
4.1.2. Classic Omission Bias Replicated (H3)
4.1.3. Relational Reasoning Explains Human-AI Differences (H4)
- ↑ Higher contextual responsiveness (+112% comparing Level 3 vs. Level 0)
- ↓ Lower arbitrary variation (-31% comparing Level 3 vs. Level 0)
4.1.4. Most Variation Is Systematic, But One-Third Remains Arbitrary
- Structural consistency: 84-87% agreement when only irrelevant features varied
- Contextual responsiveness: 22-24% variance attributable to debatable features
- Arbitrary variation: 32-34% unexplained residual variance
4.2. Philosophical Implications: The Context Sensitivity Paradox
4.2.1. The Principlist Interpretation: Widespread Bias
- "50 employees will lose jobs" → 42% choose protective option
- "Maria Rodriguez, single mother of three, and 49 colleagues will lose jobs" → 67% choose protective option
- 50 people harmed today = 61 people harmed in 2 years (morally equivalent under discount)
- Actively causing 50 job losses → 38% accept
- Passively allowing 50 job losses → 53% accept
- Implement structured decision protocols removing contextual details (de-identify stakeholders, standardize time horizons, use statistical aggregates)
- Train decision-makers to recognize and correct for these biases
- Deploy AI systems configured for minimal contextual sensitivity (temperature 0.3), leveraging their lower baseline CR (though our data show even temperature 0.3 exhibits significant effects)
- Audit decisions for consistency across framings; flag high CR as quality failure
4.2.2. The Particularist Interpretation: Appropriate Sensitivity
- Causal responsibility: Active causation involves stronger agency than passive allowance
- Autonomy violations: Actions impose will on others; omissions permit natural processes
- Moral psychology: Intention structures differ (doing vs. letting happen)
- Epistemic uncertainty: Distant consequences are genuinely more uncertain
- Opportunity costs: Immediate actions prevent future option spaces
- Psychological sustainability: Perfect temporal neutrality may be psychologically impossible for finite agents
- Preserve rich contextual information rather than de-identifying stakeholders
- Train decision-makers in care ethics and relational reasoning frameworks
- Configure AI systems for moderate contextual sensitivity (temperature 0.7), targeting the "Balanced" profile (Cluster 2)
- Audit for systematic patterns rather than raw consistency; flag low CR as potentially insensitive
4.2.3. Why Our Data Cannot Adjudicate
- Principlist: Pure bias (names are morally irrelevant)
- Particularist: Appropriate response to concreteness (persons vs. statistics)
- Principlist: Bias that happens to be systematic (still wrong)
- Particularist: Genuine moral competence (systematic = appropriate)
- Principlist: Humans more biased (AI superior at temperature 0.3)
- Particularist: Humans more morally sophisticated (AI impoverished)
- Principlist: Post-hoc rationalization (people choose frameworks to justify biases)
- Particularist: Appropriate framework selection (knowing when different principles apply)
- Identifiability effects exist (empirical)
- They are large and systematic (empirical)
- They correlate with relational reasoning (empirical)
- Whether identifiability is morally relevant (normative)
- Whether systematic sensitivity is virtuous or vicious (normative)
- Whether AI or humans are "correct" (normative)
4.2.4. Implications for Normative Ethics
- Systematic sensitivity to some contextual features (not zero)
- Principled insensitivity to other contextual features (not unlimited)
- Clear criteria distinguishing morally relevant from irrelevant context
- Better conditional consistency (within-framework coherence)
- Systematic framework selection (context-appropriate frameworks)
- But also higher overall variation and elevated arbitrary variation
4.2.5. A Tentative Synthesis: Bounded Particularism
- We selected T=0.7 because it produced human-like total variation (0.41 vs. 0.42)
- At T=0.7, AI over-represents Cluster 2 (48% vs. 31% humans)
- Cluster 2 has lowest arbitrary variation (AV=0.31)
- We call Cluster 2 "optimal" based on this low AV
- "The temperature we selected to match humans produces a profile that we then call optimal based on characteristics influenced by that temperature selection"?
- AI achieving Cluster 2 more frequently than humans (48% vs. 31%) is partly artifact of temperature selection
- At T=0.3: AI in Cluster 2 drops to 35% (closer to humans)
- At T=1.0: AI in Cluster 2 drops to 28% (below humans)
- Temperature directly affects where AI units fall in cluster space
- Cluster 2 having lowest AV (0.31) is observed across both humans and AI
- This holds even when clustering humans separately (human-only Cluster 2: AV=0.29)
- The Balanced profile's advantage (moderate CR with low AV) is not temperature-dependent
| Human-Only Cluster | % of Humans | SC | CR | AV |
| H1 (Principled) | 29% | 0.92 | 0.14 | 0.34 |
| H2 (Balanced) | 31% | 0.86 | 0.26 | 0.29 ← Lowest |
| H3 (Inconsistent) | 13% | 0.74 | 0.22 | 0.52 |
| H4 (Context-Driven) | 27% | 0.82 | 0.39 | 0.35 |
| Temperature | % AI in Balanced | % Humans in Balanced | Difference |
| 0.3 | 35% | 31% | +4 pp |
| 0.7 | 48% | 31% | +17 pp |
| 1.0 | 28% | 31% | -3 pp |
- T=0.7 makes AI more likely to exhibit Balanced profile than humans
- This is what makes T=0.7 seem "optimal" for AI deployment
- But it doesn't make Balanced profile itself optimal (that's empirically supported regardless of temperature)
- Human-only clustering shows Balanced profile has lowest AV
- This pattern replicates across age groups, education levels, and professional experience (see Appendix D.1: Demographic Analyses). Cross-cultural replication is a priority for future research, as our sample over-represents Western contexts (89%).
- Theoretical coherence: Pure principlism (Cluster 1) shows higher AV despite low CR
- That AI at T=0.7 achieves this profile more consistently than humans
- This says more about temperature calibration than moral reasoning
- What is optimal: Empirically supported (Balanced profile minimizes AV)
- How AI achieves it: Temperature-dependent (T=0.7 maximizes frequency)
-
Some contextual features are morally relevant (contra pure principlism):
- ○
- Relational obligations, concrete particularity, and causal structure plausibly matter morally
- ○
- Zero contextual sensitivity (Cluster 1, CR=14%) shows higher arbitrary variation (31%) than moderate sensitivity (Cluster 2, CR=26%, AV=33%)
- ○
- This suggests rigid principlism may increase, not decrease, inconsistency
-
Not all observed sensitivity is appropriate (contra pure particularism):
- ○
- 32-34% arbitrary variation across sample indicates substantial unprincipled inconsistency
- ○
- Temperature-dependent patterns in AI (CR ranges 12%-28% across T=0.3 to T=1.0) suggest some "sensitivity" is architectural artifact, not moral insight
- ○
- Very high contextual sensitivity (Cluster 4, CR=39%) doesn't further reduce arbitrary variation vs. moderate sensitivity
-
Optimal judgment balances principles and context (synthesis):
- ○
-
The "Balanced-Integrative" profile (Cluster 2) achieves:
- ▪
- Moderate contextual responsiveness (CR=26%)
- ▪
- Minimal arbitrary variation (AV=33%, lowest across clusters)
- ▪
- Good structural consistency (SC=86%)
- ○
- This profile represents neither pure principlism (CR too high) nor pure particularism (CR not maximized)
| Component | Target Range | Rationale |
| SC | >0.85 | High reliability when irrelevant features vary |
| CR | 0.20-0.30 | Systematic sensitivity without over-fitting |
| AV | <0.35 | Minimal unexplained randomness |
| Calibration | CR/AV >0.75 | Signal-to-noise ratio favoring systematic over random variation |
-
Start with general principles (default to consistency)
- ○
- Identify applicable frameworks (utilitarian, deontological, care ethics, etc.)
- ○
- Apply consistently across structurally similar cases
-
Allow context to defeat defaults when specific features cross salience threshold
- ○
- Identifiable stakeholders may warrant different treatment than statistical aggregates
- ○
- Relational history may create special obligations
- ○
- Temporal proximity may reflect epistemic uncertainty
- ○
- Action/omission may track genuine moral distinctions
-
Limit to theoretically justified features (not arbitrary framings)
- ○
- Candidate features: identifiability, relational context, causal structure, temporal proximity
- ○
- Exclude: presentation order, wording variations, irrelevant demographics
-
Monitor for arbitrary variation (not all variation is wisdom)
- ○
- Calculate individual AV scores
- ○
- If AV >0.35, scrutinize decisions for unprincipled inconsistency
- ○
- If CR >0.35 with high AV, may indicate over-sensitivity to irrelevant context
- ✓
- Preserves principlist concern for consistency (high SC, low AV)
- ✓
- Accommodates particularist insight about context (moderate CR)
- ✓
- Provides empirical targets (CR≈0.25-0.30, AV<0.30, SC>0.85)
- ✓
- Admits both humans and AI can achieve optimal profile (though AI does so more consistently at T=0.7)
-
Which specific features are "theoretically justified"?
- ○
- Our four features (identifiability, proximity, temporal, relational)?
- ○
- Others we didn't measure?
- ○
- Context-dependent (different features relevant in different domains)?
-
What "salience threshold" should trigger context-sensitivity?
- ○
- Always consider relational obligations?
- ○
- Only when relationships cross duration/intensity threshold?
- ○
- Calibrated to domain norms?
-
How do we distinguish legitimate contextual defeating from bias?
- ○
- Empirical criterion: Does feature variation reduce AV?
- ○
- Normative criterion: Philosophical argument for moral relevance?
- ○
- Pragmatic criterion: Stakeholder acceptance and organizational sustainability?
-
Can the optimal profile (Cluster 2) be trained/achieved?
- ○
- Our data show 31% of humans naturally in this cluster
- ○
- Can ethics training move people from Clusters 1, 3, 4 → Cluster 2?
- ○
- Can AI be calibrated to reliably achieve Cluster 2 profile?
-
Circularity concern: We defined "optimal" as minimizing AV, but:
- ○
- AV includes measurement error and unmeasured constructs
- ○
- Low AV might reflect lack of sensitivity to legitimate but unmeasured features
- ○
- "Optimal" is normatively loaded—assumes consistency is virtuous
-
Cluster 2 superiority not universally accepted:
- ○
- Principlists might argue Cluster 1 (lowest CR) is optimal if we could eliminate their higher AV through better training
- ○
- Particularists might argue Cluster 4 (highest CR) is optimal and their moderate AV reflects appropriate complexity
-
Sample-specific findings:
- ○
-
Cluster structure might differ in:
- ▪
- Non-Western cultural contexts
- ▪
- Different professional domains
- ▪
- Higher-stakes real-world decisions
- ○
- Our "optimal" profile may be optimal only for these scenarios
-
Temperature-dependence undermines AI claims:
- ○
-
That AI achieves Cluster 2 more frequently (49% vs. 31%) is artifact of:
- ▪
- Temperature selected to match human variation
- ▪
- Deterministic sampling reducing noise
- ○
- Not evidence of superior AI moral reasoning
- Conceptual clarity: Specifies what we mean by "appropriate" balance
- Empirical targets: Testable predictions about optimal profiles
- Practical guidance: Concrete metrics for training and deployment
- Middle path: Avoids extremes of rigid principlism and unprincipled relativism
- Test whether Cluster 2 profile predicts better outcomes (stakeholder satisfaction, decision quality, organizational performance)
- Develop interventions to move individuals toward Cluster 2
- Examine cross-cultural generalizability of cluster structure
- Philosophically defend (or critique) the normative assumption that minimizing AV is desirable
4.3. Practical Implications for Organizations
4.3.1. Implications for Ethics Training
- Principles-based approaches: Teach universal frameworks (utilitarian, deontological, rights-based) and encourage consistent application
- Case-based approaches: Develop judgment through exposure to diverse scenarios and contextual reasoning
- Low transfer: Cluster 1 (single-framework users) show higher arbitrary variation (0.41) than multi-framework users (0.31-0.35), suggesting rigid principle application doesn't reduce inconsistency
- Context insensitivity: Pure principlism requires ignoring features (identifiability, relational context) that may be morally relevant
- Psychological unrealism: Achieving very low contextual sensitivity (CR < 0.15) appears difficult for humans and may be undesirable
- Overfitting risk: High integrators (Level 3-4) show elevated arbitrary variation (0.35), suggesting unlimited context-sensitivity becomes unprincipled
- Framework confusion: Without clear decision procedures, case exposure may simply increase variation without improving judgment
- Lack of generalization: Conditional consistency (within-framework) is better than overall consistency, but only if framework selection itself is principled
- Teach multiple frameworks (utilitarian, deontological, care ethics) with clear scope conditions
- Identify morally relevant features explicitly (our four features provide starting point)
- Practice systematic framework selection (when does care ethics vs. utilitarianism apply?)
- Monitor arbitrary variation (use consistency checks to catch unprincipled variation)
- Target optimal profile (Cluster 2 parameters: CR ≈ 0.28, AV < 0.31)
- Consistency checks: Present identical scenarios with surface variations; flag unexplained differences
- Feature isolation: Present scenarios varying only identifiability or only relational context; discuss when variation is justified
- Framework mapping: For each framework, identify scenarios where it applies vs. doesn't apply
- Variation decomposition: Calculate individual CR/AV scores; provide feedback on sources of inconsistency
4.3.2. Implications for AI Governance
| Temperature | Mean CR | Mean AV | Coherence | Recommendation |
| 0.3 | 0.12 | 0.41 | 99.6% | Principlist contexts requiring consistency |
| 0.5 | 0.21 | 0.36 | 98.8% | Moderate sensitivity, low noise |
| 0.7 | 0.24 | 0.34 | 97.2% | Balanced (human-like) |
| 1.0 | 0.28 | 0.49 | 91.6% | High sensitivity but excessive noise |
- Legal compliance is paramount (minimal interpretation needed)
- Consistency across cases is essential (fairness as uniformity)
- Stakeholder anonymization is feasible and desirable
- Rapid decisions at scale (minimize computational variance)
- Context-sensitive judgment is valuable
- Human-like reasoning increases acceptance
- Relational factors may be relevant
- Explaining decisions to stakeholders matters
- Consistency matters (AV becomes problematic)
- Automated decision-making (coherence degrades)
- Accountability required (excessive variation complicates auditing)
- Generate AI recommendation with minimal contextual sensitivity
- Compare to human decision
- If difference > threshold AND no clear contextual justification → trigger review
- Generate recommendation with explicit framework and contextual considerations
- Present to human decision-maker as one input
- Human retains final authority but sees systematic contextual analysis
- Low-T model provides principled baseline
- Moderate-T model provides contextual analysis
- Compare recommendations; disagreement triggers human review
- Effectively implements "bounded particularism" architecturally
- Transparency: Log temperature and sampling parameters; make clear that variation is architectural, not error
- Auditing: Track CR/AV metrics over time; flag drift from target profiles
- Human oversight: Require human approval for decisions in high-stakes domains
- Bias monitoring: Test for systematic disparities across demographic groups (contextual sensitivity could amplify existing biases)
- Framework documentation: Require AI to specify which ethical framework(s) informed recommendation
4.3.3. Implications for Decision Auditing
- Same scenario presented differently → different decisions
- Surface features (wording, presentation) driving outcomes
- Interpretation: Unambiguous reliability problem
- Variation unexplained by scenario features or principled frameworks
- Inconsistent application of stated decision criteria
- Interpretation: Problematic even under particularism
- Target CR < 0.20
- Flag high CR as excessive context-sensitivity
- Audit action: Check if variation is justified by clearly relevant features (magnitude, probability, rights)
- Target CR = 0.25-0.30
- Flag both very low CR (insensitivity) and very high CR (unprincipled)
- Audit action: Verify variation corresponds to theoretically justified features
- Calculate consistency within each framework (conditional consistency)
- Audit framework selection separately
-
Flag as problematic only if:
- ○
- Low conditional consistency (inconsistent application), OR
- ○
- Unprincipled framework switching (no clear selection criteria)
- -
- Identifiability effect: η²p = 0.22 (high)
- -
- Relational effect: η²p = 0.18 (high)
- -
- Temporal effect: η²p = 0.06 (acceptable)
- -
- Action/omission: η²p = 0.09 (acceptable)
4.3.4. Implications for Stakeholder Communication
- Context influences decisions (attempting to hide this creates legitimacy gaps)
- Some context-sensitivity is appropriate (particularist framing)
- Variation is partially systematic (not arbitrary or biased)
| Stakeholder | Likely Preference | Communication Strategy |
| Regulators | Principlist (consistency) | Emphasize structural consistency, low AV |
| Employees | Particularist (contextual) | Emphasize relational reasoning, care ethics |
| Shareholders | Consequentialist | Emphasize outcome optimization |
| Community | Mixed | Acknowledge framework pluralism |
- Contradictory justifications (saying opposite things to different groups)
- Hiding true reasons (claiming principled consistency when actually context-driven)
- Post-hoc rationalization (fabricating justifications after decision made)
- What principles applied?
- What contextual features were considered?
- How were they weighted?
- What was the decision procedure?
4.4. Limitations and Future Directions
4.4.1. Methodological Limitations
- Reverse causation: High CR → increased relational language (post-hoc justification)
- Common cause: Personality traits → both RR and CR
- Reciprocal causation: Bidirectional relationship
- Real stakeholders and relationships
- Personal consequences for decision-maker
- Time pressure and incomplete information
- Organizational politics and power dynamics
- Do hypothetical judgments predict actual behavior? (Literature: mixed, see Bersoff, 1999; FeldmanHall et al., 2012)
- Are contextual effects stronger in real settings (personal stakes amplify biases) or weaker (professional norms constrain variation)?
- Field studies tracking actual organizational decisions with pre-registered coding
- Experience sampling: managers report real ethical decisions in near-real-time
- Archival analysis: code historical organizational decisions for contextual patterns
- Western cultural contexts (89% Western)
- Highly educated professionals (67% graduate degrees)
- Technology/healthcare/finance sectors (56%)
- Eastern vs. Western moral reasoning differs systematically (Nisbett et al., 2001)
- Education correlates with abstract reasoning and moral sophistication (Rest, 1986)
- Industry norms shape ethical judgment (Victor & Cullen, 1988)
- Cross-cultural replication (East Asia, Middle East, Africa, Latin America)
- Diverse occupational sampling (blue-collar, service sector, public sector)
- Developmental study (novice managers vs. experienced executives)
- Specific model versions (GPT-4 Jan 2025, Claude 3 Opus Feb 2024, Gemini Pro 1.5 Sep 2024)
- Training data cutoffs (each model trained on different time periods)
- Evolving reinforcement learning from human feedback (RLHF)
- Longitudinal tracking of model versions
- Comparison to specialized "ethics-focused" models as they emerge
- Analysis of training data composition effects on contextual sensitivity
- We selected 0.7 specifically because it produced human-like patterns
- At 0.3, AI shows less variation; at 1.0, more variation
- No principled way to identify "true" AI reasoning pattern
- Cannot claim AI "inherently" resembles humans
- Can claim: At typical deployment parameters, patterns are similar
- Should report results across temperature range (we did this in sensitivity analyses)
- Develop theory-driven temperature selection criteria
- Test other stochastic parameters (top-p, top-k) for robustness
- Compare deterministic AI approaches (e.g., chain-of-thought reasoning at T=0)
- Genuine inconsistency (problematic)
- Unmeasured individual differences (potentially legitimate)
- Measurement error (unavoidable)
- Subtle contextual features we didn't code (might be morally relevant)
- Qualitative analysis of high-AV cases to identify patterns
- Test-retest reliability studies to separate state vs. trait variation
- Expand feature coding to capture additional contextual dimensions
4.4.2. Theoretical and Interpretive Limitations
- Empirical benchmarks for normative theories (e.g., "pure principlism requires CR < 0.15, which is psychologically implausible")
- Evidence about consequences of different approaches (e.g., "high integrators show higher CR but also higher AV")
- Definitive answer to "which contextual features are morally relevant?"
- Proof that any particular variation pattern is morally right or wrong
- Interdisciplinary dialogue between empirical researchers and normative ethicists
- Development of normative theories that explicitly incorporate empirical constraints
- Expert elicitation studies asking moral philosophers to classify features as relevant/irrelevant
- Many responses combined multiple frameworks
- Framework boundaries are contested theoretically
- Coders may have systematically misclassified certain reasoning patterns
- Develop more granular framework taxonomy
- Use multiple independent coding teams with different theoretical backgrounds
- Validate framework classifications against self-reported ethical orientation
- Mediated by empathy/emotional response? (affective mechanism)
- Or enhanced moral salience of concrete persons? (cognitive mechanism)
- Or both?
- Driven by responsibility attribution?
- Or counterfactual reasoning about interventions?
- Or default inaction bias?
- Reflects genuine care ethics commitments?
- Or in-group favoritism bias?
- Or reciprocity heuristics?
- Process-tracing methods (think-aloud protocols, eye-tracking)
- Psychophysiological measures (emotional arousal during judgment)
- Computational modeling of decision processes
- Neuroscience approaches (fMRI studies of moral judgment)
- Direct stakeholders are often identifiable
- Immediate consequences are often more certain
- Relational stakeholders often have longer history
- Conjoint analysis with realistic feature correlations
- Multi-level modeling of feature interactions
- Case studies examining how features combine in actual decisions
4.4.3. Future Research Directions
- Longitudinal tracking of decision-makers' variation profiles
-
Measure downstream outcomes:
- ○
- Stakeholder satisfaction
- ○
- Decision quality (by independent evaluation)
- ○
- Organizational culture/trust
- ○
- Legal/compliance issues
- Test whether "Cluster 2" profile (CR = 0.28, AV = 0.31) predicts better outcomes
- Compare novice vs. experienced managers
- Pre-post ethics training assessment
- Longitudinal panel following managers over 3-5 years
-
Measure whether:
- ○
- CR increases, decreases, or stabilizes
- ○
- AV decreases (learning consistency)
- ○
- Framework integration increases
- Replicate study in collectivist cultures (East Asia, Latin America)
-
Test whether:
- ○
- Relational effects are stronger in collectivist cultures
- ○
- Identifiability effects differ (named individuals vs. in-group/out-group)
- ○
- Different frameworks dominate (harmony-focused vs. rights-focused)
- Fine-tune models on decisions from high-performing humans (Cluster 2 profile)
-
Test whether fine-tuned models:
- ○
- Reduce AV while maintaining moderate CR
- ○
- Show systematic framework selection
- ○
- Better match human variation patterns across temperature range
- Manipulate emotional content independently of identification
- Test mediation by empathy (self-report, psychophysiological)
- Distinguish affective vs. cognitive mechanisms
- Manipulate relationship strength orthogonally to fairness considerations
- Include reciprocity measures
- Test whether effects persist when controlling for reciprocity
- Manipulate causal language with constant outcomes
- Measure responsibility attribution explicitly
- Test whether making causation salient eliminates effect
- Provide real-time feedback on CR/AV/SC metrics
- Flag unexplained variation for reflection
- Test whether feedback reduces AV without eliminating CR
- Require explicit framework identification
- Document contextual features considered
- Test whether documentation improves conditional consistency
- Deploy hybrid ensemble model (§4.3.2, Model 3)
- Compare decision quality to human-only or AI-only conditions
- Measure stakeholder acceptance across conditions
- Delphi study with moral philosophers
- Present empirical findings (e.g., "relational reasoning reduces AV")
- Ask: Does this evidence change your normative view of relational ethics?
- Iterative dialogue between empirical and normative researchers
- Classic trolley problems with contextual manipulations
- Test whether organizational effects generalize to personal ethics
- Compare professionals to general population
- Public officials facing resource allocation
- Test identifiability effects in policy contexts
- Compare elected officials to appointed administrators
- Physician decision-making with patient variation
- Test relational effects in doctor-patient relationships
- Compare to medical AI recommendation systems
- Judges/juries with defendant/victim variation
- Test whether legal training reduces contextual effects
- Compare to algorithmic sentencing recommendations
4.5. Conclusions and Recommendations
- 1.
- Contextual sensitivity is systematic, not random
- 2.
- Perfect consistency is neither achievable nor clearly desirable
- 3.
- Framework integration can be sophisticated or confused
- 4.
- Humans and AI differ primarily in relational reasoning
- 5.
- Empirical findings constrain but cannot resolve normative debates
- Acknowledge contextual sensitivity rather than demanding impossible consistency
- Set empirical targets (CR = 0.25-0.30, AV < 0.30, SC > 0.85) aligned with bounded particularism
- Audit variation components separately (structural, contextual, arbitrary)
- Document framework selection criteria to ensure principled rather than arbitrary context-sensitivity
- Configure AI systems intentionally with temperature matching organizational meta-ethical commitments
- Teach multiple frameworks with explicit scope conditions
- Practice framework selection systematically, not ad hoc
- Provide feedback on variation profiles (individual CR/AV/SC metrics)
- Target optimal calibration (Cluster 2 profile), not zero sensitivity
- Distinguish morally relevant from irrelevant context explicitly
- Make temperature selection explicit governance decision
- Deploy ensemble models combining different temperature settings
- Require framework documentation in AI-generated recommendations
- Monitor CR/AV/SC metrics as deployment KPIs
- Preserve human authority for high-stakes decisions in contested domains
- Field studies of actual organizational decisions
- Cross-cultural replication to test generalizability
- Mechanism experiments to identify underlying processes
- Longitudinal tracking of variation profile development
- Normative-empirical integration through interdisciplinary dialogue
- Principlist concerns (structural consistency 86%, low arbitrary variation)
- Particularist insights (systematic sensitivity to relational and concrete contextual features)
5. Conclusions
5.1. Summary of Contributions
- Overall variation: Both humans (42%) and AI (41%) show substantial variation across scenario presentations
- Component decomposition: 85% structural consistency, 22-24% contextual responsiveness, 32-34% arbitrary variation
- Feature effects: Identifiability (OR=2.08), relational proximity (OR=7.89), temporal proximity (OR=1.52), relational context (OR=1.89) all produce large, robust effects
- Human-AI similarity: At deployment parameters (temperature 0.7), AI replicates human patterns (no significant Source × Feature interactions)
- 85% structural consistency is achievable (not controversial—should approach 100%)
- 32-34% arbitrary variation is problematic (not controversial—should be minimized)
-
22-24% contextual responsiveness is the contested philosophical zone:
- ○
- Principlism: Bias requiring elimination
- ○
- Particularism: Appropriate moral sensitivity
- Structural Consistency (SC): Agreement when irrelevant features vary
- Contextual Responsiveness (CR): Variation attributable to debatable features
- Arbitrary Variation (AV): Unexplained residual
- Calibration: CR/AV ratio (signal-to-noise)
- Aim for calibrated sensitivity (moderate CR, low AV), not maximum consistency
- Make contextual features explicit in decision processes
- Use domain-specific consistency standards
- Audit for systematic bias vs. appropriate sensitivity
- Acknowledge that temperature/design choices embed normative commitments
- Develop explicit relational reasoning modules
- Target calibration (CR/AV > 0.8), not just consistency
- Report calibration metrics alongside performance metrics
- Specify calibration targets, not just "consistency"
- Audit for three components (SC, CR, AV) separately
- Set domain-specific standards
- Measure signal-to-noise, not just noise
5.2. The Irreducible Normative Question
- ✓ Substantial contextual variation exists (22-24% of variance) in both humans and AI at typical deployment parameters (T=0.7)1
- ✓ Variation is systematic (framework-appropriate, reliable)
- ✓ Some variation is problematic (32-34% arbitrary)
- ✓ Moderate sensitivity outperforms zero sensitivity on reducing arbitrary variation
- ✓ Patterns are shared by humans and AI
- ✗ Whether contextual features should influence judgment
- ✗ What level of CR is "optimal" (depends on normative framework)
- ✗ How AI systems should be designed (depends on values)
- Explain why zero sensitivity increases arbitrary variation
- Explain framework-appropriate patterns (if pure bias, shouldn't be framework-specific)
- Develop training that reduces arbitrary variation while pursuing consistency
- Explain 32-34% arbitrary variation (what accounts for unjustified inconsistency?)
- Provide principles for when context matters (avoid "anything goes" relativism)
- Distinguish good from bad particularism (focused vs. diffuse sensitivity)
- Address organizational implementation (how to scale case-by-case judgment)
- Engage with empirical constraints (theories must explain observed patterns)
- Specify which features are relevant and why
- Provide decision procedures for contested cases
- Test whether normative principles improve calibration empirically
- We selected temperature T=0.7 because it produced human-like total variation (0.41 vs. 0.42, p=.56)
- At T=0.7, we found that AI exhibits human-like contextual sensitivity patterns
- We cannot therefore claim AI "inherently" reasons like humans
- ✓ At deployment parameters that match human total variation, AI exhibits similar contextual responsiveness (CR ≈ 0.24 for both)
- ✓ Contextual effects exist across the full temperature range (T=0.3 to T=1.0, all p<.001)
- ✓ Temperature is a design choice that embeds normative commitments about desired reasoning patterns
- ✓ Organizations can calibrate AI to target specific variation profiles (principlist T=0.3, moderate T=0.7, particularist T=1.0)
- ✗ AI reasoning is fundamentally similar to human reasoning
- ✗ AI would exhibit human-like patterns at "default" or "optimal" settings absent calibration
- ✗ Human-AI convergence is independent of parameter selection
- ✗ AI "naturally" achieves the balanced profile (Cluster 2) more frequently than humans
-
Low temperature (T=0.3): CR=0.12, AV=0.41, coherence=99.6%
- ○
- Principlist-friendly: minimal contextual sensitivity
- ○
- But paradoxically higher arbitrary variation than moderate temperature
-
Moderate temperature (T=0.7): CR=0.24, AV=0.34, coherence=97.2%
- ○
- Human-matched: similar CR and AV to human average
- ○
- Balanced calibration between sensitivity and noise
-
High temperature (T=1.0): CR=0.28, AV=0.49, coherence=91.6%
- ○
- Particularist-friendly: high contextual sensitivity
- ○
- But excessive arbitrary variation and coherence degradation
- Contextual responsiveness (particularist value)
- Consistency (principlist value)
- Coherence (rational discourse requirement)
- ✓ Human-only clustering showing this profile minimizes arbitrary variation (Cluster H2: AV=0.29)
- ✓ Theoretical argument that neither extreme (pure principlism nor unbounded particularism) achieves low AV
- ✓ Practical considerations about realistic targets for ethics training
- ✗ Observation that AI "naturally" achieves this profile (AI frequency in Cluster 2 is temperature-dependent)
- ✗ Claim that this profile represents AI "reasoning" superiority
- "AI at temperature T=0.7 (selected to match human variation) exhibits pattern X"
- NOT "AI inherently exhibits pattern X"
- Practitioners deciding how to configure AI systems (temperature is a choice, not a given)
- Researchers interpreting human-AI comparisons (similarity is parameter-dependent)
- Philosophers evaluating whether AI "reasoning" provides evidence for moral theories (it doesn't - it provides evidence about what patterns emerge at different calibration settings)
5.3. Future Outlook
- Rule-based domains → principlism appropriate (maximize consistency)
- Relationship domains → particularism appropriate (calibrated sensitivity)
- Justice domains → framework pluralism (multiple legitimate principles)
- If high CR with low AV predicts better organizational outcomes, particularism vindicated
- If low CR predicts better outcomes, principlism vindicated
- If outcome-dependence itself emerges (context-specific optimality), pluralism vindicated
- Rigorous empirical measurement (which patterns exist?)
- Careful philosophical analysis (which patterns should exist?)
- Practical experimentation (which approaches work in organizations?)
- Iterative refinement (revise both principles and practices based on evidence)
Supplementary Materials
References
- Aristotle. Nicomachean ethics, 2nd ed.; Irwin, T., Translator; Hackett Publishing, 1999; (Original work published ca. 350 BCE). [Google Scholar]
- Askell, A.; Bai, Y.; Chen, A.; Drain, D.; Ganguli, D.; Henighan, T.; Kaplan, J. A general language assistant as a laboratory for alignment. arXiv 2021, arXiv:2112.00861. [Google Scholar] [CrossRef]
- Bai, Y.; Kadavath, S.; Kundu, S.; Askell, A.; Kernion, J.; Jones, A.; Kaplan, J. Constitutional AI: Harmlessness from AI feedback. arXiv 2022, arXiv:2212.08073. [Google Scholar] [CrossRef]
- Baron, J. The effect of normative beliefs on anticipated emotions. Journal of Personality and Social Psychology 1992, 63(2), 320–330. [Google Scholar] [CrossRef]
- Baron, J.; Ritov, I. Omission bias, individual differences, and normality. Organizational Behavior and Human Decision Processes 2004, 94(2), 74–85. [Google Scholar] [CrossRef]
- Baron, J.; Spranca, M. Protected values. Organizational Behavior and Human Decision Processes 1997, 70(1), 1–16. [Google Scholar] [CrossRef]
- Bartels, D. M. Principled moral sentiment and the flexibility of moral judgment and decision making. Cognition 2008, 108(2), 381–417. [Google Scholar] [CrossRef]
- Bazerman, M. H.; Tenbrunsel, A. E. Blind spots: Why we fail to do what's right and what to do about it; Princeton University Press, 2011. [Google Scholar]
- Beauchamp, T. L.; Childress, J. F. Principles of biomedical ethics, 8th ed.; Oxford University Press, 2019. [Google Scholar]
- Bentham, J. An introduction to the principles of morals and legislation; Oxford University Press, 1996. Original work published 1789. [Google Scholar]
- Bloom, P. Against empathy: The case for rational compassion; Ecco/HarperCollins, 2017. [Google Scholar]
- Bommasani, R.; Hudson, D. A.; Adeli, E.; Altman, R.; Arora, S.; von Arx, S.; Liang, P. On the opportunities and risks of foundation models. arXiv 2021, arXiv:2108.07258. [Google Scholar] [CrossRef]
- Brown, T.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J. D.; Dhariwal, P.; Amodei, D. Language models are few-shot learners. Advances in Neural Information Processing Systems 2020, 33, 1877–1901. [Google Scholar]
- Christiano, P. F.; Leike, J.; Brown, T.; Martic, M.; Legg, S.; Amodei, D. Deep reinforcement learning from human preferences. Advances in Neural Information Processing Systems 2017, 30, 4299–4307. [Google Scholar]
- Cinelli, C.; Hazlett, C. Making sense of sensitivity: Extending omitted variable bias. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 2020, 82(1), 39–67. [Google Scholar] [CrossRef]
- Cushman, F. Action, outcome, and value: A dual-system framework for morality. Personality and Social Psychology Review 2013, 17(3), 273–292. [Google Scholar] [CrossRef]
- Cushman, F.; Young, L.; Hauser, M. The role of conscious reasoning and intuition in moral judgment: Testing three principles of harm. Psychological Science 2006, 17(12), 1082–1089. [Google Scholar] [CrossRef]
- Dancy, J. Moral reasons; Blackwell, 1993. [Google Scholar]
- Dancy, J. Ethics without principles; Oxford University Press, 2004. [Google Scholar]
- Dancy, J. Moral particularism. In E. N. Zalta (Ed.), The Stanford Encyclopedia of Philosophy (Winter 2013 Edition). 2013. Available online: https://plato.stanford.edu/archives/win2013/entries/moral-particularism/.
- DeYoung, C. G.; Quilty, L. C.; Peterson, J. B. Between facets and domains: 10 aspects of the Big Five. Journal of Personality and Social Psychology 2007, 93(5), 880–896. [Google Scholar] [CrossRef]
- Donaldson, T.; Preston, L. E. The stakeholder theory of the corporation: Concepts, evidence, and implications. Academy of Management Review 1995, 20(1), 65–91. [Google Scholar] [CrossRef]
- Dworkin, R. Taking rights seriously; Harvard University Press, 1977. [Google Scholar]
- European Parliament and Council. Regulation (EU) 2024/1689 of the European Parliament and of the Council laying down harmonised rules on artificial intelligence (Artificial Intelligence Act); Official Journal of the European Union, 2024. [Google Scholar]
- Foot, P. The problem of abortion and the doctrine of the double effect. Oxford Review 1967, 5, 5–15. [Google Scholar]
- Frederick, S. Cognitive reflection and decision making. Journal of Economic Perspectives 2005, 19(4), 25–42. [Google Scholar] [CrossRef]
- Freeman, R. E. Strategic management: A stakeholder approach; Pitman, 1984. [Google Scholar]
- Freeman, R. E.; Harrison, J. S.; Wicks, A. C.; Parmar, B. L.; De Colle, S. Stakeholder theory: The state of the art; Cambridge University Press, 2010. [Google Scholar]
- Fried, C. Right and wrong; Harvard University Press, 1978. [Google Scholar]
- Gabriel, I. Artificial intelligence, values, and alignment. Minds and Machines 2020, 30(3), 411–437. [Google Scholar] [CrossRef]
- Ganguli, D.; Lovitt, L.; Kernion, J.; Askell, A.; Bai, Y.; Kadavath, S.; Clark, J. Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned. arXiv 2022, arXiv:2209.07858. [Google Scholar] [CrossRef]
- Gert, B. Common morality: Deciding what to do; Oxford University Press, 2004. [Google Scholar]
- Gilligan, C. a different voice: Psychological theory and women's development; Harvard University Press, 1982. [Google Scholar]
- Goodman, N. Fact, fiction, and forecast; Harvard University Press, 1954. [Google Scholar]
- Graham, J.; Haidt, J.; Koleva, S.; Motyl, M.; Iyer, R.; Wojcik, S. P.; Ditto, P. H. Moral foundations theory: The pragmatic validity of moral pluralism. Advances in Experimental Social Psychology 2013, 47, 55–130. [Google Scholar]
- Graham, J.; Nosek, B. A.; Haidt, J.; Iyer, R.; Koleva, S.; Ditto, P. H. Mapping the moral domain. Journal of Personality and Social Psychology 2011, 101(2), 366–385. [Google Scholar] [CrossRef] [PubMed]
- Greene, J. D. Why are VMPFC patients more utilitarian? A dual-process theory of moral judgment explains. Trends in Cognitive Sciences 2007, 11(8), 322–323. [Google Scholar] [CrossRef]
- Greene, J. D. Moral tribes: Emotion, reason, and the gap between us and them; Penguin Press, 2013. [Google Scholar]
- Greene, J. D.; Sommerville, R. B.; Nystrom, L. E.; Darley, J. M.; Cohen, J. D. An fMRI investigation of emotional engagement in moral judgment. Science 2001, 293(5537), 2105–2108. [Google Scholar] [CrossRef]
- Haidt, J. The emotional dog and its rational tail: A social intuitionist approach to moral judgment. Psychological Review 2001, 108(4), 814–834. [Google Scholar] [CrossRef] [PubMed]
- Haidt, J. The righteous mind: Why good people are divided by politics and religion; Pantheon Books, 2012. [Google Scholar]
- Haidt, J.; Baron, J. Social roles and the moral judgement of acts and omissions. European Journal of Social Psychology 1996, 26(2), 201–218. [Google Scholar] [CrossRef]
- Haidt, J.; Hersh, M. A. Sexual morality: The cultures and emotions of conservatives and liberals. Journal of Applied Social Psychology 2001, 31(1), 191–221. [Google Scholar] [CrossRef]
- Hare, R. M. Moral thinking: Its levels, method, and point; Oxford University Press, 1981. [Google Scholar]
- Heider, F. The psychology of interpersonal relations; Wiley, 1958. [Google Scholar]
- Held, V. The ethics of care: Personal, political, and global; Oxford University Press, 2006. [Google Scholar]
- Hendrycks, D.; Burns, C.; Basart, S.; Critch, A.; Li, J.; Song, D.; Steinhardt, J. Aligning AI with shared human values. In Proceedings of the International Conference on Learning Representations (ICLR); 2021. [Google Scholar]
- Hooker, B.; Little, M. (Eds.) Moral particularism; Oxford University Press, 2000. [Google Scholar]
- Hsee, C. K.; Rottenstreich, Y. Music, pandas, and muggers: On the affective psychology of value. Journal of Experimental Psychology: General 2004, 133(1), 23–30. [Google Scholar] [CrossRef]
- Hume, D. Selby-Bigge, L. A., Nidditch, P. H., Eds.; A treatise of human nature, 2nd ed.; Oxford University Press, 1978. (Original work published 1739-1740). [Google Scholar]
- Jenni, K.; Loewenstein, G. Explaining the identifiable victim effect. Journal of Risk and Uncertainty 1997, 14(3), 235–257. [Google Scholar] [CrossRef]
- Ji, Z.; Lee, N.; Frieske, R.; Yu, T.; Su, D.; Xu, Y.; Fung, P. Survey of hallucination in natural language generation. ACM Computing Surveys 2023, 55(12), 1–38. [Google Scholar] [CrossRef]
- Kagan, S. The limits of morality; Oxford University Press, 1989. [Google Scholar]
- Kahneman, D. Thinking, fast and slow; Farrar, Straus and Giroux, 2011. [Google Scholar]
- Kahneman, D.; Tversky, A. Prospect theory: An analysis of decision under risk. Econometrica 1979, 47(2), 263–291. [Google Scholar] [CrossRef]
- Kant, I. Ellington, J. W., Translator; Grounding for the metaphysics of morals, 3rd ed.; Hackett Publishing, 1993. (Original work published 1785). [Google Scholar]
- Kant, I. Gregor, M. J., Translator; Practical philosophy; Cambridge University Press, 1996. [Google Scholar]
- Kohlberg, L. Essays on moral development. In The psychology of moral development; Harper & Row, 1984; Vol. 2. [Google Scholar]
- Korsgaard, C. M. The sources of normativity; Cambridge University Press, 1996. [Google Scholar]
- Levinas, E. Lingis, A., Translator; Totality and infinity: An essay on exteriority; Duquesne University Press, 1969. [Google Scholar]
- Levinas, E. Cohen, R. A., Translator; Ethics and infinity: Conversations with Philippe Nemo; Duquesne University Press, 1985. [Google Scholar]
- Liao, S. M.; Wiegmann, A.; Alexander, J.; Vong, G. Putting the trolley in order: Experimental philosophy and the loop case. Philosophical Psychology 2012, 25(5), 661–671. [Google Scholar] [CrossRef]
- Lickona, T. Educating for character: How our schools can teach respect and responsibility; Bantam Books, 1991. [Google Scholar]
- Little, M. O. Hooker, B., Little, M., Eds.; Moral generalities revisited. In Moral particularism; Oxford University Press, 2000; pp. 276–304. [Google Scholar]
- MacIntyre, A. After virtue: A study in moral theory; University of Notre Dame Press, 1981. [Google Scholar]
- Markus, H. R.; Kitayama, S. Culture and the self: Implications for cognition, emotion, and motivation. Psychological Review 1991, 98(2), 224–253. [Google Scholar] [CrossRef]
- McCloskey, H. J. An examination of restricted utilitarianism. Philosophical Review 1957, 66(4), 466–485. [Google Scholar] [CrossRef]
- McDowell, J. Virtue and reason. The Monist 1979, 62(3), 331–350. [Google Scholar] [CrossRef]
- McDowell, J. Mind, value, and reality; Harvard University Press, 1998. [Google Scholar]
- McNaughton, D. Moral vision: An introduction to ethics; Blackwell, 1988. [Google Scholar]
- Mikhail, J. Universal moral grammar: Theory, evidence and the future. Trends in Cognitive Sciences 2007, 11(4), 143–152. [Google Scholar] [CrossRef]
- Mill, J. S. Utilitarianism; Crisp, R., Ed.; Oxford University Press, 1998. (Original work published 1861). [Google Scholar]
- Mitchell, J. P.; Banaji, M. R.; Macrae, C. N. The link between social cognition and self-referential thought in the medial prefrontal cortex. Journal of Cognitive Neuroscience 2005, 17(8), 1306–1315. [Google Scholar] [CrossRef]
- Moore, G. E. Principia ethica; Cambridge University Press, 1903. [Google Scholar]
- Nagel, T. The possibility of altruism; Princeton University Press, 1970. [Google Scholar]
- Nagel, T. The view from nowhere; Oxford University Press, 1986. [Google Scholar]
- National Institute of Standards and Technology. Artificial Intelligence Risk Management Framework (AI RMF 1.0); U.S. Department of Commerce, 2023. [Google Scholar] [CrossRef]
- Nichols, S.; Mallon, R. Moral dilemmas and moral rules. Cognition 2006, 100(3), 530–542. [Google Scholar] [CrossRef] [PubMed]
- Noddings, N. Caring: A feminine approach to ethics and moral education; University of California Press, 1984. [Google Scholar]
- Nozick, R. Anarchy, state, and utopia; Basic Books, 1974. [Google Scholar]
- Nussbaum, M. C. Love's knowledge: Essays on philosophy and literature; Oxford University Press, 1990. [Google Scholar]
- Nussbaum, M. C. Upheavals of thought: The intelligence of emotions; Cambridge University Press, 2001. [Google Scholar]
- Ouyang, L.; Wu, J.; Jiang, X.; Almeida, D.; Wainwright, C.; Mishkin, P.; Lowe, R. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems 2022, 35, 27730–27744. [Google Scholar]
- Parfit, D. Reasons and persons; Oxford University Press, 1984. [Google Scholar]
- Parfit, D. On what matters; Oxford University Press, 2011; Vols. 1-2. [Google Scholar]
- Petrinovich, L.; O'Neill, P. Influence of wording and framing effects on moral intuitions. Ethology and Sociobiology 1996, 17(3), 145–171. [Google Scholar] [CrossRef]
- Petrinovich, L.; O'Neill, P.; Jorgensen, M. An empirical study of moral intuitions: Toward an evolutionary ethics. Journal of Personality and Social Psychology 1993, 64(3), 467–478. [Google Scholar] [CrossRef]
- Pettit, P.; Brennan, G. Restrictive consequentialism. Australasian Journal of Philosophy 1986, 64(4), 438–455. [Google Scholar] [CrossRef]
- Piaget, J. The moral judgment of the child; Gabain, M., Translator; Free Press, 1965. (Original work published 1932). [Google Scholar]
- Portmore, D. W. Commonsense consequentialism: Wherein morality meets rationality; Oxford University Press, 2011. [Google Scholar]
- Prinz, J. J. The emotional construction of morals; Oxford University Press, 2007. [Google Scholar]
- Railton, P. Alienation, consequentialism, and the demands of morality. Philosophy & Public Affairs 1984, 13(2), 134–171. [Google Scholar]
- Rawls, J. A theory of justice; Harvard University Press, 1971. [Google Scholar]
- Rawls, J. Political liberalism; Columbia University Press, 1993. [Google Scholar]
- Raz, J. The truth in particularism. In Moral particularism; Hooker, B., Little, M., Eds.; Oxford University Press, 2000; pp. 48–78. [Google Scholar]
- Rest, J. R. Development in judging moral issues; University of Minnesota Press, 1979. [Google Scholar]
- Robinson, P. H.; Darley, J. M. Justice, liability, and blame: Community views and the criminal law; Westview Press, 1995. [Google Scholar]
- Ross, W. D. The right and the good; Oxford University Press, 1930. [Google Scholar]
- Sandel, M. J. Liberalism and the limits of justice; Cambridge University Press, 1982. [Google Scholar]
- Scanlon, T. M. What we owe to each other; Harvard University Press, 1998. [Google Scholar]
- Schein, C.; Gray, K. The theory of dyadic morality: Reinventing moral judgment by redefining harm. Personality and Social Psychology Review 2018, 22(1), 32–70. [Google Scholar] [CrossRef]
- Scherrer, N.; Shi, C.; Feder, A.; Blei, D. Evaluating the moral beliefs encoded in LLMs. Advances in Neural Information Processing Systems 2024, 37. [Google Scholar]
- Schnall, S.; Haidt, J.; Clore, G. L.; Jordan, A. H. Disgust as embodied moral judgment. Personality and Social Psychology Bulletin 2008, 34(8), 1096–1109. [Google Scholar] [CrossRef] [PubMed]
- Sen, A. Rational fools: A critique of the behavioral foundations of economic theory. Philosophy & Public Affairs 1977, 6(4), 317–344. [Google Scholar]
- Sen, A. The idea of justice; Harvard University Press, 2009. [Google Scholar]
- Shweder, R. A.; Much, N. C.; Mahapatra, M.; Park, L. The "big three" of morality (autonomy, community, divinity) and the "big three" explanations of suffering. In Morality and health; Brandt, A. M., Rozin, P., Eds.; Routledge, 1997; pp. 119–169. [Google Scholar]
- Sidgwick, H. The methods of ethics, 7th ed.; Macmillan, 1907. [Google Scholar]
- Simmons, G. Moral mimicry: Large language models produce moral rationalizations tailored to political identity. arXiv 2023, arXiv:2209.12106. [Google Scholar] [CrossRef]
- Singer, P. Famine, affluence, and morality. Philosophy & Public Affairs 1972, 1(3), 229–243. [Google Scholar]
- Singer, P. Practical ethics; Cambridge University Press, 1979. [Google Scholar]
- Singer, P. The life you can save: Acting now to end world poverty; Random House, 2009. [Google Scholar]
- Sloman, S. A. The empirical case for two systems of reasoning. Psychological Bulletin 1996, 119(1), 3–22. [Google Scholar] [CrossRef]
- Slovic, P. If I look at the mass I will never act": Psychic numbing and genocide. Judgment and Decision Making 2007, 2(2), 79–95. [Google Scholar] [CrossRef]
- Small, D. A.; Loewenstein, G. Helping a victim or helping the victim: Altruism and identifiability. Journal of Risk and Uncertainty 2003, 26(1), 5–16. [Google Scholar] [CrossRef]
- Small, D. A.; Loewenstein, G.; Slovic, P. Sympathy and callousness: The impact of deliberative thought on donations to identifiable and statistical victims. Organizational Behavior and Human Decision Processes 2007, 102(2), 143–153. [Google Scholar] [CrossRef]
- Sorensen, T.; Moore, J.; Fisher, J.; Gordon, M.; Mireshghallah, N.; Rytting, C. M.; Choi, Y. Value kaleidoscope: Engaging AI with pluralistic human values, rights, and duties. Proceedings of the AAAI Conference on Artificial Intelligence 2024, 38(20), 22000–22310. [Google Scholar] [CrossRef]
- Spranca, M.; Minsk, E.; Baron, J. Omission and commission in judgment and choice. Journal of Experimental Social Psychology 1991, 27(1), 76–105. [Google Scholar] [CrossRef]
- Stanovich, K. E.; West, R. F. Individual differences in reasoning: Implications for the rationality debate? Behavioral and Brain Sciences 2000, 23(5), 645–665. [Google Scholar] [CrossRef]
- Starmans, C.; Bloom, P. When the spirit is willing, but the flesh is weak: Developmental differences in judgments about inner moral conflict. Psychological Science 2016, 27(11), 1498–1506. [Google Scholar] [CrossRef] [PubMed]
- Sunstein, C. R. Moral heuristics. Behavioral and Brain Sciences 2005, 28(4), 531–542. [Google Scholar] [CrossRef] [PubMed]
- Sunstein, C. R. How change happens; MIT Press, 2019. [Google Scholar]
- Talisse, R. B.; Aikin, S. F. Pragmatism: A guide for the perplexed; Continuum, 2008. [Google Scholar]
- Tassy, S.; Oullier, O.; Duclos, Y.; Coulon, O.; Mancini, J.; Deruelle, C.; Wicker, B. Disrupting the right prefrontal cortex alters moral judgement. Social Cognitive and Affective Neuroscience 2012, 7(3), 282–288. [Google Scholar] [CrossRef]
- Tetlock, P. E. Thinking the unthinkable: Sacred values and taboo cognitions. Trends in Cognitive Sciences 2003, 7(7), 320–324. [Google Scholar] [CrossRef]
- Thomson, J. J. The trolley problem. Yale Law Journal 1985, 94(6), 1395–1415. [Google Scholar] [CrossRef]
- Thomson, J. J. The realm of rights; Harvard University Press, 1990. [Google Scholar]
- Tronto, J. C. Moral boundaries: A political argument for an ethic of care; Routledge, 1993. [Google Scholar]
- Turiel, E. The development of social knowledge: Morality and convention; Cambridge University Press, 1983. [Google Scholar]
- Uhlmann, E. L.; Pizarro, D. A.; Tannenbaum, D.; Ditto, P. H. The motivated use of moral principles. Judgment and Decision Making 2009, 4(6), 476–491. [Google Scholar] [CrossRef]
- Unger, P. Living high and letting die: Our illusion of innocence; Oxford University Press, 1996. [Google Scholar]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A. N.; Polosukhin, I. Attention is all you need. Advances in Neural Information Processing Systems 2017, 30, 5998–6008. [Google Scholar]
- Waldron, J. Theoretical foundations of liberalism. The Philosophical Quarterly 1987, 37(147), 127–150. [Google Scholar] [CrossRef]
- Walzer, M. Spheres of justice: A defense of pluralism and equality; Basic Books, 1983. [Google Scholar]
- Walzer, M. Thick and thin: Moral argument at home and abroad; University of Notre Dame Press, 1994. [Google Scholar]
- Wei, J.; Wang, X.; Schuurmans, D.; Bosma, M.; Xia, F.; Chi, E.; Zhou, D. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems 2022, 35, 24824–24837. [Google Scholar]
- Weidman, A. C.; Sowden, W. J.; Berg, M. K.; Kross, E. Punish or protect? How close relationships shape responses to moral violations. Personality and Social Psychology Bulletin 2020, 46(5), 693–708. [Google Scholar] [CrossRef]
- Wheatley, T.; Haidt, J. Hypnotic disgust makes moral judgments more severe. Psychological Science 2005, 16(10), 780–784. [Google Scholar] [CrossRef]
- Williams, B. A critique of utilitarianism. In Utilitarianism: For and against; Smart, J. J. C., Williams, B., Eds.; Cambridge University Press, 1973; pp. 77–150. [Google Scholar]
- Williams, B. Ethics and the limits of philosophy; Harvard University Press, 1985. [Google Scholar]
- Wistrich, A. J.; Guthrie, C.; Rachlinski, J. J. Can judges ignore inadmissible information? The difficulty of deliberately disregarding. University of Pennsylvania Law Review 2005, 153(4), 1251–1345. [Google Scholar] [CrossRef]
- Young, L.; Saxe, R. When ignorance is no excuse: Different roles for intent across moral domains. Cognition 2011, 120(2), 202–214. [Google Scholar] [CrossRef] [PubMed]
- Zheng, R.; Consoli, S.; Zhao, L. Large language models for ethics: A systematic literature review. arXiv 2023, arXiv:2308.12711. [Google Scholar]
| 1 | AI patterns observed at temperature=0.7, which was selected post-hoc to match human total variation levels. At lower temperatures (T=0.3), AI shows significantly less contextual responsiveness (CR=0.12 vs. human 0.27, p<.001); at higher temperatures (T=1.0), AI shows similar CR (0.28) but with elevated arbitrary variation (AV=0.49 vs. human 0.33, p<.001) and degraded coherence. Human-AI similarity is thus parameter-dependent, not inherent to AI reasoning architecture. However, the existence and direction of contextual effects (identifiability OR>1.3, relational OR>1.4, temporal OR>1.2, action-omission d>0.5) persist across all temperatures (T=0.3 to T=1.0, all p<.001), supporting robustness of core findings independent of calibration choices. |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).