Methods
2.1. Source Corpus and Parsing
The source corpus comprises 67 remedy-level text files representing the full published scope of Hahnemann’s Materia Medica Pura. Each file encodes the proving record for a single remedy, including symptom entries numbered sequentially within the original text, authority attributions identifying the provers and observers credited for each symptom, source references indicating the edition from which the entry was drawn, preparation notes, and introductory prose. The files span remedies lettered A through V and reflect the internal structural heterogeneity characteristic of a work compiled and revised across multiple editions by different editorial hands.
Parsing proceeded in two stages: structural segmentation and metadata extraction. Structural segmentation identified symptom boundaries using authority-block-constrained extraction rules that distinguished symptom entries from embedded editorial commentary. Each symptom entry was assigned a sequential identifier within its remedy, preserving the original numbering for downstream validation. Source-reference strings were normalized to edition-like patterns. Raw authority strings embedded in symptom blocks were extracted without normalization at this stage, preserving the full variant range for the subsequent reconciliation workflow. Contamination cleanup removed residual formatting artifacts and source-reference fragments that had migrated into symptom text fields during extraction.
Metadata extraction recovered six fields per remedy: source, source_reference, common_name, preparation, introduction, and authorities. Field coverage across the 67-remedy corpus was: source and source_reference at 79.10% (53 of 67 remedies), common_name and preparation at 74.63% (50 of 67), introduction at 91.04% (61 of 67), and authorities at 79.10% (53 of 67). Coverage gaps reflect the variable completeness of source edition metadata rather than parsing failures; no remedies were missing from parsed output and no extra parsed files were produced without a corresponding source. Empty symptom texts were absent across the full corpus.
The total canonical symptom count after parsing was 31,086 across 67 remedies, yielding a mean of 463.97 symptoms per remedy, a median of 343.00, and a range of 107 (Bismuthum) to 1,465 (Mercurius). Parsed outputs were produced in parallel JSON and Markdown formats, with both sets achieving full 67-of-67 parity against the source corpus.
2.2. Canonicalization Audit
The canonicalization audit assessed symptom text uniqueness and cross-remedy repetition. From 31,086 symptom entries, 29,649 unique symptom texts were identified, with 436 texts occurring in two or more remedies. The ten most frequently repeated texts were: “Vertigo.” (23 remedies), “Nausea.” (18), “Diarrhoea.” (18), “Vomiting.” (15), “Palpitation of the heart.” (13), “Inclination to vomit.” (13), “Syncope.” (12), “Anxiety.” (12), “Tightness of the chest.” (11), and “Sleeplessness.” (11). Repetition at this frequency reflects shared pathophysiological territory across the tested substances rather than extraction error; the distribution is consistent with the expectations for a corpus of materia medica provings drawn from the same methodological tradition.
A snapshot-freeze mechanism was applied before and after candidate clustering to preserve a stable audit baseline. Repeated symptom texts were inventoried by remedy to support downstream disambiguation. Non-sequential symptom numbering was confirmed in 48 of 67 remedies, and one remedy (Ambra grisea) carried non-bracketed authority markers. Both conditions were logged as structural features of the source edition rather than parsing failures. The source_reference field was edition-like across all 53 remedies where it was populated, with zero non-edition-like exceptions.
2.3. Section-Aware Similarity Modeling
Remedy similarity was computed at section level rather than at full-remedy level, to preserve the internal structural differentiation of the proving record. Collapsing all symptom entries for a remedy into a single vector would suppress the anatomical and functional specificity that distinguishes one remedy from another in clinical use; section-level comparison retains that specificity while enabling quantitative pairwise analysis.
Each remedy’s symptom corpus was partitioned into named anatomical and functional sections (including, for example, MAIN, HEAD, EYE, CHEST, STOMACH, and EXTREMITIES) using structural markers embedded in the source files. Section-level vectors were constructed using term-frequency representations of symptom text tokens within each section. Only sections with substantive symptom content were included in the similarity computation.
Section-layer conservation validation confirmed that the analytical expansion preserved canonical symptom totals exactly: the section-layer row count equalled 31,086, matching the canonical parsed count, with a mismatch count of zero and a conservation status of PASS. The section-layer section count was 77, representing all distinct section units across the corpus where multi-section structure was present.
Pairwise cosine similarity was computed across all 77 section units, yielding 5,929 unique pairwise cells. The strongest similarity was between Hyoscyamus Niger::MAIN and Stramonium::MAIN (cosine similarity 0.025856843427458604). The low absolute magnitude of the strongest pair reflects the lexical diversity characteristic of the Materia Medica Pura corpus and confirms that each proving record carries a substantially distinct symptom vocabulary. Neighborhood extraction identified the highest-similarity section partners for each unit, producing the comparative structures that support practitioner-facing differential profiles described in the companion manuscript (M2).
2.4. Authority and Provenance Graph Construction
Provenance graph construction proceeded through three steps: authority string normalization, entity catalog creation, and multi-layer edge construction.
Authority strings extracted during parsing were normalized to canonical entity records by resolving abbreviations, variant spellings, edition-specific shorthand, and initialisms used inconsistently across the source texts. Normalization was applied before entity deduplication to prevent the same prover or editor from appearing as multiple distinct nodes in the graph. The resulting authority entity catalog comprises 2,732 unique entities, representing individual provers, editors, translators, and cited clinical observers named within the Materia Medica Pura source texts.
Two classes of directed edges were then constructed. Remedy-authority edges link each remedy to the authority entities credited within its text; 1,682 such edges were generated. These edges support queries of the form: which authorities are cited in a given remedy? Remedy-symptom-authority edges link individual symptom records to the authority entities associated with them, providing line-level provenance traceability at the symptom grain; 16,764 such edges were generated. The disparity in scale between the two edge classes reflects the many-to-many nature of authority attribution within the corpus: a single authority entity may be linked to many individual symptoms across one or more remedies, while a single remedy carries a bounded set of distinct authority credits.
The base provenance graph, produced before footnote integration, comprised 18,446 rows. After footnote-node integration the graph reached 18,561 rows. The combined graph supports three query patterns: authority-scoped symptom retrieval, remedy-scoped authority attribution, and symptom-scoped provenance tracing. All three patterns are surfaced in the practitioner-facing outputs described in the companion manuscript.
2.5. Temporal Onset Signal Extraction
Hahnemann’s proving records contain explicit and relative temporal markers indicating when symptoms appeared during the course of a proving experiment. These markers encode information about onset latency and symptom sequence that is absent from the symptom text alone. Retaining them as a structured feature of the corpus rather than discarding them as prose incidentals enables timeline-stratified comparison across remedies and preserves the phase-sequencing signal embedded in the original proving records.
A dedicated extraction pass identified and classified onset language across the full corpus. Onset cue patterns included explicit time references (for example “after one hour,” “on the third day”), relative markers (“soon after,” “immediately,” “later”), and sequence-indicating language (“first,” “subsequently,” “at last”). Each extracted event was assigned to one of six time buckets: immediate to 1h, 1h to 12h, 12h to 48h, 2d to 7d, over 7d, and qualitative timing cues. The qualitative bucket captures onset language that specifies temporal sequence or relative order without providing a measurable interval; it was retained as a distinct category rather than discarded, preserving the signal value of imprecisely reported onset events.
The extraction yielded 8,607 records across the full corpus. The bucket distribution was: immediate to 1h, 1,376 records (15.99%); 1h to 12h, 2,486 (28.88%); 12h to 48h, 1,670 (19.40%); 2d to 7d, 656 (7.62%); over 7d, 187 (2.17%); qualitative timing cues, 2,232 (25.93%). The 1h to 12h bucket contained the largest share of quantitatively bounded events. Qualitative-only cues accounted for approximately one quarter of all onset records, reflecting the variable precision of temporal reporting across provers and substances.
Timed-event burden at the remedy level was highest for Nux Vomica (468 events), Pulsatilla (423), Belladonna (315), Ignatia (315), and China (290). These counts reflect both the total symptom richness of these remedies and the relative detail of their proving narratives.
2.6. Editorial Footnote Integration
The Materia Medica Pura source texts contain brace-delimited footnotes that carry editorial commentary, secondary action annotations, comparative clinical observations, and dose or preparation context attached to individual symptom entries. These footnote-prose blocks constitute an interpretive layer distinct from the primary proving record, requiring separate extraction logic and a governed promotion pathway before integration into the provenance graph.
Footnote extraction applied a two-tier approach. The strict tier targeted well-formed brace-delimited blocks with unambiguous editorial markers, extracting them with high confidence. The candidate tier captured blocks meeting structural brace-delimited criteria but requiring confidence-based evaluation before integration; candidate blocks below the strict threshold were scored and either promoted automatically or queued for manual adjudication. In the current production run, 62 editorial note blocks met the confidence threshold for automatic promotion. Manual promotion overrides were applied in 5 cases following adjudication review. The total number of editorial note blocks integrated into the provenance graph was 115.
Promoted footnote nodes were linked to their parent symptom records and to the remedies in which they appeared, extending the provenance graph from the base 18,446-row structure to the final 18,561-row integrated graph. The promotion-gated integration design prevents unreviewed or ambiguous footnote content from entering the graph as authoritative evidence, preserving the traceability and auditability of the corpus layer.
2.7. Potency Marker Normalization
A subset of symptom entries and footnote blocks contain potency-like markers using the notation “1.c.” or “l.c.” — the latter arising from optical character recognition errors in which the numeral “1” was misread as the letter “l”. Correctly classifying these markers matters for downstream interpretation: a potency notation carries dose-context information, while a citation-like use of the same character sequence would not.
A combined extraction pass applied two complementary mechanisms: a marker detector targeting the 1.c./l.c. string pattern and a phrase detector targeting explicit potency language in the surrounding textual context. The combined pass identified 169 events. Classification by local context distinguished 168 likely potency-context events (99.41%) from 1 citation-like marker event (0.59%). Within the marker-identified subset of three events, two were confirmed as OCR normalizations from “l. c.” to “1. c.”, and all three were classified as potency-context.
The near-total dominance of the potency-context class confirms that 1.c./l.c. notation in this corpus functions overwhelmingly as a potency reference rather than as a bibliographic citation marker. This classification governs downstream interpretation of dose-context signals embedded in the proving records and informs the potency-annotation layer described in the companion manuscript.