A Reproducible Computational Framework for Parsing, Section-Aware Similarity Modeling, and Provenance Analysis of Materia Medica Pura

Muhammad Sohail Latif

doi:10.20944/preprints202606.0264.v1

Submitted:

02 June 2026

Posted:

03 June 2026

You are already at the latest version

Abstract

Background Hahnemann’s Materia Medica Pura encodes the primary proving evidence for 67 classical remedies in prose structured for clinical reading rather than computational processing. Symptom boundaries, authority attributions, and editorial annotations remain embedded in unstructured text, making reproducible audit at the symptom and provenance level difficult without governed extraction logic. Objective We present a reproducible computational framework for parsing, section-aware similarity modeling, and provenance analysis of the full Materia Medica Pura corpus. Methods The pipeline implements seven analytical stages: corpus parsing with authority-block-constrained symptom extraction; canonicalization audit; section-aware similarity modeling with conservation validation; and authority/provenance graph construction with two-tier edge generation. Downstream stages add temporal onset signal extraction across six time buckets, promotion-gated editorial footnote integration, and potency marker normalization with OCR correction. Results The framework processed all 67 source remedy files, yielding 31,086 canonical symptoms in JSON and Markdown formats with full source-to-output parity and section conservation PASS at zero mismatch. Section-aware similarity across 77 units produced 5,929 matrix cells; the strongest pair was Hyoscyamus Niger::MAIN and Stramonium::MAIN (cosine similarity 0.025856843427458604). The provenance graph comprises 2,732 authority entities and 16,764 remedy-symptom-authority edges, with 115 promotion-gated footnote nodes integrated. Temporal onset extraction yielded 8,607 records; potency-signal classification identified 168 of 169 events (99.41%) as potency-context. Conclusion The framework delivers a verified, reproducible corpus layer for Materia Medica Pura supporting downstream similarity comparison, provenance tracing, and interpretive analysis. This manuscript reports computational methods and text analytics only; no therapeutic claims are made.

Keywords:

Materia Medica Pura

;

corpus construction

;

provenance graph

;

knowledge graph

;

historical medical text

;

symptom annotation

;

reproducibility

;

information retrieval

Subject:

Computer Science and Mathematics - Data Structures, Algorithms and Complexity

Introduction

1.1. The Corpus and the Computational Challenge

Hahnemann’s Materia Medica Pura is the primary systematic proving record of classical homeopathic materia medica. Across 67 remedies compiled and revised across multiple decades and editions, it encodes tens of thousands of symptom entries drawn from drug provings conducted by Hahnemann and his contemporaries. Each entry carries embedded evidential structure: a symptom text, one or more authority attributions identifying the prover or observer credited for the entry, source references linking the entry to a specific edition, and in many cases editorial footnotes providing secondary action annotations or comparative clinical observations. In the source texts, these layers are woven into prose — readable by a trained scholar or clinician, but not queryable or auditable by a computational system without extraction logic that accounts for the formatting conventions of the proving tradition.

The problem this creates for reproducible scholarship is specific. Textual accessibility — having the corpus in digital form — does not confer computational tractability at the evidential level. A researcher can identify that a given symptom in Nux Vomica is attributed to a particular prover in a particular edition. Without machine-traceable provenance, that researcher cannot determine programmatically how many other symptoms share that attribution across the remedy, how the attribution pattern compares to other remedies in the corpus, or whether the symptom text appears in identical or variant form elsewhere. These are questions about the evidential structure of the corpus — about who attested to what, under which conditions, and in which edition — and they require a parsed, normalised, and linked representation of that structure rather than a digitised version of the original text.

1.2. The Gap in Existing Computational Approaches

Computational methods have been applied to historical medical corpora across a range of analytical tasks, from named entity recognition and topic modelling to text classification and information retrieval (Piotrowski 2012). Knowledge graph construction from historical sources has been approached through linked data frameworks and provenance-aware graph models, with formal provenance representation enabling traceable evidence chains across heterogeneous source material (Hogan et al. 2021; Moreau and Missier 2013). Information retrieval methods for structured symptom-text corpora are well established in the biomedical literature (Manning, Raghavan, and Schütze 2008).

Within this methodological context, the specific gap is not in text processing techniques generally; it is in their application to the layered evidential structure of proving records. Proving texts are not archival documents of uniform type. They are records in which symptom evidence, prover attribution, editorial annotation, and edition-specific source reference occupy the same prose space, and must be disentangled by extraction logic sensitive to the proving tradition’s own conventions. A framework that does not address this layered structure produces symptom records without traceable provenance — analytically usable at the surface level, but not independently verifiable at the evidential level where reproducibility matters most.

Reproducibility in computational research requires that results can be independently derived from the same inputs using documented methods (Sandve et al. 2013). For a historical corpus like Materia Medica Pura, that requirement extends to the extraction and normalisation decisions that precede analysis: which authority strings were resolved to which entity records, which editorial blocks were promoted into the graph, whether symptom totals at each analytical layer correspond exactly to those at the source. Without explicit governance and reporting of these decisions, analytical outputs are reproducible in principle but not in practice.

1.3. Non-Duplication Relative to Repertory Navigation

This work does not replicate repertory navigation. Repertory tools address the question of which remedies cover a presented symptom cluster — a function that operates over pre-structured indices built through successive layers of editorial judgment and that is essential to homeopathic clinical practice. The framework described here addresses a different and antecedent question: what is the evidential chain behind each symptom in the source proving text, who contributed it, under what authority and edition context, and how does its computational representation relate to corresponding entries across the corpus?

These are upstream governance questions. Answering them does not replace the repertory; it provides the auditable corpus layer on which downstream interfaces — including those that connect to or inform repertory use — can anchor their evidential claims. The relationship is sequential rather than substitutive: corpus governance and provenance normalisation precede comparative analysis, which precedes practitioner-facing interpretation. This manuscript addresses the first stage. The companion manuscript (M2) addresses the third, using the outputs produced here as its evidence foundation.

1.4. Objective and Scope Boundary

This manuscript presents a reproducible, governed computational corpus layer for Materia Medica Pura covering seven pipeline stages: corpus parsing with authority-block-constrained symptom extraction, canonicalization audit, section-aware similarity modeling with conservation validation, authority/provenance graph construction, temporal onset signal extraction, promotion-gated editorial footnote integration, and potency marker normalization with OCR correction. The framework produces structured outputs — parsed symptom records, canonicalization audit artifacts, section-aware similarity matrices, a provenance graph with symptom-level authority edges, temporal onset records, footnote-integrated graph extensions, and potency marker classifications — reported here as computational results and available to downstream interpretive analysis in the companion manuscript. All claims are methodological and traceability-based. No therapeutic efficacy claims are made.

Methods

2.1. Source Corpus and Parsing

The source corpus comprises 67 remedy-level text files representing the full published scope of Hahnemann’s Materia Medica Pura. Each file encodes the proving record for a single remedy, including symptom entries numbered sequentially within the original text, authority attributions identifying the provers and observers credited for each symptom, source references indicating the edition from which the entry was drawn, preparation notes, and introductory prose. The files span remedies lettered A through V and reflect the internal structural heterogeneity characteristic of a work compiled and revised across multiple editions by different editorial hands.

Parsing proceeded in two stages: structural segmentation and metadata extraction. Structural segmentation identified symptom boundaries using authority-block-constrained extraction rules that distinguished symptom entries from embedded editorial commentary. Each symptom entry was assigned a sequential identifier within its remedy, preserving the original numbering for downstream validation. Source-reference strings were normalized to edition-like patterns. Raw authority strings embedded in symptom blocks were extracted without normalization at this stage, preserving the full variant range for the subsequent reconciliation workflow. Contamination cleanup removed residual formatting artifacts and source-reference fragments that had migrated into symptom text fields during extraction.

Metadata extraction recovered six fields per remedy: source, source_reference, common_name, preparation, introduction, and authorities. Field coverage across the 67-remedy corpus was: source and source_reference at 79.10% (53 of 67 remedies), common_name and preparation at 74.63% (50 of 67), introduction at 91.04% (61 of 67), and authorities at 79.10% (53 of 67). Coverage gaps reflect the variable completeness of source edition metadata rather than parsing failures; no remedies were missing from parsed output and no extra parsed files were produced without a corresponding source. Empty symptom texts were absent across the full corpus.

The total canonical symptom count after parsing was 31,086 across 67 remedies, yielding a mean of 463.97 symptoms per remedy, a median of 343.00, and a range of 107 (Bismuthum) to 1,465 (Mercurius). Parsed outputs were produced in parallel JSON and Markdown formats, with both sets achieving full 67-of-67 parity against the source corpus.

2.2. Canonicalization Audit

The canonicalization audit assessed symptom text uniqueness and cross-remedy repetition. From 31,086 symptom entries, 29,649 unique symptom texts were identified, with 436 texts occurring in two or more remedies. The ten most frequently repeated texts were: “Vertigo.” (23 remedies), “Nausea.” (18), “Diarrhoea.” (18), “Vomiting.” (15), “Palpitation of the heart.” (13), “Inclination to vomit.” (13), “Syncope.” (12), “Anxiety.” (12), “Tightness of the chest.” (11), and “Sleeplessness.” (11). Repetition at this frequency reflects shared pathophysiological territory across the tested substances rather than extraction error; the distribution is consistent with the expectations for a corpus of materia medica provings drawn from the same methodological tradition.

A snapshot-freeze mechanism was applied before and after candidate clustering to preserve a stable audit baseline. Repeated symptom texts were inventoried by remedy to support downstream disambiguation. Non-sequential symptom numbering was confirmed in 48 of 67 remedies, and one remedy (Ambra grisea) carried non-bracketed authority markers. Both conditions were logged as structural features of the source edition rather than parsing failures. The source_reference field was edition-like across all 53 remedies where it was populated, with zero non-edition-like exceptions.

2.3. Section-Aware Similarity Modeling

Remedy similarity was computed at section level rather than at full-remedy level, to preserve the internal structural differentiation of the proving record. Collapsing all symptom entries for a remedy into a single vector would suppress the anatomical and functional specificity that distinguishes one remedy from another in clinical use; section-level comparison retains that specificity while enabling quantitative pairwise analysis.

Each remedy’s symptom corpus was partitioned into named anatomical and functional sections (including, for example, MAIN, HEAD, EYE, CHEST, STOMACH, and EXTREMITIES) using structural markers embedded in the source files. Section-level vectors were constructed using term-frequency representations of symptom text tokens within each section. Only sections with substantive symptom content were included in the similarity computation.

Section-layer conservation validation confirmed that the analytical expansion preserved canonical symptom totals exactly: the section-layer row count equalled 31,086, matching the canonical parsed count, with a mismatch count of zero and a conservation status of PASS. The section-layer section count was 77, representing all distinct section units across the corpus where multi-section structure was present.

Pairwise cosine similarity was computed across all 77 section units, yielding 5,929 unique pairwise cells. The strongest similarity was between Hyoscyamus Niger::MAIN and Stramonium::MAIN (cosine similarity 0.025856843427458604). The low absolute magnitude of the strongest pair reflects the lexical diversity characteristic of the Materia Medica Pura corpus and confirms that each proving record carries a substantially distinct symptom vocabulary. Neighborhood extraction identified the highest-similarity section partners for each unit, producing the comparative structures that support practitioner-facing differential profiles described in the companion manuscript (M2).

2.4. Authority and Provenance Graph Construction

Provenance graph construction proceeded through three steps: authority string normalization, entity catalog creation, and multi-layer edge construction.

Authority strings extracted during parsing were normalized to canonical entity records by resolving abbreviations, variant spellings, edition-specific shorthand, and initialisms used inconsistently across the source texts. Normalization was applied before entity deduplication to prevent the same prover or editor from appearing as multiple distinct nodes in the graph. The resulting authority entity catalog comprises 2,732 unique entities, representing individual provers, editors, translators, and cited clinical observers named within the Materia Medica Pura source texts.

Two classes of directed edges were then constructed. Remedy-authority edges link each remedy to the authority entities credited within its text; 1,682 such edges were generated. These edges support queries of the form: which authorities are cited in a given remedy? Remedy-symptom-authority edges link individual symptom records to the authority entities associated with them, providing line-level provenance traceability at the symptom grain; 16,764 such edges were generated. The disparity in scale between the two edge classes reflects the many-to-many nature of authority attribution within the corpus: a single authority entity may be linked to many individual symptoms across one or more remedies, while a single remedy carries a bounded set of distinct authority credits.

The base provenance graph, produced before footnote integration, comprised 18,446 rows. After footnote-node integration the graph reached 18,561 rows. The combined graph supports three query patterns: authority-scoped symptom retrieval, remedy-scoped authority attribution, and symptom-scoped provenance tracing. All three patterns are surfaced in the practitioner-facing outputs described in the companion manuscript.

2.5. Temporal Onset Signal Extraction

Hahnemann’s proving records contain explicit and relative temporal markers indicating when symptoms appeared during the course of a proving experiment. These markers encode information about onset latency and symptom sequence that is absent from the symptom text alone. Retaining them as a structured feature of the corpus rather than discarding them as prose incidentals enables timeline-stratified comparison across remedies and preserves the phase-sequencing signal embedded in the original proving records.

A dedicated extraction pass identified and classified onset language across the full corpus. Onset cue patterns included explicit time references (for example “after one hour,” “on the third day”), relative markers (“soon after,” “immediately,” “later”), and sequence-indicating language (“first,” “subsequently,” “at last”). Each extracted event was assigned to one of six time buckets: immediate to 1h, 1h to 12h, 12h to 48h, 2d to 7d, over 7d, and qualitative timing cues. The qualitative bucket captures onset language that specifies temporal sequence or relative order without providing a measurable interval; it was retained as a distinct category rather than discarded, preserving the signal value of imprecisely reported onset events.

The extraction yielded 8,607 records across the full corpus. The bucket distribution was: immediate to 1h, 1,376 records (15.99%); 1h to 12h, 2,486 (28.88%); 12h to 48h, 1,670 (19.40%); 2d to 7d, 656 (7.62%); over 7d, 187 (2.17%); qualitative timing cues, 2,232 (25.93%). The 1h to 12h bucket contained the largest share of quantitatively bounded events. Qualitative-only cues accounted for approximately one quarter of all onset records, reflecting the variable precision of temporal reporting across provers and substances.

Timed-event burden at the remedy level was highest for Nux Vomica (468 events), Pulsatilla (423), Belladonna (315), Ignatia (315), and China (290). These counts reflect both the total symptom richness of these remedies and the relative detail of their proving narratives.

2.6. Editorial Footnote Integration

The Materia Medica Pura source texts contain brace-delimited footnotes that carry editorial commentary, secondary action annotations, comparative clinical observations, and dose or preparation context attached to individual symptom entries. These footnote-prose blocks constitute an interpretive layer distinct from the primary proving record, requiring separate extraction logic and a governed promotion pathway before integration into the provenance graph.

Footnote extraction applied a two-tier approach. The strict tier targeted well-formed brace-delimited blocks with unambiguous editorial markers, extracting them with high confidence. The candidate tier captured blocks meeting structural brace-delimited criteria but requiring confidence-based evaluation before integration; candidate blocks below the strict threshold were scored and either promoted automatically or queued for manual adjudication. In the current production run, 62 editorial note blocks met the confidence threshold for automatic promotion. Manual promotion overrides were applied in 5 cases following adjudication review. The total number of editorial note blocks integrated into the provenance graph was 115.

Promoted footnote nodes were linked to their parent symptom records and to the remedies in which they appeared, extending the provenance graph from the base 18,446-row structure to the final 18,561-row integrated graph. The promotion-gated integration design prevents unreviewed or ambiguous footnote content from entering the graph as authoritative evidence, preserving the traceability and auditability of the corpus layer.

2.7. Potency Marker Normalization

A subset of symptom entries and footnote blocks contain potency-like markers using the notation “1.c.” or “l.c.” — the latter arising from optical character recognition errors in which the numeral “1” was misread as the letter “l”. Correctly classifying these markers matters for downstream interpretation: a potency notation carries dose-context information, while a citation-like use of the same character sequence would not.

A combined extraction pass applied two complementary mechanisms: a marker detector targeting the 1.c./l.c. string pattern and a phrase detector targeting explicit potency language in the surrounding textual context. The combined pass identified 169 events. Classification by local context distinguished 168 likely potency-context events (99.41%) from 1 citation-like marker event (0.59%). Within the marker-identified subset of three events, two were confirmed as OCR normalizations from “l. c.” to “1. c.”, and all three were classified as potency-context.

The near-total dominance of the potency-context class confirms that 1.c./l.c. notation in this corpus functions overwhelmingly as a potency reference rather than as a bibliographic citation marker. This classification governs downstream interpretation of dose-context signals embedded in the proving records and informs the potency-annotation layer described in the companion manuscript.

Results

3.1. Corpus Coverage and Parsing Outcomes

The parsing pipeline processed all 67 remedy-level source texts and produced full JSON and Markdown output sets with 67-of-67 parity in both formats. No source remedies were absent from parsed output and no extra parsed files were generated. The total canonical symptom count across the corpus was 31,086. Symptom distribution across the 67 remedies was right-skewed: mean 463.97 symptoms per remedy, median 343.00, minimum 107 (Bismuthum), and maximum 1,465 (Mercurius). The 20 remedies with the highest symptom counts are shown in Figure 5. The top ten remedies by symptom count were Mercurius (1,465), Belladonna (1,398), Nux Vomica (1,266), China (1,116), Pulsatilla (1,112), Arsenicum Album (1,030), Rhus (942), Sulphur (783), Ignatia (771), and Bryonia Alba (751). At the lower end, Bismuthum (107), Sambucus (115), Euphrasia Officinalis (125), and Sarsaparilla (135) carried the fewest symptom entries, consistent with the shorter proving records for those substances in the source editions.

The corpus spans 18 alphabetical letters, with remedy counts per letter ranging from one to 15 (letter C). The B-letter group (3 remedies) produced the highest mean at 752.00 symptoms per remedy; the P-letter group (2 remedies) had a mean of 883.00. The C-letter group, the largest alphabetically, comprised 15 remedies and 5,755 symptoms at a mean of 383.67 per remedy.

Symptom text length across the 31,086 entries showed a mean of 79.43 characters and a median of 65.00, with a range of 4 to 2,186 characters. The most frequent terminal character was a period, present in 16,284 entries (52.4%), followed by a closing parenthesis in 6,692 entries (21.5%). Empty symptom texts were absent across the full corpus.

Metadata field recovery across 67 remedies was: source and source_reference, 79.10% (53 remedies); common_name and preparation, 74.63% (50 remedies); introduction, 91.04% (61 remedies); authorities, 79.10% (53 remedies).

3.2. Canonicalization Audit Outcomes

From 31,086 symptom entries, the canonicalization audit identified 29,649 unique symptom texts, with 436 texts occurring in two or more remedies — a cross-remedy repetition rate of 1.47% across the unique text inventory. The ten most frequently repeated texts, shown in Figure 1, were: “Vertigo.” (23 remedies), “Nausea.” (18), “Diarrhoea.” (18), “Vomiting.” (15), “Palpitation of the heart.” (13), “Inclination to vomit.” (13), “Syncope.” (12), “Anxiety.” (12), “Tightness of the chest.” (11), and “Sleeplessness.” (11).

Structural audit of symptom numbering confirmed non-sequential numbering in 48 of 67 remedies (71.6%). One remedy (Ambra grisea) carried non-bracketed authority markers. Both were recorded as source edition features. The source_reference field was edition-like in all 53 remedies where it was populated, with zero non-edition-like exceptions.

3.3. Section-Aware Similarity Outcomes

Section-layer conservation validation confirmed that the transformation from canonical to section-layer representation preserved symptom totals exactly: the section-layer row count was 31,086, matching the canonical parsed count, with a mismatch count of zero and a conservation status of PASS. The section-layer comprised 77 distinct section units across the corpus. The count of 77 sections from 67 remedies indicates that some remedies contributed multiple named sections to the analytical layer, while remedies with undifferentiated structure contributed a single section. Conservation PASS is a precondition for valid section-level similarity and provenance analysis; a non-zero mismatch would invalidate the analytical expansion.

Pairwise cosine similarity across all 77 section units yielded 5,929 unique matrix cells. The strongest section-level similarity was between Hyoscyamus Niger::MAIN and Stramonium::MAIN, with a cosine similarity score of 0.025856843427458604. The top remedy-section pairs by similarity score are shown in Figure 2.

3.4. Authority and Provenance Graph Outcomes

Authority string normalization resolved abbreviations, variant spellings, and edition-specific initialisms to a canonical entity catalog of 2,732 unique authority entities, representing individual provers, editors, translators, and cited clinical observers named within the Materia Medica Pura source texts. The top authority entities by edge count are shown in Figure 3.

Two classes of directed edges were constructed from the entity catalog. Remedy-authority edges totalled 1,682, linking each remedy to the authority entities credited within it — a mean of 25.1 edges per remedy. Remedy-symptom-authority edges totalled 16,764, linking individual symptom records to their associated authority entities at the symptom grain — a mean of 6.1 edges per authority entity. The base provenance graph produced before footnote-node integration comprised 18,446 rows. The combined edge count of 18,446 rows equals the sum of remedy-symptom-authority edges (16,764) and remedy-authority edges (1,682) exactly, confirming that the base graph was populated entirely from the two primary edge classes.

3.5. Editorial Footnote Integration Outcomes

The distribution of footnote-prose blocks across remedies is shown in Figure 4. The two-tier extraction process applied strict-tier criteria to well-formed blocks and scored candidate-tier blocks before integration. From the candidate tier, 62 blocks met the confidence threshold for automatic promotion; manual promotion overrides were applied in 5 cases following adjudication review. The total editorial note blocks integrated into the provenance graph was 115.

Integration of the 115 promoted nodes extended the provenance graph from 18,446 rows to 18,561 rows. The 115-row increment is exact, confirming that integration introduced no duplicate or spurious rows. The ratio of manual overrides to total promoted nodes was 5 of 115 (4.3%).

3.6. Temporal Onset Signal Outcomes

The temporal onset extraction pass identified 8,607 records carrying explicit or relative onset timing cues across the full corpus, representing 27.7% of the 31,086 canonical symptom entries. Figure 6 shows the distribution across the six time buckets. The bucket breakdown was: immediate to 1h, 1,376 records (15.99%); 1h to 12h, 2,486 (28.88%); 12h to 48h, 1,670 (19.40%); 2d to 7d, 656 (7.62%); over 7d, 187 (2.17%); qualitative timing cues, 2,232 (25.93%). The 1h to 12h bucket held the largest share of quantitatively bounded onset records. The qualitative timing cue bucket accounted for 25.93% of all onset records, and retaining it as a distinct category preserved 2,232 onset records that would otherwise have been excluded.

Timed-event burden at the remedy level was highest for Nux Vomica (468 events), Pulsatilla (423), Belladonna (315), Ignatia (315), and China (290), as shown in Figure 7. These five remedies together account for 1,811 of the 8,607 onset records — 21.0% of all timed events from 7.5% of the remedy count.

3.7. Potency Marker Normalization Outcomes

The combined potency-signal extraction pass identified 169 events across the corpus. Of these, 168 were classified as likely potency-context events (99.41%) and 1 as a citation-like marker event (0.59%). Within the marker-identified subset of three events, two were OCR-normalized from “l. c.” to “1. c.”; all three were classified as potency-context. Figure 8 shows the distribution of potency-signal event classifications.

Discussion

4.1. Main Methodological Contribution

The central contribution of this framework is an auditable transformation chain from unstructured historical prose to machine-traceable, structured analytical entities — a layer that did not previously exist for Hahnemann’s Materia Medica Pura at this scale or with this degree of provenance resolution. Proving records in classical materia medica were composed and organised for clinical reading, not computational processing; their authority attributions, symptom boundaries, and editorial annotations are embedded in prose conventions that resist direct extraction without governed parsing logic. The pipeline described here makes those embedded structures reproducibly recoverable and independently verifiable.

The canonicalization audit establishes the starting point for that verification. A 1.47% cross-remedy repetition rate across the unique text inventory is low, but the repeated texts are not randomly distributed: all ten of the most frequently repeated entries are short-form, single-concept symptom texts — the class most susceptible to cross-remedy textual identity given the standardised recording conventions of the proving tradition. Their concentration at the top of the repetition distribution reflects the proving method’s shared vocabulary for common presentations rather than transcription error or parsing artifact, a distinction that matters for downstream disambiguation.

Section-layer conservation validation provides a second, mechanically enforced verification point. Conservation PASS — confirmed at zero mismatch between canonical and section-layer row counts — is not reported here as evidence of quality; it is a precondition for the validity of all downstream section-level computation. Making conservation status an explicit reported metric rather than an internal check converts it from an assumption into a verifiable claim.

4.2. Similarity Atlas: Interpretation

The strongest section-level similarity in the corpus — between Hyoscyamus Niger::MAIN and Stramonium::MAIN at cosine similarity 0.025856843427458604 — warrants interpretation on two grounds. First, the magnitude. The low absolute value of the strongest pair reflects the lexical diversity of the corpus: even the most similar remedy-section pair shares a small proportion of its symptom vocabulary with its nearest neighbour. This finding has implications for any downstream interface that uses similarity scores for comparative display: the scores are meaningful for ordering, not for asserting textual overlap above a substantive threshold.

Second, the identity of the pair. Hyoscyamus Niger and Stramonium are among the most consistently differentially discussed remedies in the classical materia medica literature. Nash groups Hyoscyamus, Stramonium, and Belladonna as a classical delirium triad, differentiating the three by the grade and alternation pattern of their respective deliria rather than by symptom presence alone (Nash 1913). That the section-level similarity computation recovers this classical pairing as the strongest in the corpus without encoding that relationship explicitly is a form of external validation.

The section-aware modelling choice directly determines what the similarity computation can and cannot detect. Collapsing all symptom entries for a remedy into a single vector would suppress the anatomical and functional specificity that distinguishes one remedy from another within individual body-system territories. Section-level comparison preserves that structure while enabling quantitative pairwise analysis.

4.3. Provenance Graph: Interpretation

The scale difference between remedy-authority edges (1,682) and remedy-symptom-authority edges (16,764) reflects the many-to-many relationship between authority entities and symptom entries. The consequence for provenance tracing is that authority attribution at the symptom grain carries substantially more resolution than authority attribution at the remedy grain. The 2,732-entity catalog and 16,764 symptom-level edges together represent a provenance resolution that could not have been achieved by manual review across a 67-remedy corpus of this scale.

The exact correspondence between the base graph row count (18,446) and the combined edge total (16,764 + 1,682 = 18,446) confirms that the base graph is structurally clean: no spurious rows were introduced during graph construction, and no edge class contributed duplicate records. This internal consistency check is reported explicitly because provenance graphs built from heterogeneous historical sources are particularly susceptible to row duplication during entity normalisation.

4.4. Editorial Footnote Governance

The 4.3% ratio of manual overrides to total promoted editorial note blocks indicates that the confidence threshold for automatic promotion was appropriately calibrated to the corpus. The observed rate — 62 automatic promotions, 5 manual overrides, zero blocks admitted without evaluation — reflects a threshold that routed genuinely uncertain cases to human review while processing high-confidence blocks without unnecessary manual load.

The 115 integrated footnote nodes are not marginal additions to the provenance graph. Brace-delimited footnotes in the Materia Medica Pura carry secondary action annotations, comparative clinical observations attributed to named authorities, and dose or preparation context that modifies the interpretation of the attached symptom. Their integration at 115 nodes extends the graph by 0.62% in row count while potentially providing interpretive context for a disproportionate share of the corpus’s evidentially significant symptom entries.

4.5. Temporal Onset and Potency Signal

The 8,607 onset records — representing 27.7% of all canonical symptom entries — indicate substantial coverage of timing information within the corpus. The decision to retain the qualitative timing cue category (2,232 records, 25.93% of all onset records) preserved the largest single onset bucket after the 1h to 12h group. Qualitative cues carry sequence and phase information that locates a symptom within the proving timeline even without a measurable interval.

The potency-signal classification result — 168 of 169 events (99.41%) assigned to the potency-context class — resolves an ambiguity that affects any computational workflow applied to scanned historical medical texts. The OCR normalization finding reinforces the same point: character-level OCR errors in scanned editions are classifiable and correctable when local textual context is available.

4.6. Non-Duplication and Complementarity with the Practitioner Track

This framework addresses upstream corpus governance, provenance normalization, and auditable transformation. It does not replicate repertory navigation, which operates over pre-structured symptom-to-remedy indices constructed through editorial judgment rather than computational derivation from proving text. Repertory tools answer “which remedies cover this symptom cluster?” This framework answers “what is the evidential chain behind a given symptom in the source proving text, who contributed it, and how does it relate computationally to corresponding entries in other remedies?” The two questions are complementary and sequential, not redundant.

M1 provides the methods foundation and evidence anchor for the companion manuscript (M2), which translates these computational outputs into practitioner-facing interpretive workflows. The boundary between the two manuscripts is maintained throughout: M1 contains no practitioner guidance, and M2 introduces no new computational methods.

4.7. Governance Implications

Three governance mechanisms operate across this pipeline and each addresses a distinct failure mode. Section-layer conservation validation detects silent analytical drift. Promotion-gated editorial integration prevents unreviewed or ambiguous footnote content from entering the evidence graph as authoritative material. PI-controlled publication review maintains human authority over the claims advanced in the final submitted manuscript. Together these mechanisms convert what would otherwise be implicit quality assumptions into explicit, reported, and verifiable states.

Conclusion

The framework presented here converts an unstructured historical proving corpus into a machine-traceable analytical structure with three independently verifiable properties: source-to-output parity confirmed at 67 of 67 remedy files, section-layer conservation confirmed at PASS with zero mismatch, and base-graph structural integrity confirmed by exact correspondence between edge totals and row count. These are not quality claims made post-hoc; they are preconditions that the pipeline enforces mechanically and reports explicitly.

The governance design — promotion-gated editorial integration, conservation validation, and PI-controlled publication review — reflects the particular exposure of historical corpus computational work to silent analytical drift. The mechanisms described here convert implicit quality assumptions into explicit, reported, and verifiable states.

The similarity and provenance outputs are substantive as well as structural. The recovery of Hyoscyamus Niger::MAIN and Stramonium::MAIN as the corpus’s strongest section-level pair provides external validation that the computational method tracks meaningful signal in the source texts. Nash groups these two remedies with Belladonna as a classical delirium triad, differentiating them by delirium grade and alternation pattern (Nash 1913); the section-level computation recovers the same grouping structure from unstructured proving prose. The provenance graph’s 2,732 authority entities and 16,764 symptom-level edges deliver prover attribution at a resolution not achievable by manual review at this corpus scale.

This work does not replicate repertory navigation. It addresses the upstream corpus layer — governance, provenance normalization, and auditable transformation — that comparative and practitioner-facing interfaces require before their evidential claims can be traced to source. The computational outputs reported here serve as the methods and evidence foundation for the companion manuscript (M2), which translates these outputs into practitioner-facing interpretive workflows.

This manuscript reports computational methods and text analytics only. No therapeutic claims are made.

Limitations

The framework is implemented against a single corpus — Hahnemann’s Materia Medica Pura — in a specific source edition configuration. Findings concerning corpus structure, symptom distribution, and authority attribution patterns are specific to this repository implementation and cannot be generalised without further analysis to other materia medica texts or to alternative editions of the same work. Edition boundaries and translation choices embedded in the source texts propagate through all downstream analytical layers.

Parsing and canonicalization operate under constraints inherent to the historical source texts. Authority strings in the proving tradition are inconsistently abbreviated and variably formatted; the normalisation workflow resolves the majority of these variants but cannot guarantee that all ambiguous authority strings have been resolved to the correct entity. Near-tie similarity rankings are sensitive to canonicalization decisions that affect which symptom texts are present in each section vector. Rankings within a narrow band should be treated as indicative rather than definitive orderings.

Temporal onset and footnote coverage are sparse relative to the total corpus. The 8,607 timed onset records represent 27.7% of canonical symptom entries; the remaining 72.3% of symptoms carry no recoverable onset timing information. Footnote-prose blocks are distributed unevenly across the 67 remedies. Analytical queries that depend on temporal or footnote-derived signals will be limited by this coverage rather than by pipeline design.

Similarity between remedy-section representations is a computational property of the corpus texts; it does not imply therapeutic equivalence or clinical substitutability.

Data Availability

Statistical outcomes, charts, and visualization outputs are fully accessible within the supplementary materials accompanying this publication. In accordance with the HomeoAnalytics data governance policy, underlying pipeline processing scripts, platform integration details, and raw analytical outputs are retained within the secure research environment to safeguard proprietary methodology. FOR RESEARCH (Application Required): The full underlying dataset is subject to a module-specific phased release schedule. Access may be granted to verified academic researchers upon reasonable request for non-commercial validation. Applications should be submitted to slatif37@gmail.com and must include an institutional affiliation and a detailed statement of research purpose. The review period for applications is 7–10 business days. Code Availability: Analysis scripts were developed as part of the HomeoAnalytics research infrastructure and are retained within the project environment. They are not publicly distributed at this stage. Research outputs enabling result verification are provided in the associated supplementary materials.

AI Disclosure and Contribution Statement

AI assisted drafting, restructuring, and consistency support. PI retained final scientific and editorial authority. Final submission text requires PI review, edit or rewrite, and PI authorization.

References

Hogan, Aidan, Eva Blomqvist, Michael Cochez, Claudia d’Amato, Gerard de Melo, Claudio Gutierrez, et al. Knowledge Graphs. San Rafael, CA: Morgan & Claypool, 2021.
Manning, Christopher D., Prabhakar Raghavan, and Hinrich Schütze. Introduction to Information Retrieval. Cambridge: Cambridge University Press, 2008.
Moreau, Luc, and Paolo Missier, eds. PROV-DM: The PROV Data Model. W3C Recommendation. World Wide Web Consortium, 2013. https://www.w3.org/TR/prov-dm/.
Nash, E.B. Leaders in Homoeopathic Therapeutics. 4th ed. Philadelphia: Boericke & Tafel, 1913.
Piotrowski, Michael. Natural Language Processing for Historical Texts. San Rafael, CA: Morgan & Claypool, 2012.
Sandve, Geir Kjetil, Anton Nekrutenko, James Taylor, and Eivind Hovig. “Ten Simple Rules for Reproducible Computational Research.” PLOS Computational Biology 9, no. 10 (2013): e1003285.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.