Preprint
Article

This version is not peer-reviewed.

Privacy-Preserving Structured Knowledge Extraction from Census-Style Records Using a Hierarchical Multi-Agent Open-Weight LLM Architecture

Submitted:

19 May 2026

Posted:

20 May 2026

You are already at the latest version

Abstract
Census-style records contain sensitive personal information that must be transformed into structured fields before it can support record linkage, geocoding, duplicate detection, identity verification, and statistical processing. This task is challenging because addresses appear in many forms, including standard street addresses, apartment and unit addresses, university addresses, military APO/FPO/DPO addresses, rural routes, highway addresses, and attention-line records. Rule-based parsers are efficient and transparent but often fail on non-standard formats. Single-prompt Large Language Model (LLM) approaches improve generalization but can suffer from record skipping, field conflation, and long-context degradation when processing heterogeneous documents. In addition, privacy and governance requirements limit the use of cloud-hosted models for sensitive census-style data. This paper presents a three-stage hierarchical multi-agent architecture for structured knowledge extraction from census-style records using a locally deployed open-weight LLM. A Planner Agent analyzes the input document and formulates an extraction strategy. A Manager Agent converts this strategy into a dependency-aware task graph. A fleet of eight specialized Worker Agents performs extraction, validation, and formatting, while a bounded feedback loop supports limited autonomous recovery from extraction failures. The system runs on gpt-oss-20b, a 21-billion-parameter open-weight model deployed on local infrastructure, so input records do not need to be transmitted to an external model provider. The system is evaluated on 700 synthetically generated records across seven address categories. It achieves 95.7% component-level exact match accuracy, compared with 52.3% for a rule-based baseline and 80.9% for a single-prompt LLM baseline using the same model. The largest improvements occur on challenging non-standard categories, including highway addresses (93% vs. 11% rule-based), military addresses (91% vs. 28%), and attention-line records (90% vs. 22%). The results suggest that multi-agent decomposition can improve the robustness and completeness of open-weight LLM extraction while preserving the privacy advantages of on-premise deployment. The study should be interpreted as a prototype evaluation on synthetic data rather than a production-readiness claim.
Keywords: 
;  ;  ;  ;  ;  ;  
Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.
Prerpints.org logo

Preprints.org is a free preprint server supported by MDPI in Basel, Switzerland.

Subscribe

Disclaimer

Terms of Use

Privacy Policy

Privacy Settings

© 2026 MDPI (Basel, Switzerland) unless otherwise stated