Preprint Article Version 1 Preserved in Portico This version is not peer-reviewed

A Bag-of-Words Approach for Information Extraction from Electricity Invoices

Version 1 : Received: 22 May 2024 / Approved: 23 May 2024 / Online: 24 May 2024 (09:33:54 CEST)

How to cite: Sánchez, J.; Cuervo-Londoño, G. A. A Bag-of-Words Approach for Information Extraction from Electricity Invoices. Preprints 2024, 2024051564. https://doi.org/10.20944/preprints202405.1564.v1 Sánchez, J.; Cuervo-Londoño, G. A. A Bag-of-Words Approach for Information Extraction from Electricity Invoices. Preprints 2024, 2024051564. https://doi.org/10.20944/preprints202405.1564.v1

Abstract

In an era marked by digitization and automation, extracting relevant information from business documents, like electricity invoices, remains a major challenge. To address this, we need to use machine-learning techniques to automate the process, reduce manual labor, and minimize errors. This work introduces a new method for extracting key values from this type of invoice, including customer personal data, bill breakdown, electricity consumption, or marketer data. We evaluate several machine learning techniques, such as Naive Bayes, Logistic Regression, Random Forests, or Support Vector Machines. Our approach uses a bag-of-words strategy and custom-designed features specifically tailored for electricity data. We tested the method on the IDSEM dataset, which includes 75.000 electricity invoices with eighty-six different fields. The method converts PDF invoices into text and processes each word separately using a context of eleven words. The results of our experiments indicate that Support Vector Machines and Random Forests perform exceptionally well in capturing numerous values with high precision. The study also explores the advantages of our custom features and evaluates the performance of the model on unseen documents. The precision obtained with Support Vector Machines is 91,86%, peaking at 98,47% for one document template. These results demonstrate the effectiveness of our method in accurately extracting key values from invoices.

Keywords

electricity invoice; information extraction; semi-structured document; machine learning; support vector machine; random forests; decision tree; logistic regression

Subject

Computer Science and Mathematics, Artificial Intelligence and Machine Learning

Comments (0)

We encourage comments and feedback from a broad range of readers. See criteria for comments and our Diversity statement.

Leave a public comment
Send a private comment to the author(s)
* All users must log in before leaving a comment
Views 0
Downloads 0
Comments 0
Metrics 0


×
Alerts
Notify me about updates to this article or when a peer-reviewed version is published.
We use cookies on our website to ensure you get the best experience.
Read more about our cookies here.