Preprint Article Version 3 Preserved in Portico This version is not peer-reviewed

Detecting CSV File Dialects by Table Uniformity Measurement and Data Type Inference

Version 1 : Received: 15 February 2024 / Approved: 15 February 2024 / Online: 15 February 2024 (15:41:17 CET)
Version 2 : Received: 16 February 2024 / Approved: 16 February 2024 / Online: 16 February 2024 (11:04:53 CET)
Version 3 : Received: 16 February 2024 / Approved: 16 February 2024 / Online: 17 February 2024 (09:06:25 CET)
Version 4 : Received: 19 February 2024 / Approved: 20 February 2024 / Online: 20 February 2024 (09:09:43 CET)
Version 5 : Received: 22 February 2024 / Approved: 23 February 2024 / Online: 24 February 2024 (08:49:48 CET)
Version 6 : Received: 14 March 2024 / Approved: 17 March 2024 / Online: 18 March 2024 (07:25:57 CET)

A peer-reviewed article of this Preprint also exists.

García, Wilfredo. ‘Detecting CSV File Dialects by Table Uniformity Measurement and Data Type Inference’. 1 Jan. 2024 : 1 – 18. DOI: 10.3233/DS-240062. García, Wilfredo. ‘Detecting CSV File Dialects by Table Uniformity Measurement and Data Type Inference’. 1 Jan. 2024 : 1 – 18. DOI: 10.3233/DS-240062.

Abstract

The human-readable simplicity with which the CSV format was devised, together with the absence of a standard that strictly defines this format, has allowed the proliferation of several variants in the dialects with which these files are written. The latter has meant that the exchange of information between data management systems, or between countries and regions, requires human intervention during the data mining and cleansing process. This has led to the development of various computational tools that aim to accurately determine the dialects of CSV files, in order to avoid data loss during data loading by a given system. However, current systems have limitations and make assumptions that need to be improved and/or extended. This paper proposes a method for determining CSV file dialects through table uniformity, a statistical approach based on table consistency and records dispersion measurement along with the detection of data type over each field. The new method has a 100% accuracy over a dataset with 147 sampling CSV files from a benchmark framework. Furthermore, the proposed method is accurate enough to determine dialects by reading only ten records, requiring more data to disambiguate those cases where the first records do not contain the necessary information to conclude with a dialect determination.

Keywords

comma separated values; CSV dialect detection; data mining

Subject

Computer Science and Mathematics, Data Structures, Algorithms and Complexity

Comments (0)

We encourage comments and feedback from a broad range of readers. See criteria for comments and our Diversity statement.

Leave a public comment
Send a private comment to the author(s)
* All users must log in before leaving a comment
Views 0
Downloads 0
Comments 0


×
Alerts
Notify me about updates to this article or when a peer-reviewed version is published.
We use cookies on our website to ensure you get the best experience.
Read more about our cookies here.