Submitted:
19 February 2024
Posted:
20 February 2024
Read the latest preprint version here
Abstract
Keywords:
1. Introduction


2. Related work
3. Problem formulation
3.1.1. Potential dialect boundaries
4. Table uniformity
5. Type detection
- Time and date: matching regular dates and time format, as well stamped ones like MM/DD/YYYY[YYYY/MM/DD] HH:MM: SS +/- HH:MM
- Numeric: matching all numeric data supported by the implementation language selected.
- Percentage.
- Alphanumeric: matching numbers, ASCII letters and underscore.
- Currency
- Especial data: like “n/a” or empty strings
- Email.
- System paths.
- Structured scripts data types: matching JSON arrays and data delimited by parentheses, curly and square brackets.
- Numeric lists: matching fields with numeric values delimited with common separator character.
- URLs.
- IPv4.
6. Table scoring
7. Determining CSV file dialects
| Algorithm 1: Dialect Determination |
| Input: CSV content , expected number of records to import |
| Output: the dialect the that produces the more accurate table |
|
1. function determine (, ): 2. StartDialects () 3. for in do 4. ⊳Parsing 5. TScore () 6. return GetBestDialect () 7. end function |
dialect with this character. As the author points out, the CSV file is comma delimited, using double quotes as the quote and escape character, then the file is compliant with RFC-4180 specifications. When running dialect detection, CleverCSV gets the vertical bar "|" as the delimiter as this field pattern gets a score vs a from the patterns with the "," character as delimiter. This behavior is due to the fact that the logic used strongly weights the count of delimiters over the detected data types. This behavior is due to the fact that the logic used strongly weights the delimiter count over the types of data detected, where dialects containing the comma as delimiter obtain a type score of against the type score of obtained by dialects with the vertical bar as delimiter.| Algorithm 4: Table Uniformity |
| Input: CSV table with records containing fields. |
| Output: the table uniformity factors |
|
1. function TUniformity (): 2. AverageFields () 3. for to do 4. ⊳Deviations 5. if then 6. ⊳Sentinel 1 7. else 8. if then 9. ⊳Sentinel 2 10. if then 11. 12. 13. else 14. 15. if then 16. if then 17. 18. if then 19. 20. else 21. 22. 23. ⊳Range 24. if then 25. 26. 27. return 28. end function |
8. Experiments
8.1.Dialect detection accuracy
9. Discussion
9.1. Heuristic
9.2. CSV parser basis
10. Appendix: algorithms pseudocode
| Algorithm 2: Table Score |
| Input: CSV table with records, expected number of records to import |
| Output: the score for the given table |
|
1. function TScore (, ): 2. SumScore () 3. if then 4. TUniformity () 5. return 6. else 7. 8. return 9. end function |
| Algorithm 3: Sum of Records Score |
| Input: CSV table with records containing fields. |
| Output: the sum of records score for the given table |
|
1. function SumScore (): 2. for to do 3. for to do 4. if KnownDataType () then 5. 6. else 7. 8. ( ) 9. return 10. end function |
References
- Y. Shafranovich, “Common Format and MIME Type for Comma-Separated Values (CSV) Files,” IETF. Accessed: Jul. 23, 2021. [Online]. Available: https://datatracker.ietf.org/doc/rfc4180/.
- Library of Congress, “CSV, Comma Separated Values (RFC 4180),” LOC. [Online]. Available: https://www.loc.gov/preservation/digital/formats/fdd/fdd000323.shtml.
- C. Sutton, T. Hobson, J. Geddes, and R. Caruana, “Data Diff: Interpretable, Executable Summaries of Changes in Distributions for Data Wrangling,” presented at the Knowledge Discovery and Data Mining Conference, London, United Kingdom, Aug. 2018. Accessed: Jul. 23, 2021. [Online]. Available: https://www.turing.ac.uk/research/publications/data-diff-interpretable-executable-summaries-changes-distributions-data.
- J. Mitlohner, S. Neumaier, J. Umbrich, and A. Polleres, “Characteristics of Open Data CSV Files,” in 2016 2nd International Conference on Open and Big Data (OBD), Vienna: IEEE, Aug. 2016, pp. 72–79. [CrossRef]
- G. J. J. van den Burg, A. Nazábal, and C. Sutton, “Wrangling messy CSV files by detecting row and type patterns,” Data Min. Knowl. Discov., vol. 33, no. 6, pp. 1799–1820, Nov. 2019. [CrossRef]
- T. Döhmen, H. Mühleisen, and P. Boncz, “Multi-Hypothesis CSV Parsing,” in Proceedings of the 29th International Conference on Scientific and Statistical Database Management, Chicago IL USA: ACM, Jun. 2017, pp. 1–12. [CrossRef]
- Alagiannis, R. Borovica-Gajic, M. Branco, S. Idreos, and A. Ailamaki, “NoDB: efficient query execution on raw data files,” Commun. ACM, vol. 58, no. 12, pp. 112–121, Nov. 2015. [CrossRef]
- M. Karpathiotakis, M. Branco, I. Alagiannis, and A. Ailamaki, “Adaptive query processing on RAW data,” Proc. VLDB Endow., vol. 7, no. 12, pp. 1119–1130, Aug. 2014. [CrossRef]
- S. Idreos, I. Alagiannis, R. Johnson, and A. Ailamaki, “Here are my data files. Here are my queries. Where are my results?,” in Proceedings of 5th Biennial Conference on Innovative Data Systems Research, Asilomar, California, USA, Jan. 2011, pp. 57–68. Accessed: Jul. 24, 2021. [Online]. Available: https://www.cidrdb.org/cidr2011/Papers/CIDR11_Paper7.pdf.
- Dutch Stichting DuckDB Foundation, “DUCKDB.” Dutch Stichting DuckDB Foundation, Amsterdam NL, 13 2023. Accessed: Feb. 04, 2024. [Online]. Available: https://duckdb.org/docs/archive/0.9.2/.
- C. Christodoulakis, E. B. Munson, M. Gabel, A. D. Brown, and R. J. Miller, “Pytheas: pattern-based table discovery in CSV files,” Proc. VLDB Endow., vol. 13, no. 12, pp. 2075–2089, Aug. 2020. [CrossRef]
- L. Hübscher, L. Jiang, and F. Naumann, “ExtracTable: Extracting Tables from Raw Data Files,” 2023. [CrossRef]
- M. F. Al-Saleh and A. E. Yousif, “Properties of the Standard Deviation that are Rarely Mentioned in Classrooms,” Austrian J. Stat., vol. 38, no. 3, Apr. 2016. [CrossRef]
- G. Vitagliano, M. Hameed, L. Jiang, L. Reisener, E. Wu, and F. Naumann, “Pollock: A Data Loading Benchmark,” Proc. VLDB Endow., vol. 16, no. 8, pp. 1870–1882, Apr. 2023. [CrossRef]
- T. Petricek, G. J. J. V. D. Burg, A. Nazábal, T. Ceritli, E. Jiménez-Ruiz, and C. K. I. Williams, “AI Assistants: A Framework for Semi-Automated Data Wrangling,” IEEE Trans. Knowl. Data Eng., vol. 35, no. 9, pp. 9295–9306, Sep. 2023. [CrossRef]
- Rufus Pollock, “Data Package (v1),” CSV Dialect. Accessed: May 10, 2023. [Online]. Available: https://specs.frictionlessdata.io/csv-dialect/.


| Method | Success | Erroneous |
| CleverCSV | 94.59% | 5.41% |
| Actual (10R) | 99.32% | 0.68% |
| Actual (25R) | 99.32% | 0.68% |
| Actual (50R) | 100% | 0.00% |
| 1 | An analysis of a 413 GB data body found CSV files available for download on 232 portals [4]. |
| 2 | In most applications the record delimiter (υ_r) is not considered, as modern systems handle new lines discrepancies internally. |
| 3 | Segmented mode refers to the use of segments of the sample, which are defined as the data undergoes dispersion. |
| 4 | |
| 5 | GitHub repositories: https://github.com/ws-garcia/CSVsniffer, https://github.com/ws-garcia/VBA-CSV-interface
|
| 6 | An open-source tool for working with messy data: https://openrefine.org/
|
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).