Submitted:
22 February 2024
Posted:
24 February 2024
Read the latest preprint version here
Abstract
Keywords:
1. Introduction
2. Related work
3. Problem formulation
1.1.1. Potential dialect boundaries
4. Table uniformity
5. Type detection
- Time and date: matching regular dates and time format, as well stamped ones like MM/DD/YYYY[YYYY/MM/DD] HH:MM: SS +/- HH:MM
- Numeric: matching all numeric data supported by the implementation language selected.
- Percentage.
- Alphanumeric: matching numbers, ASCII letters and underscore.
- Currency
- Especial data: like “n/a” or empty strings
- Email.
- System paths.
- Structured scripts data types: matching JSON arrays and data delimited by parentheses, curly and square brackets.
- Numeric lists: matching fields with numeric values delimited with common separator character.
- URLs.
- IPv4.
6. Table scoring
7. Determining CSV file dialects
| Algorithm 1: Dialect Determination |
![]() |

| Algorithm 4: Table Uniformity |
![]() |
8. Experiments
8.1. Dialect detection accuracy
9. Discussion
9.1. Heuristic
9.1. CSV parser basis
10. Appendix: algorithms pseudocode
| Algorithm 2: Table Score |
![]() |
| Algorithm 3: Sum of Records Score |
![]() |
11. References
- Y. Shafranovich, “Common Format and MIME Type for Comma-Separated Values (CSV) Files,” IETF. Accessed: Jul. 23, 2021. [Online]. Available: https://datatracker.ietf.org/doc/rfc4180/.
- Library of Congress, “CSV, Comma Separated Values (RFC 4180),” LOC. [Online]. Available: https://www.loc.gov/preservation/digital/formats/fdd/fdd000323.shtml.
- C. Sutton, T. Hobson, J. Geddes, and R. Caruana, “Data Diff: Interpretable, Executable Summaries of Changes in Distributions for Data Wrangling,” presented at the Knowledge Discovery and Data Mining Conference, London, United Kingdom, Aug. 2018. Accessed: Jul. 23, 2021. [Online]. Available: https://www.turing.ac.uk/research/publications/data-diff-interpretable-executable-summaries-changes-distributions-data.
- J. Mitlohner, S. Neumaier, J. Umbrich, and A. Polleres, “Characteristics of Open Data CSV Files,” in 2016 2nd International Conference on Open and Big Data (OBD), Vienna: IEEE, Aug. 2016, pp. 72–79. doi: 10.1109/OBD.2016.18. [CrossRef]
- G. J. J. van den Burg, A. Nazábal, and C. Sutton, “Wrangling messy CSV files by detecting row and type patterns,” Data Min. Knowl. Discov., vol. 33, no. 6, pp. 1799–1820, Nov. 2019, doi: 10.1007/s10618-019-00646-y. [CrossRef]
- T. Döhmen, H. Mühleisen, and P. Boncz, “Multi-Hypothesis CSV Parsing,” in Proceedings of the 29th International Conference on Scientific and Statistical Database Management, Chicago IL USA: ACM, Jun. 2017, pp. 1–12. doi: 10.1145/3085504.3085520. [CrossRef]
- Alagiannis, R. Borovica-Gajic, M. Branco, S. Idreos, and A. Ailamaki, “NoDB: efficient query execution on raw data files,” Commun. ACM, vol. 58, no. 12, pp. 112–121, Nov. 2015, doi: 10.1145/2830508. [CrossRef]
- M. Karpathiotakis, M. Branco, I. Alagiannis, and A. Ailamaki, “Adaptive query processing on RAW data,” Proc. VLDB Endow., vol. 7, no. 12, pp. 1119–1130, Aug. 2014, doi: 10.14778/2732977.2732986. [CrossRef]
- S. Idreos, I. Alagiannis, R. Johnson, and A. Ailamaki, “Here are my data files. Here are my queries. Where are my results?,” in Proceedings of 5th Biennial Conference on Innovative Data Systems Research, Asilomar, California, USA, Jan. 2011, pp. 57–68. Accessed: Jul. 24, 2021. [Online]. Available: https://www.cidrdb.org/cidr2011/Papers/CIDR11_Paper7.pdf.
- Dutch Stichting DuckDB Foundation, “DUCKDB.” Dutch Stichting DuckDB Foundation, Amsterdam NL, 13 2023. Accessed: Feb. 04, 2024. [Online]. Available: https://duckdb.org/docs/archive/0.9.2/.
- C. Christodoulakis, E. B. Munson, M. Gabel, A. D. Brown, and R. J. Miller, “Pytheas: pattern-based table discovery in CSV files,” Proc. VLDB Endow., vol. 13, no. 12, pp. 2075–2089, Aug. 2020, doi: 10.14778/3407790.3407810. [CrossRef]
- L. Hübscher, L. Jiang, and F. Naumann, “ExtracTable: Extracting Tables from Raw Data Files,” 2023, doi: 10.18420/BTW2023-20. [CrossRef]
- M. F. Al-Saleh and A. E. Yousif, “Properties of the Standard Deviation that are Rarely Mentioned in Classrooms,” Austrian J. Stat., vol. 38, no. 3, Apr. 2016, doi: 10.17713/ajs.v38i3.272. [CrossRef]
- G. Vitagliano, M. Hameed, L. Jiang, L. Reisener, E. Wu, and F. Naumann, “Pollock: A Data Loading Benchmark,” Proc. VLDB Endow., vol. 16, no. 8, pp. 1870–1882, Apr. 2023, doi: 10.14778/3594512.3594518. [CrossRef]
- T. Petricek, G. J. J. V. D. Burg, A. Nazábal, T. Ceritli, E. Jiménez-Ruiz, and C. K. I. Williams, “AI Assistants: A Framework for Semi-Automated Data Wrangling,” IEEE Trans. Knowl. Data Eng., vol. 35, no. 9, pp. 9295–9306, Sep. 2023, doi: 10.1109/TKDE.2022.3222538. [CrossRef]
- Rufus Pollock, “Data Package (v1),” CSV Dialect. Accessed: May 10, 2023. [Online]. Available: https://specs.frictionlessdata.io/csv-dialect/.
| 1 | An analysis of a 413 GB data body found CSV files available for download on 232 portals [4]. |
| 2 | In most applications the record delimiter (𝜐r) is not considered, as modern systems handle new lines discrepancies internally. |
| 3 | Segmented mode refers to the use of segments of the sample, which are defined as the data undergoes dispersion. |
| 4 | https://github.com/alan-turing-institute/CleverCSV/issues/99 |
| 5 | GitHub repositories: https://github.com/ws-garcia/CSVsniffer, https://github.com/ws-garcia/VBA-CSV-interface |
| 6 | An open-source tool for working with messy data: https://openrefine.org/ |
| 7 | https://clevercsv.readthedocs.io/en/latest/source/clevercsv.html#module-clevercsv.normal_form |




| Method | Success % | Erroneous % |
| Actual (10R) | 99.32 | 0.68 |
| Actual (25R) | 99.32 | 0.68 |
| Actual (50R) | 100 | 0.00 |
| CleverCSV | 94.59 | 5.41 |
| Method | Success % | Erroneous % |
| Actual (10R) | 88.83 | 11.17 |
| Actual (25R) | 89.39 | 10.61 |
| Actual (50R) | 88.83 | 11.17 |
| CleverCSV | 79.58 | 20.42 |
| Method | Success % | Erroneous % |
| Actual (10R) | 86.51 | 13.49 |
| Actual (25R) | 87.30 | 12.70 |
| Actual (50R) | 87.30 | 12.70 |
| CleverCSV | 76.98 | 23.02 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).



