Submitted:
25 March 2026
Posted:
27 March 2026
You are already at the latest version
Abstract
Keywords:
1. Introduction
2. State of Knowledge
2.1. Datasets on agriculture
2.2. HuggingFace as open data space
2.3. Research Gap
3. Materials and Methods
3.1. Research Questions
- 1.
- Which agricultural datasets are available on HF?
- 2.
- What metadata is typical for these datasets?
3.2. Search and Screening Procedure
3.3. Data Extraction and Enrichment
- Author classification: The classification is based solely on information provided by the dataset authors themselves; this self-reported information was taken at face value.
- Thematic categorisation: Some datasets were difficult to categorise unambiguously. Datasets in languages other than German or English were sampled and translated using Google Translate; these translations were judged insufficiently informative to allow reliable categorisation of entire datasets. Where no clear determination of content could be made, the category “Unclear” was assigned (see Section 5.1).
- Primary language: Each dataset was assigned one primary language. Where no language signal was present (e.g., only an empty file in the dataset) or only formal indicators were available (e.g., English column headers with numeric content), a best-effort assignment was made. For translation datasets, the source language was designated as the primary language.
3.4. Content Analysis of Crop-Category Datasets
3.5. Automated Metadata Extraction
4. Dataset Overview
5. Results
5.1. Thematic Categories

5.2. Metadata Characteristics
5.3. Lexical Analysis of Crop-Category Datasets
6. Discussion
7. Conclusions
- Prototype development and evaluation: It should be experimentally tested whether the existing datasets can be used. Arguments against this include the relatively small amount of material and a relatively high proportion of generic data. Arguments in favor include the fact that well-chosen application areas might not require a large corpus to be supported by a well-trained AI. In our view, a purely theoretical assessment of whether this data basis is sufficient does not appear adequate.
- Quality metrics: Systematic quality assessment frameworks covering metadata completeness, format standardisation, and provenance documentation should be developed and applied to HF agricultural datasets before they are used in research or production systems.
- Curated repository pathways: High-quality datasets identified on HF could be transferred to established, editorially curated collections such as AgMIP or OpenAIRE. HF could thereby serve as a low-barrier entry point into the data-publication lifecycle, with curated repositories as the downstream destination for validated, reusable data.
- Text-based agricultural AI: The strong presence of NLP resources on HF—a decisive departure from the image-centric literature reviewed by Kamilaris and Prenafeta-Boldú [6]—identifies text-based agricultural AI as an underexplored but growing research frontier. The advisory and instructional character of the crop-category corpus in particular suggests practical applications in agricultural extension, farmer advisory services, and multilingual chatbot development, warranting dedicated investigation.
Funding
Author Contributions: Conceptualization
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Heath, T.; Bizer, C. Linked data: Evolving the web into a global data space; Morgan & Claypool Publishers, 2011. [Google Scholar]
- Heider, N.; Gunreben, L.; Zürner, S.; Schieck, M. A survey of datasets for computer vision in agriculture. In Proceedings of the Proceedings of the 45th GIL Annual Conference: Digital Infrastructures for a Sustainable Agricultural, Forestry, and Food Industry, Bonn, 2025; pp. 35–46. [Google Scholar]
- Lu, Y.; Young, S. A survey of public datasets for computer vision tasks in precision agriculture. Computers and Electronics in Agriculture 2020, 178, 105760. [Google Scholar] [CrossRef]
- Ricciardi, V.; Ramankutty, N.; Mehrabi, Z.; Jarvis, L.; Chookolingo, B. An open-access dataset of crop production by farm size from agricultural censuses and surveys. Data in Brief 2018, 19, 1970–1988. [Google Scholar] [CrossRef] [PubMed]
- Farjon, G.; Huijun, L.; Edan, Y. Deep-learning-based counting methods, datasets, and applications in agriculture: a review. Precision Agriculture 2023, 24, 1683–1711. [Google Scholar] [CrossRef]
- Kamilaris, A.; Prenafeta-Boldú, F.X. Deep learning in agriculture: A survey. Computers and Electronics in Agriculture 2018, 147, 70–90. [Google Scholar] [CrossRef]
- Lhoest, Q.; Villanova del Moral, A.; Jernite, Y.; Thakur, A.; von Platen, P.; Patil, S.; Chaumond, J.; Drame, M.; Plu, J.; Tunstall, L.; et al. Datasets: A community library for natural language processing. In Proceedings of the Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, Punta Cana, Dominican Republic, 2021; pp. 175–184. [Google Scholar] [CrossRef]
- Wolf, T.; Debut, L.; Sanh, V.; Chaumond, J.; Delangue, C.; Moi, A.; Cistac, P.; Rault, T.; Louf, R.; Funtowicz, M.; et al. Transformers: State-of-the-art natural language processing. In Proceedings of the Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, 2020; Association for Computational Linguistics; pp. 38–45. [Google Scholar] [CrossRef]
- Ait, A.; Cánovas Izquierdo, J.L.; Cabot, J. On the suitability of Hugging Face Hub for empirical studies. Empirical Software Engineering 2025, 30, 57. [Google Scholar] [CrossRef]
- Jones, J.; Jiang, W.; Synovic, N.; Thiruvathukal, G.; Davis, J. What do we know about Hugging Face? A systematic literature review and quantitative validation of qualitative claims. In Proceedings of the Proceedings of the 18th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement, New York, 2024; pp. 13–24. [Google Scholar] [CrossRef]
- Chamorro-Padial, J.; García, R.; Gil, R. A systematic review of open data in agriculture. Computers and Electronics in Agriculture 2024, 219, 108775. [Google Scholar] [CrossRef]
- Rachmann, A.; Poschmann, H.; Weißbeck, L. Hugging Face als Datenraum für landwirtschaftliche Datensets. 2026. [Google Scholar]
- Cultivars, Copyleft. Copyleft Cultivars Nonprofit – Instagram Photos and Videos. 2025. Available online: https://www.instagram.com/copyleftcultivars.
- Caleb DeLeeuw, S. Solshine – Hugging Face Profile. 2025. Available online: https://huggingface.co/Solshine.

| Author | Dataset Name | Entry Counts |
|---|---|---|
| BAAI | IndustryCorpus2_agriculture_forestry_animal_husbandry_fishery | 10,000,000 |
| MikeGreen2710 | agriculture_forestry_4m1_5m6 | 1,527,729 |
| Vadim21221 | Agriculture_Vision | 1,000,000 |
| MikeGreen2710 | agriculture_forestry_100k_1m1 | 1,000,000 |
| MikeGreen2710 | agriculture_forestry_1m1_2m1 | 1,000,000 |
| MikeGreen2710 | agriculture_forestry_2m1_3m1 | 1,000,000 |
| MikeGreen2710 | agriculture_forestry_3m1_4m1 | 1,000,000 |
| BAAI | IndustryCorpus_agriculture | 1,000,000 |
| MikeGreen2710 | 26_04_tho_cu_remote_agriculture_forestry | 914,706 |
| yeniguno | turkish_agriculture_corpus | 205,179 |
| Mxode | Chinese-QA-Agriculture_Forestry_Animal_Husbandry_Fishery | 100,000 |
| meithnav | agriculture | 100,000 |
| MikeGreen2710 | first_100k_agriculture_forestry | 100,000 |
| Keshav022 | Agriculture-Dataset | 100,000 |
| coild-aikosh | Agriculture | 100,000 |
| adaboubvincent | Agriculture-QA-duplicat4 | 56,716 |
| adaboubvincent | Agriculture-QA-duplicat3 | 42,498 |
| muhammad-atif-ali | agriculture-dataset-for-falcon-7b-instruct | 36,063 |
| legacy107 | qa_wikipedia_sentence_transformer_negative_farming | 34,668 |
| ShuklaShreyansh | Agriculture-QA | 28,531 |
| adaboubvincent | Agriculture-QA-duplicat2 | 28,332 |
| adaboubvincent | Agriculture-QA | 28,326 |
| BekiTila | Testing_AgriCultureDataset | 24,716 |
| KisanVaani | agriculture-qa-english-only | 22,615 |
| Kobi-01 | tamil_agriculture_QA | 22,615 |
| muhammad-atif-ali | agriculture-qa-english-llama-2 | 22,615 |
| nieche | turkish_agriculture_QA_llama2_22.6k | 22,615 |
| adaboubvincent | Agriculture-QA-duplicat0 | 14,166 |
| recepbulbul | Law_Sustainability_Education_Agriculture_Turkish_Dataset | 10,000 |
| Apocalypse423 | Agriculture_jiangsu | 10,000 |
| Chhabi | Nepali-Agriculture-QA | 10,000 |
| Hemg | AgricultureLLM | 10,000 |
| Dharine | agriculture-10k | 10,000 |
| FrancophonIA | Termes_agriculture_sylviculture_peche_industrie_alimentaire | 10,000 |
| tahsinsoyak | agriculture-qa-turkish-translated | 10,000 |
| LLMcompe-Team-Watanabe | agriculture-qa-english-only_preprocess | 10,000 |
| legacy107 | qa_wikipedia_augmented_sentence_transformer_negative_farming | 7,183 |
| legacy107 | qa_wikipedia_augmented_sentence_transformer_negative_farming_128 | 7,183 |
| AnuradhaPoddar | agriculture_llama_6k | 5,883 |
| Dharine | agriculture-5k | 5,000 |
| sumukshashidhar-archive | yourbench_agriculture_single_shot_questions_farmer | 4,420 |
| argilla | farming | 1,695 |
| burtenshaw | farming | 1,695 |
| dvgodoy | argilla-farming-cleaned | 1,690 |
| berger123 | kerala_agriculture_dataset | 1,029 |
| Gan1108 | agriculture_upd | 1,000 |
| CGIAR | AgricultureVideosQnA | 1,000 |
| CGIAR | AgricultureVideosQnA2 | 1,000 |
| DigiGreen | AgricultureVideosQnA | 1,000 |
| electricsheepafrica | Africa-Agriculture-forestry-and-fishing-value-added-percentage-of-GDP | 1,000 |
| electricsheepafrica | Africa-Employment-in-agriculture-male-percentage-of-male-employment-modeled-ILO-estimate | 1,000 |
| electricsheepafrica | informal-employment-13th-icls-non-agriculture-for-african-countries | 1,000 |
| electricsheepafrica | informal-employment-19th-icls-agriculture-for-african-countries | 1,000 |
| EvanArlen194 | agriculture_instruct_indonesian | 1,000 |
| Govind222 | Farming-SFT | 1,000 |
| mmazzz | Agriculturetasks | 1,000 |
| ov1n | sinhala-agriculture-gce-alevel-2021 | 1,000 |
| electricsheepafrica | Africa-Employment-in-agriculture-female-percentage-of-female-employment-modeled-ILO-estimate | 1,000 |
| AnuradhaPoddar | AgricultureDataset | 1,000 |
| Bluelilyflower | agriculture_laws_and_regulations | 1,000 |
| caixukun0802 | Agriculture | 1,000 |
| CopyleftCultivars | Natural-Farming-Real-QandA-Conversations-Q1-2024-Update | 1,000 |
| CopyleftCultivars | SemiSynthetic_Data_For_Regenerative_Farming_Agriculture | 1,000 |
| CopyleftCultivars | Semisynthetic_Data_Natural_Farming_Fundamentals | 1,000 |
| hari7261 | agriculture_training_data | 1,000 |
| Harish-as-harry | Agriculture | 1,000 |
| Hercule66 | agriculture-dataset-for-falcon-7b-instruct-cleaned | 1,000 |
| huhucheck | farming | 1,000 |
| ignacioct | farming | 1,000 |
| kshubham | agriculture_data_5k | 1,000 |
| Mahesh2841 | Agriculture | 1,000 |
| muhammad-atif-ali | agricultureQnA-1k-unique-llama-2 | 1,000 |
| PRAKALP-PANDE | PSP-agricultureQnA-1k-unique | 1,000 |
| shahram-ali | Agriculture | 1,000 |
| Shraddhzz | synthetic-agriculture-groq | 1,000 |
| Sony | Hokkaido_Agriculture_Image_Dataset | 1,000 |
| sowmya14 | agriculture_QA | 1,000 |
| Tanishqgupta10 | agriculture | 1,000 |
| YuvrajSingh9886 | Agriculture-Irrigation-QA-Pairs-Dataset | 1,000 |
| YuvrajSingh9886 | Agriculture-Plan-Diseases-QA-Pairs-Dataset | 1,000 |
| YuvrajSingh9886 | Agriculture-Soil-QA-Pairs-Dataset | 1,000 |
| zomd | farming | 1,000 |
| yacinekay | my-agriculture-tips | 1,000 |
| CGIAR | AgricultureVideosTranscript | 1,000 |
| DigiGreen | AgricultureVideosTranscript | 1,000 |
| ahmedsamirio | farming | 1,000 |
| Solshine | Reflection-Tuning-Natural-Farming_Agricultural-Dataset | 1,000 |
| installs | Mahesh2841-Agriculture-Zh | 1,000 |
| Gabbbzzz | potato_farming | 1,000 |
| wali-2121 | agriculture | 1,000 |
| aidev08 | farming-sample | 1,000 |
| ChrisSMurphy | farming | 1,000 |
| jefferylovely | farming | 1,000 |
| roshaan-aq | agriculture_QA | 999 |
| sumukshashidhar-archive | yourbench_agriculture_multihop_questions_gpt4omini_v2 | 756 |
| aksakalalper | agriculture_dataset | 500 |
| ipranavks | agriculturevlm | 420 |
| distilabel-internal-testing | farming-research-v0.2 | 376 |
| Prasad12344321 | farming-preference-dataset-prep-small | 312 |
| FrancophonIA | Protection_of_culture_in_ecological_agriculture | 168 |
| Solshine | Arabic-Reflection-Tuning-Natural-Farming-Instruct2 | 111 |
| Solshine | Reflection-Tuning-Natural-Farming-Instruct2 | 58 |
| CopyleftCultivars | syntheticdata-distiset-farming-chemistry | 30 |
| burtenshaw | farming-dataset-synthetic-generator-classification | 10 |
| CopyleftCultivars | syntheticdatasample-distiset-farming-chemistry | 10 |
| transitionGap | farmingset | 4 |
| transitionGap | farming-preference-dataset-prep-small | 2 |
| Solshine | Arabic-Reflection-Tuning-Natural-Farming-Instruct-SINGLE_SEED_EXAMPLE | 1 |
| aidev08 | farming | 0 |
| shi-labs | Agriculture-Vision | 0 |
| 202Shiva | agriculture_data | 0 |
| aidev08 | farming-data | 0 |
| CLeach22 | green_farming | 0 |
| codybum | farming | 0 |
| ipranavks | agriculture_trivandrum | 0 |
| Solshine | Natural_Farming_Recipes_Datachunks | 0 |
| datasets-CNRS | vocabulaire_agriculture_et_systemes_elevage | 0 |
| SciKnowOrg | ontolearner-agriculture | 0 |
| Andresw | farming | 0 |
| globosetechnology12 | Smart-Agriculture-and-Crop-Monitoring | 0 |
| prithviraaj | agriculture | 0 |
| Somali-asr | Somali-Agriculture-ASR | 0 |
| tanujrai | rain_and_Agriculture_dataset | 0 |
| vibulan73 | Tamil-Agriculture-Data | 0 |
| VietDoanSotaTek | Agriculture | 0 |
| BrokenSoul | farming | 0 |
| Format | Share ( %) | Count |
|---|---|---|
| Text | 4.0 | 5 |
| Unbekannt | 19.8 | 25 |
| Parquet | 42.9 | 54 |
| CSV | 12.7 | 16 |
| JSON | 20.6 | 26 |
| Total | 100.0 | 126 |
| Primary Language | Share ( %) | Count |
|---|---|---|
| English | 71.4 | 90 |
| Not evaluable | 7.1 | 9 |
| Chinese | 4.8 | 6 |
| Turkish | 2.4 | 3 |
| Hindi | 2.4 | 3 |
| French | 2.4 | 3 |
| Arabic | 2.4 | 3 |
| Nepali | 1.6 | 2 |
| Vietnamese | 1.6 | 2 |
| Amharic | 0.8 | 1 |
| Indonesian | 0.8 | 1 |
| Ukrainian | 0.8 | 1 |
| Marathi | 0.8 | 1 |
| Tamil | 0.8 | 1 |
| Total | 100.0 | 126 |
| Author | Dataset Name | Attribute | Spec. Type | Gen. Type |
|---|---|---|---|---|
| 202Shiva | agriculture_data | state | string | Text |
| aksakalalper | agriculture_dataset | review | string | Text |
| AnuradhaPoddar | agriculture_llama_6k | question | string | Text |
| KisanVaani | agriculture-qa-english-only | answer | string | Text |
| aidev08 | farming-data | argilla_api_url | string | Text |
| 202Shiva | agriculture_data | min_price | float64 | Numeric |
| 202Shiva | agriculture_data | rainfall_annual | float64 | Numeric |
| BAAI | IndustryCorpus2_agriculture_forestry_animal_husbandry_fishery | alnum_ratio | float64 | Numeric |
| CopyleftCultivars | SemiSynthetic_Data_For_Regenerative_Farming_Agriculture | scenario_id | int64 | Numeric |
| electricsheepafrica | informal-employment-13th-icls-non-agriculture-for-african-countries | Angola | float64 | Numeric |
| ahmedsamirio | farming | examples | list | Complex |
| ahmedsamirio | farming | perspectives | sequence | Complex |
| CopyleftCultivars | SemiSynthetic_Data_For_Regenerative_Farming_Agriculture | actions | sequence | Complex |
| distilabel-internal-testing | farming-research-v0.2 | instructions | sequence | Complex |
| hari7261 | agriculture_training_data | crops | dict | Complex |
| shi-labs | Agriculture-Vision | png | image | Image |
| Sony | Hokkaido_Agriculture_Image_Dataset | image | image | Image |
| 202Shiva | agriculture_data | grade | null | Unclear |
| CGIAR | AgricultureVideosQnA | — | — | Unclear |
| datasets-CNRS | vocabulaire_agriculture_et_systemes_elevage | — | — | Unclear |
| Rank | Noun | Domain |
|---|---|---|
| 1 | soil | domain-specific |
| 2 | crop | domain-specific |
| 3 | plant | domain-specific |
| 4 | farmer | domain-specific |
| 5 | disease | domain-specific |
| 6 | water | domain-specific |
| 7 | management | domain-specific |
| 8 | yield | domain-specific |
| 9 | seed | domain-specific |
| 10 | practice | domain-specific |
| 11 | fertilizer | domain-specific |
| 12 | growth | domain-specific |
| 13 | variety | domain-specific |
| 14 | field | domain-specific |
| 15 | production | domain-specific |
| 16 | bean | domain-specific |
| 17 | instruction | meta / dialogue |
| 18 | response | meta / dialogue |
| 19 | user | meta / dialogue |
| 20 | assistant | meta / dialogue |
| Rank | Adejctive | Type |
|---|---|---|
| 1 | agricultural | domain-specific |
| 2 | organic | domain-specific |
| 3 | nutrient | domain-specific |
| 4 | resistant | domain-specific |
| 5 | natural | domain-specific |
| 6 | high | general modifier |
| 7 | good | general modifier |
| 8 | low | general modifier |
| 9 | proper | general modifier |
| 10 | important | general modifier |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).