1. Introduction
1.1. Relevance of the Problem
Modern agriculture is undergoing intensive integration of digital technologies into all aspects of business and production processes. In this context, computer vision and deep learning models play a crucial role, enabling highly accurate, non-invasive object detection — an essential component in livestock management. Automated intelligent animal monitoring systems make it possible not only to accurately count livestock but also to track their health, analyze behavior and growth rates, optimize feeding schedules, and detect diseases in time.
For example, computer vision systems can detect early signs of illness or injury — such as nasal discharge or eye problems — allowing farmers to take timely action and prevent disease outbreaks [
1]. Monitoring animal movement and interactions helps identify signs of stress or discomfort [
2]. Moreover, drones equipped with cameras provide effective aerial surveillance of livestock across large fields, helping track grazing patterns and locate missing animals [
3].
For all these applications, detection capability is of critical importance, as it directly affects operational efficiency, economic viability, and animal welfare. However, this domain presents several unique challenges: significant variation in animal sizes (from small objects in drone footage to close-up large animals), frequent occlusions within herds, complex dynamic natural backgrounds, and a strict need for real-time processing for practical farm management decisions.
Despite the widespread use of deep learning models, there is still a noticeable lack of systematic comparative benchmarks focused on real-time commercial performance for domain-specific agricultural datasets. This gap is especially critical under the constraints of real-time operation and commercial deployment, which require balancing accuracy, inference speed, resource efficiency, and licensing considerations.
Thus, conducting a systematic benchmarking of modern open-source object detectors on a specialized animal dataset to evaluate detection performance (accuracy and inference speed) in real-time and commercial contexts is a highly relevant and timely research task.
1.2. Research Problem Statement
Systematic benchmarking plays a fundamental role in computer vision, as it enables objective evaluation of model performance and guides their practical deployment. However, general-purpose benchmarks such as COCO (Common Objects in Context, Microsoft) often prove insufficient for domain-specific applications [
4]. This limitation arises from differences in object characteristics (e.g., number of classes, unique shapes), scene complexity, and performance requirements (e.g., strict timing constraints, accuracy thresholds) in specialized domains such as agriculture.
As mentioned above, in agricultural settings, objects may be very small or very large, frequently occluded, and appear against backgrounds that differ substantially from urban or domestic scenes represented in COCO. Therefore, directly applying models trained on general datasets without domain adaptation and specialized benchmarking can lead to inadequate performance in real-world agricultural conditions.
The goal of this study is to conduct a systematic benchmarking of nine modern open-source object detection architectures on a specialized cattle dataset, evaluating performance in terms of accuracy and inference speed under real-time and commercial conditions.
We hypothesize that the accuracy of detectors fine-tuned on our dataset will generally correlate with their known performance on public datasets (e.g., COCO), though deviations may occur due to the domain-specific nature of the cattle dataset. In contrast, inference speed is expected to remain consistent with previously reported values.
The study will analyze the performance of various architectural paradigms — two-stage, one-stage, and transformer-based — on a real, domain-specific dataset. Special attention is given to the trade-off between accuracy and inference speed, which is critical for practical real-time applications. Domain adaptation aspects affecting the generalization capability of models in agricultural environments are also considered.
The expected outcomes — a unique set of empirical benchmarks and practical recommendations — will provide tangible value for stakeholders in the agro-industrial sector, helping them make informed decisions when selecting suitable object detection models for specific needs. Detailed documentation of the experimental setup and methodology underscores the importance of research reproducibility.
3. Research Methodology
This section provides a detailed description of the research methodology, ensuring its reproducibility.
3.1. Data Preparation
For this study, several public datasets containing cattle images were collected and aggregated: “Cattle (Cows) Object Detection Dataset,” “Animals Detection Images Dataset,” “Cattle Breeds Dataset,” and “Bovine Cattle Images.” All redundant and irrelevant data (non-cow images) were removed, along with duplicates and mirrored duplicates. Manual label correction was performed to ensure high annotation accuracy.
As a result, 4,367 images containing 14,400 bounding box annotations were obtained.
To ensure dataset representativeness, the data were stratified by scene type (e.g., “small distant cows” and “dense herds”), allowing proportional representation of each group in training and validation sets.
All annotations were converted into COCO JSON format, compatible with MMDetection, with a strict validation of data integrity and file path consistency.
3.2. Hardware and Software Configuration
All experiments were conducted on the following documented hardware:
GPU: NVIDIA RTX 4090 (24 GB VRAM),
CPU: AMD Ryzen 9 9900X,
RAM: 64 GB.
The software environment was configured in an isolated Conda environment running Linux with:
NVIDIA driver version 570.86.10,
CUDA Toolkit 12.8,
PyTorch 2.4.1,
MMDetection 3.3.0.
All dependencies were recorded in a requirements.txt file to ensure reproducibility.
3.3. Model Selection and Adaptation
Nine state-of-the-art object detection models were selected for comparative analysis:
Cascade R-CNN [
22], RetinaNet [
23], RTMDet [
24], D-FINE [
25], DETR [
26], DINO [
27], RT-DETR [
28], Co-DETR [
29], and YOLOv11 [
30].
The main selection criterion was model size (20–70M parameters) — large enough for robust accuracy but lightweight enough for real-time and embedded applications.
MMDetection-based models (Cascade R-CNN, RetinaNet, RTMDet, DETR, DINO, RT-DETR, Co-DETR) were obtained from the official repository under the Apache-2.0 license, allowing free academic and commercial use.
YOLOv11 (from Ultralytics) is distributed under AGPL-3.0, requiring derivative works to remain open-source.
All configuration files were modified to match the single-class detection task (“cow”), with:
num_classes = 1,
input resolution fixed at 1280×1280,
unified logging and checkpoint storage (work_dirs).
Unified augmentation strategy:
To ensure fairness, all model-specific augmentations were replaced with a shared minimal set:
Horizontal flip (p = 0.5),
Random rotation (−10° to +10°),
Filtering out objects smaller than 16×16 px.
This minimized variability between models while maintaining comparability, although it may have slightly reduced the potential accuracy of certain models (since aggressive augmentations often improve generalization).
3.4. Training Protocol
Each model underwent fine-tuning using its pretrained weights and adapted configurations.
Training duration (number of epochs) was adjusted individually until convergence.
Learning rates (LR) and LR schedules were based on MMDetection defaults, with occasional manual scaling (e.g., ×0.1 reduction for fine-tuning).
Optimization process:
Different optimizers were tested per model, and the best-performing settings were retained.
Training progress was monitored with Weights & Biases (WandB) integrated into MMDetection, allowing visualization of loss curves and validation metrics.
The final evaluation used the checkpoint with the highest boxAP score.
3.5. Model Performance Evaluation Protocol
Performance evaluation included two key aspects: accuracy and inference speed.
Each model’s best checkpoint was evaluated using the COCOeval script on a 569-image test set (input size 1280×1280, no Test-Time Augmentation).
Metrics recorded:
AP@[IoU=0.50:0.95] (main metric),
AP@[IoU=0.50],
AP for small, medium, and large objects,
Average Recall (AR).
(2) Inference Speed (Latency):
A custom benchmarking script measured latency (ms/image) on an NVIDIA RTX 4090 (FP32 precision, batch size = 1).
Warm-up runs were performed to stabilize performance before timing.
Both preprocessing and postprocessing times (resize + normalization) were included to reflect real-world conditions.
Two measurement modes were implemented:
Reproducible mode:
cudnn.benchmark=False, deterministic=True.
Prioritizes consistency for academic comparison.
Production-optimized mode:
cudnn.benchmark=True, deterministic=False.
Reflects realistic deployment performance for industrial systems.
Both latency values are reported in results — the first ensuring reproducibility, the second representing operational speed.
4. Results
This section presents the outcomes of a systematic benchmarking of nine object detection models on a specialized cattle dataset.
The results include both accuracy metrics (Average Precision, AP) and inference speed metrics (latency and FPS) for each model, along with comparisons to their reported performance on the COCO dataset.
4.1. Evaluation of Detection Accuracy and Inference Speed
The final accuracy and inference speed results for each model are summarized in
Table 1.
To enable a fair comparison, the total number of parameters for each model was also calculated, providing an additional measure of model complexity.
Accuracy was evaluated using Average Precision (AP) at multiple IoU thresholds, while inference speed was measured as the average time per image (ms/image) and frames per second (FPS) in two operation modes:
Reproducible (R): optimized for experimental reproducibility,
Production (P): optimized for real-world deployment performance.
4.2. Comparative Analysis
The analysis reveals that the D-FINE model achieved the highest accuracy among all tested detectors, reaching AP@[0.5:0.95] = 0.872 and AP@0.50 = 0.966.
This demonstrates its superior localization precision and high reliability in predictions.
Co-DETR also achieved strong results (AP@[0.5:0.95] = 0.851, AP@0.50 = 0.966), ranking second overall in terms of detection accuracy.
Other transformer-based models — DETR and DINO — showed good, though slightly lower, performance (AP@[0.5:0.95] = 0.777 and 0.819, respectively), likely due to their architectural characteristics and domain adaptation specifics.
In terms of inference speed under reproducible conditions, RTMDet demonstrated the lowest latency — 15.81 ms per image (63.24 FPS) — making it the fastest among all models.
YOLOv11 ranked second with 19.14 ms (52.24 FPS).
Both belong to the one-stage family, traditionally optimized for speed, and this is confirmed by the experimental results.
Despite its superior accuracy, D-FINE maintained impressive performance at 21.61 ms (46.28 FPS), making it particularly attractive for real-world applications requiring both precision and responsiveness.
Considering its balance of accuracy and speed, D-FINE emerges as the most well-rounded solution, combining top-tier precision with near real-time inference speed.
For applications where maximum speed is critical, RTMDet or YOLOv11 are preferable, as they provide acceptable accuracy with minimal latency.
Meanwhile, DETR and RetinaNet offer moderate inference times (~25–26 ms per image), suitable for use cases with less stringent real-time constraints.
However, Co-DETR (≈280.9 ms) and DINO (≈78.0 ms) were significantly slower, largely due to their more complex architectures and higher parameter counts, which reduce suitability for time-sensitive deployments.
4.3. Comparison with COCO Dataset
The proposed hypothesis suggested that the accuracy of object detectors on the specialized cattle dataset would correlate with their known performance on widely used benchmarks (e.g., COCO), with possible deviations due to domain specificity, while the inference speed should remain more stable.
When comparing the obtained AP@[0.5:0.95] values on the cattle dataset with the published AP metrics on COCO (for the same model versions and checkpoints), the following pattern is observed (here, COCO AP refers to the AP metric on the COCO dataset, and Cow AP refers to the AP metric on the cattle dataset — the domain-specific dataset):
Cascade R-CNN: COCO AP = 41 compared to Cow AP = 0.772 — significant increase in AP on the domain dataset;
RetinaNet: COCO AP = 39.5 compared to Cow AP = 0.751 — likewise, a substantial improvement;
RTMDet: COCO AP = 49.4 compared to Cow AP = 0.773 — notable increase;
D-FINE: COCO AP = 55.1 compared to Cow AP = 0.872 — the highest absolute improvement;
DETR: COCO AP = 39.9 compared to Cow AP = 0.777 — significant improvement;
DINO: COCO AP = 50.1 compared to Cow AP = 0.819 — considerable improvement;
RT-DETR: COCO AP = 53.1 compared to Cow AP = 0.722 — improvement, though less pronounced compared to other transformer-based models;
Co-DETR: COCO AP = 54.8 compared to Cow AP = 0.851 — strong improvement;
YOLOv11: COCO AP = 53.4 compared to Cow AP = 0.707 — improvement, but relatively modest compared to others.
Overall, a strong positive correlation is observed: models that demonstrated high accuracy on COCO generally achieved high results on the cattle dataset.
However, the absolute AP values on the cattle dataset are significantly higher than on COCO.
This confirms the first part of the hypothesis regarding correlation, while also highlighting noticeable deviations in absolute values.
These deviations can be explained by domain-specific factors:
The cattle dataset contains only one object class, simplifying the classification task compared to the multi-class COCO dataset.
Additionally, cattle images tend to have less diverse backgrounds or more uniform poses, making detection easier than in COCO’s complex and varied contexts.
Regarding inference speed, measurements in both the “reproducible” and “production” modes showed consistent similarity, confirming the reliability of the benchmarking protocol.
Comparison with the published COCO inference speeds is difficult due to hardware differences used in the original benchmarks (e.g., D-FINE and RT-DETR were tested on T4 GPUs).
Nevertheless, the relative ranking of model speeds generally remains consistent:
One-stage detectors (RTMDet, YOLOv11) remain the fastest;
Some transformer-based models (DINO, Co-DETR) are the slowest.
4.4. Qualitative Analysis
A qualitative evaluation was performed through visual inspection of detection results across several test images.
This analysis revealed not only quantitative differences but also qualitative aspects such as successful detections, false positives, and missed objects.
Visualizations showed that high-accuracy models like D-FINE and Co-DETR generally exhibited:
More precise bounding box localization,
Fewer false positives,
Better performance in dense herds and complex background scenes.
For example, D-FINE successfully detected even small, distant cattle, a key advantage for aerial imagery or drone-based monitoring.
Conversely, faster models such as YOLOv11 and RT-DETR, while highly efficient, occasionally missed small or partially occluded animals, and sometimes produced false positives on background elements resembling cattle (e.g., rocks, shadows, or machinery parts).
In dense herd scenes, some detectors struggled to distinguish individual animals, leading to overlapping bounding boxes or missed detections.
False detections typically arose from background textures similar to cows, while missed detections occurred under strong lighting, motion blur, or occlusion conditions.
This qualitative analysis underscores that while quantitative metrics (e.g., AP) are crucial, visual assessment provides valuable insight into model robustness under real-world conditions and helps identify the types of domain-specific errors each model tends to make.
5. Discussion
The conducted benchmarking of nine open-source object detection models on a specialized cattle dataset revealed substantial performance differences, highlighting the importance of domain adaptation for real-world commercial applications in the agro-industrial sector.
5.1. Accuracy and Domain Adaptation
The results clearly demonstrate that all models achieved significantly higher accuracy (AP) on the cattle dataset compared to their reported performance on COCO, confirming the first part of the hypothesis — a strong correlation accompanied by noticeable deviations in absolute values.
This improvement is largely attributed to domain-specific characteristics:
The cattle dataset contains only one object class, simplifying classification compared to the multi-class COCO dataset;
The homogeneity of the objects (cows) and the less diverse visual context facilitate more effective fine-tuning.
This finding indicates that for narrowly focused agricultural tasks, even models that perform moderately on general-purpose datasets can achieve very high detection accuracy after appropriate domain adaptation.
5.2. Trade-off Between Accuracy and Speed
The analysis revealed diverse trade-offs between accuracy and inference speed across the evaluated models.
D-FINE, which achieved the highest accuracy, also demonstrated remarkably high speed, making it one of the most balanced models for applications that require both precision and real-time performance.
RTMDet and YOLOv11 confirmed their reputation as high-speed detectors, ideal for systems where throughput is the top priority — such as continuous livestock monitoring or drone-based video analytics.
Transformer-based models like DINO and Co-DETR, while highly accurate, were notably slower, limiting their use in strict real-time scenarios under current hardware conditions.
5.3. Architectural Insights
Two-stage models (Cascade R-CNN):
Demonstrated good accuracy but lagged behind one-stage and some transformer-based architectures in speed;
Their multi-step nature enhances localization precision but comes at a higher computational cost.
One-stage models (RetinaNet, RTMDet, YOLOv11):
Confirmed their advantage in speed;
In particular, RTMDet and YOLOv11 stand out as fast and sufficiently accurate, making them strong candidates for commercial deployment in agricultural monitoring systems.
Transformer-based models (DETR, DINO, RT-DETR, Co-DETR, D-FINE):
Demonstrated the potential for achieving the highest accuracy (notably D-FINE and Co-DETR), though with variable inference speed;
D-FINE and RT-DETR show that transformer architectures can be optimized for real-time performance, whereas DINO and Co-DETR remain relatively slow.
5.4. Stability of Inference Speed Across Experiments
The second part of the hypothesis — the stability of inference speed — was generally confirmed.
The relative speed ranking of models remained consistent across all experiments and both inference modes.
The differences between the “reproducible” and “production” settings were minimal, indicating that model speed is primarily determined by:
Architecture design, and
Hardware configuration,
and is less sensitive to dataset domain characteristics compared to accuracy.
This consistency supports the reliability of the benchmarking methodology and highlights the robustness of the tested implementations.
5.5. Study Limitations
Test Set Size:
Although the dataset was significantly expanded and stratified, the test set size (569 images) may not be sufficient to fully assess result stability, particularly for rare or edge-case scenarios.
Single-Class Detection:
The study focused solely on detecting one class (“cow”);
Broader agricultural applications — such as multi-species detection or equipment-based monitoring — would require additional testing and adaptation.
Specific Hardware Configuration:
All benchmarks were performed on a single GPU configuration (NVIDIA RTX 4090);
Model performance may vary across different hardware platforms, necessitating further benchmarking for edge devices and server-based deployments.
6. Conclusion
This study successfully conducted a systematic benchmarking of nine open-source object detection models on a specialized cattle dataset, providing critical empirical data for their potential commercial deployment in the agro-industrial sector.
It has been demonstrated that domain adaptation significantly increases the accuracy of all tested models compared to their reported performance on widely used benchmarks such as COCO.
6.1. Key Findings
D-FINE achieved the highest overall accuracy (AP@[0.5:0.95] = 0.872) and demonstrated strong inference speed (21.61 ms/image), making it the optimal choice for most commercial applications requiring high reliability and real-time detection.
RTMDet and YOLOv11 confirmed their leadership in speed (15.81 ms/image and 19.14 ms/image, respectively), making them preferable for scenarios where minimal image processing latency is critical, though they achieved slightly lower accuracy compared to D-FINE.
Transformer-based architectures, such as D-FINE and Co-DETR, showed potential for achieving very high accuracy, while other models like DINO and Co-DETR remained speed-constrained despite acceptable precision.
Inference speed across all models proved to be stable and predictable, confirming that performance is primarily determined by model architecture and hardware, rather than by dataset domain.
6.2. Directions for Future Work
Investigate data augmentation strategies specifically tailored for agricultural imagery and assess their impact on model performance.
Expand the dataset to include a broader range of scenarios, lighting conditions, and possibly other livestock species.
Evaluate models on diverse hardware platforms, including cost-effective GPUs and edge accelerators, to determine their suitability for different commercial deployment environments.
Integrate object detection with tracking systems to enable behavioral analytics and automated herd management.
Explore model quantization and distillation techniques to further enhance inference speed without substantial loss of accuracy.
6.3. Recommendations for the Agro-Industrial Sector
Prioritize D-FINE for precision-critical applications.
For tasks requiring both high accuracy and real-time performance — such as detailed health monitoring and behavior analysis — D-FINE offers the best balance of precision and speed.
Use RTMDet or YOLOv11 for high-speed monitoring systems.
In scenarios where processing speed is paramount (e.g., large-scale herd counting using drones, where a slight accuracy loss is acceptable), RTMDet or YOLOv11 are the most suitable choices.
Always perform domain adaptation.
Fine-tuning models on domain-specific datasets is essential, as it leads to a substantial increase in detection accuracy compared to directly applying models pre-trained on generic datasets.
Conduct pilot deployments before scaling.
Before large-scale rollout, it is critical to run pilot studies that consider specific hardware and network environments, ensuring that the desired performance is achieved under real-world operational conditions.