AgriPath: A Systematic Exploration of Crop Disease Classification

Overview

This project investigates how different model architectures behave under real-world domain shift in crop disease classification. I designed a unified evaluation framework to compare CNNs, contrastive vision–language models, and generative vision–language models across controlled (lab) and uncontrolled (field) conditions.

Rather than focusing only on accuracy, the work examines robustness, failure modes, and generalisation behaviour under realistic variability, where deployment conditions differ significantly from training data.

Problem

Many computer vision systems for agriculture achieve high accuracy on curated lab datasets but degrade significantly in real-world field conditions.

This raises a key question:

How do different model classes behave under domain shift, and which approaches are more reliable when deployed in less controlled environments?

To answer this, the project evaluates multiple architectural paradigms under explicit lab vs field separation, analysing performance degradation, generalisation gaps, and robustness across domains.

Dataset

AgriPath-LF16

111,307 images
16 crops
41 diseases
65 crop–disease pairs
Explicit Lab vs Field source separation

Available on HuggingFace:

Full dataset:
https://huggingface.co/datasets/hamzamooraj99/AgriPath-LF16
Balanced subset (used in experiments):
https://huggingface.co/datasets/hamzamooraj99/AgriPath-LF16-30k

The 30k subset preserves all classes and supports fair evaluation across domain conditions.

Methodology

Domain-aware dataset with explicit lab/field separation
Unified evaluation across:
- ResNet-50 (transfer learning baseline)
- Contrastive VLMs (CLIP, SigLIP)
- Generative VLMs (Qwen2.5-VL, SmolVLM)
LoRA-based parameter-efficient fine-tuning
Deterministic decoding for generative models
Structured output parsing with invalid-generation penalties
Macro F1 as the primary metric
Separate evaluation across Lab-only, Field-only, and combined regimes

Key Findings

Performance drops significantly from lab to field conditions, demonstrating strong sensitivity to domain shift.
CNNs achieve high accuracy in controlled settings but show limited robustness under real-world variability.
Contrastive VLMs generalise better across domains, suggesting stronger robustness to visual variation.
Generative VLMs introduce new failure modes, including invalid or unparsable outputs, requiring structured evaluation (Parse Success Rate).
High parse success does not guarantee semantic correctness, highlighting a gap between format compliance and task performance.
Deterministic decoding and output validation are necessary for reliable evaluation of generative systems.

These results show that model architecture directly affects reliability under deployment-like conditions, and that evaluation must account for both predictive accuracy and failure behaviour.

Engineering & Infrastructure

Built modular pipelines supporting CNN, contrastive, and generative VLMs under a unified interface
Implemented LoRA fine-tuning for efficient adaptation
Optimised training on RTX 4090 and A100 with mixed precision (bf16)
Developed automated evaluation scripts for multi-split benchmarking and output validation
Integrated experiment tracking with Weights & Biases
Enforced deterministic inference to eliminate stochastic evaluation variance

Limitations & Future Work

Current setup focuses on classification; extending to sequential or agent-based settings would better reflect real deployment conditions.
Parsing-based evaluation for generative models does not fully capture semantic correctness; richer evaluation methods are needed.
Dataset scope is limited to 16 crops with unknown region bias; broader coverage and region information would improve generalisation analysis and deploymentability.

Technologies

PyTorch, Transformers, PEFT, Unsloth, PyTorch Lightning, Weights & Biases