DistilBERT Model for Crop Recommendation Based on Environmental Parameters

This repository contains a fine-tuned DistilBERT model trained for crop recommendation using structured agricultural data. By converting numerical environmental features into text format, the model leverages transformer-based NLP techniques to classify the most suitable crop type.

🌾 Problem Statement

The goal is to recommend the best crop to cultivate based on parameters such as soil nutrients and weather conditions. Traditional ML models handle this as a tabular classification problem. Here, we explore the innovative approach of using NLP models (DistilBERT) on serialized tabular data.

📊 Dataset

Source: Crop Recommendation Dataset
Features:
- N: Nitrogen content in soil
- P: Phosphorus content in soil
- K: Potassium content in soil
- Temperature: in Celsius
- Humidity: %
- pH: Acidity of soil
- Rainfall: mm
Target: Crop label (22 crop types)

The dataset is preprocessed by concatenating all numeric features into a single space-separated string, making it suitable for transformer-based tokenization.

🧠 Model Details

Architecture: DistilBERT
Tokenizer: DistilBertTokenizerFast
Model: DistilBertForSequenceClassification
Task Type: Multi-Class Classification (22 classes)

🔧 Installation

pip install transformers datasets pandas scikit-learn torch

Loading the Model

from transformers import DistilBertTokenizerFast, DistilBertForSequenceClassification
import torch

# Load model and tokenizer
model_path = "model_fp32_dir"
tokenizer = DistilBertTokenizerFast.from_pretrained(model_path)
model = DistilBertForSequenceClassification.from_pretrained(model_path)

# Sample input
sample_text = "90 42 43 20.879744 82.002744 6.502985 202.935536"
inputs = tokenizer(sample_text, return_tensors="pt")

# Predict
with torch.no_grad():
    outputs = model(**inputs)
predicted_class = torch.argmax(outputs.logits, dim=1).item()
print("Predicted class index:", predicted_class)

📈 Performance Metrics

Accuracy: 0.7636
Precision: 0.7738
Recall: 0.7636
F1 Score: 0.7343

🏋️ Fine-Tuning Details

📚 Dataset

The dataset is sourced from the publicly available Crop Recommendation Dataset. It consists of structured features such as:

Nitrogen (N)
Phosphorus (P)
Potassium (K)
Temperature (°C)
Humidity (%)
pH
Rainfall (mm)

All numerical features were converted into a single textual input string to be used with the DistilBERT tokenizer. Labels were factorized into class indices for training.

The dataset was split using an 80/20 ratio for training and testing.

🔧 Training Configuration

Epochs: 3
Batch size: 8
Learning rate: 2e-5
Evaluation strategy: epoch
Model Base: DistilBERT (distilbert-base-uncased)
Framework: Hugging Face Transformers + PyTorch

🔄 Quantization

Post-training quantization was applied using PyTorch’s half() precision (FP16).
This reduces the model size and speeds up inference with minimal impact on performance.

The quantized model can be loaded with:

model = DistilBertForSequenceClassification.from_pretrained("quantized_model_fp16", torch_dtype=torch.float16)

Repository Structure

.
├── quantized-model/               # Contains the quantized model files
│   ├── config.json
│   ├── model.safetensors
│   ├── tokenizer_config.json
│   ├── vocab.txt
│   └── special_tokens_map.json
├── README.md                      # Model documentation

Limitations

Uses text conversion of tabular data, which may miss deeper feature interactions.
Trained on a specific dataset; may not generalize to different regions or conditions.
FP16 quantization may slightly reduce accuracy in rare cases.

Contributing

Feel free to open issues or submit pull requests to improve the model or documentation.