DistilBERT Model for Crop Recommendation Based on Environmental Parameters
This repository contains a fine-tuned DistilBERT model trained for crop recommendation using structured agricultural data. By converting numerical environmental features into text format, the model leverages transformer-based NLP techniques to classify the most suitable crop type.
πΎ Problem Statement
The goal is to recommend the best crop to cultivate based on parameters such as soil nutrients and weather conditions. Traditional ML models handle this as a tabular classification problem. Here, we explore the innovative approach of using NLP models (DistilBERT) on serialized tabular data.
π Dataset
Source: Crop Recommendation Dataset
Features:
- N: Nitrogen content in soil
- P: Phosphorus content in soil
- K: Potassium content in soil
- Temperature: in Celsius
- Humidity: %
- pH: Acidity of soil
- Rainfall: mm
Target: Crop label (22 crop types)
The dataset is preprocessed by concatenating all numeric features into a single space-separated string, making it suitable for transformer-based tokenization.
π§ Model Details
- Architecture: DistilBERT
- Tokenizer:
DistilBertTokenizerFast
- Model:
DistilBertForSequenceClassification
- Task Type: Multi-Class Classification (22 classes)
π§ Installation
pip install transformers datasets pandas scikit-learn torch
Loading the Model
from transformers import DistilBertTokenizerFast, DistilBertForSequenceClassification
import torch
# Load model and tokenizer
model_path = "model_fp32_dir"
tokenizer = DistilBertTokenizerFast.from_pretrained(model_path)
model = DistilBertForSequenceClassification.from_pretrained(model_path)
# Sample input
sample_text = "90 42 43 20.879744 82.002744 6.502985 202.935536"
inputs = tokenizer(sample_text, return_tensors="pt")
# Predict
with torch.no_grad():
outputs = model(**inputs)
predicted_class = torch.argmax(outputs.logits, dim=1).item()
print("Predicted class index:", predicted_class)
π Performance Metrics
- Accuracy: 0.7636
- Precision: 0.7738
- Recall: 0.7636
- F1 Score: 0.7343
ποΈ Fine-Tuning Details
π Dataset
The dataset is sourced from the publicly available Crop Recommendation Dataset. It consists of structured features such as:
- Nitrogen (N)
- Phosphorus (P)
- Potassium (K)
- Temperature (Β°C)
- Humidity (%)
- pH
- Rainfall (mm)
All numerical features were converted into a single textual input string to be used with the DistilBERT tokenizer. Labels were factorized into class indices for training.
The dataset was split using an 80/20 ratio for training and testing.
π§ Training Configuration
- Epochs: 3
- Batch size: 8
- Learning rate: 2e-5
- Evaluation strategy:
epoch
- Model Base: DistilBERT (
distilbert-base-uncased
) - Framework: Hugging Face Transformers + PyTorch
π Quantization
Post-training quantization was applied using PyTorchβs half()
precision (FP16).
This reduces the model size and speeds up inference with minimal impact on performance.
The quantized model can be loaded with:
model = DistilBertForSequenceClassification.from_pretrained("quantized_model_fp16", torch_dtype=torch.float16)
Repository Structure
.
βββ quantized-model/ # Contains the quantized model files
β βββ config.json
β βββ model.safetensors
β βββ tokenizer_config.json
β βββ vocab.txt
β βββ special_tokens_map.json
βββ README.md # Model documentation
Limitations
- Uses text conversion of tabular data, which may miss deeper feature interactions.
- Trained on a specific dataset; may not generalize to different regions or conditions.
- FP16 quantization may slightly reduce accuracy in rare cases.
Contributing
Feel free to open issues or submit pull requests to improve the model or documentation.
- Downloads last month
- 3