Optimizing Crop Yield Prediction: Data-Driven Analysis and Machine Learning Modeling using USDA Datasets

This research uses a variety of machine learning models and exploratory data analysis (EDA) to forecast crop yields using USDA information from 2003 to 2013 in an effort to achieve precision agriculture. Not only did we want to predict agricultural output, but we also wanted to identify the underlying factors that affect yield. By means of thorough EDA, which encompassed a wide range of agricultural data, including weather patterns and USDA-sourced soil composition, we were able to gain important insights into the variables that impact differences in crop output. The thorough investigation that followed served as the basis for our machine learning modelling. We thoroughly assessed and contrasted the performance of a variety of machine learning algorithms, including Bagging Regressor, KNN, Decision Trees, Gradient Boost, Random Forest, and Linear Regression. The accuracy of the models varied noticeably, as the results showed: the Random Forest, Decision Trees, and Bagging Regressor models showed great accuracy, with respective values of 98.56%, 97.62%, and 98.59%. Conversely, KNN and Linear Regression showed reduced accuracy, indicating their limits in this situation. The robustness of our results was further improved by applying k-fold cross-validation, highlighting the significance of model validation in crop yield prediction. Some models showed changes in accuracy during cross-validation, which revealed more about their dependability. In addition to providing a thorough investigation of the variables affecting agricultural productivity, this study highlights the diverse forecasting powers of machine learning models. Our findings provide a path for well-informed agricultural decision-making by utilizing technology to optimize crop production estimates. The ultimate goal of this research is to support stakeholders in optimizing agricultural productivity and enable sustainable practices.


Introduction
As the primary source of human nutrition, agriculture is always looking for new and creative ways to increase output and guarantee food security.The amalgamation of technology, data analytics, and machine learning has surfaced as a revolutionary methodology in this endeavour.Using extensive datasets from the United States Department of Agriculture (USDA) covering the years 2003 to 2013, this study aims to estimate crop yields by leveraging the capability of these techniques.Crop yield prediction is essential to agricultural decisionmaking because it helps farmers, policymakers, and other stakeholders plan ahead and anticipate changes in agricultural productivity.It is crucial for this endeavour to comprehend the multitude of elements that affect agricultural yield, ranging from different environmental parameters to soil composition and weather patterns.Consequently, the present study commences with a comprehensive Exploratory Data Analysis (EDA) that explores a wide range of agricultural variables in an attempt to unravel the complex interplay between these variables and crop yields.The key to this study is how several machine learning models are then applied to precisely estimate crop output.This study intends to assess the effectiveness of several algorithms, including Linear Regression, Random Forest, Gradient Boost, XGBoost, KNN, Decision Trees, and Bagging Regressor, in predicting agricultural production.We want to determine the most accurate predictors of crop output fluctuations by evaluating and contrasting the performance of different models, which will enable more exact and knowledgeable agricultural estimates.This research aims to both simplify the processes involved in crop yield prediction and open the door to more precise and useful agricultural decision-making by combining data-driven insights with machine learning capabilities.

Objective
The main goals of this study are to use USDA data from 2003 to 2013 and a variety of machine learning models, such as Random Forest, XGBoost, and others, to accurately estimate crop yields.By using measurements like accuracy and Mean Squared Error (MSE), the study seeks to systematically assess and contrast these algorithms' performances.Furthermore, by doing extensive exploratory data analysis (EDA), the research aims to pinpoint important variables impacting agricultural output.The study examines model dependability using k-fold cross-validation to guarantee the robustness of the results.In the end, the research aims to offer insightful information that can improve crop output forecasts, aid in agricultural decision-making, and help the advancement of sustainable farming methods.

Hypothesis
We postulate that by using USDA data spanning from 2003 to 2013, including a variety of machine learning models and doing extensive exploratory data analysis (EDA), crop yield forecasts would be much improved.Certain algorithms are predicted to perform better than KNN and Linear Regression, including Random Forest, Decision Trees, and Bagging Regressor.These models' ability to foresee is expected to be significantly enhanced by the underlying elements that EDA identifies.Furthermore, to improve the robustness of the model, k-fold cross-validation is expected.In order to optimise agricultural production, assist wise decision-making, and advance sustainable practices, the research seeks to offer insightful information.

Literature
The ultimate goal of our efforts is to improve the productivity and sustainability of farming methods in order to provide a more robust and fruitful future for the world's food systems.While LDA is used to efficiently group or categorise the data, EDA is utilised as a first step in exploring and understanding the data.The study analyses and forecasts wheat production depending on environmental conditions by using these techniques in combination with predictive models such as decision trees and random forest regression.In addition, many models are used in ensemble learning to improve prediction accuracy and get understanding of model performance. 1his research attempted to thoroughly collect and synthesise data about algorithms and characteristics used in agricultural yield prediction studies using a Systematic Literature Review (SLR).There were 567 pertinent studies found after the first search of six internet databases.Fifty studies met the predetermined inclusion and exclusion criteria and were chosen for further analysis. 2The Random Forest method, in particular, shows to be useful in producing these very accurate predictions using machine learning.Because of its use, accurate crop projections are made possible, assisting farmers in choosing the best crop to plant in light of the current environmental conditions. 3The usefulness of Support Vector Machines (SVM), Single-Layer Artificial Neural Networks (ANN), Deep Neural Networks (DNN), and Extreme Gradient Boosting (XGBoost) models in forecasting daily temperatures for summer maize production in Northwest China was examined in this study. 4Using a set of parameters, machine learning techniques-both supervised and unsupervised-allow for the prediction of results.Creating a useful connection between the input variables and the intended output parameter is the aim.In order to improve crop yield forecast accuracy, an ensemble of two machine learning algorithms is utilised in this.After conducting a thorough search across several databases, the study found almost seven relevant characteristics.The researchers then assembled and examined a dataset that included 28,242 occurrences.Analysing these characteristics and comparing different algorithms produced enlightening findings.The study examined the efficacy of machine learning algorithms and suggested directions for further research in this field. 5he present work underscores the importance of clustering approaches in identifying patterns within agricultural data, hence reducing the difficulties associated with sparse data when estimating crop productivity.A robust cross-validation method called K-Fold validation is used to thoroughly examine different prediction models.Using this strategy, the data is divided into K subsets, and each model is tested at various folds. 6A robust cross-validation technique called K-Fold validation is employed to assess different prediction models.Each model is tested on various folds by dividing the data into K subsets.Our multi-model ensemble strategy's generalizability is confirmed by K-Fold validation, which enhances crop production predictions. 7e rapid evolution of big data applications in agriculture is driven by an increasing accumulation of experience, growing applications, the emergence of best practices, and enhanced computational power.Despite this progress, actual implementations addressing real-life problems are limited.What defines the process of adapting big data challenges to solutions, and to what degree is there alignment between them. 8is research involved conducting a Systematic Literature Review (SLR) to systematically extract and amalgamate algorithms and features employed in studies related to crop yield prediction. Utilizing predefined search criteria, a total of 567 pertinent studies were retrieved from six electronic databases.Subsequently, 50 studies were meticulously selected for in-depth analysis based on inclusion and exclusion criteria.The chosen studies underwent careful examination, wherein we scrutinized the employed methodologies and features, offering insights and recommendations for future research directions.Our analysis identified temperature, rainfall, and soil type as the predominantly utilized features, with Artificial Neural Networks emerging as the most commonly applied algorithm in these predictive models. 9cently, there has been a growing application of Deep Learning (DL) techniques in the analysis of dense scenes, with a notable emergence in the field of dense agricultural scenes.This review aims to delve into the diverse applications of DL for analyzing dense scenes in agriculture.To provide a comprehensive understanding of the topic, we initially outline the different types of dense scenes encountered in agricultural settings, along with the associated challenges.Subsequently, we present an overview of widely employed deep neural networks specifically tailored for analyzing these dense scenes.The review then extensively covers the applications of these neural network structures across various agricultural tasks, encompassing aspects such as recognition and classification, detection, counting, and yield estimation. 10 Leveraging sensors and biosensors with the capacity to perceive alterations in plant health and forecast the progression of both morphology and physiology has emerged as a valuable approach for enhancing crop yields.The advent of flexible sensors and nano materials has sparked innovations in wearable and portable devices designed for on-plant use.These devices offer continuous and precise long-term sensing capabilities, capturing morphological, physiological, biochemical, and environmental parameters.This review offers a comprehensive exploration of cutting-edge plant sensing technologies, examining wearable and integrated devices specifically designed to engineer and monitor the morphological traits, physiological processes, and interactions between plants and their environment. 11 1 Heat map shows that there is a strong negative correlation between Area and pesticides_tonnes, along with Area and average rainfall of -0.35 and -0.26.An inverse link between the crop type ("Item") and the crop production yield ("hg/ha_yield") is represented by the negative sign (-).There is a tendency for the other variable (yield) to shift somewhat in the opposite direction when one variable (crop type) changes.Degree of Correlation Strength: -0.22 is a value that indicates a moderately strong negative association.It suggests that the crop type will likely have a minor impact on the final yield per hectare.Impact of Crop Selection: Given the negative association, it is possible that some crop varieties would marginally affect the final output.

Proposed Methodology
There is a slight tendency for the yield per hectare to fluctuate inversely with crop type selection .From the above figure we concluded that Australia yielded the most while harvesting potatoes, whereas Angola yielded the least when harvesting maize, sorghum, and soybeans.Ecuador was having difficulty harvesting wheat, while Egypt was generating the highest output in this group by cultivating sweet potatoes and potatoes.Honduras had poor luck harvesting wheat, while France and Germany are leading the world in yield output when it comes to potatoes.Madagascar failed to cultivate soybeans and sorghum, while India excelled in producing cassava and Japan was the best at growing potatoes.Niger did not produce a lot of wheat, but Morocco and Mexico did well in producing potatoes.Saudi South Africa, and Spain excelled in sorghum cultivation, but Pakistan struggled.Maximum Yield: Maximum yields for various crop types across different areas: Highest yield observed for Maize: 10250.87hectograms per hectare (Cameroon).Minimum Yield: Minimum yields for different crop types in various regions: Lowest yield observed for Soybeans: 941.75 hectograms per hectare (Tajikistan).Average Yield: Averages (means) for specific crop types in different regions: Sorghum: Ranges from around 2500 to over 10,000 hectograms per hectare in various countries.Soybeans, Maize, and Wheat also show considerable yield variations across regions.

Fig. 5: Rainfall analysis across various countries
We have total 7 such figures for analysis of rain fall and we have concluded from the bar graph that Top nations for rainfall: Papua New Guinea, Ecuador, Suriname, Bangladesh, Colombia, Guyana, Indonesia, and Nicaragua get more than 2000 mm of rain annually.
Saudi Arabia, Pakistan, South Africa, Mali, Mauritania, Morocco, Niger, Libya, Iraq, Egypt, Azerbaijan, and Algeria are among the nations with the least amount of rainfall, with an average of less than 500 mm.
Top nations that use pesticides: Argentina, Brazil, and Italy France: Using more than 80,000 tonnes of pesticides might be detrimental for a nation that produces excellent yields.Japan is a high-producing nation that uses more than 60,000 tonnes of pesticides.Among the nations that use the fewest pesticides are Algeria, Angola, Azerbaijan, Bulgaria, Burkina Faso, Burundi, Cameroon, Central African Republic, Croatia, Egypt, El Salvador, Greece, Guinea, Guyana, Haiti, Honduras, Hungary, Indonesia, Iraq, Jamaica, Kenya, Kazakhstan, Libya, Madagascar, Malawi, Mali, Mauritana, Mauritius, Mozambique, Nepal, Niger, Papua New Guinea, Rwanda, Senegal, Saudi Arabia, Sri Lanka, Suriname, Tajikistan, Uganda, Zambia, imbabwe,UruguayThe least amount of pesticides used countries include all low-yielding nations.The top producing nations are the United Kingdom, Australia, and Germany, and they use an average amount of pesticides around 30,000 tonnes.The figure tells Brazil's output began with modest yields and increased as it applied an increasing amount of pesticides.But Argentina, Australia, and Algeria produced more output than Brazil ever could, despite using less pesticides overall.

Parameter Tunin
Adjust model parameters or perform hyperparameter tuning to optimize the efficiency of each model.

Ensemble Methods
Leverage ensemble methods to combine the strengths of multiple models for improved overall performance.

More Data
Consider expanding the dataset, as a larger and more diverse dataset can lead to improved model generalization.

Conclusion
The study concludes by highlighting the remarkable predictive powers of the Decision Tree, Random Forest, XGBoost, and Bagging Regressor models in predicting agricultural yields.These models routinely outperform others, exhibiting reduced error rates and increased accuracy.Potential strategies for enhancing model performance are also suggested by the research, including feature engineering, parameter adjustment, using ensemble techniques, and growing the dataset.
By putting these improvements into practice, we can raise the models' accuracy even more and help provide more accurate and precise projections of crop production.This is in line with the overarching objective of improving agricultural decision-making and encouraging sustainable methods.

Fig. 2 :
Fig. 2: Heat map for the different numerical valued column

Fig. 11 :
Fig. 11 :Accuracy of different regression model for actual vs predicted values

Table 1
represents the summary of the numerical columns present in the data set and some inferences drawn out from the table 1 as there are roughly 1149 rainy days on average every year, with 51 being the wettest and 3240 being the most, with a low of 0.04 and a high of 367778 tonnes, the average amount of pesticides used is an astounding 37077 tonnes.hg/ha_yield:Crop output yields range

Table 3 : Model accuracy after applying K folds cross validation
The thorough analysis shows that while KNN and Linear Regression perform poorly, Random Forest, XGBoost, Decision Tree, and Bagging Regressor constantly produce excellent outcomes.Even though it performs admirably, Gradient Boost is not quite as good as the best models.