HPA-2023 Paper 20

CEREAL CROP YIELD PREDICTION USING MACHINE LEARNING
TECHNIQUES
Belete Asmare
Addis Ababa science and technology university
beleteasmare10@gmail.com
Sudhir Kumar Mohapatra

Faculty of Emerging Technologies, Sri Sri University, Cuttack, India
sudhir.mohapatra@srisriuniversity.edu.in
Kula Kekeba
Center of Excellence for HPC and Big Data Analytics
Addis Ababa science and technology university
kula.kakeba@aastu.edu.et
Semahegn Getnet Molla
Plant science at Gode poly techniques college
, Somalia region, Ethiopia)
sem.getnet@gmail.com
Abstract: Agriculture in Ethiopia is the area that plays an important role in improving our economy. About
85% of the population live in rural areas and their economy is largely based on crop productivity. Crop
selection depended on several parameters such as market price, production rate, climate data, chemicals,
and different government policies. Prediction of crop yields is important for planning and making various
policy decisions. Many countries like Ethiopia their economy is depend on agriculture use the conventional
technique of data collection for crop monitoring and yield predicting. The purpose of this study is to develop
a cereal crops yield prediction model based on agricultural inputs data. To this end, appropriate machine
learning techniques have been identified and applied to predict cereal crop yields based on agricultural
inputs. In order to build the prediction, model the collected raw data had been pre-processed and merged
based on common features. After merging the dataset, the files containing the data were collected and the
reputation of the final data should be: the year of the item (crop), the yield value, the average rainfall, the
pesticides, and the average temperature. The data has a size of 20 kilobytes and 12 features initially. After
feature importance analysis has been implemented the size data was resized to 7 features and 8 kilobytes to
develop the predicted model. For the experimental analysis we have use, Gradient Boosting Regression,
Random Forest Regression, Support Vector Machine, and Decision Tree Regression. Experimentally we
have analyzed the performance comparison of each algorithm by using different data splitting train/test
levels. Finally, among listed algorithms, the Gradient Boosting Regression outperforms the other standard
algorithms by showing 93% accuracy in crop yield prediction.
Keywords: Cereal crop, regression algorithm, machine learning, dataset, yield prediction.
Introduction:
Ethiopia is one of the country maximum prone to weather variability. The agricultural region which
contributes over 45% of GDP, 80% of the workforce, and 85% of forex profits are very touchy to weather
change [1]. Over 95% of rainfall-established agricultural manufacturing has been produced through
smallholders and subsistence farmers who've much less potential to evolve to weather change [2]. The
plants produced consist of meals plants, coins plants, fruits, and vegetables. It constitutes the best proportion
of the country’s GDP and export profits whilst in comparison to livestock manufacturing.
Hence, because it has been for centuries in the past, still being the leading sector at present, it's far believed
to stay to be the determinant sector to play a dominant role to result in a general sustainable economic
growth to the country, for the years to come. if and only if strenuous efforts are made through the authorities
and the involved stakeholders such as the farmer, to enhance productiveness via multiplied use of farm
inputs which includes progressed seed, fertilizers, etc., and modernize the farm activity via multiplied use
of modern and progressed farm implements and farming systems in addition to via the introduction of
modern farming technology to the sector as a whole.
In Ethiopia, cereal production is a dominant form of agricultural practice over other types of crop
production. According to the 2019 CSA report, the percentage of crops, according to production, is cereals
(71.57%), legumes (11.20%), oats (5.17), vegetables (1.67%), root crops (1.60%), and fruit crops. (0.83%)
and coffee (5.28%) of typical crop production locations. Out of nearby states inside the country, Oromia
ranks first every in terms of land region allocation (45.41% of country-wide crop production location) and
crop production (49.24% of country-wide crop production) [3].
In 2018, cereal yield for Ethiopia was 2,395 kg per hectare. Though Ethiopia's cereal yield fluctuated
appreciably in recent years, it tended to increase via the 1969 - 2018 duration ending at 2,395 kg per hectare
in 2018 [4]. This indicates that cereal crop production is the important source of livelihood for smallholder
farmers in the country and thus, smallholder farmers’ food security and welfare status depends on the extent
of development in this subsector. Cereals like sorghum, wheat, maize, and rice are major staple foods of
most populations.
Crop yield is the maximum vital indicator in agriculture and has several connections with human society.
Due to the complexity of the information, crop manufacturing forecasting is a difficult assignment for
coverage leaders. Researchers in agriculture and agro-economics are inquisitive about growing new
mathematical techniques that could make higher predictions with the use of current metrics. Research on
this path is concerned with presenting a hyperlink among the rural surroundings and crop manufacturing,
considering nearby variables, soil quality, irrigation, and land use. These fashions are primarily based
totally at the legal guidelines of measurement. These models are based on the laws of measurement [5].
Crop yield prediction is one of the most important and well-known topics in real agriculture, with crop
mapping and estimation, crop supply in line with demand, and crop management. Modern approaches are
far from simple predictions based on historical data but include computer vision technologies to provide
information on travel and general crop, weather, and economic conditions [6].
The challenge begins when one realizes that it is not possible to produce such information for a specific
professional system. Manual surveys and remote sensor data are used to predict crop yields. Observations
of the past years with mathematical knowledge Manual study with historical knowledge is useful for a small
area, but difficult to compare with other regions and countries. Recent advances in crop simulation models
have overcome these problems [7] .
Crop yield predictions are valuable to many stakeholders in the agro-food chain, including farmers,
agronomists, commodity traders, and policymakers of agriculture[8]. Crop yield is prompted by many crop-
unique parameters, environmental conditions, and management decisions and it's far hard to construct a
reliable and explainable prediction model [9].
Machine learning is an artificial intelligence application in which a computer or system learns from
experiences beyond (inserting data) and makes predictions about fate. The overall performance of this type
of machine must be at the lowest human level. The study of the machine is omnipresent throughout the
development and collection cycle. It starts with evolving with a seed that is planted in the ground from the
size of the preparation of the soil, the selection of seeds, and the supply of water and ends with robots
harvesting the crop by calculating the maturity using computer vision [10].
From an engineering perspective, an ML task is a software system that has one or greater components in it
that learn from records. This involves the gathering and pre-processing of records, the training of an ML
model, the deployment of the trained model to carry out inference, and the software program engineering
of the encompassing software system that sends new input records to the model to get answers. Machine
Learning is usually classified into 3 types: Supervised Learning, Unsupervised Learning, Reinforcement
Learning [11].
Data processing is the idea of the whole agricultural records cycle and ought to deal with many troubles in
agriculture, including meals security, soil conservation, irrigation, pest identity and prevention, soil health,
and agricultural utilization. Traditional evaluation techniques including information mining, system
learning, statistical evaluation, and different techniques aren't applicable to large-scale information
processing in agriculture. Years of studies and development, information mining, system learning,
statistical evaluation, and extra information evaluation have caused huge effects on the information.
Depending on the traits of the information in agriculture, you will use timeliness as a measure. Research
information control offers numerous technological challenges. These are associated with environmental
modeling, i.e., metadata-primarily based totally information retrieval troubles into information mining and
information integration. He makes use of analysts to affirm the good-sized agricultural information
algorithms. It can calculate the effectiveness of algorithms to a point and calculate the reliability of
information results. Predictive analysis is the branch of data analysis that is mainly used to predict future
events or outcomes. The process of predictive analysis can be diagrammatically described below [12].
i. Research Questions
The study is basically designed with these research questions in mind:
1. Which machine-learning model can be applied for cereal crop yield prediction?
2. How the developed crop yield predictive model is effective?
ii. Contribution of the Study

This work contributes to the scientific community and practice in multiple ways:
 Different factors that affect crop production specifically cereal crops were investigated that have different
applications for agricultural policies.
 Study of various solutions proposed/used in crop yield prediction, and the effectiveness of the various
parameters that influences their results
 Comparative analysis of standard algorithms on crop yield prediction and identifying the most suitable
algorithm for a generic set of crops
 Evaluation of various parameters which affect the crop yield and ranking them according to their impact.
 The cereal crop dataset for Ethiopia has been developed and published for researchers and other users.
Related work:
Nowadays, the research community has given more attention to the topics that are related to agriculture and
its contribution to the growth of economies in countries. There are different approaches used to study
agricultural gross production forecasting. Hereafter we are going to look back on some research works
related to the use of machine learning techniques in agriculture especially prediction and forecasting.
Goapl and Bhargavi, developed a novel hybrid model to predict paddy crop yield and is based on multiple
linear regression (MLR) and Artificial Neural Networks(ANN). In this model, the initial weights of the
neural network are derived from MLR coefficients. The paddy data is used to train the backpropagation
community and the performance is similar to the other machine learning models. The hybrid ANN-MLR
model achieved better precision than other models[13].
Shastry and Sanjay created a brand new cloud-primarily based totally framework to categorize soil and to
are expecting crop yield. The proposed framework used to categorize the soil is primarily based totally on
the hybrid kernel Support Vector Machine(SVM) method and the SVM kernel parameters are derived from
GA. Based on Artificial Neural Networks(ANN), the crop yield prediction version turned into advanced,
and additionally the parameters of ANN just like the hidden layers, neurons and gaining knowledge of price
are customized. The proposed cloud-primarily based totally framework version plays higher than different
fashions in soil type and crop yield prediction[14].
Pavan Patil, Virendra Panpatil, Prof. Shrikant Kokate proposed a system that discusses improving the result
by adding more attributes to the system. A combination of Naive Bayes and decision tree algorithms are
used. The decision tree shows poor performance with the given dataset and has more variations but naive
Bayes provides better results than the decision tree for such datasets. The combination classification
algorithm of naive Bayes and decision tree classifier are better performing than the use of a single classifier
model. The parameters include soil type, soil Ph value, humidity, temperature, wind, and rainfall[15].
Islam, T., Chisty, T. A., & Chakrabarty used a deep learning neural network model(DNN) to envisage
varieties of crop yield like rice, Jute, Wheat, and Potato by using weather, Soil, and fertilizers data. The
newly developed DNN model is compared with the other machine learning models namely Random Forest,
Support Vector Machine, and Linear Regression. The DNN model gives higher precision in prediction than
the other model [16].
Feng et al. exploited the power of machine learning and regression models in the prediction process. In
their research work, they compared the cross-validated Random forest(RF) model with the multiple linear
regression(MLR) model and also establish the correlation between climate and rainfall parameters. This
established correlation shows how the wheat yield percentage is decreased when the rainfall is low. In
prediction, the RF outperforms MLR [17].
Prakash, S., Sharma, A., & Sahu explored the better way for soil moisture prediction with help of machine
learning models such as Support Vector Machine(SVM), RNN, and statistical model multiple linear
regression(MLR). The predicted outcome of the machine learning models is compared against each other.
Authors suggested that, in short-term moisture prediction, MLR has better prediction power than machine
learning models [18].
Giritharan and Koteeshwari suggested in this paper, to use one of the most effective tools named Artificial
Neural Network(ANN) for modeling and prediction. To implement the ANN both the Feedforward and
Back Propagation Network is combined together and used. The suggested system is an easy-to-use android
application [19].
Snehal S. Dahikar and Sandeep V. Rode used Artificial Neural Network technology for estimating long-
term or short-term crop production because it provides an assorted solution for the cumbersome problems
in agriculture research. This research work only presented the ANN to minimize the losses when the
conditions are not apt while envisaging the crop yield with the parameters of the soil, weather, guaranteed
price, cultivation area, etc. [20].
Singh and Prabhat Kummer concluded that this paper would help improve crop yields by applying
classification methods and comparing metrics. We can also do analyzing and prediction of crops using
Bayesian algorithms. The Bayesian algorithm, K-means Algorithm, Clustering Algorithm, Support Vector
Machine algorithms were used. The disadvantage is the lack of accuracy and performance described in the
paper according to the implementation of the suggested algorithms [21].
Arun Kumar, Naveen Kumar, and Vishal Vats have proposed a system to predict the yield of the crop by
analyzing past soil datasets, rainfall datasets, yield datasets. The prediction was done using K-Nearest
Neighbor and Support Vector Machine algorithm and Least Squares algorithms [22]. They have done crop
prediction using weather forecasting, pesticides and fertilizers to be used and past revenue as input data.
Multi-line core component analysis (MPCA) has been used for behavior reduction. In addition to the
forecast, they take into account prerequisites and behavioral reductions [23].
There are few research works about sugarcane yield prediction which can be associated with our work.
Sugarcane yield prediction technique with the use of Random forest [17] became proposed in one of the
survey, the features used in this study consist of biomass index, climate statistics (e.g., rainfall) and yields
from previous years. Two predictive tasks are provided in[24] : (i) the category problem for predicting
whether or not the yield can be above or underneath the found median yield, and (ii) the regression hassle
for predicting the yield estimates in two distinct time intervals. In addition, support vector system for rice
crop yield prediction become proposed, the dataset used in this method are precipitation, minimum,
maximum and common temperature, place, evapotranspiration and manufacturing. The sequential minimal
optimization classifier is implemented on the dataset [25].
Mary Mary Saji, Kevin Tom, Varsha S, Lisha Vargesi, Er. Gene Thomas proposed the paper that will clear
up the rural issues via way of means of looking at the rural region on the premise of soil properties. It
recommends the maximum appropriate crop to farmers, thereby assisting them to boom productiveness and
decrease loss. Here is a paper evaluating the algorithms. Here, in particular, the use of the algorithms is
KNN, Selection Tree, Naive Bay, KNN with certified SVM. And it affects wherein set of rules is first-rate
for this crop prediction. The algorithms are on the way to be used for checking out our KNN, KNN with
Cross-Validation, Decision Tree, Naive Bayes, and SVM. The accuracies acquired have been 85%, 88%,
81%, 82%, and 78% respectively. KNN with cross-validation has the very best accuracy and as a result,
may be used for implementation inside the very last system [26].
The dataset is processed through the WEKA tool to build the set of rules on the current dataset. The results
were generated in python by using the SVM algorithm. Based on the C4.5 algorithm, decision trees and
decision rules have been developed, in their study, they have developed a website called Crop Advisor:
This is an interactive website for discovering the effect of weather and crop production by using the C4.5
algorithm [27]. This gives the idea of how different climatic parameters impact the growth of the crop. The
selections were made based on the area under the chosen crop. The information regarding the associated
year's climatic parameters like rainfall, high and low temperature, wet day frequency was collected. The
id3 algorithmic rule was developed to induce sensible quality and improved Tomato crop yield that is
implemented in the PHP platform and uses CSV as datasets. The features used in this study include area,
production of the tomato crop, temperature, and humidity [28].
A decision tree classifier for agriculture information turned into proposed [29]. This new classifier uses
new facts expression and can address each entire record and in entire records. Inside the test, a 10-fold
cross-validation technique is used to check the dataset, horse-colic dataset, and soybean dataset. Their
results showed the proposed selection tree is capable of classifying all styles of agriculture records. A yield
prediction version turned into proposed in one of the take a look at which makes use of data mining
techniques for category and prediction. This model includes crop name, topography, soil type, soil pH, pest
information, climate, water level, seed type, and this model anticipated the plant boom and plant diseases
and therefore enabled to select of the nice crop based on climate information and required parameters
[30,31]. The researcher use deep learning on augmented image which can be used in crop yield
prediction[32,33,34]. A systematic literature review is done for potato disease detaction using computer
vision[35]. Machine learning and transfer learning for plant disease detcion[36,37].
By studying the previous research done by various scholars above many techniques and ideas can be
grasped which can help to learn more about solving the issues which are intended to achieve. Hence by
using the machine learning algorithms the prediction can be more efficient for achieving the goal and there
are ways to crop yield prediction. Taking a step forward that aiming to use the regression technique on the
data-test above numerical values. As the values in the data-set are numerical it is suited for the regression.
The table shown below summarizes the works of other researchers, scholars, and contributors of the domain
of crop yield prediction, the algorithms that they use, the purpose of their studies, and their findings.
Table 1 Summary of studies and their findings
Studies and their findings

Year Author Purpose Model used Findings
Proposed model
Hybrid MLRANN model gives better
Goapl and to predict the Hybrid MLR-ANN
2019 prediction accuracy than other models for
Bhargavi accurate crop model
same agricultural dataset
yield
Categories soil
Cloud primarily based totally framework to
Shastry and and to are Hybrid kernel Support
2019 categories soil and to are expecting crop
Sanjay expecting crop Vector Machine(SVM)
yield.
yield
he combination classification algorithm of

Pavan Patil, Crop Prediction Decision tree and
naïve
Virendra System using Naïve Bayes. bayes and decision tree classifier are better
2020 performing than use of single classifier

Panpatil, Prof. Machine Learning
model.
Shrikant Algorithms.
Kokate.
Envisage varieties
Islam, T., DNN model which have higher precision
of crop yield like Deep learning neural
2018 Chisty, T. A., & than RF, SVN and Leaner regression in
rice, Jute, Wheat network model(DNN)
Chakrabarty prediction
and Potato
Wheat yield Random forest(RF)

RF has best performance than Leaner
2018 Feng, P.,Wang, prediction based model and multiple
regression model after comparison
on rainfall linear regression model
Support Vector
Machines (SVM),
Random Forest (RF),
Prakash, S., Extremely Randomized
To predict future Trees GBM model showed the lowest prediction
2018 Sharma, A., &
soil moisture error
Sahu (ET), Gradient Boosting
Machines (GBM), and
Deep Feedforward
Network (DFN
To develop crop
Giritharan and predictor and Artificial Neural Develop crop predictor and advisor
2016
Koteeshwari advisor using Network application for smartphones
ANN
Snehal S. To develop Crop
Dahikar and prediction by Artificial Neural Developed powerful tools for modeling and
2014
Sandeep V. sensing various Network prediction of crop based on soil
Rode parameter of soil
J.P. Singh, Bayesian algorithm, K-

Analyzing crop prediction using those
Rakesh Kumar, To improve the means Algorithm,
2015 models, but they did not show proper
M.P. Singh and yield rate of crops Clustering Algorithm,
accuracy error
Prabhat Kumar SVM
Efficient Crop
Yield Prediction
Arun Kumar, Using Machine SVM and Least Squares It shows that SVM is better here compared
2018 Naveen Kumar
algorithms to the complexity
and Vishal Vats
Learning
Algorithms.
Kevin Tom Crop Prediction
Varsha S , KNN, Decision Tree, The accuracies obtained here are 85%, 88%,
Using Machine
Merin Mary Naive Bayes , KNN with 81%, 82% and 78% respectively. KNN with
2020
Cross Validation, and cross validation has the highest accuracy for
Saji, Lisha Learning. SVM. this paper.
Varghese, Er.
Jinu Thomas
Depending on the above literature performance, we use regression we have select to apply four machine
learning algorithms for crop yield prediction based on the performance that the researchers have gotten
precisely. When we analyzed the gaps of researcher we can conclude that the data they use and their scope
is limited for the specific area this study can fill the gabs on Ethiopia and specifically cereal crops.
Propose Model:
The first task in this research is basically understanding the problem domain. This step includes an overview
of the agriculture, factors of cereal crop determinates. In understanding the data step domain-specific
terminologies, data description and attribute selection are included. In the data preparation step, data
cleaning, data integration, and data reduction steps are applied. The next step is building the model based
on the selected algorithm which is the regression algorithm.
Figure 1 Block diagram for model design
The model in figure 1 clearly explains how the components of the system communicate among themselves
starting from pre-processing of data. This proposed model is able to find out the crop yield. This model
gives a clear picture of the huge amount of data capture and pre-processing of data to remove the unwanted
data such as NULL etc. presented in it. During pre-processing step, we split the dataset into the training
and testing dataset. Train dataset to detect the crop yield present in the dataset using appropriately
supervised learning algorithms. Apply the machine learning techniques which are helpful for finding crop
yield for any new data that occurred in the data. After this data acquisition suitable machine learning
algorithm must be applied to compute the efficiency and capability of the model, here that have applied
various machine learning algorithms like random forest regression, SVR, decision tree regression, gradient
boosting regression, etc. Measurements such as accuracy are calculated for the proposed model. This
system architecture focuses on 3 parts such as flow data, Machine learning techniques, and modules for
detecting crop yield and feature selection modules.
After cleaning and exploring the relationship among the features, the final data frame that carries all of the
features used in our model are listed below.
 Area: country of production.
 Item: type of crop.
 Year: year of production.
 Average_rain_fall_mm_per_year: Average amount of rain recorded that year.
 Hg/ha_yield: country’s yearly production of the crop that year.
 Pesticides_tonnes: Amount of pesticides used on the crop that year.
 Avg_temp: Average temperature recorded for that year.
The research on crop yield prediction needs multiple factors of production and different algorithms. Some
of the algorithms that are being used are for finding the best feature subset for better prediction and others
are used for finding prediction. Multiple algorithms were used to compare the different algorithms that were
used in the current study. It has long been recognized that the generation of empirical models to estimate
the crop yield is an important responsibility for the remote sensing community [31]
Machine learning is an essential decision guide tool for crop yield prediction, which includes supporting
decisions on what crops to develop and what to do during the growing season of the crops. The regression
learning algorithm is supervised machine learning that is important in the prediction of the labeled data. It
works on continuous values prediction. It also important in the crop yield prediction. Many machine
learning algorithms are utilized for crop yield prediction by numerous researchers. Generally involved
models for crop yield prediction are random forest regression, decision tree regression, support vector
machine (SVM) and Gradient boosting regression.
Experiment and result:

In this research, numerous hardware and software program necessities have been hired to test the proposed
algorithms. A non-public laptop with Intel ® Pentium CPU B960, 2.2GHz, 2.00GB memory, and 300GB
tough pressure changed into used, which ran on Microsoft Windows 10 Ultimate. Microsoft ®Excel® 2016
was used for statistical analysis (calculating minimum, maximum, average and standard deviation) at the
stage of dataset analysis.
The Jupyter Notebook is an open-source web application that lets you create and share documents that
include live code, equations, visualizations, and narrative text. Uses consist of data cleaning and
transformation, numerical simulation, statistical modeling, data visualization, machine learning, and lots
more. The Jupyter Notebook project is the evolution of the IPython Notebook library which changed into
advanced usually to enhance the default python interactive console through permitting scientific operations
and advanced data analytics capabilities through sharable web documents. Jupyter Notebooks work with
what's referred to as a two-process version primarily based totally on a kernel-client infrastructure. This
model applies a comparable idea to the Read-Evaluate-Print Loop (REPL) programming surroundings that
take a single user’s inputs, evaluate them, and return the end result to the user.
i. Data Gathering and Cleaning

The technological know-how of training machines to examine and produce models for future predictions is
extensively used, and now no longer for nothing. Agriculture plays a vital position inside the worldwide
economy. With the continuing growth of the human population understanding, crop yield is important to
addressing food safety demanding situations and decreasing the influences of weather change. Crop yield
prediction is an essential agricultural problem.
The Agricultural yield in the main relies upon climate conditions (rain, temperature, and pesticides), and
correct data approximately the records of crop yield are an essential issue for making selections associated
with agricultural danger control and destiny predictions. The primary elements that maintain human beings
are similar. In this study, the prediction of the top 4 cereal crop yields is established by applying different
machine learning techniques. These corps include maize, rice, sorghum, and wheat.
ii. Crops Yield Data

Cereal crops yield of four most consumed crops around the country was downloaded from FAO website
after importing required libraries. The collected data include, item, year starting from 1961 to 2016 and
yield value.
Table 2 Sample Crop yield Dataset
Item
Item Year Unit Value
Code
56 Maize 1961 hg/ha 9629
56 Maize 1962 hg/ha 9610
56 Maize 1963 hg/ha 9000
56 Maize 1964 hg/ha 9700
56 Maize 1965 hg/ha 9850
56 Maize 1966 hg/ha 10000
56 Maize 1967 hg/ha 10078
56 Maize 1970 hg/ha 10731
Rice,
27 1993 hg/ha 18519
paddy
Rice,
27 1994 hg/ha 18275
paddy
Rice,
27 1995 hg/ha 18644
paddy
Rice,
27 1996 hg/ha 18372
paddy
Rice,
27 1997 hg/ha 18462
paddy
Rice,
27 1998 hg/ha 18571
paddy
Rice,
27 1999 hg/ha 18667
paddy
Rice,
27 2000 hg/ha 18293
paddy
Rice,
27 2003 hg/ha 18056
paddy
83 Sorghum 1961 hg/ha 7930
15 Wheat 1961 hg/ha 7127
15 Wheat 1962 hg/ha 7127
15 Wheat 1963 hg/ha 7123
15 Wheat 1964 hg/ha 7100
15 Wheat 1965 hg/ha 7200
15 Wheat 1966 hg/ha 7300
15 Wheat 1967 hg/ha 7327
15 Wheat 1968 hg/ha 7389
15 Wheat 1969 hg/ha 7453
In in the above table, a small part of the crop yield dataset from different types of crop in different year is
displayed.
Climate Data
The climatic factors include rainfall and temperature. They are abiotic components, including pesticides
and soil, of the environmental factors that influence plant growth and development. Rainfall has a dramatic
effect on agriculture. For this project rainfall per year, information was gathered from the World Data Bank
repository.
Table 3 Sample Rainfall Data set
Average_rain_fall_
Year
mm_per_year
1963 910.08
1964 943.97
1965 749.42
1966 847.97
1967 1082.36
1968 892.22
1969 777.65
1970 817.73
1971 821.9
1972 842.86
1973 755.97
1974 794.31
1975 901.29
1976 896.37
1977 964.92
1978 772.47
1979 772.21
1980 704.63
1981 794.49
1982 928.02
1983 837.77
1984 629.57
The average temperature for each country was collected from the World Data Bank repository. So average
temperature starts from 1901 and ends in 2020, with some empty rows that we have to drop.
Table 4 Sample temperature Data set
Average
Year Temprature
1961 21.98
1962 22.03
1963 22.23
1964 21.82
1965 22.21
1966 22.31
1967 21.82
1968 21.83
1969 22.55
1970 22.48
1971 21.97
1972 22.42
1973 22.8
1974 22.16
1975 22.28
1976 22.59
1977 22.58
1978 22.61
1979 22.75
1980 23.04
1981 22.39
iii. Pesticides Data

Pesticides used for each item and country was also collected from FAO database.
Table 5 Sample Pesticides data set
Year Unit Value

tonnes of active
1993 242
ingredients
tonnes of active
1994 242
ingredients
tonnes of active
1995 242
ingredients
tonnes of active
1996 383
ingredients
tonnes of active
1997 383
ingredients
tonnes of active
1998 383
ingredients
tonnes of active
1999 492.5
ingredients
tonnes of active
2000 602
ingredients
tonnes of active
2001 630
ingredients
tonnes of active
2002 822.67
ingredients
tonnes of active
2003 1015.4
ingredients
tonnes of active
2004 1208
ingredients
tonnes of active
2005 1400.7
ingredients
tonnes of active
2006 2603.1
ingredients
tonnes of active
2007 2593.9
ingredients
tonnes of active
2008 2270.6
ingredients
tonnes of active
2009 3699.2
ingredients
tonnes of active
2010 4128.1
ingredients
tonnes of active
2011 4128.1
ingredients
tonnes of active
2012 4128.1
ingredients
tonnes of active
2013 4128.1
ingredients
tonnes of active
2014 4128.1
ingredients
tonnes of active
2015 4128.1
ingredients
tonnes of active
2016 4128.1
ingredients
iv. Data Exploration

The final dataframe was obtained by joining four different dataframes, from FAO and World Data Bank to
collect all needed features. Then after cleaning and transforming the data into a standardized form, I’ve
merged them together in the final dataframe yield_df. To understand the relationship between these
parameters. The final dataframe starts in 1993 and ends in 2016 after merging all the data were cleaned and
merged together.
Now, exploring the connections between the columns of the data frame, the best way to quickly check the
connection between the columns is to view the communication matrix as a heatmap.
The correlation between all the features has been calculated and illustrated with diverging color heatmap.
Figure 2 Correlation map in the dataframe
It is evident from the heatmap above that all of the variables are independent of each, with no correlation
between any of the columns in the dataframe.
v. Model Comparison & Selection

Before deciding on the algorithm to use, we must first evaluate, compare, and select the one that is
compatible with this particular set of data. Usually, when we are working on a machine learning problem
with a given set of data, we try different models and techniques to solve the optimization problem and try
to adapt the most appropriate model, which does not fit the model. We compare the following models for
this project,
 Gradient Boosting Regressor
 Random Forest Regressor
 Support vector machines (SVM)
 Decision Tree Regressor
Result of Evaluation Metrics

The evaluation metric is set based on the R 2 (coefficient of determination) regression score function, which
will represent the proportion of the variance for items (crops) in the regression model. R 2 score shows how
well terms (data points) fit a curve or line. R2 is a statistical measure between 0 and 1 that calculates how
similar a regression line is to the data it’s fitted to. If 1, the model predicts 100% data variance; If 0, the
model does not predict any differences.
In this study of the data is splitting in different level. The following table shows the comparison of the
machine-learning algorithms models based on the value of R².
Table 6 R2 result summary of different train/test values
R² result R² result R² result R² result

Models 90/10 80/20 70/30 60/40
train/test. train/test. train/test. train/test.
Gradient
Boosting 92 93 87 88
regression
Random
forest 76 76 69 72
regression
SVR 73 71 73 71
Decision
tree 75 88 83 8
regression
From the results viewed above, for 80/20 train/test data split Gradient Boosting Regressor has the highest
R2 score 0f 93.2%, Decision tree regression comes second.
The result of comparison of the models can be shown graphically below using 80/20 train/test data split
Figure 31 Model Comparison
It will also calculate Adjusted R2 indicates how well terms fit a curve or line but adjusts for the number of
terms in a model. If more and more useless variables add to the model, adjusted r-squared will decrease. If
more useful variables add, adjusted r-squared will increase. Adjusted R 2 will always be less than or equal
to R2.
Figure 4 Actual Vs Predicted Yield

The image above shows the goodness of matching linear predictions. It can be seen that the R Square score
is excellent. This means that we have found a good fitting model to predict the crops yield value for each
year.
vi. Model Results

The studies paintings receive carried out via way of means of the use of a crop dataset that's accrued from
FAO and World Data Bank database galleries. It carries diverse simple cereal crops, for example, wheat,
rice, maize, and sorghum. It is covered with some prediction parameters like temperature, rainfall,
pesticides, and 12 months of harvest. For a predictive model, a gadget getting to know desires sorts of
information, namely, the Trained set and a Test set. The Trained information is the accrued survey
information that has been amassed from beyond events. While the cutting-edge survey information is the
Test information.
The most common interpretation of r - squared is how properly the regression version suits the discovered
data. For example, an r - squared of 60% well-known shows that 60% of the data suit the regression model.
Generally, a better r - squared shows a higher suit for the version. From the acquired results, it’s clear that
the model suits the data to an excellent degree of 93.2%. Feature importance is calculated because the lower
in node impurity is weighted with the aid of using the probability of achieving that node. The node
probability may be calculated with the aid of using the number of samples that attain the node, divided with
the aid of using the full number of samples. The higher the value the extra important the feature. Getting
the 7 top features important for the model:
Figure 5 Level of Feature importance
The crop being maize has the highest importance in the decision-making for the model, where it's the
highest crop in the dataset. rice too, then as expected we see the effect of pesticides, then comes rainfall
and temperature. The first assumption about these features was correct that all significantly impact the
expected crops yield in the model. The boxplot shows the yield for each item. maize is the highest, Rice,
Wheat and Sorghum.
Figure 6 Yield for each item.
Conclusion
Researches on agriculture is the most common area that government give more attention because of
Ethiopian economy is highly dependent on it. Since cereal crop production was dominant over other types
of crop production by contributing more than 71% of total crop production, in this paper, we focus on the
cereal crop yield and the effects of different parameters on the production of such crops. To improve the
crop yield prediction implementing machine learning techniques were analyzed in the case of Ethiopia for
cereal crop yield predictions. Predicting the size of the crop can influence on-farm decisions such as how
much pesticides to need and help farmers carefully plan maintenance and labor schedules to be ready for
the start of the harvesting seasons. For crop yield prediction the climate factors, temperature and rainfall,
and the number of pesticides used during harvesting had different impacts. Developing accurate models for
cereal crop yield estimation using machine learning techniques may help farmers and other stakeholders
improve decision-making in relation to national food revenue and food security. The purpose of this study
is to solve the problems raised like the problem of accuracy of prediction of crop yields by farmers and
governments.To experiment with this
study, the dataset was collected from FAO and World Data Bank. Significantly
those data were preprocessed to make it more understandable and used for building the machine learning
models to find the solution. There are four sets of data: temperature data sets, rainfall data sets, pesticide
datasets, and crop yield data sets. Based on our dataset the model was developed by using four data
preprocessing techniques. The prediction of cereal crop yield is primarily based totally on the dataset
implementation of algorithms. The analysis of each datasets depending on the parameters that affect crop
yield predictions.
Reference:
[1] Central Statistical Agency(CSA), "Agriculture Sample Survey," Central Statistical Agency(CSA), Addis
Ababa,Ethiopia, 2011.
[2] MoFED (Ministry of Finance and Economic Development) , "Survey of the Ethiopian economy," ,Addis
Ababa, Ethiopia,, 2006.
[3] Central Statistical Agency (CSA), "Agricultural sample survey: Report on area and production of major
crops," Addis Ababa, 2019.
[4] (CSA), Central Statistical Agency, "Agricultural sample survey: Report on area and," Addis Ababa, 2017.
[5] Zhong L. Hu L. Zhou H., "Deep learning based multi-temporal crop classification," Remote Sens. Environ,
vol. 221, p. 430–443, 2019.
[6] Rossana MC, L. D., "Prediction Model Framework for Crop Yield Prediction," in Asia Pacific Industrial
Engineering and Management Society Conference Proceedings Cebu, Phillipines, 2013.
[7] You, J., Li, X., Low, M., Lobell, D., Ermon, S., "Deep Gaussian process for crop yield prediction based on
remote sensing data," in Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, 2017.
[8] Basso, B., Liu, L., "Seasonal crop yield forecast: methods, applications, and accuracies," Elsevier, vol. 154,
no. Advances in Agronomy , p. 201–255, 2019.
[9] Chipanshi, A., Zhang, Y., Kouadio, L., Newlands, N., Davidson, A., Hill, H., Warren,R., Qian, B.,
Daneshfar, B., Bedard, F., et al, "Evaluation of the integrated Canadian crop yield forecaster (ICCYF)
model for in-season prediction of crop yield across the Canadian agricultural landscape," vol. 206, no. Agri-
cultural and Forest Meteorology, p. 137–150, 2015.
[10] Fischer, R., "Definitions and determination of crop yield, yield gaps, and of rates of change.," vol. 182,
no. Field Crop Res, p. 9–18, 2015.
[11] C. Ozer., "Research on Machine Learning Methods and Its Applications," Real-World Applications and
Research, no. Machine Learning: Algorithms, 2018.
[12] Lee JY, Ahn S, Kim D., "Deep learning-based prediction of future growth potential of technologies,"
PLoS ONE, 2021.
[13] Gopal, P. M., & Bhargavi, R., " A novel approach for efficient crop yield prediction," no. Computers
and Electronics in Agriculture, 2019.
[14] Shastry, K. A., & Sanjay, H. A., "Cloud-Based Agricultural Framework for Soil Classification and Crop
Yield Prediction as a Service," no. Emerging Research in Computing, Information, Communication and
Applications, pp. 685-696, 2019.
[15] Pavan Patil, Virendra Panpatil, Prof. Shrikant Kokate, "Crop Prediction System using Machine Learning
Algorithms," International Research Journal of Engineering and Technology (IRJET) , vol. 07 , no. 02,
2020.
[16] Islam, T., Chisty, T. A., & Chakrabarty, A., "A Deep Neural Network Approach for Crop Selection and
Yield Prediction in Bangladesh," in IEEE Region 10, Bangladesh, 2018.
[17] Feng, P., Wang, B., Li Liu, D., Xing, H., Ji, F., Macadam, I., ... & Yu, Q. , "Impacts of rainfall extremes
on wheat yield in semi-arid cropping systems in eastern Australia," Vols. 147(3-4), no. Climatic change,
pp. 555-569, 2018.
[18] Prakash, S., Sharma, A., & Sahu, S. S., "Soil Moisture Prediction Using Machine Learning.," in 2018
Second International Conference on Inventive Communication and Computational Technologies
(ICICCT),, 2018.
[19] Giritharan Ravichandran, Koteeshwari R S., "Agricultural Crop Predictor and Advisor using ANN for
Smart phones," IEEE, 2016.
[20] Snehal S.Dahikar, Dr.Sandeep V.Rode, " Agricultural Crop Yield Prediction Using Artificial Neural
Network Approach," International Journal Of Innovative Research In Electrical, Electronics,
Instrumentation And Control Engineering, vol. 2(1), pp. 683-686., 2014.
[21] Rakesh Kumar, M.P. Singh, Prabhat Kumar, J.P. Singh,, "Crop Selection Method to Maximize Crop
Yield Rate using Machine Learning Technique," in International Conference on Smart Technologies and
Management for Computing Communication, Controls, Energy and Materials (ICSTM), Vel Tech
Rangarajan Dr. Sagunthala R&D Institute of Science and Technology, Chennai, T.N., India, May 2015.
[22] Arun Kumar, Naveen Kumar and Vishal Vats, "Efficient crop yield prediction using machine learning
algorithms," International Research Journal of Engineering and Technology (IRJET), vol. 05, pp. ISSN:
2395-0072, 2018.
[23] Aakunuri Manjula and Dr. G.Narsimha, "Crop Yield Prediction with Aid of Optimal Neural Network in
Spatial Data Mining," New Approaches, International Journal of Information & Computation Technology
ISSN 09742239 , vol. 6(1), pp. 25-33, 2016.
[24] Y. Everingham, J. Sexton, D. Skocaj, and G. Inman-Bamber. , "Accurate prediction of sugarcane yield
using a random forest algorithm," vol. 36(2) , no. Agronomy for Sustainable, 2016.
[25] N. Gandhi, L. J. Armstrong, O. Petkar and A. K., "Tripathy, Rice crop yield prediction in India using
support vector machines," in 13th International Joint Conference on Computer Science and Software
Engineering (JCSSE), Khon Kaen, 2016 .
[26] M Kalimuthu ,P.Vaishnavi, M.Kishore, "Crop Prediction using Machine Learning," in Proceedings of
the Third International Conference on Smart Systems and Inventive Technology (ICSSIT2020), 2020.
[27] S. Veenadhari, B. Misra and C. Singh, " Machine learning approach for forecasting crop yield based on
climatic parameters," in International Conference on Computer Communication and Informatics,
Coimbatore, 2014.
[28] CH. Vishnu Vardhan chowdary, Dr.K.Venkataramana, "Tomato Crop Yield Prediction using ID3,"
IJIRT, vol. 4, no. 10 , pp. 663-62, March 2018.
[29] Jun Wu, Anastasiya Olesnikova, Chi- Hwa Song, Won Don Lee, "The Development and Application of
Decision Tree for Agriculture Dat," IITSI, pp. 6-20, 2009.
[30] R. Sujatha and P. Isakki, " A study on crop yield forecasting using classification techniques," in 2016
International Conference on Computing Technologies and Intelligent Data Engineering (ICCTIDE'16),
Kovilpatti, 2016.
[31] Ahmad, F. K., et al, " Daily stream flow prediction on time series forecasting.," Journal of Theoretical
and Applied Information Technology, vol. 95(4), no. ISSN: 1992-8645 and E-ISSN: 1817-3195, 28th
February 2017.
[32] Maya Gopal, P.S., Bhargavi, R., " Optimum Feature subset for optimizing crop yield prediction using
filter and wrapper approaches," Appl. Eng. Agri., vol. 35 (1), pp. 9-14, 2019a.
[33] Kind, M.C., Brunner, R.J., TPZ, "Photometric redshift PDFs and ancillary information by using
prediction trees and random forests," Monthly Notices of the Royal Astronomical Society, 2013.
[34] Mohapatra, Sudhir Kumar. "Automatic Lung Tuberculosis Detection Model Using Thorax Radiography
Image." Deep Learning Applications in Medical Imaging. IGI Global, 2021. 223-242.
[35] Sinshaw, Natnael Tilahun, et al. "Applications of Computer Vision on Automatic Potato Plant Disease
Detection: A Systematic Literature Review." Computational Intelligence and Neuroscience 2022 (2022).
[36] Sinshaw, Natnael Tilahun, Beakal Gizachew Assefa, and Sudhir Kumar Mohapatra. "Transfer Learning
and Data Augmentation Based CNN Model for Potato Late Blight Disease Detection." 2021 International
Conference on Information and Communication Technology for Development for Africa (ICT4DA). IEEE,
2021.
[37] Mohapatra, Sudhir Kumar, Srinivas Prasad, and Sarat Chandra Nayak. "Wheat Rust Disease Detection
Using Deep Learning." Data Science and Data Analytics: Opportunities and Challenges (2021): 191.

HPA-2023 Paper 20

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

HPA-2023 Paper 20

Uploaded by

Copyright:

Available Formats

CEREAL CROP YIELD PREDICTION USING MACHINE LEARNING

Sudhir Kumar Mohapatra

ii. Contribution of the Study

Table 1 Summary of studies and their findings

Studies and their findings

he combination classification algorithm of

2020 performing than use of single classifier

Wheat yield Random forest(RF)

J.P. Singh, Bayesian algorithm, K-

Kevin Tom Crop Prediction

Figure 1 Block diagram for model design

Experiment and result:

i. Data Gathering and Cleaning

ii. Crops Yield Data

Table 3 Sample Rainfall Data set

iii. Pesticides Data

Year Unit Value

iv. Data Exploration

v. Model Comparison & Selection

Result of Evaluation Metrics

Table 6 R2 result summary of different train/test values

R² result R² result R² result R² result

Figure 4 Actual Vs Predicted Yield

vi. Model Results

Figure 6 Yield for each item.

You might also like