Team Report Dayakar

A Major Project Phase Report
On
Crop Yield Prediction Using Machine Learning

Algorithms
submitted in partial fulfillment of the requirements for the award of degree of
BACHELOR OF TECHNOLOGY
in
COMPUTER SCIENCE & ENGINEERING

by
Vemula Dayakar (18211A05V8)
Yervala Vamsi Krishna Reddy (18211A05X0)
Yalamanchili Jotsna Sri (18211A05W2)
Under the guidance of
Mrs. B. Usha Sri

Assistant Professor
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING

B.V.RAJU INSTITUTE OF TECHNOLOGY
(UGC Autonomous, Accredited by NBA & NAAC)
Vishnupur, Narspur, Medak(Dist.), Telangana State, India - 502313
2021 - 2022
B. V. Raju Institute of Technology
(UGC Autonomous, Accredited By NBA & NAAC)
Vishnupur, Narspur, Medak (Dist.),
Telangana State, India – 502313
_________________________________________________________
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

CERTIFICATE
This is to certify that the Major Project entitled “Crop Yield Prediction
Using Machine Learning Algorithms”, being submitted by
In partial fulfillment of the requirements for the award of degree of

BACHELOR OF TECHNOLOGY in COMPUTER SCIENCE AND ENGINEERING
to B. V. RAJU INSTITUTE OF TECHNOLOGY is a record of bonafide work
carried out during a period from January 2022 to June 2022 by them
under the guidance of Mrs. B. Usha Sri, Assistant Professor, CSE
Department.
This is to certify that the above statement made by the students is/are
correct to the best of my knowledge.
Mrs. B. Usha Sri

Assistant Professor
The Project Viva-Voce Examination of this team has been held on

___________________.
Dr. Ch. Madhu Babu

EXTERNAL EXAMINER Professor & HoD-CSE
B. V. Raju Institute of Technology
(UGC Autonomous, Accredited By NBA & NAAC)
Vishnupur, Narspur, Medak (Dist.),
Telangana State, India – 502313
_________________________________________________________
CANDIDATE’S DECLARATION
We hereby certify that the work which is being presented in the

project entitled “Crop Yield Prediction Using Machine Learning
Algorithms” in partial fulfillment of the requirements for the award of
Degree of Bachelor of Technology and submitted in the Department of
Computer Science and Engineering, B. V. Raju Institute of Technology,
Narsapur is an authentic record of my own work carried out during a period
from August 2021 to January 2022 under the guidance of Mrs. B. Usha
Sri, Assistant Professor. The work presented in this project report has not
been submitted by us for the award of any other degree of this or any other
Institute/University.

i
ACKNOWLEDGEMENT
The success and final outcome of this project required a lot of guidance
and assistance from many people and We are extremely fortunate to have got
this all along the completion. Whatever we have done is due to such guidance
and assistance. We would not forget to thank them.
We thank Mrs. B. Usha Sri for guiding us and providing all the support
in completing this project. We are thankful to Mrs. T. Shilpa, our section
project coordinator for supporting us in doing this project. We thank the
person who has our utmost gratitude is Dr. Ch. Madhu Babu, Head of CSE
Department.
We are thankful to and fortunate enough to get constant encouragement,

support and guidance from all the staff members of CSE Department.

ii
Crop Yield Prediction Using Machine Learning
Algorithms
ABSTRACT
In India, agriculture is a major source to increase the Indian economy.

Generally, based on conventional methods like depending on agriculture field,
the farmers are used to cultivating the specific crop type such as Wheat, Rice,
Mango, Muskmelon, etc. But sometimes due to bad weather conditions, the
farmers will get poor crop yield and sometimes they lose the entire crop yield.
Based on the traditional process, only experts can predict the crop type with
previous knowledge but sometimes this prediction will get false results.
Therefore, based on machine learning methodologies, it will improve the
accuracy of crop prediction, because in modern automatic technologies the
machine learning methods are providing good results. So that we would like to
implement the ML classification techniques such as KNN, RF, DT, SVM, GNB,
GB, XGBoost, and Voting ensemble classifiers to predict the crop type. In this
research paper, we collected a crop dataset with parameters such as
temperature, humidity, rainfall, pH which is trained with all ML algorithms to
obtain the best accuracy model. Here the experimental results evaluated that
the Voting ensemble classifier is providing the best accuracy with 94%
compared to other classifiers.
Key Words: KNN, RF, DT, SVM, GNB, GB, XGBoost, and Voting ensemble
classifiers
iii
CONTENTS
Candidate’s Declaration i
Acknowledgement ii
Abstract iii
Contents
1. INTRODUCTION
1.1 Motivation 1
1.2 Problem Definition 2
1.3 Objective of Project 2
1.4 Limitations of Project 2
1.5 Organization of Documentation 3
2. LITERATURE SURVEY
2.1 Introduction 4
2.2 Existing System 5
2.3 Disadvantages of Existing system 5
2.4 Proposed System 5
3. ANALYSIS
3.1 Introduction 7
3.2 Software Requirement Specification 7
3.2.1 User requirements 7
3.2.2 Software requirements 8
3.2.3 Hardware requirements 10
3.3 Content Diagrams of Project 10
3.4 Algorithms and Flowcharts 10
4. DESIGN
4.1 Introduction 16
4.2 DFD / ER / UML diagram (any other project diagrams) 17
4.3 Module design and organization 23
5. IMPLEMENTATION & RESULTS
5.1 Introduction 26
5.2 Explanation of Key functions 26
5.3 Method of Implementation 28
5.3.1 Forms 28
5.3.2 Output Screens 29
5.3.3 Result Analysis 38
6. TESTING & VALIDATION
6.1 Introduction 45
6.2 Design of test cases and scenarios 45
6.3 Validation 47
7. CONCLUSION & FUTURE WORK 48
8. REFERENCES 49
Chapter 1
INTRODUCTION
Achieving maximum yield rates with limited land resource is the goal of
agriculture planning in agro-based country. Earlier farming predictions was
performed based on farmers past experience in a particular field of crops.
Now-a-days as the conditions are changing there is a need of advancement in
the farming activities. What happens the farmers in rural areas are not aware
of new crop and their benefits while farming them? The proposed system
applies machine learning and prediction algorithms to suggest the best
suitable crops for the farmers. The aim of the system is to reduce the losses
due to drastic climatic changes and increase the yield rates of crops. The
system integrates the data obtained from the past prediction, current weather
and soil condition due to this farmers gets the idea and list of crops that can
be cultivated. Machine Learning methods are widely used in prediction
techniques like SVM (Support Vector Machine), linear regression. This in
return gives the best crop for cultivation based on the current environment
condition. The proposed system considers the rainfall amount of past, current
and future and also the type of soil the farmer have. Based on this parameters
the suitable crops for the given condition is predicted using the machine
learning algorithms more accurate prediction results are produced.
1.1 Motivation
In recent years, Asian country has been agitated by economic and social
forces related to higher suicide rates amongst tiny and marginal farmers. Our
aim is to supply help and tools to assist such farmers and communities
and address these problems. Generally, they face challenges accessing and
trusting academic reach and coaching to higher perceive the way to
increase crop yields and improve money standing.
1
1.2 Problem Definition
The Problem Statement revolves around prediction of crop yield using

Machine Learning Techniques. The goal of the project is to help the users
choose a suitable crop to grow in order to maximize the yield and hence the
profit.
1.3 Objective of Project
This project aims at predicting the crop yield at a particular weather

condition and thereby recommending suitable crops for that field. It
involves the following steps.
 Collect the weather data, crop yield data, soil type data and the
rainfall data and merge these datasets in a structured form and
clean the data.
 Perform Exploratory Data Analysis (EDA) that helps in analyzing the
complete dataset and summarizing the main characteristics.
 Divide the analysed crop data into training and testing sets and train
the model using the training data to predict the crop yield for given
inputs.
 Compare various Algorithms by passing the analysed dataset
through them and calculating the error rate and accuracy for each.
 Implement a system in the form of a mobile application and integrate
the algorithm at the back end.
 Test the implemented system to check for accuracy and failures.
1.4 Limitations of Project

Integrating farming and machine learning, we can lead to further
advancements in agriculture by maximizing yield and optimizing the use of
resources involved. Previous year‘s production data is an essential element
for predicting the current yield.
2
1.5 Organization of Documentation
Chapter 1: This chapter discusses the project's motivation, problem
definition, project objective, and project limitations, as well as the project's
key slogan.
Chapter 2: Given the present system, this chapter explains the shortcomings
of the existing system and how to remove those disadvantages in the
suggested system.
Chapter 3: This chapter completely deals with software requirements,

Algorithms, and flow charts related to the project.
Chapter 4: This chapter focuses entirely on utilising UML or ER diagrams to

express the project's contents diagrammatically.
Chapter 5: This chapter describes a few major functions involved in the

project's implementation process, as well as a brief explanation of the project's
implementation.
Chapter 6: This chapter covers the testing of testcases as well as the project
validation.
Chapter 7: This chapter discusses the project's completion and anticipated

future work.
Chapter 8: This chapter provides all of the references that were considered
during the project's execution.
3
Chapter 2
LITERATURE SURVEY
2.1 Introduction
Monali Paul, Santosh K Vishwakarma, Ashok Verma [1]
This paper provided to predict the yielding of crops, the crops square measure
analyzed supported analysis they're categorized. This categorization is
completed on data processing algorithms like KNN, Naïve mathematician. this
may be helpful for developing our project with the assistance of knowledge
mining.
Abdullah sodium, William patriarch, Ekaram Khan[2]
In this paper a wise phone based mostly application which is able to live the
hydrogen ion concentration worth of soil, temperature and wetness in real
time. This paper assists in remote analysis of soil through numerous
techniques.
S. Nagini, Dr.T.V. Rajnikanth, B.V. Kiranmayee [3]

This paper theorizes Associate in Nursing exploratives knowledge analysis
concerning planning numerous prediction models. numerous regression
techniques like rectilinear regression square measure used for prediction for
appropriate crops. mistreatment numerous machine learning algorithmic rule
prediction square measure created for many appropriate crop for the farmer.
Awanit Kumar, knife Kumar [4]

The paper planned techniques for prediction of crop production of current
year. The system uses prediction mechanism within the style of formal logic.
formal logic is rule based mostly prediction logic wherever set of rules square
measure applied on the land for farming, rain and prediction for crops. The
system uses K-means will be wont to analyze knowledge set obtained.
4
Pooja More, Sachi Nene [5]
This paper uses advanced artificial neural network technology in conjunction
with machine learning algorithms like SVM, rectilinear regression square
measure used for prediction of best suited crops.
Rakesh Kumar I, M.P. Singh [6]

The paper planned techniques for acceptable crop choice mistreatment CSM
(Crop choice Method), and Machine.
2.2 Existing System

Farmers simply predict the climate and rain by their own and guess the yield
which may be correct and wrong. The prediction created by machine Learning
algorithms can facilitate the farmers to return to a choice that crop to grow to
induce the foremost yield by considering factors like temperature, rainfall,
area, etc.
2.3 Disadvantages of Existing System

• It does not give the accurate results.
• It does not help in making a decision of which crop to grow.
• Acquire a more time for processing the data.
2.4 Proposed System

Our projected system may be a mobile application that predicts name of the
crop moreover as calculate its corresponding yield. Name of the crop is set by
many options like temperature, humidity, wind-speed, precipitation etc. and
yield is set by the world and production. during this paper, Random Forest
classifier is employed for prediction, KNN, SVM, XGBoost. It'll attain the crop
prediction with best correct values.
5
Advantages:
1. Predicting productivity of crop in numerous weather conditions will

facilitate waymer and alternative partners in essential basic leadership
as far as scientific agriculture and merchandise call.
2. This model will be wont to choose the foremost glorious crops for the
region and conjointly yield thereby up the values and gain of farming
conjointly.
6
Chapter 3
ANALYSIS
3.1 Introduction
Achieving maximum yield rates with limited land resource is the goal of
agriculture planning in agro-based country. Earlier farming predictions was
performed based on farmers past experience in a particular field of crops.
Now-a-days as the conditions are changing there is a need of advancement in
the farming activities. What happens the farmers in rural areas are not aware
of new crop and their benefits while farming them? The proposed system
applies machine learning and prediction algorithms to suggest the best
suitable crops for the farmers. The aim of the system is to reduce the losses
due to drastic climatic changes and increase the yield rates of crops. The
system integrates the data obtained from the past prediction, current weather
and soil condition due to this farmers gets the idea and list of crops that can
be cultivated. Machine Learning methods are widely used in prediction
techniques like SVM (Support Vector Machine), linear regression. This in
return gives the best crop for cultivation based on the current environment
condition. The proposed system considers the rainfall amount of past, current
and future and also the type of soil the farmer have. Based on this parameters
the suitable crops for the given condition is predicted using the machine
learning algorithms more accurate prediction results are produced.
3.2 Software Requirements Specifications

3.2.1 User requirements
The user must possess strong cognitive abilities and be able to
comprehend the various elements involved in the data mining process.
Operating System: Windows
Language: python 3.
7
3.2.2 Software requirements
For demonstrating my proposed system, building a web-based
application that can use to demonstrate the proposed system. For
demonstrating I choose a web application for getting a better visual
experience. Major development goals of this system are, implement the
spam features results and classification analysis. To fulfill these
requirements, I choose python programming language because I can
get huge API support for developing Machine Learning and Deep
Learning algorithms. Many web framework tools are also developed
based on the python programming language. For better interface
experience I developed this SRD system in a web-based environment.
Based on these reasons, I choose the flask web framework because it
follows the MVT design pattern. For storing and handling the
application-generated data, we need to integrate with a database server.
For the integration of the database with the flask framework, I choose
the Mysql Database server. For developing the flask web application, I
set up the software which are mentioned in the following.
3.2.2.1 Python
Python is an open-source and multi-used purpose language. Using
python we can develop Machine Learning, Artificial Intelligence-based
applications without GUI, or with GUI-based applications, we can
develop. Also, Operating system-based, mobile-based, and Web-based
applications can develop. In this system, I have two major reasons to
choose the python programming language. First, need to implement
classification algorithms. And second, to implement the web-based
application. In python, we have strong API support for implementing
the classification algorithms. In the Python programming language, we
have two third-party frameworks that are there to build web
applications. Those are flask and Flask. In this, flask supports Model
View Template (MVT) design pattern. It is similar to MVC (Model View
Controller). I Installed python 3.8 version software to get the python
8
environment in my Windows 10 operating system. The python program
language is an object-oriented language, and it has much easy
programming syntax to learn python easily and quickly.
3.2.2.2 MySQL
All web frameworks support connecting to the database server, in those
major web frameworks using internal interfaces connect to the
database. Communication between the web server and the database
server happens with SQL queries. Generally, using SQL queries when
we are communicating the database that will become platform
dependent. We always focus on the lower level of access code (SQL)
depends on the database server. Flask supports Object-Relational
Mapping (ORM) for database management. ORM will take care of the
conversion of data between RDBMS and programming languages like
python. Developers need not focus on the lower level of writing the SQL
queries. By using Object Oriented concepts like methods, constructors,
objects, etc. data will store and manipulate in the database server. Only
a few database servers are support ORM like Mysql 8.0, Postgresql,
Mongo, etc. From these, I choose Mysql 8.0 software to integrate with
the flask server for storing and manipulating the data.
3.2.2.3 PYCharm IDE
For better programming writing experience, I need Full Integrated
Development Environment (IDE) software that can identify the python
syntaxes. Many IDE’s are suitable for only particular programming
languages, for example, Netbeans IDE is only suitable for Java
programming. But Pycharm only suitable for python programming.
Based on the programming language which we are using we need to
install plug-ins for getting more services like identification of compile
errors, suggestions, execution, etc. For my project requirements, I
choose PYCharm IDE because of open source, lightweight software and
I need to write the python code as well as HTML code.
9
3.2.3 Hardware requirements
• Processor : Any Processor above 500 MHz

• RAM : 512Mb and above
• HardDisk : 100 GB and above
• Inputdevice : Standard Keyboard and Mouse
• Outputdevice : VGA and High-Resolution Monitor
3.3 Content Diagram of Projects
Figure-2 Content Diagram
3.4 Algorithms and FlowCharts

Algorithms used:
 Support Vector Machine

 Random Forest
 Naïve Bayes
 Decision Tree
 K-Nearest Neighbor
 Logistic Regression
Support Vector Machine (SVM):
10
Support Vector Machine is an extremely popular supervised machine learning
technique (having a pre-defined target variable) which can be used as a
classifier as well as a predictor. For classification, it finds a hyper-plane in the
feature space that differentiates between the classes. An SVM model
represents the training data points as points in the feature space, mapped in
such a way that points belonging to separate classes are segregated by a
margin as wide as possible. The test data points are then mapped into that
same space and are classified based on which side of the margin they fall.
Random Forest:
In our experiment, we use random forest as a classifier. The popularity of

decision tree models in data mining is owed to their simplification in algorithm
and flexibility in handling different data attribute types. However, single-tree
model is possibly sensitive to specific training data and easy to overfit.
Ensemble methods can solve these problems by combine a group of individual
decisions in some way and are more accurate than single classifiers. Random
forest, one of ensemble methods, is a combination of multiple tree predictors
such that each tree depends on a random independent dataset and all trees
in the forest are of the same distribution. The capacity of random forest not
11
only depends on the strength of individual tree but also the correlation
between different trees. The stronger the strength of single tree and the less
the correlation of different tress, the better the performance of random forest.
The variation of trees comes from their randomness which involves
bootstrapped samples and randomly selects a subset of data attributes.
Below is the step-by-step Python implementation. ...
Step 1 : Import and print the dataset.
Step 2 : Select all rows and column 1 from dataset to x and all rows and
column 2 as y.
Step 3 : Fit Random forest regressor to the dataset.
Step 4 : Predicting a new result.
Step 5 : Visualising the result.
NAÏVE BAYES ALGORITHM :

Naïve Bayes classifier is based on Bayes theorem. It has strong independence
assumption. It is also known as independent feature model. It assumes the
presence or absence of a particular feature of a class is unrelated to the
presence or absence of any other feature in the given class. Naïve bayes
classifier can be trained in supervised learning setting. It uses the method of
maximum similarity. It has been worked in complex real world situation. It
requires small amount of training data. It estimates parameters for
classification. Only the variance of variable need to be determined for each
class not the entire matrix. Naïve bayes is mainly used when the inputs are
high. It gives ouput in more sophisticated form. The probability of each input
attribute is shown from the predictable state. Machine learning and data
mining methods are based on naïve bayes classification.
Bayes theorem:-
P(H|X) = P(X|H) P(H)
________________________
P(X)
 Where P(H|X ) is posterior probability of H conditioned on X
 P(X|H) is posterior probability of X conditioned on H
 P(H)is prior probability of H P(X) is prior probability of X
12
Decision Tree:
Decision Tree algorithm belongs to the family of supervised learning
algorithms. Unlike other supervised learning algorithms, decision tree
algorithm can be used for solving regression and classification problems too.
The general motive of using Decision Tree is to create a training model which
can use to predict class or value of target variables by learning decision rules
inferred from prior data (training data).
K-Nearest Neighbour (KNN):
KNN is slow supervised learning algorithm, it takes more time to get trained
classification like other algorithm is divided into two step training from data
and testing it on new instance. The K Nearest Neighbour working principle is
based on assignment of weight to the each data point which is called as
neighbour. In K Nearest Neighbour distance is calculate for training dataset
for each of the K Nearest data points now classification is done on basis of
majority of votes there are three types of distances need to be measured in
KNN Euclidian, Manhattan, Minkowski distance in which Euclidian will be
consider most one the following formula is used to calculate their distance.
The algorithm for KNN is defined in the steps given below:
13
1. D represents the samples used in the training and k denotes the number
of nearest neighbour.
2. Create super class for each sample class.
3. Compute Euclidian distance for every training sample
4. Based on majority of class in neighbour, classify the sample
K- Nearest Neighbour
Logistic regression:
Logistic regression is a statistical method for analyzing a data set in which

there are one or more independent variables that determine an outcome. The
outcome is measured with a dichotomous variable (in which there are only
two possible outcomes). For the cases when there are more than two labels,
the strategy, which is called “One versus all”, is used. In this strategy every
category is binary classified against its inverse (a fictional category that states
that the example does not belong to the current category). The category with
the highest score is picked as a result of a classification. Logistic regression
is one of the simplest machine learning techniques. It is easy to implement
and easy to interpret. It is usually a good idea to implement logistic regression
classifier before proceeding with a more complex approach because it gives
you an estimate of how well machine learning algorithms will perform on this
14
specific task. It also helps to eliminate some basic implementation bugs
regarding data set treatment.
Flowcharts
.
Figure-3 Flow Chart
15
Chapter 4
DESIGN
4.1 Introduction
Design is focused with identifying software components and defining their
interactions. For the document phase, defining the structure and providing a
plan. One of the desirable qualities of huge systems is modularity. It denotes
that the system is separated into multiple components. The interaction
between pieces is minimally stated in this manner. The design phase's goal is
to devise a strategy for resolving the problem identified in the requirement
document. This is the initial stage in transitioning from the problem to the
solution domain. The design of a system is the most important aspect
impacting software quality, and it has a significant impact on following
phases, especially testing and maintenance. The design document is the
result of this step. This document serves as a solution's blueprint or plan, and
it will be used during implementation, testing, and maintenance.
The design process is frequently split into two phases: system design and
detail design. The goal of system design, also known as top-level design, is to
identify the modules that should be included in the system, their
specifications, and how they interact with one another to create the intended
results. All of the primary data structures, file formats, and output formats,
as well as the major modules in the system and their specifications, are
decided at the end of system design.
A module's logic is frequently specified in a high-level design description

language that is unrelated to the target language in which the software will
be built. The focus of the system design is on identifying the modules, whereas
the focus of the detailed design is on defining the logic for each of the modules.
In other words, system design focuses on what components are required,
whereas detailed design focuses on how the components can be implemented
in software.
16
4.2 DFD/ER/UML diagram (any other project
diagrams)
UML DIAGRAMS
This chapter describes SRD system functionalities and responsibilities with
use case diagram. Use Case diagram is a graphical representation of
interactions of the actors with the system. My project Use Case diagram
visually described in Figure In this system, I defined two actors. Those are,
1. Admin
2. User
Let discuss the actors and their responsibilities and operation here.
Actor: Admin
Name Admin of the system
Actor Admin
Description Admin is a super user of the system, after successful login
admin can perform various operations in the admin
portal. Mainly, upload the dataset, Features calculations,
implementation of classification algorithms on features
dataset, and evaluation. Based on the dataset which is
uploaded by the admin, users can see the product
details.
Table 4.1: Description of Admin actor
Actor: User
Name Admin of the system
Actor User
Description The end user will create account and login in to the
application using his/her login credentials to predict the
cop based on trained information
17
Table 4.2: Description of User actor
Use Case Diagram

The primary form of system/software requirements for an undeveloped
software programme is a UML use case diagram. The intended behaviour
(what) is specified in use cases, not the actual technique of achieving it (how).
Once defined, use cases can be represented both textually and visually (i.e.,
use case diagram). A core aspect of use case modelling is that it aids in the
subdesign of a system from the standpoint of the end user. It's a good way to
communicate system behaviour to users in their own words by defining every
externally apparent system activity.
Figure-4 Use Case Diagram
18
Data Flow Diagram:
1. A bubble chart is another name for a DFD. It is a basic graphical

formalism that can be used to depict a system in terms of data intake,
data processing, and data output.
2. One of the most essential modelling tools is the data flow diagram
(DFD). It's used to represent the many components of the system. The
system process, the data that the process uses, an external entity that
interacts with the system, and the information flows in the system are
all examples of these components.
3. The DFD depicts the flow of data through the system and how it is
altered by a series of transformations. It's a graphical representation of
data flow and the transformations that occur when data goes from input
to output.
4. A bubble chart is also known as a DFD. At any level of abstraction, a

DFD can be used to depict a system. DFD can be divided into levels,
each representing a different level of information flow and functional
detail.
19
Login
InValid
Verification
Valid
Admin User
Upload Dataset
Crop Prediction
View Dataset
Models Evaluations
Performance Evaluations Logout
Figure-5 Data Flow Diagram
Class diagram:
The class diagram is used to further enhance the use case diagram and
establish the system's detailed design. The class diagram divides the players
in the use case diagram into a group of related classes. The relationship
between the classes might be either "is-a" or "has-a." Each class in the class
diagram may be able to perform specific functions. The class's functions are
referred to as "methods." Aside from that, each class may have certain
"attributes" that help to distinguish it from others.
20
Figure-6 Class Diagram
21
Collaboration diagram:
A collaboration diagram groups together the interactions between different

objects. The interactions are listed as numbered interactions that help to trace
the sequence of the interactions. The collaboration diagram helps to identify
all the possible interactions that each object has with other objects.
Figure-7 Collaboration Diagram
22
4.3 Module design and organization
Dataset Collection:
In this system, we are using the prediction of crop type dataset which was
imported from internet resources. This dataset contains four independent
attributes such as temperature, humidity, and rainfall, and one dependent
attribute or target attribute is a target label such as rice, papaya, coconut,
groundnut, etc. Figure. 3.2 is depicting the crop dataset which contains 5
columns and 3100 records and the target class like crop type has 31 labels.
Figure.3.2 Crop dataset Sample
Table.1 shows the prediction of crop types:

SI.NO Crop Type SI.NO Crop Type SI.NO Crop Type
1 Adzuki 14 Maize 27 Sugarcane
Beans
2 Apple 15 Mango 28 Tea
3 Banana 16 Millet 29 Tobacco
4 Black gram 17 Moth Beans 30 Watermelon
5 Chickpea 18 Mung Bean 31 Wheat
6 Coconut 19 Muskmelon
7 Coffee 20 Orange
23
8 Cotton 21 Papaya
9 Grapes 22 Peas
10 Ground 23 Pigeon Peas
Nut
11 Jute 24 Pomegranate
12 Kidney 25 Rice
Beans
13 Lentil 26 Rubber
Data Preprocessing:
The crop dataset will be in the form of raw data which is not understood by
ML algorithms. So, in the data preprocessing stage, this system will read the
data from the .csv file format and convert it into data frames. Later it will
check the dataset contains any missing values like question marks, special
characters, and null values. But the selected dataset does not contain any
missing values.
Train and Test Split:
After the pre-processing stage, the crop dataset will be split into 80 to 20
ratios. By using of train_test_split () method this system will split the 80%
training dataset with 2480 records and 20% testing dataset with 620 records.
Training the models:
After splitting the dataset as training and testing, then this system will train
the ML models with help of invoking the fit () method with input parameters
as independent variables x_train and target column values y_train of the
training dataset.
Predicting the model:
Here the trained model will predict the crop name with help of predict ()
method with an input parameter of testing data x_test then it returns the list
of predicted crop names as predicted values.
ML Evaluations:
In the ML evaluations, this system will compare the algorithm performances
between all ML techniques, so finally, with input parameters of predicted
24
values and actual values then this system will calculate all performance
metrics such as accuracy score, precision, recall, f1score, MCC, and Kappa
scores.
Crop Prediction:
From the ML evaluations, this system will select the best classification model
based on algorithm performance metrics. According to this system
experimental results, the voting classifier model was selected as the best
model which is predicted with 95% accuracy. So, when the user enters the
testing data like temperature, humidity, pH, and rainfall then this system
predicts the different crop names such as rice, papa, watermelon, etc with
various testing data points.
25
Chapter 5
IMPLEMENTATION & RESULTS
5.1 Introduction
Our initiative primarily aims to determine whether or not a student has
thyroid disease. When it determines that a person has thyroid disease, it
administers the appropriate treatment. Finally, during the building of a
prediction model, machine learning algorithms play a critical role in tackling
complicated and non-linear problems. In disease prediction models, variables
that may be selected from diverse data sets and used to describe a healthy
patient as precisely as feasible are required. The modules will be divided into
two categories of data: test data and train data. Later, we'll read the files,
import the modules, clean the data, visualize it, and run machine learning
algorithms on it to make the appropriate predictions.
5.2 Explanation of Key functions

Numpy
Numpy is an array processing package that can be used for a variety of
tasks. It includes a high-performance multidimensional array object as well
as utilities for manipulating them.
It is the most important Python package for scientific computing. It has a
number of features, including the following:
• Useful linear algebra, Fourier transform, and random number capabilities
• Sophisticated (broadcasting) functions
• Tools for integrating C/C++ and Fortran code
Numpy can be used as a multi-dimensional container of generic data in
addition to its apparent scientific applications. Numpy enables for the
definition of any data types, allowing it to connect with a wide range of
databases seamlessly and quickly.
Pandas
Pandas is a free Python library that uses strong data structures to provide
high-performance data manipulation and analysis. Python was used
primarily for data preprocessing and munging. It didn't make much of a
difference in the data analysis. This was a difficulty that the pandas were
able to solve. Regardless of the source of the data, we may use Pandas to
26
complete five common phases in data processing and analysis: prepare,
manipulate, model, and analyse. Python and Pandas are utilized in a variety
of sectors, including academic and business domains such as finance,
economics, statistics, and analytics.
Matplotlib
Matplotlib is a Python 2D plotting package that generates high-quality

figures in a range of hardcopy and interactive formats across platforms.
Matplotlib is a Python library that may be used in Python scripts, the Python
and IPython shells, the Jupyter Notebook, web application servers, and four
graphical user interface toolkits. Matplotlib aims to make simple things
simple and difficult things feasible. With just a few lines of code, you can
create plots, histograms, power spectra, bar charts, error charts, scatter
plots, and more. See the sample plots and thumbnail galleries for examples.
The pyplot package, especially when paired with IPython, provides a

MATLAB-like interface for easy graphing. You may adjust line styles, font
settings, axis properties, and more via an object oriented interface or a
command line interface if you're a power user.
Scikit – learn
Scikit-learn is a Python library that offers a uniform interface to a variety of
supervised and unsupervised learning techniques. It is distributed under
several Linux distributions and is licenced under a permissive simplified
BSD licence, making it suitable for academic and commercial use.
df.isnull(): isnull() function detect missing values in the given series

object. It returns a boolean same-sized object indicating if the values are
NA. Missing values gets mapped to True and non-missing value gets mapped
to false.
df.isnull().sum(): function detect missing values in the given series

object.It returns a boolean same-sized objectindicating ifthe values are NA
and display the sum of all the column values.
df.shape: This gives the number of rows and columns.

fillna():The fillna() function is used to fill NA/NaN values using the
specified method. Value to use to fill holes (e.g. 0), alternately a
dict/Series/DataFrame of values specifying which value to use for each
index (for a Series) or column (for a DataFrame).
fit(x,y): The python-fit module is designed for people who need to fit data.
Print(): This function print is the statement.
27
5.3 Method of Implementation
5.3.1 Forms
Python
Python is an interpreted high-level programming language for general-
purpose programming. Created by Guido van Rossum and first released
in 1991, Python has a design philosophy that emphasizes code readability,
notably using significant whitespace.
Python features a dynamic type system and automatic memory
management. It supports multiple programming paradigms, including
object-oriented, imperative, functional and procedural, and has a large
and comprehensive standard library.
Python is Interpreted − Python is processed at runtime by the interpreter.
You do not need to compile your program before executing it. This is
similar to PERL and PHP.
Python is Interactive − you can actually sit at a Python prompt and interact
with the interpreter directly to write your programs.
Python also acknowledges that speed of development is important.
Readable and terse code is part of this, and so is access to powerful
constructs that avoid tedious repetition of code. Maintainability also ties
into this may be an all but useless metric, but it does say something about
how much code you have to scan, read and/or understand to troubleshoot
problems or tweak behaviors.
This speed of development, the ease with which a programmer of other
languages can pick up basic Python skills and the huge standard library
is key to another area where Python excels.
Python's Benefits Compared to Other Programming Languages

1. Coding is simplified
When compared to other languages, almost all actions performed in
Python require far less coding. Python also has excellent standard library
support, so you won't need to rely on third-party libraries to complete your
task. Many people recommend Python to beginners because of this.
2. It is inexpensive.
Python is free, therefore anyone, whether an individual, a small business,
or a large corporation, can use it to create applications. Python is
28
extensively used and popular, therefore you'll get more community
support.
All its tools have been quick to implement, saved a lot of time, and several
of them have later been patched and updated by people with no Python
background - without breaking.
3. Python is a Language for Everyone
Python code may execute on any platform, including Linux, Mac OS X,
and Windows. Programmers must learn many languages for various roles,
but Python allows you to create sophisticated web apps, perform data
analysis and machine learning, automate tasks, scrape the web, and
create games and amazing visualizations. It is a programming language
that can be used in a variety of situations.
5.3.2 Output Screens

Upload Dataset:
Here we have collected the dataset for the different resources and
choosen the best dataset for the prediction and uploaded dataset as
intial step.
Preprocessing Dataset:
Once the dataset is uploaded, we have started preprocessing the dataset so
that all the null values and missing values are removed.
29
After preprocessing started training the dataset with different
algorithms.
SVM Algorithm
Steps:
 Explore the data to figure out what they look like

 Pre-process the data
 Split the data into attributes and labels
 Divide the data into training and testing sets
 Train the SVM algorithm
 Make some predictions
 Evaluate the results of the algorithm
Kernel functions are linear functions that are used in the kernel of a
computer.
Because the majority of classification problems are linearly separable,
these methods frequently recommended for text classification.
When there are many features, such as in text classification issues, the
linear kernel performs exceptionally well. Linear kernel functions are faster
than most others, and there are fewer parameters to tune.
The linear kernel's function is as follows:
wT* X + b = f(X).
30
SVM.py
Code Snippet of SVM

KNN Algorithm:
Steps:
 Step 1: We've decided on the number K of neighbours.
 Step 2: We then estimated the Euclidean distance between K
neighbours.
 Step 3: Using the obtained Euclidean distance, find the K = 3
closest neighbours.
 Step 4: We counted the number of data points in each category
among these k neighbours.
 Step 5: Finally, allocate the new data points to the category

with the highest number of neighbours.
31
KNN.py
Code Snippet of KNeighbors Classifier
Decision Tree Algorithm

Steps:
 Step1:Begin the tree with the root node, says S, which contains the
complete dataset.
 Step-2: Find the best attribute in the dataset using Attribute

Selection Measure (ASM).
 Step-3: Divide the S into subsets that contains possible values for the
best attributes.
 Step-4: Generate the decision tree node, which contains the best
attribute.
 Step-5: Recursively make new decision trees using the subsets of the
dataset created in step -3. Continue this process until a stage is
reached where you cannot further classify the nodes and called the
final node as a leaf node.
32
Database Connection Code Snippet:
DBConfig.py
33
Code Snippet of GaussianNB:
GNB.py
34
Code Snippet of Random Forest Classifier:
RFC.py
Figure 1: Home Screen of the application
35
Figure 2: Admin Login Screen of the application
Figure 3: Dataset Upload Screen in the application
36
Figure 4: Enter Input Attributes Screen in the application
Figure 5: Screen with Input Attributes in the application
37
Figure 6: Result Screen
5.3.3 Result Analysis

Table.5.3.3.1 Ensemble machine learning classifiers of performance metrics
Algorithm Accuracy Precision Recall F1-Score MCC Kappa
RF 82.90 84.47 81.07 79.11 82.43 82.29
DTC 83.87 79.62 81.82 79.30 83.46 83.29
K-NN 85.16 85.34 83.99 83.40 84.71 84.64
SVM 67.58 68.93 66.24 62.23 66.74 66.46
XGBoost 93.87 93.33 93.67 93.35 93.66 93.65
GB 92.74 92.39 92.47 92.22 92.50 92.48
GNB 93.38 93.08 93.16 92.70 93.18 93.15
VTC 94.67 94.17 94.26 93.98 94.50 94.49
From the table.5.3.3.1, the Voting ensemble classifiers was giving the best accuracy with 94.67
percent compared to other ML classifiers.
38
Figure 7: Accuracy of all models
From Figure 7, the voting classifier was given the highest accuracy 94.67 percent compared to
other ML models.
39
Figure.8 Precision of all models
From Figure.8, the voting classifier was given the highest precision 94.17 percent compared to
other ML models.
40
Figure.9 Recall of all models
From Figure.9, the voting classifier was given the highest recall score 94.26 percent compared
to other ML models.
41
Figure.10 F1-Score of all models
From Figure.10, the voting classifier was given the highest F1-score 93.98 percent compared
to other ML models.
42
Figure.11 MCC Score of all models
From Figure.11, the voting classifier was given the highest MCC score 94.50 percent compared
to other ML models.
43
Figure 12 Kappa Score of all models
From Figure 12, the voting classifier was given the highest Kappa score 94.49 percent
compared to other ML models.
44
Chapter 6
TESTING & VALIDATION
6.1 Introduction
The goal of testing is to find mistakes. Testing is the practise of attempting
to find all possible flaws or weaknesses in a work product. It allows you to
test the functionality of individual components, subassemblies, assemblies,
and/or a finished product. It is the process of testing software to ensure
that it meets its requirements and meets user expectations, and that it does
not fail in an unacceptable way. There are many different types of tests.
Each test type is designed to fulfil a distinct testing need.
6.2 Design of test cases and scenarios

Unit Testing
Unit testing is usually conducted as part of a combined code and unit test
phase of the software lifecycle, although it is not uncommon for coding and
unit testing to be conducted as two distinct phases.Unit testing entails the
creation of test cases to ensure that the program's underlying logic is
working properly and that programme inputs result in legitimate outputs.
Validation should be performed on all decision branches as well as the
internal code flow. It is the testing of the application's separate software
components. Before integration, it is done after each individual unit is
completed. This is an invasive structural test that requires knowledge of the
building's architecture. Unit tests are used to test a business process,
application, or system configuration at the component level. Unit tests
guarantee that each step of a business process follows the published
specifications and has clearly defined inputs and outputs.
Functional test:
Functional tests provide systematic demonstrations that functions tested are
available as specified by the business and technical requirements, system
documentation, and user manuals.
Functional testing is centered on the following items:
Valid Input : identified classes of valid input must be accepted.
Invalid Input : identified classes of invalid input must be rejected.
Functions : identified functions must be exercised.
45
Output : identified classes of application outputs must be exercised.
Systems/Procedures : interfacing systems or procedures must be invoked.
Organization and preparation of functional tests is focused on requirements,

key functions, or special test cases. In addition, systematic coverage
pertaining to identify Business process flows; data fields, predefined
processes, and successive processes must be considered for testing. Before
functional testing is complete, additional tests are identified and the effective
value of current tests is determined.
System Test
System testing guarantees that the complete integrated software system

complies with specifications. It checks a configuration to guarantee that the
results are known and predictable. Configuration oriented system integration
testing is a kind of system testing. System testing is based on process
descriptions and flows, with an emphasis on process connections and
integration points that are pre-driven.
White Box Testing
White Box Testing is a type of software testing in which the software tester is
familiar with the software's inner workings, structure, and language, or at the
very least its purpose. It serves a purpose. It's used to test regions that aren't
accessible with a black box level.
Black Box Testing
Testing software without knowing the inner workings, structure, or language

of the module being tested is known as black box testing. Black box tests, like
most other types of tests, require a definite source document, such as a
specification or requirements document. It's a type of testing in which the
programme being tested is treated as if it were a black box. It is impossible to
"look" into it. The test accepts inputs and responds to outputs without taking
into account how the software functions.
Integration testing
Integration tests are used to see if two or more software components can
work together as a single application. Testing is event-driven, with a focus
on the basic consequence of screens or fields. Integration tests show that,
while the components were individually satisfying, the combination of
components is proper and consistent, as demonstrated by successful unit
testing. Integration testing is a type of testing that focuses on uncovering
issues that occur from the integration of components.
46
6.3 Validation
Test strategy and approach
Field testing will be performed manually, and functional tests will be written
in detail.
Test objectives
 All field entries must work properly.
 Pages must be activated from the identified link.
 The entry screen, messages and responses must not be delayed.
Features to be tested
 Verify that the entries are of the correct format
 No duplicate entries should be allowed
 All links should take the user to the correct page.
Integration Testing:
Software integration testing is the incremental integration testing of two or

more integrated software components on a single platform to produce failures
caused by interface defects.
The task of the integration test is to check that components or software

applications, e.g. components in a software system or – one step up – software
applications at the company level – interact without error.
Test Results: All the test cases mentioned above passed successfully. No
defects encountered.
Acceptance Testing
Acceptance by the users Testing is an important aspect of any project, and it
necessitates active engagement from the end user. It also guarantees that the
system satisfies the functional specifications.
Test Results: All the test cases mentioned above passed successfully. No
defects encountered.
47
Chapter 7
CONCLUSION & FUTURE WORK
CONCLUSION
This paper is focus on crop prediction with several machine learning
algorithms such as K-NN, RF, DTC, GNB, SVM, GB, XGBoost and Voting
classifier. Based on crop dataset we had calculated accuracy of all models.
This system experimental results concluded that Voting classifier was given
highest accuracy with 94.67 percent compared to other algorithms. Therefore,
this system useful for users like farmers to predict the crop name or type to
cultivate specific crop in various agriculture fields. In the future scope, we will
implement several machine learning algorithms to prediction of crop yields.
FUTURE WORK
In returning years, will strive applying information freelance system. that's
no matter be the format our system ought to work with same accuracy.
desegregation soil details to the system is a plus, as for the choice of crops
data on soil is additionally a parameter. correct irrigation is additionally a
required feature crop cultivation. In relevance downfall will depict whether
or not additional water convenience is required or not. This analysis work
are often increased to higher level by availing it to whole Republic of India.
48
Chapter 8
REFERENCES
[1] Arun Kumar, Naveen Kumar, Vishal Vats, “Efficient Crop Yield Prediction Using Machine
Learning Algorithms”, International Research Journal of Engineering and Technology
(IRJET)- e-ISSN: 2395-0056, pISSN:2395-0072, Volume: 05 Issue: 06, June-2018
[2] Aakash Parmar & Mithila Sompura, "Rainfall Prediction using Machine Learning", 2017
International Conference on (ICIIECS) at Coimbatore Volume: 3, March 2017.
[3] Vinita Shah & Prachi Shah, "Groundnut Prediction Using Machine Learning Techniques “,
published in IJSRCSEIT. UGC Journal No: 64718, March-2020.
[4]. Prof. D.S. Zingade, Omkar Buchade, Nilesh Mehta, Shubham Ghodekar, Chandan Mehta,
“Machine Learning-based Crop Prediction System Using Multi-Linear Regression,”
International Journal of Emerging Technology and Computer Science (IJETCS), Vol 3, Issue
2, Apri12018.
[5] Priya, P., Muthaiah, U., Balamurugan, M.” Predicting Yield of the Crop Using Machine
Learning Algorithm”,2015.
[6] Mishra, S., Mishra, D., Santra, G. H., “Applications of machine learning techniques in
agricultural crop production”,2016.
[7] Ramesh Medar & Anand M. Ambekar, “Sugarcane Crop Prediction Using Supervised
Machine Learning" published in International Journal of Intelligent Systems and Applications
Volume: 3 | August 2019.
[8] Nithin Singh & Saurabh Chaturvedi, “Weather Forecasting Using Machine Learning”, 2019
International Conference on Signal Processing and Communication (ICSC) Volume: 05 | DEC-
2019.
49

Team Report Dayakar

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Team Report Dayakar

Uploaded by

Copyright:

Available Formats

A Major Project Phase Report

Crop Yield Prediction Using Machine Learning

COMPUTER SCIENCE & ENGINEERING

Under the guidance of

Mrs. B. Usha Sri

DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING

Vishnupur, Narspur, Medak(Dist.), Telangana State, India - 502313

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

Vemula Dayakar (18211A05V8)

Yervala Vamsi Krishna Reddy (18211A05X0)

Yalamanchili Jotsna Sri (18211A05W2)

In partial fulfillment of the requirements for the award of degree of

Mrs. B. Usha Sri

The Project Viva-Voce Examination of this team has been held on

Dr. Ch. Madhu Babu

We hereby certify that the work which is being presented in the

Vemula Dayakar (18211A05V8)

We are thankful to and fortunate enough to get constant encouragement,

Vemula Dayakar (18211A05V8)

In India, agriculture is a major source to increase the Indian economy.

The Problem Statement revolves around prediction of crop yield using

1.3 Objective of Project

This project aims at predicting the crop yield at a particular weather

1.4 Limitations of Project

Chapter 3: This chapter completely deals with software requirements,

Chapter 4: This chapter focuses entirely on utilising UML or ER diagrams to

Chapter 5: This chapter describes a few major functions involved in the

Chapter 7: This chapter discusses the project's completion and anticipated

S. Nagini, Dr.T.V. Rajnikanth, B.V. Kiranmayee [3]

Awanit Kumar, knife Kumar [4]

Rakesh Kumar I, M.P. Singh [6]

2.2 Existing System

2.3 Disadvantages of Existing System

2.4 Proposed System

1. Predicting productivity of crop in numerous weather conditions will

3.2 Software Requirements Specifications

• Processor : Any Processor above 500 MHz

3.3 Content Diagram of Projects

Figure-2 Content Diagram

3.4 Algorithms and FlowCharts

 Support Vector Machine

Support Vector Machine (SVM):

In our experiment, we use random forest as a classiﬁer. The popularity of

NAÏVE BAYES ALGORITHM :

K-Nearest Neighbour (KNN):

The algorithm for KNN is defined in the steps given below:

Logistic regression is a statistical method for analyzing a data set in which

Figure-3 Flow Chart

A module's logic is frequently specified in a high-level design description

Table 4.1: Description of Admin actor

Use Case Diagram

Figure-4 Use Case Diagram

1. A bubble chart is another name for a DFD. It is a basic graphical

4. A bubble chart is also known as a DFD. At any level of abstraction, a

Performance Evaluations Logout

Figure-5 Data Flow Diagram

A collaboration diagram groups together the interactions between different

Figure-7 Collaboration Diagram

Figure.3.2 Crop dataset Sample

Table.1 shows the prediction of crop types:

5.2 Explanation of Key functions

Matplotlib is a Python 2D plotting package that generates high-quality