Download as odt, pdf, or txt
Download as odt, pdf, or txt
You are on page 1of 6

Classification of Input document or text in

different Indian IT laws using Machine


Learning Techniques
Akshay Paliwal , Rupam Saxena and Tamanna Sharma
Jaypee Institute of Information Technology

Abstract - In this paper , we study how section preprocessing. The data set is then splitted into
classification can help individuals to train and validation sets.
understand various acts and laws that can be 2. Feature Engineering:The next step is the
applied on a legal document using n grams Feature Engineering in which the raw data set is
model. Due to the exponential growth of transformed into flat features which can be used in
information there is a need for automated tools, a machine learning model. This step also includes
which can process and understand such texts. the process of creating new features from the
Our model will allow users to use a legal existing data.
document such as an FIR and find the legal 3.Model Training:The final step is the Model
section broken by the accused ,thus helping a Building step in which a machine learning model
naive person to understand law. is trained on a labelled data set.

Index Terms - Document classification; 4.Improve Performance of Text Classifier:In this


Information extraction ; Classifiers article, we will also look at the different ways to
improve the performance of text classifiers.

Our model will take a legal text as an input and


then classify different Indian it laws applicable on
INTRODUCTION
it using machine learning and finding the accuracy
Classifying legal documents is an important issue of different algorithms used for classifying the
faced by organizations and individuals, and it legal text.
necessitates the need to devise mechanisms that
intelligently classify and deal with the problem. We will be focusing on classifying the sections of
Text classification is an example of supervised laws applicable on a document using Random
machine learning task since a labelled data set Forest and Neural network.
containing text documents and their labels is used
for train a classifier. An end-to-end text
classification pipeline is composed of three main
components:
1. Data set Preparation:The first step is the
Data set Preparation step which includes the
process of loading a data set and performing basic
PROBLEM STATEMENT to classify under w hich sections the legal
document can be classified.
Classification of input document or a text file in Features are based on the sections in the
different Indian IT laws using Machine Learning constitution.
Techniques. N-grams - Using this method text documents are
divided into character sets (sub strings) of length n
An input legal document will be classified on in so much as the first. Second sub string contains
different sections of IT laws it would come under. all characters of the document from 2nd to (n+1)-
th inclusive. This principle is used throughout the
whole text document, the last substring containing
characters from (k−n+1)-th to k-th, where k is the
number of characters in the text document. This
METHODOLOGY process is applied to each text document and a
dictionary of unique substrings (considered as
This section describes the methodology created to terms) of length n (n-grams) is generated.
extract the required information from the PDF Character sets is one of several ways to use n-
files, Data extraction and classify the documents. grams sub string contains all the characters of the
documents from the 1st to n- th inclusive.

I. DATA EXTRACTION IV. TEXT CLASSIFIERS

Data extraction is used for extracting data from We are using two algorithms to classify the legal
the web. Here we are using data extraction to text and hence identifying under which section of
creat3 a data set upon which we will train our laws it comes under.
model for prediction of laws
All data used in the research is available on the 1.Random forest algorithm - is a supervised
Creating a data set for our classifier by scraping classification algorithm. We can see it from its
data from the site "INDIAN KANOON.ORG". name, which is to create a forest by some way and
make it random. There is a direct relationship
between the number of trees in the forest and the
II. DATA PREPROCESSING results it can get: the larger the number of trees,
the more accurate the result . Random forests are a
The text data cannot be directly input into our way of averaging multiple deep decision trees,
proposed model so it needs to be preprocessed trained on different parts of the same training set,
before using the data obtained by web scraping. with the goal of reducing the variance.
Data preprocessing will include removal of
punctuations , extra white spaces present in the The training algorithm for random forests applies
text special characters and leave only alpha the general technique of bootstrap aggregating, or
numeric values. Also all data must be converted bagging, to tree learners. This bootstrapping
into lower case. procedure leads to better model performance
because it decreases the variance of the model,
without increasing the bias. This means that while
III.FEATURE EXTRACTION the predictions of a single tree are highly sensitive
to noise in its training set, the average of many
Before training the model we need to design the trees is not, as long as the trees are not correlated.
features to obtain score of different text present in Simply training many trees on a single training set
legal document. . As all legal documents have a would give strongly correlated trees (or even the
fixed format therefore it is possible for our model same tree many times, if the training algorithm is
deterministic); bootstrap sampling is a way of de- input and the output layer corresponding to words
correlating the trees by showing them different of the given input vector by looking up in the
training sets. . Random forests differ in only one tables owned by learning nodes. The output layer
way from this general scheme: they use a modified generates the categorical scores indicating
tree learning algorithm that selects, at each memberships of the string vector in categories as
candidate split in the learning process, a random the output.
subset of features.

Suppose we have a calibration set C = {C1,….,Cn}


with Ci≡ (xi,yi) and an independent test case
C0with predictor x0, the following steps can be
carried out:

Sample the calibration set C with


replacement to generate bootstrap
resamples B1,…,BM

For each re sample Bm, m = 1, …, M, grow a Figure 1. The architecture of neural network
regression tree Tm. For predicting the test case C0 classifier
with covariate x0, the predicted value by the whole
RF is obtained by combining the results given by
individual trees.
The problem can be written as Text categorization is defined as follows.
1.The number of the input nodes should be
1M∑Mm=1f^∗m(x0). identical to the dimension of string vectors
representing documents. This layer receives an
input vector given as a string vector, so each node
corresponds to each word in the string vector.
2. Neural network - The proposed neural network
2.The number of the learning nodes should be
follows Perceptron in that synaptic weights are
identical to the number of predefined categories.
connected directly between the input layer and the
Nodes of this layer own tables corresponding to
output layer, and the weights are updated only
predefined categories, and determine weights
when each training example is misclassified. The
between the input and output layer, to each word
learning layer given as an additional layer to the
in the input vector.
input and the output layer is different from the
3. The number of the output nodes should be
hidden layer of back propagation with respect to
identical to the number of predefined categories.
its role. The learning layer determines synaptic
weights between the input and the output layer by
This layer generates categorical scores as the
referring to the tables owned by learning nodes.
output, and they correspond to predefined
The learning of neural network classifier refers to
categories.
the process of optimizing weights stored in the
tables the architecture of the neural network. It
consists of the three layers: input layer, output
layer, and learning layer. The input layer receives
an input vector given as a string vector. The MODEL DESIGN
learning layer determines weights between the
4.Tokenization: Data is tokenized, filtered and
stemmed in order to create a unique list of stems
and their frequencies.
The classification process consists of three steps
that are shown in Fig. 2. 5.Model Train: Training the data using various
classification techniques.
Critical review of research papers

In order to implement “Classification of Input


document or text in different Indian IT laws using
Machine Learning Techniques ” effectively we
studied various research papers that matched with
the requirements of our topic. We extensively
covered papers that talked about data cleaning and
preprocessing ,different types of text classifiers.
We now understand that one algorithm alone
cannot fulfill our requirement .At different levels
we are using multiple
algorithms to get more refined and accurate data.
In order to match strings we are using Lcs
Algorithm and then applying N gram algorithm
for designing features which will be used to train
our model. We gathered deeper knowledge of
already known classification techniques (KNN
and NAÏVE BAYES) and have found that moving
to higher classification algorithms gives better
Figure 2. Process of text classification results. We use two algorithms – First by Random
Forest Classifier and second using Convolution
neural networks.
1. Web scraping for Desired Data: Creating a Furthermore we plan to work upon improving the
data set for our classifier by scraping data from the results and accuracy of these algorithms by
site "INDIAN KANOON.ORG". studying the various new models and
advancements that are happening in the fields of
2. Cleaning and Preprocessing Data: Data is these algorithms.
cleaned and preprocessed that is all symbols, extra
spaces and unwanted signs are removed.

3.Text Selection: All the useful text is selected


from the document.

CONCLUSION

Finally we developed a model which can classify a


legal document and identify the different legal
section applicable on it.
Classification of legal text will help in decreasing International Conference on Computer and
the human effort of scrolling down each legal Communications 2017.
document for finding the constitutional sections [6]R, Krishnamoorthy ,S. Sreedhar Kumar,
that apply to it. It will be of great help to Basavaraj Neelagund, A New Approach for Data
lawyers,policeman(for filing charge sheets) and Cleaning Process, IEEE International Conference
layman (to cross check what their legal advisers on Recent Advances and Innovations in
says is correct).The main idea is to implement a Engineering(2014)
commercially available legal tool to help classify [7]Fengxi Song, Qinglong Chen, Zhongwei
documents according to the constitutional section. Guo ,Xiumei Gao, Mathematical Analysis on
Weight Vectors in Text Classification, IEEE at
Third Global Congress on Intelligent System
2012.
[8]Guo Aizhang,Yang Tao, Based on Rough Sets
ACKNOWLEDGMENT and the Associated Analysis of KNN Text
Classification Research ,IEEE International
Conference on Recent Advances and Innovations
The authors would like to thank Dr. Bharat Gupta , in Engineering 2014
our project supervisor, for fruitful discussion on [9]Neethu KS , Jyothis TS , Jithin dev , Text
information extraction and research methods Classification Using KM-ELM Classifieir,
which helped us to complete the project. International Conference on Circuit, Power and
Computing Technologies [ICCPCT] 2016
[10]Sotarat Thammaboosadee, An elements-based
multi-stage charges identification model for
textual criminal cases, Third Global Congress on
Intelligent Systems,2016
REFERENCES [11]Octavia-Maria S¸ulea, Marcos Zampieri,
ShervinMalmasi, Mihaela Vela, Liviu P. Dinu,
Josef van Genabith, Exploring the Use of Text
[1] M. Krendzelak and F. Jakab, Text Classification in the Legal Domain, German
Categorization with Machine Learning and Research Center for Artificial Intelligence IEEE
Hierarchical Structures,IEEE International 2017.
conference on Computer Science 2016 [12]Feng He and Xiaoqing Ding, Improving Naive
[2] Kankawin Kowsrihawat and Peerapon Bayes Text Classifier Using Smoothing
Vateekul ,An Information Extraction Framework Methods , Springer-Verlag Berlin Heidelberg
for Legal Documents:a Case Study of Thai 2007.
Supreme Court Verdicts IEEE Joint Conference [13]Leo Rizky Julian, Friska Natalia, The Use of
on Computer Science and Software Engineering Web Scrapping In Computer Parts and Assembly
2015 Price Comparison , IEEE Conference for
[3]Haiyi Zhang and Di Li, Naïve Bayes Text Departments of Information System 2010.
Classifier ,IEEE International Conference on [14]ZengminGeng, Jujian Zhang, Xuefei Li,
Granular Computing 2007 Jianxia Du, Research on Domain-specific Text
[4]Lingzhong Wang and Xia Li ,An improved Classifier,World Congress on Computer Science
KNN algorithm for text classification: ,IEEE and Information Engineering 2009.
International Conference on Information, [15] Yinglong Diao, Ke-yan Liu, XiaoliMeng,
Networking and Automation 2010. Xueshun Ye, Kaiyuan He , A Big Data Online
[5]Lin Li, Linlong Xiao, Nanzhi Wang, Guocai Cleaning Algorithm Based on Dynamic Outlier
Yang Text Classification Method Based on Detection,International Conference on Cyber-
Convolution Neural Network,3rd IEEE
Enabled Distributed Computing and Knowledge Engineering University, Gagarin Str., 59/1,
Discovery 2015. Ryazan, Russian Federation
[16]Jiamei Liu,Suping Wu , Research on Longest [26]Muhammad Diaphan Nizam Arusada, Novi
Common Subsequence Fast Algorithm 2016 Amalia Santika Putri, Training Data Optimization
[17]Ghulam Mujtaba, Liyana Shuib, Ram Gopal Strategy for Multiclass Text Classification,School
Raj, Nahid Majeed and Mohammed Ali Al-gar- of Economics and Business
adi, Email Classification Research Trends: Review Telkom University Bandung, Indonesia
and Open Issues 2017 [27]Sveinn R. Joelsson, Jon Atli Benediktsson and
[18]Juan C. Rendón-Miranda, Julia Y. Arana- Johannes R. Sveinsson, feature selection for
Llanes, Juan G.González-Serna and Nimrod morphological feature extraction using random
González-Franco, Automatic classification of forests ,Department of Electrical and Computer
scientific papers in PDF for populating ontologies, Engineering, University of Iceland
IEEE International Conference 2016 [28]Yuan Xie, Tao Zhang, Feature Extraction
[19]Ying Li, Sharon Lipsky Gorman, Noémie Based on DWT and CNN for Rotating Machinery
Elhadad , Section Classification in Clinical Notes Fault Diagnosis ,Department of Automation,
using Supervised Hidden Markov Model 2016 School of Information Science and Technology,
[20]Vytautas Mickevicius,Tomas Krilavicius Tsinghua University, Beijing
Vaidas Morkevicius ,Classification of Short Legal [29]M.Priya, Dr.R.Kalpana and T.Srisupriya,
Lithuanian Texts IEEE International Conference Hybrid Optimization Algorithm Using N Gram
2016 Based Edit Distance ,International Conference on
[21]Nguyen Truong Son, Nguyen Le Minh, Ho Communication and Signal Processing
Bao Quoc, Akira Shimazu,Recognizing logical [30]Yan-Shi Dong, Ke-Song Han, A Comparison
parts in legal texts using neural architecture,Eighth of Several Ensemble Methods for Text
International Conference on Knowledge and Categorization , Shanghai Jiao Tong University,
Systems Engineering (KSE) Motorola Labs, China Research Center
[22]Zhang Yongan, She Guiqing, Research on
Proofreading System of Judicial Documents Based
on N-gram Model ,Economics & Management
School of Beijing University of Technology
[23]Gaurav Kawade Santosh Sahu Sachin
Upadhye Nilesh Korde Manish Motghare, An AUTHOR INFORMATION
Analysis on Computation of Longest Common
Subsequence Algorithm , Department of Computer
Application Akshay Paliwal, student of Jaypee Institute of
Shri Ramdeobaba College of Engineering and Information Technology – Computer Science
Management Engineering, 15103006.
[24]Cai-zhi Liu, Yan-xiu Sheng, Zhi-qiang Wei1 Rupam Saxena, student of Jaypee Institute of
And Yong-Quan Yang ,Research of Text Information Technology – Computer Science
Classification Based on Improved TF-IDF Engineering, 15103019.
Algorithm, College of Information Science & Tamanna Sharma, student of Jaypee Institute of
Engineering, Ocean University of China, Qingdao, Information Technology – Computer Science
China Engineering, 15103027.
[25]Liliya Demidova, Maksim Egin,Improving the
Accuracy of the SVM Classification using the
Parzen & classifier , Ryazan State Radio

You might also like