Download as pdf or txt
Download as pdf or txt
You are on page 1of 38

Automatic

Patent
Classification
Documents

PAPER WITHIN Product development


AUTHOR: Nala Yehe
TUTOR: Rachid Oucheikh, Lirandë Pira
JÖNKÖPING June 2020
This exam work has been carried out at the School of Engineering in
Jönköping in the subject data analysis. The work is a part of the two-year
university master programme, of the Master of Software product
engineering programme.

Examiner: Anders Adlemo

Supervisor: Rachid Oucheikh, Lirandë Pira

Scope: 30 credits

Date: June 2020


Abstract

Abstract
Patents have a great research value and it is also beneficial to the community of
industrial, commercial, legal and policymaking. Effective analysis of patent literature
can reveal important technical details and relationships, and it can also explain
business trends, propose novel industrial solutions, and make crucial investment
decisions. Therefore, we should carefully analyze patent documents and use the value
of patents. Generally, patent analysts need to have a certain degree of expertise in
various research fields, including information retrieval, data processing, text mining,
field-specific technology, and business intelligence. In real life, it is difficult to find and
nurture such an analyst in a relatively short period of time, enabling him or her to meet
the requirement of multiple disciplines.

Patent classification is also crucial in processing patent applications because it will


empower people with the ability to manage and maintain patent texts better and more
flexible. In recent years, the number of patents worldwide has increased dramatically,
which makes it very important to design an automatic patent classification system.
This system can replace the time-consuming manual classification, thus providing
patent analysis managers with an effective method of managing patent texts. This
paper designs a patent classification system based on data mining methods and
machine learning techniques and use KNIME software to conduct a comparative
analysis. This paper will research by using different machine learning methods and
different parts of a patent.

The purpose of this thesis is to use text data processing methods and machine learning
techniques to classify patents automatically. It mainly includes two parts, the first is
data preprocessing and the second is the application of machine learning techniques.
The research questions include: Which part of a patent as input data performs best in
relation to automatic classification? And which of the implemented machine learning
algorithms performs best regarding the classification of IPC keywords?

This thesis will use design science research as a method to research and analyze this
topic. It will use the KNIME platform to apply the machine learning techniques, which
include decision tree, XGBoost linear, XGBoost tree, SVM, and random forest. The
implementation part includes collection data, preprocessing data, feature word
extraction, and applying classification techniques. The patent document consists of
many parts such as description, abstract, and claims. In this thesis, we will feed
separately these three group input data to our models. Then, we will compare the
performance of those three different parts.

Based on the results obtained from these three experiments and making the
comparison, we suggest using the description part data in the classification system
because it shows the best performance in English patent text classification. The
abstract can be as the auxiliary standard for classification. However, the classification
based on the claims part proposed by some scholars has not achieved good
performance in our research. Besides, the BoW and TFIDF methods can be used
together to extract efficiently the features words in our research. In addition, we found
that the SVM and XGBoost techniques have better performance in the automatic
patent classification system in our research.

1
Keywords

Keywords
XGBoost; support vector machine (SVM); random forest; decision tree; machine
learning; text data mining; patent classification; IPC

2
Contents

Contents

1 Introduction............................................................................... 7
1.1 BACKGROUND ................................................................................................. 7

1.2 PURPOSE AND RESEARCH QUESTIONS .............................................................. 7

1.3 DELIMITATIONS .............................................................................................. 9

1.4 OUTLINE ......................................................................................................... 9

2 Theoretical background ......................................................... 11


2.1 PATENT DOCUMENTS .................................................................................... 11

2.1.1 PATENTS ............................................................................................... 11


2.1.2 TEXT DATA MINING ............................................................................... 11
2.1.3 CLASSIFICATION STANDARDS................................................................ 11
2.1.4 PATENTS DOCUMENT CLASSIFICATION RESEARCH DIRECTION ............... 11
2.2 CLASSIFICATION PROCEDURE ........................................................................ 12

2.3 DATA PROCESSING........................................................................................ 13

2.3.1 DATA PREPROCESSING .......................................................................... 13


2.3.2 FEATURE EXTRACTION (TF/IDF METHOD) ............................................ 14
2.4 CLASSIFICATION TECHNIQUES ...................................................................... 14

2.4.1 SUPPORT VECTOR MACHINE (SVM) ..................................................... 15


2.4.2 DECISION TREE (DT) ............................................................................. 15
2.4.3 RANDOM FOREST (RF) .......................................................................... 15
2.4.4 XGBOOST ............................................................................................. 15
2.5 RESULTS ANALYSIS ...................................................................................... 16

2.6 TOOLS ........................................................................................................... 16

2.6.1 PYCHARM ............................................................................................. 16


2.6.2 KNIME ................................................................................................ 16
3 Methodology ............................................................................ 18
3.1 LITERATURE REVIEW AND DATA COLLECTION............................................... 18

3.2 DESIGN SCIENCE RESEARCH.......................................................................... 19

3.2.1 DESIGN SCIENCE RESEARCH METHOD ................................................... 19

3
Contents

3.2.2 RESULTS COMPARISON .......................................................................... 20


4 Research procedure and implementation............................ 21
4.1 DATA COLLECTION........................................................................................ 22

4.2 DATA PREPROCESSING .................................................................................. 23

4.3 FEATURE WORDS EXTRACTION...................................................................... 24

4.4 CLASSIFICATION TECHNIQUES ....................................................................... 25

5 Discussion and conclusions .................................................... 28


5.1 DISCUSSION OF METHOD ............................................................................... 28

5.2 DISCUSSION OF FINDINGS .............................................................................. 28

5.2.1 RESULTS AND ANALYSIS ....................................................................... 28


5.2.2 COMPARISON RESULTS .......................................................................... 30
5.2.3 SUMMARY OF RESULTS ......................................................................... 31
5.2.4 COMPARISON WITH OTHER STUDIES ...................................................... 32
5.3 CONCLUSION AND PERSPECTIVE.................................................................... 32

6 References ............................................................................... 34

4
List of figures & list of tables

List of figures
Figure 1. Research procedure. ....................................................................................... 13
Figure 2. Medicine as the search string. ....................................................................... 19
Figure 3. Program running for patent data collection.................................................. 19
Figure 4. Sample of patent txt file. ................................................................................ 19
Figure 5. Patent classification system. ..........................................................................22
Figure 6. Processing data workflow. (Thiel, 2014) .......................................................24
Figure 7. Feature words extraction workflow. (Thiel, 2014) ........................................ 25
Figure 8. SVM technique workflow. (Berthold & Thiel, 2012) .....................................26
Figure 9. XGBoost linear technique workflow. (Berthold & Thiel, 2012) ....................26
Figure 10. XGBoost tree technique application. (Berthold & Thiel, 2012) .................. 27
Figure 11. Decision tree technique. (Berthold & Thiel, 2012) ...................................... 27
Figure 12. Random forest technique. (Berthold & Thiel, 2012) ................................... 27

5
List of figures & list of tables

List of tables
Table 1. Classification results using only description data .......................................... 28
Table 2. Classification results using only abstract data................................................29
Table 3. Classification results using only claims data text ...........................................29
Table 4. Comparison of accuracy ................................................................................. 30
Table 5. Comparison of recall ...................................................................................... 30
Table 6. Comparison of precision ................................................................................. 31
Table 7. Comparison of performance............................................................................ 31

6
Introduction

1 Introduction
This thesis aims to use machine learning techniques to automatically classify patent
documents. Generally, most of the patent documents are classified manually currently
(PRV, 2019). This report will introduce how to get patent text data, clean it, find the
feature words of a text in a patent document, and then how to classify it. Besides these
goals, this report will also introduce some methods on how to do data processing in
chapter three and introduce technology background information in chapter two. This
thesis includes also the experiment procedure, results, and comparison.

1.1 Background

The role of a holder in a patent document describes the rights of the holder for a limited
period of time. As Olsson & Söderström (2019) introduced, in order to prevent other
commercial infringements within the specified time and within the prescribed area,
such as selling, importing, distributing, etc., patents could work as a measure of
technological innovation and development and can also promote socio-economic and
technological progress (PRV, 2019).

In recent years, the number of patent documents has increased dramatically (WIPO IP
Services, 2019). As the number of applications continues to increase, it is important to
quickly categorize and retrieve patent documents so that the patents could be used in
many areas such as patents analysis (Aristodemou, Tietze, Athanassopoulou, &
Minshall, 2017).

According to statistical research by the World Intellectual Property Organization


(WIPO), 90%-95% of inventions apply for patents every year (WIPO IP Services, 2019).
Patent research has a positive impact on product sales, company performance and
stocks. Patented companies have also created better pay, as studies have shown that
such companies can give 26% higher wages than other companies (WIPO IP Services,
2019). It can be seen that the patent contains rich scientific and technological
information, reflecting the development level and trend of science and technology.
Therefore, how to mine patents and obtain useful patent information has become the
focus of research by relevant experts and scholars (Olsson & Söderström, 2019).

Besides, there are some people that work for patent documents transfer, who aim to
use another way to make patents document more interesting and attractive for the
public. It also demonstrates that patent documents research work has a significant
meaning. (IPscreener, 2019)

The patent classification system uses the International Patent Classification (IPC) as
the main classification standard, but there are other classification standards in Europe,
such as CPC and DPK. In the future, CPC may become the most popular classification
standard (Lee & Hsiang, 2019).

1.2 Purpose and research questions

At present, the classification of patent texts is mainly based on manual work. It is


common to find that the patent examiners of the intellectual property department or
experts in related fields are required to deal with classification categories on the newly
applied patent texts. This manual approach leads to a lot of time-consuming work
required to complete the classification, and there may be some drawbacks (PRV, 2019).

7
Introduction

With the advancement of computer technology, automatic classification of patent texts


can be used as an aid. Automated or semi-automatic assistance with classification by
computer technology can reduce the uncertainty of manual classification and
classification errors. At the same time, it can also reduce the workload of the examiner
and improve the efficiency of classification (Wang, et al., 2020). However, based on
the current literature review, relevant research is still in the experimental stage. Some
researchers prefer to use the abstract or a part of patents to make analysis and
classification (Lee & Hsiang, 2019). In addition, the majority of concerned researchers
are still working on developing a suitable method to classify them automatically
(IPscreener, 2019). Certainly, the accuracy still has no acceptable results and the large-
scale automatic classification of patent texts has not been achieved. Therefore, the
study of the application of machine learning to the automatic classification of patent
texts can automatically divide a large number of patent texts according to the semantic
features of patent texts, which can better help people grasp the rich technical
information contained in the text, so this study is important regarding the practical
significance (Li, Hu, Cui, & Hu, 2018).

The purpose of this thesis is to use text data processing methods and machine learning
techniques to classify patents automatically (Tran & Kavuluru, 2017). It mainly
includes two parts, the first is data preprocessing and the second is the application of
machine learning techniques. Besides, this paper will research by using different
machine learning methods and different parts of a patent. Some researchers
mentioned that the SVM technique shows the best performance in patent automatic
classification (Lee & Hsiang, 2019). Some researcher says that XGBoost technique gets
the most advanced results in many machine learning competitions (Wang, et al., 2020).
Decision tree and random forest are two techniques of researchers usually choose when
they classify text data (Lee, Kwon, Myeongjung, & Kwon, 2018). Therefore, the
machine learning techniques to be used in this thesis include decision tree, XGBoost,
SVM, and random forest. On the other hand, some researchers mentioned that the
claims can be used as input data for patent classification (Suominen, Toivanen, &
Seppänen, 2017). Jieh Sheng Lee and Jieh Hsiang mentioned that claims part is
sufficient for patent classification (Lee & Hsiang, 2019). Most researchers focus on
improving the performance of patent classification by using the abstract and title of a
patent (Christopher, Lin, & Spieckermann, 2011). Besides, some researchers
mentioned that the description part usually shows detailed information about one
patent which might be used in patent classification (Suominen, Toivanen, & Seppänen,
2017). So, in this thesis, we will feed separately these three group input data to our
models include abstract, description, and claims. Then, we will compare the
performance of those three different parts to give a suggestion that which part more
suitable for patent automatic classification and which machine learning techniques
should be used in this thesis. Therefore, the main contribution of our research is
summarized as follows:

• We use XGBoost techniques in patent automatic classification.


• We use abstract, description, and claims as three different input data in the patent
classifiers and compare these performances.
• We use SVM, XGBoost, DT, RF techniques with TF/IDF method to build classifiers
and compare these performances.

The aim of the thesis is to develop a framework that automatically classifies patent
documents.

8
Introduction

An important aspect that needs to be evaluated in the framework relates to the different
parts of a patent as input data that could be applied and their individual performance.
The exact meaning of performance in this thesis is explained in section 2.5. This leads
to the first question:

Which part of a patent as input data performs best in relation to automatic


classification?

Another important part of the framework is the evaluation of different machine


learning algorithms, to identify the algorithm with the best performance.

This leads to the second research question:

Which of the implemented machine learning algorithms performs best regarding the
classification of IPC keywords?

1.3 Delimitations

Since the total number of patent documents is large, and there are many small
classifications, there are about tens of thousands of categories (USPTO, 2020).
Moreover, in a general category, a patent document may be classified into several
different subcategories. Considering the operability of the experiment, a small range of
data will be taken, and experiments will only be conducted for several small categories
in one large category. The goal is to complete the experiment completely.

In this thesis, we choose one hundred documents as the test data from different five
categories. We have used design, medicine, human, plant, and Swedish as the
keywords to download patents from the US patent website. We randomly choose
twenty patents from each category as the test data. Therefore, the data set is not big
and randomly choose five categories in our research. Besides, we choose SVM, decision
tree, random forest, and XGBoost machine learning techniques and TF/IDF methods
to build the classifier. Therefore, these are limitations of our research.

1.4 Outline

Rest parts of the report:

The second section is the theoretical background, which will mainly introduce the
detailed process of the automatic classification of patent documents and also gives a
brief explanation of patents. In this chapter, we also include the data mining methods,
machine learning techniques, and the comparison elements explanation. Different
methods of building classifiers will be sorted out and summarized.

The third section is the methodology. This chapter will show the methods which are
used in this research. It also includes a research design, implementation plan, and data
collection.

The fourth section is about results analysis and discussion. The experimental analysis,
discussion, and results comparison will be detailed.

9
Introduction

The fifth section is the conclusion and perspectives of this study. It sums up the work
and its achievements. It covers some elements like the limitations of the thesis and
offers personal opinions for potential future work.

The last section is the references, a list of all references used in this thesis.

10
Theoretical background

2 Theoretical background
This chapter introduces some theoretical explanation. It includes patent document’s
description, text data mining methods, machine learning techniques of classification,
and elements of how to evaluate results.

2.1 Patent Documents


2.1.1 Patents

A patent document is a document that describes a patent application. Patent


documents structure includes basic information, background introduction, description,
claims. The basic information including the patents number, title, CPC/IPC code,
references, drawing, etc. (Olsson & Söderström, 2019).

When the document is approved and ready to be published to the public, the document
should have a patent classification code or two or even more codes with it. It will easily
search the patent in a specific category. There are different codes for different
classification standards in different patent authorities. (Li, Hu, Cui, & Hu, 2018)

Patents could as a measure of technological innovation and development for a country


and also could promote the socio-economic and technological process. Besides, patent
documents can be used to predict future professional technology trends and detect
infringement. Automatic patent classification allows patent experts to reduce the
amount of work involved, including manual analysis of patent documents and
determination of patent quality. (Wu, Chang, Tsao, & Fan, 2016) It required a higher
cost, such as the needing for professional experts to participate. And the manual
classifications also with uncertainties or misclassifications. (PRV, 2019)

2.1.2 Text data mining

Data mining is the process of extracting unknown, potentially useful information and
knowledge from a large number of data. This data should be incomplete, noisy, or fuzzy,
and it also should be random data. When the object of data mining is a text, that is, the
data type is text data, this process is called text mining. (Aggarwal C & Zhai, 2012)

2.1.3 Classification standards

The classification criteria for patent documents have different standards in different
regions. Currently, most of the standards used are three, IPC (International patent
classification), CPC (The cooperative patent classification), and DPK (German patent
classification). (PRV, 2019)

In fact, for the classification of patent documents, Europe and the United States have
high-quality classification systems and they have different patent classification
standards. Therefore, CPC is produced by the European Patent Office and the US
Patent Office. The CPC is modified in accordance with the existing European patent
document classification standard and complies with International Standards (IPC). It
includes more details and sub-categories. (European patent office & United States
patent and trademark office, 2010) In the future, CPC might be the most used standard
(Lee & Hsiang, 2019).

2.1.4 Patents document classification research direction

11
Theoretical background

For a patent document, the text data preprocessing, feature extraction methods,
selection and improvement of various machine learning algorithms in patent
documents are the main research directions in these fields. Most of the major research
concerns are in the selection and improvement of machine learning algorithms. For
each step of the automatic classification process of patent texts, relevant experts have
conducted in-depth research and achieved great results. Moreover, it can be learned
from the relevant literature that as long as any breakthrough in the pre-processing of
patent text automatic classification, feature selection, or machine learning algorithm
selection and improvement, the classification effect can be greatly improved the
classification results. (Hongjie , et al., 2018)

Most of the researchers classified patents through using the whole text data such as the
abstract, title. (Xia, LI, & Lv, 2016) But this year, some people have proposed a new
direction that only uses the claims part in a patent document is enough for
classification. (Lee & Hsiang, 2019) This section covers the subject matter covered by
patent documents or applications. It includes the scope of legal protection sought or
provided by dominant or intergovernmental associations (Olsson & Söderström, 2019).

2.2 Classification procedure

Text classification is an effective way to extract useful and meaningful information


from text data (Wang, et al., 2020). It is used in information extraction, text retrieval,
dynamic summary, and other fields. The text classification system usually consists of
three steps: text processing, model training, and prediction. The text processing
includes cleaning the training data, express it to vector, and select feature words
(Aggarwal C & Zhai, 2012). The model training step means to use machine learning
techniques to build the training model. Prediction means that the same preprocessing
is performed to obtain the document vector. Then the classification is predicted
according to the machine learning technique to obtain the corresponding category
(Wang, et al., 2020). In order to make this research clearer, refer to figures using Figure
1 shows the procedure of the text classification system.

First, the text data set is randomly divided into a test set and a training set. Then, the
training set and the test set are preprocessed separately. The quality of preprocessing
will also affect the accuracy of a classification. The preprocessing process includes
removing stop words and low frequency words. In some cases, low frequency words
have an important influence on text classification, and there is no need to remove them.
(Wang, et al., 2020)

Because there are too many feature words in the text, the dimension of the constructed
model will be too large, which will affect the accuracy of a classification and the
performance of calculation (Tseng & Lin, 2007). Therefore, it is necessary to select
features based on the importance of feature words for text classification.

Finally, using machine learning techniques to construct a classifier and then using the
classifier to process data, classify data, and evaluate the test data set (Christopher, Lin,
& Spieckermann, 2011). The classification and evaluation in Figure 1 points to the
above preprocessing, feature selection, and weight of value calculation. In the KNIME
workflow, it evaluates the quality of the preprocessing process, feature selection
method, and text representation model based on the results of the classification (AG,
2020).

12
Theoretical background

Figure 1. Research procedure.

If a classifier could be used for classification, the input data could be a patent document
without knowing its patent code. After the same processing, the result should have a
suggesting category for the patent. (Li, Hu, Cui, & Hu, 2018)

2.3 Data Processing

Text classification task means to find a category or a class quickly and automatically.
Nowadays, with the improvement of technologies and economy, more and more useful
data hides in chaos data (Wang, et al., 2020). So, how to find useful information or
needed data will be an interesting research area. Because of these reasons, there are
SVM, KNN, CNN, K means methods could be used (Li, Hu, Cui, & Hu, 2018). Some
researches show that SVM has some advantages such as high-dimensional input space,
few unimportant features, sparse document vector space, most text classification
problems are linearly separable and so on (Li, Hu, Cui, & Hu, 2018). Comparing
different methods also proves that this method has better classification accuracy (Tran
& Kavuluru, 2017). Patent documents are also text, so patent classification belongs to
the scale of text classification, therefore machine learning and text can be used. SVM
could be the first choice. Besides, there is the XGBoost method that usually used in the
text classification field with good performance (Wang, et al., 2020). Therefore, that
could be the second choice.

2.3.1 Data preprocessing

13
Theoretical background

Data preprocessing is the conversion of text into usable data. This step aims to process
the information that is not needed in the document such as noisy words, delete
document formats, punctuation, symbols, special characters. It might also include
some meaningless words like “articles, pronouns, prepositions, conjunctions, and
auxiliary words” in the document (Xia, LI, & Lv, 2016).

2.3.2 Feature extraction (TF/IDF method)

The text of each patent document contains many different feature words that will be
the key first step in the classification. Due to a large volume of patent texts, the vector
space dimension obtained by the processed text data is particularly high. Maybe there
is some highly relevant text in a patent. Some of these words are interrelated. There
may be some words that appear in different categories with different meanings.
Therefore, it is important to extract suitable feature vocabulary for classification
experiments. (Tseng & Lin, 2007)

Currently, there are some methods that have been used to select feature words
(Aggarwal C & Zhai, 2012). Even these methods could be used in text data mining, most
researchers choose to use the TF/IDF method to extract feature words. Some of them
mentioned that TF/IDF shows the best performance in feature words extraction of
patent classification. (Li, Hu, Cui, & Hu, 2018) Based on the KNIME tutorial, we find
that TF/IDF is the common choice of text classification. (Berthold & Thiel, 2012)
Therefore, we choose TF/IDF as the method to extract feature words.

According to the literature review, TF/IDF is a popular and usual method in the text
classification field (Li, Hu, Cui, & Hu, 2018). This method can calculate the weight
value of words. This method is considered usually as the first choice for text
classification. The TF / IDF weighting method is based on high frequency words in the
concerned document and low frequency words in other documents have a great
influence on text classification. It is affected by the two values of TF value and IDF
value. Among them, TF means word frequency, and the calculation formula of TF / IDF
weights means that the higher the value TF, the greater its range of influence on the
classification of the text. IDF means inverse document frequency. The high value of
IDF indicates that it is less likely that the less likely the feature word appears in other
documents. This means that this word could more able to distinguish between text
categories. (Aggarwal C & Zhai, 2012) Here is the formula to calculate:

N
𝑡𝑓(𝑡, 𝑑) ∗ log ( + a)
n(t)
W(t, d) =
=
45 N
[𝑡𝑓(𝑡, 𝑑) ∗ log 7 + a8]
:∈< n(t)

The W(t, d) is the weight of the feature word t in the text d, tf (t, d) represents the word
frequency of the feature word t, and n(t) is the number of texts containing the feature
t, a is a small positive value, and log (N / n(t) + a) is the inverse text frequency function.
(Aggarwal C & Zhai, 2012)

2.4 Classification Techniques

Classification algorithms are an important part of text classification and machine


learning. Classification can be seen actually as a prediction. The mission is a process of
mapping test data to specific targets based on the model obtained from the training

14
Theoretical background

data. The current classification algorithms are mainly based on statistical classification,
such as Naive Bayes, K nearest neighbor algorithm, support vector machine, maximum
entropy model, etc. Or based on neuron connection classification, such artificial neural
network. Or it is a classification that is based on classification rules, such as decision
tree, etc. (Taher, Jisan, & Rahman, 2019)

2.4.1 Support Vector Machine (SVM)

The principle of the support vector machine algorithm is to find a separated hyperplane
so that the hyperplane can separate different categories. The specific training process
is to find this hyperplane. The positive and negative examples of this hyperplane fall
on both sides. The best hyperplane is the one that maximizes the distance between the
positive and negative examples and is located at an equal distance to the nearest
positive and negative examples. This means that the patent text of the unknown
category is calculated on the side of the hyperplane, that is, the category to which it
belongs. (Aggarwal C & Zhai, 2012)

The support vector machine algorithm has some advantages over other methods. For
example, the advantages of proper processing of high-dimensional problems,
insensitivity to text feature correlation, and high accuracy. (Araghinejad & Modaresi,
2014)

2.4.2 Decision tree (DT)

Decision tree classification algorithm is an example-based inductive learning method.


It can extract the tree-like classification model from the given disordered training
samples. Compared with other machine learning classification algorithms, the decision
tree classification algorithm is relatively simple as long as the training sample set can
be expressed using feature vectors and categories. (Abdelaal, Ahmed, Ghribi, &
Alansary, 2019)

2.4.3 Random forest (RF)

Random forest is composed of many decision trees, and there is no correlation between
different decision trees. When we perform a classification task, new input samples are
entered. Each decision tree in the forest is judged and classified separately, and each
decision tree will get its own classification result. Once the forest builds successful, the
classifier can select the most voted category by summarizing all the decision tree
categories. RF processing large data set are very efficient. (Lee, Kwon, Myeongjung, &
Kwon, 2018)

2.4.4 XGBoost

XGBoost provides parallel tree promotion, which can be expanded quickly and
accurately. XGBoost is an optimized distributed gradient enhancement library
designed to be efficient, flexible, and portable. The same code runs on the main
distributed environment that can solve the problem of over billions of samples. It also
a usual machine learning technique for text data processing. (Wang, et al., 2020)

15
Theoretical background

2.5 Results Analysis

In the results of the experiment, several indicators are needed to compare the results
of the experiment. Based on the KNIME platform, accuracy, recall, and precision are
shown in the results (Berthold & Thiel, 2012). The accuracy rate refers to the
probability of correct predictions of the text classification. The recall rate refers to the
ratio of the accuracy of a certain classification in text classification to all documents of
this category in the document. The precision value is the ratio of the number of
retrieved related documents to the total number of retrieved documents. It measures
the accuracy of the retrieval system. When the value is higher and near to 1, the
performance is better. Usually, we use the accuracy to measure the effect of
classification. (Lee, Kwon, Myeongjung, & Kwon, 2018)

Here is the expression:

• Accuracy = (right patents classification number) / (all the patents documents


in this experiment)
• Recall = (right patents classification number) / (number of patents should be
in this category)
• Precision = (right patents classification number) / (number of patents be
classified in this category)

2.6 Tools
2.6.1 Pycharm

Pycharm is an IDE with a set of tools that can help developers improve their efficiency
when developing in the Python language. There are many functions like debugging,
syntax highlighting, smart prompts, unit testing, and so on. In addition, the IDE
provides some advanced features to support professional web development under
some frameworks. (JetBrains, 2020)

2.6.2 KNIME

Knime is an open source data mining software based on Eclipse. It completes the data
extraction and transformation loading operations in the data warehouse and data
mining through a workflow. Among them, the workflow is completed by the nodes with
convenient functions. The nodes are independent of each other and can be executed
separately and pass the executed data to the next node. (AG, 2020)

Drag the node in the Node Repository area in the lower left corner to the Workflow
Editor in the middle to form a workflow. There is three status of the node. When the
node is just dragged into the work area, the red light is on to indicate that the data
cannot pass through. The node needs to be configured so that it can be executed. When
the configuration is complete and correct, the yellow light will be on, indicating that
the ready data can be passed. When execute order is selected to run this node, the green
light is on and it means that the node has been successfully executed, and the data has
been passed and passed to the next node point. (AG, 2020)

It includes several types of nodes. IO type nodes, used for input and output operations
of files, tables, and data models. Database operation node operates the database
through the JDBC driver. Data operation nodes, such as filtering, transforming, and
simple statistical calculations on the data passed in from the previous node. Data view
nodes provide the most commonly used tables and graphics in data mining, including

16
Theoretical background

box charts, pie charts, histograms, data curves, etc. Statistical model nodes,
encapsulating statistical model algorithm nodes, such as linear regression, polynomial
regression, etc. Data mining model class nodes provide Bayesian analysis, cluster
analysis, decision tree, neural network, and other major data mining classification
models and corresponding predictors. (Berthold & Thiel, 2012)

17
Methodology

3 Methodology
3.1 Literature review and data collection

To carry out a scientific research project, one must start with literature review and get
knowledge of the state of the art. It is an important task where we look for the
information in reliable sources such as published papers, JU library, google scholar
website, IEEE website, Elsevier, etc. This surely helps to get good knowledge on the
current status of patents documents automatic classification.

Data collection: the needed data will be obtained directly from the US patent and
trademark official website, which has patents documents from all over the world (Li,
Hu, Cui, & Hu, 2018). The data has been settled as a txt format. It also includes the
original full-text pdf document which means that the patent document includes all the
text and pictures data and needs to be cleaned.

The International Patent Classification (IPC) is a complex patent classification system


that includes large classifications and small subcategories. “The latest version of the
IPC contains eight parts, about 120 classes, about 630 subclasses, and about 69,000
groups” (USPTO, 2020). For example, in the IPC classification, “A is human necessities,
B is performing operations, transportation, C is chemistry, metallurgy, D is textiles,
paper, E is fixed structure, F is Mechanical engineering, lighting, heating, weapons,
blasting, G is physics, and H is electricity” (USPTO, 2020). Each part is subdivided into
two categories, and its symbol consists of a part symbol followed by two digits, such as
D01. Similarly, each category is divided into several sub-categories, and its sub-
symbols are composed of symbols and capital letters of the larger category, such as
A01B or D8724 (USPTO, 2020). Because there are many patent files under the IPC
classification and there are many sub-categories, this paper selects five categories in
the crawled data as the research object. According to the patent code, select the first
four char and number as the category of the patent in this paper.

During the data collection, we randomly choose plant, Swedish, design, human, and
medicine as search strings in our program and download these patents automatically
from the US patent website. When we choose the design as the keyword, we
downloaded patents from D8724 and D8723 which belong to “washing, cleaning, or
drying machine” category. When we choose the medicine as the keyword, we
downloaded name 1052 which is “investigating or analyzing materials by determining
their chemical or physical or physical properties” category. When we choose the
human as the keyword, we downloaded name 1053 which is the “loudspeaker,
microphones, gramophone pick-ups or like acoustic electromechanical transducers;
deaf-aid sets; public address systems” category. When we choose the plant as the
keyword, we downloaded the name PP313 which is “new plants or processes for
obtaining them; plant reproduction by tissue culture techniques” category. When we
choose the Swedish as a keyword, we downloaded patents from different categories. In
order to randomly choose patent for our research, we decided to choose us patent code
start with 104 which belongs to the “bioinformatics” category as our input data (USPTO,
2020). This experiment processed about 1400 patents in the field of medicine, design,
and others. Including 182 articles in the PP313 category, 442 articles in the 1052
category, 278 articles in the Swedish category, etc. For this research, 100 texts of those
categories were selected for test classification because we only have one laptop to do
the experiment and KNIME cannot process more files at one time. Therefore, we
choose twenty patents from each category and the total number is one hundred.

18
Methodology

Here is the detailed procedure. For example, when we choose medicine as the search
string in our program like Figure 2, there is the code and program running.

Figure 2. Medicine as the search string.

Running the program like Figure 3 and it will download the document automatically.

Figure 3. Program running for patent data collection.

Here is an example of a data txt file in Figure 4. It includes the patent code number and
all the text description in the full pdf document.

Figure 4. Sample of patent txt file.

After that, it will be prepared for classification tasks.

3.2 Design Science research


3.2.1 Design Science research method

This thesis will use the design science method. It will use existing machine learning
techniques to design an experiment and compare results. Actually, in this thesis, what
we do is to develop an artifact to measure the accuracy of the different machine
learning techniques. Therefore, we think the design science method is more suitable

19
Methodology

than the experiment (Wieringa, 2014). The experiment research method involves the
distinction between two basic conditions, exposed and unexposed independent
variables (Tanner, 2002). That is, there are an experimental group and a control group.
There can be multiple experimental conditions and control conditions in an
experiment (Tanner, 2002). The research in this article is based on the classification
application and discussion of patent, so it is not suitable to use experiments as research
methods.

3.2.2 Results comparison

In the training model, it will read the data set, process it, apply it in machine learning
techniques, and analyze results. Through comparing results, it will help to get a
conclusion that which machine techniques and which part of a patent have a better
performance.

20
Research procedure and implementation

4 Research procedure and implementation


The automatic classification of patent text falls in the general field of text classification,
so this paper combines the text classification techniques and uses machine learning
algorithms to automatically classify the patent text. Figure 5 shows the structure of the
patent text automatic classification procedure. The workflow of the procedure is as
follows:

Firstly, this experiment needs a test data set. This paper analyzes the website of the US
patent official website, implements a patent collection program, and downloads more
than 1400 patent abstract texts in the field of design, medicine, plant, Swedish and
human facilities. We choose the five different words as the search string in our program
to automatically download patents. We choose the five words because in the IPC
standard, there are eight categories and we just randomly choose the plant, human,
medicine, and design. Because we do our research in Sweden, therefore we add
Swedish as the fifth search string. After we download these patents, there are about
1400 patents.

Secondly, we analyze the downloaded patent document and extract the main data
including patent title, classification code, patent content, etc. The processed data set is
randomly divided into training set and test set according to a ratio.

Thirdly, we preprocess the training set and test set obtained by dividing the patent text
separately. The preprocessing task includes stop words removing, feature extraction
and so on.

Then, for the training set of patent text, we use the decision tree, random forest,
XGBoost, and SVM and TF/IDF method to construct the text vector and train the
classification model. The TF/IDF shows the best performance of feature word
extraction (Berthold & Thiel, 2012). SVM and XGBoost have a better performance in
classification (Lee, Kwon, Myeongjung, & Kwon, 2018) (Wang, et al., 2020). Decision
tree and random forest usually be used in classification (Lee & Hsiang, 2019).
Therefore, we choose these techniques in our research.

Finally, we build the classifier and use the patent text test set to evaluate the
classification accuracy of the model.

21
Research procedure and implementation

Figure 5. Patent classification system.

4.1 Data collection

The automatic classification of patent text requires a patent text dataset, and there is
currently no open patent text data set ready for use. Through research and analysis of
the US patent website, a corresponding automatic collection code of patent texts is
developed to request several categories of patents within a defined field. A large
number of patent texts have been collected in this field. Therefore, crawler technology
is needed to perform this patent texts collection operation. The web crawling is the
process of downloading the content corresponding to the URL from the server to the
local machine, similar to the use of the browser. When we put the URL and HTTP
request on the server, the response returned by the server to the browser is displayed
to the user after interpretation. (Vadivel, Shaila, Mahalakshmi, & Karthika, 2012)

We randomly choose plant, Swedish, design, human, and medicine as search strings
and download about 1400 patents from the US patent website. For this research, we
randomly choose 100 patents for our research because the KNIME can processing no
more than 100 patents at one time in our laptop. Therefore, we choose twenty patents
from each category and the total number is one hundred. The patents collecting
program was implemented to collect the patent text in different fields. The patent
content includes drawing, patent code, application date, publication date, IPC
classification number, applicant, inventor, abstract, description, and other
information. In our patent text classification program, only some necessary
information is needed. Therefore, we use Beautifulsoup to parse some pages
(Richardson, 2020). We get the description information and category information we
need and the title of the invention. In short, there are three main steps to obtain text
data. The first step is to obtain the relevant patent name according to the keyword
search and obtain the patent code number according to the patent document name.
The second step is to find the corresponding single page preview pdf link and it could
be found from the patent code number. The last step is to get a link to the full pdf
document from the single page pdf preview page. In summary, we use beautiful soup
library in pycharm to finish the data collection step. We did not create an interface to
enter a search string. We need to hard code the keywords as input data and then run
the program to get the patent file. This is the first step in our research, and then the
data could be used in classifiers.

22
Research procedure and implementation

4.2 Data preprocessing

This part aims at removing the noisy words and useless data in patents. It intends to
get a group of vocabularies that are valuable for feature words extraction. From this
step, this research will use the KNIME application as the analysis tool. The workflow
of data import and preprocessing looks like what is shown in Figure 6. The data
preprocessing step means to load, clean, and prepare the data for classification. Using
Excel’s reading node to read the data and then use the string to the document node to
convert the specified string in the read file into a document. For each line, a document
is created and attached to the line. Through this node, we confirm the category that the
text data should belong, and then put the category and the corresponding document
together in one line.

In this part, data preprocessing is required. Because the corpus data collected by the
crawler in our research, because there are some Html tags in the crawled content, and
the non-text part of the text data needs to be removed also. There are six detailed
explanations of the preprocessing step in the workflow.

• Punctuation Erasure

The first node in the preprocessing step is to remove all punctuation marks in the data.
Deleting punctuation marks allows the machine to better process text data. So, the
node is to remove all punctuation marks. (Berthold & Thiel, 2012)

• N Chars Filter node

The second node in the preprocessing step is to remove all words with less than N chars
in the text data, and also to make it easier to process (Berthold & Thiel, 2012). The N
value set in this article is 3, which means to delete all words less than 3. For example,
is, too, are, am, and, but, etc.

• Number filter

The third node is mainly to remove the numbers in the text data, which includes
integers, decimals, and negative numbers. (Berthold & Thiel, 2012)

• Case Converter

The fourth node concerns the proper nouns and capitalized vocabulary in a patent text.
Through the use of this node, all chars in the text data are changed to lowercase in
order to unify and process the data. (Berthold & Thiel, 2012)

• Removal of stop words

The fifth node called the stop word filter, which can remove stop words in English
(Berthold & Thiel, 2012). The stop words are common words that are not
representative. This step and the second step (N Chars Filter) basically achieve similar
goals. However, after the second step of processing, this step of processing will be faster
and more comprehensive. The text data processed through these two steps will be more
convenient for subsequent data classification. (AG, 2020)

• Stemming

23
Research procedure and implementation

The sixth node called the snowball stemmer, which is to find the original form of the
word. English words have several different forms such as singular, plural, or different
tense. Stemming can be found through the stemmer node. That is to find the same
word in different forms to facilitate subsequent data processing. (Berthold & Thiel,
2012)

Figure 6. Processing data workflow.


Source: Thiel, 2014

4.3 Feature words extraction

After obtaining the feature words in the patent text, the computer cannot directly
process the feature words, and the workflow shows in Figure 7. Therefore, it needs to
be converted into a format that the computer can recognize. (Wang, et al., 2020)

In this phase of feature word processing, first use the node of a bag of words (Bag of
Words, BoW for short) which shows in Figure 7. BoW assumes that it does not consider
the contextual relationship between words in the text but focus only on the weight of
all words (Berthold & Thiel, 2012). The weight is related to the frequency of words
appearing in the text. This part consists of three steps, tokenizing, statistically revising
word feature values, and normalizing (Thiel, 2014). The word bag model will first
perform word segmentation.

After word segmentation, by counting the number of occurrences of each word in the
text, we can get the word-based features of the text. If these words of each text sample
are put together with the corresponding word frequency, it is vectorized. After the
vectorization is completed, the TF-IDF method can be used to correct the weight of the
feature, and then normalize the feature, and then put the data into the machine
learning techniques for classification. (Aggarwal C & Zhai, 2012) (Berthold & Thiel,
2012)

TF-IDF is the abbreviation of Term Frequency-Inverse Document Frequency, namely


“Word Frequency-Inverse Text Frequency” (Aggarwal C & Zhai, 2012). It consists of
two parts, TF and IDF. The technical description in part two in this paper.

The TF is the word frequency and vectorization means that the frequency of each word
in the text is counted and used as the text feature. The IDF means inverse text

24
Research procedure and implementation

frequency. For example, although the word frequency of “to” is high, it appears in
almost all texts and thus does not help in classification. Its importance should be lower
the other word related to the field of the patent for example, which has a low frequency
of words. IDF helps to reflect the importance of this word, and then modify the word
feature value expressed only by the word frequency. IDF reflects the frequency of a
word appearing in all texts. If a word appears in many texts, its IDF value should be
low. Conversely, if a word appears in a relatively small text, its IDF value should be
high. For example, some professional terms such as “Machine Learning”. The IDF
value of such words should be high. In an extreme case, if a word appears in all text, its
IDF value should be zero. Here is the workflow for extract feature words. (Guo & Yang,
2016)

Figure 7. Feature words extraction workflow.


Source: Thiel, 2014

4.4 Classification techniques

By reviewing the research reports in the past decades, the text classification problem
has been extensively studied and solved in many practical applications. Especially with
the deepening of research on natural language processing (NLP) and text mining, many
researchers are currently interested in developing applications that utilize text
classification methods (IPscreener, 2019). Most text classification and document
classification systems can be decomposed into the following four stages: feature
extraction, dimensionality reduction, classifier selection, and evaluation. Patent
documents are a direction in the research of text classification. Because the cost of
patent manual classification is too high, it is also a meaningful issue for machine
classification to replace or help manual classification (Wang, et al., 2020). After
extracting feature words, we use KNIME to build the workflow and apply it. We start
with the SVM method to perform the classification. The workflow is depicted in Figure
8.

25
Research procedure and implementation

Figure 8. SVM technique workflow.


Source: Berthold & Thiel, 2012

Then, we will use the second technique, which is the XGBoost algorithm. The workflow
shows in Figure 9 and Figure 10.

Figure 9. XGBoost linear technique workflow.


Source: Berthold & Thiel, 2012

26
Research procedure and implementation

Figure 10. XGBoost tree technique application.


Source: Berthold & Thiel, 2012

Then, we use decision tree and random forest techniques. The workflow shows in
Figure 11 and Figure 12.

Figure 11. Decision tree technique.


Source: Berthold & Thiel, 2012

Figure 12. Random forest technique.


Source: Berthold & Thiel, 2012

27
Discussion and conclusion

5 Discussion and conclusions


This section includes a discussion of the method, a discussion of findings, and results
in comparison.

5.1 Discussion of method

Some researchers use more than 600,000 patents to classify and got the highest 83.98
precision when they only the abstract and title as input data (Lee & Hsiang, 2019). We
use 100 patents as input data and got the highest 84.5 precision when we use abstract
as input data. This can prove that our results are reliable. Besides, we got a similar
conclusion that the SVM technique shows the best performance, which shows the
validity of our research. However, we take a small group of data in our research because
of the limitation of the laptop. Therefore, we can improve that and might take a big
scale dataset in classifier to get more reliable results.

In our research, we take DT, RF, SVM, and XGBoost techniques and TF/IDF methods
together in classifiers. The results offer a comparison of different techniques. Since
different scholars use different parts to study the classification of patents, this thesis
uses abstract, description, and claims as the input data. The abstract is the most
common choice which most scholars prefer to choose (Christopher, Lin, &
Spieckermann, 2011). A small number of scholars mentioned that the claims part can
be used as well (Lee & Hsiang, 2019). Besides, some researchers mentioned that the
description part as input data might achieve a good performance of patent
classification (Suominen, Toivanen, & Seppänen, 2017). Therefore, we use the three
research objects are abstract, claims and description of a patent to conduct
classification research and comparison. The results can show a comparison of which
part suitable for patent classification. In a word, we think our method is suitable for
our research and the results are reliable.

5.2 Discussion of findings


5.2.1 Results and analysis

This part shows the result of the experiment in which only we take into consideration
recall and precision on a category called 1052. Because there are five different
categories result data could be used. In order to make it clearer and simple, only
randomly choose one category to show the results.

Table 1. Classification results using only description data


Technique Accuracy Recall on 1052 Precision on
1052
1. SVM 92.86% 0.85 0.75

2. XGBoost linear 88.89% 0.875 0.778

3. XGBoost tree 78.57% 0.724 0.714

4. Decision tree 87.5% 0.8 0.75


(Gradient Boosted)

5. Random forest 85.75% 0.875 0.875

28
Discussion and conclusion

The accuracy is a clear result comparison in Table 1. The accuracy of the SVM technique
is 92.86% so it shows better performance than other methods. The XGBoost linear
technique has 88.89% accuracy, which shows the second-best performance. Therefore,
SVM and XGBoost linear techniques have higher accuracy which means they have
better performance than other techniques when the data object is the patent
description. The decision tree and random forest techniques have almost similar
accuracy like XGBoost Linear techniques, the accuracy separately 87.5% and 85.75%.
The XGBoost tree technique has a bad performance when only take the description as
the text data part. The accuracy only has 78.57%.

Table 2 shows the results when we take only abstract as the text data processing object.

Table 2. Classification results using only abstract data


Technique Accuracy Recall on 1052 Precision on
1052
1. SVM 89.81% 0.85 0.667

2. XGBoost linear 87.75% 0.875 0.778

3. XGBoost tree 90.90% 0.764 0.780

4. Decision tree 83.83% 0.875 0.75


(Gradient Boosted)

5. Random forest 88.91% 0.809 0.845

XGBoost tree shows the best performance compared to other methods when only we
use abstract as the text data because it has the highest accuracy. SVM technique shows
better performance because the accuracy is 89.81%. XGBoost linear and random forest
techniques show a similar performance because the accuracy is 87.75% and 88.91%
respectively. The decision tree technique shows the worse performance because the
accuracy is only 83.83%. Therefore, the XGBoost tree and SVM show better
performance when only take the abstract as the object of data processing.

Table 3 shows the result of using only claims as the object of patent text data.

Table 3. Classification results using only claims data text


Technique Accuracy Recall on 1052 Precision on
1052
1. SVM 63.64% 0.667 0.65

2. XGBoost linear 62.23% 0.667 0.667

3. XGBoost tree 72.73% 0.75 0.875

4. Decision tree 63.64% 0.667 0.65


(Gradient Boosted)

5. Random forest 61.76% 0.818 0.55

Based on the result table, the XGBoost tree technique shows the best performance
because it has the highest accuracy. The SVM and decision tree techniques show
similar accuracy, which is 63.64%. XGBoost linear and random forest does not show

29
Discussion and conclusion

high accuracy. When only comparing those five techniques, XGBoost tree techniques
show an acceptable performance contrary to other techniques. This means that only
XGBoost tree could be applied when only take claims as the object of patent text data.
Besides, when we feed to our models only claims part, the results are not showing a
good performance.

5.2.2 Comparison results

Through comparing three different experiments to find which part could be the object
for classification. When using description, the automatic classification effect of the
patent is relatively best, the accuracy is more than 92% when using the SVM technique.
Using the abstract data is relatively better, the accuracy is around 90%. But the claims
part for classification performance is not so good, and its accuracy is less than 73%.
Therefore, it is recommended to use the description in the patent automatic
classification and feed it as input data to the classifier.

Here is a comparison of accuracy in Table 4.

Table 4. Comparison of accuracy


Technique Description Claims Abstract

SVM 92.86% 63.64% 89.81%

XGBoost linear 88.89% 62.23% 87.75%

XGBoost tree 78.57% 72.73% 90.90%

Decision tree 87.5% 63.64% 83.83%


(Gradient Boosted)

Random forest 85.75% 61.76% 88.91%

While it can be seen from the comparison of different machine learning algorithms, the
SVM method is the best. The second is the XGBoost tree method. They have a better
performance than other techniques, and especially when the object of text data is the
description part. Therefore, the algorithms of random forest or decision tree are not
very suitable for automatic classification of patents, particularly when we only have a
small dataset.

The second performance metric we use for the comparison and analysis of techniques
is the recall value. Table 5 shows the results of recall value.

Table 5. Comparison of recall


Technique Description Claims Abstract

SVM 0.85 0.667 0.85

XGBoost linear 0.875 0.667 0.875

XGBoost tree 0.724 0.75 0.764

30
Discussion and conclusion

Decision tree 0.8 0.667 0.875


(Gradient Boosted)

Random forest 0.875 0.818 0.809

The recall value refers to the ratio of the number of retrieved related documents to the
number of all related documents in the document library, which is the recall rate of the
retrieval system. It means that the recall value is higher and near to 1, the performance
is better. Based on the table, XGBoost linear and random forest techniques have better
performance than others when using description as the text data input. The random
forest technique shows the best performance when using claims as the input data to
process. The XGBoost linear and decision tree techniques show better performance
compared to the others when choose abstract as the input data.

Table 6 illustrates the precision value result of category 1052.

Table 6. Comparison of precision


Technique Description Claims Abstract

SVM 0.75 0.65 0.667

XGBoost linear 0.778 0.667 0.778

XGBoost tree 0.714 0.875 0.780

Decision tree 0.75 0.65 0.75


(Gradient Boosted)

Random forest 0.875 0.55 0.845

According to the table 6, the random forest and XGBoost linear have a better
performance when using description as text data. The XGBoost tree and random forest
techniques have better performance when using abstract as the processing data.
XGBoost tree technique has a better performance when using claims as the data input.

5.2.3 Summary of results

Table 7 shows the comparison results from accuracy, recall value, and precision value.

Table 7. Comparison of performance


Description Claims Abstract

Accuracy SVM XGBoost tree SVM


XGBoost tree XGBoost tree
Recall SVM XGBoost tree SVM
XGBoost linear Random forest XGBoost linear
Random forest Decesion tree
Precision XGBoost linear XGBoost tree XGBoost tree
Random forest XGBoost linear
Random forest

31
Discussion and conclusion

Table 7 serves as a summary to answer the research questions in this thesis. There are
two research questions to be answered. The first one is which part of a patent as input
data performs best in relation to automatic classification? For this question, in order
to have a good classification performance, the description part shows the best accuracy.
So, we can affirm that description of the patent can be used as the main data part to
effectively accomplish automatic classification of patent documents.

The second research question is which of the implemented machine learning


algorithms performs best regarding the classification of IPC keywords? To give an
appropriate answer to this question, we should tackle it from the perspective of data
input. For instance, the SVM is the most suitable technique that should be used for
classification when the used data is the patent description text. On the other hand,
XGBoost tree technique should be used when the abstract or claims text are used as
the main input data.

5.2.4 Comparison with other studies

Most researchers mentioned that SVM technique shows the best performance of patent
classification. Most of them use abstract or title as the input data. We got similar results
that the SVM technique shows the best performance when we use the description as
the input data. But when we use the abstract as the input data, the XGBoost technique
tree technique got the best performance. We think the first reason is that they did not
choose XGBoost in their research when they use abstract to do the classification. The
second reason might be the description of a patent shows better performance than the
abstract of a patent as input data.

However, some researchers mentioned that “using patent claims alone is sufficient for
classification” (Lee & Hsiang, 2019). We got a different result. In our research, we
found that the patent claims are not suitable for patent classification. We think the
reason might be that our test data is smaller than them so that the accuracy when we
use claims as input data is lower than their research. Besides, they compared claims
with abstract and title, and then they got the result show that patent claims has better
performance. If they consider comparing the results with the description, the
description might have better performance.

5.3 Conclusion and perspective

Generally, in the text classification task the text document is represented as a fixed-
length feature vector. Then, machine learning methods are used to train the model and
make predictions based on these feature vectors. In the text representation, the most
simple and common method is the word bag model representation, and this is the
method we preferred to use in this article. It counts the words in each document to
represent the feature vectors. The weight of this feature vector is usually based on the
TF-IDF weight technique. The advantage of TF-IDF feature weighting method is that
it can make full use of the distinguished word set in the document. However, this
method also has some shortcomings. Simple word counting statistical methods cannot
capture the deep semantic information of words and may cause high-dimensional
problems (Tseng & Lin, 2007). However, the performance in this article is still
relatively good.

To better answer the research questions in this article, we made experiments on three
different data inputs, namely abstract, description, and claims. According to the results
obtained from these three experiments and making the comparison, we affirm that the

32
Discussion and conclusion

description part is the most important data by which we can achieve the best
performance in English patent text classification. The description text can be used as
the main classification input, then the abstract is the auxiliary standard for
classification. However, the classification based on claims part proposed by some
scholars has not achieved good performance in our research.

In this study, the classification of patent texts is based on only a small sample dataset,
and the text corpus includes only the patent abstract text, description, and claims. The
three parts are studied separately. Other patent-related information has not been fully
utilized. Classification still needs to be expanded by data, further research and
verification, and the results of classification will be applied to the analysis of patents.

On the other hand, with the development of technology, the combination of deep
neural network and natural language processing technology, the classification method
based on word vector and convolutional neural network has also proved their efficiency.
In the future, we can continue to study the technical methods in this area and focus on
these methods to carry out further research on the automatic classification of patent
texts.

33
6 References
Abdelaal, H., Ahmed, A., Ghribi, W., & Alansary, H. Y. (den 26 September 2019).
Knowledge Discovery in the Hadith According to the Reliability and Memory of the
Reporters Using Machine Learning Techniques. IEEE Access, 7, ss. 41-55.

AG, K. (2020). KNIME Home page. Hämtat från KNIME: https://www.knime.com

Aggarwal C, C., & Zhai, C. X. (2012). Mining Text Data. New York: Springer
Science+Business Media.

Araghinejad, S., & Modaresi, F. (2014). A Comparative Assessment of Support Vector


Machines, Probabilistic Neural Networks, and K-Nearest Neighbor Algorithms for
Water Quality Classification. Springer Science+Business Media Dordrecht.

Aristodemou, L., Tietze, F., Athanassopoulou, N., & Minshall, T. (2017). Exploring the
Future of Patent Analytics: A Technology Roadmapping approach1. Centre for
Technology Management working paper series.

Berthold, D., & Thiel, D. (2012). Technical Report The KNIME Text Processing
Feature: An Introduction. Hämtat från KNIME: https://www.knime.com/solutions

Bulbul, H. I., & Unsal, Ö. (2011). Comparison of Classification Techniques used in


Machine Learning as Applied on Vocational Guidance Data. 2011 10th International
Conference on Machine Learning and Applications and Workshops. Honolulu: IEEE.

Cheng, G., Huang, Y. Q., & Kyebambe, N. M. (2017). Forecasting emerging


technologies: A supervised learning approach through patent analysis. Technological
Forecasting & Social Change, ss. 236–244.

Christopher, I., Lin, S., & Spieckermann, S. (2011). Automated Patent Classification.
Hämtat från WIPO: https://www.wipo.int/portal/en/index.html

European patent office, & United States patent and trademark office. (2010).
Cooperative patent classification. Hämtat från Objectives:
https://www.cooperativepatentclassification.org/obj

Guo, A., & Yang, T. (2016). Research and improvement of feature words weight based
on TFIDF algorithm . 2016 IEEE Information Technology, Networking, Electronic
and Automation Control Conference. Chongqing: IEEE.

Hongjie , L., Wang, H., Du, D., Wu, H., Chang, B., & Chen, E. (2018). Patent Quality
Valuation with Deep Learning Models. Springer International Publishing AG.

IPscreener. (2019). IPscreener by autoMatch Technology. Hämtat från IPscreener :


https://ipscreener.com

Jae, H., & Key, S. (2007). Patent document categorization based on semantic structural
information. ELSEVIER.

JetBrains. (2020). Pycharm. Hämtat från JetBrains:


https://www.jetbrains.com/pycharm/

34
Kamran Kowsari, K. J. (2019). Text Classification Algorithms: A Survey.
Charlottesville: Department of Systems and Information Engineering, University of
Virginia.

Lee, C. Y., Kwon, O., Myeongjung, K., & Kwon, D. (2018). Early identification of
emerging technologies: A machine learning approach using multiple patent indicators.
ELSEVIER, 291-303.

Lee, J. S., & Hsiang, J. (2019). Patent Classification by Fine-Tuning BERT Language
Model. Taiwan: Department of Computer Science and Information Engineering
National Taiwan University.

Li, S., Hu, J., Cui, Y., & Hu, J. (2018). DeepPatent: patent classification with
convolutional neural networks and word embedding. Springer, 721-744.

Lin, H. J., Wang, H., Du, D. F., Wu, H., Chang, B., & Chen, E. H. (2018). Patent Quality
Valuation with Deep Learning Models. Springer International Publishing AG, ss. 474–
490.

Lupu, M. (2017). Information retrieval, machine learning, and Natural Language


Processing for intellectual property information. World Patent Information, 49, ss. A1-
A3.

Olsson, E., & Söderström, A. (2019). Data-Driven Decisions in Mergers & Acquisitions.
Chalmers digitaltryck.

PRV. (2020). PRV. Hämtat från prv.se: https://www.prv.se/en/about-us/

Richardson, L. (den 17 May 2020). Beautiful Soup. Hämtat från Crummy:


https://www.crummy.com/software/BeautifulSoup/#Download

Suominen, A., Toivanen, H., & Seppänen, M. (2017). Firms' knowledge profiles:
Mapping patent data with unsupervised learning. Technological Forecasting & Social
Change, 115, ss. 131-142.

Taher, K. A., Jisan, B. M., & Rahman, M. (2019). Network Intrusion Detection using
Supervised Machine Learning Technique with Feature Selection. 2019 International
Conference on Robotics,Electrical and Signal Processing Techniques (ICREST).
Dhaka: IEEE.

Tanner, K. (2002). CHAPTER 7 Experimental research designs. i K. Williamson,


Research methods for students, academics and professionals Information
management and systems. Centre for Information Studies, Charles Sturt University.

Thiel, K. (2014). Sentiment Analysis. Hämtat från KNIME:


https://www.knime.com/blog/sentiment-analysis

Tran, T., & Kavuluru, R. (2017). Supervised Approaches to Assign Cooperative Patent
Classification (CPC) Codes to Patents. Springer International Publishing AG, 107, 22–
34.

Tseng, Y.-H., & Lin, C.-J. (2007). Text mining techniques for patent analysis.
ResearchGate, 1216–1247.

35
USPTO. (2020). Patents. (Office of the Chief Communications Officer) Hämtat från
United states patent and trademark office: https://www.uspto.gov

Vadivel, A., Shaila, S., Mahalakshmi, R., & Karthika, J. (2012). Component based
effective web crawler and indexer using web services. IEEE-International Conference
On Advances In Engineering, Science And Management (ICAESM -2012).
Nagapattinam, Tamil Nadu: IEEE.

WIPO IP Services. (2019). Innovators File Record Number of International Patent


Applications, With Asia Now Leading. Hämtat från WIPO 2018 IP Services:
https://www.wipo.int/pressroom/en/articles/2019/article_0004.html

Wu, J.-L., Chang, P.-C., Tsao, C.-C., & Fan, C.-Y. (2016). A patent quality analysis and
classification system using self-organizing maps with support vector machine. Elsevier
B.V., 305-316.

Wang, P., Yan, Y., Si, Y., Zhu, G., Zhan, X., Wang, J., & Pan, R. (2020). Classification
of Proactive Personality: Text Mining Based on Weibo Text and Short-Answer
Questions Text. IEEE Access, 8, ss. 70-82.

Xia, B., LI, B., & Lv, X. (2016). Research on Patent Document Classification Based on
Deep Learning. Atlantis Press, 308-311.

Wieringa, R. J. (2014). Design Science Methodology for Information Systems and


Software Engineering. Springer.

36

You might also like