Professional Documents
Culture Documents
Research On High-Value Patent Identification Model
Research On High-Value Patent Identification Model
Research Article
Keywords: patent transfer, high-value patent identi cation, imbalanced data, ensemble learning
DOI: https://doi.org/10.21203/rs.3.rs-4239996/v1
License: This work is licensed under a Creative Commons Attribution 4.0 International License.
Read Full License
Ying Li
College of Economics and Management, China Jiliang University
No. 258, Xueyuan Street, Hangzhou, Zhejiang Province, 310018, P.R. China
2577325924@qq.com
Xiangli Han
College of Economics and Management, China Jiliang University
No. 258, Xueyuan Street, Hangzhou, Zhejiang Province, 310018, P.R. China
1768103222@qq.com
Bin He
Syney Elevator (Hangzhou) Co., Ltd
No. 31, Yangcheng Street, Hangzhou, Zhejiang Province, 310000, P.R. China
hebin@syney.net
Page 1
RESEARCH ON HIGH-VALUE PATENT IDENTIFICATION MODEL FROM
PERSPECTIVE OF PATENT TRANSFER
ABSTRACT
Accurately identifying high-value patents can be difficult with the dramatic increase in the number of patent
applications. This leads to a low rate of commercialization of patent achievements. Whether a patent is transferred or
not is an important reflection of the value of the patent. In order to solve above problems, we proposed a high-value
patent identification model that combines hybrid sampling technology and ensemble learning algorithm. First, we add
technical capacity of patentees based on traditional high-value patent identification indicators to reconstruct the
indicator system. Then we reduce the identification indicator system for high-value patents to eliminate redundant
indicators. Second, we use Adaptive Synthetic Sampling - Local Outlier Factor (ADASYN-LOF) to expand minority
samples to balance the data. Finally, we use Genetic Algorithm (GA) to optimise the parameters of AdaBoost. For
clarity, this model is called the ADASYN-LOF-GA-AdaBoost model. To test the effectiveness of above model, we
use patent data in field of scientific instruments. The results demonstrate that the proposed model achieves ACC of
94.47%, AUC of 94.87%, recall of 97.54%, and F1-score of 95.23%. The results show that ADASYN-LOF-GA-
AdaBoost model performs better than other models. Therefore, this model can effectively identify high-value patents
Keywords: patent transfer; high-value patent identification; imbalanced data; ensemble learning
Page 2
RESEARCH ON HIGH-VALUE PATENT IDENTIFICATION MODEL FROM
PERSPECTIVE OF PATENT TRANSFER
1. Introduction
With the rapid advancement of technology and increasing global economic competition, the importance of
intellectual property has become more and more prominent. In the era of the knowledge economy, patents as an
important form of intellectual property play a crucial role in protecting innovation and driving economic development.
Furthermore, patents serve as essential carriers of core technologies and reflect a nation's scientific and technological
strength. Consequently, supported by strategic initiatives and policy encouragement from various countries, the global
patent application volume has been increasing year by year. According to a report released by the World Intellectual
Property Organization, the number of patent applications submitted by over 150 countries and regions worldwide
reached 3.4 million in 2022(Kong et al. 2023). However, as the volume of patent applications continues to rise, the
quality of patents varies greatly. Relevant studies have shown that half of the patent value in the world comes from
only 5% of high-value patents, and only a minority of patents can truly achieve the transfer and transformation of
patent achievements (Zhou et al. 2021). High-value patents can help enterprises make targeted R&D investments,
avoid financial risks, enhance technological competitive advantages, and increase market share. For countries, they
can strengthen national scientific and technological strength (Huang et al. 2021). Therefore, the development of
accurate and efficient methods for identifying high-value patents has become a hot topic of academic research.
The research on high-value patent identification mainly focuses on the construction of patent identification
indicator systems and the improvement of identification methods. Traditional high-value patent identification
indicator systems are primarily constructed by market, legal, and technological. For example, Wang et al. (2019) used
five indicators such as the number of cited patents, the number of patent families, patent coverage, the number of
claims per patent, and the number of patent litigations to identify leading technologies in sustainable energy. Hu et al.
(2023) confirmed that indicators such as patent family size and first citation speed are significant in measuring patent
value when identifying high-value patents in the integrated circuit field. However, most of these indicators have a lag
and are difficult to use for identifying newly authorized patents. In terms of identification methods, scholars calculate
patent value by determining the weights of indicators based on the constructed indicator system (Huang et al. 2022).
However, this approach is time-consuming and labor-intensive, making it challenging to handle large amounts of data.
Therefore, scholars have begun to use machine learning to identify high-value patents. Yang et al. (2022) measured
Page 3
the technological novelty value of each technology and used K-means clustering to identify technological
opportunities in the drone field. Liu et al. (2023) comprehensively considered the patent value evaluation model and
the credit evaluation model for small and medium-sized enterprises, constructed 17 measurement indicators, and
finally used RF to screen out high-value underlying assets. Although these methods have made certain contributions
to the identification of high-value patents, they overlook the imbalance of patent data, which means that the number
of high-value patents is relatively small compared to non-high-value patents. Most machine learning algorithms are
based on balanced datasets for classification. Directly training on imbalanced datasets can lead to model distortion
Given this, this study aims to enhance the accuracy of high-value patent identification by constructing the
following framework: Firstly, high-value patents are defined based on whether patent transfer has occurred. Secondly,
to effectively identify newly granted patents, the traditional dimensions of technology, legal, and market are
supplemented with technical capacity of patentees dimension to optimize the indicator system. Thirdly, addressing
the issue of imbalanced data, a combined model of ADASYN-LOF-GA-AdaBoost is constructed. This involves using
the ADASYN-LOF algorithm to balance the samples and utilizing GA to find the optimal parameter combination for
AdaBoost, further enhancing the classification performance of the model. Finally, this model is compared with other
2. Literature Review
The research on high-value patents has always received considerable attention. Its connotation is closely related
to the development of national strategies and socio-economic development. Currently, there is no consensus on the
definition of high-value patents both domestically and internationally. However, most scholars agree that the
formation of patent value cannot be separated from its efficient transformation and utilization (Fischer et al. 2014).
Patents have broad prospects for industrialization and market application, as well as strong policy adaptability. By
converting and applying patents through various means such as possession, use, transfer, and mortgage, higher
The definition of high-value patents involves multiple disciplines and perspectives, primarily focusing on patent
lifespan, patent protection strength, and the degree of patent innovativeness. Odasso et al. (2015) started from different
types of buyers and sellers, argue that the transaction price of a patent depends on its remaining lifespan. Lee (2008)
Page 4
found in his research that the quality and value of a patent can be proxied by its lifespan, and factors such as the scale
of invention, the number of claims, and the number of self-citations can influence the patent lifespan. Although using
patent lifespan to measure high-value patents has achieved certain results, it is difficult to predict the value of some
expired patents from the perspective of patent lifespan, which may lead to the loss of important information (Mansfield
et al. 1981). Therefore, scholars have begun to use patent protection strength to measure patent value. Klemperer
(1990) was the first to introduce the concept of patent breadth. He believed that the strength of patent protection
depends on the breadth of the patent, and the larger the scope of protection, the higher the cost for competitors to
imitate. Ultimately, the larger the profits allocated to the patent holder, the higher the patent value will be. However,
patent breadth is usually specific to a particular technical field and lacks universality for cross-field composite
technologies. Currently, some scholars are attempting to measure patent value from the perspective of patent
innovativeness by using patent citation information (Miao et al.2021). They believe that the number of citations a
patent receives can be used to assess the competitiveness and influence of a technological invention. Patents with a
high number of citations are likely to be high-value patents. Yang et al. (2015) proposed a patent value evaluation
method based on a comprehensive patent citation network and used this method to identify more high-value patents
in the field of optical disk technology. Danish et al. (2023) used the number of citations as a measure of high-value
patents to evaluate the value of Indian patents from the perspectives of inventors, companies, and technology.
Measuring patent value based on the number of citations is simple, straightforward, and widely applicable. However,
this indicator has a lag effect and a time dilation effect, and its ability to explain differences in patent value is limited
(Huang et al.2020).
With the deepening of research, scholars have found a connection between the transferability of patents and their
value. The transferability of patents refers to the possibility of realizing the potential value of patents through
transactions, that is, whether patents can be transferred or sold (Ko 2019). Patent value refers to the economic benefits
brought to enterprises by patents in the process of operation and the contributions to the enterprise's development
strategy under actual market conditions. The transferability of patents can be evaluated by examining whether the
patent has been transferred. The value of a patent has an impact on its transferability. Patents with higher value are
usually easier to be transferred because buyers are more willing to invest in patents with potential economic returns.
Page 5
The construction of a high-value patent identification indicator system directly affects the accuracy of the
identification. Existing research on the construction of indicator system mainly falls into three major dimensions:
legal, market, and technological. Change et al. (2014) added patent family depth and revenue plan ratio to the indicator
system based on features such as the number of claims and patent citations, and used logistic regression to identify
and analyze high-value patents in the LED industry. The results showed that the selected indicators were positively
correlated with patent value. Danish et al. (2020) evaluated patent value based on indicators such as patent family size,
technological scope, and patent grant. Chiu et al. (2015) constructed a high-value patent evaluation indicator system
using references, patent citation counts, non-patent references, and other indicators. Then applied them to evaluate the
value of multinational patents. In current research on high-value patent identification indicator systems, scholars tend
to construct a system that identifies high-value patents based on three dimensions: legal, technological, and market.
However, for some newly granted patents in their early stages, their market value may not be fully realized and their
lifespan may be relatively short. Therefore, relying solely on these three dimensions for value assessment can make it
Existing research has shown that factors related to the patentee are positively correlated with patent value. To
better identify high-value patents, the construction of an indicator system needs to take into account the technical
proficiency of the patentee. Lee et al. (2018) demonstrated through research that incorporating patent ownership
indicators can effectively identify emerging technologies, which have a significant impact on patent value. Caviggioli
et al. (2016) further discovered a significant positive correlation between patent ownership-related indicators and
patent value. Chung et al. (2021) identified competitive partners by assessing the technical capabilities, concentration,
and scale of inventor groups in order to enhance the conversion value of patents.
Due to the uncertainty of high-value patents and the complexity of patent data, early research methods for
identifying high-value patents primarily relied on market-based criteria to evaluate patent value. These methods can
be categorized into four major types: cost approach, market approach, income approach, and real options approach.
Scholars estimate the costs invested in patent technology during the research and development stage through the cost
approach to predict the future value of the patent. Moreno et al. (2016) conducted an analysis of future pricing in the
pharmaceutical industry based on cost-benefit analysis, while also considering non-patent pricing and future patients.
However, there is no clear correlation between the R&D investment of most patents and the subsequent returns,
Page 6
making it difficult to accurately estimate the value of patents. Some scholars also use the market approach to determine
the value of a patent to be evaluated based on the transaction prices of similar patents in the market. Hsu et al. [26]
used the market approach to match university patents with those granted to listed companies with similar
characteristics and estimated the potential value of university patents based on the stock market's response to the
matched patents. The implementation of the market approach is challenging and relies on comprehensive transaction
data platforms. The income approach involves estimating and discounting the future market prospects of patents. Oh
et al. (2022) used the income approach to estimate the market value of future innovative technologies in order to gain
an early understanding. In a highly competitive market environment, the uncertainty of valuation using the income
approach is extremely high. Chung et al. (2019) developed a new theoretical framework to assess the value of software
using real option theory. Although this method is practical, the models used are complex and difficult to understand,
making it challenging to address the uncertainty of variables in the future and exhibiting significant randomness.
Methods based on market criteria approach patent value assessment from a static perspective and can only roughly
calculate patent value indicators, making it difficult to handle large amounts of patent data (Chen et al. 2016).
Over time, scholars have begun to develop a set of high-value patent evaluation indicator systems in order to
address the inapplicability of market-based criteria methods for assessing value. A commonly used approach is the
comprehensive evaluation method, which primarily includes analytic hierarchy process (AHP), fuzzy comprehensive
evaluation method, entropy weight method, and others. Huang et al. (2022) evaluated patent value by calculating the
weights of patent evaluation indicators using the AHP. Wang et al. (2015) used the fuzzy comprehensive evaluation
method to assess patent value. Yuan et al. (2022) combined the entropy weight method with TOPSIS to improve the
rationality of indicator weights and applied this method to evaluate patents in the field of solar cell technology.
Research on the evaluation of high-value patents based on the comprehensive evaluation method has become relatively
mature. However, when assigning weights to indicators, subjective value judgments play a significant role, which can
be time-consuming and labor-intensive. For patents with different uses, the factors influencing value vary
significantly, making it difficult to construct a universal patent value indicator system using this method. Additionally,
With the development of artificial intelligence technology, an increasing number of scholars have begun to apply
machine learning techniques to the identification of high-value patents. Erdogan et al. (2022) believe that, against the
backdrop of a rapid increase in patent applications, identifying high-value patents is crucial for enterprises to make
Page 7
precise investments. Therefore, he constructed a predictive model combining supervised algorithms with the analytic
hierarchy process to identify high-value patents. Kwon et al. (2020) established a multi-dimensional indicator system
and constructed six machine learning models, including DT, XGBoost, and RF, to demonstrate the effectiveness of
this method. Han et al. (2022) used support vector machines (SVM) to identify cutting-edge technologies in the field
of electric vehicles and verified that this method performs well in terms of prediction accuracy and generalization
ability. Lee et al. (2018) extracted quantitative indicators from patent data and used AdaBoost to predict sustainable
technology transfers, concluding that the algorithm exhibits good classification performance in technology transfer
predictions. Machine learning algorithms can not only easily handle massive amounts of data but also automatically
learn the importance of high-value patent indicators for weight assignment, making identification more efficient.
However, when machine learning is used for classification, the imbalance data can affect classification performance.
Through a review of existing research, it is found that scholars have rarely considered the imbalance data in high-
value patent identification based on machine learning, resulting in relatively low identification accuracy.
3. Methodology
This paper proposes a high-value patent identification model based on machine learning. The specific approach
is as follows: Firstly, a high-value patent identification indicator system is constructed from the dimensions of legal,
market, technological, and technical capacity of patentees The required data is obtained from the PatSnap patent
database. Secondly, the indicators are reduced to eliminate redundant ones. Then, the ADASYN-LOF method is used
to expand the dataset of the retained indicators, ensuring a balanced distribution of the data. AdaBoost is subsequently
employed for training and testing, and the accuracy of high-value patent identification is further improved by
incorporating a genetic algorithm (GA). Finally, classification evaluation metrics are used to assess the performance
of the machine learning model, ultimately leading to the development of a high-value patent identification model
3.1 The construction and reduction of high-value patent identification indicator system
Based on existing research, this paper constructs a high-value patent indicator system suitable for machine
learning. The selected indicators cover four dimensions: legal, market, technological, and technical capacity of
patentees. Some time-lag indicators have been excluded, and the specific indicators are listed in Table 1.
Page 8
(1) The technological dimension of patent reflects the patent itself technological level. The number of IPC
classification codes reflects the technical coverage of the patent, indicating the connotation of the technology, which
is measured by the number of four-digit IPC subcategories of the patent. The number of cited patents refers to the
number of other patents cited by the target patent, reflecting the technological foundation of the target patent. The
number of non-patent citations refers to the number of scientific papers cited by the patent. The more citations there
are, the more likely it is that the patent is based on a larger number of research results, and its technological level is
likely to be higher. The number of pages in the document can be used to describe the structure, technical essentials,
and usage methods of a certain patent technology. The more pages there are, the higher the complexity of the patent
technology.
(2) The patent legal dimension primarily measures the statutory value of a patent from the perspectives of patent
application process, application cost, maintenance cost, and scope of protection. The number of claims refers to the
scope of protection claimed by the target patent. The more claims there are, the more technical features the patent has,
indicating stronger innovation capabilities and higher advancement of the patent. The number of litigation cases
reflects the legal effectiveness of a patent. Patents with higher technological content and stronger novelty are more
likely to encounter litigation. The number of independent claims reflects the innovation and practicality of a patent in
solving technical problems. The higher the technological innovation and practicality, the higher the patent value. The
examination duration refers to the time span from the patent application date to the patent grant date. Patents with
longer examination durations indicate more advanced patent technology. The duration of maintenance refers to the
period between the patent grant date and the estimated expiration date of the patent. The longer the duration of
(3) The patent market dimension primarily measures patent value based on the scope of patent protection and
patent type. The number of simple patent families refers to the count of patents within a patent family that share the
same priority right. A larger patent family size indicates a stronger patent protection network, a more complete
technical portfolio layout, and a higher value of the patent family. The number of simple family members represents
the number of countries where patent applications are filed, reflecting the international competitiveness of the patent.
Patents that are applied for protection in multiple countries generally have higher value. Whether a patent belongs to
a strategic emerging industry or a national economic industry also reflects the degree of patent value to a certain extent.
Page 9
(4) The technical capacity of patentees dimension primarily reflects the technical strength of patent-related
entities. The number of current patentees and inventors is used to reflect the cooperation situation of patents. The more
inventors there are, the more knowledge and experience contributed by different inventors, leading to a stronger
knowledge base and a higher level of value. Different types of applicants have varying tendencies towards patent
transfer. Scientific research institutions often undertake the work of technology research and development, while
enterprises focus on the market operation of technology. In this paper, the types of patentees are divided into
companies, scientific research institutions, individuals, government agencies and others, and are labeled as 0, 1, 2, 3,
4 accordingly. Technical influence refers to the total number of patents published by the patent holder in the field.
Generally, the more patents an inventor publishes in the field, the more thoroughly they understand the knowledge in
that field, and the more likely they are to create high-value patents. Overall technical strength refers to the total number
of patents invented by the patent inventor, reflecting their technological innovation capabilities and strength.
After establishing the identification indicator system for high-value patents, this paper reduces the constructed
indicators to eliminate redundancy and improve the prediction accuracy of the model. Consequently, an optimized
Firstly, we divide the dataset. To ensure the validity of the final test results, we employ the ten-fold cross-
validation method for our experiments. Specifically, the dataset is randomly divided into ten parts, with nine parts
Page 10
serving as the training set and one part serving as the test set in rotation. This process of training and testing is repeated
ten times. The model's performance is then evaluated based on the combined results of these ten iterations.
Secondly, we obtain the initial mean prediction accuracy 𝑄̅0 of the model. The imbalance data can lead to poor
classification performance of the model. To make the reduction results more effective, we employ the ADASYN-LOF
method to balance the dataset before reducing the indicators. Subsequently, we use the AdaBoost to train and test the
dataset containing the original indicators, obtaining the identification accuracy 𝑄̅0 .
𝑄̅𝑖 − 𝑄̅ 0 = 𝐼𝑖 (1)
𝑄̅0 represents the mean accuracy of the initial model, and 𝑄̅𝑖 represents the mean accuracy of the model after the
removal of the ith indicator. By iteratively removing features with replacement from the original set of indicators, we
calculate the importance indices for each indicator. Indicators are then ranked in ascending order according to their
importance indices, meaning that the higher the ranking, the greater the impact of the indicator on improving the
After obtaining the ranking of indicator importance indices, we proceed with indicator reduction. Using a
recursive method, we extract one indicator at a time from the ranked list of indicator importance indices and calculate
its mean model identification accuracy. If the mean accuracy of the model improves, we retain that indicator.
Otherwise, we continue by attempting to add the next indicator from the list. This process continues until all indicators
in the list have been considered. The algorithm stops when no further improvement is achieved, and we are left with
After indicator reduction, we use the dataset with retained indicators for high-value patent identification to by
ADASYN-LOF-GA-AdaBoost combined model. The workflow is illustrated in Figure 1. It primarily includes two
steps: First, data balancing treatment based on the ADASYN-LOF model. Second, classification processing based on
Page 11
start
training set
(90%)
testing set
(10%)
optimize AdaBoost
by GA
ten-fold cross-validation
NO
model classification
results
10次
YES
end
ADASYN is an oversampling method proposed by He [36] and its core principle is to use density distribution
parameters as the distribution standard. Based on the varying degrees of difficulty in learning different minority class
samples, ADASYN applies a weighted distribution, generating more synthetic samples for those minority class
samples that are harder to learn compared to those that are easier to learn. The ADASYN algorithm improves learning
in two ways: (1) Reducing biases caused by imbalanced categorical data. (2) Adaptively shifting the classification
decision boundary towards difficult sample instances. Most outlier detection methods rely on density, angles,
distances, and other factors to delineate hyperplanes and identify anomalous points. These methods are based on the
similarity of data points. However, the Local Outlier Factor (LOF) is a detection method that starts from the data
density surrounding a sample point. It assigns a local reachability density to each sample point and analyzes the outlier
degree of the sample based on its outlier factor derived from this reachability density, determining whether it is an
outlier or not. The LOF algorithm is simple and intuitive, considering both local and global attributes of the dataset.
The ADASYN-LOF first performs sampling on the original data, and the resulting data inevitably contains noise
Page 12
samples. At this point, noise reduction can be achieved through the application of LOF, resulting in a balanced dataset
that is more conducive to classification processing. The specific training process is outlined in Table 2.
AdaBoost (Adaptive Boosting) algorithm proposed by Freund and Schapire [35]. The basic idea of the algorithm
is as follows: Initially, each of the 𝐺 samples in the dataset is assigned the same weight of 1/𝐺. During the training
process, samples that are misclassified are given higher weights, allowing the classifier to focus on learning from these
incorrectly classified samples in the next iteration, resulting in a new sample distribution. The algorithm generates a
weak classifier in each round of learning and increases the weight of the classifier with higher accuracy. Finally,
multiple classifiers are combined to form a strong classifier. The training process of the algorithm is as follows:
Input: original dataset 𝑄 = {(𝑥1 , 𝑦1 ), (𝑥2 , 𝑦2 ), … (𝑥𝐺 , 𝑦𝐺 }, 𝑥𝑖 𝜖𝑥 ⊑ 𝑅𝑛 , 𝑦𝑖 𝜖𝑦 = {−1, +1}, 𝐻 represents the base
classifier.
1
𝑇1 = (𝑤11 , … , 𝑤1𝑖 , … , 𝑤1𝐺 ), 𝑤1𝑖 = , 𝑖 = 1,2,4, … , 𝐺 (2)
𝐺
2)𝐹𝑜𝑟 𝑚 = 1 𝑡𝑜 𝑀
Page 13
a. Use the dataset with weight distribution 𝑇𝑚 to learn and obtain the weak classifier 𝐻𝑚 (𝑥)。
b. Calculate the classification error rate 𝑒𝑚 of 𝐻𝑚 (𝑥). If the 𝑒𝑚 > 50%, discard the weak classifier.
𝐺 𝐺
1 1 − 𝑒𝑚
𝛼𝑚 = ln (5)
2 𝑒𝑚
The strong classifier integrated by the AdaBoost exhibits higher stability and classification accuracy. However,
it is challenging to determine the most suitable number of iterations during the iterative process, and it is difficult to
identify the optimal combination of weights for all weak classifiers. In GA, a vector composed of 𝑛 𝑋𝑛 =
[𝑥1 , 𝑥2 , … , 𝑥𝑖 ] represents a decision space vector. Each 𝑋𝑛 is considered a genetic gene, and the spatial vector 𝑋
represents a feasible solution to the problem. The optimization problem is transformed into a process of solving for 𝑋.
The greatest advantage of GA is their ability to simulate the process of biological evolution. Through population
selection, crossover, mutation, and other processes, they screen for the optimal population and determine the global
optimal parameters. The specific iterative process is shown in Figure 2. By combining GA with AdaBoost, GA can
be used to tune the hyperparameters of the AdaBoost model, obtaining the optimal number of iterations and optimizing
Page 14
start
choose
crossover
NO
mutation
meet the
termination
condition
YES
end
4. Empirical analysis
The data source of this study from PatSnap. The database deeply integrates patent data from 116 countries and
regions around the world, dating back to 1790, encompassing over 140 million patent records. By reviewing the
literature related to the field of scientific instruments, we conducted a search using query TAC:("measurement" OR
"metrology") AND ("instrument" OR "device" OR "system") AND ("sensor" OR "detector" OR "transducer") AND
("accuracy" OR "precision" OR "calibration") ISD: [ 2013 TO 2020] in the PatSnap database retrieving a total of
The specific steps for performing indicator reduction using the original dataset are as follows:
Step 1: calculate the mean identification accuracy of the original indicator dataset.
The ADASYN is used to augment the dataset to obtain a balanced dataset, which is then denoised using the LOF
on the newly generated samples, resulting in the final balanced dataset. The AdaBoost is used to train and test this
Page 15
sample, with the results shown in Table 3. We can see that the accuracy of AdaBoost on the original data is only
70.68%, but after balancing the data with ADASYN-LOF, the accuracy reaches 89.18% and exhibits good stability.
Through Step 1, the mean ACC of AdaBoost on the dataset containing all indicators is 89.18%, denoted as 𝑄̅0 .
Then, an AdaBoost model is established using the dataset with indicator 𝑖 removed, and the ACC is obtained, denoted
̅𝑖 . The importance indices of each indicator is calculated by 𝑄̅𝑖 − 𝑄̅0 = 𝐼𝑖 , and the results are sorted in ascending
as 𝑄
Based on the sorted list of indicator importance indices obtained in the step 2, indicators are selected in sequence.
If the addition of a selected indicator during training improves the identification accuracy of the model, then that
indicator is confirmed and retained. If the recognition accuracy does not improve, the next indicator in the list is tried
until all indicators in the sorted list have been added, and the algorithm ends. The final list of retained indicators is
shown in Table 5.
Page 16
3 whether it belongs to a strategic 0.738250533
emerging industry
4 number of non-patent citation documents 0.848887534
5 number of claims 0.879609875
6 maintenance duration 0.88381591
7 technological influence 0.895736056
8 number of citation patents 0.907930509
9 overall technological strength 0.910947882
10 examination duration 0.914660165
ACC
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
ACC
patentee, two are from the technology dimension, three are from the legal dimension, and one is from the market
dimension. According to Table 6, the final model identification accuracy is 91.46%, which is a 2.28% improvement
compared to the average identification accuracy of 89.18% using all indicators. Therefore, by reducing indicators
and eliminating redundant ones, the identification accuracy of the model can be improved.
Page 17
This study focuses on a binary classification problem, thus four evaluation metrics, namely Accuracy (ACC),
AUC, Recall, and F1-score, are employed to assess the performance of the model. In binary classification, the actual
categories in the dataset can be combined with the predicted categories by the classifier, resulting in four categories,
(1) ACC
ACC represents the proportion of correctly classified samples to the total number of samples. In this study, it
equates to the ratio of correctly identified high-value patents to the total number of patents. It is a commonly used
performance metric in classification tasks with imbalanced data. The ACC can be represented by the binary confusion
𝑇𝑃 + 𝑇𝑁
𝐴𝐶𝐶 = 。 (8)
𝑇𝑃 + 𝐹𝑃 + 𝑇𝑁 + 𝐹𝑁
(2) AUC
The Receiver Operating Characteristic (ROC) curve is created by plotting the true positive rate against the false
positive rate at various threshold settings, based on the predicted results of the classifier. The Area Under the Curve
(AUC) is the area beneath the ROC curve. If the ROC curves intersect, it can be difficult to judge the superiority of
the models. In such cases, using the AUC can effectively avoid this problem. Generally, a higher AUC value indicates
(3) Recall
Recall, also known as sensitivity, represents the proportion of the original samples that were predicted correctly.
A higher recall indicates that fewer minority class samples are being misclassified:
𝑇𝑃
𝑅𝑒𝑐𝑎𝑙𝑙 = 。 (9)
𝑇𝑃 + 𝐹𝑁
(4) F1-score
The F1 score is a precision metric that calculates the weighted harmonic mean of recall and precision. It will
only be high if both recall and precision values are relatively large. Therefore, the F1 score comprehensively reflects
the classification performance of the algorithm for both positive and negative samples.
Page 18
2 × 𝑃𝑟𝑒𝑠𝑐𝑖𝑠𝑖𝑜𝑛 × 𝑅𝑒𝑐𝑎𝑙𝑙 (10)
𝐹1 − 𝑠𝑐𝑜𝑟𝑒 =
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑅𝑒𝑐𝑎𝑙𝑙
To further enhance the classification performance of AdaBoost, we introduced the GA for parameter
optimization. Through multiple iterations, we identified the optimal combination, where the number of base classifiers
in the model was 318, the learning rate was 1.069, and the maximum tree depth was 16. This improved the convergence
speed and accuracy of the model. Furthermore, to validate the effectiveness of our model, we applied the GA-
AdaBoost model to a dataset with retained indicators that had been balanced using ADASYN-LOF. We compared the
classification performance of the GA-AdaBoost model with that of BP neural networks, DT, RF, and AdaBoost. After
ten-fold cross-validation, we obtained the mean values of ACC, AUC, Recall, and F1-score for these models on the
indicating that the classification performance of the ensemble algorithm is better than that of a single algorithm. In
addition, the performance of the ADASYN-LOF-AdaBoost model after mixed sampling processing is significantly
incorporates GA optimization on the basis of mixed sampling, is further enhanced. By comparing the ACC, it can be
observed that the ACC of the ADASYN-LOF-GA-AdaBoost is 94.47%, which is higher than that of other models,
indicating that the model has strong discrimination ability and can accurately identify high-value patents. Regarding
the AUC metric, the AUC of the BP neural network is only 54.45%, and the AUC of DT is 56.74%. However, the
AUC of the ADASYN-LOF-GA-AdaBoost reaches 94.87%. The Recall of the ADASYN-LOF-GA-AdaBoost is even
reaching 97.54%, indicating that the model can identify more high-value patents. In terms of the F1-score, the F1 of
the BP neural network is only 49.76%, the mean F1 of DT is 75.54%, the mean F1 of AdaBoost is 77.86%, and the
mean F1 of the ADASYN-LOF-AdaBoost model is 91.23%. However, the mean F1 of the ADASYN-LOF-GA-
AdaBoost model reaches 95.23%, demonstrating that the overall performance of this model is superior to the other
models.
Page 19
5. Discussion
The main contributions of this paper are as follows. Firstly, this study optimizes the construction of the indicator
system. Based on a review of previous studies, we add technical capacity of patentees to improve identification
accuracy. Secondly, in order to solve the problem of low identification accuracy of high-value patents caused by data
imbalance, we introduce ADASYN to expand the dataset and use LOF to remove noise samples to obtain a balanced
dataset. In addition, for AdaBoost, it is difficult to determine the most appropriate number of iterations during the
iterative process, and it is impossible to determine the optimal combination of weights for all weak classifiers. We add
GA to optimize the parameters of AdaBoost to further improve the classification performance of the model. These
6. Conclusion
How to quickly and accurately identify high-value patents with transfer potential among massive patent data is
of great significance for promoting the transformation of patent achievements. Addressing the shortcomings in
existing research, this paper proposes a high-value patent identification method based on machine learning. Firstly, in
response to the inadequacies in the construction of indicator systems in current studies, we incorporate the dimension
of patentee to reconstruct the indicator system. Secondly, in view of the imbalance in high-value patent data, an
ADASYN-LOF-GA-AdaBoost model is proposed. This model utilizes ADASYN to expand minority class samples
and applies LOF for noise reduction to mitigate the degree of data imbalance. Finally, the GA-optimized AdaBoost is
employed for classification to further enhance the classification performance of the model.
Based on the empirical analysis in this paper, the following two conclusions can be drawn: (1) It is effective to
incorporate the dimension of patentee into the indicators for high-value patent identification. In the ranking of the
importance indices of indicators, it is found that the indicators related to the dimension of patentee introduced in this
paper are all ranked at the top, and they can still be retained after indicator reduction. This further proves the rationality
of incorporating such indicators. (2) Combined model can enhance the classification performance. This paper proposes
the ADASYN-LOF-GA-AdaBoost model, which involves using ADASYN-LOF to balance the data at the data level
and incorporating GA to optimize the AdaBoost. Compared with other models, this combined model achieves the best
classification results, demonstrating the effectiveness of the high-value patent identification method proposed in this
paper.
Page 20
This study also has some limitations, which can be considered to be improved in future research. First, the data
used in this paper is structured data, and only includes patent data. In the future, we can consider using multi-source
data, combining patents and papers, enriching data sources, and considering the use of text mining methods to further
identify high-value patents. Second, the machine learning algorithm is chosen in this paper. If text features are added
in future research, deep learning algorithm can be considered. Finally, considering the availability of the database, this
paper uses the Patsnap database, and the more authoritative Derwent or Incopat database can be used in the future.
Acknowledgements: This article is supported by Zhejiang Provincial Philosophy and Social Sciences Planning
Project (grant number 24NDJC215YB), the Key project of the Zhejiang Provincial Soft Science Research Program
(grant number 2024C25010), the Key Program of Zhejiang Province (grant number 2021C01027), and Special Project
for the Alliance of high-level Universities in the Changjiang Delta (grant number CSJYB202312).
Author contributions: Z.W. proposed the framework and wrote the manuscript. Y.L. collected the patent data and
conducted empirical analysis. X.H. provided suggestions for modification of the manuscript. B.H. revised the
Data availability: All data generated or analyzed during this study are included in this published article.
REFERENCES
[1] Chang K C, Hao J, Chen C, et al. The relationships between the patent deployment strategy and patent
[2] Chiu C C, Su H N. What is the value of internationalized patent? [C]//2015 Portland International Conference on
[3] Chen Y M, Liu H H, Liu Y S, et al. A preemptive power to offensive patent litigation strategy: Value creation,
transaction costs and organizational slack[J]. Journal of Business Research, 2016, 69(5):1634-1638.
[4] Caviggioli F, Ughetto E. Buyers in the patent auction market: Opening the black box of patent acquisitions by non-
practicing entities[J]. Technological Forecasting and Social Change, 2016, 104: 122-132.
[5] Chung S, Animesh A, Han K, et al. Software patents and firm value: A real options perspective on the role of
innovation orientation and environmental uncertainty[J]. Information Systems Research, 2019, 30(3): 1073-1097.
Page 21
[6] Chung J, Ko N, Yoon J. Inventor group identification approach for selecting university-industry collaboration
[7] Danish M S, Ranjan P, Sharma R. Valuation of patents in emerging economies: a renewal model-based study of
Indian patents[J]. Technology analysis & strategic management, 2020, 32(4): 457-473.
[8] Danish M, Sharma R. The value of Indian patents: an empirical analysis using citation lags approach[J]. Economics
[9] Erdogan Z, Altuntas S, Dereli T. Predicting Patent Quality Based on Machine Learning Approach[J]. IEEE
[10] Freund Y, Schapire R. A decision-theoretic generalization of on-line learning and an application to boosting[J].
[11] Fischer T, Leidinger J. Testing patent value indicators on directly observed patent value—An empirical analysis
[12] He H, Yang B, Garcia E A, et al. ADASYN: Adaptive synthetic sampling approach for imbalanced learning[C]//
Proceeding of the 2008 International Joint Conference on Neural Networks. Piscataway: IEEE, 2008:1322-1328.
[13] Huang Y, Chen L, Zhang L. Patent citation inflation: The phenomenon, its measurement, and relative indicators
[14] Huang K G L, Huang C, Shen H, et al. Assessing the value of China's patented inventions[J]. Technological
[15] Hsu D H, Hsu P H, Zhou T, et al. Benchmarking US university patent value and commercialization efforts: A
[16] Huang Z, Li J, Yue H. Study on comprehensive evaluation based on AHP-MADM model for patent value of
[17] Han F, Zhang S, Yuan J, et al. Assessing future technological impacts of patents based on the classification
algorithms in machine learning: The case of electric vehicle domain[J]. Plos one, 2022, 17(12): e0278523.
[18] Hu Z, Zhou X, Lin A. Evaluation and identification of potential high-value patents in the field of integrated
circuits using a multidimensional patent indicators pre-screening strategy and machine learning approaches[J].
Page 22
[19] Klemperer P. How broad should the scope of patent protection be? [J]. RAND Journal of
Economics,1990,21(1):113-130.
[20] Ko N, Jeong B, Seo W, et al. A transferability evaluation model for intellectual property[J]. Computers &
[21] Kwon U, Geum Y. Identification of promising inventions considering the quality of knowledge accumulation: A
[22] Kong J, Zhang J, Deng S, et al. Knowledge convergence of science and technology in patent inventions[J]. Journal
[23] Lee Y G. Patent licensability and life: A study of US patents registered by South Korean public research
[24] Lee C, Kwon O, Kim M, et al. Early identification of emerging technologies: A machine learning approach using
multiple patent indicators[J]. Technological Forecasting and Social Change, 2018, 127: 291-303.
[25] Lee J, Kang J H, Jun S, et al. Ensemble modeling for sustainable technology transfer[J]. Sustainability, 2018,
10(7): 2278.
[26] Liu C, Shi Y, et al. A novel approach to screening patents for securitization: a machine learning-based predictive
[27] Mansfield E, Schwartz M, Wagner S. Imitation costs and patents: An empirical study [J]. Economic
Journal,1981,91:907-918.
[28] Moreno S G, Ray J A. The value of innovation under value-based pricing[J]. Journal of market access & health
[29] Miao Y Z, Salomon R M, Song J. L-earning from technologically successful peers: the convergence of Asian
[30] Odasso C, Scellato G, Ughetto E. Selling patents at auction: an empirical analysis of patent value[J]. Industrial
[31] Oh J W, Park H W. Income approach to technology valuation for innovations[J]. International Journal of
[32] Wang M H, Hsiao Y C, Tsai B H, et al. Fuzzy markup language with genetic learning mechanism for invention
patent quality evaluation[C]//2015 IEEE Congress on Evolutionary Computation (CEC). IEEE, 2015: 251-258.
Page 23
[33] Wang H, Sun B, Wang P. Dominant technology identification model based on patent information toward
[34] Yang G C, Li G, Li C Y, et al. Using the comprehensive patent citation network (CPC) to evaluate patent value[J].
[35] Yang W, Cao G, Peng Q, et al. Effective Identification of Technological Opportunities for Radical Inventions
Using International Patent Classification: Application of Patent Data Mining[J]. Applied Sciences, 2022, 12(13):
6755.
[36] Zhou Y, Dong F, Liu Y, et al. A deep learning framework to early identify emerging technologies in large-scale
outlier patents: An empirical study of CNC machine tool[J]. Scientometrics, 2021, 126: 969-994.
[37] Yuan X, Song W. Evaluating technology innovation capabilities of companies based on entropy-TOPSIS: the
case of solar cell companies[J]. Information Technology and Management, 2022, 23(2): 65-76.
Page 24