Doceng2012 Submission 19

Conceptualization Effects on MEDLINE Documents
Classification Using Rocchio Method

Shereen Albitar Sebastien Fournier Bernard Espinasse
Aix Marseille Université LSIS UMR Aix Marseille Université LSIS UMR Aix Marseille Université LSIS UMR
CNRS 7296 CNRS 7296 CNRS 7296
Domaine Universitaire de St Jérôme Domaine Universitaire de St Jérôme Domaine Universitaire de St Jérôme
13397 Marseille Cedex 20, France 13397 Marseille Cedex 20, France 13397 Marseille Cedex 20, France
shereen.albitar@lsis.org sebastien.fournier@lsis.org bernard.espinasse@lsis.org
ABSTRACT expensive search on the Web, it seems adequate to involve

The aim of this paper is to propose an efficient supervised classification techniques in order to consider the contents of
classification method of Web pages. Web page classification can search engines answers, applying thorough filtering and ranking.
be considered as a particular case of text classification for which Web page classification is currently a challenging research topic,
various methods are proposed. In this paper, a text classification particularly in areas such as information retrieval,
method, Rocchio, has been chosen for its efficiency and recommendation, personalization, user profiles etc.
extensibility. This method is first tested using several similarity Comparing their heterogeneous structure with plain text
measures on the Ohsumed text corpus, composed of abstracts of documents, Web page classification can be considered a particular
biomedical articles retrieved from the MEDLINE database. case of text classification as many features can be extracted from
Analyzing statistical results, many limitations are identified and different parts of a Web page's HTML code (title, metadata,
discussed. In order to overcome these limitations, this paper header, URL, …) in addition to its contents [2, 3]. Despite these
proposes to integrate semantic aspects into the Rocchio differences, principles of plain text classification also to Web page
classification through a conceptualization task. This classification as well.
conceptualization is applied on Ohsumed corpus, mapping terms
in text to their corresponding concepts in UMLS Metathesaurus in First, text classification was completely a manual task realized by
order to take meaning into consideration during text classification. experts. Then, it was semi-automated thanks to rules, which
Conceptualization effects on Rocchio classification using different generally bind the occurrence of certain keywords or "features" in
standard similarity measures and conceptualization strategies are a document to certain classes. However, rule implementation and
discussed. maintenance is a complex and time consuming task, in addition to
their limited adaptability to their original context dynamics and to
Categories and Subject Descriptors new contexts [2].
H.3.3 [Information Search and Retrieval]: Information filtering,
Thus, new methods for classification based on learning-
H.3.1 [Content Analysis and Indexing]: Linguistic processing,
techniques appeared. Commonly based on supervised learning
H.3.4 [Systems and Software]: Performance evaluation,
using training corpus, these methods learn decision criteria that
Semantic Web. J.3 [LIFE AND MEDICAL SCIENCES]:
are often crystallized in induced rules or statistical estimations, in
Medical information systems, MEDLINE. I.7.1 [Document and
order to discriminate different classes. Training corpus is usually
Text Editing]: Document management. I.2.7 [Natural Language
prepared through manual tagging, that associates training
Processing]: Text analysis.
documents to their relevant classes. This preparation effort is
General Terms significant, nevertheless it is less considerable than rule
Algorithms, Measurement, Performance, Experimentation. implementation [4]. Most popular text classification methods are:
Naïve Bayes Classifier (NB), Support Vector Machines (SVMs),
Keywords Rocchio, and K Nearest Neighbor (KNN).
Web page Classification, Semantic classification, Information NB classifier [5], also called "The Binary Independence Model",
retrieval, Rocchio, Similarity measures, conceptualization. is based on the independence hypothesis considering each feature
independently in calculating class prototype during training phase.
1. INTRODUCTION This unrealistic hypothesis, despite its simplicity, has critical
Nowadays and due to the explosive increase in published weaknesses. SVMs [6-8] are efficient methods for classification,
information on the Web, existing search engines seem to be nevertheless their learning complexity is high. This complexity is
unable to respond efficiently to user requests. This is often related related to number of features characterizing documents, so feature
to the traditional keyword-based indexing techniques neglecting selection is indispensible for eliminating noisy and irrelevant
search context [1]. Aiming at more efficient and less time features [4]. KNN [6] is also sensitive to noisy examples in
training set in addition to its slow classification when using
Permission to make digital or hard copies of all or part of this work for important corpus [4].
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that Concerning Rocchio, or centroïd-based classifier [7], centroïd
copies bear this notice and the full citation on the first page. To copy vectors of classes, learned during training, represent a
otherwise, or republish, to post on servers or to redistribute to lists, classification model that summarizes the characteristics that occur
requires prior specific permission and/or a fee.
Conference’10, Month 1–2, 2010, City, State, Country.
in training documents. This summarization is relatively absent in
Copyright 2010 ACM 1-58113-000-0/00/0010…$10.00. other classification methods, except for NB that summarizes terms
occurrences in different classes in the learned term-probability 2.1.1.1 Vectorization Task
distribution functions. Moreover, Rocchio takes class This task transforms the text documents of the training corpus into
summarization into account in classification as each test document vectors according to the Vector Space Model (VSM). This task is
is compared to classes' centroïds using similarity measures. NB realized through four steps (Tokenization, Stemming, Stopword
uses also learned probability distribution during classification to Removal and Weighting) without applying any feature selection
estimate the probability of the occurrence of each term technique. Multiple weighting schemes might be used to represent
independently neglecting all term co-occurrences. the corresponding importance of each token in a document [12]
Vector-based (binary or TF/IDF) representation used by preceding like idf, idf-prob, Odds Ratio, χ² etc. According to TF/IDF, the
methods permits semantic integration or "Conceptualization" that most popular scheme, the score of a term tj in document di is
enriches document representation model using a certain estimated as follows:
background knowledge base [8]. In addition, both KNN and wij = tfij ∗ log( N⁄dfj ) (1)
Rocchio enable involving knowledge bases in decision making
through semantic similarity functions [9]. 𝑡𝑓𝑖𝑗 : Frequency of term tj in document di.
As a conclusion, we consider Rocchio an adequate baseline N: Number of documents.
classifier for text and Web page classification. Its efficiency,
simplicity, and extendibility with semantic resources in addition 𝑑𝑓𝑗 : Number of documents that contain term tj.
to other advantages lead us to choose Rocchio for the rest of this
The result of applying vector space model (VSM) to a text
work.
document is a weighted vector of features:
Next section, presents experimentations realized using Rocchio
𝑑𝑖 = (𝑤𝑖1 , 𝑤𝑖2 , 𝑤𝑖3 , … , 𝑤𝑖𝑛 ) (2)
method with several similarity measures on the Ohsumed corpus,
which is composed of abstracts of biomedical articles retrieved 2.1.1.2 Rocchio Centroïd Computing task
from the MEDLINE database. Analyzing statistical results, many Based on the weighted vectors obtained from applying the
limitations are identified and discussed in order to propose precedent task on the training corpus, this task calculates the
appropriate solutions in third section through the use of semantics class’s centroïds according to the Rocchio Method. Each class is
in classification. In forth section we apply Rocchio method to represented by a vector positioned at the center of the sphere
conceptualized Ohsumed corpus using UMLS (Unified Medical delimited by training documents related to this class. This vector
Language System) Metathesaurus and MetaMap tool. The fifth is so called the class's centroïd as it summarized all features of the
section presents some preliminary results using these class as collected during learning phase. These features result
conceptualized documents. Conceptualization effects on Rocchio from applying VSM on training documents as detailed earlier.
classification according to different standard similarity measures Having n classes in the training corpus, n centroïd vectors
and conceptualization strategies are discussed. Finally, we {C1,C2,.....,Cn} are calculated throughout the training phase.
conclude with an assessment of our work, followed by different
research perspectives.
2. EVALUATING ROCCHIO CLASSIFIER

USING DIFFERENT SIMILARITY
MEASURES
In previous section, Rocchio classifier has been chosen for its
efficiency and semantic extendibility. This section presents an
experimental study of Rocchio-based classification on text
documents referring to some implementation details. Using five
frequently used similarity measures [10] (Cosine, Jaccard,
Pearson, Kullback-Leibler, and Levenshtein) separately in
experimentations enables us to evaluate Rocchio's performance
independently to similarity calculation in decision making.
Afterwards, results of different system configurations applied to
the corpus Ohsumed are compared and analyzed. Finally, we
discuss certain limitations in these results intending to overcome
these limitations through integrating semantic aspects in
classification process.
2.1 Rocchio Implementation details

Rocchio or centroïd based classification [7] for text documents is
widely used in Information Retrieval tasks, in particular for
relevance feedback [11]. Training and classification phases of
Rocchio are illustrated in Figure 1.
2.1.1 Training Phase Figure 1. Rocchio classifier in details
Training Phase exploits a training corpus of documents and is
composed of two tasks: the Vectorization task and the Rocchio
Centroïd Computing task.
2.1.2 Classification Phase Where:
To classify a new document, the classification Phase use the A.B= ∑ai*bi
centroïds computed during the previous Training Phase and is
|A|²=∑ ai²
composed of two tasks: a Vectorization task and a Similarity
Computing task. and iϵ[0, n-1]; n: the number of features in the vector space.
2.1.2.1 Vectorization task
The same four steps of the vectorization already realized on the 2.2.3 Pearson similarity measure
training corpus are here realized on the new document dx in order
to classify it. Given two vectors A(𝑎1 , 𝑎2 , … , 𝑎𝑛 ), B(𝑏1 , 𝑏2 , … , 𝑏𝑛 ), Pearson
calculates the correlation between these vectors. Deriving their
2.1.2.2 Similarity Computing task centric vectors: 𝐴(𝑎1 − 𝑎�, … . . , a n − a�) and 𝐵�𝑏1 − 𝑏�, … . . , bn −
Document vector resulting from the preceding task is compared to b� )
all centroïds of n candidate classes using a similarity measure. So
the class of the document dx is the one represented by the most Where: 𝑎� is the average of all A's features, 𝑏� is the average of
similar centroïd, having the maximum similarity value according all B's features.
to the following function: Pearson correlation coefficient, also called the Pearson
similarity measure, is by definition the cosine of the angle α
arg Max �𝑆𝑖𝑚𝑀𝑒𝑎𝑠𝑢𝑟𝑒(𝑑𝑥 , 𝐶𝑖 )� (3) between the centric vectors as follows:
i=1,2,…n
𝑛 ∑ 𝑎𝑖 𝑏𝑖 −∑ 𝑎𝑖 ∑ 𝑏𝑖
𝑆𝑖𝑚𝑃𝑒𝑎𝑟𝑠𝑜𝑛 (𝐴, 𝐵) = (6)
��𝑛 ∑ 𝑎𝑖2−(∑ 𝑎𝑖 )2 ��𝑛 ∑ 𝑏𝑖2 −(∑ 𝑏𝑖 )2�
2.2 Similarity Measures
Many similarity measures where used for both document
classification and document clustering [13] to estimate the
similarity between a document and a class prototype. Using VSM, 2.2.4 Kullback-Leibler similarity measure
this similarity is calculated to compare a document vector with the
vector representing a class or the centroïd. Next, are introduced According to probability and information theory, Kullback-
five similarity measures (Cosine, Jaccard, Pearson, Kullback Leibler divergence is a measure estimating dis-similarities
Leibler, and Levenshtein) all used in the following between two probability distributions. In the particular case of text
experimentations on Rocchio. processing, this measure calculates the divergence between
feature distributions in documents. Given vectors' representations
of their features distribution A(𝑎1 , 𝑎2 , … , 𝑎𝑛 ), B(𝑏1 , 𝑏2 , … , 𝑏𝑛 ), the
2.2.1 Cosine similarity measure divergence, also used for calculating similarities, is calculated as
follows
Cosine is the most popular similarity measure and largely used in
information retrieval, document clustering, and document 𝑆𝑖𝑚𝐾𝑢𝑙𝑙𝑏𝑎𝑐𝑘 = 𝐷𝐴𝑣𝑔𝐾𝐿 (𝑡��⃗𝑎 |�𝑡��⃗𝑏 �
(7)
classification research domains.
= ∑…
𝑡=1(𝜋1 ∗ 𝐷(𝑤𝑡,𝑎 ||𝑤𝑡 ) + 𝜋2 ∗ 𝐷(𝑤𝑡,𝑏 ||𝑤𝑡 )
Having two vectors A(𝑎1 , 𝑎2 , … , 𝑎𝑛 ), B(𝑏1 , 𝑏2 , … , 𝑏𝑛 ), the
similarity between these vector is estimated using the cosine of Where:
the angle they delimit: 𝑤𝑡,𝑎
𝜋1 =
A. B 𝑤𝑡,𝑎 + 𝑤𝑡,𝑏
SimCosine (A, B) = (4) 𝑤𝑡,𝑏
|A| ∗ |B| 𝜋2 =
𝑤𝑡,𝑎 + 𝑤𝑡,𝑏
Where: 𝑤𝑡 = 𝜋1 ∗ 𝑤𝑡,𝑎 + 𝜋2 ∗ 𝑤𝑡,𝑏
A.B= ∑ai*bi
|A|²=∑ ai² 2.2.5 Levenshtein similarity measure
iϵ[0, n-1]; n: the number of features in vector space. Levenshtein is used to compare two strings. A possible extension
In systems using this similarity measure, changing documents' for vector comparison can be derived as the following equation:
length has no influence on the result as the angle they delimit is
still the same.
𝑆𝑖𝑚Levenshtein (𝐴, 𝐵) = 1 − (𝐷𝑖𝑠𝑡𝑎𝑛𝑐𝑒/𝑀𝑎𝑥) (8)
2.2.2 Jaccard similarity measure

Where:
Jaccard (sometimes called Tanimoto) estimates the similarity to
the division of the intersection by the union of two compared 𝐷𝑖𝑠𝑡𝑎𝑛𝑐𝑒(𝐴, 𝐵) = �|𝑎𝑖 − 𝑏𝑖 |
entities. Having two vectors A(𝑎1 , 𝑎2 , … , 𝑎𝑛 ), B(𝑏1 , 𝑏2 , … , 𝑏𝑛 ),
according to Jaccard the similarity between A and B is by 𝑀𝑎𝑥(𝐴, 𝐵) = � 𝑀𝑎𝑥(𝑎𝑖 , 𝑏𝑖 )
definition:
A. B
SimJaccard (A, B) = (5)
|A|2 + |B|2 − A. B
2.3 Experimentations and results Table 2. F1-measure values applying Rocchio to Ohsumed
In these experimentations, the Oshumed [14] corpus is used. Category Cosine Jaccard Kullback Levenshtein Pearson
Ohsumed is composed of abstracts of biomedical articles of the C06 55,61% 48,81% 54,19% 48,95% 55,70%
year 1991 retrieved from the MEDLINE database indexed using
C23 52,13% 7,82% 18,97% 4,71% 52,60%
MeSh (Medical Subject Headings) [15]. The first 20000
documents of this database were selected and categorized using C04 69,67% 70,30% 69,56% 70,86% 69,31%
23 sub-concepts of the Mesh concept "Disease". In this work, we C14 70,00% 67,99% 68,92% 66,55% 69,63%
restricted this corpus to the five most frequent classes [13]. C20 57,33% 55,23% 55,52% 52,72% 57,36%
Training is realized on the corpus and so five class centroïds are
Macro 60,95% 50,03% 53,43% 48,76% 60,92%
calculated for each of the classes listed in Table 1.
Micro 61,14% 52,74% 55,01% 51,82% 61,06%
As for test, five experimentations are executed applying the
similarity measures shown in the previous section. For most
classification tasks, classifier's accuracy [16] exceeded 80%. Figure 2. In this case, we can observe performance variations
Thus, we considerer that it is more relevant to use F1-Measure, among different similarity measures especially for "C23" where
recall and precision measures [16] for performance comparison. pathology documents seem to be difficult to distinguish from
These measures are given though the following equations: other classes. In fact, this class is very large compared to others
treated in the same case, and in other words, its documents can be
related to other classes as pathologies can affect, for instance, the
tp digestive and the cardiovascular systems ("C06", "C14"
Precision = (9)
t p + fp respectively). As a result, low F-measure values were observed
for this class [0,05-05]. The other classes have quite similar
tp
Recall = (10) results in range between [0,5-0,7].
t p + fn
2 ∗ Precision ∗ Recall
F1 − Measure = (11)
Precision + Recall
tp + tn
Accuracy = (12)
t p + t n + fp + fn
Considering a specific category C:

t p : The number of correctly classified documents in C.
t n : The number of correctly classified documents in other
categories.
fp: The number of documents classified in C while they belong to
other categories. Figure 2. F1-measure values applying Rocchio to Ohsumed
fn: The number of documents classified in other categories while
they belong to C In order to analyze precedent results in details, figure 3 resumes
the recall values through previously described experimentations.
In fact, treating the class “C23”, the classifier can't easily identify
Table 1. Ohsumed Corpus. its documents. Consequently, most of its documents are attributed
Category Description to other classes (false negative) whereas only few documents are
correctly categorized in “C23”, which explains the low values of
C04 Neoplasms
both the recall and the F1-measure values for this class.
C23 Pathological Conditions, Signs and Symptoms
C06 Digestive System Diseases
C14 Cardiovascular Diseases
C20 Immune System Diseases
Experimentation results using F1-measure values for the five

document classes according to each similarity measure are
illustrated in Table 2. The two last lines present the macro and
micro average F-measures obtained for each similarity measure.
Micro-average F-measure is computed globally over all category
documents, whereas macro-average F-mesure is equal to the
average of locally calculated F-measure for each class.
Figure 3. Recall values applying Rocchio to Ohsumed

On the contrary, figure 4 shows a relatively high precision value might be used for conceptualization: Wordnet, Wikipedia and
for “C23” compared to other classes especially “C06” and “C20”. other domain specific resources usually called domain ontologies
In fact, when the classifier identifies a document as a “C23” one, such as UMLS thesaurus in the medical domain.
it is true for most of the time which explains the high precision
value. On the other hand, most of the time it makes errors so In general, vector conceptualization is realized in two steps:
many “C23” documents are assigned other classes affecting their (i)search for corresponding concepts related to vector's terms and
corresponding precision values. In other words, the low precision then (ii)the integration of these concepts in the vector producing
of these classes depends on errors related to the wideness of the final conceptualized vector. Three different strategies have
“C23”. been proposed for conceptualization [8]:
• Adding Concepts: Where the original vector is extended
and corresponding concepts are added.
• Partial Conceptualization: Where terms are substituted
by corresponding concepts. terms having no related concepts are
reserved in the vector.
• Complete Conceptualization: Similarly to Partial
Conceptualization, terms are substituted by concepts whereas
remaining terms are eliminated from the final vector.
Integrated concepts are assigned new scores derived from their
related terms' frequencies.
The second strategy seems to be the most appropriate one as it
removes no term before replacing it with a related corpus so no
Figure 4. Precision values applying Rocchio to Ohsumed original feature is removed from the vector (compared to the third
one), and no extra feature is added (compared to the first one)
resulting in minimized efficiency effects. Yet, the classifier has to
2.4 Conclusion be adapted to hybrid (concepts + terms) representation.
Throughout previous experimentations, a limitation of Rocchio is
While searching polysemic term corresponding concepts in
exhibited through all tested similarity measures: general or wide
semantic resources, multiple matches are detected and introducing
class issue. Some other limitations seem to affect Rocchio's
some ambiguities in final document representation. For example:
performance particularly in dealing with similarities among
the term "Book" signifies in English a book and also a reservation
classes, and heterogeneous classes. These limitations are mainly
(Ticket, accommodations…). According to [8], three strategies for
related to class representation and similarity calculations. For
disambiguation deploying WordNet can be used:
example, limitations observed with similar classes may be
overcome by means of semantic resources. Indeed, redefining • All: accept all candidate concepts as matches for the
centroïds using concepts instead of terms and the use of semantic considered term.
measures might limit intersections between spheres of similar
classes in concept space. Next sections present in details how • First: Accept the most frequently used concept among
semantic can be introduced inside vectors. Some primary results candidates using document language statistics.
are then presented. • Context: Accept the candidate concepts having the most
similar semantic context compared to the term's context in the
3. TOWARDS A ROCCHIO-BASED document.
SEMANTIC CLASSIFICATION The first strategy, despite being the simplest, is the least reliable
In spite of being considered the most popular text representation as it accepts all candidate concepts without choosing a specific
method, VSM representation suffers from certain limitations [17, sense of the term. In cases where a term is used in the document
18] especially for processing composed words, synonyms, signifying its rarely used sense, the second strategy gives bad
polysemy, etc.. In order to overcome these limitations, semantic decision. Despite its complexity, the last strategy seems to be
resources (like thesaurus & ontologies) can be used to replace more accurate as concept context can be derived from semantic
term-based representation by concept-based one. Thus, text resources using: concepts definition, its descriptive terms or from
classification using conceptualized vectors is so called "Semantic text corpus.
Classification".
This section presents first different strategies for vector 3.2 Using Conceptualized Vectors in Rocchio
conceptualization. Then, the use of conceptualized vectors during Vectors resulting from the conceptualization step are applied to
classification process is discussed according to the chosen decision making criteria in order classify corresponding
strategy. These two steps constitute a semantic extension to the documents. Most semantic classification systems neglect semantic
original Rocchio, thanks to semantic resources. relations between relevant concepts and apply the same pure
mathematic similarity measures for decision making. Indeed, the
3.1 Vector Conceptualization exact concept must occur in both compared vectors to influence
Conceptualization is the process of moving from terms literally decision making through his score. Similar or related concepts are
occurring in treated text to their semantically corresponding rarely taken into consideration.
concepts or senses in semantic resources that might permit better Taking advantage of semantic resources, it is important to
classification results. As an example of semantic resources that incorporate related concepts to those already adopted through
conceptualization step enriching document representation and according to their senses grouping common concepts together.
eventually improving classification results. For example, some Concepts and relations among them are assigned at least one type
works propose integrating super concepts to the conceptualized from the Semantic Network. Indeed, the Semantic Network
vector and demonstrate some ameliorations in classification provides a higher level of abstraction through concept and relation
results as multiple superior ontology levels are considered [8]. categorization in inter-related Types constituting a network of 133
semantic types and 54 relationships. The SPECIALIST lexicon
New semantic similarity measures considering semantic relations contains a large variety of general as well as medical terms and
between ontology concepts are also being developed making more words.
advances towards a semantic classification. These measures
permit similar concepts to be compared and contribute, in addition 4.1.2 MetaMap tool
to common concepts, to vector comparison [9]. For an overview In addition to the UMLS knowledge resources, many tools are
of different semantic similarity measures see [19]. developed and provided by NLM in order to facilitate deploying
these sources for medical information system developers. In this
Integrating similarity measures in semantic classification depends
work we are interested particularly in MetaMap [22]. The major
on adopted conceptualization strategy. Precedent measures can be
goal of MetaMap developers was to improve biomedical text
directly applied to vectors resulting from "Complete retrieval using UMLS sources. Indeed, MetaMap can discover the
Conceptualization". Considering other conceptualization links between biomedical text and the knowledge in the
strategies that produce hybrid (concepts+terms) document Metathesaurus. This mapping is the result of a rigorous linguistic
representation, both mathematic and semantic similarity measures analysis of each phrase of the text: First the text is tokenized and
must be applied to terms and concepts respectively. phrase boundaries are identified, then part of-speech-tags are
New representation models using parts of ontology hierarchy are added. Second, the Specialist lexicon and the shallow parser are
also proposed. These parts constitute semantic trees or forests used to analyze syntactically these phrases. Finally, different
where each concept is assigned an importance score. As for candidates are identified in the Metathesaurus and then final
decision making in systems using these models, similarity mappings combining these candidates are evaluated resulting in
measures between semantic hierarchies is considered as the confidence scores for each mapping. In cases were ambiguities are
accumulation of similarities between their concepts peer to peer detected, MetaMap keeps the most semantically similar mappings
[20]. The semantic similarity between two concepts is related to to the surrounding text following the context strategy (section
their scores and their positions in the hierarchy. 3.1).
4. APPLYING ROCCHIO TO 4.2 New Classification Process

Figure 5 illustrated the new classification process that is also
CONCEPTUALIZED CORPUS realized through two phases: Training Phase and Classification
This section presents details concerning the integration of Phase. A new Conceptualization task is introduced before the
conceptualization process in the Rocchio classifier (figure 4). Two Vecorization task using both MetaMap and UMLS.
resources, a knowledge base and a text-to-concept mapping
utility, are needed in order to transform a plain text into a 4.2.1 Training Phase
conceptualized one. We chose to use the UMLS Metathesaurus During the training phase, class centoïds are learned. Through the
and the MetaMap tool in mapping Oshumed text to UMLS Conceptualization task, all documents of the training corpus are
concepts. conceptualized using UMLS through MetaMap tool; the system
transforms all training corpus into conceptualized text documents.
After having introduced these resources, we present different
This Conceptualization task is realized on text documents directly
conceptualization strategies introducing the new classification
and not on their vectors in order to keep all information in text
process, then we present new Rocchio classification results
needed by MetaMap permitting a better text to concept mappings.
obtained by integrating this conceptualization. Finally
Inspired by vector conceptualization strategies described in the
conceptualization effects on Rocchio classification according to
different standard similarity measures and conceptualization previous section, text conceptualization can also be done using
strategies are discussed. several strategies.
Then, the Vectorization task converts these conceptualized
4.1 Resources documents into vectors in the same way as in previous
In this section we present UMLS and MetaMap being both used in classification process without conceptualization. Finally the
our system during the conceptualization process. Both were Rocchio Centroïd Computing task computes classes' centroïds
developed by the National Library of Medicine (NLM) aiming at using vectors resulting from conceptualized training corpus
facilitating the development of sophisticated medical information vectorization.
systems.
4.2.2 Classification Phase
4.1.1 Unified Medical Language System (UMLS) In this phase, the new document to classify is first conceptualized
The Unified Medical Language System (UMLS) [21] was and then vectorized throught similar Conceptualization and
developed in order to model the language of biomedicine and Vectorization tasks. Then, through a Similarity Computing task, is
health. UMLS' knowledge sources enhance the development of estimated the similarity between the vector of the conceptualized
information systems in the biomedical domain. document and learned centroïds, using one of the similarity
The UMLS knowledge base consists of three main resources: the measures (cosine, jaccard, etc.). Resulting class is the one having
Metathesaurus, the Semantic Network and the SPECIALIST the most similar centroïd.
Lexicon. The Metathesaurus is a multilingual vocabulary database
of biomedical concepts, their names, their attributes and the
4.3 Experimentation and Results
relations among them. This database organizes concepts of the These experimentations were realized in order to evaluate the
various source vocabularies (like Mesh, SNOMED-CT, etc.) effect of MetaMap integration into the classification process. This
integration enables mapping text to UMLS concepts. • All concepts. All candidate concepts are kept.
Conceptualization is realized as a first step before vectorization.
Candidates resulting from matching have many properties. In this
The rest of the system is the same Rocchio system detailed in
work we choose to use the concept name or the concept ID. In
section 2.1. In our test we used the same Ohsumed corpus used for
fact, during the tokenization step, the concept ID is considered as
the original system in order to compare results before and after
a single token so it stays intact. Concept names, being sometime
conceptualization.
compound words; they are cut during tokenization when applied
4.3.1 Conceptualization step on a text conceptualized using concept name strategy.
During the conceptualization task, different strategies can be In this work, conceptualization is done using all combinations of
implemented as described in section 3.1 (adding concepts, partial the different strategies (12 combinations). Classification step is
conceptualization and complete conceptualization). According to executed using the five different similarity measures that we
MetaMap text-to-concept matching results, we can choose: described previously. Next section shows a selection of the most
• Best concept. The best concept among several candidate relevant results using F-measure.
concepts matched to the text. This depends on a
matching score computed by MetaMap [22].
Figure 5. Conceptualization process: Rocchio applied to conceptualized corpus
4.3.2 Results analysis category documents, whereas in macro-averaging it is equal to the

Table 3 shows the F1-measure obtained applying Rocchio method average of locally calculated F-measure for each class.
to conceptualized Ohsumed corpus, for each of five categories
{C06, C23, C04, C14, C20}. These results are obtained first using
As illustrated in the table, in most cases the original system
one of the five similarity measures that are already tested and
compared on the original Ohsumed corpus without outperforms the new system integrating conceptualization phase.
conceptualization. Then the classifier is tested using for each In fact this applies to approximately 70% of results evaluated
similarity measure one of the twelve different conceptualization using the micro-average F-measure. However, after a thorough
strategies previously introduced. The two last columns present look into the results, it seems clear that the system using the
micro and macro averaged F-measure obtained for each similarity similarity measure of KullbackLeibler shows some amelioration.
Indeed, results are improved in two thirds of the cases after
measure and conceptualized strategy. As explained previously, in
conceptualization. Considering classes as a whole in the results,
micro-averaging, F-measure is computed globally over all
we can observe that conceptualization improves the outcome in
about 70% of cases for the class "C23". The original system,
showed the worst performance in treating this class.
Table 3. Results of applying Rocchio to conceptualized Ohsumed
System Configuration \ Category C06 C23 C04 C14 C20 Macro Micro
Original 55,61% 52,13% 69,67% 70,00% 57,33% 60,95% 61,14%
AllConcepts 51,94% -3,68% 49,71% -2,42% 67,10% -2,57% 67,68% -2,32% 53,20% -4,12% 57,92% -3,02% 58,58% -2,56%
AllConceptsID 53,68% -1,93% 52,22% +0,10% 68,87% -0,80% 69,45% -0,55% 57,79% +0,46% 60,40% -0,54% 60,85% -0,29%
AddConcept
BestConcept 55,74% +0,12% 52,68% +0,55% 69,71% +0,03% 70,41% +0,41% 57,07% -0,26% 61,12% +0,17% 61,38% +0,24%
BestConceptID 55,17% -0,44% 53,35% +1,22% 69,50% -0,17% 69,71% -0,29% 58,60% +1,27% 61,27% +0,32% 61,51% +0,37%
AllConcepts 50,23% -5,38% 49,10% -3,02% 66,26% -3,41% 66,62% -3,38% 52,78% -4,55% 57,00% -3,95% 57,75% -3,39%
Cosine AllConceptsID 52,89% -2,72% 51,90% -0,23% 68,44% -1,23% 68,97% -1,03% 57,70% +0,37% 59,98% -0,97% 60,47% -0,67%
Partial
BestConcept 54,36% -1,26% 53,37% +1,24% 69,29% -0,38% 69,33% -0,67% 55,90% -1,43% 60,45% -0,50% 60,98% -0,16%
BestConceptID 45,84% -9,77% 53,38% +1,26% 62,89% -6,78% 61,94% -8,06% 58,25% +0,93% 56,46% -4,49% 57,27% -3,87%
AllConcepts 50,03% -5,58% 49,05% -3,08% 66,29% -3,38% 66,52% -3,48% 52,92% -4,40% 56,96% -3,98% 57,71% -3,43%
AllConceptsID 52,68% -2,93% 51,82% -0,30% 68,44% -1,23% 68,99% -1,01% 57,63% +0,31% 59,91% -1,03% 60,42% -0,72%
Complete
BestConcept 54,38% -1,23% 53,38% +1,25% 69,33% -0,34% 69,36% -0,64% 55,99% -1,33% 60,49% -0,46% 61,01% -0,13%
BestConceptID 47,17% -8,44% 53,05% +0,92% 62,73% -6,94% 62,18% -7,82% 57,19% -0,14% 56,46% -4,48% 57,15% -3,99%
Original 48,81% 7,82% 70,30% 67,99% 55,23% 50,03% 52,74%
AllConcepts 47,58% -1,23% 10,99% +3,18% 67,34% -2,96% 64,68% -3,31% 51,34% -3,90% 48,39% -1,64% 51,01% -1,73%
AllConceptsID 48,65% -0,16% 11,30% +3,48% 69,32% -0,98% 67,32% -0,67% 53,73% -1,50% 50,06% +0,03% 52,48% -0,26%
AddConcept
BestConcept 50,69% +1,88% 8,40% +0,59% 70,62% +0,31% 67,52% -0,47% 55,58% +0,35% 50,56% +0,53% 53,20% +0,46%
BestConceptID 48,26% -0,55% 5,84% -1,98% 69,99% -0,31% 67,18% -0,81% 56,63% +1,39% 49,58% -0,45% 52,34% -0,40%
AllConcepts 46,47% -2,34% 12,33% +4,52% 66,54% -3,77% 63,74% -4,25% 50,68% -4,55% 47,95% -2,08% 50,50% -2,24%
Jaccard AllConceptsID 48,03% -0,78% 12,33% +4,51% 68,62% -1,68% 67,04% -0,95% 53,29% -1,94% 49,86% -0,17% 52,16% -0,58%
Partial
BestConcept 50,90% +2,09% 8,98% +1,16% 69,29% -1,02% 66,86% -1,13% 54,86% -0,37% 50,18% +0,15% 52,75% +0,02%
BestConceptID 40,30% -8,52% 3,63% -4,19% 60,62% -9,69% 61,47% -6,51% 56,54% +1,30% 44,51% -5,52% 46,48% -6,26%
AllConcepts 46,45% -2,36% 12,25% +4,44% 66,58% -3,72% 63,72% -4,27% 50,63% -4,60% 47,92% -2,11% 50,46% -2,27%
AllConceptsID 48,24% -0,57% 12,56% +4,75% 68,49% -1,81% 66,93% -1,06% 53,35% -1,88% 49,92% -0,11% 52,18% -0,56%
Complete
BestConcept 51,27% +2,46% 8,98% +1,17% 69,28% -1,03% 66,65% -1,34% 54,69% -0,54% 50,17% +0,14% 52,75% +0,02%
BestConceptID 41,23% -7,58% 3,80% -4,01% 60,79% -9,51% 60,84% -7,14% 56,73% +1,50% 44,68% -5,35% 46,70% -6,03%
Original 54,19% 18,97% 69,56% 68,92% 55,52% 53,43% 55,01%
AllConcepts 53,41% -0,79% 27,91% +8,94% 69,15% -0,40% 69,65% +0,73% 54,88% -0,64% 55,00% +1,57% 56,07% +1,06%
AllConceptsID 53,80% -0,39% 25,86% +6,89% 68,94% -0,61% 69,13% +0,21% 55,00% -0,52% 54,55% +1,12% 55,62% +0,61%
AddConcept
BestConcept 54,49% +0,30% 20,63% +1,67% 69,93% +0,38% 68,79% -0,13% 56,46% +0,95% 54,06% +0,63% 55,55% +0,54%
BestConceptID 53,12% -1,07% 16,83% -2,14% 69,52% -0,04% 68,51% -0,42% 55,86% +0,35% 52,77% -0,66% 54,61% -0,40%
AllConcepts 52,96% -1,23% 29,50% +10,54% 68,82% -0,73% 69,13% +0,21% 54,47% -1,05% 54,98% +1,55% 55,92% +0,91%
Kullback AllConceptsID 53,00% -1,19% 26,68% +7,72% 68,48% -1,07% 68,84% -0,08% 53,92% -1,60% 54,19% +0,75% 55,22% +0,21%
Partial
BestConcept 53,44% -0,75% 19,02% +0,05% 68,97% -0,59% 68,11% -0,81% 56,24% +0,73% 53,16% -0,27% 54,85% -0,16%
BestConceptID 45,92% -8,27% 10,70% -8,26% 64,12% -5,44% 64,59% -4,33% 54,90% -0,62% 48,05% -5,38% 50,14% -4,87%
AllConcepts 53,14% -1,05% 29,64% +10,67% 68,65% -0,91% 69,03% +0,10% 54,43% -1,08% 54,98% +1,55% 55,87% +0,86%
AllConceptsID 53,13% -1,06% 27,30% +8,34% 68,31% -1,25% 68,89% -0,03% 53,80% -1,71% 54,29% +0,86% 55,28% +0,27%
Complete
BestConcept 53,91% -0,28% 20,75% +1,78% 68,94% -0,62% 68,21% -0,71% 56,26% +0,74% 53,61% +0,18% 55,19% +0,18%
BestConceptID 46,32% -7,87% 12,13% -6,83% 64,25% -5,31% 64,91% -4,01% 54,39% -1,12% 48,40% -5,03% 50,46% -4,55%
Original 48,95% 4,71% 70,86% 66,55% 52,72% 48,76% 51,82%
AllConcepts 49,97% +1,02% 7,14% +2,43% 67,54% -3,33% 65,86% -0,69% 49,64% -3,08% 48,03% -0,73% 50,74% +0,89%
AllConceptsID 47,86% -1,09% 6,70% +1,99% 67,94% -2,92% 66,39% -0,16% 50,44% -2,28% 47,87% -0,89% 50,62% -1,20%
AddConcept
BestConcept 50,18% +1,23% 4,88% +0,18% 70,63% -0,24% 65,51% -1,04% 52,41% -0,31% 48,72% -0,04% 51,81% -0,02%
BestConceptID 48,12% -0,83% 3,28% -1,43% 69,98% -0,88% 65,65% -0,90% 54,23% +1,52% 48,25% -0,51% 51,31% -0,51%
AllConcepts 48,76% -0,19% 8,31% +3,61% 66,87% -3,99% 65,61% -0,94% 49,32% -3,40% 47,78% -0,98% 50,42% -1,41%
Levenshtein AllConceptsID 46,23% -2,72% 6,78% +2,07% 66,62% -4,24% 65,93% -0,62% 49,70% -3,02% 47,05% -1,71% 49,74% -2,08%
Partial
BestConcept 49,34% +0,40% 5,04% +0,34% 68,46% -2,40% 64,88% -1,67% 52,51% -0,21% 48,05% -0,71% 51,02% -0,80%
BestConceptID 39,64% -9,30% 2,28% -2,42% 62,03% -8,83% 60,60% -5,95% 54,82% +2,11% 43,88% -4,88% 46,13% -5,70%
AllConcepts 48,88% -0,07% 8,73% +4,02% 66,83% -4,03% 65,63% -0,92% 49,30% -3,42% 47,87% -0,88% 50,46% -1,36%
AllConceptsID 46,25% -2,70% 6,86% +2,15% 66,60% -4,27% 65,89% -0,66% 49,43% -3,29% 47,00% -1,75% 49,68% -2,14%
Complete
BestConcept 49,47% +0,53% 5,13% +0,42% 68,41% -2,46% 64,59% -1,96% 52,65% -0,07% 48,05% -0,71% 51,02% -0,80%
BestConceptID 40,85% -8,10% 2,46% -2,24% 62,27% -8,59% 60,71% -5,84% 54,37% +1,65% 44,13% -4,62% 46,51% -5,31%
Original 55,70% 52,60% 69,31% 69,63% 57,36% 60,92% 61,06%
AllConcepts 52,21% -3,49% 49,93% -2,67% 67,03% -2,28% 67,56% -2,07% 52,94% -4,42% 57,93% -2,99% 58,59% -2,46%
AllConceptsID 53,70% -2,00% 52,77% +0,17% 68,88% -0,43% 69,00% -0,64% 57,87% +0,51% 60,44% -0,48% 60,87% -0,19%
AddConcept
BestConcept 55,00% -0,71% 52,94% +0,34% 69,51% +0,20% 69,92% +0,29% 57,14% -0,22% 60,90% -0,02% 61,19% +0,13%
BestConceptID 55,26% -0,45% 53,44% +0,84% 69,20% -0,11% 69,31% -0,32% 58,46% +1,09% 61,13% +0,21% 61,32% +0,26%
AllConcepts 50,33% -5,38% 49,07% -3,54% 66,35% -2,96% 66,47% -3,16% 52,74% -4,63% 56,99% -3,93% 57,73% -3,33%
Pearson AllConceptsID 52,31% -3,40% 52,35% -0,25% 68,34% -0,97% 68,59% -1,04% 57,62% +0,25% 59,84% -1,08% 60,37% -0,69%
Partial
BestConcept 54,59% -1,12% 53,53% +0,93% 69,16% -0,15% 68,92% -0,72% 55,82% -1,55% 60,40% -0,52% 59,47% -1,59%
BestConceptID 45,97% -9,73% 53,74% +1,14% 62,51% -6,80% 61,07% -8,57% 57,51% +0,15% 56,16% -4,76% 57,01% -4,05%
AllConcepts 50,20% -5,51% 49,04% -3,56% 66,31% -3,00% 66,57% -3,06% 52,70% -4,66% 56,96% -3,96% 57,71% -3,35%
AllConceptsID 52,13% -3,57% 52,39% -0,21% 68,27% -1,04% 68,62% -1,02% 57,66% +0,29% 59,81% -1,11% 60,36% -0,70%
Complete
BestConcept 54,14% -1,57% 53,75% +1,15% 69,20% -0,11% 68,91% -0,72% 55,69% -1,67% 60,34% -0,59% 60,90% -0,16%
BestConceptID 54,14% -1,57% 53,75% +1,15% 69,20% -0,11% 68,91% -0,72% 55,69% -1,67% 60,34% -0,59% 60,90% -0,16%
Concerning conceptualization strategies, one of the 12 used in the conceptualization task. Thus, it seems adequate to
conceptualization strategies tested in these experimentations deploy these relations in the classification process through the
seems to provide the maximal improvement. Indeed, Best Concept introduction of new semantic similarity measures. These measures
strategy outperforms others as it retrieves the best candidate can be coupled with traditional similarity measures used in this
concept provided by MetaMap as the mapping result. Best concept work.
names rather than IDs are used in text conceptualization. However
this strategy does not present a significant improvement over the 5. CONCLUSION AND PERSPECTIVES
original system without conceptualization. Indeed, the gain is each Due to the explosive growth of the Web, many search engines
time less than 1%. The highest increases are obtained when using demonstrate limited performance to meet the needs of internet
conceptualization strategies taking into account all the candidates users. This leads to a challenging need for efficient filtering and
found by MetaMap and not only the best. Thus, in some cases this ranking techniques. This paper concerns Web page classification
increase exceeds 10%. Furthermore, largest increase in the micro- presenting some traditional methods for text classification that
average F-Measure values are attained using the strategy of apply to Web page classification as well.
adding Concepts applied to the classifier with KullbackLeibler
We choose Rocchio that demonstrates a good performance
similarity measure.
compared to its relatively minimal complexity. In addition, it can
Concerning similarity measures, the least improvement of provide relevance feedback on classification results permitting
conceptualization can be observed for the classifier using the better result understanding and so potential improvements in
similarity measure Levenshtein. In fact, among all micro-averaged classification. We used five different similarity measures for
F-measure values for this classifier, only one surpasses its value testing Rocchio method on one corpus (Ohsumed). Yet, several
when compared to system results using Levenshtein without limitations were identified in classification results. Limitations
conceptualization. As we noticed during experiments on the could be overcome by means of semantic resources taking
original Ohsumed corpus, Rocchio classifier that uses Cosine and meaning into consideration in text classification. These extensions
Pearson has relatively similar behaviors. Indeed, both strategies seem to be promising.
that add the names and the IDs of the best concepts seem to
improve classification results using both similarity measures. Thus, in this research we have proposed to extend the Rocchio
Since the micro-averaged F-measure increases are in all cases less text classification method in order to improve its performance
than 0.5%, this increase does not seem very significant. Rocchio using semantic resources. Indeed, semantics can be integrated into
using Jaccard similarity measure outperforms the original system a Rocchio classifier through conceptualization and also during
in conceptualizing text using the best concept name according to decision making through different semantic similarity measures.
three different strategies: AddConcepts, Partial and Complete Taking into account information stored in HTML tags, document
conceptualization conceptualization using a knowledge base, helps to complete
VSM approach with semantics. We make some experiments using
4.4 Discussion different conceptualization strategies on the Ohsumed corpus with
According to the results presented in the preceding section, here standard similarity measures according to the Rocchio
we list some remarks. First of all, lowest results are observed classification process. These experiments show in some cases
when terms are replaced by IDs of their corresponding concepts in considerable performance improvements. However we expect
the UMLS. This performance degradation might be principally better improvements, through deploying semantic similarity
related to replacing all terms corresponding to a concept by its ID; measures that can be calculated by aid of semantic resources.
only the IDs of concepts can participate in vectorization. Terms These measures could be combined with standard similarity
that are shared among concept with different IDs are excluded measures already used in this work permitting the development of
from vectors even if they had a high importance. efficient semantic classification methods for Web pages.
Second, when the system performance has a good F1-measure 6. REFERENCES
value (i.e. exeeds 60%), no significant effect can be observed for
[1] Asirvatham, A. P. and Ravi, K. K. Web page classification
the integration of the conceptualization task into the system. In
based on document structure. Awarded second prize in
fact, as the same similarity measures are used for both cases
National Level Student Paper Contest conducted by IEEE
with/without conceptualization, results' amelioration was limited.
India Council., City, 2001.
Third, when the system performance using a specific similarity [2] Pierre, J. M. On the Automated Classification of Web Sites.
measure has a low F1-measure value, as it the case for the class Linköping Electronic Articles in Computer and Information
"C23", introducing conceptualization can significantly improve Science, 6, 1 2001).
this value with a maximum gain reaching (10%) in some cases. [3] Qi, X. and Davison, B. D. Web page classification: Features
Indeed, the class "C23" is very large compared to others and so and algorithms. ACM Comput. Surv., 41, 2 2009), 1-31.
enriching class representation by semantics might result in a better [4] Manning, C. D., Raghavan, P. and Schtze, H. Introduction to
identification of this class and also in better results. Information Retrieval. Cambridge University Press, New
Fourth, the best conceptualization strategy is Add concept adding York, NY, USA, 2008.
the Best concept among mapped candidates into the text. In fact, [5] Lewis, D. D. Naive (Bayes) at Forty: The Independence
best mappings retrieved by MetaMap are added into text in order Assumption in Information Retrieval. In Proceedings of the
to enrich it with semantics avoiding any information loss. Proceedings of the 10th European Conference on Machine
Learning (1998). Springer-Verlag.
Finally, even if the results are still preliminary, it seems useful to [6] Soucy, P. and Mineau, G. W. A Simple KNN Algorithm for
introduce semantic enrichments to the Rocchio method in order to Text Categorization. In Proceedings of the Proceedings of the
ameliorate classification results. Nevertheless, the exploitation of 2001 IEEE International Conference on Data Mining (2001).
semantic resources was limited in this work ignoring all relations IEEE Computer Society.
(like Subsumption and Transversal relations) among concepts
[7] Han, E.-H. and Karypis, G. Centroid-Based Document [15] Medical Subject Headings (MeSH®). Available from:
Classification: Analysis and Experimental Results. In http://www.nlm.nih.gov/pubs/factsheets/mesh.html.
Proceedings of the Proceedings of the 4th European [16] Sokolova, M. and Lapalme, G. A systematic analysis of
Conference on Principles of Data Mining and Knowledge performance measures for classification tasks. Information
Discovery (2000). Springer-Verlag. Processing & Management, 45, 4 2009), 427-437.
[8] Hotho, A., Staab, S. and Stumme, G. Text clustering based on [17] Bloehdorn, S. and Hotho, A. Boosting for text classification
background knowledge. City, 2003. with semantic features. In Proceedings of the Proceedings of
[9] Guisse, A., Khelif, K. and Collard, M. PatClust  : une the 6th international conference on Knowledge Discovery on
plateforme pour la classification sémantique des brevets. In the Web: advances in Web Mining and Web Usage Analysis
Proceedings of the Conférence d’Ingénierie des (Seattle, WA, 2006). Springer-Verlag.
connaissances (Hammamet, Tunisie, 2009). [18] Wang, P. and Domeniconi, C. Building semantic kernels for
[10] Huang, A. Similarity measures for text document clustering. text classification using wikipedia. In Proceedings of the
Proceedings of the Sixth New Zealand Computer Science Proceedings of the 14th ACM SIGKDD international
Research Student Conference (NZCSRSC2008), Christchurch, conference on Knowledge discovery and data mining (Las
New Zealand2008), 49-56. Vegas, Nevada, USA, 2008). ACM.
[11] Salton, G. The SMART Retrieval System-Experiments in [19] Pirro, G. A semantic similarity metric combining features
Automatic Document Processing. Prentice-Hall, Inc., 1971. and intrinsic information content. Data Knowl. Eng., 68, 11
[12] Lan, M., Tan, C. L., Su, J. and Lu, Y. Supervised and 2009), 1289-1308.
Traditional Term Weighting Methods for Automatic Text [20] Ruiz, M. E. Combining image features, case descriptions and
Categorization. IEEE Trans. Pattern Anal. Mach. Intell., 31, 4 UMLS concepts to improve retrieval of medical images.
2009), 721-735. AMIA ... Annual Symposium proceedings / AMIA Symposium.
[13] Yi, K. and Beheshti, J. A hidden Markov model-based text AMIA Symposium2006), 674-678.
classification of medical documents. J. Inf. Sci., 35, 1 2009), [21] Unified Medical Language System (UMLS®). Available
67-81. from: http://www.nlm.nih.gov/research/umls/.
[14] Hersh, W., Buckley, C., Leone, T. J. and Hickam, D. [22] Aronson, A. R. and Lang, F. M. An overview of MetaMap:
OHSUMED: an interactive retrieval evaluation and new large historical perspective and recent advances. Journal of the
test collection for research. In Proceedings of the Proceedings American Medical Informatics Association : JAMIA, 17, 3
of the 17th annual international ACM SIGIR conference on (May-Jun 2010), 229-236.
Research and development in information retrieval (Dublin,
Ireland, 1994). Springer-Verlag New York, Inc.

Doceng2012 Submission 19

Uploaded by

Copyright:

Available Formats

You might also like

Doceng2012 Submission 19

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Doceng2012 Submission 19

Uploaded by

Copyright:

Available Formats

Conceptualization Effects on MEDLINE Documents

Classification Using Rocchio Method

ABSTRACT expensive search on the Web, it seems adequate to involve

2. EVALUATING ROCCHIO CLASSIFIER

2.1 Rocchio Implementation details

2.2.2 Jaccard similarity measure

Considering a specific category C:

Experimentation results using F1-measure values for the five

Figure 3. Recall values applying Rocchio to Ohsumed

4. APPLYING ROCCHIO TO 4.2 New Classification Process

Figure 5. Conceptualization process: Rocchio applied to conceptualized corpus

4.3.2 Results analysis category documents, whereas in macro-averaging it is equal to the

You might also like