Professional Documents
Culture Documents
Libshorttext Refrensi 14
Libshorttext Refrensi 14
Hsiang-Fu Yu rofuyu@cs.utexas.edu
Department of Computer Science, University of Texas at Austin, Austin, TX 78712 USA
Chia-Hua Ho b95082@csie.ntu.edu.tw
Yu-Chin Juan r01922136@csie.ntu.edu.tw
Chih-Jen Lin cjlin@csie.ntu.edu.tw
Department of Computer Science, National Taiwan University, Taipei 106, Taiwan
Abstract
LibShortText is an open source library for short-text classification and analysis. Properties
of short texts are considered in its design and implementation. The package supports effec-
tive text pre-processing and fast training/prediction procedures. We provide an interactive
tool to perform error analysis. Because of the short length, details of each short text can
be investigated easily. In addition, the package is designed so that users can conveniently
make extensions. The ease of use, efficiency, and extensibility of LibShortText make it very
useful for practitioners working on short-text classification and analysis. The package is
available at http://www.csie.ntu.edu.tw/~cjlin/libshorttext
Keywords: short-text classification, interactive error analysis, open source, linear classi-
fication, machine learning
1. Introduction
Recently, there has been a growing interest in short-text classification and analysis. Ap-
plication domains include titles (e.g., Shen et al., 2012), questions (e.g., Zhang and Lee,
2003; Qu et al., 2012), sentences (e.g., Khoo et al., 2006), and short messages. Existing
text-classification packages may not be suitable for analyzing short texts because of not
considering their special properties. For example, it is difficult to analyze a text with many
words, but we can easily investigate details in training/predicting short texts. Unlike gen-
eral texts such as web pages and emails, short texts in a corpus may have similar lengths and
words in a short text are often distinct. In addition, practitioners wonder if the standard
procedures for text classification must be applied when experimenting with short texts.
We develop LibShortText as an open source tool (under the BSD license1 ) for short-text
classification and analysis. It has the following features.
1. For large-scale short-text classification, it is more efficient than general text-mining tools.
2. Based on our study (Yu et al., 2012), we carefully select the default options for LibShort-
Text. Users need not try many options to get the best performance for their applications.
3. We provide an interactive tool to perform error analysis. In particular, because of short
lengths, details of each text can be investigated.
1. The New BSD licences approved by the Open Source Initiative.
1
Yu et al.
4. The package is mainly written in Python for simplicity and extensibility, while time-
consuming operations are implemented in C/C++ for efficiency.
5. We provide full documentation including class references and examples.
2
LibShortText: A Library for Short-text Classification and Analysis
For general texts, one option suitable for some texts may not suitable for others. In contrast,
it is easier to select options based on properties of short texts. For example, because in a
short text words are generally distinct, Yu et al. (2012) show that the SVM models obtained
by binary and TF-IDF feature representations give similar performance.
Because users may train with the same data several times after the pre-processing phase,
we provide text2svm.py to convert short texts to sparse feature vectors of term frequency.
Then text-train.py can directly train a model on these sparse vectors.
LibShortText supports interactive error analysis. First, we enter Python, import the module,
load the prediction results, and create an object of ‘Analyzer’ by reading a model.
$ python
>>> from libshorttext.analyzer import *
>>> predict_result = InstanceSet(’predict_result’)
>>> analyzer = Analyzer(’train_file.model’)
We randomly select 1000 instances with two labels ‘Books’ and ‘Music’ and generate a
confusion table.
Books Music
Books 425 3
Music 9 563
Next, to see how a short text “MICKEY MOUSE POT STAKE” is predicted by our model,
the following operation gives the feature weights of the top three classes. For multi-class
SVM and logistic regression used in LibShortText, in the trained model, each class has a
corresponding weight vector. The decision value is the inner product between the weight
vector and the test instance. For this example, the class “Home & Garden” is correctly
predicted because of the highest decision value. From the weights, we identify “stake” as
an important word for the “Home & Garden” class.
3
Yu et al.
We provide more operations for error analysis. Details are in the documents.
3. Experiments
To demonstrate the efficiency and effectiveness of LibShortText, we crawled 20 million eBay
product titles in 34 classes. The first 10 million titles are for training, while the next 10
million are for testing.2 On a machine with Intel Xeon 2.4GHz CPU (E5620), 12MB Cache,
and 64GB RAM, running text-train.py with default options takes about 37 minutes,
where 21 minutes are for generating features from texts and 16 minutes are for training.
For prediction, text-predict.py takes 28 minutes and achieves 90.7873% accuracy. In
contrast, a general text-mining tool such as RapidMiner (Mierswa et al., 2006) fails to
even pre-process the data because of the memory issue. On a subset of prod-title with
100K instances, LibShortText takes only 17 seconds and 240 MB memory, while RapidMiner
requires 10 minutes and 3 GB memory.
4. Conclusions
LibShortText is an easy-to-use, efficient, and extensible tool for short-text classification and
analysis. Its design and implementation incorporate properties of short texts. The package
is still being improved by new research results and users’ needs.
2. We cannot redistribute the data; see http://developer.ebay.com/join/licenses/individual.
4
LibShortText: A Library for Short-text Classification and Analysis
References
L. Bottou, C. Cortes, J. Denker, H. Drucker, I. Guyon, L.D. Jackel, Y. LeCun, U.A. Muller,
E. Sackinger, P. Simard, and V. Vapnik. Comparison of classifier methods: a case study
in handwriting digit recognition. In International Conference on Pattern Recognition,
pages 77–87. IEEE Computer Society Press, 1994.
Koby Crammer and Yoram Singer. On the algorithmic implementation of multiclass kernel-
based vector machines. Journal of Machine Learning Research, 2:265–292, 2001.
Rong-En Fan, Kai-Wei Chang, Cho-Jui Hsieh, Xiang-Rui Wang, and Chih-Jen Lin. LIBLIN-
EAR: A library for large linear classification. Journal of Machine Learning Research, 9:
1871–1874, 2008. URL http://www.csie.ntu.edu.tw/~cjlin/papers/liblinear.pdf.
Anthony Khoo, Yuval Marom, and David Albrecht. Experiments with sentence classifica-
tion. In Australasian Language Technology Workshop, 2006.
Ingo Mierswa, Michael Wurst, Ralf Klinkenberg, Martin Scholz, and Timm Euler. YALE:
Rapid prototyping for complex data mining tasks. In Proceedings of the 12th ACM
SIGKDD international conference on Knowledge discovery and data mining (KDD),
pages 935–940, 2006.
Bo Qu, Gao Cong, Cuiping Li, Aixin Sun, and Hong Chen. An evaluation of classification
models for question topic categorization. Journal of the American Society for Information
Science and Technology, 63:889–903, 2012.
Dan Shen, Jean-David Ruvini, and Badrul Sarwar. Large-scale item categorization for e-
commerce. In Proceedings of the 21st ACM international conference on Information and
knowledge management (CIKM), pages 595–604, 2012.
Hsiang-Fu Yu, Chia-Hua Ho, Prakash Arunachalam, Manas Somaiya, and Chih-Jen Lin.
Product title classification versus text classification. Technical report, 2012. URL http:
//www.csie.ntu.edu.tw/~cjlin/papers/title.pdf.
Dell Zhang and Wee Sun Lee. Question classification using support vector machines. In
Proceedings of the 26th annual international ACM SIGIR conference on Research and
development in informaion retrieval, pages 26–32. 2003.