Professional Documents
Culture Documents
Extracting Parallel Texts From The Web: 2010 Second International Conference On Knowledge and Systems Engineering
Extracting Parallel Texts From The Web: 2010 Second International Conference On Knowledge and Systems Engineering
Abstract Parallel corpus is the valuable resource for some such as size, date, ect. Note that this criterion does not
important applications of natural language processing such as appear in most of the bilingual English-Vietnamese web sites.
statistical machine translation, dictionary construction, cross- STRAND [10] has a similar approach to PTMiner except that
language information retrieval. The Web is a huge resource
of knowledge, which partly contains bilingual information in it handles the case where URL-matching requires multiple
various kinds of web pages. It currently attracts many studies substitutions. This system also proposes a new method that
on building parallel corpora based on the Internet resource. combines content-based and structure matching by using a
However, obtaining a parallel corpus with high accuracy is cross-language similarity score as an additional parameter of
still a challenge. This paper focuses on extracting parallel texts the structure-based method.
from bilingual web-sites of the English and Vietnamese language
pair. We first propose a new way of designing content-based In our knowledge, there is rarely studies on this field related
features, and then combining them with structural features under to Vietnamese. [14] built an English-Vietnamese parallel cor-
a framework of machine learning. In the experiment we obtain pus based on content-based matching. Firstly, candidate web
88.2% of precision for the extracted parallel texts. page pairs are found by using the features of sentence length
and date. Then, they measure similarity of content using a
I. INTRODUCTION
bilingual English-Vietnamese dictionary and making decision
Parallel corpus has been used in many research areas of that whether two papers are parallel based on some thresholds
natural language processing. For example, parallel texts are of this measure. Note that this system only searches for parallel
used for connection between vocabularies in cross language pages that are good translations of each other and they are
information retrieval [5], [6], [9]. Moreover, extracting seman- required being written in the same style. Moreover using word-
tically equivalent components of the parallel texts as words, word translation will cause much ambiguity. Therefore this
phrases, sentences are useful for bilingual dictionary construc- approach is difficult to extend when the data increases as well
tion [1] and statistical machine translation [2], [7]. However, as when applying for bilingual web sites with various styles.
the available parallel corpora are not only in relatively small In this paper, we aim to automatically extracting English-
size, but also unbalanced [15] even in the major languages. Vietnamese parallel texts from bilingual web-sites of news.
Along with the development the Internet, World Wide Web is As encouraging by [10] we formulate this problem as
really a huge database containing multi-language documents classification problem to utilize as much as possible the
thus it is useful for bilingual texts processing. knowledge from structural information and the similarity of
Up to now, some systems have been built for mining content. It is worth to emphasize that different from previous
parallel corpus. These studies can be divided into three main studies [3], [10] we use cognate information replace of word-
kinds including content-based (CB), structure-based (SB) and word translation. From our observation, when translating a
combination of the both methods. For CB approach, [3], [13], text from one language to the another, some special parts
[14] uses a bilingual dictionary to match pairs of word-word in will be kept or changed in a little. These parts are usually
two languages. Meanwhile, the SB approach [11], [12] relies abbreviation, proper noun, and number. In addition, we
on analysis HTML structure of page. Other studies such as [4], also use other content-based features such as the length of
[10] have combined the two methods to improve performance tokens, the length of paragraphs, which also do not require
of their systems. any linguistical analysis. It is worth to note that by this
Parallel web pages in a site in general speaking have approach we do not need any dictionary thus we think it
comparable structures and contents. Therefore, a big number can be apply for other language pairs. Our experiment is
of these studies focuses on finding characteristics of HTML conducted on the web sites containing English-Vietnamese
structures such as URL links, filename, HTML tags. PTMiner documents, including BBC (http://www.bbc.co.uk),
[3] works on extracting bilingual English-Chinese documents. VietnamPlus (http://www.vietnamplus.vn), and VOA
This system uses a search engines to locate for host containing (http://www.voanews.com).
the parallel web pages. In order to generate candidate pairs, The rest of this paper is organized as follows. Section II
PTMiner uses a URL-matching process (e.g. Chinese transla- shows our proposed model, including the general architec-
tion of a URL as "http://www.foo.ca/english-index.html" might ture of the model, how structural features and content-based
be "http://www.foo.ca/chinese-index.html") and other features features are designed and computed. Section III shows the
148
ing to our observations, units of the numbers in English reasonable threshold of the rate between the two texts so
are often retained when translated into Vietnamese. So, that it will keep potential candidates.
we do not consider in case the units are different (e.g. The third feature estimates the rate of paragraphs of the
inch vs cm, pound vs kg, USD vs VND, etc). two texts. In our opinion, two parallel texts often have
We use a list which contains the corresponding names similar numbers of paragraphs in the texts, so a feature
between English and Vietnamese. They include names of for representing this criterion is necessary.
countries, continents, date, ect. However, the names of
C. Structure Analysis Module
countries in English can be translated into Vietnamese
in different ways. Therefore, we only consider these Beside finding candidate pairs based on the content of
names in English, which Vietnamese names correspond- the texts, the similarity of structure of the HTML pages
ing have been published on Wikipedia site2 . For example, also provide useful information for determining whether a
"Mexico" in English vs "Mhic" or "M Ty C" in pair of web pages is mutual translation or not. This method
Vietnamese are the same. uses the hypothesis that parallel web pages are presented in
The following is an example of two corresponding texts of similar structures. Note that this approach does not require
English and Vietnamese: linguistical knowledge. For presenting structural features we
follow the approach presented in [10]. The structural analysis
Vietnam and Italy through three cooperation programmes
module is implements in the two steps: At the first step,
beginning in 1998 have so far signed more than 60
both documents in the candidate pair are analyzed through
projects on joint scientific research. Of the figure, 40
a markup analyzer that acts as a transducer, producing
projects have been carried out and brought good results.
a linear sequence containing three kinds of token [10]:
T 1998, n nay, Vit Nam v Italy k kt hn 60 d
[START: elementl abel], [END:elementl abel], [Chunk:length].
n hp tc nghin cu chung, c khong 40 d n c
At the second step, we will align the linearized sequences
trin khai thc hin v t c kt qu tch cc.
using a dynamic programming algorithm.
Scanning these texts, we obtained T1={"Vietnam", "Italy", After alignment, we compute four scalar values that char-
"1998", "60", "40"} and T2 ={"1998", "Vit Nam", "Italy", acterize the quality of the alignment [10]:
"60", "40"}. We measure the similarity of cognates between
dp The difference percentage, indicating non-shared ma-
A and B by using the algorithm presented in figure 2.
terial.
n The number of aligned non-markup text chunks of
unequal length.
r The correlation of lengths of the aligned non-markup
chunks.
p The significance level of the correlation r.
149
represented by a vector of these features. We will label them F-Score = 2*Precision*Recall
(Precision+Recall) (4)
by 1 or 0 if each pair is parallel or not respectively. By this
way we will obtain the training data. In our system, we use It is worth to note that for comparing our approach and
a support vector machine algorithm to train a classification previous approaches in using content-based features we also
system. For a new pair of web pages, we first extract features conduct an experiment like in [15]. This study measure the
to have its representation as a vector. This vector goes through similarity of content based on aligning word translation of
the classification system and get the result as 1 or 0. It means the two texts. Here, we use a bilingual English-Vietnamese
that we will have the answer about whether this pair is parallel dictionary to compute a content-based similarity score. For
or not. each pair of two texts (or web pages), the similarity score is
defined as follows.
III. EXPERIMENT
Number of translation token pairs
A. Data preparation sim(A, B) = Number of tokens in text A (5)
We have explored several news sites of bilingual English- With this experiment we obtained the result as shown in
Vietnamese on the World Wide Web. There are a few sites Table I.
with high translation quality. In this system, we experiment
TABLE I
with 94,323 pages are downloaded from the three web sites:
EVALUATING CONTENT-BASED MATCHING (USING THE BILINGUAL
37,665 pages from BBC3 ; 12,553 pages from VietnamPlus4
DICTIONARY TO MATCH PAIRS OF WORD -WORD ).
and 44,105 pages from VOA News5 .
Firstly, we perform a host crawling on the specified domains. Precision Recall F-Score
And then, the HTML pages are downloaded. All data collected Fold 1 0.688 0.484 0.568
is analyzed by the CB modules to filter candidate pairs. We Fold 2 0.647 0.478 0.550
Fold 3 0.643 0.548 0.591
have used some thresholds as follows. Fold 4 0.601 0.569 0.584
simcognates (A, B) > 0.5, Fold 5 0.682 0.528 0.595
publication date1. Average 0.652 0.521 0.578
As the result we have excluded over 90% of the pairs which
are not considered as candidates. Consequently we receive a
Table II shows our experimental result by using the content-
number of 1,170 pairs which are considered as candidates for
based features, Table III shows our result by using structure-
determining whether each pair of them is parallel or not. Next,
based features, and Table IV contains the result obtained by
all data obtained from the content filter module go into the
combining these both kinds of features.
structure module to extract the designed features. We then
labeled 0 or 1 for each pair of the candidates. A pair will be TABLE II
labeled by 1 if it is parallel, in contrast it will be labeled by EVALUATING CONTENT-BASED MATCHING (USING EXTRACTED
0. There are 433 pairs labeled 1 and 737 pairs labeled 0 from FEATURES FROM CONTENT- BASED F ILTERING M ODULE ).
these 1,170 pairs of candidates. After that, we construct this
data with format: <label> <index1>: <value1> <index2>: Precision Recall F-Score
Fold 1 0.831 0.907 0.867
<value2>... which is suitable for using the LIBSVM tool6 . Fold 2 0.823 0.864 0.843
Fold 3 0.810 0.836 0.823
B. Experimental results Fold 4 0.878 0.765 0.818
We conduct a 5-folds cross-validation experiment, each fold Fold 5 0.931 0.803 0.862
Average 0.855 0.835 0.843
had 234 test items and 936 training items. For investigating
the effectiveness of different kinds of features, we here design
three feature sets including: the feature set containing only
content-based (CB) features; the feature set containing only TABLE III
structure-based (SB) features; and the feature set which include EVALUATING STRUCTURE-BASED MATCHING.
these both kinds of features (i.e. CB and SB features). Precision Recall F-Score
We also use the three well-known measures for evaluation, Fold 1 0.409 0.620 0.493
as follows. Fold 2 0.518 0.614 0.562
Fold 3 0.397 0.614 0.482
Fold 4 0.451 0.763 0.567
No. of pairs labeled 1 are true Fold 5 0.444 0.654 0.529
Precision = Total no. of pairs labeled 1 in data output (2) Average 0.444 0.653 0.529
No. of pairs labeled 1 are true
Recall = Total no. of pairs labeled 1 in data test (3)
It is worth to note that in such this task (extracting parallel
3 http://news.bbc.co.uk corpus) the precision is the most important criterion for
4 http://en.vietnamplus.vn, http://www.vietnamplus.vn evaluating the effectiveness of the system. According to results
5 http://www.voanews.com
shown in the above tables we can see that: the precision
6 http://www.csie.ntu.edu.tw/ cjlin/libsvm/ of using CB feature set (85.5%) is much higher than using
150
TABLE IV [8] Michel Simard, George F. Foster, Pierre Isabelle. "Using Cognates to
COMBINING STRUCTURAL AND CONTENT-BASED MATCHING. Align Sentences in Bilingual Corpora".
[9] Oard, D. W. (1997). "Cross-language text retrieval research in the USA.
Precision Recall F-Score Third DELOS Workshop". European Research Consortium for Informat-
Fold 1 0.873 0.817 0.844 ics and Mathematics.
Fold 2 0.862 0.842 0.852 [10] P. Resnik and N. A. Smith. 2003. The Web as a Parallel Corpus.
Fold 3 0.869 0.879 0.874 Computational Linguistics, 2003, 29(3):349-380.
Fold 4 0.904 0.817 0.858 [11] Resnik, Philip. 1998. Parallel strands: A preliminary investigation into
Fold 5 0.904 0.733 0.810 mining the Web for bilingual text. In Proceedings of the Third Conference
of the Association for Machine Translation in the Americas (AMTA-98).
Average 0.882 0.817 0.848
Langhorne, PA, October 28-31.
[12] Resnik, Philip. 1999. Mining the Web for bilingual text. In Proceedings
of the 37th Annual Meeting of the ACL, pages 527-534, College Park,
MD, June.
SB feature set (44.4%). And our approach of extracting CB [13] Takehito Utsuro, Hiroshi Ikeda Masaya Yamane, Yuji Matsumoto, and
features is also much better the approach in [15] which obtain Makoto Nagao. 1994. Bilingual text matching using bilingual dictionary
and statistics. In Proc. 15th COLING, pages 1076-1082.
only 65.2% of precision. The combination of both feature [14] Van B. Dang, Ho Bao-Quoc. 2007. Automatic Construction of English-
kinds gives the best result, with the precision 88.2%. Vietnamese Parallel Corpus through Web Mining. Proceedings of 5th
IEEE International Conference on Computer Science - Research, Innova-
These results have shown that the CB features as we tion and Vision of the Future (RIVF2007), Hanoi, Vietnam.
proposed is so effective. Note that it also suggests that if we [15] Xiaoyi Ma, Liberman Mark. 1999. BITS: A method for bilingual text
are not sure about the structural corresponding between the search over the Web. Machine Translation Summit VII, September, 1999.
two web pages, we can use only content-based features.
IV. CONCLUSION
This paper presents our work on extracting a parallel corpus
from the Web for the language pair of English and Vietnamese.
We have proposed a new approach for measuring the similarity
of content of the two pages which does not require a deep
linguistical analysis. We have utilized both structural features
and content-based features under a framework of machine
learning. The obtained results have shown that content-based
features as proposed is the major information for determining
a pair of web pages is parallel or not. In addition, our approach
can be applied for other pairs of languages because that
the features used in the proposed model is independent of
language. In the future we will extend our work on extracting
smaller parallel components such as paragraphs, sentences or
phrases. This work will also be interesting in the case the
quality of translation between bilingual web pages is not good.
Acknowledgment
This work is supported by NAFOSTED (Vietnams National
Foundation for Science and Technology Development).
REFERENCES
[1] Akira Kumano and Hideki Hirakawa. 1994. Building an MT dictionary
from parallel texts based on linguisitic and statistical information. In Proc.
15th COLING, pages 76-81.
[2] Brown, P., Cocke, J., Della Pietra, S., Della Pietra, V., Jelinek, F., Mercer,
R., Roosin, P. (1990). "A statistical approach to machine translation".
Computational Linguistics, 16(2), 79-85.
[3] Chen J., Nie. J.Y. 2000. Automatic construction of parallel English-
Chinese corpus for cross-language information retrieval," Proc. ANLP,
pp. 21-28, Seattle.
[4] Chen, J., Chau, R. and Yeh, C.-H. (2004). Discovering Parallel Text from
The World Wide Web. In Proc. Australasian Workshop on Data Mining
and Web Intelligence (DMWI2004)
[5] Davis, M., Dunning, T. (1995). "A TREC evaluation of query translation
methods for multi-lingual text retrieval". Fourth Text Retrieval Conference
(TREC- 4). NIST.
[6] Martin Volk, Spela Vintar, Paul Buitelaar. "Ontologies in Cross-Language
Information Retrieval," Wissensmanagement 2003: 43-50.
[7] Melamed, I. D. (1998). "Word-to-word models of translation equiva-
lence". IRCS technical report 98-08, University of Pennsylvania.
151