Download as pdf or txt
Download as pdf or txt
You are on page 1of 5

2010 Second International Conference on Knowledge and Systems Engineering

Extracting Parallel Texts from the Web


Le Quang Hung Le Anh Cuong
Faculty of Information Technology University of Engineering and Technology
Quynhon University, Vietnam Vietnam National University, Hanoi
Email: hungqnu@gmail.com Email: cuongla@vnu.edu.vn

Abstract Parallel corpus is the valuable resource for some such as size, date, ect. Note that this criterion does not
important applications of natural language processing such as appear in most of the bilingual English-Vietnamese web sites.
statistical machine translation, dictionary construction, cross- STRAND [10] has a similar approach to PTMiner except that
language information retrieval. The Web is a huge resource
of knowledge, which partly contains bilingual information in it handles the case where URL-matching requires multiple
various kinds of web pages. It currently attracts many studies substitutions. This system also proposes a new method that
on building parallel corpora based on the Internet resource. combines content-based and structure matching by using a
However, obtaining a parallel corpus with high accuracy is cross-language similarity score as an additional parameter of
still a challenge. This paper focuses on extracting parallel texts the structure-based method.
from bilingual web-sites of the English and Vietnamese language
pair. We first propose a new way of designing content-based In our knowledge, there is rarely studies on this field related
features, and then combining them with structural features under to Vietnamese. [14] built an English-Vietnamese parallel cor-
a framework of machine learning. In the experiment we obtain pus based on content-based matching. Firstly, candidate web
88.2% of precision for the extracted parallel texts. page pairs are found by using the features of sentence length
and date. Then, they measure similarity of content using a
I. INTRODUCTION
bilingual English-Vietnamese dictionary and making decision
Parallel corpus has been used in many research areas of that whether two papers are parallel based on some thresholds
natural language processing. For example, parallel texts are of this measure. Note that this system only searches for parallel
used for connection between vocabularies in cross language pages that are good translations of each other and they are
information retrieval [5], [6], [9]. Moreover, extracting seman- required being written in the same style. Moreover using word-
tically equivalent components of the parallel texts as words, word translation will cause much ambiguity. Therefore this
phrases, sentences are useful for bilingual dictionary construc- approach is difficult to extend when the data increases as well
tion [1] and statistical machine translation [2], [7]. However, as when applying for bilingual web sites with various styles.
the available parallel corpora are not only in relatively small In this paper, we aim to automatically extracting English-
size, but also unbalanced [15] even in the major languages. Vietnamese parallel texts from bilingual web-sites of news.
Along with the development the Internet, World Wide Web is As encouraging by [10] we formulate this problem as
really a huge database containing multi-language documents classification problem to utilize as much as possible the
thus it is useful for bilingual texts processing. knowledge from structural information and the similarity of
Up to now, some systems have been built for mining content. It is worth to emphasize that different from previous
parallel corpus. These studies can be divided into three main studies [3], [10] we use cognate information replace of word-
kinds including content-based (CB), structure-based (SB) and word translation. From our observation, when translating a
combination of the both methods. For CB approach, [3], [13], text from one language to the another, some special parts
[14] uses a bilingual dictionary to match pairs of word-word in will be kept or changed in a little. These parts are usually
two languages. Meanwhile, the SB approach [11], [12] relies abbreviation, proper noun, and number. In addition, we
on analysis HTML structure of page. Other studies such as [4], also use other content-based features such as the length of
[10] have combined the two methods to improve performance tokens, the length of paragraphs, which also do not require
of their systems. any linguistical analysis. It is worth to note that by this
Parallel web pages in a site in general speaking have approach we do not need any dictionary thus we think it
comparable structures and contents. Therefore, a big number can be apply for other language pairs. Our experiment is
of these studies focuses on finding characteristics of HTML conducted on the web sites containing English-Vietnamese
structures such as URL links, filename, HTML tags. PTMiner documents, including BBC (http://www.bbc.co.uk),
[3] works on extracting bilingual English-Chinese documents. VietnamPlus (http://www.vietnamplus.vn), and VOA
This system uses a search engines to locate for host containing (http://www.voanews.com).
the parallel web pages. In order to generate candidate pairs, The rest of this paper is organized as follows. Section II
PTMiner uses a URL-matching process (e.g. Chinese transla- shows our proposed model, including the general architec-
tion of a URL as "http://www.foo.ca/english-index.html" might ture of the model, how structural features and content-based
be "http://www.foo.ca/chinese-index.html") and other features features are designed and computed. Section III shows the

978-0-7695-4213-3/10 $25.00 2010 IEEE 147


DOI 10.1109/KSE.2010.14
A. Host crawling
Bilingual English-Vietnamese web pages are collected by
crawling the Web using a Web spider as in [4]. A Web spider is
a software tool that traverses a site to gather web pages by fol-
lowing the hyperlinks appeared in the web pages. To describe
this process, our system uses the Teleport-Pro to retrieve web
pages from remote web-sites. Teleport-Pro is a tool designed to
download the documents on the Web via HTTP and FTP proto-
cols and store the extracted data in disk [15]. Note that we se-
lect the URLs on the specified hosts from the three news sites:
BBC, VietnamPlus and VOA News. For example, the URL on
the BBC site for English is: "http://news.bbc.co.uk/english/"
and "http://www.bbc.co.uk/vietnamese/" for Vietnamese. We
then use Teleport-Pro to download the HTML pages for
obtaining the candidate web pages.
B. Content-based Filtering Module
As common understanding, using content-based features we
want to determine whether two pages are mutual translation.
However, as [15] pointed out, not all translators create trans-
lated pages that look like the original page. Moreover, SB
matching is applicable only in corpora that include markup,
and there are certainly multilingual collections on the Web and
experiment in which we will implement different feature sets. elsewhere that contain parallel text without structural tags [10].
Finally, conclusion is derived in section IV. Many studies have used this approach to build a parallel corpus
from the Web such as [4], [14]. They use a bilingual dictionary
II. THE PROPOSED MODEL to measure the similarity of the contents of two texts. However,
this method can cause much ambiguity because a word has
In this paper, we follow the approach which combines many its translations. For English-Vietnamese, one word in
content-based features and structure-based features of the English can correspond to multiple words in Vietnamese. In
HTML pages to extract parallel texts from the Web by using this paper, we propose a different approach which provides a
machine learning [10]. The machine learning algorithm used cheap and reasonably resource. This proposal is based on an
here is Support Vector Machine (SVM). Figure 1 illustrates observation that a document usually contain some cognates
the general architecture of our proposed model. As shown in and if two documents are mutual translations then the cognates
the model it includes the following tasks: are usually kept the same in both of them1 . Note that [8] also
Firstly, we use a crawler on the specified domains to use cognates but for sentence alignment. From our observation,
extract bilingual English-Vietnamese pages which are we divide a token which is considered as a cognate into the
called raw data. three kinds as follows:
Secondly, from the raw data, we will create candidates of 1) The abbreviations (e.g. "EU", "WTO")
parallel web pages by using some threshold of extracted 2) The proper nouns in English (e.g. "Vietnam", "Paris")
features (content-based features and the feature about 3) The numbers (e.g. "10", "2010")
date). Now, we can design a feature for measuring content sim-
Thirdly, we manually label these candidates and then we ilarity based on cognates. This feature is computed by the
have a training data. It means that we will obtain some rate between the number of corresponding cognates between
pairs of parallel web pages which are assigned with label the two texts and the number of tokens in one text (e.g. for
1, and some other pairs of parallel web pages which are English text).
assigned with labeled 0 (the detail of this task is presented Given a pair of texts (A, B) where A stands for English and
in the experiment section). B stands for Vietnamese. Then, we respectively obtained the
Fourthly, we will extract structural features and content- token set of cognates T1 and T2 from A and B. For a robust
based features so that each web page pair can be repre- matching between cognates we make some modifications of
sented as a vector of these features. This representation the original token:
is required to fit a classification model.
A number which is written as sequence of letters in the
Finally, we use a SVM tool to train a classification system
English alphabet is converted into a real number. Accord-
on this training data. It means that if we have a pair of
English-Vietnamese web pages for test, then the obtained 1 Cognates in linguistics are words that have a common etymological origin
classification will decide whether it is parallel or not. (http://en.wikipedia.org/wiki/Cognate)

148
ing to our observations, units of the numbers in English reasonable threshold of the rate between the two texts so
are often retained when translated into Vietnamese. So, that it will keep potential candidates.
we do not consider in case the units are different (e.g. The third feature estimates the rate of paragraphs of the
inch vs cm, pound vs kg, USD vs VND, etc). two texts. In our opinion, two parallel texts often have
We use a list which contains the corresponding names similar numbers of paragraphs in the texts, so a feature
between English and Vietnamese. They include names of for representing this criterion is necessary.
countries, continents, date, ect. However, the names of
C. Structure Analysis Module
countries in English can be translated into Vietnamese
in different ways. Therefore, we only consider these Beside finding candidate pairs based on the content of
names in English, which Vietnamese names correspond- the texts, the similarity of structure of the HTML pages
ing have been published on Wikipedia site2 . For example, also provide useful information for determining whether a
"Mexico" in English vs "Mhic" or "M Ty C" in pair of web pages is mutual translation or not. This method
Vietnamese are the same. uses the hypothesis that parallel web pages are presented in
The following is an example of two corresponding texts of similar structures. Note that this approach does not require
English and Vietnamese: linguistical knowledge. For presenting structural features we
follow the approach presented in [10]. The structural analysis
Vietnam and Italy through three cooperation programmes
module is implements in the two steps: At the first step,
beginning in 1998 have so far signed more than 60
both documents in the candidate pair are analyzed through
projects on joint scientific research. Of the figure, 40
a markup analyzer that acts as a transducer, producing
projects have been carried out and brought good results.
a linear sequence containing three kinds of token [10]:
T 1998, n nay, Vit Nam v Italy k kt hn 60 d
[START: elementl abel], [END:elementl abel], [Chunk:length].
n hp tc nghin cu chung, c khong 40 d n c
At the second step, we will align the linearized sequences
trin khai thc hin v t c kt qu tch cc.
using a dynamic programming algorithm.
Scanning these texts, we obtained T1={"Vietnam", "Italy", After alignment, we compute four scalar values that char-
"1998", "60", "40"} and T2 ={"1998", "Vit Nam", "Italy", acterize the quality of the alignment [10]:
"60", "40"}. We measure the similarity of cognates between
dp The difference percentage, indicating non-shared ma-
A and B by using the algorithm presented in figure 2.
terial.
n The number of aligned non-markup text chunks of
unequal length.
r The correlation of lengths of the aligned non-markup
chunks.
p The significance level of the correlation r.

In addition, we observe that a page on a bilingual news site,


which is the translation of the original page will be created
in the short period of time after the original was published.
Therefore, using this feature we can eliminate many pairs
which are not parallel. For example, in the bilingual news sites
such as BBC, VOA, the Vietnamese pages are published on
the same day or one day later than the corresponding English
If simcognates (A, B) greater than a threshold then the pair pages [14]. To extract this information, we conducted analysis
(A, B) is a candidate. The simcognates (A, B) is calculated as of the HTML tags and then group this feature (publication
in formula (1). date) into the structural feature set. Note that this information
simcognates (A, B) = Number of count is extracted from the different HTML formats (e.g. META tag
tokens in T1 (1) in the BBC site:<META name = "OriginalPublicationDate"
In addition to cognates, we observe that text length and content = "2009/11/22 12:29:48"/ >, SPAN tag in the Viet-
number of paragraphs also provide evidences for measuring namPlus site:<SPAN id = "ctl00_mContent_lbDate" class =
content similarity between two texts. Parallel texts usually have "timeDate">10/04/2009</SPAN>, ect).
similar text lengths and numbers of paragraphs. Therefore,
D. Classification modeling
given a pair of texts we design three features as follows:
The first feature estimates the cognate-based similarity. It
As a result from content-based and structure analysis of each
is computed by the formula (1). pair of web pages we obtain the features which are divided into
The second feature estimates the similarity of text lengths.
the two categories: content features and structural features.
A method to filter out the wrong pairs is to compare Content features include simcognates (A, B), text length and
the lengths of the two texts by characters. We will set a number of paragraphs. Structure features include dp, n, r, p
and publication date. It is now easy to formulate the task as
2 http://vi.wikipedia.org/wiki/Danh_sch_quc_gia classification problem. Each candidate pair of web pages is

149
represented by a vector of these features. We will label them F-Score = 2*Precision*Recall
(Precision+Recall) (4)
by 1 or 0 if each pair is parallel or not respectively. By this
way we will obtain the training data. In our system, we use It is worth to note that for comparing our approach and
a support vector machine algorithm to train a classification previous approaches in using content-based features we also
system. For a new pair of web pages, we first extract features conduct an experiment like in [15]. This study measure the
to have its representation as a vector. This vector goes through similarity of content based on aligning word translation of
the classification system and get the result as 1 or 0. It means the two texts. Here, we use a bilingual English-Vietnamese
that we will have the answer about whether this pair is parallel dictionary to compute a content-based similarity score. For
or not. each pair of two texts (or web pages), the similarity score is
defined as follows.
III. EXPERIMENT
Number of translation token pairs
A. Data preparation sim(A, B) = Number of tokens in text A (5)
We have explored several news sites of bilingual English- With this experiment we obtained the result as shown in
Vietnamese on the World Wide Web. There are a few sites Table I.
with high translation quality. In this system, we experiment
TABLE I
with 94,323 pages are downloaded from the three web sites:
EVALUATING CONTENT-BASED MATCHING (USING THE BILINGUAL
37,665 pages from BBC3 ; 12,553 pages from VietnamPlus4
DICTIONARY TO MATCH PAIRS OF WORD -WORD ).
and 44,105 pages from VOA News5 .
Firstly, we perform a host crawling on the specified domains. Precision Recall F-Score
And then, the HTML pages are downloaded. All data collected Fold 1 0.688 0.484 0.568
is analyzed by the CB modules to filter candidate pairs. We Fold 2 0.647 0.478 0.550
Fold 3 0.643 0.548 0.591
have used some thresholds as follows. Fold 4 0.601 0.569 0.584
simcognates (A, B) > 0.5, Fold 5 0.682 0.528 0.595
publication date1. Average 0.652 0.521 0.578
As the result we have excluded over 90% of the pairs which
are not considered as candidates. Consequently we receive a
Table II shows our experimental result by using the content-
number of 1,170 pairs which are considered as candidates for
based features, Table III shows our result by using structure-
determining whether each pair of them is parallel or not. Next,
based features, and Table IV contains the result obtained by
all data obtained from the content filter module go into the
combining these both kinds of features.
structure module to extract the designed features. We then
labeled 0 or 1 for each pair of the candidates. A pair will be TABLE II
labeled by 1 if it is parallel, in contrast it will be labeled by EVALUATING CONTENT-BASED MATCHING (USING EXTRACTED
0. There are 433 pairs labeled 1 and 737 pairs labeled 0 from FEATURES FROM CONTENT- BASED F ILTERING M ODULE ).
these 1,170 pairs of candidates. After that, we construct this
data with format: <label> <index1>: <value1> <index2>: Precision Recall F-Score
Fold 1 0.831 0.907 0.867
<value2>... which is suitable for using the LIBSVM tool6 . Fold 2 0.823 0.864 0.843
Fold 3 0.810 0.836 0.823
B. Experimental results Fold 4 0.878 0.765 0.818
We conduct a 5-folds cross-validation experiment, each fold Fold 5 0.931 0.803 0.862
Average 0.855 0.835 0.843
had 234 test items and 936 training items. For investigating
the effectiveness of different kinds of features, we here design
three feature sets including: the feature set containing only
content-based (CB) features; the feature set containing only TABLE III
structure-based (SB) features; and the feature set which include EVALUATING STRUCTURE-BASED MATCHING.
these both kinds of features (i.e. CB and SB features). Precision Recall F-Score
We also use the three well-known measures for evaluation, Fold 1 0.409 0.620 0.493
as follows. Fold 2 0.518 0.614 0.562
Fold 3 0.397 0.614 0.482
Fold 4 0.451 0.763 0.567
No. of pairs labeled 1 are true Fold 5 0.444 0.654 0.529
Precision = Total no. of pairs labeled 1 in data output (2) Average 0.444 0.653 0.529
No. of pairs labeled 1 are true
Recall = Total no. of pairs labeled 1 in data test (3)
It is worth to note that in such this task (extracting parallel
3 http://news.bbc.co.uk corpus) the precision is the most important criterion for
4 http://en.vietnamplus.vn, http://www.vietnamplus.vn evaluating the effectiveness of the system. According to results
5 http://www.voanews.com
shown in the above tables we can see that: the precision
6 http://www.csie.ntu.edu.tw/ cjlin/libsvm/ of using CB feature set (85.5%) is much higher than using

150
TABLE IV [8] Michel Simard, George F. Foster, Pierre Isabelle. "Using Cognates to
COMBINING STRUCTURAL AND CONTENT-BASED MATCHING. Align Sentences in Bilingual Corpora".
[9] Oard, D. W. (1997). "Cross-language text retrieval research in the USA.
Precision Recall F-Score Third DELOS Workshop". European Research Consortium for Informat-
Fold 1 0.873 0.817 0.844 ics and Mathematics.
Fold 2 0.862 0.842 0.852 [10] P. Resnik and N. A. Smith. 2003. The Web as a Parallel Corpus.
Fold 3 0.869 0.879 0.874 Computational Linguistics, 2003, 29(3):349-380.
Fold 4 0.904 0.817 0.858 [11] Resnik, Philip. 1998. Parallel strands: A preliminary investigation into
Fold 5 0.904 0.733 0.810 mining the Web for bilingual text. In Proceedings of the Third Conference
of the Association for Machine Translation in the Americas (AMTA-98).
Average 0.882 0.817 0.848
Langhorne, PA, October 28-31.
[12] Resnik, Philip. 1999. Mining the Web for bilingual text. In Proceedings
of the 37th Annual Meeting of the ACL, pages 527-534, College Park,
MD, June.
SB feature set (44.4%). And our approach of extracting CB [13] Takehito Utsuro, Hiroshi Ikeda Masaya Yamane, Yuji Matsumoto, and
features is also much better the approach in [15] which obtain Makoto Nagao. 1994. Bilingual text matching using bilingual dictionary
and statistics. In Proc. 15th COLING, pages 1076-1082.
only 65.2% of precision. The combination of both feature [14] Van B. Dang, Ho Bao-Quoc. 2007. Automatic Construction of English-
kinds gives the best result, with the precision 88.2%. Vietnamese Parallel Corpus through Web Mining. Proceedings of 5th
IEEE International Conference on Computer Science - Research, Innova-
These results have shown that the CB features as we tion and Vision of the Future (RIVF2007), Hanoi, Vietnam.
proposed is so effective. Note that it also suggests that if we [15] Xiaoyi Ma, Liberman Mark. 1999. BITS: A method for bilingual text
are not sure about the structural corresponding between the search over the Web. Machine Translation Summit VII, September, 1999.
two web pages, we can use only content-based features.

IV. CONCLUSION
This paper presents our work on extracting a parallel corpus
from the Web for the language pair of English and Vietnamese.
We have proposed a new approach for measuring the similarity
of content of the two pages which does not require a deep
linguistical analysis. We have utilized both structural features
and content-based features under a framework of machine
learning. The obtained results have shown that content-based
features as proposed is the major information for determining
a pair of web pages is parallel or not. In addition, our approach
can be applied for other pairs of languages because that
the features used in the proposed model is independent of
language. In the future we will extend our work on extracting
smaller parallel components such as paragraphs, sentences or
phrases. This work will also be interesting in the case the
quality of translation between bilingual web pages is not good.

Acknowledgment
This work is supported by NAFOSTED (Vietnams National
Foundation for Science and Technology Development).

REFERENCES
[1] Akira Kumano and Hideki Hirakawa. 1994. Building an MT dictionary
from parallel texts based on linguisitic and statistical information. In Proc.
15th COLING, pages 76-81.
[2] Brown, P., Cocke, J., Della Pietra, S., Della Pietra, V., Jelinek, F., Mercer,
R., Roosin, P. (1990). "A statistical approach to machine translation".
Computational Linguistics, 16(2), 79-85.
[3] Chen J., Nie. J.Y. 2000. Automatic construction of parallel English-
Chinese corpus for cross-language information retrieval," Proc. ANLP,
pp. 21-28, Seattle.
[4] Chen, J., Chau, R. and Yeh, C.-H. (2004). Discovering Parallel Text from
The World Wide Web. In Proc. Australasian Workshop on Data Mining
and Web Intelligence (DMWI2004)
[5] Davis, M., Dunning, T. (1995). "A TREC evaluation of query translation
methods for multi-lingual text retrieval". Fourth Text Retrieval Conference
(TREC- 4). NIST.
[6] Martin Volk, Spela Vintar, Paul Buitelaar. "Ontologies in Cross-Language
Information Retrieval," Wissensmanagement 2003: 43-50.
[7] Melamed, I. D. (1998). "Word-to-word models of translation equiva-
lence". IRCS technical report 98-08, University of Pennsylvania.

151

You might also like