Professional Documents
Culture Documents
Text Mining of English Materials For Business Management: Hiromi Ban, Haruhiko Kimura, Takashi Oyabu
Text Mining of English Materials For Business Management: Hiromi Ban, Haruhiko Kimura, Takashi Oyabu
238 www.erpublication.org
Text Mining of English Materials for Business Management
From this function, we are able to derive coefficients c and b Although we cannot see a positive correlation between
[3]. The distribution of coefficients c and b extracted from coefficients c and b such as in the case of
each material is shown in Figure 1. character-appearance, the values for Materials 1 to 4 are
relatively similar and we might be able to regard them as a
0.14
cluster.
Characters 2. Competitive As a method of featuring words used in writing, a
0.13
statistician named Udny Yule suggested an index called the
K-characteristic in 1944 [5]. This can express the richness
Coefficient b
0.1378 (Material 4). On the other hand, in the case of the (63.772)
239 www.erpublication.org
International Journal of Engineering and Technical Research (IJETR)
ISSN: 2321-0869 (O) 2454-4698 (P), Volume-3, Issue-8, August 2015
basic word. Thus, we can calculate how many required or 1) Average of word length
basic words are not contained in each piece of material in As for the average of word length for the four materials
terms of word-sort and frequency. for business management, it varies from 6.071 letters for
Thus, we calculated the values of both Dws and Dwn to show Material 1 to 6.378 letters for Material 4. They are a little
how difficult the materials are for readers, and to show at longer than Computing Essentials (5.808 letters) and
which level of English the materials are compared with other journalism (5.853 to 5.980 letters). It seems that this is
materials. In order to make the judgments of difficulty easier because the materials for business management contain many
for the general public, we derived one difficulty parameter long-length technical terms for management such as
from Dws and Dwn using the following principal component MARKETING and ACCOUNTING.
analysis:
2) Number of words per sentence
z = a1 * Dws + a2 * Dwn (5) The number of words per sentence for Material 2 is
where a1 and a2 are the weights used to combine Dws and Dwn. 27.096 words, which is the most of the eight materials, and
Using the variance-covariance matrix, the 1st principal approximately 10 words more than BusinessWeek (17.878
component z was extracted: z = (0.5672 * Dws + 0.8236 * words), which is the fewest. From this point of view, the
Dwn)for the required vocabulary, and z = (0.4636 * Dws + Material 2 seems to be rather difficult to read. In the case of
0.8861 * Dwn) for the basic vocabulary, from which we other three materials for business management, it is 19.002
calculated the principal component scores. The results are (Material 4) to 22.537 (Material 3) words, which are a little
shown in Figure 4. fewer than TIME (24.931 words) and almost the same as
Computing Essentials (19.546 words) and The Economist
(21.682 words).
(Grade of difficulty )
0.05
0.04
3) Number of commas per sentence
0.03 The number of commas per sentence for Materials 1 to 4
0.02
is from 1.062 (Material 4) to 1.376 (Material 1), which is
Japan almost the same as the three journalism (1.122 to 1.389).
0.01 508
words)
0.00 U.S.
4) Frequency of auxiliaries
798
-0.01 words There are two kinds of auxiliaries in a broad sense. One
-0.02 expresses the tense and voice, such as BE which makes up the
-0.03
progressive form and the passive form, the perfect tense
HAVE, and DO in interrogative sentences or negative
-0.04
sentences. The other is a modal auxiliary, such as WILL or
-0.05
ce itiv
e
n ci
al
etin
g E ist eek tin
g CAN which expresses the mood or attitude of the speaker [8].
llen pet ina ark TIM on
om sW pu
xce om Ec nes m
1. E 2. C 3. F 4. M Bu
si Co In this study, we targeted only modal auxiliaries. As for the
result, the frequency of auxiliaries is highest in Material 2
Figure 4: Principal component scores of difficulty shown in (2.438%), which is more than three times of Material 1
one-dimension. (0.801%) and twice of TIME (1.125%). Therefore, it might
be said that while the writer of Material 2 tends to
According to Figure 4, the difficulty level increases in the communicate his subtle thoughts and feelings with auxiliary
order of Material 1, Material 2, Material 3 and Material 4. verbs, the style of Material 1 and TIME can be called more
The difficulty of these four materials much varies: while the assertive.
easiest Material 1 is a little more difficult than Computing
Essentials, which is the easiest of the eight materials, because E. Characteristics of Preposition, Relative, Auxiliary, and
it is an introductory book, the most difficult Material 4 is more Personal Pronoun Appearance
difficult than TIME and The Economist. On the other hand, in Next, we examined in detail the prepositions, relatives,
the case of the basic vocabulary, Material 3 is a little more modal auxiliaries, and personal pronouns of each
difficult than Material 4. We can judge that the three material. We valued each part of speech used in each material
materials for business management, that is, Materials 2, 3 and at 100%, and checked the kind of words and its frequency. As
4 are more difficult than TIME and The Economist, and easier for Relatives, WHICH and HOW are frequently used for
than BusinessWeek, which is the most difficult of the eight Materials 1 to 4: WHICH is the 2nd to 5th, and HOW is the 4th
materials. to 6th most frequently used. These days, THAT has been
taking place of WHICH [9]. Therefore, the literally style of
D. Other Characteristics
the materials for business management might be older. HOW
Other metrical characteristics of each material were is also frequently used. This seems to be because the contents
compared. The results of the average of word length, the of these materials are mainly about consideration of some
number of words per sentence, etc. are shown together in methods for solving a problem. In the case of Auxiliaries, the
Table 1. Although we counted the frequency of relatives, frequency of CAN, which often means possibility of
the frequency of modal auxiliaries, etc., some of the words
counted might be used as other parts of speech because we
didnt check the meaning of each word.
240 www.erpublication.org
Text Mining of English Materials for Business Management
something, is high: it is the 1st or 2nd in the four materials for Therefore, we can say that the variation of the word-length for
management. As for Personal Pronouns, ITS and WE are used the materials on management is bigger than that for
frequently: while ITS is the most or the 2nd most frequently journalism.
used in Materials 2 to 4, WE is the 1st to 6th in the four
materials for management. Table 3: Coefficients of variation for word-length distribution of
Next, the frequencies of the most frequently used words, the top 100 words.
that is, the top 44 for Prepositions, 9 for Relatives, 8 for Average of Standard cv (%)
Material Total words Variance
Auxiliaries, and 14 for Personal Pronouns in each material word length Deviation ( / x * 100)
1. Search of Excellence 7,692 3.905 3.669 1.916 49.065
were plotted on a descending scale. The vertical shaft was
2. Competitive Strategy 7,502 4.753 6.918 2.630 55.333
scaled with a logarithm. Each characteristic curve was 3. Financial Management 8,095 4.636 5.888 2.427 52.351
approximated by the exponential function: [y = c * exp(-bx)]. 4. Marketing Management 12,062 4.798 5.794 2.407 50.167
We derived coefficients c and b for each part of speech. The TIME 6,844 3.426 1.171 1.082 31.582
results are shown in Table 2. As a result, in the case of Economist 12,556 3.687 2.427 1.558 42.257
BusinessWeek 10,768 3.935 2.532 1.591 40.432
Relatives, the value of c is high for the four materials on
Computing Essentials 4,686 4.547 5.153 2.270 49.065
management as a whole: it is 23.809 (Material 3) to 52.564
(Material 2). On the other hand, in the case of Auxiliaries, as
for the three materials for management except for Material 2, Next, the results of the word-length distribution of the most
the value of c is 30.643 (Material 4) to 32.581 (Material 3) frequently used 100 words of Material 2, Material 4, TIME
and b is 0.2349 (Material 4) to 0.2638 (Material 1), both of and The Economist are shown in Figure 5. As a result, we can
which are lower than other materials. This means that more see that while the distribution for journalism such as TIME
kinds of auxiliaries are used in the materials for management. and The Economist corresponds to the normal distribution,
the distribution for the books on management such as
F. Word-length Distribution of the Top 100 Words
Materials 2 and 4 corresponds to the Poisson distribution.
We examined the word-length distribution of the most Moreover, we inquired into the coefficient of variation for
frequently used 100 words of each material. Then, we the word-length distribution of the most frequently used 100
calculated the variance, standard deviation and coefficient of words except for articles and prepositions. The results are
variation for the distribution. The results are shown in Table shown in Table 4. In this case, the coefficients of variation for
3. As a result, the coefficients of variation for the four the four materials for management are 32.512 (Material 2) to
materials for management are 49.065 (Material 1) to 55.333 36.125 (Material 3), which are lower than three journalism
(Material 2), which are higher than three journalism materials, materials, which are 36.886 (The Economist) to 40.532
which are 31.582 (TIME) to 42.257 (The Economist).
241 www.erpublication.org
International Journal of Engineering and Technical Research (IJETR)
ISSN: 2321-0869 (O) 2454-4698 (P), Volume-3, Issue-8, August 2015
(BusinessWeek). This means that the variation of the
word-length for the materials on management is less than that
Table 5: High-frequency technical terms for management and
for journalism. their percentages for each material.
V. CONCLUSIONS
We investigated some characteristics of character- and
Figure 5: Word-length distribution of the top 100 words. word-appearance of some famous English books on
management, comparing these with English journalism and a
computer book. In this analysis, we used an approximate
Table 4: Coefficients of variation for word-length distribution of equation of an exponential function to extract the
the top 100 words except for articles and prepositions. characteristics of each material using coefficients c and b of
Average of Standard cv (%)
the equation. Moreover, we calculated the percentage of
Material Total words Variance
word length Deviation ( / x * 100) Japanese junior high school required vocabulary and
1. Search of Excellence 8,005 2.333 0.624 0.789 33.819 American basic vocabulary to obtain the difficulty-level as
2. Competitive Strategy 4,862 2.353 0.585 0.765 32.512
5,848 2.364 0.729 0.854 36.125
well as the K-characteristic. As a result, English materials for
3. Financial Management
4. Marketing Management 5,881 2.392 0.612 0.783 32.734 management have the same tendency as English literature in
TIME 5,921 2.401 0.814 0.902 37.568 the character-appearance. The values of the K-characteristic
Economist 11,273 2.402 0.785 0.886 36.886 for the materials on management are high, compared with the
BusinessWeek 9,761 2.482 1.013 1.006 40.532
journalism. Moreover, the books on management are easier
Computing Essentials 3,292 2.272 0.612 0.782 34.419
to read than BusinessWeek. Besides, we inquired into the
word-length distribution of the most frequently used 100
words. It has been cleared that while the distribution for
IV. APPLICATION TO EDUCATION journalism corresponds to the normal distribution, the
Using the three dictionaries of accounting terms, we distribution for the books on management corresponds to the
checked what technical terms for management are included in Poisson distribution.
each material. The top 20 nouns and their percentages for In the future, we plan to apply these results to education.
Material 2 and Material 3 are shown in Table 5. While the For example, we would like to measure the effectiveness of
frequencies of INDUSTRY, COST and FIRM, including both teaching the 100 most frequently used words in a certain
singular and plural forms, are 1.058%, 0.940% and 0.881% material beforehand.
respectively of all the words used in Material 2, the
frequencies of CASH, COMPANY and ASSET are 0.747%, REFERENCES
0.971% and 0.729% respectively in Material 3. [1] H. Ban, T. Dederick, H. Nambo, and T. Oyabu, Metrical Comparison
of English Materials for Business Management and Information
Technology, Proceedings of the 5th Asia-Pacific Industrial
As for Materials 2 and 3, the top 20 technical terms occupy Engineering and Management Systems Conference 2004,
as much as 6.897% and 6.786% respectively of all words. In pp.33.4.1-33.4.10, 2004.
the case of Material 1 and 4, the percentage is 3.039% and [2] H. Ban, T. Dederick, and T. Oyabu, Linguistical Characteristics of
7.602% respectively. If we teach beforehand these technical Eliyahu M. Goldratts The Goal, Proceedings of the 4th
Asia-Pacific Conference on Industrial Engineering and Management
terms for management to students, reading of the texts will Systems, pp.1221-1225, 2002.
become easier. [3] H. Ban, T. Dederick, and T. Oyabu, Metrical Comparison of
Singapore English Newspapers and Other English Journalism,
Proceedings of the 6th International Conference on Engineering
Design and Automation, pp.717-722, 2002.
242 www.erpublication.org
Text Mining of English Materials for Business Management
Hiromi Ban received her B.A. and M.A. from Japan Womens
University. She joined Toyama University of
International Studies, Toyama, Japan as a lecturer in
1993. Currently, she is a professor at Graduate School
of Engineering, Nagaoka University of Technology.
She is engaged in researches on computational
linguistics and text mining. She is a member of IEICE,
SOFT (Japan), and JSKE.
Takashi Oyabu was born in 1949. He received his B.E. and M.E. in
1971 and 1973 from Kogakuin University and also
received B.A. in 1975 from Waseda University,
Tokyo, Japan. In 1984, he received the Dr of
Engineering from Kogakuin University. He joined
the Electronics Research Laboratory of Denki
Onkyo Co. Ltd., Tokyo in 1973. From 1991-1998
he was a Professor of Toyama University of
International Studies. He joined Kanazawa Seiryo University, Kanazawa,
Japan as a Professor of Graduate School of Regional Economic Systems in
1998. He is the President of Society for Tourism and Informatics of Japan
and the academic President of Kanazawa Institute of Tourism. His current
research activities are on tourism with advanced information technology and
its applications to welfare fields.
243 www.erpublication.org