Download as pdf or txt
Download as pdf or txt
You are on page 1of 9

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/236951303

E-mail Address Categorization based on Semantics of Surnames

Conference Paper · April 2013


DOI: 10.1109/CIDM.2013.6597240

CITATIONS READS
5 1,034

5 authors, including:

Suresh Veluru Yogachandran Rahulamathavn BSc(Hons), PhD


United Technologies Research Center Loughborough University
28 PUBLICATIONS   465 CITATIONS    82 PUBLICATIONS   1,202 CITATIONS   

SEE PROFILE SEE PROFILE

Viswanath Pulabaigari Muttukrishnan Rajarajan


IIIT Sri City Chittoor City, University of London
82 PUBLICATIONS   904 CITATIONS    282 PUBLICATIONS   5,064 CITATIONS   

SEE PROFILE SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Privacy-preserving Techniques View project

Subtle Emotions View project

All content following this page was uploaded by Suresh Veluru on 21 May 2014.

The user has requested enhancement of the downloaded file.


E-mail Address Categorization based on Semantics
of Surnames
Suresh Veluru∗ , Yogachandran Rahulamathavan∗ , P. Viswanath† , Paul Longley‡ , and Muttukrishnan Rajarajan∗
∗ InformationSecurity Group, School of Engineering and Mathematical Sciences,
City University London, London, EC1V 0HB, United Kingdom.
E-mail:{Suresh.Veluru.1, Yogachandran.Rahulamathavan.1, R. Muttukrishnan}@city.ac.uk
† Department of CSE, R.G.M.C.E.T, Nandyal, Andhra Pradesh, India

E-mail: viswanath.pulabaigari@gmail.com
‡ Department of Geography, University College London, London, WC1E 6BT, United Kingdom.
E-mail: p.longley@ucl.ac.uk

Abstract—Surname (family name) analysis is used in geog- names of migrants retain semantic similarity to surnames of
raphy to understand population origins, migration, identity, the people at their original locations.
social norms and cultural customs. Some of these are suppos- In future, an e-mail address can be used as a form of
edly evolved over generations. Surnames exhibit good statistical
properties that can be used to extract information in names digital identify of an individual that often holds surname as a
data set such as automatic detection of ethnic or community substring. Thus, a methodology is important to categorize an
groups in names. An e-mail address, often contains surname as e-mail address based on the knowledge extracted from names
a substring. This containment may be full or partial. An e-mail data set. Hence, association among people can be predicted
address categorization based on semantics of surnames is the from the e-mail addresses. The objective of this paper is to
objective of this paper. This is achieved in two phases. First
phase deals with surname representation and clustering. Here, a extract information in names data set which can be used in
vector space model is proposed where latent semantic analysis is classifying an e-mail address. Knowledge discovery in names
performed. Clustering is done using the method called average- data set involves identifying relationship or association among
linkage method. In the second phase, an email is categorized groups of people (surnames).
as belonging to one of the categories (discovered in first phase). In text mining, latent semantic analysis (LSA) is used in
For this, substring matching is required, which is done in an
efficient way by using suffix tree data structure. We perform finding semantic similarity between terms (words) across doc-
experimental evaluation for the 500 most frequently occurring uments [15]. Here, a document is seen as a bag of words where
surnames in India and United Kingdom. Also, we categorize the the lower level structure (like phrases, sentences which shows
e-mail addresses that have these surnames as substrings. a definite relationship between words) present in the document
Index Terms—Vector space model; latent semantic analysis; is neglected. It is shown that the phrase based approaches does
surnames; average link clustering method; suffix tree;
not perform well since phrases do not repeat as the terms
repeat in a set of documents. Hence, phrase based approaches
I. I NTRODUCTION
do not capture good statistical information [14]. Vector space
Due to rapid growth of digital data, knowledge discovery model is used popularly in text mining to represent documents.
and data mining have great potential which would turn data Similarly, surnames provide good statistical information at
into useful information and knowledge. Text mining (some- several location to extract knowledge from names data set
times called ‘mining from text documents’) is to extract knowl- using vector space model. Several surname analysis techniques
edge from a set of text documents [6]. One such knowledge have been developed in [3], [10], and [9], but do not explicitly
is to discover the clustering structure present in the data, i.e., use the vector space model to extract knowledge.
to find groups of documents [11]. This can be later used to Our contribution: To the best of our knowledge, this is
categorize a new document in to one of these pre-discovered the first paper that represents surnames at different locations
classes (clusters) or as an outlier (saying ‘does not belong to in a vector space model and applies classical text mining
any of these groups’). Similar to this, names, like first names, techniques such as latent semantic analysis (LSA), average-
family names of individuals can be clustered to find inherent link clustering method, and suffix tree data structure appropri-
structure present in them which later can be used to classify ately to perform categorization of e-mail addresses based on
a new name. This knowelge is shown to have importance in semantics of surnames.
geography [5]. The proposed method in this paper has two phases. In the
Broadly speaking, family names (surnames) represent eth- first phase, it represents surnames in a vector space model
nic, geographic, cultural and genetic structures that have been and applies LSA and an average link clustering method in
developed in human populations. It is a well known fact that order to cluster surnames which co-occur together in several
people migrate from one location to other due to job prospects, locations. In the second phase, it constructs the suffix tree of an
economic prosperity, political unrest, etc. However, the sur- e-mail address which compactly represents all of the suffixes

978-1-4673-5895-8/13/$31.00 2013
c IEEE 222
of the e-mail address. Further, it performs a substring matching B. Literature review
technique such that if any surname is present as a substring Surname analysis have been developed in geography such
in the e-mail address then the e-mail address is assigned into as identifying spatial concentration of surnames [3], migrant
the cluster to which the surname belongs. This means that if surname analysis [8], uncertainty in the analysis of ethnicity
two surnames that are in the same cluster are substrings of two classification [10], and ethnicity and population structure anal-
different e-mail addresses then these two e-mail addresses will ysis [9]. However, the degree of similarity between surname
also be assigned into the same cluster. mixes has been developed by comparing relative frequencies of
This paper is organized as follows. Section II sets out surnames at different locations such as isonymy [7] and Lasker
the background and literature review for the proposed work. distance [13]. These measures are complementary measures
Section III describes proposed method for e-mail address such that the inverse natural logarithm of the isonymy creates
categorization. Section IV presents the experimental results a more intuitive measure called Lasker distance. These are
and finally Section V presents conclusion and future work of applicable to study inbreeding between marital partners or
the paper. social groups, but do not explicitly address the semantic
similarity between surnames. E-mail address categorization
II. BACKGROUND AND LITERATURE REVIEW
based on semantics of surnames is proposed in the following
This section briefly explains some of the background tech- section.
niques and literature review that are used in this paper to
develop the proposed method. III. E- MAIL ADDRESS CATEGORIZATION
This section describes the proposed e-mail address catego-
A. Background rization method. Figure 1 presents a block diagram for an
Vector space model is popularly used to extract information e-mail address categorization technique which has two phases
in text documents. Consider if a document contains a bag of represented using dotted lines. Figure 1 also documents each
words then each document could be represented with a 𝑑- phase as follows. In the first phase, the semantics of surnames
dimensional vector where 𝑑- represents 𝑑 most frequent terms are identified by representing a set of names at each location
(or words) in a set of documents. Each element of the vector using a vector space model followed by latent semantic
represents either term frequency multiplied by the inverse analysis as shown in the three blocks and as explained in
document frequency (TF-IDF1 ) if the term is present in the Subsection A. Further, clustering of surnames is shown as an
document or 0 if the term is not present in the document. average-link clustering method and is explained in Subsection
A set of documents are represented using a set of vectors in B. In the second phase, suffix tree construction of an e-mail
the form of a term-document matrix and techniques such as address is shown in two blocks and is explained in Subsection
latent semantic analysis (LSA), clustering and classification of C. Surname identification in an e-mail address is shown in
documents can be performed on the term-document matrix [2]. matching algorithm and is explained in Subsection D.
The LSA method computes the semantic similarity among
A. Semantics of surnames
words in the term-document matrix. It performs corpus based
statistical analysis that finds words which co-occur together in We adapt methods used in information retrieval in order to
several documents [15]. LSA represents a vector for each word represent each location which contains a bag of surnames as a
and hence the cosine similarity between two vectors can be vector, and this is used to identify the semantics of surnames.
used to measure semantic similarity between corresponding Consider the location space of a region or a country
words. Popular clustering methods can be applied to group consisting of a set of locations where each location has a
words that are semantically similar. Clustering methods can be bag of surnames. Let there be 𝑛 locations that are represented
divided into two types, i.e., hierarchical and partitional clus- as ℒ1 , . . . , ℒ𝑛 . A typical vector space model represents each
tering methods. Hierarchical methods represent clusters and location with a vector consisting of 𝑚 entries where 𝑚
subclusters in a hierarchy. If 𝜋𝑖 and 𝜋𝑖+1 are two successive represents the top 𝑚 frequently occurring surnames in a region
levels, then normally, either 𝜋𝑖 is a refinement of 𝜋𝑖+1 or 𝜋𝑖+1 or a country. Let these top 𝑚 frequently occurring surnames
is a refinement of 𝜋𝑖 . Single-link, complete-link and average- be 𝒮 = 𝑠1 , . . . , 𝑠𝑚 . The vector space model for each location
link clustering methods are the most widely used methods of ℒ𝑖 is represented with a 𝑚-dimensional vector, for 𝑖 = 1 to 𝑛
this category which produce arbitrarily shaped clusters when is given in (1) where 𝑤𝑠𝑖 𝑗 represents the weight of the surname
compared with partitional clustering. The single-link clustering 𝑠𝑗 in location ℒ𝑖 .
method is sensitive to noisy patterns and may merge two
clusters if they are connected by a chain of noisy patterns [1]. ℒ𝑖 =< 𝑤𝑠𝑖 1 , 𝑤𝑠𝑖 2 , . . . , 𝑤𝑠𝑖 𝑚 > (1)
In this sense, average link clustering method can be used to We assign weight 𝑤𝑠𝑖 𝑗 that represent the weight of surname
find good clusters. 𝑠𝑗 in location ℒ𝑖 . The weight depends upon the number of
1 A popular representation scheme in information retrieval weights each
occurrences of surname 𝑠𝑗 in location ℒ𝑖 called surname
term using a global weight IDF which is inversely proportionate to the number frequency and a global weight for each surname 𝑠𝑗 called
of documents that contains the term inverse location frequency (ILF). The weight 𝑤𝑠𝑖 𝑗 is given

2013 IEEE Symposium on Computational Intelligence and Data Mining (CIDM) 223
Fig. 1. E-mail address categorization method corresponds to surnames that have 𝑘 columns represented with
PHASE I 𝑑𝑖𝑚1 , . . . , 𝑑𝑖𝑚𝑘 . This means that each surname 𝑠𝑖 is a vector
NAMES DATASET VECTOR SPACE MODEL
of 𝑘 dimensions such that 𝑠𝑖 =< 𝑢𝑖1 , . . . , 𝑢𝑖𝑘 >, for 𝑖 = 1 to
1
𝑚 which is given as below.
𝐿𝑜𝑐𝑎𝑡𝑖𝑜𝑛 : Names < 𝑤11 , ..., 𝑤𝑚
1
>
Latent Semantic
. .
𝑑𝑖𝑚1 . . 𝑑𝑖𝑚𝑘
.
.
.
.
Analysis
⎛ ⎞
𝑈𝑉 𝑊 𝑇
𝑠1 𝑢11 . . 𝑢1𝑘
𝐿𝑜𝑐𝑎𝑡𝑖𝑜𝑛𝑛 : Names < 𝑤1𝑛 , ..., 𝑤𝑚
𝑛
>
. ⎜⎜ . . . . ⎟⎟
𝑈𝑚×𝑘 =
. ⎝ . . . . ⎠
Email address Average link
Clustering method
𝑠𝑚 𝑢𝑚1 . . 𝑢𝑚𝑘
Categorization
Output
𝐶1 .. 𝐶
𝑙 𝐶1 ... 𝐶𝑙 The semantic similarity between two surnames 𝑠𝑖 and 𝑠𝑗 is a
𝑙-clusters
cosine similarity between two vectors 𝑠𝑖 and 𝑠𝑗 which is given
by (4). Further, the clustering of surnames can be performed
using the semantic similarity given by (4) to identify semantic
Matching clusters of surnames which is explained in the following
E−mail address Algorighm subsection.
Suffix Tree 𝑘

𝑢𝑖𝑡 × 𝑢𝑗𝑡
PHASE II
𝑡=1
𝐶𝑂𝑆(𝑠𝑖 , 𝑠𝑗 ) = 𝑘 𝑘
(4)
√ ∑ √ ∑
( 𝑢𝑖𝑡 × 𝑢𝑖𝑡 ). ( 𝑢𝑗𝑡 × 𝑢𝑗𝑡 )
𝑡=1 𝑡=1
in (2). Here 𝑓𝑠𝑖𝑗 is the frequency of surname 𝑠𝑗 at location
ℒ𝑖 and 𝐼𝐿𝐹 (𝑠𝑗 ) is the inverse location frequency. B. Clustering of surnames
{ 𝑖 We used average-link clustering method to develop a good
𝑓𝑠𝑗 ∗ 𝐼𝐿𝐹 (𝑠𝑗 ) if surname 𝑠𝑗 in location ℒ𝑖
𝑤𝑠𝑖 𝑗 = semantic clusters of surnames.
0 if no surname 𝑠𝑗 in location ℒ𝑖 Average-link clustering produces a hierarchy of clusters.
(2) Let 𝒞𝑖 , 𝒞𝑗 be two clusters of surnames then the average-
𝐼𝐿𝐹 (𝑠𝑗 ) provides the importance of surname 𝑠𝑗 that retrieve link similarity (𝐴𝑣𝑔𝑆𝑖𝑚) between two clusters of surnames is
locations using surname 𝑠𝑗 . If surname 𝑠𝑗 appears only in defined by (5). Here ∣𝒞𝑖 ∣ is number of surnames in the cluster
a particular location then it is easy to retrieve that location 𝒞𝑖 .
given the surname 𝑠𝑗 . If a surname appears in one location ∑
1
then it is of greater importance than a surname that appears 𝐴𝑣𝑔𝑆𝑖𝑚(𝒞𝑖 , 𝒞𝑗 ) = 𝐶𝑂𝑆(𝑠𝑝 , 𝑠𝑞 ) (5)
in several locations. If 𝑛𝑗 is the number of locations in which ∣𝒞𝑖 ∣∣𝒞𝑗 ∣
𝑠𝑝 ∈𝒞𝑖 ,𝑠𝑞 ∈𝒞𝑗
the surname 𝑠𝑗 appears and 𝑛 is the total number of locations Initially, the average-link clustering method assumes that each
then 𝐼𝐿𝐹 (𝑠𝑗 ) is given in (3). surname 𝑠𝑖 ∈ 𝒮 is a separate cluster and proceeds by merging
𝑛 two clusters at each iteration of the clustering process. If the
𝐼𝐿𝐹 (𝑠𝑗 ) = 𝑙𝑜𝑔2 ( ) (3)
𝑛𝑗 average-link clustering method reaches the desired number
Let the location space which contains a set of locations be ℒ, of clusters then the merging process ceases. Otherwise, at
represented by a matrix consisting of location-surnames. For each iteration, it finds two clusters such that the average link
our convenience, let ℒ𝑇 be a transpose matrix of ℒ having 𝑚 similarity (𝐴𝑣𝑔𝑆𝑖𝑚) between these two clusters is a maximum
rows and 𝑛 columns given below. and merges them into a single cluster. The algorithm for
average-link clustering method is given in Algorithm 1.
ℒ1 . . ℒ𝑛
⎛ ⎞ C. Suffix tree construction method
𝑠1 𝑤𝑠11 . . 𝑤𝑠𝑛1
. ⎜⎜ . . . . ⎟⎟
A suffix tree is a versatile data structure that stores all
ℒ𝑇 =
. ⎝ . . . . ⎠ suffixes of a given string that can be constructed in linear
𝑠𝑚 𝑤𝑠1𝑚 . . 𝑤𝑠𝑛𝑚 time [12]. It has been used in many applications [16], [4].
Given a string 𝑧, an enhanced string is represented as 𝑧$ to
We apply LSA to ℒ𝑇 which in-turn applies a SVD technique make sure that every suffix is unique. The suffix tree of the
that decomposes ℒ𝑇 into three matrices 𝑈 , 𝑉 and 𝑊 𝑇 such enhanced string is represented as Γ(𝑧). Each node represents
that ℒ𝑇𝑚×𝑛 = 𝑈𝑚×𝑘 𝑉𝑘×𝑘 (𝑊𝑛×𝑘 )𝑇 . The matrices 𝑈𝑚×𝑘 and 𝑤 which denotes a string 𝑤 that is the path from root to
(𝑊𝑛×𝑘 )𝑇 correspond to surnames and locations respectively the corresponding node. Each edge in suffix tree Γ(𝑧) is a
which consist of orthonormal columns. The matrix 𝑉𝑘×𝑘 is a substring of 𝑧$. Let 𝑇𝑤 represents the subtree rooted at node
diagonal matrix that containing the singular values in descend- 𝑤. The root of suffix tree is denoted as 𝑟𝑜𝑜𝑡(Γ(𝑧)).
ing order where the 𝑖𝑡ℎ singular value indicates the amount of A suffix link is an auxiliary unlabeled edge between two
variation along the 𝑖𝑡ℎ axis. We focus on matrix 𝑈𝑚×𝑘 which nodes 𝑧𝑖 𝑤, 𝑤 such that 𝑧𝑖 𝑤 → 𝑤 where 𝑧𝑖 is a character.

224 2013 IEEE Symposium on Computational Intelligence and Data Mining (CIDM)
Algorithm 1 Average-link(𝒮, 𝑑) Algorithm 2 SurnameMatching(𝑠𝑖 , Γ(𝑧), 𝑚𝑎𝑡𝑐ℎ)
{𝒮 is a set of surnames, each 𝑠𝑖 ∈ 𝒮 is a vector of 𝑘 {𝑠𝑖 ∈ 𝒮 is a surname 𝑖 in a set of surnames 𝒮. Let ∣𝑠𝑖 ∣ be
dimensions, 𝑑 is a desired number of clusters} number of characters in surname 𝑠𝑖 . Let Γ(𝑧) be suffix tree
Place each surname 𝑠𝑖 ∈ 𝒮 in a separate cluster. Let it be of an e-mail address. Let 𝑚𝑎𝑡𝑐ℎ be the string matched with
𝜋𝑗 = {𝒞1 , 𝒞2 , . . . , 𝒞𝑚 } and 𝑗 = 1. the surname in the e-mail address and it is empty initially.}
{ Let ∣𝜋𝑗 ∣ be the number of clusters at iteration 𝑗} Let string temp=𝜙;
while ∣𝜋𝑗 ∣ > 𝑑 do {let 𝑇 be next child of 𝑟𝑜𝑜𝑡(Γ(𝑧)) and 𝑇.𝑒𝑑𝑔𝑒 be it’s edge.
Select two closest clusters 𝒞𝑝 , 𝒞𝑞 ∈ 𝜋𝑗 such that let 𝑠𝑖 [𝑗] be a character at position 𝑗 of string 𝑠𝑖 }
𝐴𝑣𝑔𝑆𝑖𝑚(𝒞𝑝 , 𝒞𝑞 ) is a maximum. while 𝑟𝑜𝑜𝑡(Γ(𝑧)) has next child do
Form a new cluster 𝒞 = 𝒞𝑝 ∪ 𝒞𝑞 . k=0;
Next set of clusters is 𝜋𝑗+1 = 𝜋𝑗 ∪ {𝒞}∖{𝒞𝑝 , 𝒞𝑞 }. while 𝑘 < ∣𝑇.𝑒𝑑𝑔𝑒∣ & 𝑘 < ∣𝑠𝑖 ∣ & 𝑇.𝑒𝑑𝑔𝑒[𝑘]=𝑠𝑖 [𝑘] do
j=j+1; k++;
end while end while
Output final clustering 𝜋𝑗 if k ∕= 0 & k=∣𝑠𝑖 ∣ then
𝑚𝑎𝑡𝑐ℎ = 𝑚𝑎𝑡𝑐ℎ + 𝑠𝑖 ;
return 𝑚𝑎𝑡𝑐ℎ;
Fig. 2. Suffix Tree for an email address aamalam$ (e.g aa-
malam@yahoo.com). The surname is alam$, it is a substring in the e-mail else
and it has been shown at a leaf node if k ∕= 0 & i=∣𝑇.𝑒𝑑𝑔𝑒∣ then
root {let 𝑠𝑖 [𝑙, 𝑚] be a substring between position 𝑙 to 𝑚
suffix link
of 𝑠𝑖 }
m 𝑚𝑎𝑡𝑐ℎ = 𝑚𝑎𝑡𝑐ℎ + 𝑠𝑖 [0, k];
lam$
a
𝑠𝑖 =𝑠𝑖 [k+1,length(𝑠𝑖 )];
$ alam$
return SurnameMatching(𝑠𝑖 ,𝑇 ,𝑚𝑎𝑡𝑐ℎ);
$
else
amalam$ return temp;
lam$
m end if
end if
$
alam$ end while
This node represents return temp;
surname ’alam’

Suffix links are used significantly to speedup the insertion of assigned to the cluster to which the surname belongs. Since
each new suffix. Each suffix shares the prefix of previous suffix the e-mail address is represented in a compact trie of suffixes,
and suffix links are useful to jump quickly to another node in the proposed method is a fast one which verifies against all
the suffix tree and hence suffix tree construction algorithm is surnames to identify which surname is present as substring in
linear. Each non-leaf node of a suffix tree Γ(𝑧) has a suffix it. If there are two surnames present as substrings in an e-mail
link [12] and the suffix link for a root is root itself. If the address then it is categorized into any one of the two surname’s
set of all non-empty strings 𝑢 such that 𝑢𝑣 belongs to nodes clusters. In general, it is unusual, however, in such cases it
in the suffix tree for some string 𝑣 (possibly empty) then the categorizes the e-mail address into the cluster of surname that
set contains all possible substrings of 𝑧$. The suffix tree data occurs first.
structure is useful for computations on substrings of a string. Surname matching method in an e-mail address identifies
Each leaf node represents a suffix of the given string and the whether or not the surname present as substring in the e-
dotted lines represent the suffix links. mail address. The algorithm SurnameMatching takes surname
𝑠𝑖 , suffix tree of an e-mail address Γ(𝑧), and empty string
D. Surname matching method in an e-mail address match which represents the matching part of surname in the
The proposed e-mail address categorization method uses e-mail address. The SurnameMatching algorithm compares the
surname matching method and semantic clusters of surnames surname 𝑠𝑖 with the string associated to the edge of each child
which is proposed in phase one. It constructs a suffix tree for of the root node. If the surname 𝑠𝑖 matches with prefix of the
an e-mail address. Further, it identifies any substring in the e- edge then it returns surname which is identified in the e-mail
mail address that matches with a surname. If it finds a surname address. If there is no edge that matches with the prefix of 𝑠𝑖
as a substring in the e-mail address then the e-mail address then it returns a null (It says there is no substring present).
is assigned to the cluster to which the surname belongs. If it Otherwise, if a prefix of surname 𝑠𝑖 is matched then the prefix
does not find any surname then it returns a null. Similarly, is copied into the match string, eliminates the prefix from
it checks for each surname and if any surname matches as surname, and calls SurnameMatching algorithm at child node
a substring in the e-mail address then the e-mail address is to check whether or not the remaining surname as substring in

2013 IEEE Symposium on Computational Intelligence and Data Mining (CIDM) 225
the e-mail address recursively.The detailed algorithm is given Semantic clusters of surnames for India names data set and
in 2 . United Kingdom names data set are plotted in Figure 5 and 6
Figure 2 denotes the suffix tree for an e-mail address respectively. We have calculated 𝐿𝑄 values of each surname
aamalam$ and alam$ is the surname. Given a surname 𝑠𝑖 and at all locations and taken the location that has the maximum
a suffix tree Γ(𝑧) where 𝑧 is an e-mail address, the proposed 𝐿𝑄 value 2 . For a surname, if the 𝐿𝑄 value in a location is a
method finds weather 𝑠𝑖 is a substring or not in 𝑂(∣𝑠𝑖 ∣) time. maximum means the surname concentration in that location is
In the example, the edge ’a’ is matched with the prefix of the highest. Also, we have eliminated a few surnames that have
surname alam$ and the algorithm finds a child node attached relatively low maximum 𝐿𝑄 value. Hence, we have plotted
to ’a’ and traverses that child node. It finds the edge lam$ that 123 surnames for India names data set and 118 surnames for
matches with the remaining characters of the surname (i.e., United Kingdom names data set in which the horizontal axis
lam$) and hence assigns the cluster to which the surname represents surnames and the vertical axis represents locations
belongs. If there are 𝑚 surnames then the time complexity where the surname concentration is a maximum. The size of
of the proposed method to categorize an e-mail address is the circle represents the 𝐿𝑄 value and the number represents
𝑂(𝑚 × max {∣𝑠𝑖 ∣}𝑚 𝑖=1 ). the semantic cluster number to each surname from 1 to 30.
From Figure 5, it is clear that clusters 3, 6, 11, and 28
IV. E XPERIMENTAL R ESULTS contain surnames that are each heavily concentrated in a single
This section describes experimental results. We have two province. It can be observed that surnames in clusters 21,
countries names and e-mail addresses data sets, viz., India 28 belong to a single community. Many of the surnames in
and United Kingdom. India corpus has 17.4 million names cluster 29 are highly concentrated in West Bengal and many
and 14.9 million e-mail addresses collected over 277 locations of the surnames in clusters 9 and 25 are highly concentrated
which covered 28 provinces and 6 union territories. United in Goa, Maharashtra, Dadra & Nagar Haweli, Daman & Diu
Kingdom corpus has 0.924 million names and 1.048 million and Andaman & Nicobar, but are split between two clusters.
e-mail addresses collected over 115 locations in United King- Hence, it can be concluded that surnames found in cluster 9
dom. Location information is not used in e-mail address, per- and 25 are the result of migration between Goa, Maharashtra,
haps, it is used to identify semantics of surnames which in-turn Dadra & Nagar Haweli, Daman & Diu and Andaman &
used to categorize e-mail addresses. In Figures 3 and 4, the Nicobar, but, highly concentrated in Goa.
horizontal axis represents surname or e-mail address domain From Figure 6, since the 𝐿𝑄 values are measured at each
and vertical axis represents frequency of surnames or e-mail location and hence all clusters are heavily concentrated in
addresses in log scale. The frequency of 40 most frequent two and more locations which are limited to a few locations
surnames and 40 most frequent e-mail address domains for for some clusters. For example, surnames in cluster 28 are
India and United Kingdom data sets are given in Figures 3 heavily concentrated in Zetland, Belfast, and Uxbridge. It can
and 4 respectively. be observed that surnames in clusters 18 and 10 belong to a
In phase 1, we extracted the 500 most frequently occurring single community which are heavily concentrated in a single
surnames which are represented in a vector space model location, viz., Bradford and Uxbridge.
for India and United Kingdom names data set. For India
TABLE I
names data set, 277 vectors were generated correspond to C ATEGORIZATION OF E - MAIL ADDRESSES FOR India DATA SET
277 locations and for United Kingdom names data set, 115
vectors were generated correspond to 115 locations. After No E-mail address surname category
1 anal.chatterjee@domain1.com chatterjee 29
applying LSA, we chose 60 dimensions for each surname in a 2 anshukataria@domain1.com kataria 8
decomposed matrix which corresponds to surnames in order to 3 anwesha.bakshi@domain1.com bakshi 4
find semantic similarity among surnames and clustered them 4 arnabghoshd@domain1.com ghosh 29
5 binitmishra8@domain1.com mishra 12
into 30 groups. 6 chawlaarvinder@domain1.com chawla 8
Analysis of the spatial concentration of surnames has been 7 eesatish.kumar@domain1.com kumar 1
developed in Great Britain [5] using the Location Quotient 8 feroj khan@domain2.com khan 1
9 bedgautam@domain2.com gautam 23
(LQ) to measure the concentration of any surname at different
locations. Let 𝒫𝑖𝑗 be the frequency of surname 𝑖 in location
𝑗 and let 𝒬𝑖 be the frequency of surname 𝑖 in Great Britain. In the second phase, we analysed 14.9 million e-mail
Let 𝑚 be the total number of surnames then the LQ is defined addresses and found that 3.7 million e-mail addresses have 500
by (6) most frequent surnames as substrings for India e-mail address
𝑗
∑𝑚𝒫𝑖 𝑗 data set. We categorized these 3.7 million e-mail addresses
𝒫𝑖 into 30 groups based on the clusters of surnames obtained
𝐿𝑄𝑗𝑖 = 𝑖=1
(6)
∑𝑚𝒬𝑖 in the phase 1 of the method. We analysed 1.048 million e-
𝒬𝑖
𝑖=1 mail addresses and found 318,867 of e-mail addresses have
For each 𝑖, we represented surname 𝑖 in the location 𝑗 which 2 For India names data set, the provinces are considered to calculate 𝐿𝑄
has maximum 𝐿𝑄𝑗𝑖 value in order to analyse the semantic value whereas for United Kingdom names data set, the locations themselves
clusters of surnames. considered to calculate 𝐿𝑄 values

226 2013 IEEE Symposium on Computational Intelligence and Data Mining (CIDM)
Fig. 3. Frequency of 40 most frequent surnames and 40 most frequent Fig. 4. Frequency of 40 most frequent surnames and 40 most frequent
e-mail address domains for India data set e-mail address domains for United Kingdom data set
Number of e-mail addresses vs damain name statistics for India data Number of e-mail addresses vs damain name statistics for UK data
262144 262144

131072
Number of e-mail addresses in domain 131072
Number of e-mail addresses in domain
65536
65536
Number of e-mail addresses

Number of e-mail addresses


32768
32768

16384

16384

8192

8192

4096

4096
2048

2048
1024

1024
512

256 512

H
H M
YA M L.C
AO OO L.C M
BT .C CO .UK
YA NTE M K
N OO NE
G WO CO .CO
LI AIL LD
M .C OM O
G N.C .UK
BL OG M
TI EY M
SK CA ND L.C
TA .C .CO R.C M
FS KT M K .U
BT AI LK.
VI PE NE ET
LI IN WO
TE .C ET LD
TA CO M
YM K2 ET
O IL CO
BT O M
M ON K
SU LIN EC
R AN TO .CO
LI KE T.C .CO
M EO MA M
AO LIN .N .C
O .C TO T M
H ETE .UK 2.C
TY E .CO
YA DD LL
G OO OM O.
W X.C CO
M .PL .UK
R .C
VI IF M
G GIN AI
gm
ya ail.
ho oo m
ya ma om
re oo .com
ao ffm .in
vs .co il.c
liv l.co
ya co
ym oo
si ail.c n
m .co m
sa n.c
m ch m
in il.co rne
sa iatim in
ro yam es.
go ket .ne om
re gle ail. in
go iff.c ai om
fre gle m om
ni wh rou
in in le. s.c
ga om
da ab
bo on m
ey e in
sa pa in
ly sa te.c
16 s.c nic m
ya .co m n
si oo
em ndi ail.
bo ail. .co om
ju 3. m
bt o.c snl.
ro ter m et.i
ai ers et.c
w .co om m
bs hp

O
O AI

TL . T

O E R M

O L

ED O
M R M M

O O

2. .C M

N O R

M .

M M L
VE .C .C
S O

VE .N R

AI N

N T O M
AI NE IL

AC
fy o

co d. o

si m

as m

S O AI
d m t.

c. o p
.c

n v

m .c o

R N T

P O .IN

R FM
s m

a a

in
di .co

d m c

g n
e. m
n m om

U LE
h co

h il

h m

n o

e t.
n s

h m

nl os

I O .U

L O .U O
M A
O L. N

S O
L .N

C .U

L CA M
t .c

l a

o m t.

o o l.c

w
ta .co
l.n e.

3 o .i

m co m

H AI O

H R

Y LI E O

H .C .C
e g

P A T
L . O

L A E O
T
T

A 1.

M
C

X. E .C
G
.in t.

C D O
o
.i

a c

O IA M
O

M .C
co
co o
c

m
m m
Domain names Domain names

O
O
n

.C

M
M
O
M
K
Surnames statistics for India data Surnames statistics for UK data
1.04858e+06 16384

524288

8192
Number of e-mail addresses in domain

Number of surnames
Number of surnames

262144
Number of e-mail addresses in domain
4096

131072

2048

65536

32768 1024
SM
JO TH
W ES
BR LIA
TA W S
D LO
W IE
EV SO
JO NS
TH NS
R M
W ER
TH IGH S
W MP
R KE ON
W IN
H ITE ON
G H
ED EN
LE A
H IS S
C L
M RK
H TI
JA RI
W KS
C OD N
SC RK
TU TT
M NE
PA OR
H EL
C L
W P
W RD
AN SO
M ER
KI RR ON
JA G
H ES
ku
si ar
sh gh
gu rm
sh ta
kh h
ja
ya
sh av
pa ikh
pa il
m el
ra hra
pr
m ad
de hta
da i
la
jo
ve hi
re ma
pa dy
ra de
ag
pa rw
sh
tiw ty
ja ari
na av
al r
ah
ar ed
jh ra
ch
pr nd
sa kas
th an
pa kur
sh ar
de nde

AV R

O AS

O R

AL
LA

AR N

LA

IL
O

AR
R ES

AR E

O R

O S
IL

IL S

R T

AL S

H S

O O

A ER
AT

N IS
C S

M
W RD
N

Y N

H
n

A N

O ON

O T

T E
in

l
s

dh

O M

D N
is

a t

G
o
as

m y

O
m

a
an

d
a

et

w h

i
p a

t
t

v
s

a
l al

sa

R
IS
i

Surnames Surnames

O
N
TABLE III
N UMBER OF E - MAIL ADDRESSES CATEGORIZED INTO DIFFERENT GROUPS FOR India AND United Kingdom DATA SETS

CNo Country Name CNo Country Name CNo Country Name


India United Kingdom India United Kingdom India United Kingdom
1 403485 34022 11 35941 16833 21 99292 886
2 132674 35843 12 109691 501 22 60759 2969
3 75559 5094 13 133299 3814 23 79369 32415
4 249669 18088 14 96657 14422 24 39220 297
5 52414 27827 15 21952 227 25 56875 13942
6 12742 4093 16 407543 9003 26 16194 133
7 28689 829 17 103426 5152 27 37509 8796
8 117376 13908 18 82564 9655 28 70889 13611
9 94766 13245 19 19912 9827 29 475388 4892
10 35499 1439 20 602131 16347 30 38975 757

TABLE II
500 most frequent surnames as substrings for United Kingdom C ATEGORIZATION OF E - MAIL ADDRESSES FOR United Kingdom DATA SET
e-mail address data set. Table I and II present the sample
results of 9 e-mail addresses and their categories based on No E-mail address surname category
the semantics of their parent surnames for India and United 1 glennis.middleton@domin1.com middleton 1
2 emily.curtis1@domin1.com curtis 11
Kingdom data sets respectively. For example, e-mail addresses 3 amanda.francis@domin1.com francis 11
1 and 4 suggest surnames chatterjee, ghosh respectively which 4 darrenmbates@domain2.com bates 1
belong to the same cluster 29 and hence these two e-mail 5 georgeamos44@domin3.com george 23
6 johnnysingh1971@domain2.com singh 30
addresses are assigned to group 29. Similarly, we can see 7 michaelburton1983@domin3.com burton 23
that e-mail address 7 and 8 assigned to group 1. Table III 8 manishakaur2000@domin4.com kaur 30
presents the categorization of 3.7 million and 318,867 of 9 emmasjones2@domin2.com jones 2
e-mail addresses for India and United Kingdom data sets
respectively and the number of e-mail addresses belonging
to each category is presented. V. C ONCLUSION AND FUTURE WORK
In general, an e-mail address is created to reflect the identity
of an entity and it is common to see surnames as substring in

2013 IEEE Symposium on Computational Intelligence and Data Mining (CIDM) 227
Fig. 5. Semantics of Surnames for India Data set A) Surnames from 1 to 62, B) Surnames from 63 to 123
A) Semantics of surnames from 1 to 62

TRIPURA Surname concentration with cluster label


17 17

MADHYAPRADESH
12 14
DAMANDIU
9
GOA
SIKKIM 6 6 6 6 6 6 6 6
HIMACHALPRADESH 7 12
CHHATTISGARH 4 4
ORISSA 4
MEGHALAYA 4 7 7 7 7 7 7 7 7 7 7 7 7 7
RAJASTHAN 4
PUNJAB 3 3 3 14 14
GUJRAT 2 2 4
11 11
ANDAMANNICOBAR
8
LAKSHDWEEP
1 8 16
MANIPUR
16
JAMMUANDKASHMIR
UTTARANCHAL 1 2 4 4 4
ASSAM 14
JHARKAHND 17 17 17 17 17
BIHAR 1
1 1
SI
R HA
KUNJ
AK MA N
IQ HT I
KA AL R
G R
M L
KH HA
PA AN AN
M RE EL
BH N K A
R AT
KA INA
D UL
SA AR
W HU
KA LIA
SOUS
N OD AL
N ND
KA K
FE MA
D RN
PE S ND
R RE ES
SH DR RA
PE ET GU
SADN E S
M HO KA
SWHA
R AI NTY
M UT
N LL
D YA K
PA H
JE TR
M NA
PR HA
PA AD PAT
BEND AN RA
PAHE
TA TN A
SA E K
LA CH A
PAD EV
PRTE
R J
SHI PA
M RIV I
C TH AST
PA OU R V
R NW AN
D TH AR
SA VI RE
D HA
N S I
G TH
IS SW
SA AM AM
C RM
D OU
A

A
AI A

IA A

O I

O N

A IC
AS K

A A

H U A

A
E O

A N
A

H A
U D
IL

O
A

EE E W

O O R

A
N

L
B A

N
U

TT H
A

L
Y E

AI
J

A UR
H

R
D

H
A

D
T
R

H
T

Y
L

A
Surnames

B) Semantics of surnames from 63 to 123

KERALA
PONDICHERRY 20 21 21 21 28 28 28 28 28 28 28
NAGALAND 20 20
19
MIZORAM
TRIPURA 20
WESTBENGAL 21 29 29 29 29 29 29 29 29 29 29
29 29 29 29 29 29 29 29 29 29 29 29 29 29 29 29 29 29

GOA
25

GUJRAT
22 22 22 22 22 22
ANDAMANNICOBAR
20
LAKSHDWEEP Surname concentration with cluster label
20 24
MANIPUR
UTTARPRADESH 20
30
UTTARANCHAL
ASSAM 19 19 19 19
17
R
R HM
N WA N
TH GI
BI AP
PA HT
N NT
SU H
AHND N
SHAM RA
SE N D
C KA UG
KA N
PANN RA
ANUL N
JO TO
VAHN Y
JA RG
D I ES
BHVE
R AV
PA A AR
PARM
SHND R
SO K A
TH UZ R
M OM
AB TH S
G A W
JA OR AM
JO O
JOSE
R SE H
SA
C RK
D AK R
D Y AB
BH
SA AT
M HA AC
D U
KA TT DE
GR
BI OS
C A
M OW S
SH RA DH
KUAW
BHND
BA T
D SU AC
BOSG
SE SE PT
G NG
SE A PT
BAN
M NE
C KH JE
G AT RJ
M NG ER E
SR ND LY EE
A
A A
E T

AT

H R A

AV S

H A
E R
EB

A M

H E E
E H

U U

A T E
A A

AJ

IT

U R

O U J
S A

SW H
N H

C GE
R
A

IV AL
A

AS
P

A R
B

T
A

A
M

A N

A
A

U
E M

TA
U
H

H
O

A
E

VA
R
AR

AR
R
M

Y
TY

JE

YA
E

Surnames

the e-mail address of an identifiable individuals. In this paper, random. From India and United Kingdom data sets, this is
we have analysed statistical relationships among surnames and clearly the case from the results of our analysis of the 500
clustered them into several groups using a vector space model most frequently occurring surnames and the assignment of 3.7
in the first phase. We used latent semantic analyses to identify million and 318,867 corresponding e-mail addresses into 30
semantic similarity among surnames and used the average-link groups for two data sets.
clustering method to allocate surnames between 30 clusters. The future directions of this work will include i) finding an
In the second phase, the categorization of an e-mail address optimal number of clusters using the average-link clustering
has been carried out. If the e-mail address contains a surname method; and ii) developing an efficient and fast approach for
identifiable as a substring and can thus be assigned to one e-mail address database mining in order to find frequent sub-
of the surname clusters. This is done efficiently by using the patterns that occur in the e-mail addresses.
suffix tree of an e-mail address.
Through the experimental evaluations it is shown here that
the surnames present as substring in an e-mail address can be ACKNOWLEDGEMENTS
retrieved which can be useful in the future to link individuals
multiple digital identities to their physical identities. The e- This work was supported through the EPSRC Grant- “The
mail addresses can then be assigned to locations because Uncertainty of Identity: Linking Spatiotemporal Information
the geographic distributions of most surnames are far from Between Virtual and Real Worlds” (EP/J005266/1).

228 2013 IEEE Symposium on Computational Intelligence and Data Mining (CIDM)
Fig. 6. Semantics of Surnames for UK Data set A) Surnames from 1 to 59, B) Surnames from 60 to 118
A) Semantics of surnames from 1 to 59

HALIFAX
INVERNESS 9

MOTHERWELL 8

BELFAST 8

OUTERHEBRIDES 8 8
Surname concentration with cluster label 8 8 8
PAISLEY
HARROW 8

KIRKCALDY 7

FALKIRKANDSTIRLING 4

DUMFRIES 4 4

KILMARNOCK 4 4 4 4 4 8

KIRKWALL 4 4

WORCESTER 3 4 4 4 8

HEREFORD 3

SWANSEA 2

SHREWSBURY 2 2 2 2 2

LLANDUDNO 2

GALASHIELS 2 2 2 2 2 2

LLANDRINDODWELLS 1 4 4 4 8

SUNDERLAND 1 2 2 5

HARROGATE 1

CARLISLE 1

ZETLAND 1
1 3 3 4 4 8 9
W
H LLI
AT DG MS
STKIN ON N
SHEP ON
H R EN
JO GG
R NE
H E
O GH TS
PR EN S
PA ITC
PURR AR
D GH
TH IE
G OM
R IFF S
JO S H
BOHN
PR WE
W E
R TK
H ND N S
FA PE LL
SPRM
M EN R
KE IR E
BO R
C YD
H IRN
R ND S
TH BE RS
PAOM TS N
AL TE ON N
M A N SO
D INT
KA KS YR
M E N
D LA
M UG
JO W AS
PAHN ELL
LI RK TO
H TL
PA E
M TE
O MI
M ON LAN
C CD EL
SI ME NA
D CL ONLD
M N N IR
C LA LL
R MP GH
M LL EL IN
LI RR
M DS Y
ST INT Y
M UA OS
O A

O P S

O S
U R

AV

EE IT

A I
O A

A
E
O E

IC

O R

O E

A O L

O A

A U Y
EI B L
W E

R A

D L
U C

IL

AX L

C L

A N

C E

U Y L
N A
C A

AR RT H
I

N R
L
R

N
A

I
B

W
C

SD
H

S
Y D

R O
S

S O
S O

O E

EN
S

N
O

E
N

Surnames

B) Semantics of surnames from 60 to 118

OLDHAM
PERTH 29
ABERDEEN 28
PRESTON 28 28 28 28 28
BRADFORD 25
WAKEFIELD 16 18 18 18 24
BLACKBURN 16
ILFORD 16 16
UXBRIDGE 15
DUDLEY 15 27 30 30
HUDDERSFIELD 10 12 12
HALIFAX 9 29
INVERNESS 25
28 28 28 28 28
BELFAST
OUTERHEBRIDES 27 27 27
10 16 26 28 28
HARROW
21

DUMFRIES
13
KIRKWALL
23 28 28 28

Surname concentration with cluster label

GALASHIELS
LLANDRINDODWELLS 13
SUNDERLAND 11 23
HARROGATE 13 13
22
ZETLAND
10 13 22 27 27 28 28
SY
C KE
D RT
FL DD RI
VA M
PAUG G
W RK AN
R ITE S
D B S HO
TU IS N SE
PARN N
ST TTE UL
SH R RS
M AR Y N
D IK A
H YL
BL LD
H ACK N
H RT BU
KH SS Y N
M AN IN
IQ HM
SH L O
H AH
JO NR
FI EP
SL LD H
AKATE
PAHT R
G RK R
M EE SO
JO L WO
BL HN OD OD
SMACK TO
D YT
M HE
KU CA TY
D MA N
R ID
C CH ON
D RIS E
FO NC IE
M RB N
M RR S
H NE O
BRY
W UC
G T
R AN
FR SS
SUAS
M TH R
M CK RL
M K NZ D
M KE Y E
LE NR ZIE
SCES
KA HO
SI R EL
A S
O W

O
AV O U

O
O E

A
U LE R

O H

AV R
IT S
H I
U T

O T
R IN

R
AL M

AC N N

C R

O E
IL IS

A E
AC E AN
C A I
U N
E
H E

AT E

N
BA O
E

U
G
H
IN

E
S
B

E O

E
H

FI
N
O

O
D
G

D
H
T

Surnames

R EFERENCES [10] P Mateos, A Singleton, and P A Longley. Uncertainty in the analysis


of ethnicity classifications: Issues of extent and aggregation of ethnic
[1] A. K. Jain, M. N. Murty, and P. J. Flynn. Data clustering: a review.
groups. Journal of Ethnic and Migration Studies, 35(9):1437–1460,
ACM Computing Surveys, 31(3):264–323, 1999.
2009.
[2] C. D. Manning, and H. Schutze. Foundations of Statistical Natural
[11] P. Viswanath, B. K. Patra, and V. Suresh Babu. Document clustering
Language Processing. The MIT Press, 2003.
methods: Recent trends along with some efficient and fast approaches. In
[3] James A. Cheshire and Paul A. Longley. Identifying spatial concentra-
Handbook of Research on Text and Web Mining Technologies, volume 1,
tions of surnames. International Journal of Geographical Information
pages 181–188. IGI Global, 2008.
Science, DOI:10.1080/13658816.2011.591291:1–17, 2011.
[12] R. Giegerich and S. Kurtz. From ukkonen to mccreight and weiner:
[4] F. Rasheed, M. Alshalalfa, and R. Alhajj. Efficient periodicity mining in
A unifying view of linear-time suffix tree construction. Algorithmica,
time series databases using suffix trees. IEEE Transactions on Kowledge
19:331–353, 1997.
and Data Engineering, 23:79–94, 2011.
[13] A Rodriguez-Larralde, A. Pavesi, G. Siri, and I. Barrai. Isonamy and
[5] J. Burt, G. Barder, and D. Rigby. Elementary statistics for geographers.
the genetic structure of sicily. Journal of Biosocial Science, 26:9–24,
Guilford Press, 2009.
1994.
[6] K. Aas and L. Eikvil. Text categorisation: A survey. In Technical Report
[14] Fabrizio Sebastiani. Machine learning in automated text categorization.
941. Norwegian Computing Center, 1999.
ACM Computing Surveys, 34(1):1 – 47, 2002.
[7] G. Lasker. Using surnames to analyse population structure. Naming,
[15] T. K. Landauer, P. Foltz, and D. Laham. An introduction to latent
Society and Regional Identity, pages 3–24, 2002.
semantic analysis. Discourse Processes, 25:259–284, 1998.
[8] Paul A. Longley, James A. Cheshire, and P Mateos. Creating a regional
[16] Choon Hui Teo and S. V. N. Vishwanathan. Fast and space efficient
geography of britain through the spatial analysis of surnames. Geoforum,
string kernels using suffix arrays. In Proceedings of the 23 rd Interna-
doi:10.1016, 2011.
tional Conference on Machine Learning, pages 929–936, 2006.
[9] P Mateos, Paul A. Longley, and David O’Sullivan. Ethnicity and
population structure in personal naming networks. PLoS ONE, 6(9):1–
12, 2011.

2013 IEEE Symposium on Computational Intelligence and Data Mining (CIDM) 229

View publication stats

You might also like