Professional Documents
Culture Documents
Measuring The Correlation of Personal Identity Documents in Structured Format
Measuring The Correlation of Personal Identity Documents in Structured Format
The name of the attribute and the nested structure solely depend Steps for measuring the correlation between the name values:
on the document structure. After analyzing a set of 20 personal
1. Obtain a single value by concatenating the values in the
documents, we identified the base attribute names and the
document according to the order of the attributes in the
possible ordering of the attributes for the name as follows [2].
attribute set.
It was observed in the experiment that the name segments 2. Convert each name segment to its phonetic representation
and the order of the name segments represented in the using Soundex [14].
documents vary with each document structure and the optimum 3. For each pair of documents d" and d#
results were not obtained when the name segments were 3.1. If <initials> is available in one of the two documents,
concatenated in the given order of the document [3]. Therefore, compare the <initials> with the first letters of the
we came up with the order of name segments as presented in the names of the other document
attribute set which was used to concatenate the name segments. 3.2. Identify overlapping name segments by comparing
If a document contained an attribute that is not available in our the phonetic representations
attribute set, the concatenation was done according to the order 3.3. Calculate Order Similarity (OS'( ,'* ,+,-. ) by
presented in the document. calculating the Jaccard Coefficient [23].
The attribute set = {<initials>, <first_name>, <middle_name>, 12.34,55"+6 7.6-.+87 "+ 89. 7,-. :3'.3
<name>, <full_name>, <surname>, <other_names>, OS'(,'* ,+,-. = (1)
;+"<=. +,-. 7.6-.+87 "+ '( ,+' '*
<last_name>}
3.4. Calculate String Similarity SS'(,'* ,+,-. of the name
[2] – The analyzed documents are passport, national identity card, birth segments with similar phonetic codes using
certificate, driving license, marriage certificate, the university ID card, motor
vehicle registration certificate, vehicle insurance card, mortgage deed,
Levenshtein distance
university transcript, university degree certificate, will (testament), bank ∑ @.2.+798."+ 7"-"4,3"8A
statement, credit card statement, divorce papers, income tax sheets. 𝑆S'(,'* ,+,-. = ;+"<=. +,-. 7.6-.+87 "+ ' ,+' ' (2)
( *
[3] – Eg. Passport contains <surname> <other_names> and personal
medical record contains <last_name> <first_name> which gives different 3.5. Calculate the pair-wise correlation score of name
values for the same name when concatenated in the order of attributes presented
in the document.
1EF ,F ,GHIJ KEEF ,F ,GHIJ 1.2. If <country> matched and if all documents contain
( * ( *,
BCS'(,'* ,+,-. D = (3)
L <province>,
4. Calculate the correlation score for name for dM Repeat Step 1.1 for <province>
1.3. Repeat Step 1.2 for <state>, <zipcode> and <city>.
U
𝐶𝑆OP,QRST = BQVUD ∑QXYZ 𝐶𝑆OP,OW,QRST (4)
X[\ 2. If at least one document contains the value of the address
where dM is the k-th document and 𝑛 is the number of represented as a single value with no child attributes,
documents that includes <name> 2.1. The address values of the remaining documents are
concatenated based on the order of the values
B. Date of Birth represented in the document.
2.2. Correlation score for the address is calculated using
The attribute set = {<date_of_birth>, <dob>, <birth_date>} the method in Section IV Part A, from Step 2
Child attribute set = {<date>, <day>, <d>, <month>, <m>, onwards.
<year>, <y>}
E. National Identification Code
The date of birth was analyzed by checking the exact
matching values for the date, month and year. The attribute set = {<nic>}
If different values were obtained, a candidate value The similarity national identification code was measured by
(Candidate':c ) will be selected based on majority voting and checking for exact matching values. If different values were
the score will be calculated against the candidate. obtained, the candidate method was used as in Section IV Part
B.
1, DOB'e = Candidate':c
CS'e,':c = f (5) 1, NIC'e = Candidate+"u
0, otherwise CS'e,+"u = f (7)
0, otherwise
C. Gender F. Correlation score of a document
The attribute space = {<gender>, <sex>} The final correlation score is the average of the scores of all
For the scope of this paper, we assumed only two gender values attributes.
(male and female). The final correlation score of a document dM :
The similarity was measured by classifying the value into U
Q
𝐶𝑆OP = Q ∑X | 𝐶𝑆OP,} (8)
one of the two target values and then checking if the values are |
of the same class. The classification process was based on where 𝐴 is the attribute and 𝑛} is the number of attributes
checking textual similarity with a target set of values. available.
Target value set for class 1 = {<f>, <female>} V. CONCLUSION AND FUTURE WORK
Target value set for class 2 = {<m>, <male>} The need of static and automated correlation calculation is
gradually increasing in a world where data processes are rapidly
If different values were obtained, the candidate method was replacing the human intervention with automated computations.
used as in Section IV Part B. Correlation between personal identity documents is typically
1, GENDER 'e = Candidate6.+'.3 measured by a human by comparing and analyzing a set of
CS'e,6.+'.3 = f (6) documents and identifying whether they belong to the same
0, otherwise
individual or not. In this paper, we have proposed and
D. Address successfully implemented a solution that automates the
correlation calculation and provides a normalized score that
The attribute set = {<address>, <line1>, <line2>, <city>, determines the correlation of a personal identity document
<zipcode>, <state>, <province>, <country>} within a set of documents.
Steps for measuring correlation of address: The algorithm to calculate the correlation score of “name”
1. If all the documents in the input set have child attributes, and “address” attributes (𝐴) in a document dM is:
1.1. If all documents contain a <country>, U
A candidate <country> value 𝐶𝑆OP,} = BQVUD ∑QXYZ 𝐶𝑆OP,OW ,} (9)
X[\
tCandidateu:=+83A v is selected based on
majority voting. The selection process involves The algorithm to calculate the correlation score of “date of
textual comparison and country code birth”, “gender” and “nic” attributes (𝐴) in a document dM is:
comparison 1, Candidate•
1.1.1. COUNTRY'e ≠ Candidateu:=+83A CS'e,• = f (10)
0, otherwise
1.1.1.1. CS'e,,''3.77 = 0
The overall correlation score of a document dM is the [8] B. Lisbach and V. Meyer, Linguistic identity matching. Springer, 2013.
[9] M. Ben Fleischman and E. Hovy, “Multi-Document Person Name
average of the attribute scores: Resolution,” ACL 2004 Work. Ref. Resolut. its Appl., pp. 1–8, 2004.
U
Q [10] C. Niu, W. Li, and R. K. Srihari, “Weakly Supervised Learning for
𝐶𝑆OP = Q ∑X | 𝐶𝑆OP,} (11) Cross-document Person Name Disambiguation Supported by
|
Information Extraction,” Proc. 42nd Meet. Assoc. Comput. Linguist.
where 𝐴 is the attribute and 𝑛} is the number of attributes (ACL’04), Main Vol., pp. 597–604, 2004.
available. [11] V. I. Torvik, M. Weeber, D. R. Swanson, and N. R. Smalheiser, “A
probabilistic similarity metric for medline records: A model for author
The algorithm is successfully implemented in IDStack, the name disambiguation,” J. Am. Soc. Inf. Sci. Technol., vol. 56, no. 2, pp.
140–158, 2005.
common protocol for document verification built of digital [12] J. Euzenat and P. Valtchev, “Similarity-based ontology alignment in
signatures [22]. IDStack is a protocol that facilitates three OWL-Lite,” Processing, vol. 16, no. C, pp. 333–337, 2004.
modules: data extraction, data validation and scoring. Our [13] R. Nagaraj and V. Thiagarasu, “Correlation similarity measure based
algorithm is used in the score module which is used by a relying document clustering with directed ridge regression,” Indian J. Sci.
Technol., vol. 7, no. 5, pp. 692–697, 2014.
party to evaluate a set of documents sent by the owner of the
[14] "Soundex System", National Archives, 2017. [Online]. Available:
documents. https://www.archives.gov/research/census/soundex.html.
[15] D. Holmes and M. C. McCabe, “Improving precision and recall for
The future contributions that can take this algorithm to the Soundex retrieval,” Proc. - Int. Conf. Inf. Technol. Coding Comput. ITCC
next level include fine-tuning the phonetic representation code 2002, pp. 22–26, 2002.
and adding language extensions. We currently use Soundex to [16] H. Raghavan and J. Allan, “Using soundex codes for indexing names in
retrieve the phonetic representation of name segments and ASR documents,” Proc. Work. Interdiscip. Approaches to Speech Index.
Retr. HLT-NAACL 2004., pp. 22–27, 2004.
address segments and the traditional Soundex fail to [17] M. Wieling, E. Margaretha, and J. Nerbonne, “Inducing a measure of
differentiate names that end in differently pronounced vowels phonetic similarity from pronunciation variation,” J. Phon., vol. 40, no.
(Eg. Both Kasun and Kasuni have the same Soundex code even 2, pp. 307–314, 2012.
though the two names are pronounced differently). The current [18] G. Kondrak, “Phonetic alignment and similarity,” Comput. Hum., vol.
37, no. 3, pp. 273–291, 2003.
algorithm supports only the documents in English language [19] W. Heeringa, “Measuring dialect pronunciation differences using
which can be improved in the future to support multiple Levenshtein distance,” Dissertations.Ub.Rug.Nl, 2004.
languages. [20] G. Kondrak, “N -Gram Similarity and Distance,” Lect. Notes Comput.
Sci., vol. 3772, pp. 115–126, 2005.
REFERENCES [21] S. Banerjee and T. Pedersen, “The design, implementation, and use of
the Ngram statistics package,” Lect. Notes Comput. Sci. (including
[1] D. Dessimoz and P. C. Champod, “Multimodal Biometrics for Identity Subser. Lect. Notes Artif. Intell. Lect. Notes Bioinformatics), vol. 2588,
Documents 1 State-of-the-Art Research Report,” Most, no. September, pp. 370–381, 2003.
2005. [22] "IDStack — The common protocol for document verification built on
[2] R. Clarke, "Roger Clarke's 'Id and Authentication digital signatures - IEEE Conference Publication", Ieeexplore.ieee.org,
Basics'", Rogerclarke.com, 2017. [Online]. Available: 2018. [Online]. Available:
http://www.rogerclarke.com/DV/IdAuthFundas.html. https://ieeexplore.ieee.org/document/8285654/.
[3] B. Miller, “Vital signs of identity [biometrics],” IEEE Spectr., vol. 31, [23] P. Achananuparp, X. Hu, and X. Shen, “The evaluation of sentence
no. 2, pp. 22–30, 1994. similarity measures,” Lect. Notes Comput. Sci. (including Subser. Lect.
[4] J. D. Woodward, C. Horn, J. Gatune, and A. Thomas, Biometrics: A Look Notes Artif. Intell. Lect. Notes Bioinformatics), vol. 5182 LNCS, pp.
at Facial Recognition. 2002. 305–316, 2008.
[5] J. Mathyer, “Quelques remarques sur le problemes de la sécurité des [24] W. W. Cohen, P. Ravikumar, and S. E. Fienberg, “A comparison of string
pieces didentité et des pieces de légitimation: une solution intéressante,” metrics for matching names and records,” KDD Work. Data Clean.
Rev. Int. Police Criminelle, vol. 35, no. 336, pp. 66–79, 1980. Object Consol., vol. 3, pp. 73–78, 2003.
[6] "Willkommen bei den Ausweisschriften für Schweizer [25] A. Islam and D. Inkpen, “Semantic text similarity using corpus-based
Staatsangehörige", Schweizerpass.admin.ch, 2017. [Online]. Available: word similarity and string similarity,” ACM Trans. Knowl. Discov. Data,
https://www.schweizerpass.admin.ch/pass/de/home.html. vol. 2, no. 2, pp. 1–25, 2008.
[7] J. W. M. Campbell, “The role of biometrics in ID document issuance,”
Keesing’s Journal of Documents & Identity, no. 4, pp. 6–8, 2004.