Download as pdf
Download as pdf
You are on page 1of 13
2) United States Patent (10) Patent No: US 9,218,334 B2 Mugali, Jr. et al. (45) Date of Patent: Dec. 22, 2015, (54) PRONOUNCEABLE DOMAIN NAMES osmoss1s AL, 12008 mand (71) Applicant: VeriSign Inc, Reson, VA (US) 2unsionassi2 AL? 22005 Tpak eta Tarot ongoing AI” 00° cal (72) Inventors: Aditya Anand Mugall, Je, Reston, VA. (US), Andrew W, Simpson, Sterling, VA (US); Seott King Walker, Purcellville, VAs) (73) Assignee: VERISIGN, INC., Reston, VA (US) (4) Notice: Subject to any diselaimer, the tem of this patent is extended or adjusted under 35 USC. 154(b) by S18 days. (21) Appl. Nos 131672,226 (22) Filed: Nov.8, 2012 65) Prior Publication Data US 2013011701341 May 9,2013 Related US. Application Data (G0) Provisional application No. 61/5S7:248, filed on Nov. 8.2011 (1) Ince Goo 1727 (2006.01), GOOF 1728 (2006.01) Host, 29712 (2005.01) (52) US.Cl eee GOOF 172274 2013.01): GOOF 1727 (2013.01, GU6E 17/28 2013.01); GOOF 172881 (2013.01); HOsL 613025 (2013.01) (58) Fleld of Classification Search cee Gose 17/274 usp Fou; 713/188 ‘See pplication ile for complete sear history. 66) References Cited US. PATENT DOCUMENTS S890448 A 12/1908 Ganesan ‘Tends? Bi 62006 Mada el 07788 Tpi27i6 B2_ S201 Meculler 8200914 BIt 92012 Ranjan 709204 2oov182110 AL* $2003 Deng 700231 100 110 wot |_ Serta _| Al* 92008 Owen Als 22009 Phi ‘2000125308 AL* $008 ml anno02soa71 AL* 102009 Kanevsiy ea ggonnostI9 AL* 122000 Wuetal 0220503 Ea 42002 (OTHER PUBLICATIONS ‘eater Crawford tal, "Kwyibo: automatic domain name gener tion, Software—Practice ad Fxperence vol 36,2008, pp. 1861 587 Mark Goadhich “Generating Pronounceable Nonsense Wor: Nily “Assignment”, Jouaal of Computing Sciences in Colleges, vl. 25, Issue 5. May 2011, pp 36-37 than, "Namco Introduces Domain Naber NXD Scores for Net, {CC aid “TV, Namecom Blog, Aug. 2011, Retrieved fiom the Ine: pram som Hog general evelopment 201108 ‘ae con-eraducerdonalvaabba-axd-sonce rhe cca oir siendet European Search Report dated Ma. 11 201 ‘Aprieation Nor 121918726 filed Now. 8 2013, pp I-10 European * cited by examiner Primary Examiner — Shaun Roberts (74) Attornes, Agent, or Firm —MH2 Technology Law Group, LLP on ABSTRACT Embodiments of the present teachings relate to systems and thods for generating proaenceahle damain names. The ‘method includes proving a list of charocter strings; filtering the list of character strings through a ist filter based on a Phonetic model to produce a first filtered list of character ‘rings filtering the ist of character sings through a second ier based on a character order mode to produce a second Iilered list of character strings: and generating, by a proces- 0g a Tistof pronounceable domain names based on the frst filteod ist of character strings and the second filtered list oF charter strings. 27 Claims, & Drawing Sheets Prepare Saas Bais] U.S. Patent Dee. 22, 2015 Sheet 1 of 4 US 9,218,334 B2 FIG. 1 J = = F ni chesne one 120~ | aia Phono oa | sa0-~ [sa rarer Orca | Pronouncaable Doman Names FIG.3 310-~ Build ARFF File ——— Populate Attributes for Pronounceable ‘Source Words 320. Populate Attributes for noa-Pronounceable ‘Source Words 330— 340~J Generate Character Order Mode! a U.S. Patent Dee. 22, 2015 Sheet 2 of 4 US 9,218,334 B2 FIG. 2 220 ~_{ Greate Data Entios 230-—~_[ Greate Data Entries for ‘or Pronounceabie | “non-Pronaunceable Source Words |___Souree Words ae fo 222~_{ Calculate n-gram 232~_ Calculate n-gram Frequencies Frequencies 224-~[” Generate Doubie | 234~[ Generate Double Metephone | Metapho: —_ 226~[ Attibute Class = ¥ 236~ 250~| Generate Phonetic Model | US 9,218,334 B2 vol = ar anejanu pieoghay i [ Homan | 8 = we Sheet 3 of 4 Dee. 22, 2015, US. Patent sadepy a | asnow Aeidsig i A - | J | 207 we | Koway) ue soss0% ene | nue #80204 BOR we un adesois ajgenowoy oor US 9,218,334 B2 Sheet 4 of 4 Dee. 22, 2015, US. Patent sou seuuks seyons oes | SejuNs Jeajuns reKNs Jejens JeMOS AROS e_UOS Ses JeoMOS seYBOS TEPUIS JeZNIS JeOMS Asus JIMS syRUS fous YOHLUS CoS Jems oORUS PERS | JeplUs JecRUS JejAUS JejonIs JeMIS Je;ENIS JENS seIdIs OBIS SePUUIS JENS Jens eOWIS eBoUIS J920uiIS JOS JenowIS JeoawIS nes oBeUIS Jlewis JeUIS JEYIs JepeIS Je}09!S IWEIS JeIUS Jenas om StU | US 9,218,334 B2 1 PRONOUNCEABLE DOMAIN NAMES. (CROSS REFERENCE TO RELATED "APPLICATIONS, ‘This application claims priority from U.S. Provisional Pateat Application Ser. No. 61/$57,248, filed Nov. 8, 2011, which is hereby incorporated by reference in their entirety: FIELD This disclosure relates generally to systems and methods Jor building phonetic and character onder models used to identify pronounceable domain names, and generating pro- nounceable domain names using the phonetic and character ‘order modes. DESCRIPTION OF THE RELATED ART ‘Short domsin names are popular because they ean be ‘catchy and easier (o remember than longer domain names. However, when seeking to register a new domain name, many people argue that the COM zone fl, nd that there are no ood domain names left. This is fir from the truth. Although ‘many domain names that contain ditionary words may be registered, statistically there remain large numberof unreg- istered and available domain names of different character Jengths, many of which may be pronounceable. While some ‘of these unregistered domain names may be nonsensical strings of characters, many of them may be easy to use and potentially could be popular and successful domain names i omeane were to pst marketing and brand name behind them. ‘One example ofthis is Google.com. While the word pooule was aot previously a word, it has now become a word. ‘One goal of short domain names may be to be memorable and communicable. Thais, when viewing the domain name, individuals should be able to pronounce the domain name, remember the domain name, and, when telling the domain rameto others the other individuals should be alee remem - ber and easily find the domain name. For example, if fifteen people were to look ta domain name (that isnot a dictionary word) and ten to fillcen of those people were able to pro- ounce the domain name the same way and tell it to theft fiends who could then easly find and visit the domain, dhe ‘domain name may bea good choice, despite not being a real word ‘More specifically, Sand 6 character domain names are very popular. However a very lage percentage of Sand 6 char ter domain names in the (COM zone have already been reg- istered. As discussed above, however, there are many unreg- istered 5nd 6 character domain names that do not have any specie meaning in the English Latin scrip, yet may be pro- nouneenble Itisaccordingly an object ofthe disclosure to build models that may be used io dentiy pronounceable domain names. It is another object ofthe disclosure to generate pronounceable ‘domain names using the pronouncesble domain name mod ‘els, The systems and methods according to embodiments of the disclosure may be used to generate pronounceable com- ppany names or websites based on input words relevant to the Particular business. “These objects may be achieved by using a dictionary set to Jeam what words are pronotneeable, A combination of Bays- san networks to learn the composition of phonetics in pro- rouncenble words and decision trees t lear the order of ‘characters ina pronounceable word may be used, along with 0 o 2 -gram Scoring heuristics, to build models that can predict if a given word is pronounceable based on the learned models SUMMARY, Inaceordance withthe disclosure, systems and methods for building phonetic and character order models used to identify ronounceable domain names, and generating pronounceable ‘domain names using the phonetic and character onder models are provided. Tnone embodiment, a method of generating pronounceable domain names may include: (1) building a phonetic model represcnting phonetic characteristics of pronounceable words; (2) building a character order model representing character order characteristics of pronounceable words; and generating a list of pronounceable domain names by gener- Ating a ist of character strings, and passing the list of char- acter stings through first filler based onthe phonetic model ‘and second filter based on the character order model, In implementations, a computer-implemented method of szencmting pronounceable domain names is disclosed. The ‘computer implemented method can comprise providing a ist ‘of character strings; filtering the list of character strings through a firs filter hased on a phonetic model to produce & fist filtered fst of character strings; filtering thelist of char- acter strings througha second fier based ona character order nodel to produce a second filtered list of character strings; and generating, by 2 processor, a list of prouounceable ‘domain names based on the fist filtered ist of character strings andthe second filtered list of charscter strings In implementations, the computer-implemented method can farther comprise building the phonetic model represe ing phonetic charscterstcs of pronounceable words ‘Tn implementations, the computer-implemented method can further comprise buikling the character order model rep- resenting character order characteristics of pronounceable words Inimplementations, in the building of the phonetic model, the method can further comprise preparing library of source ‘word comprising proaounceable words nd non-pronounce- able words; and providing the library of source words to a Teaming model algorithm to train the learning model algo- rithm to determine characteristics of pronounceable nd char acteristies of non-pronounceable wots In implementations, the learning model algorithm ean comprise a Bayesian network, In implementations, in the preparing the library, the ‘method can comprise buikling an attribute relationship file format (ARFF) based on the library of source words; and associating one oF more attibutes of pronounceable words ‘and non-pronounceable words with the ARFE. Tnimplementations, at least oneatributeof the one or more atibutes of pronouncesble words and aon-pronounceable ‘words can comprise an n-gram score. In implementations, the computer-implemented| method an further comprise calculating first n-gram score forthe pronounceable words in the library; and calculating a second negra seore forthe non-pronounceable words in the library: In implementations, an attbute of tbe one or more tributes of pronounceable wards and non-pronounceable ‘words can comprise a phonetic representation. ‘Tn implementations, the eomptter-implemented method an further comprise determining the phonetic representation {or each pronotncesble word and each aon-proscunceable ‘word in the library. US 9,218,334 B2 3 In implementations, the phonetic representation can co prise a Metaphone representation or a Double Metapbone representation. Tinimplementations, the building the charaeteronder model ‘ean further comprise preparing a library of source words ‘comprising pronouncesble and non-pronounceable words and associating attributes of pronounceable souree words and ibutes of noa-proncnceale source words with words in the library of source words ‘In mplementations, the phonetic model can be operable to determine a peobabilty that an input charter sting ofthe Tistof character sirings is pronounceable TInimplementations, the characterorder mexlelcan be oper: able to determine probability that an input character string ‘of thelist of characte strings is pronounceable. In implementations, the first filtered list of character strings ‘can be provided to the second filter to produce a second ‘ltred list of character strings. In implementations, the second filtered lst of character strings can be provided to the frst filter to praduoe a fst filtered list of character strings. In implementations, the computer-implemented| method ‘can further comprise determining s first peobabilty. using the phonetic model, that an input character string of the list of ‘character stings is proounceable: determining a second probability, using the character order model, that the input ‘character string is pronounceable; and detemining a come bined probability based on the first peobabilty and the second probability thatthe inpatcharectr string is pronounceable: ‘comparing the combined probability with a pronounceabilty threshold to determine whether the input character string is Tikely tobe pronounceable; and providing the input character string as thelist of prnounceable domain names. In implementations, the computer-impleniented method ‘can furher comprise determining if pronounceable domain ramen te listofpronounceable domain names is registered: and providing alist of altemative suggestions of unregistered pronounceable domain names from thelist of pronounceable ‘domain names ifthe pronounceable domain name is deter- mined to be registered or is unregistered In implementations, the compiter-implemented method ‘can further comprise ordering the ist of alternative sugges tions of unregistered pronounceable domain names based on ‘oneormon of the following; primitive distance between the pronounceable domain name and the alternative suggestions ‘of unregistered propounceable domain names, a depree of similanty between the pronounceable domain name and the ‘altemative suggestions of unregistered -pronounceable ‘domisia names, an amount of traffic the pronounceable ‘domain name is receiving, or combinations thereof, TInimplementations, a device is disclosed that can comprise at least one processor: and # non-transitory computer read- ‘ble mediums comprising instrctions that cause the atleast ‘one processor to perform a method eommprising: providing 3 Tist of character strings; filtering the list of character strings through a first filter Based on a phonetic modal to procace @ fst filtered list of character strings: tering the Hist of char- acter stings throngha second filter based on acharacterordet ‘model to produce a sevond filtered list of character stings: and generating, by a processor, a ist of pronounceable ‘donstia names based on the fist filtered list of character strings andthe second filtered ist of charter strings. In implementations, a computer readable storage mest jis disclosed that can comprise instraction that ease one oF ‘more processors to perform a method comprising: providing a listof character strings; filtering the ls of character strings through a first filter based on a phonetic model to produce & 0 o 4 first fitered Hist of character strings: filtering thelist of char acter stings through second fiter based ona character order model to produce a second fered list of character stings fand generating, by a processor, a list of pronounceable ‘domain names based on the fist filtered ist of character strings andthe second filtered list of chamscter strings Additional objects and advantages ofthe embodiments of the disclosure will beset forth in part inthe description which follows, and in pant will be obvious from the description, oF nay be learned by practice of the embodiments, The objects and advantages of the embodiments will be realized and ‘attained by means ofthe elements and combinations partien- larly pointed out inthe appended claims. cis to be understood that both the foregoing general description and the following detailed description are exem- plary and explanatory only and are not restrictive of the embodiments, as elaimed. BRIEF DESCRIPTION OF THE DRAWINGS. ‘Various features of the embodiments can be more fly ‘appreciated, a the same become better understood with re ference tothe following detailed description ofthe embodi- ‘meats when considered in connection with the accompanying figures, in which: FIG. 1 illustrates an exemplary flowehart for @ method secording to embostiments forthe disclosure FIG. 2 lustrates an exemplary flowehart for method of cresting a phonetic model according to embodiments of the disclosure. FIG. 3 illustrates an exemplary flowchart for method of character order model according to embodiments of the di closure. FIG. 4 is a block diagram of an exemplary computing system, according fo various embodiments FIG, 5 illustrates an exemplary user interface according 0 embodiments forthe disclosure, DETAILED DE CRIPTION Foe simplicity and illustrative purposes, the principles of the present teachings are dereibed by referring mainly to exemplary embodiments thereof, However, one of ordinary sill in the art would readily recognize thatthe sume prin- ciples are equally applicable ro, and can be implemented in, alltypes of information and systems, and that any such vara ‘ions donot depar from the trie spirit and seope of the present teachings. Moreover, in the following detailed description, references are made othe accompanying figures, which lly ‘wate specific exemplary embodiments. Flectrical, mechani- cal, logial and structural changes may be made to the exem- plary embodiments without departing from the spirit and Scope of the present teachings. The following’ detailed description is, therefore, not to be taken in limiting sense fand the scope of the present teachings is defined by the ‘appended claims and their equivalents FIG. 1 llusrates an exemplary flowchart fora method 100, for building phonetic and character onler models wsed 10 identify pronounceable domain names, and generating pro- ‘nounceable domain names using the phonede and character order models. While FIG. illsiates various processes that ‘ean be performed, oneskilled in hear will walizethat any of the processes and stages ofthe processes can beperformed by ‘ther components of a computing device. Likewise, one lille inthe art will realize thatthe illustrated stages of the processes are exemplary and that any oF te illustrated stapes US 9,218,334 B2 5 ‘can be removed, addtional stages cam be added, and the order ‘of te illustrated stages can be changed The method may be cured out for domain names of a specific length (eg. 5 characters), or for domain names of rnutipteTengths (eal domain sames having 310 charae- ters). Throughout this disclosure, embodiments will be ‘deseribed for generating pronouncesble domain names bav~ Jing length ofS characters However, the exemplary emboli ments having 5 character domain names ate not intended 10 limit the scope of this disclosure, but are provided as an ilustaon of only a subset of possible embodiments of the disclosure, Asshown inFIG. 1, source data is gathered and prepare in 110. The source data may include both pronounceable and non:peonounceable words. Throughout this disclosure, the term “Words” is used in aocondance with ils ordinary usage, and also to generally denote charactor strings, whether or not the character strings form “words” in the ordinary sens. ‘The pronounceable words in the source data may be taken Jom an English language dictionary. For purposes of build- ing the modbls, i s assumed that ail dictionary words are pronounceable. Throughout this disclosure, the domain ames are diseussed with respect to English words and pro- rnounceability. However, the systems and methods disclosed ‘could also be used in connection with demain names and > pronounceability of other languages. For example, to gener te pronounceable domain names in Spanish, a Spanish lan- tage dictionary could be used in pace of an English Jan- ‘zuage dictionary. Other sourees of pronounceable words ‘couldalso hese forthe source data, orasubset of dictionary ‘words could be used, The non-pronounceable words in the source data may be generated wring a random string generator For example, if 5 ‘character domain names are being generated, random string enerator may generate random § character strings as non- pronounceable words for the source data. Alternatively, ran- ‘dom character strings of varying lengths may be generated, For purposes of building the models, tis assumed that the randomly generated strings are non-pronounceabl. In other ‘embodiments, the randomly generated stings may be eros checked against a list of dictionary words to remove any known pronounceable words from the randomly generated strings ‘The souree data includes both pronouncesble and non- proaounceable words so that te leaming models have leara- {ng information about characteristics of both ronounceable ‘and non-pronounceable words. Inone embodiment, the mum- berof pronounceable words in the source data is substantially ‘equal to the numberof non-pronounceable words in order 0 ‘equally weight the source data. Although FIG. 1 illustrates, ‘oly one set of source dats being use for both the phonetic model andthe character order model, in other embodiments, separate source dala sels may be provided forthe separate models. Tn 120, 2 phonetic model is built using the source data prepared and obtained in 110. Morespecife details regarding building the phonetic mode accor to an embodiment are described below with reference 10 FIG. 2 In 130, a character onder model is built using the source data prepared and obtained in 110. More specific details regarding building the character order model aveoring 10a ‘embodiment are described below with reference to FIG. 3. In 140, pronounceablo domain names are generated vsing the phonetic model and thecharaeter onder model, Generating the pronounceable domain names may include creating a ist ‘of character input words (where ais the desired domain name length), and filtering the list input words through the o 6 phonetic and character order models, The list of n-character put words may be limited toa single n, for example, only $ character input words, or my include several different Jengths. In some embodiments, the input words are intended tobe used as domain names, 0 only the characters 0.9, A, and hyphen ae includ as possible characters. However, in ‘ther embodiments where the inpot words may be intended for use in other applications other restritions (oF no restic- sions) may be placed on the character set from which the secharacter inp words are formed, Additionally, in one embodiment, the list of n-character input words may include all possible ncharacter words. In ‘anather embodiment, the lis of n-charaete input words fnclude an appropriate subset of all possible n-character words For example, each ofthe models may retumn a probability ‘hat the input word is pronounceable, represented by an out- put from 0.00 1.0, Thelistof input word may be is filtered by the phonetic model. Theo, the input w the character order model may be limited t only those input words witha prob- ability valne from the phonetic model that exceeds a prede- termined threshold, for example, 90% or 0.9. The fltered ‘input words may then e filtered by the character order model. After passing the filtered input words through the character ‘order model, the final lst of pronounceable domi. names ‘may be limited t0 only the input words with a probability value from the character order model that exceeds a predoter- ‘mined threshold, for example, 95% or 0.95. The thresholds may be set at other probability values as desired or deter mined to return optimum results, Tn other embodiments, the order of the filering may be reversed, with the input words first being filtered by the char- acter orser model, then filtered ty the phonetie model. In ‘ther embodiments, al of the input words maybe itered by both models, and the final lst of pronounceable domain ‘names may be determined based on a combination of the two probability values retamed by the models. For example, in ‘one embodiment, only input words that ceived a prodeter- ‘ined probability, (eg. a least 90% oF 0.9) in both models ‘may be ineluded in te fina ist of pronounceable words or ‘domain names. In another embodiment, the ealulated proh- abilities may be combined or nmatiplied, then compared to @ threshold to determine whether each ofthe input words is sulicientlyTikely tobe pronounceable aso be included inthe Tina ist of pronounceable domain names, Tshowd benoted that, while FIG. 1 instrates the phonetic model and character onder models being builtin parallel, ia other embodiments, the models may be built consecutively ar inany other order. FIG. 2ilustrates an exemplary Nowehart fora method 200, ofereatinga phonetic model according to embodiments ofthe disclosure. Although the embodiment illustrated in FIG. 2 ‘and diseussedelow specifically ses theopen source WEKA program and a Bayesian network to leam, develop and gen- erate the phonetic model, other programs and earning models ‘may be wsed in other embodiments. While FIG. 2 illustrates various processes that canbe performed, one killed in the at will realize that any of the processes and stages ofthe pro» cesses can be peeformed by other components of computing device. Likewise, one skilled in the art will realize that the illustrated stapes ofthe processes are exemplary and that any ofthe illustrated stages can be romoved, additional stages can be added, and the order of the ilusrated stages ean be changed. ‘As showin in FIG. 2, an attribute relationship fle format (ARE) thes built using source data in 210. The source data may be, for example, source data such as is prepare in 110 US 9,218,334 B2 1 ‘above, In one embodiment, the number of pronounceable ‘words in the source data is substantially equal to the number ‘of non-pronounceable words, in order to equally weight the source data. The ARF isa text file format used by WEKA to sore data in a database, The AREF file contains ateibutes bout the source data, The atributes forthe source data that ‘contains pronounceable words may be generated separate fom the attbutes for the source data that contains none pronounceable words In 220, data entries for pronounceable source words may be created. Similarly, n 230, data enees for uon-pronounce- able Source words may be created In 222, n-grams over the set of pronounceable source words maybe calculated. For example ithe domain names that are ultimately to be generated are 5 character strings, 2-grams, 3-grams, and 4gramsmay becalculted over the set ‘ofpronouneeable source words. An -gram's valueisthe otal number of occurences of the n-gram in the set of source words. As a brief example if the souree includes only the words “attack, daily dat, fi, "then the n-gram score for is 2, for“al"is2, for “i” is 3, for “at is 2, and for “da js 2. Other n-grams in tis dataset, such as “ata” occur only ‘once, fora score of 1, Then-grain scores form attributes ofthe AREF ie for each pronounceable source word. Inthe above brief example, the 2-gram atibute of “daily” would be the sum ofthe 2-gram scores forthe 2-grams wi the word daily: da aii snd ly, which ae 2,2, 3,1, fora total 2-gram atteibute seore of 8. The 3-gram attribute of “would bethe sumof the 3-gram scores forthe grams ‘within the word daily: dai, si, and iy, which ae 1,2, nd 1, {ora total 3-grain atribute score of 4. The process for ealew= lating the n-gram attribute for each value on continues this ‘In 232, n-prams over the set of non-pronounceable source words may be calculated, and n-gram attributes generated, in ‘a manner similar to that described above with reference 10 222 In other embodiments, the calculated n-gram frequen- ‘ies from the prononnceable source words may be used {0 lgenerate the n-gram attributes for the non-pronounceable Source words, without ealculating n-gram frequencies over the non-pronounceable source words 1224, Double Metaphone ateibute is defined foreach of the pronounceable source words. The Double Mctapbone attribute is based on the Double Meta phone representation of the source words. The Double Mela phone is a standard phonetic representation of words. In other embodiments, ‘ther phonetic representations of the source words may Be used, such a8 a Metaphone representation or any future ver son or variant of such, Each charter in the Double Meta- Phone representation of the source words may form an tbat In 244, Double Metaphone attribute is defined foreach of the non-pronounceable sourve words, similar to the method described in 224 with respect to the pronounceable source words, Th 226, a class attribute is defined for each of the pro= ounenble source Words, The class atribute identifies ‘whether the source word is pronounceable. Ts, each ofthe pronounceable source worts receives a class attribute"Y", oF ‘ther alirmative representation. Similarly, in 236, each ofthe rhon-peonounceable source words receives class atteibute “N°. or other negative representation, In 240, the attributes of the pronounceable source words and the attributes ofthe non-pronounoeeble source words are ‘combines into the ARFF file, In other embodiments, the Pronounceahle and now-pronounceable source words ae aot separately processed (as illustrated in FIG. 2), but are instead 0 o 8 processed together, in which case there may be mo need 10 ‘combine the atributes into the AREF file, because they wil already be present in the ARF file Tn 250, 2 phonetic model is generated based on the tributes ofthe pronounceable source words and now-pr- ‘ounceable source words inched in the ARFF file In one embodiment, tis is accomplished using the Bayesian Net- ‘work Algortim in WEK., The resulting Bayesian network ‘model, oF phonetic model, canbe used to predict a probability thats given inpatstring willbe pronounceable. Whilea Baye- sian network has been described as used to generate the phonetic model, ether appropriate machine leaning models :may be used in other emboxliments FIG. 3 ilustratesan exemplary owchart fora method 300, of ereating a character order model according to embod atsofthe disclosure. Thecharacteanler model may deter ‘mine the probability that an input word is pronounceable based on information leamed by and stored in the model relating to the chain of order of leters in pronounceable ‘words, While FIG, 3ilustrates various processes that ean be performed, one skilled in the art wil realize that any of the processes and stages of the processes can be performed by fther components of a computing device. Likewise, one salle inthe art will realize that the illustrated stages ofthe processes are exemplary and that any a the illustrated stages can be removed, alitional stages can be added, and the order ofthe illustrated stages can be changed. ‘As shown in FIG, 3, ARFF file is built sing source data {n310, The source data may be, for example, sourcedata such asi prepared in 110 above, andor source data that was Used jn the method 200 described above. In one embodiment, the ‘numberof pronounceable wordsin the source datas substan- tilly equal to the number of nonproncunceable words, in lrder to equally weight the source data In 320, the ARFF file is populated with atibutes ofthe pronounceable source words. The attributes af dhe ARFF fle {or the character order model inelude the characters of the source words, and a class attebute that identifies whether the ‘word is pronounceable. This, in 320, the ARF file is popt- lated with the character atibutes'of the pronounceable source wards, and a lass atibute"Y™ ‘In 330, the ARFF file is populated with attributes ofthe nonpronounceable source worts. The atributes of the none pronounceable source words are populated ina similar man- fer as described above with respect o the pronounceable source words, except that the class atribute for the non- ‘ronounceable source words is “N”. In.340, the character onier model is generated based onthe atibutes of the pronounceable and_non-pronounceable source words stored in the ARFF file. This may be accom plished using the J48 decision ree algorithm in WEKA. The ‘sulting character order modo can be used to predicta prob- ability that a given input sting wll be pronounceable, While the J48 decision tree algorithm has been described as used t0 genente the character order model, other appropriate ‘machine leaming models may be used in other embodiments ‘After building the phonetic and character onder models based on source data, new input words may be processed using the phonetic and character order models to determine theprobabilty thatthe new input words are pronounceable. If itis determined with a sufficient degree of eeetainty that the ew input wordsarepronounceable, the new input words may be output to a user, oF otherwise stored on a storage devies. ‘This may beusefl,forexample, in suggesting domain names tousers seeking o register a domain name. Por example, ifa ter requests a domain name that is already registered, the systems and methods desribed in the disclosure may be used US 9,218,334 B2 9 ‘osugyestaltemative, pronounceable domain names based ai the originally requested domain name. The suggestions may be bosed on the requested domsin name, ora user may enter several relevant key words, and the suggestions may be based uher than a specific requested domi ile, a primitive distance may be wsed to ‘enlat the sett ems wth th posible suggestions om the pronounceable domain names system. That is, any tem that is within a certain distance of the search term being passed (0 the system may be returned, and the system could perform further iterations to improve matches between the search terms and suggestions for even more relevant pro- ounceable domain name suggestions ‘The output pronounceable words may be prioritized. The prioritized output words may be stored or displayed in the Prortized onder. ‘The output words may’ be priontized, for ‘example, based on the primitive distance between the output ‘words and the input request, Inother embodiments, the output words may be prioritized based on a degree of similarity of the output word withthe input by the user. Ia another embod ment, the output words may be prioritized based on the amount of traffic the word or string is receiving, Ifthe output word or sting is not a registered domain name, NXD tealic information may be used to determine traffic volume forthe non-existing domi, ‘By way ofa non-limiting example, considera teaming task {or fter 1: create 2-, 3, 4-gram score from dictionary. Asn ‘example three entry dictionary consisting ofthe following ties: f00, food, and world. The 2-arams are fo, 00, fo, 0, ‘ed, Wo. oF, Id. The 3-grams are foo, foo, 00d, Wor, of ld “The4-grams food, wor, ord, The scores seross n-gram ae, fi.2: 00:2; ode]; wo: oF; 11:1; KT; 00:2; ood; wor: ‘ofl; rd: fod; wor orld, Double metaphone repre: sentations are then created of these words as shown in Table 1 TABLE] outs Meustons meats ‘The same is done for randomly generated words that aren’ prosounceable and build the “learning” par of the AREF for Sher 1 TABLE 2 Die TKS ik = ‘The ARPF can be represented as shown below: {Gatrtce ml (shestaths) kim ncg.geatasonsz0. NONE} 0 o 10 ae eae ‘The ARFF shen runthrough a Bayesian network to learn the sodel for filter 1 "An ARFF is then created for the filter 2. This AREF bis to bbe ereted specific to a usecase, so consider the filters built {or letter Words, instead of fo, food and world assume that we used hello, world and green. The ARFF can be represented as shown below: Gunite! fwbedataniiikinacparstasmxy2) batahue m2 (aheddefanigilacpasstatsinny2) (Gurshde ms (akedefeniuelmaspazataemey2) (Gate mt (anednfenistlmaspazstavmes) Gutrite ns [shedtelanivkimaoparatanmns) aoc ‘hie AREF is then run through a decision tee 148 algo- ith and build the model for itr 2. ‘The processing steps can include the following: (1) erate set of n-character names tobe filtered (2) pass the names through filter I which generates a probability fora given word to be pronounceable; (3) iter names that are below a given ‘threshold; and (4 follow the same steps for filter 2. ‘The methods described throughout the description of ‘embodiments ofthe disclosure may be implemented or per Tormed by a system tat includes a processor and a memory An exemplary system for generating pronounceable domain ‘names may include a processor, storage, amemory, and int output (VO) devices. The system may be implemented in various ways. For example, the system may be embodied in 2 general purpose computer a server, mainframe computer, oF ‘any combination of these components. The system may be standalone, or itmay be part ofa subsystem, which may in ‘tur, be part ofa largersystem. Further, thecomponentsof the system may be separated, or inteprated into singe system. ‘The processor may include one or more knowa processing vices, such as # microprocessor from the Pentium™ or ‘Xeon™ family manufactured by Intel™, the Turion™ family ‘manufactured by AMD™, orauy of various processors man ‘actured by Sun Microsystems, The memory may inelude one ‘or more storage devices configured to store information used by processor to peeform certain functions related to disclosed ‘embodiments. The storage may inchude a volatile of non- volatile, magnetic, semiconductor, tape, optical, removable, ‘on-removable, or other type of computer-eadable medinss ted asa storage deviee. Tn one embodiment, memory may include one of sore rogeams or sbprograns that may’be loaded from storage oF US 9,218,334 B2 u ‘elsewhere that, when executed by the processor, perform ‘various procedures, operations, or processes consistea! with disclosed embodiments. ‘While the above disclosure as referred specially to the pronounceability of domain names, the disclosed systems fnd methods may also he operable for generating other pro- rounceable words or character stings, for example, email addresses, gamertags, online identities, company or store ames, et. Also, the above disclosure isnot limited to chat acter strings of a specitic length, but may be adapted as rocestary to accommodate different Iengihs of character strings. Additionally, while the above disclosure refers 10 pronounceability in the Fnglish language, appropriate maxi cations may be made to accommodate other languages with- ‘out departing from the spirit and scope of the invention, Funher, while the source ofthe data has been described ‘embodiments asbeing a dietionary, other data sources may be twsed for oblaining and generating a list of pronounceable words. For example, domain names that individuals have tried orepsteror Web addresses that are Frequently typed but ‘do not correspond to any registered domain name may be used to create a control set of pronounceable words TIG. ilastates an exemplary bloek diagram of a com= ping system 400 which ean be implemented to preform the ‘aris processes of FIGS. 1-3 according o various embod ments, While FIG. 4 illustrates various components of eo Puting system 400, one skilled in the art will realize that ‘existing componenls can be removed or additional compo- nents can be ade. Asshown inFIG. 4, computing system 400 can includeone ‘or more processors, sch as processor 402 that provide an ‘execution platform for embodiments of seeurity tool 102. ‘Commands and data from processor 402 are communicated ‘over communication bus 404. Computing system 400 ea also include main memory 406, for example, one or more ‘computer readable storage media such as a Random Access Memory (RAM), where security tool 102, and/orother appli ‘ation programs, such as an operating system (OS) ean be ‘execute curing rintime, and can inclide secondary memory 408, Secondary memory 408 can include, for example, one or ‘more compiter readable storage media oF devices sich as hard disk drive 410 andior removable storage dive 412, rep- resenting a foppy diskette drive, a magnetic tape drive, 2 ‘compet disk dive, ete, where a copy of an application program embodiment for security tool 102 can be stone Removable storage dive 412 roids fom andor writes 10 removable storage unit 414 in a well-known manner. The ‘computing system 400 can also inehude a newwork interface 416 in oner to conneet withthe one or more nesworks 110, Inembodiments, a user can interface with computing sys> tem 400 and operate security tool 102 with keyboard 418, mouse 420, and display 422. To provide information from ‘computing system 400 and data from security tool 102, the ‘computing system 400 can ince display adapter 424. Dis: play adapter 424 can interface with communication bus 404 nd display 422. Display adapter 424 ean receive display data trom processor 402 and convert the display data ito display ‘commands for display 422, Certain embodiments may be performed as a computer pplication or program. The compoter program may existin& ‘arity of forms both aetive and inactive, For example, the ‘compisler program can exist as software program(s) com= prised of program instnitions in source ene, abject code, ‘executable code or other formats, finnware program(s), oF hardware description language (HDL) files. Any ofthe above ‘can be embodied on a computer readable medium, which include computer readable storaze devices and media, and 0 o 12 signals, in compressed of uncompressed form, Exemplary computer eadable storage devices and media include eon- ventional computer system RAM (random aecess memory), ROM (read-only memory), EPROM (erasable, program- able ROM), EEPROM (electrically erisable, program- ‘able ROM), and magsctic or optical disks or tapes. Fxem- plary computer readable signals, whether modulsted using & carrer or nt, are signals that a computer system hosting or ‘inning the present teachings can be configured t0 aocess, including signals downloaded through the lntemet or other networks, Concrete examples of the foregoing include distr- bution of executable software program(s) of the computer program ona CD-ROM or via Internet download, Ina sense, the Intemet itself san abstract entity isa computer readable smeium. The same is trie of computer networks in genera FIG. 8A illustrates an exemplary use interface secording ‘to embodiments for the disclosure. While FIG. 4 illustrates various aspeets, fields, of attributes of the exemplary user interlace, one skied inthe art will realize that any of the aspects, fields, or attributes can be removed, additional aspects. feds, or attributes can be added, and the onder ofthe illustrated aspects, feds, or atributes can be changed ‘The user interface $08 can be any of user interface that allows the user to enter, view, an! interact with pronownee- ability service in relation to processes discussed in elation to FIGS. 1-3 The pronounceability service ean prove the user interface 50S 10 the user vin an output device, such a a splay. Field 505 isa field that allows the user to enter one oF more keywords, sueh as domain names, (0 be analyzed ‘according tothe processes of FIGS. 1-3, Field $10 is field that allows the user To enter one oF more characters thatthe ‘one or more keywords can begin with. Field S18 is fick that allows the wserto enter one or more characters thatthe one oF ‘ore keywords can end with, Field S20 ia field that allows the user to view previously entered keywords, Field $25 is a field that allows the user to view result aed on the entered keyword and the processes of FIGS. 1.3. Inthe above examples WEKA and ARF are used tollus- ‘eute various implementations in which aspects ofthe present Aisclosure can be performed. For example, WEKA is one of variety of programs that can be used for modeling, which uses AREF as the file Format o interact wth WER In general, AREF isa Tomat useful for characterizing a feature Veclor ‘hat can bo used to tain a machine learning model. However, ferent applications may leverage different formats to rep: resent featire vectors ‘While the teachings has heen deseribed with reference to the exemplary embodiments thereof, those skilled inthe at will be able to make various medications to the deseribed embodiments without departing from the true spirit and scope. The tems and descriptions used herein are sel forth by ‘way of illustration only and are not meant as limitations in particular, although the method has been described. by examples, the steps of the method may be performed in @ ‘ifferent order than illustrated or simultaneously. Pure ‘more, 10 the extent that the terms “including”, “includes”, “having”, “has”, “with or variants thereof are used in either the detailed description and the claims, such tems are intended 10 be inclusive in a manner simile w the tem oomprising.” As used herein, the tem “one or more of” with espott toa listing of items such as, for example, A and B, ‘means A alone, Balone, or A and B. Those skilled in the art ‘ill recognize that these and other Variations are possible ‘within the spirit and sonpe as defined in the follwing claims and their equivalents, US 9,218,334 B2 13 ‘What is claimed is 1. A computerimplemented method of generating pro- nounceable domain names, comprising: providing alist of character strings; ‘elermining a fist probability that a character siting inthe list of character strings is pronouncenble based on 8 phonetic model; determining a second probability that a character string in the lst of eharacter strings is pronounceable based on a character order model: ‘ering the ist of charaeter stings through a first fier based on theirs probability to producea ist filtered list of character strings Altering thelist of character strings through a second fer hhased om the second probability to produce a second filtered Hist of character strings; and generating, by a processor, list of pronouneeable domain ‘names based on te firs filtered list of character stings And the second filtered list of character strings, 2, The computer-implemented method of claim 1, further ‘comprising building the phonetic model representing pho- netic characteristics of pronounceable words 3. The computerimplemeated method oF claim 1, farther ‘comprising building the character order model representing ‘character order characteristics of pronouneeable words, 4. The computer-implemented method of claim 2, wherein building the phonetic model luther comprises ‘reparinga library of source words comprising pronounce able wonis and non-pronouneeable words; and roving the library of source words toa leaning model algorithm to tran the learning model algorithm to deter- ‘mine characteristics of pronounceable and charactris- ties of non-pronounceable words 5. The computer-implemented method of clan 4, wherein the learning mode algorithm comprises a Bayesian network. ‘6. Thecompater-implemented method of claim 4, wherein preparing the library compeses ‘building an attribute relationship file format (AREP) based ‘on the library’ of source words: and associating one or more atrbutes of pronounceable words ‘and not-pronounceable words with the AREF 7. The computer implemented method of clam 6, wherein ‘a east one atibute of the one oF more atebutes of pro= rounceable words and noa-pronounceable words comprises an negra score 8, The compuierimplemented method of claim 7, farther ‘comprising: calculating a first n-gram score for the pronounceable ‘word in the library and calculating second nygram score forthe non-pronounce= able words inthe library. 9, The computer implemented method of claims 6, wherein an atribute ofthe one or more atrbutes af pronounceable words and non-pronounceable words comprise a phonetic representation. 10. The computer-implemented method of claim 9, further ‘comprising:

You might also like