Ki Gerl 2017

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 19

Article

Social Science Computer Review


1-19
ª The Author(s) 2017
Profiling Cybercriminals: Reprints and permission:
sagepub.com/journalsPermissions.nav
Topic Model Clustering of DOI: 10.1177/0894439317730296
journals.sagepub.com/home/ssc
Carding Forum Member
Comment Histories

Alex Kigerl1

Abstract
Cybercrime has become a growing business. The marketplaces for such businesses tend to be online
forums. Much of the research on carding forums has been qualitative, but there have been quan-
titative analyses as well. One such type of analysis is topic modeling, a clustering technique that
groups forum users according to the textual comments they leave. However, this type of research
so far has been exclusively quantitative, without qualitatively examining the topics. The following
study attempts to add to this research by analyzing the comment histories from 30,469 users from
three carding forums. The results have revealed that users belong to one or more of 21 different
topics. The topics are grouped into six broader categories, consisting of a customer base, identity
fraud market, crimeware market, free content market, and two others. Descriptives are provided
displaying how the topics are distributed across the three websites and directions for future
research are discussed.

Keywords
cybercrime, carding forums, topic modeling, fraud, malware

Today cybercrime is a business rather than a hobby for most participants. Being a digital crime, the
venues for participating in the cybercrime markets take place primarily in online communities.
Many of these online communities are organized, often with a hierarchy, division of labor, and
system of promotion and policy enforcement (Holt, 2013). In the Western world, the majority of
these cybercrime websites take the form of online forums, which are boards organized according to
different themes, under which users can post messages, often offering to buy or sell a certain illicit
product (Yip, 2010).
The forums allow both buyers and sellers to find what they want. Identity fraud and crimeware
(malware and other illicit software) products and services tend to be popular markets for visitors.
Offenders who have accumulated a surplus stock of stolen credit card credentials, too many to

1
Washington State University, Spokane, WA, USA

Corresponding Author:
Alex Kigerl, Washington State University, 1017 W Maxwell Ave., 2, Spokane, WA 99201, USA.
Email: alex.kigerl@wsu.edu
2 Social Science Computer Review XX(X)

cashout all by themselves, can offer to sell them to buyers on such forums, often for as little as a few
dollars per stolen identity (Holt & Lampke, 2010). Credit card information is priced according to
how much information about the victim was stolen, which may be just enough to shop online for free
on the low end and enough to clone the card outright and make ATM withdrawals on the high end
(Holt, 2013).
Crimeware is also both a popular product and a service (Gaspareniene & Remeikiene, 2015).
Crimeware is software used to facilitate a crime. Buyers can purchase their own malware, often with
intuitive and user-friendly user interfaces, to infect computer resources or scam victims out of their
money. Buyers can also simply rent or buy already hacked and compromised computers or services,
of which they can use for their own purposes. Crimeware can be used to gain unauthorized access to
electronic resources, send spam, install spyware, or even take websites and servers offline, making
them unavailable.
Much of the research on cybercrime carding forums has been qualitative and observational (Garg,
Afroz, Overdorf, & Greenstadt, 2015; Gaspareniene & Remeikiene, 2015; Holt, 2013; Holt &
Lampke, 2010; Mikhaylov & Frank, 2016; Soudijn & Zegers, 2012; Yip, Shadbolt, & Webber,
2013; Yip, Webber, & Shadbolt, 2013). There have also been some quantitative analyses of carding
forums using network analysis (Motoyama, McCoy, Levchenko, Savage, & Voelker, 2011) and
predictive models to identify top sellers (Benjamin, Li, Holt, & Chen, 2015; Li & Chen, 2014).
There have also been quantitative methods used for clustering of carding forum members into
categories (Li, Chen, & Nunamaker, 2016; Li, Yin, & Chen, 2016). The methods used are termed
“topic modeling,” which group text-based documents into distinct groups based on their similarity
to each other and differences from other groups, which has been used to group carding forum users
according to the contents of the messages they leave. However, the topic modeling of carding
forums so far has been exclusively quantitative, attempting to identify top sellers and differentiate
categories of both communities and participants. Little time is devoted to qualitatively describing
the different topics once they are computed, and less time is devoted to addressing topics for users
who are not top sellers.
The present research seeks to fill this gap by combining qualitative and descriptive research
goals of the past with quantitative clustering methods previously used for instrumental purposes.
Past research has been almost exclusively qualitative and observational or quantitative and instru-
mental. The present research intends to combine the two by conducting a topic model analysis of
the merged samples from three different carding forum websites, consisting of 30,469 users. Topic
modeling will be applied in order to assign all users to separate but nonmutually exclusive cluster
categories. Following cluster assignment, each topic will be carefully examined to produce qua-
litative descriptions and observations about each type of cybercriminal archetype. The purpose is
an exploratory approach with both quantitative and qualitative elements to further shed light on
these underground communities. The results should illuminate the different characteristics and
themes based on analysis of every comment posted for the entire population of all users from all
three carding sites.

Literature Review
Cybercrime has become a lucrative economic venture, attracting many participants in a growing
underground marketplace. Gone are the days of curious hackers and hobbyists engaging in both
white hat and black hat cybercrimes. In the Western world, many of these burgeoning economies
exist on online forums, often with a hierarchy, division of labor, and system of promotion and policy
enforcement (Holt, 2013). In many countries in the East, including China, online cybercrime mar-
kets take different forms, such as social networks. Forums, on the other hand, are a type of website
where registrants can create and reply to publicly visible posts, offering to buy goods, sell products
Kigerl 3

or services, or collaborate in mutually beneficial business ventures. Visitors can then reply under-
neath a given posting that interests them, starting a conversation thread. Posts are sorted on the web
page in descending order according to the recency of post activity within each thread.
Why is it that forums are such popular marketplaces for participating in cybercrime? Among the
reasons include mechanisms of formal control and coordination, social networking and building
contacts, identity uncertainty mitigation, and quality uncertainty mitigation (Yip et al., 2013). Such
forums are hierarchically structured, with administrators on top who own the website, moderators
tasked with running specific boards assigned to them, verified and trusted sellers who have earned
more trust and therefore more business, lower tier sellers, and of course, general customers.
Cybercrime markets do not have to be organized like this. In China, cyber-offenders rely on a
more decentralized network of websites where anyone can post and reply with little risk of being
banned outright by moderators (Yip, 2010). These types of underground economies offer fewer
protections to would-be buyers to avoid being exploited but also offer better protection against
legal law enforcement from shutting the website service down entirely. However, in the West,
forums dominate.
A popular mechanism cybercrime forums offer is the transparency of economic exchanges and a
regulated review system. Sellers will often create their own post describing their stock and pricing
(Holt, 2013). Interested buyers will exchange contact information and finalize transactions in pri-
vate. Buyers are encouraged by the community to leave reviews for sellers, usually in the originating
thread that the seller created. Quality sellers are given good reviews from their customers, and so can
acquire trusted vendor status and more business due to their high status within the community.
Not all sellers are to be trusted, however. Many sellers do not intend to sell anything at all and
attempt to gain an advance payment from a buyer for a product or service which they have no
intention of delivering (Yip et al., 2013). These fraudsters are termed “rippers” by the community,
and they are an example of cybercriminals preying on fellow cybercriminals. Of course, negative
reviews and reporting on “rippers” eventually result in their termination from the forum. They can,
of course, reregister again under a new username, so buyers and moderators must remain vigilant
at all times.
The transactions usually rely on electronic money wiring services and electronic currency (Garg
et al., 2015). The forum communities will also vary in size considerably. Smaller forums tend to be
more specialized in a narrower market, such as the sale of stolen credit cards. Larger forums will
have many boards, some of which are not intended to facilitate an illegal market but rather to discuss
the news or world events, providing a supplement to the usual illegal business offerings available.
Among some of the goods traded include identity fraud-related products (Holt & Lampke, 2010).
Sellers may offer stolen credit cards with varying degrees of information about the victim credit card
holder, which is priced accordingly, of course. The minimum amount of credit card information
would be enough to make purchases online. More detail might be offered about the victim, such as
date of birth and zip code, which can be used to take out credit lines in the victim’s name. There
could also be enough data to clone the card outright, which can be used in restaurants or at ATMs
(Holt & Lampke, 2010).
Crimeware products and services are also a valued commodity within cybercrime markets (Gas-
pareniene & Remeikiene, 2015). Crimeware is software used to facilitate illicit behaviors, such as
malware or spyware. Sellers might offer the source code for a malware product itself, which the
buyer can modify himself or herself as needed, or compiled binaries of ready-to-use malware with
intuitive graphical user interfaces (GUIs), making them more user-friendly. Crimeware can also be
used for services, such as offering to deliver distributed denial of service (DDoS) attacks against a
website of the buyer’s choosing, taking it offline.
Other services are available to convert illicit goods into cash, termed cashout services (Mikhaylov
& Frank, 2016). Participants who possess stolen credit card data can transmit the information to a
4 Social Science Computer Review XX(X)

contact in a country with low enforcement against cybercrime who can make bank withdrawals or
online purchases. The funds or products are then liquidated and a portion sent to the accomplice
overseas. Money can also be laundered via online gambling websites, as any winnings can be
claimed as gambling income (Mikhaylov & Frank, 2016). Sometimes, the cashout service partici-
pant is a victim themselves, not realizing the money they are relaying back to the offender is illegally
acquired (Soudijn & Zegers, 2012). These victims are termed “mules,” as they take on all of the legal
risks and make little if any profit.
Much of our knowledge on carding forums comes from qualitative investigations and observa-
tions of the user activity and posts publicly available on such forums, which are manually read and
interpreted. However, there are a growing number of quantitative analyses of carding forum data.
Much of this research is applied and practical, intended to identify top sellers for law enforcement
assistance purposes. One such method has relied on a web crawler to extract user comment data from
online cybercrime forums for analysis (Banjamin, Li, Holt, & Chen, 2015). Web crawlers are
software programs designed to scan web pages and extract useful information from them. Benjamin
et al. (2015) have used such a crawler to extract user posts from carding forums and identify top key
words used by higher status malware and identity fraud sellers.
Another application to identify top sellers was conducted by Li and Chen (2014), who designed a
machine learning algorithm to automatically apply the judgment of a user’s posts in place of a
human coder. The authors used both supervised machine learning and sentiment analysis. Super-
vised machine learning is the use of already labeled data (labeled by a human rater) which an
algorithm learns from so that it can apply its own ratings on unlabeled data, such as to identify new
top cybercrime sellers. Sentiment analysis was also used to measure positive feedback that buyers
leave, to further tease out top sellers.
There have been some inferential quantitative analyses of carding forums as well, in addition to
applied research. The research relies on network analysis, which explores how cybercriminals form
networks on carding forums measured by mutual posts in each other’s threads and private messages
exchanged between members (Motoyama et al., 2011). Via these techniques, the relationships
offenders form in cybercrime communities are illuminated.
In addition to examining the connections that link offenders together, researchers have used
quantitative methods to distinguish cybercriminals from each other by assigning them to separate
categories. The technique is termed clustering, which is an unsupervised machine learning technique
that groups cases into categories based on their similarity with other members of the category and
separateness from other categories. The methods are referred to as unsupervised because the
researchers do not know a priori what the correct labels are, they are to be discovered in the data.
Clustering has been used to categorize countries according to cybercrime activity (Kigerl, 2016),
but has also been applied to forums as well. One such study used X-means clustering to categorize
users of hacking forums (Abbasi, Benjamin, Hu, & Chen, 2014). X-means clustering is hard cluster-
ing, which assigns cases to mutually exclusive categories. The results found four groupings of users:
black market activists (those discussing cybercrimes), founding members, technical enthusiasts
(highly skilled at hacking), and average users (novice and nonhackers).
Another clustering technique is latent Dirichlet allocation (LDA), a type of topic modeling, which
is soft clustering assigning cases to nonmutually exclusive categories. Topic modeling has been used
to classify malware and tutorial related content on hacking websites (Samtani, Chinn, & Chen,
2015). Tutorials are instructions that users write for other forum members, usually on how to engage
in certain cybercrime activities. The researchers identified 13 different categories for these types of
hacking forum resources.
Topic modeling has been applied to carding forums as well, in addition to hacking forums (Li,
Chen et al., 2016; Li, Yin et al., 2016). However, this research has been instrumental research
intended to develop tools to identify top sellers. The research has not attempted to explore the
Kigerl 5

qualitative differences between the topics after the quantitative topic modeling is applied. For the
purposes of this research, quantitative topic modeling methods will be applied to all carding forum
member data. Following that, however, qualitative interpretation and examination of the separate
topics will be explored to facilitate a better intuitive understanding of the different archetypes found
on carding forum websites. The purpose will be to use formerly instrumental/quantitative methods in
order to conduct qualitative and observational analyses to expand the existing literature.

Method
User posts and comments from three carding forum websites were downloaded and aggregated by
unique username. The websites selected were CSU.su, CardersForum.se, and BitsHacking.com.
Most users left more than one comment, and so multiple comments were concatenated per user and
stored as a single text-based column. The analysis chosen was topic modeling, which is used to
cluster and classify documents into nonmutually exclusive categories. The documents, in this case,
being all comments posted per user among the three carding forum websites. Following the explora-
tory cluster category assignment, individual example users from each topic were examined to
identify the unique themes and typologies captured by the software.

Data and Sample


Data were acquired from three online carding forums. Forums were located via a series of web
searches using key terms commonly used by such forum members (cvv, dumps, fullz, drops, ripper,
etc.). Of the search results, only forums were examined to assess whether they met the inclusion
criteria. While there are other mediums where cyberoffenders congregate, exclusively forums were
sampled as such websites present uniform and structured data which is easy to acquire via a web
crawler. Among the inclusion criteria for using the forum, data were a size requirement. Each forum
must have possessed at least one thousand registered members. Forums often have a members
section with an index to browse all registered members.
Each forum was also required to have active sections where users can both buy and sell stolen
credential goods (credit cards, bank logins, etc.). Each forum must have also had other cybercrime
business outlets such as botnet rentals, spam services, and malware dealers, in addition to exclu-
sively credential goods sales. Finally, the primary language of each forum was required to be
English, as the topic modeling software was designed to remove stop words and perform word
stemming based on the English language (Hornik & Grün, 2011).
Three forum websites were selected and custom Python web crawlers were written to extract data
from each of them. Each crawler was designed to start at the main forum portal where all boards/
sections were listed. The crawler extracted all board links, saving them to a list, and accessing each
one in sequence. Each board contained an index listing all pages of posts users had submitted. The
crawler extracted the index length, so that it could crawl each page of posts one-by-one. On each
page of posts, all posts on the page are crawled. Posts may be composed of long threads consisting of
many pages of user comments, also with an index listing the length of thread pages. Each of these
pages were also crawled.
For each page of posts, the crawler identified the number of user comments and extracted the date
of each comment, the username of the commenter, and the text-based comment posted. These values
were inserted into a spreadsheet as a single row, one row per each comment in every thread on the
website.
The three crawlers were deployed at slightly different times. The crawler for CSU was deployed
on March 31, 2016. Crawler 2 was launched on April 24, 2016. A final, third crawler, was run on
6 Social Science Computer Review XX(X)

December 19, 2016. Crawlers 1 and 2 ran for approximately 1 week before completion. Crawler 3
took about 2 weeks to crawl all pages on BitsHacking.
Three data sets were produced for each website, the unit of analysis being a user comment.
Registered members of each forum could, of course, continue to make posts during the time the
websites were being crawled, which might shift some page index positions and thread orderings
during this time. The result would be some duplicated user comments. Each data set was dedupli-
cated to remove repeat user post entries in the data sets.
For the planned analyses, all three website data sets were merged and then aggregated by forum
username. There were 200,105 user posts in total. All posts made by each user were concatenated
into a single string, separated by spaces, and included in a column, one per each unique username.
There were 30,469 users total, 4,095 from CSU, 2,934 from CardersForum, and 23,440 from
BitsHacking. Since most of the users in the total combined sample were from BitsHacking, analyses
would be biased somewhat toward this larger website.
The average number of posts per user was 6.57. It should be noted that this number reflects users
who left at least one comment only. The crawlers did not account for registered forum members who
never posted a comment, as the member indices were not crawled, only user threads.

Measures
Topic modeling is a text mining procedure that can analyze the raw key word frequencies in the data
it is meant to process. Therefore, instead of requiring a short list of intuitive, theory-driven variables,
each word appearing at least once in any comment left by members of the sample became a variable,
represented as the frequency count of the number of times the word was used by a given user. This
transformation of the data is considered a bag-of-words method, converting textual data sources into
a term frequency matrix, where a row is a unique user, and a column represents the frequency with
which each user has used a given word, with one column per word.
Prior to conversion into this matrix, stop words were removed from the data set. These include
common function words (but, and, for, etc.) that would no doubt have a high frequency of occurrence
across all groups but would not likely distinguish differences among users. Words are also stemmed,
reducing them to their root. This process removes suffix characters from the ends of words with
more than one morphology, such that words as “offending” or “offended” become “offend.” Punc-
tuation was also removed from the training corpus of user comments.
Finally, any hypertext markup language (HTML) was removed from the user comments. Users
can apply rich-text based formatting and hyperlinking to the content of their comments, which the
crawler would perceive as HTML. HTML is one of the languages used to design web-based content,
represented as tags contained by angular brackets (<br/>, <p>, etc.). Anything appearing in the
comments surrounded with an opening angular bracket (<) and ending with a closing angular
bracket (>) was removed from the text.
After the cleaning process, the term matrix was built, yielding a total number of 30,469 rows
(users) and 113,657 columns. Each column represented a unique word, which was found at
least once in the set of all comments. Columns represent the frequency with which each user
uses the word.

Analytic Plan
The analysis of the data chosen was a document clustering task. Documents in the context of this
analysis refer to users of carding forums, where the text associated with each user includes all
comments made by the user at the time data were downloaded from the website. The goal of
document clustering is to assign documents to categories based on the type of key words appearing
Kigerl 7

in each document/used by each user. Clustering is an unsupervised technique, meaning that the
researcher does not know beforehand what the correct cluster categories are. Cluster categories
are discovered in the data based on how similar users are to other users assigned to the same
cluster category.
The specific method used was LDA with a Gibbs sampler (Blei, 2012). LDA performs soft
clustering, meaning that cluster categories assigned are not mutually exclusive. That means one
user can be a member of more than one category. Category assignment is probabilistic, with each
user assigned to each of the k categories in such a way that the probabilities sum to 1.0 for each user.
LDA is titled latent Dirichlet allocation because it estimates latent constructs (the topics) while
assuming the category probabilities follow a Dirichlet distribution, which is a distribution over
distributions (documents distributed over topics, with topics distributed over words).
LDA works iteratively to construct two estimates: the distribution of words over topics and the
distribution of topics over documents (Blei, 2012). The model begins by randomly assigning doc-
uments to a fixed K number of topics using a bootstrapping technique. This first assignment is based
on a random guess. After assignment, the probabilities that each word belongs to a given topic is
calculated. The probability is calculated based on how frequent the word appears in one topic from
the documents assigned to it and also based on how infrequent the same word is in other, competing
topics. These observations are then used to update the probability of documents belonging to topics
such that the next iteration of sampling documents into topics is not random but based on what was
learned from the preceding iteration. This process is repeated, updating the probabilities each time
until the algorithm arrives at convergence. Convergence is determined once probabilities cease to
improve sufficiently with repeated iterations. This process is referred to as Bayesian inference.
Analysis using these techniques was performed in R using the “topicmodels” software package
(Hornik & Grün, 2011).
LDA does not identify the correct number of topics automatically. The method uses a fixed,
prespecified number of topics on which to estimate. In order to determine the optimum number of
topics to use, multiple models must be run starting from a minimum of two topics to some maximum
number of topics to explore. Measures of model fit and performance would then be calculated for
each model and the number of topics from the highest performing model would be selected.
Five measures of fit were used in this analysis to determine topic size. Four of the five fit metrics
use internal fit methods, assessing the similarity of documents assigned to the same topics and the
separateness and distance of each topic from all the alternate topics (Arun, Suresh, Madhavan, &
Murthy, 2010; Cao, Xia, Zhang, & Tang, 2009; Deveaud, SanJuan, & Bellot, 2014; Griffiths &
Steyvers, 2004). The fifth metric is a cross-classification technique termed perplexity. Perplexity
measures how well a set of probabilities recreate or fit an existing data source (Brown, Pietra,
Mercer, Pietra, & Lai, 1992). A topic model is generated based on one portion of the data set, which
yields key word and topic probabilities that are used to classify documents from a different portion
of the data set. A low perplexity score indicates that the pretrained probabilities recreate the original
topics well.
Perplexity was calculated via 5-fold cross-validation for each model. Fivefold cross-validation
partitions the data set into five equal parts and trains a topic model on all but one part (Kohavi,
1995). The partition that was excluded is then used for testing and is the data set from which the
perplexity score is calculated from. This process is repeated 5 times, each time leaving out a different
partition for testing. The five perplexity scores are then averaged per each of the models.

Results
LDA modeling was conducted to distribute topic category probabilities of the 30,469 user comment
documents according to 20 separate topics. Manual inspection and reading of example user posts
8 Social Science Computer Review XX(X)

based on topic assignment was carried out to define the semantic meaning of each topic and how to
differentiate them from one another. A number of users were not assigned a category by the
algorithm, and these cases were manually examined as well as they were considered to be their
own type of topic. Results indicate users could be broken out into categories based on buyer and
seller markets as well as revealing a free content market section and some additional miscellaneous
attributes of cybercrime forum participants.

Model K Selection and Fit Testing


The number of K topics to cluster on must be decided prior to running the final model. Fit testing and
cross-validation were performed for 34 separate model specifications, starting with an initial model
of two topics, and monotonically increasing up to a maximum of 35 topics of which to classify users.
Five fit metrics were applied to each of the 34 models to assess the top performing specification for
K. The results of the fit metrics can be located in the Appendix in Table A1.
Two maximization fit metrics were used: Griffiths and Steyvers, 2004 and Deveaud (2014).
Maximization means that a higher numerical fit score implies a better fit. Three minimization fit
metrics were also included: Cao et al. (2009), Arun et al. (2010), and perplexity. For these metrics,
lower scores indicate a better fit to the data. For the maximization metrics, Deveaud (2014) peaks at
topic number nine, then gradually declines. Griffiths and Steyvers (2004) begin to slow in its ascent
at around topic number 18. For the minimization metrics, Cao et al. (2009) become flat at K ¼ 19,
and Arun et al. (2010) become flat at K ¼ 24. For perplexity, improvements experience diminished
returns after 20 topics. Based on these observations, a K of 20 topics was selected for the subsequent
final topic model.

Topic Classification and Interpretation


A LDA model using a Gibbs sampler was applied to the 30,469 user comment documents, with a
K of 20 topics specified. The results yielded two different data sets: a user by topic matrix
indicating the probability that each user belonged to each of the 20 topics, and a matrix of
20 columns of key word probabilities, 1 column per each topic, containing the top most likely
words pertaining to each topic.
The matrix of user-topic probabilities consists of rows representing users and 20 columns rep-
resenting the topics. Cell values are probabilities that a user belongs to the given topic. Probabilities
are assigned such that rows sum to 1.0. If a user’s comment history is not indicative of any given
topic, the probability of each of the 20 topics due to chance would be .05. To create a threshold of
certainty from which to assign a user to a topic, a cut-off value of .1 was selected, which would be
twice the probability due to chance for belonging to a topic.
Based on this cut point, the highest number of topics assigned to a given user was six, and the
lowest was, of course, zero. With categories assigned, the next step was to identify the intuitive
meaning separating members of one topic from all the rest. This step required a qualitative approach
to reading the top key words per topic as well as the comment history and context of users belonging
to each topic. Users that could not be assigned to any topic were also examined, as this could be
considered an important grouping by itself.
The results of this qualitative approach can be found in Table 1. There were 21 categories of
users, 20 according to the assigned topics and 1 including nonassigned users who themselves
appeared to belong to their own topic. The 21 categories were qualitatively interpreted to belong
to 6 larger groupings, which consisted of a cybercrime market customer base of 4 topics, an identity
fraud market consisting of 5 topics, a crimeware market of 3 topics, an additional market of 2 topics,
a free content market of 3 topics, and an additional remaining grouping of other topics, with 4
Kigerl 9

Table 1. Carding Forum Member Topics.

Topic Top 10 Key Words

Customer base
1. General customers can, need, help, want, site, please, icq, cards, plz, contact
2. Satisfied customers good, try, work, lol, give, great, really, love, hope, mate
3. Dissatisfied customers just, know, get, card, dont, will, like, shit, people, waiting
4. Exploited customers dont, guy, ripper, see, will, forum, now, icq, just, sent
Identity fraud market
5. Credit card credentials vendors time, fresh, cvv, country, sell, dob, card, bins, city, fullz
6. Cloned credit card vendors dumps, good, sell, give, western, union, need, pcs, amex, 100 pcs
7. Banking credentials vendors can, need, bank, icq, transfer, account, contact, cashout, interested,
accounts
8. Region-specific vendors declined, working, USA, Richmond, card, bank, Virginia, capital,
Delaware, Wilmington
9. Document forgery services admin, need, support, like, quality, best, passport, psd, templates, free
Crimeware market
10. Spam services and products code, scam, rdp, email, mailer, yahoo, stock, smtp, cpanel, icq
dealers
11. Malware services and products use, account, paypal, can, software, card, get, will, fud, vpn
dealers
12. Botnet services and products icq, seller, dumps, logs, code, contact, zeus, USA, botnet, sell
dealers
Other markets
13. Proxy vendors live, checked, blacklist, France, yes, unknown, states, united, delete,
socks
14. Cashout services will, can, escrow, card, buy, price, ship, selling, drop, payment
Free content market
15. Free content distributors posted, method, originally, will, ebook, email, free, download,
comment, rep
16. Free credit card consumers thank, much, great, info, password, information, fullz, can, thanks,
brother
17. Free tutorial and crimeware thanks, bro, nice, share, man, sharing, dude, working, tut, thanx
consumers
Additional topics
18. Non-native English speakers United States, dude, newbie, personal, USD, gan, mail, Salam Bank
19. Ongoing thread participant posted, share, thanks, can, thread, guys, bro, bump, welcome, enjoy
20. Private message exchange send, please, repped, thanks, done, added, thx, pls, waiting, link, email
21. Transient users No topic identified

categories of user attributes. Table 1 displays the topic classifications and their associated top 10 key
words per topic.

Customer Base
The customer base consists of four topics: general customers, satisfied customers, dissatisfied
customers, and exploited customers. These are the users that predominantly buy goods and services
from other members on the forum. The first category of general customers, or Topic 1, consists of
mostly impartial requests for products or the exchange of contact information in order to facilitate a
transaction. Forums often have specified sections for buyer requests. The key word “icq” appears in
the top 10 key word list for this topic. ICQ is an instant messaging program which forum members
10 Social Science Computer Review XX(X)

can use to communicate with each other without leaving a record of their conversation publicly on
the forum itself.
Topic 2 consists of satisfied customers. Customers participating in these forums are encouraged
to leave feedback on each seller they do business with. Sellers who deliver quality products and
timely service can gain a reputation as a trusted seller from positive customer reviews. Sellers will
often create a new post on the forum, stating the availability of their wares. Potential customers
exchange contact information and do business off the forum record. If buyers like the product, they
will return to the original thread and post their positive review in it, which both increases the
reputation of the seller but also keeps the thread alive, as threads are sorted according to how
recently a new post was made, further advertising the seller’s product line.
Topic 3 includes dissatisfied customers who leave negative reviews of the seller’s products or
services. Buyers may be dissatisfied with a number of issues associated with the seller. The seller
may have a replacement policy that they do not honor in a timely manner. Many sellers will offer to
replace, free of charge, stolen credit card credentials a buyer purchases if the card is canceled before
the buyer can use it to make purchases or withdraw money. There also may be many canceled cards
that are no longer valid, as it is expected that sellers check the cards before selling them, often by
making a small online purchase, of a digital song, for example, of less than one dollar. If the
transaction is successful, the card is considered valid as a product. Top sellers are also expected
to be available on ICQ at almost all times of the day, or at least respond very quickly, regardless of
differing time zones.
Topic 4 includes exploited customers. These are also dissatisfied customers, but unlike the Topic
3 dissatisfied customers above, these customers leave negative reviews for illegitimate sellers,
known as “rippers.” Rippers are not sellers at all but rather fraudsters who prey on other members.
The negative reviews these users leave against rippers eventually results in them being banned.
Rippers can actually be big problems for forum communities, so there are often substantial efforts to
moderate, identify, and remove users who disobey forum policy. There are often many discussion
threads debating ways of better mitigating the disruption that rippers impose on the community.

Identity Fraud Market


From the seller’s side, there were five categories of sellers in the identity fraud market. They
included credit card credentials vendors, cloned credit card vendors, banking credentials vendors,
region-specific vendors, and document forgery services. For credit card credentials vendors, the top
10 key words include terms like “cvv,” which means basic credit card information which can be used
to make online purchases. The key word “fullz” also appears, which includes cvv information plus
personal information about the victim such as date of birth, social security number, and other data
which can be used to create credit accounts in the victim’s name.
Topic 6 consists of cloned credit card vendors, from which buyers can purchase enough infor-
mation from a skimmed credit card to create their own cloned card, or order a premade plastic card
from the vendor himself or herself. These can be used for ATM withdrawals or to go shopping in
physical stores. “Dumps” and “pcs” are the slang terms for these kinds of products, and “100pcs”
represents 100 credit cards of which the seller is naming a price. Western Union is mentioned, which
is a method of wireless payment.
Topic 7 includes banking credentials vendors, which can provide either stolen online bank
account logins, so the buyer can attempt to withdraw money on their own, or cashout services,
where the seller offers to turn existing logins, or “logs” into cash, of which is split between the two
parties. Topic 8 includes region-specific vendors who can sell any of the products listed above, but
tailor their products to specific locations, in case the buyer needs a credit card of a victim in a
specific region, country, or state.
Kigerl 11

The final identity fraud market seller includes Topic 9, which consists of those providing
document forgery services. The documents are often either State IDs or passports. Sellers may
merely sell the templates of a document, so the buyer can forge the remainder of, such as a
Washington State driver’s license template. The sellers may also forge the documents themselves.
The key word “psd” is a Photoshop document file extension, which is the form templates can take,
where the template can be further edited in Photoshop.

Crimeware Market
There were three types of dealers identified in the crimeware market. Crimeware consists of soft-
ware used to facilitate a crime. The three categories of dealers included spam dealers, malware
dealers, and botnet dealers. Topic 10 was the spam services and products dealers group, which were
sellers who offered either their services of sending spam to advertise on a business partner’s behalf
or the tools and resources which buyers could use to send their own spam. “Mailer” refers to
software or a server which can be used to send high volume email. “Stock” refers to how much
of a product a seller has available (such as spam servers). “SMTP” and “cpanel” refer to servers from
which to send spam, and often the servers are hacked and illegally acquired. “Yahoo” refers to
compromised Yahoo email accounts which contain contacts the spammer can mail, posing as the
legitimate Yahoo email account owner. “RDP” stands for remote desktop protocol, which can be
used as a type of spyware but can also be used to manage servers.
Topic 11 includes malware services and products dealers. These sellers can offer exploit packs,
which are applications that can exploit new and existing vulnerabilities in computer resources to
gain unauthorized access. Sellers can offer hacking services they provide themselves, but they can
also sell existing malware, keyloggers, password stealers, and “fud” viruses, which stands for fully
undetectable. Undetectable refers to new malware that cannot be identified by antivirus programs
yet. Achieving FUD status usually requires confirming that the malware is undetectable by dozens of
different AV programs, often by use of online services which scan files in such a way, such as
VirusTotal (https://www.virustotal.com). These sellers also offer DDoS software and services,
which can be used to bring servers and websites offline.
The final crimeware seller is described by Topic 12, representing botnet services and products
dealers. Botnets are clusters of hacked computers or servers, illegally acquired, and remotely
controlled by the botmaster who issues commands to them individually or as a unified collective.
Sellers in this field can rent out or sell some of their bots or sell the software which installs or
manages an existing botnet network. Zeus is a popular botnet software application that is sold in such
markets, known for its user-friendly and easy to understand GUI.

Other Markets
Two remaining topics belong to a category on their own or are otherwise unclassified. These include
proxy vendors and cashout services. Topic 13 consists of proxy vendors who are users that sell
access to proxy servers. Proxy servers can be used to route Internet traffic through so that the attack
or threat appears to be originating from the proxy rather than the perpetrators own host or Internet
service provider address. Vendors present the servers available, their geographic locations, and their
status. “Live, checked, blacklist, delete, and unknow” refers to the status of listed servers. Vendors
will make an initial post listing their products and will continually edit the post to indicate if a server
is still available online, or whether it is offline or blacklisted by antivirus programs. “Socks” is the
slang term for these proxies.
Topic 14 includes cashout services, which are sellers offering to convert credential goods into
cash, splitting the proceeds between buyer and seller. “Drop” or “drops” is the slang term referring to
12 Social Science Computer Review XX(X)

these sellers. Those providing drops often live in countries with relaxed anti-cybercrime laws and
low international cooperation and enforcement tendencies. Sellers will receive the credentials from
the buyer, perhaps stolen credit cards, convert them into cash or products (electronics purchased
online), and then forward part of the money or ship the products to the buyer.

Free Content Market


Most large carding forums have free content sections where users can post free products (crimeware
and credential goods) and information on how to commit various crimes such as how to perpetrate
fraud or successfully hack computers (often termed tutorials or “tuts”). There were three groups of
users in this market: free content distributors, free credit card consumers, and free tutorial and
crimeware consumers. Topic 15 described the free content distributors. These are users who spend
a lot of time producing or providing free content. Some of their content might be in the format of a
short e-book they have written. “Rep” refers to asking for positive reviews or to say thanks from
consumers and are used to build a reputation for the free content distributors.
Topic 16 describes free credit card consumers. These are users who focus on using the free
samples and batches of credential goods posted by distributors, which can include any type of digital
credential good (cvvs, fullz, and logs). In this case, “password” refers to online banking passwords.
Finally, Topic 17 describes the second category of consumer, the free tutorial and crimeware
consumers. These users read the many tutorials and e-books available in order to learn how to
commit their own cybercrimes. They also download any free malware and other crimeware that
they might use for themselves.

Miscellaneous Topics
There were four additional categories of users that did not belong to any of the obvious markets
above. These consisted of non-native English speakers, ongoing thread participants, private message
exchange participants, and transient users. Topic 18 was the non-native English speaking user group.
These users were often either buyers or sellers or both, as non-native English speaking status is an
attribute rather than a role. These users were obviously participating in English speaking forums, so
most of the top 10 key words identified for this group are words that are in English. Yet, the
algorithm was able to isolate them regardless. These posters would occasionally say words in their
native languages, which tended to be Turkish, Malay, Indonesian, or Russian.
The next two topics, Topics 19 and 20, tended to be a type of dialog to facilitate certain activities
rather than the words used by a specific group of user. These included ongoing thread participants
and private message exchange participants. Ongoing thread participants were participants in very
long threads, both long in terms of the number of posts and long in terms of the age of the thread. The
top 10 key words describe the type of language used for keeping threads going for so long.
Long threads can be any type: buyers wanting to buy, sellers wanting to sell, or popular leaks of
free content that pull in high volumes of traffic. The term “bump” refers to posting the word “bump”
in the thread, the sole purpose of which is to keep the thread alive by bringing it to the top of user
feeds because it has a recent post in it. Topic 20 is another type of dialog used for moving a business
arrangement away from the public view of the forum and through instant messages, forum private
messaging, or email. Users will post that they have added a buyer or seller to their contacts list on
ICQ to establish communication or that they have sent them a private message and are now waiting
for a response. This language is what is available on the forums themselves. The algorithm, of
course, was not able to read the private messages themselves.
The final category is numbered at 21, representing transient users. These are users that the topic
model was unable to classify. The reason becomes obvious once these users’ comments are
Kigerl 13

Table 2. Descriptives by Topic for Total Sample.

Users Posts

Topic n % Mean SD

General customers 592 1.94 8.77 12.13


Satisfied customers 2,114 6.94 16.91 22.04
Dissatisfied customers 2,413 7.92 19.23 31.75
Exploited customers 898 2.95 21.9 37.92
Credit card credentials vendors 711 2.33 9.88 17.65
Cloned credit card vendors 271 0.89 8.82 19.02
Banking credentials vendors 2,237 7.34 12.05 20.72
Regional-specific vendors 26 0.09 60.42 178.97
Document forgery services 364 1.19 13.37 24.82
Spam product/service dealers 449 1.47 12.48 38.17
Malware product/service dealer 1,353 4.44 14.12 21.86
Botnet product/service dealers 395 1.3 19.86 36.28
Proxy vendors 65 0.21 22.91 93.36
Cashout services 1,404 4.61 19.1 32.44
Free content distributors 2,311 7.58 13.68 19.19
Free credit card consumers 596 1.96 14.39 43.43
Free tut./malware consumers 3,425 11.24 13.67 19.31
Non-native English speakers 617 2.03 10.31 17.63
Ongoing thread participants 1,323 4.34 21.37 77.6
Private message exchanges 3,870 12.7 10.97 16.55
Transient users 14,457 47.45 1.7 1.21
Total 30,469 100 6.57 23.3

Note. n ¼ 30,469.

inspected. Transient users register an account on the forum, make one or two posts, and then do not
return again. They tend to ask questions, ask for advice, or ask where certain products can be
purchased, likely many of them inexperienced in the cybercrime communities. But they are present
on the forums only briefly.

Descriptive Statistics
Descriptive statistics were performed on the sample, deriving frequencies for each of the 21 cate-
gories as well as the average number of posts made by users from each group. These results are
presented in Table 2. It should be mentioned that users can be assigned to more than one topic, so the
sum of all category frequencies will be greater than the total sample size. Not surprisingly, the most
common category of user was the transient users, making up 47% of all classifications. They make
an average of 1.7 posts, with a standard deviation of 1.21.
The percentage of the classifications belonging to the customer base is generally higher than the
percent of the classifications belonging to the different seller groups, such as those selling in the
identity fraud, crimeware, and other markets. This is also not surprising, as there would be more
buyers than sellers to keep the markets alive. The exception to this appears to be the free content
distributor class, making up 7.6% of the classifications. Of course, free content distributors could
belong to one or more of any of the submarkets.
The two most common groups after transient users would be those engaging in private message
exchanges (12.7%) and free tutorial and malware consumers (11.2%). Free content seems to draw a
14 Social Science Computer Review XX(X)

Table 3. Descriptives by Topic by Website (n ¼ 30,469).

CSU CardersForum BitsHacking

Users Posts Users Posts Users Posts

Topic n % Mean SD n % Mean SD n % Mean SD

General customers 231 5.64 11.64 14.47 109 3.72 4.49 3.87 252 1.08 7.98 11.48
Satisfied customers 157 3.83 21.31 32.57 127 4.33 4 3.46 1,830 7.81 17.42 21.36
Dissatisfied 1,002 24.47 24.64 39.18 299 10.19 3.57 3.79 1,112 4.74 18.57 26.62
customers
Exploited 574 14.02 26.13 41.11 82 2.79 4.84 7.62 242 1.03 17.64 33.96
customers
Credit card 359 8.77 12.84 21.46 139 4.74 4.03 4.44 213 0.91 8.72 14.63
credentials
vendors
Cloned credit card 73 1.78 14.45 31 21 0.72 3.24 2.66 177 0.76 7.16 11.89
vendors
Banking credentials 991 24.2 15.87 24.8 362 12.34 4.16 3.93 884 3.77 11 18.66
vendors
Regional-specific 13 0.32 114.69 245.53 2 0.07 2 0 11 0.05 6.91 8.65
vendors
Document forgery 115 2.81 17.97 34.71 25 0.85 6.12 11.93 224 0.96 11.81 18.77
services
Spam product/ 90 2.2 17.39 70.42 27 0.92 5.11 12.39 332 1.42 11.74 24.81
service dealers
Malware product/ 336 8.21 14.72 25.45 140 4.77 3.64 4.03 877 3.74 15.57 21.63
service dealer
Botnet product/ 303 7.4 23.34 40.15 16 0.55 3.38 1.78 76 0.32 9.46 14.43
service dealers
Proxy vendors 17 0.42 69 177.13 15 0.51 5.67 8.01 33 0.14 7 13.71
Cashout services 1,005 24.54 23.14 36.32 148 5.04 3.54 3.49 251 1.07 12.12 18.49
Free content 38 0.93 10.08 9.89 35 1.19 3.43 2.87 2,238 9.55 13.91 19.41
distributors
Free credit card 60 1.47 37.35 127.3 33 1.12 3.42 2.56 503 2.15 12.37 15.99
consumers
Free tut./malware 36 0.88 17.03 20.77 75 2.56 5.03 4.37 3,314 14.14 13.83 19.46
consumers
Non-native English 64 1.56 11.8 16.62 97 3.31 2.96 3.3 456 1.95 11.66 19.15
speakers
Ongoing thread 73 1.78 28.96 45.97 22 0.75 5.09 6.2 1,228 5.24 21.21 79.71
participants
Private message 35 0.85 11.4 12.19 10 0.34 3.9 2.23 3,825 16.32 10.99 16.6
exchanges
Transient users 1,155 28.21 1.84 1.58 1,702 58.01 1.49 1.05 11,600 49.49 1.71 1.18
Total 4,095 14.43 13.6 37.88 2,934 9.63 2.32 2.98 23,440 76.93 5.87 21.04

larger crowd. Private message exchanges are also probably used for any number of the different
markets, free or otherwise, from buyers or sellers, hence why it is more common.
The descriptives were further broken down among the three websites and can be found in Table 3.
For all three websites, most users were still transient. There are some differences between the
websites, however. CSU, one of the smaller websites at 4,095 users, had the highest proportion
of dissatisfied and exploited customers. CardersForum also has a higher proportion of dissatisfied
Kigerl 15

customers, which is the smallest website at 2,934 users. BitsHacking is the largest and has more
satisfied customers than the other two. Larger websites might simply be better run by the moderators
and administrators.

Discussion
The following study applied topic modeling to group carding forum member users into categories
based on comment histories. After examination of the model results, 21 categories of users were
identified. The 21 labels described 30,469 users between three carding websites after analyzing each
of 200,105 comments programmatically. The result is a better understanding of the types of dialog
found on such forums but also a data set with a rich set of 21 nonmutually exclusive binary variables
describing the large sample.
The results highlighted many topics that were in line with the previous literature but also some
aspects of cybercriminal roles that were novel given the extant research. The customer base was
consistent with our current understanding of carding forum markets (Holt, 2013; Yip et al., 2013),
although this research further refined the concepts into four categories of user posts and themes. The
top key words and language used for providing user reviews were identified. It is also noteworthy
that satisfied, dissatisfied, and exploited customers were found to be associated with three separate
topics. This could indicate either that more negative sellers were simply novices and less able to vet
reputable sellers but could also reflect different personalities of the buyers independent of their
actual experiences. Some users might just be more negative.
The first three types of vendors found to compose the identity fraud market were already well
described in the literature (Holt & Lampke, 2010). The vendors include credit card credentials,
cloned credit cards, and banking credentials vendors. However, anything pertaining to the regional-
specific vendors and document forgery service providers in the past research has been given less
attention. The regional-specific vendor comments and user responses could themselves be used in
future research to estimate geographic locations of buyers based on the regional-specific products
they purchase. Adding geographic information to the data could add new dimensions to future
research findings using the same or similar data set.
The crimeware market, which consisted of spam, malware, and botnet services and products
dealers, was completely consistent with the prior research (Gaspareniene & Remeikiene, 2015).
Although it is interesting to note that spam and botnet services and products dealers were considered
separate categories by the software, botnets are often used to send spam, so further investigations to
understand how the two are separate is warranted.
The remaining cybercrime market included cashout services but also a type of proxy vendor.
Proxy vendors tended to only provide proxies, although there are other means available of masking a
user’s identity over the Internet. Other methods include VPNs, or virtual private networks, which
route your traffic through a separate network instead of an individual server. However, VPNs were
more likely to be discussed in either the free content market or the crimeware markets. There was
also some overlap between proxy vendors and regional-specific vendors. It appears that regional-
specific proxies were in high demand, as buyers wanted control over where they appear to be
originating from.
Although the free content markets that are often available on larger cybercrime forums have been
described in the previous literature (Garg et al., 2015), it has been given less attention by researchers.
The present research explored this aspect of carding forums further. The topic model identified three
themes of posts in the free content sections: free content distributors, free credit card consumers, and
free tutorial and crimeware consumers. It seemed interesting that there appeared to be only one topic
for distributors, but two topics for consumers. Either the language used by all types of distributors
was similar, or they were more versatile in the types of goods they could create or provide.
16 Social Science Computer Review XX(X)

The additional topics highlighted two types of dialog in addition to the different roles of each
user. Private message exchanges were well documented by prior research (Holt, 2013), yet ongoing
thread participation was also identified by the results. The ongoing thread participant topic revealed
the language used in order to keep popular threads alive for so long, capturing the posts of both
buyers and sellers that consistently “bump” threads to the top of the queue.

Limitations
There are some limitations with the data used that need to be mentioned. The disparity in size
of the three different websites will have skewed findings more toward the larger BitsHacking,
which consisted of 23,440 users. It is known that there are differences between smaller and
larger carding forums. However, what is not known is which types of forums cybercriminals
are more likely to frequent. While larger forums contain a larger user base, there are fewer of
them and many smaller forums.
There is also the problem of dependency both across and within websites. Some users might visit
more than one website in the sample and register under a different username. If this were to happen,
the software would code each user as a separate person. Dependency within websites would be
represented by users registering with the same forum multiple times under a different name. This
would reflect the behavior of rippers, but users could also do the same for other reasons.
The selection criteria for forum inclusion in the sample could be considered narrow, capturing a
small slice of the cybercrime communities on the Internet. Slightly larger forums were preferred, as
a user base of 1,000 participants minimum was required. Smaller communities might have distinct
differences from the ones included. Also, only forums were considered, even though cybercrime
communities exist on other mediums, such as carding shops, Internet relay chat (Banjamin, Li, Holt,
& Gen, 2015), and social networking sites (Yip, 2010). There could also have been biases in the
search terms used to locate the forums. Specific terms used to facilitate cybercrime business trans-
actions were used in search engine queries. As the cultural language, as well as the technology of
cyberoffenders, is continually changing, the chose search terms may not fully capture all types of
underground cybercrime forums. Finally, exclusively English speaking forums were selected, bias-
ing the findings toward cybercrime activities in the West.
Finally, the research was not hypothesis driven or inferential but rather descriptive in its approach
to the data. Little causal inference can be gleaned from the results, but instead, the methods yield a
better qualitative understanding of prior typologies of offenders confirmed by the current analysis as
well as newer categories identified via the quantitative methods. The results serve as a stepping stone
to escalate future analysis to seek to link the various new measures together in future multivariate
investigations. Future research could highlight some new dimensions of carding forums not
described yet.

Future Research and Conclusion


The results of the topic model analysis produced a large data set consisting of 21 categories of
cybercrime user. While the qualitative description of the topics affords its own analysis, the
approach was mostly exploratory and investigative, and further quantitative analysis of the topic
data can be conducted. Additional variables can also be acquired from the carding forums, as not all
information was scraped by the crawlers. Each forum also included a profile page for each registrant,
with personal information fields users could optionally fill out. Profile pages include details such as
age, gender, occupation, location, titles within the forum, and other information. Combined with the
21 topics, as well as post activity, further analysis can be conducted.
Kigerl 17

The research only examined publicly available user posts on the forums. The research was
not able to capture private messages that occurred behind the scenes to facilitate transactions
and partnerships. However, there are some noteworthy hacks of previous carding forums, in
which the entire website databases have been disclosed online by the attackers. These data-
bases include private messaging conversations provided by the forum software. Topic mod-
eling could be applied to these conversations as well, offering further insight into the world
of carding forums.

Appendix

Table A1. Model Fit and K Topic Number Selection Results.

Maximization Minimization

Griffiths and Deveaud, SanJuan, Cao, Xia, Zhang, Arun, Suresh, Madhavan,
Topics Steyvers (2004) and Bellot (2014) and Tang (2009) and Murthy (2010) Perplexity

2 –4.7Eþ07 2.93 .11 116.29 1,756.13


3 –4.6Eþ07 3.07 .32 106.54 1,708.82
4 –4.5Eþ07 3.32 .15 102.45 1,564.45
5 –4.4Eþ07 3.37 .19 98.99 1,323.64
6 –4.4Eþ07 3.50 .14 90.24 1,438.53
7 –4.2Eþ07 3.71 .08 83.37 1,253.90
8 –4.2Eþ07 3.54 .12 87.06 1,105.08
9 –4.1Eþ07 3.74 .10 79.10 1,270.51
10 –4.1Eþ07 3.66 .10 80.72 1,113.27
11 –4.1Eþ07 3.62 .09 75.47 1,106.57
12 –4.1Eþ07 3.57 .10 79.79 1,207.94
13 –4Eþ07 3.52 .09 73.55 1,205.24
14 –4Eþ07 3.43 .12 74.46 988.50
15 –4Eþ07 3.48 .10 68.30 991.05
16 –4Eþ07 3.56 .11 65.53 1,107.84
17 –4Eþ07 3.39 .12 71.83 997.78
18 –3.9Eþ07 3.49 .11 66.75 1,123.40
19 –3.9Eþ07 3.40 .10 62.82 1,126.29
20 –3.9Eþ07 3.25 .12 65.46 1,076.93
21 –3.9Eþ07 3.31 .13 62.81 1,024.84
22 –3.9Eþ07 3.39 .11 61.59 1,226.59
23 –3.9Eþ07 3.27 .12 62.22 1,089.12
24 –3.9Eþ07 3.27 .11 61.58 1,137.98
25 –3.9Eþ07 3.25 .11 60.55 1,042.53
26 –3.9Eþ07 3.15 .12 60.39 1,014.92
27 –3.9Eþ07 3.20 .11 55.77 1,042.90
28 –3.8Eþ07 3.17 .11 58.96 1,052.00
29 –3.9Eþ07 3.10 .10 62.44 982.58
30 –3.8Eþ07 3.12 .10 56.99 1,067.45
31 –3.9Eþ07 3.09 .09 57.80 1,013.85
32 –3.9Eþ07 3.05 .11 58.90 1,186.05
33 –3.9Eþ07 3.07 .09 55.51 1,056.28
34 –3.8Eþ07 2.94 .10 60.33 943.80
35 –3.8Eþ07 2.92 .10 51.10 1,013.49
18 Social Science Computer Review XX(X)

Declaration of Conflicting Interests


The author declared no potential conflicts of interest with respect to the research, authorship, and/or publication
of this article.

Funding
The author received no financial support for the research, authorship, and/or publication of this article.

References
Abbasi, A., Li, W., Benjamin, V. A., Hu, S., & Chen, H. (2014, September). Descriptive analytics: Examining
expert hackers in web forums (pp. 56–63). The Hague, Netherlands: JISIC.
Arun, R., Suresh, V., Madhavan, C. V., & Murthy, M. N. (2010). On finding the natural number of topics with
latent dirichlet allocation: Some observations. In Mohammed J. ZakiJeffrey, Xu YuB & RavindranVikram
Pudi (Eds.), Advances in knowledge discovery and data mining (pp. 391–402). Berlin, Germany: Springer.
Benjamin, V., Li, W., Holt, T., & Chen, H. (2015, May). Exploring threats and vulnerabilities in hacker web:
Forums, IRC and carding shops. In Intelligence and Security Informatics (ISI), 2015 IEEE International
Conference on (pp. 85–90). IEEE.
Blei, D. M. (2012). Probabilistic topic models. Communications of the ACM, 55, 77–84.
Brown, P. F., Pietra, V. J. D., Mercer, R. L., Pietra, S. A. D., & Lai, J. C. (1992). An estimate of an upper bound
for the entropy of English. Computational Linguistics, 18, 31–40.
Cao, J., Xia, T., Li, J., Zhang, Y., & Tang, S. (2009). A density-based method for adaptive LDA model
selection. Neurocomputing, 72, 1775–1781.
Deveaud, R., SanJuan, E., & Bellot, P. (2014). Accurate and effective latent concept modeling for ad hoc
information retrieval. Document Nume´rique, 17, 61–84.
Garg, V., Afroz, S., Overdorf, R., & Greenstadt, R. (2015). Computer-supported cooperative crime. In Jens
Grossklags & Bart Preneel (Eds.), Financial cryptography and data security (pp. 32–43). Berlin, Germany:
Springer.
Gaspareniene, L., & Remeikiene, R. (2015). Digital shadow economy: A critical review of the literature.
Mediterranean Journal of Social Sciences, 6, 402.
Griffiths, T. L., & Steyvers, M. (2004). Finding scientific topics. Proceedings of the National Academy of
Sciences, 101, 5228–5235.
Holt, T. J. (2013). Examining the forces shaping cybercrime markets online. Social Science Computer Review,
31, 165–177.
Holt, T. J., & Lampke, E. (2010). Exploring stolen data markets online: Products and market forces. Criminal
Justice Studies, 23, 33–50.
Hornik, K., & Grün, B. (2011). Topicmodels: An R package for fitting topic models. Journal of Statistical
Software, 40, 1–30.
Kigerl, A. (2016). Cyber crime nation typologies: K-means clustering of countries based on cyber crime rates.
International Journal of Cyber Criminology, 10, 147.
Kohavi, R. (1995, August). A study of cross-validation and bootstrap for accuracy estimation and model
selection (Vol. 14, No. 2, pp. 1137–1145). Stanford, CA: IJCAI.
Li, W., & Chen, H. (2014, September). Identifying top sellers in underground economy using deep learning-
based sentiment analysis. In Intelligence and Security Informatics Conference (JISIC), 2014 IEEE Joint,
The Hague, Netherlands: IEEE (pp. 64–67).
Li, W., Chen, H., & Nunamaker, J. F. Jr. (2016). Identifying and profiling key sellers in cyber carding
community: AZSecure text mining system. Journal of Management Information Systems, 33, 1059–1086.
Li, W., Yin, J., & Chen, H. (2016). Identifying high quality carding services in underground economy using
nonparametric supervised topic model. International Conference on Information Systems. Dublin, Republic
of Ireland
Kigerl 19

Mikhaylov, A., & Frank, R. (2016, August). Cards, money and two hacking forums: An analysis of online
money laundering schemes. In Intelligence and Security Informatics Conference (EISIC), 2016 European,
Uppsala, Sweden: IEEE (pp. 80–83).
Motoyama, M., McCoy, D., Levchenko, K., Savage, S., & Voelker, G. M. (2011, November). An analysis of
underground forums. In Proceedings of the 2011 ACM SIGCOMM Conference on Internet Measurement
Conference, ACM, Berlin, Germany (pp. 71–80).
Samtani, S., Chinn, R., & Chen, H. (2015, May). Exploring hacker assets in underground forums. In Intelli-
gence and Security Informatics (ISI), 2015 IEEE International Conference on, Baltimore, MD, USA: IEEE
(pp. 31–36).
Soudijn, M. R., & Zegers, B. C. T. (2012). Cybercrime and virtual offender convergence settings. Trends in
Organized Crime, 15, 111–129.
Yip, M. (2010). An investigation into Chinese cybercrime and the underground economy in comparison with
the West (Doctoral dissertation), University of Southampton, Southampton.
Yip, M., Shadbolt, N., & Webber, C. (2013, May). Why forums? An empirical analysis into the facilitating
factors of carding forums. In Proceedings of the 5th Annual ACM Web Science Conference, ACM, Paris,
France (pp. 453–462).
Yip, M., Webber, C., & Shadbolt, N. (2013). Trust among cybercriminals? Carding forums, uncertainty and
implications for policing. Policing and Society, 23, 516–539.

Author Biography
Alex Kigerl, PhD, is an assistant research professor of criminal justice and criminology at Washington State
University (WSU) and data scientist at the Washington State Institute for Criminal Justice (WSICJ). His
research focus is on cybercrime and correctional risk assessment development.

You might also like