ba_thesis

Let’s Talk about TikTok - A Web Scraping
Tool for Social Science Research
Bachelor’s Thesis
by
Yannick Zelle
born in
Neuss
submitted to
Professorship for Computer Networks

Prof. Dr. Martin Mauve
Heinrich-Heine-University Düsseldorf
March 2023
Supervisor:
Prof. Dr. Virginie Julliard
Marc Feger, M. Sc.
Thibault Grison, M.Art
Abstract
The emergence of social media has provided social science researchers with new avenues to
gain a macroscopic view of social phenomena. However, accessing this data can be tech-
nically challenging. While some social media platforms offer researchers access to their
data through an API, accessing data from TikTok is currently not a straightforward process.
Despite its growing importance in public and scientific discourse, TikTok remains an under-
utilized resource for social science research.
This work addresses the issue by presenting a tool that allows researchers to scrape TikTok
data and integrate it into a theoretical model. The model aligns with common methods and
theories in social science while incorporating the rigorous and formal nature of computer
science. Through this approach, the work aims to facilitate interdisciplinary collaboration
between social science and computer science researchers and provide social scientists with a
tool to collect, explore, and analyze data from TikTok.
ii
Acknowledgments
First and foremost, I would like to express my gratitude to all my supervisors for their open-
ness to constructing this bachelor’s thesis in an interdisciplinary and intercultural context. I
believe that this work has greatly benefited from the diversity of feedback, inspiration, and
recommendations I received from these different perspectives, and I am tremendously grate-
ful to everyone who was involved and willing to participate in this project.
Secondly, I would like to express my gratitude to the CERES team at Sorbonne University for
always having an open door and an open ear for me, and for supporting and motivating me
throughout the process of this work. In addition, I would like to thank Gianluca Manzo for
enriching me with the perspective of Analytical Sociology, which has shaped this work.
Finally, I would like to express my heartfelt appreciation to my friends and parents who have
provided unwavering support throughout this journey and have always been willing to read
my work and share their opinions. Specifically, I would like to thank my father for his metic-
ulous attention to detail, Gesche Nehmelmann and Philipp Breer for their constructive ideas
and diverse perspectives, Daniel Pinilla Ramírez for the enriching discussions and support
with the figures, and Irene Lopez Abarca for always lending a listening ear and enriching me
with her creativity and empathy.
iii
Contents
List of Figures vi
List of Tables vii
List of Listings viii
1 Introduction 1
1.1 Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2 Fundamentals 6
2.1 Social Science, Computer Science and Social Media . . . . . . . . . . . . . 6
2.2 TikTok . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.3 Technical Context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.4 Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.5 Theorethical Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3 Implementation 30
3.1 Application Programming Interface . . . . . . . . . . . . . . . . . . . . . . 30
3.2 Scrapy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.3 Neo4j . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.4 Docker . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.5 Algorithmic Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4 Analysis 40
4.1 Tool . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.2 Descriptive Analysis of the Collected Data . . . . . . . . . . . . . . . . . . . 42
iv
Contents
5 Discussion & Conclusion 49

5.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
5.2 Related Work & Contributions . . . . . . . . . . . . . . . . . . . . . . . . . 50
5.3 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
5.4 Outlook . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
5.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
Bibliography 54
v
List of Figures
2.1 Colemann’s General Micro-Macro Model [22] . . . . . . . . . . . . . . . . . 10

2.2 Application of the Micro-Macro Model . . . . . . . . . . . . . . . . . . . . 10
2.3 TikTok’s ’For You Page’ . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.4 Anatomy of a TikTok Post . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.5 TikTok’s Robots.txt File . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.6 Example for a Model of a Fictive TikTok Interaction . . . . . . . . . . . . . . 22
2.7 Simplification of the Fictive TikTok Interaction . . . . . . . . . . . . . . . . . 23
3.1 Scrapy Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

3.2 Comparison of Scrapy, BeautifulSoup, and Selenium [40] . . . . . . . . . . . 35
4.1 Collected Posts per Hour . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

4.2 Accumulated collected Posts over Timer . . . . . . . . . . . . . . . . . . . . 41
4.3 Structure of collected data . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.4 Portions of Nodes by Nodetype . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.5 Summed Degrees per Nodetype . . . . . . . . . . . . . . . . . . . . . . . . 43
4.6 Average Degree per Nodetype . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.7 Degree Distribution per Nodetype . . . . . . . . . . . . . . . . . . . . . . . 45
4.8 Maximal Degree per Nodetype . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.9 Visualization of Graph of Collected Data . . . . . . . . . . . . . . . . . . . . 46
4.10 Visualization of Graph of Collected Data with Ring View . . . . . . . . . . . 47
4.11 Visualization of the Scraping Process . . . . . . . . . . . . . . . . . . . . . . 48
vi
List of Tables
3.1 Collected Data in the Parse Function . . . . . . . . . . . . . . . . . . . . . . 39

3.2 Collected Data in the Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . 39
vii
List of Listings
3.1 Pseudo Code of the Scraping Algorithm . . . . . . . . . . . . . . . . . . . . 38
viii
Chapter 1
Introduction
Computer Science is no more the

science of computers than astronomy is
that of telescopes.
Alan Perlis
The above quote, spoken among others by Alan Perlis, the first recipient of the Turing
Award1 , highlights the misleading nature of the term ’computer science’. Perlis empha-
sizes that the focus of the field is not on the study of computers themselves, but rather on
the problems that can be understood and solved through the lens of computation. Moreover,
computer science is focused on understanding information transmission, as computation is
inherently tied to this process. This point is echoed by many other figures in the field, who
stress that the computer is not the main focus of the discipline [1]–[3].
One collection of fields in which information flow plays an important role are the social
sciences. The social sciences study the social behavior and functioning of various species,
including humans, to understand the organization of society [4]. This thesis posits that the
human species has a unique quality that sets it apart from other species: the utilization of
technology. Humans use technology not only to enhance their personal lives, but also to
communicate with each other, and technology plays a crucial role in this process. The ex-
amination of information flow within human groups and how it relates to their organization
1 The Turing Award, named after Alan Turing, is widely considered to be one of the most prestigious awards
in the field of computer science, often considered to be equivalent to the Fields Medal in mathematics in
terms of prestige and recognition within the respective discipline.
1
Chapter 1 Introduction
is therefore of great importance to the social sciences, especially in the age of social media
which has dramatically transformed human communication. As a result, research on hu-
man communication through social media has become increasingly relevant and has led to
increased collaboration between the social sciences and computer science [5]–[7].
In this context, the social media platform TikTok has not received significant research atten-
tion despite its widespread usage, particularly among the young population [8], [9]. Modern
technology is complex, and although social social scientists are interested in understanding
how social media works, there is a high barrier to entry for researching this field. As some-
one who holds a degree in social science and who is writting this thesis in order to finish my
bachelor’s degree in computer science, I have experienced firsthand the benefits of applying
computer science methods to answer social science research questions. However, the differ-
ent languages spoken in these fields can make collaboration a difficult task. The objective
of this thesis is to facilitate collaboration between the two disciplines by developing a tool
to scrape data from the growingly relevant platform TikTok and providing a clear theoretical
framework for interpreting the collected data. 2
2 While drafting this work, I utilized language tools such as ChatGPT, a large language model, and Grammarly,
a grammar correction tool. However, I would like to emphasize that I employed ChatGPT for text revision
purposes only, and did not utilize it as a tool to generate new information.
2
1.1 Problem
This work is situated in a broader understanding of the social world embedded in a technical
context. There have been several disciplines formed to study this topic, with each social
science subdiscipline having a corresponding computational subdiscipline that focuses on
related problems.
This thesis addresses the issue of limited research on certain social media platforms. Most
studies focus on popular platforms such as Twitter and Reddit, as they offer accessible data
through an Application Programming Interface (API)3 . However, this approach can lead to
biased knowledge about social media, as findings from studies on these platforms are often
generalized to social media as a whole.
TikTok, for instance, is a popular social media platform, that has received limited research
attention despite concerns raised by experts regarding its vulnerability to misinformation
during the Corona pandemic. Few studies have utilized data from TikTok, possibly due to the
challenges in accessing data on the platform or the fact that it is more commonly used among
the younger generation, resulting in unfamiliarity among researchers.
This study also aims to reinforce collaboration between social science and computer science-
related disciplines in researching social media. The Social sciences tend to have a different
approach to theory and methodology compared to disciplines with a natural science ori-
entation [10], which makes collaboration between social science disciplines and fields that
are primarily marked by a natural science perspective difficult. For instance, the field of
complex systems aims to understand the behavior of systems composed of many interact-
ing components that can give rise to emergent phenomena. Although social systems are one
such example, the field of complex systems is largely influenced by mathematics, physics,
and computer science and has limited communication with the social sciences due to a lack
of shared language. Therefore, interdisciplinary studies that bridge the gap between natural
and social science approaches, such as this one, have the potential to promote knowledge
exchange and collaboration between the two fields.
3 AnAPI is a program utilized by other programs to access data. Further explanation of this concept can be
found in section 3.1
3
1.2 Objectives
The following outlines the objectives of this thesis that aims to address the problem of the
lack of access to TikTok data and thereby supporting collaboration between the social sciences
and computer science-related disciplins.
The first objective is to develop a scrapper that enables researchers to access data from the
social media platform TikTok. This is a crucial step to address the lack of research on TikTok
due to the unavailability of an easy-to-access data source. However, scraping TikTok data
presents a technical challenge as well as ethical and legal concerns as the transparency and
legality of accessing data needs to be ensured.
The second objective is to design the scrapper and this thesis in a way that promotes inter-
disciplinary collaboration between social sciences and computer sciences. This is achieved
through two sub-objectives:
• to make the scrapper user-friendly by reducing the knowledge requirements of acquir-

ing data,
• to embed the tool in a theoretical model which aligns with common methods and the-
ories in social science while incorporating the rigorous and formal nature of computer
science.
Additionally, the thesis endeavors to present the information of technical nature in a way
that is comprehensible to both scientific domains, with explanatory material included in the
footnotes where necessary.
In conclusion, the objectives of this thesis are to:
1. develop a theoretical model of TikTok as a social medium that is accessible from both
a social science and a computer science perspective.
2. implement a web scraping tool adapted to this theoretical model.
3. make the tool as accessible and user-friendly as possible for researchers from both
fields.
4
1.3 Structure
This thesis will be structured as follows. In chapter 2, I will provide a theoretical discussion
on social media in general and TikTok in particular, taking into account both the social science
and computer science perspectives. The aim is to establish a common ground on the topic
and to develop an accessible theoretical model that is mirrored in the data structure.
Chapter 3 focuses on the practical solution I develop in this work. The program will be
explained component-wise, starting with an overview of the concept of an Application Pro-
gramming Interface (API) and distinguishing the concept from the program developed in this
work. I will also discuss the role of graphs in the implementation and the use of existing
tools such as Neo4j, Scrapy, and Docker. Finally, I will conclude the chapter by providing an
explanation of the algorithmic realization of the program.
Within chapter 4 I will evaluate the tool and the collected data. The analysis of the tool will
focus on its capability and performance in collecting data from TikTok, while the analysis of
the collected data will be a descriptive showcase of its potential applications.
In chapter 5, I will conclude this work by discussing the results and their relevance in the
current state of the art. I will also address the limitations of the tool and provide an outlook
on potential future extensions. Furthermore, I will provide my opinion on which use cases
would benefit the most from the tool.
5
Chapter 2
Fundamentals
The purpose of this chapter is to build a foundation for addressing the issue presented in the
previous chapter. The aim is to clarify the relevant terms that align with current theories and
methodologies in both social science and computer science. To achieve this, a common un-
derstanding must be established. Thus, in the first section, I will examine the interdisciplinary
fields of social science and computer science.
Building on this foundation, I will then explore social media in general, with a particular fo-
cus on TikTok, from both a social science and computer science perspective. This discussion
will be followed by an examination of the technical context of conducting research on social
behavior on TikTok. It is important to understand the available data in order to tailor the the-
orethical model accordingly. Finally, this chapter will conclude by formulating definitions
that bridge, between social science and computer science perspectives on social media and
by proposing a theorethical model that builds on the introduced definitions.
2.1 Social Science, Computer Science and Social Media
This section intends to demonstrate the significance of TikTok data for both Computer Science
and social science by providing a brief overview of each discipline and highlighting their
relationship with social media. During the discussion of social science, I will also give a
concise explanation of prevalent social theories and commonly used methodologies, to ensure
that the definitions and, subsequently, the scraped data align with these.
6
Chapter 2 Fundamentals
The Essence of the Social Sciences and the ’Methodenstreit’
The social sciences encompass a diverse range of disciplines, including among others psy-
chology, communication science, anthropology, and sociology, which all study human social
behavior [11]. However, defining the concept of ’social’ is challenging, as the word can have
different meanings within the social sciences and in common usage. For example, the Cam-
bridge dictionary defines social as related to "the way people live together or to the rank a
person has in a society" [12]. However, in common language the term can also describe the
behavior of certain animals, not just humans.
The social sciences emerged as a distinct field in the 19th century. It was believed that
human behavior was more complex than other natural phenomena and that the researcher’s
own perspective should be included in the study of society. This distinction between the
social and natural sciences was not universally accepted and has been the subject of ongoing
debate, known as the ’Methodenstreit’1 . This debate centeres on the question of whether
there are universal laws that govern the behavior of society [13].
Émile Durkheim’s study on suicide was groundbreaking for the social sciences. Previously,
it was difficult to gather data related to human collective behavior, but Durkheim’s work
showed the potential of using data and statistics to understand social phenomena. He found
that despite suicide being a personal decision, the rates of suicide were surprisingly stable and
showed patterns based on factors such as religious community. Durkheim’s conclusion was
that human behavior is not completely individual and isolated, but is influenced by social
factors such as norms and regulations [14]. Durkheim demonstrated that social behavior
could be studied scientifically, and opened the door for further research in the field.
Durkheim’s study also highlighted the importance of data collection and analysis in order to
understand social phenomena. His study demonstrated that the ability to gather and analyze
data can lead to new insights and theories about human behavior.
Currently, the social sciences are characterized by a diverse array of methodologies and per-
spectives. Despite this diversity, there is a persistent challenge in terms of communication
and collaboration among researchers within the field. In [15], the authors argue that incor-
porating social science research within interdisciplinary collaboration could lead to a deeper
1 ’Methodenstreit’ is german and means ’method dispute’.
7
understanding of collective human behavior through analytical methods. They conclude that
new tools could be an important factor in achieving this goal.
In conclusion, the idea that human behavior is too complex to be studied by means of sta-
tistical methods is a fallacy. While it is true that human behavior is complex, this does not
mean that patterns and regularities cannot be found and studied. Durkheim’s study of suicide
is an example of how social research can identify patterns and regularities in human behav-
ior and make meaningful contributions to our understanding of society. The advances in
data collection and analysis have opened new avenues for social research and have enabled
researchers to better understand human behavior and the forces that shape it. The ’Meth-
odenstreit’ should be seen as an ongoing dialogue about the best methods to study human
behavior, not as a barrier to scientific inquiry.
Social Science Theory
The Methodenstreit helps to shed light on the ongoing debate in the social sciences regarding
the definition of the term ’theory’. Some researchers view theory in the same way as it is
understood in natural sciences, as a coherent set of clear and unambiguous statements without
contradictions. On the other hand, others understand theory as a paradigm of thought. The
distinction between these two forms of theories is often blurred, resulting in a wide range of
theories with varying angles, applications, and methodologies.
It would be beyond the scope of this section to provide a comprehensive overview of all
the influential theories in the social sciences. Instead, we will focus on three theories that
incorporate a natural science approach and that I consider to be relevant to the study of social
media and TikTok. The theories I decided to introduce are social network theory, structural
functional theory, and the theory of social action.
Émile Durkheim is considered one of the founding fathers of sociology. He argued that
society is like an organism, with different parts that work together to maintain social stability
and balance. One of the most famous studies conducted by Durkheim was his examination
of suicide rates that I already mentioned in section 2.1. In his research Durkheim found
that certain social factors, such as integration into social groups and the presence of moral
regulation, had a significant impact on suicide rates. This led him to conclude that society
functions as a system, with different parts serving specific functions that contribute to the
8
stability and balance of the whole. According to Durkheim society can be seen as a complex
system of interrelated parts that work together to maintain stability and order. The theory that
emerged out of his findings and conclusions is referred to as structural functional theory [14],
[16], [17]. Structural functionalists believe that society is made up of a number of different
structures, such as the economy, the family, and the government, that each have their own
specific functions. These structures are seen as being in a state of equilibrium, meaning that
they maintain stability by balancing the demands and needs of different parts of society.
A few years after Durkheim published his seminal work on suicide, Max Weber established
himself as a leading figure in the field of sociology through his own publication. While
Durkheim analyzed the statistical patterns of suicide rates and their connection to individual
choice, Weber was captivated by the variation in the development of capitalism across differ-
ent cultures. Through his research, Weber argued that religion, particularly Calvinism, played
a significant role in the growth of capitalism. He claimed that the belief that an individual’s
wealth could determine their fate in either heaven or hell motivated them to reinvest their
profits into industry, leading to the emergence of capitalism. Although Weber’s hypothesis
that capitalism can be explained by religion was later challenged by empirical evidence, his
work emphasized the importance of cultural and ideological factors in shaping social behav-
ior and structures. This led to the development of the theory of social action, which focuses
on the actions of individuals in society and how these actions can be explained, in contrast to
the structural functionalism that primarily examines the existing structures in society [18]–
[20].
It is important to recognize that both theories do not contradict each other, but rather focus on
different aspects of society. In the late 1970s, American sociologist James Coleman was able
to bridge the gap between these two theories through his renowned micro-macro approach,
also known as the Coleman’s Bathtub. The model emphasizes that both structural functional
theory and theory of social action are correct, but limited in their perspectives. Social struc-
tures are crucial, but they emerge from individual actions, while those actions are determined
by the social structures within which they occur. The bathtub model visually represents this
dynamic interaction between the micro and macro levels of society [21]. Figures 2.1 and 2.2
demonstrate Coleman’s model and its application using Weber’s explanation of the rise of
capitalism.
9
Figure 2.1: Colemann’s General Micro-Macro Model [22]
Figure 2.2: The Micro-Macro Model Applied to Weber’s Explanation of Capitalism [23]
Coleman’s micro-macro model provided a framework for understanding social phenomena,

but it left open the question of how to model the transitions between different states in the
system. Social network theory emerged as a response to this challenge, proposing that society
10
can best be understood as the sum of individuals and their relationships. This perspective
holds that the constellation of these relationships determines the beliefs, attitudes, and values
that exist in a society. The potential for combining this approach with Coleman’s Bathtub
to create a comprehensive explanation of social phenomena will be further discussed in the
next section.
Methodology within the Social Sciences
The previous section provided a macroscopic perspective on social theory. In contrast, this
section aims to introduce commonly used methodologies and highlight their relation to the
theory. However, it is important to acknowledge the same limitation as in the theory section;
providing an extensive overview of all methods used in the field is beyond the scope of
this work. Instead, this section will focus on three methods closely linked to the theory
presented earlier: statistical regression, network analysis, and agent based modelling (ABM).
While many studies in the field rely on frequency or qualitative analysis, this section will not
delve into these methods in detail. Its purpose is to provide a brief introduction to the three
aforementioned methodologies and providing resources for further explorations.
The structural functionalist theory focuses on the structural relationships within society. This
means that they believe that there are functions that take a structural property within society
as input and produce another property within society as output. Social scientists often use
statistical regression as a method for finding these functions.
Statistical regression is a technique for explaining a dependent variable y as a function of one

or more independent variables x. The objective is to find the best-fitting function f (x) = y
that predicts y based on x using statistical methods. This is accomplished by collecting data
on x and y and minimizing the error term ε in the equation Y = f (x) + ε, where Y and x are
vectors and each yi corresponds to the prediction based on xi [24].
In social science, it is often assumed that the relationship between x and y is linear, repre-
sented as f (x) = axt + b, where a is a vector in R n . However, it is important to note that
many real-world relationships may not be linear. As seen in Coleman’s bathtub metaphor,
even if a function relating two or more structural variables is found, this only accounts for
the macro perspective, or the "water level" of the bathtub, and does not account for the way
social actions influence the observed dependent variable y.
11
A possible answer to this problem is a method that has recently been introduced to the field
of social research and is bridging interdisciplinary research between complexity research and
social sciences is ABM. This method is gaining popularity in the social sciences, particularly
in the field of Analytical Sociology [25], [26].
ABM is a type of simulation that allows researchers to specify the rules of behavior for a
set of agents in a given world at each time step t. The behavior of these agents can depend
on the state of the world and the behavior of other agents. The agents can represent any
type of individual entities, and the world can take various forms. The goal of this method is
to recreate and explain observed phenomena by finding the underlying rules of behavior at
the micro level. To achieve this, the researcher programs agents, that can represent different
entities such as human beings, to simulate behavior that closely mimics what is observed
in the real world. By starting the simulation, the researcher aims to obtain an outcome that
closely resembles the real-world observations 2 .
By examining structural relationships in more depth and taking into account social actions,
agent based modelling provides a way to offer mechanism-based-explanations that align with
Coleman’s micro-macro model. A mechanism refers to a macro phenomenon that emerges
from the behavior of individual entities or subjects at the micro level. Advocates of this
method argue that social researchers should focus on mechanisms instead of linear relation-
ships, which is the case in linear regression.
An important question in the context of ABM is how to model society within the framework
of ABM. As discussed earlier, one approach is to view society as a network, as suggested by
social network theorists. A closely related method to this approach is social network analysis,
which has gained popularity in the social sciences for its ability to provide insightful and
reproducible results [28], [29]. With the growth of the internet and social media, researchers
now have access to vast amounts of data on social networks.
Social network analysis involves the study of social networks, which can be represented as
graphs composed of nodes and connections between them. These graphs can be represented
mathematically as G = (U,V ), where U = u1 , · · · , un and V ⊆ U × U. For more detailed
definitions and explanations, please refer to section 2.4. Researchers in this field aim to
model society as graphs and analyze their structure and properties using specific measures
and algorithms [30].
2 For an introduction to ABMs see for instance[27].
12
ABM and social network analysis can be linked by modeling the simulated agents’ world as a
graph. This allows for the exploration of the structural relationships within society, providing
valuable insights into the behavior of individuals and their interactions.
The Essence of Computer Science
Having established a shared understanding of social science theory and methodology, and
their interrelationship, we now turn our attention to Computer Science. While it may seem
that Computer Science centers around the computer as an object, this is only partially true.
While it is common to see computer scientists working on their machines and writing code,
the central focus of Computer Science is not the computer itself, but rather the study of
computational systems and the transmission of information. In some regions, Computer
Science is referred to as informatics, which pronounces the study of computational systems
and information transmission.
As George Johnson, a science journalist, stated, "Computer Science is not about computers,
any more than astronomy is about telescopes or biology is about microscopes. These devices
are tools for observing worlds otherwise inaccessible. A computer is a tool for exploring the
world of complex processes, whether they involve cells, stars, or the human mind." [31].
The aim of the above quote is to conceptualize Computer Science as a field that enables the
observation of various phenomena through the use of computation. In this context, com-
putation refers to a systematic rigorously defined manipulation of symbols according to a
determined finite set of rules.
Social Media: A Computer Science and Social Science Perspective
The term "social media" refers to a set of internet-based applications. In common language,
the term is often used ambiguously to describe web applications with a social component.
However, a clear definition is not straightforward. Carr, Caleb, et al. precisely worked on this
issue and proposed a definition of social media as "internet-based channels that allow users to
opportunistically interact and selectively self-present, either in real-time or asynchronously,
with both broad and narrow audiences who derive value from user-generated content and
the perception of interaction with others" [32, p. 8].
13
The preceding definitions contain two key terms that should arouse the interest of both so-
cial scientists and computer scientists. These terms are "internet-based" and "interaction".
Given that any social medium, according to this definition, takes place on the internet, it is
reasonable to conclude that it is an information transmission system that Computer Science
is interested in. Furthermore, the term "interaction of users" suggests that the system reflects
some of the social actions and structures that social sciences are interested in.
Social media, defined as internet-based platforms for user interaction, have become a popular
topic for researchers in both Computer Science and social sciences. Computer scientists are
interested in the technical aspects of social media platforms, such as algorithms and network
structures [33], while social scientists are interested in the ways in which these platforms
reflect and shape social interactions and structures [7].
The former highlights the potential benefits of increased collaboration between the two dis-
ciplines. Social scientists require the expertise of computer scientists to analyze social media
data and to understand the behavior of networks, while computer scientists can gain insights
into the network structures present on social media and the way they manifest. As social me-
dia platforms continue to evolve and play a growing role in society, interdisciplinary research
between Computer Science and social sciences will remain essential for understanding their
technical and social dimensions.
2.2 TikTok
After this discussion on the relevance of social media data for social science and computer
science, we will now narrow our focus. It is important to remember that this thesis will
examine a specific social media platform: TikTok. Therefore, before delving into the formal
and technical aspects of this thesis, it is crucial to gain a better understanding of the general
context of TikTok.
The TikTok app evolved from the Chinese app Douyin, developed by ByteDance in 2016.
Douyin allowed users to create and share short videos as well as to consume content from
other users. In September 2017, ByteDance released an international version of the app under
the name TikTok. The app was a worldwide success and, by 2021, was ranked as the world’s
most visited website, ahead of Goole. One measure of its success is that in many countries,
14
mobile phone contracts now offer an option to include TikTok, allowing users to extensively
use the platform. The fact that mobile carriers have recognized this as a significant factor in
users choosing their contract is noteworthy [8].
TikTok operates by allowing users to access a feed of content when they visit the app or
website. Upon opening the TikTok app or website, users are immediately directed to a page
with a feed of content, as shown in Figure 1 fig. 2.3.
Figure 2.3: TikTok’s ’For You Page’
Figure 2.3 depicts the TikTok homepage and three key components, labeled 1-3. The ’For
You page’ (1) displays personalized content in the form of short videos. An example of the
content is shown in (2), where the video is the centerpiece. The video automatically starts
playing when all technical requirements (e.g. browser compatibility) are met and sound
can be activated by clicking the sound button in the bottom right corner. Users can directly
interact with the content through likes, comments, or sharing. To do so they need to be logged
in, which they can do in the upper right corner (3).
The centerpiece of TikTok are thus the contentpieces shown in (2). Figure 2.4 illustrate the
anatomy of such a content piece.
15
Post
Figure 2.4: Anatomy of a TikTok Post
As can be seen in fig. 2.4 each video on social media is posted by a user, and to the right of
the video, other users can see the comments made on the video. The video is accompanied
by a description that the user writes when posting it. Within this description, the user can
include hashtags, which are words or phrases preceded by the ’#’ symbol. Hashtags are used
to group and categorize content by topic or theme, and by clicking on a hashtag, users can
access a page that presents all the videos that contain that particular hashtag.
Moreover, the uploader of the video can mention other users by writing ’@’ followed by their
username. When a user is mentioned, they receive a notification, and potential users can visit
the mentioned user’s account by clicking on their name. The account contains all the content
that the user has posted.
Under the description, there is a link to the music page that corresponds to the music present
in the video. Clicking on this link leads the user to a page where all the videos containing the
same music are stored, similar to clicking on a hashtag.
The purpose of explaining the TikTok page is to give an overview of its features and functions
and to identify potential areas for deeper investigation into user behavior on the platform. The
TikTok homepage is complex and the videos posted are complex messages with underlying
intentions. Investigating the page inherits the danger of getting lost in the details and not
16
seeing the big picture. To avoid this, I constantly asked the question ’What is possible?’
within the given time limit of the thesis. This question served as a guide to filter out elements
and focus on what is achievable. The solution proposed in this work was not arrived at
through a top-down approach, but instead through exploring what data is available and what
data would be useful for a social scientist. The filtering mechanism will further be enhanced
in subsequent chapters to narrow down the focus to the most relevant information.
2.3 Technical Context
In the previous chapter, I provided an overview of TikTok. As previously mentioned, one

objective of this study is to gather data for social research. From my point of view, there are
two approaches to achieve this: top-down and bottom-up.
As a top-down approach, I had intended to define the trace of social behavior present on
TikTok and then access the relevant data. However, this approach was not feasible due to the
strict limitations on access imposed by TikTok. As we will see later in this chapter, not all of
the platform’s data can be obtained through web scraping. A tool used for research purposes
must ensure that it remains within legal boundaries.
The bottom-up approach, as the name suggests, works in reverse order. This approach in-
volves first exploring the data that is accessible and the way that social scientists and com-
puter scientists currently treat it. Based on these explorations, I will propose a theoretical
model and adjust the data collection accordingly.
To execute this approach, I will begin by providing an overview of web scraping and how
websites restrict access to their data. I will also examine the legal perspectives on data own-
ership and ethics. In the following section I will then build upon the results of this chapter to
establish definitions that will guide the implementation of the web scraper.
To introduce the reader to the concept of web scraping, I’d like to reflect on a famous quote
by science fiction writer Arthur C. Clarke: "any sufficiently advanced technology is indis-
tinguishable from magic." When we access a social media platform like TikTok through our
smartphones, tablets, or computers, we enter a seemingly infinite universe that we can ex-
plore and interact with, but where does all the information come from? The simple answer
17
is: the internet. In order to formulate a more sophisticated answer, I suggest to consider what
happens when we type ’www.tiktok.com’ into the input field of our browser and hit enter.
Our computer sends a request to the router asking for information at that address. The router
forwards the request to the next server, which sends it into the right direction until it reaches
the server holding the information. The server validates the request’s permissions and, if
approved, sends the corresponding data back.
Google Chrome browser provides a convenient way to see what’s happening behind the
scenes. Once the browser is open, right-click on the page and choose ’inspect’. This opens a
window next to the previous view and allows you to choose the Network field at the top. If
you visit ’www.tiktok.com’, for example, you’ll see a bunch of activity in the network sec-
tion of the inspect window. This option reveals a bit of the "magic" that the average internet
user does not get or need to know about. In the window, one can see and inspect all the files
that were sent to the server as a response to the request made by typing the URL into the
browser.
With each request, we receive data from the server that is temporarily stored on our machine.
When we visit the TikTok website, we may not be directly connected to the homepage at all
times, but the data for what we see in our browser is temporarily stored on our local machine’s
memory.
If one wants to research TikTok, one could visit relevant pages and take note of the infor-
mation. However, since the corresponding files are temporarily present on the researcher’s
machine, they can write a program, known as a scraper, to automate the process. Usually,
the data of interest is similar and can be found at URLs that can be constructed from previ-
ously scraped data. These URLs can be constructed using a fixed set of rules and can also
be generated by a program, known as a web crawler. In practice, the concepts of crawling
and scraping are often used interchangeably, but as mentioned before, they do have different
meanings. 3
As the process of scraping simply automates tasks that an internet user can perform manually,
it can be argued that it is not illegal or unethical. However, there is a ongoing debate on this
topic, as summarized by Krotov in his work on the legality of web scraping [34].
However, the legality and ethics of scraping in the context of social media are even more
3 Throughout this work, the term ’scraping’ will include the process of crawling unless otherwise specified
18
controversial. Social media users produce the majority of the content that drives the value of
these websites, making them valuable objects for researchers and businesses. However, the
users may not be aware that their actions are being analyzed and understood through these
means.
The accessibility to web scrapers is a double-edged sword 4 . for social media platforms. On
one hand, it’s a necessary condition for enhancing a website’s popularity since search engines
like Goole use scraping to access the website and appear in search results. On the other hand,
allowing scrapers to access their data could potentially reveal sensitive information that users
may not want to be public. Platforms have implemented measures to make it harder for
web scrapers to access their data in an automated manner, but this balancing act between
accessibility and privacy remains a challenge.
To gain a better understanding of how TikTok manages webscraping on their website, I con-
sulted their community guidelines. In these guidelines, they mention webscraping in the
following sentence: "Do not use automated scripts, web crawling, software, deceptive tech-
niques, or any other method to attempt to obtain, acquire, or request login credentials or
other sensitive information, including non-public data, from TikTok or its users"[35]. Our
scraping program adheres to these guidelines since it does not target login credentials and
only accesses data from public accounts. Additionally, we follow TikTok’s robots.txt file.
The robots.txt file resides on most professional websites and indicates the data that a web-
site allows web scrapers to access. The file’s purpose is to instruct web scrapers, which
are typically accessed by programs, not users. The location of the file is fixed, at maindo-
main/robots.txt, and has a standardized structure. In the case of TikTok, we can examine its
robots.txt file to understand its stance on web scraping.
4 Theterm ’double-edged sword’ refers to a situation where a single action can have both positive and negative
consequences for its performer.
19
Figure 2.5: Screenshot of the TikTok’s Robots.txt File Accessed via www.tiktok.com/
robots.txt
fig. 2.5 displays TikTok’s robots.txt file. The red rectangles have been added for clarification.
The first rectangle shows the user agents that are not allowed to access any content on the
page. These user agents correspond to certain businesses that TikTok wants to restrict from
accessing their website in an automated manner. In the second rectangle, we see all URLs
that TikTok permits all programs except those specified in rectangle 1 to access. In particular
we can see that that crawlers are allowed to access the music, and the hashtag URLs. The
third rectangle lists all URLs for which TikTok prohibits any scraping activity, including all
auth URLs which can only be viewed from a logged-in perspective. This means that content
that requires a user to be logged in cannot be scraped. The fourth rectangle lists the sitemaps
intended for the Goole web crawler and are not relevant to this work5 .
By introducing the concept of web scraping and analyzing the robots.txt file, we have gained
a better understanding of the information that can be accessed by means of web scraping on
TikTok, including public user accounts, videos, and content related to specific hashtags or
music tracks. With this knowledge, we can now move on to the final sections of this chapter,
which aim to familarize the reader to the concept of graphs and to translate the insights we
have gained into a theoretical model of TikTok.
5 Theyindicate to the Goole crawler which URLs are particularly ’important’ and are linked to XML files
where these URLs are listed, as there are too many to include in the robots.txt file
20
2.4 Graphs
We now have established a solid understanding on both the TikTok platform and the data we
can access by means of web scraping. But by simply accessing the data the objetives of this
work are not reached. The task that I aim to tackle within this work is to bring this data into
a suitable form for analysis from a social science perspective. I want to argue that the best
form of doing is to store the data directly in the form of a graph. I already mentioned and
briefly defined the term when talking about social network analysis in section 2.3. Within
this section, I already defined what a graph is however I did not go into detail. I did not
because I think the concept is so central to this work that it deserves to be treated in a proper
section. The goal of this section will be to introduce or remind the reader to the notion of a
graph. I will motivate and introduce the formal definition of a graph and explain it in detail.
I will then explain the usefulness of those mathematical objects in social network analysis.
Finally, I will explain how algorithms can be used in order to better understand them and the
structures they represent.
Mathematical objects are commonly used to model some happening in the real world. We
observe something and we want to find an accurate description that allows us to describe as
much as possible of these phenomena. One might agree that oftentimes phenomena can well
be described by some model that would include the entities that participate in the action of
interest and in which way they participate. Let’s take an example. Let’s imagine the users
Escher, Gödel , and Curie to be distinguished TikTok user. Let’s imagine. Escher and Gödel
have each posted a video vE and vG . We additionally know that Gödel watched6 the video of
Escher. Additionally, Escher also watched Gödel’s video vG . Curie watched both vG and vE
but she did not post any video on her own. Now I suggest the reader the small exercise of
describing the information above in some kind of visualization and then compare his or her
solution with the one I propose in fig. 2.6.
6 Clearly,
the information whether a user watched a video or not is not something I aim to include in the data.
Even if this information would be possible to obtain it would be highly unethical, since the information
whether someone watched a video is a privat matter. However, I found this example quite intuitive.
21
Figure 2.6: Example for a Model of a Fictive TikTok Interaction
What we see in this picture is already a bit more sophisticated visualization of a graph. It
is sophisticated because we have different kinds of relationships between the entities i.e.
’posted’-relations and ’watched’-relations and the connection between the entities have a
direction indicated with arrows. In order to introduce the object of a graph properly we will
simplify the above illustration by not distinguishing neither between different kinds of node
types, nor between different kinds of connections and their directions. Doing so leads to the
graph shown in fig. 2.7.
22
Figure 2.7: Simplification of the Fictive TikTok Interaction
The entities, in our case the user and the videos in a graph are in computer science often
referred to as nodes. The connections represented as lines between the nodes are called
edges. A graph where those edges do not have a direction can be described by simply listing
all the nodes and all the edges between the nodes. Accordingly, we could formally describe
the graph that is visualized in fig. 2.7 by stating that the graph is Gsimple = (N, Esimple )7
where
N = {Goedel,Curie, Escher,Vg ,Ve }

Esimple = {{Goedel,Vg }, {Goedel,Ve }, {Escher, vg }, {Escher, ve }, {Curie, vg }, {Curie, ve }}
We can use a similar way of describing the more sophisticated graph visualized in fig. 2.6.
Similar to the simple graph we describe the graph as Gsoph = (Nsoph , Esoph ). In this case, we
account for the existence of node types by stating that Nsoph = {V,U}, where V is the set of
videos and U is the set of users, i.e. V = {Vg ,Ve } and U = {Goedel, Escher,Curie}. Esoph
7 This
example hopefully also illustrates how I introduced the graph in section 2.1 as generally just a set of
nodes and edges.
23
now accounts for the different edge types in a similar manner. Additionally, the direction
within the graph is accounted for by just using tuples instead of sets to represent one edge.8
Thus we can represent the sophisticated graph by setting Esoph = {W, P}, where W are the
’watched’-relations and P are the Posted relations. I.e
W = {((Goedel,Ve ), (Escher, vg ), (Curie, vg ), (Curie, va )}

P = {(Goedel, vg ), (Escher, ve )}
Generally, a multidimensional graph can be defined in the following manner:
Definition 1 A multidimensional directed graph G = (N, E) is a tuple, where N = {N1 , · · · , Nk }

is a set of sets where each Ni contains a final number of nodes and E = {E1 , · · · , E j } is a set
of sets, where each set within E contains a finite number of edges between the nodes that are
contained in Ni , for 1 ≤ i ≤ k.
Graphs are widely used in both social science and computer science due to their usefulness in
summarizing information as I tried to show with the small example in the former paragraphs.
Additionally, graphs are well-known objects in math and computer science with known prop-
erties and applicable algorithms. While it is not feasible to describe all of these algorithms, I
will mention two types of questions that can be answered by applying algorithms to graphs.
The first type of question is: which nodes are important in a given graph? These types of
questions are typically answered using node centrality measures. Different algorithms can be
used to calculate centrality measures, depending on what constitutes as ’important’ in a given
scenario. One simple way to calculate centrality is by summing the incoming and outgoing
edges of a node, which will be used as a showcase in this work. This measure is often referred
to as the degree of a node.
The second type of question is: which nodes likely belong together? These types of questions
are typically answered with community detection algorithms. While these algorithms are
slightly more complicated, the tools introduced in this work will allow for easy application
8 Sets
are unordered collections of unique elements, often denoted by enclosing their elements within curly
braces: {}. Tuples are ordered collections that can contain duplicates, often denoted by enclosing their
elements within parentheses: ().
24
of community detection algorithms. However, since the main goal of this work is not the
analysis of collected data, I will not apply these algorithms within this work.
2.5 Theorethical Model
The previous sections provided insight into the types of questions that will be asked of the
data and the methods that may be used to answer these questions. Furthermore, technical,
ethical and legal considerations were taken into account to identify the data that can be ac-
cessed using a scraping tool. The current section focuses on establishing definitions and a
theorethical model that align with these insides. The goal is to ensure that the terms used
to represent the data generated by this thesis are well-defined and unambiguous, enabling
researchers to define concepts and variables with precision and clarity. By doing so, a shared
understanding of terminology within the research community can be promoted, minimizing
the potential for confusion and facilitating the replication of research findings. Addition-
ally, robust definitions can support the use of mathematical and statistical methods to analyze
social science data, leading to more accurate and insightful results [36].
I understand that some readers with a background in social sciences may find the use of
mathematical symbols in the definitions to be a barrier. However, I still believe that clear
definitions are important and that symbols can help convey the definitions more accurately.
To make this work accessible to a wider audience, I will provide explanations of the most
central definitions in non-mathematical terms. This way, audience without a background in
formal mathematics should be able to understand all concepts presented in this work.
The purpose of this chapter is to find definitions of key concepts related to human use of
social media, particularly TikTok, that allow us to clearly outline what will be measured using
the tool. The logic here is not that when we define a term X as Y, X is Y and only Y in all
circumstances. Instead, in the context of this work, I aim to understand X as Y and only Y.
Everything that follows in the context of understanding X as Y can then be considered true,
as long as one is willing to understand X as Y.
In Section section 2.1, we introduced the concept of Coleman’s Bathtub, a framework, which
social scientists can use to comprehend social phenomena. Therefore, we will proceed to
define the fundamental entities that comprise the micro-macro model.
25
Definition 2 An individual, represented by the symbol I, is an abstract object. The set of mul-
ˆ while the unique set of all existing individuals is represented
tiple individuals is denoted as I,
by I .
Defining the individual in this manner allows for various research purposes. For example,
one may be interested in an individual’s beliefs and can, therefore, narrow down the abstract
object that represents an individual’s beliefs. Individuals are the entities that interact on the
micro-level of Coleman’s bathtub. Now, we must define the macro-level entity, which is
society.
Definition 3 A society St at a point in time t is a Tuple (Iˆt , Rt ) where Iˆt is a set of individu-
als and Rt = (ia , ib , ω|a, b ∈ Iˆt ω ∈ R)n is a set of n possible relationship types between all
individuals. We will set It ∈ St ↔ It ∈ Iˆt .
This definition aligns with the understanding of society as a network, as suggested by social
network theorists. Essentially, we have defined society as a collection of individuals and the
relationships between them. As before, we have not specified the relationship type because
it will depend on the research objectives. With the definition of individuals and society, we
now have entities for both the micro and macro levels within Coleman’s framework. Our
next step is to consider the transitions between different states within the models. We will
begin by examining the micro-level, where individuals act according to the theory of social
actions. Therefore, we will proceed to define what is meant by this.
Definition 4 An action a by an individual It ∈ Iˆt at time t is a mapping a : (It , Rt , O) →

Rt+1 , where Rt+1 represents the set of relationships between individuals in the society St .
The mapping at defines how the action of an individual It at time t affects the relationships
between individuals in the society St + 1 and O are some other specified factors the action
might depend on.
In simple terms, this definition specifies that an action is determined by an individual’s cur-
rent relationships and other unspecified factors. I have defined the resulting action as a change
in the individual’s relationships within society, as this is a crucial factor in social sciences.
Based on this definition, we can now proceed to define the process at the macro-level as
follows:
26
Definition 5 A social interaction f is a function from the 2 dimensional space of actions and
relationships to the space of relationships thus St+1 = f (St , At ) where St is a society add
point t and At is a a set of actions within this society. We will denote the set of all possible
social interactions at time t with ⊗⊔
With this understanding of social interactions, we can account for transitions on the micro as
well as on the macro level. Now, we will proceed to define how we interpret a social medium
in this context.
Definition 6 We will define a social medium as a function fSM : Ωt → SMt . Where SMt is
some abstract space at time t.
This definition of a social medium states that each property of all possible societies is mapped
to exactly one element in an abstract space.9 This approach allows us to capture the traces
left behind by social interactions within a social medium. To account for different social
media, we can specify the structure of the abstract space accordingly. In the previous chapter,
we explored the type of information that can be accessed through web scraping on TikTok.
Therefore, we will define these entities next.
Definition 7 A social media user is a surjective submapping of U ⊂ FSN : It → SMt .
Thus, we have defined users as representations of individuals within society, which intuitively
makes sense. By stating that this map is surjective, we have accounted for the fact that each
user necessarily has one or multiple individuals associated with it. We have chosen to define
social media as a mapping instead of a function to account for the fact that an individual or a
group can be associated with multiple social media accounts.
Definition 8 A hashtag is a tuple (s, m) where s is some sequence of characters s such that
s(0) = # and m is the meaning of the word s(1 : length(s)). We denote the set of all existing
hashtags as H . H will denote sets of hashtags.
9 Itcan be argued that not every relationship and individual is present on social media, but we can account for
this fact by introducing a null element within SMt , to which all elements that are not present on social media
are mapped.
27
Definition 9 We will define a music track as a set M = {A, T, MI}, where A is some rep-
resentation of the Artists who produced the music, T is the associated title and MI is any
additional information associated to the music.
Now we have all the necessary components to define what a post is.
Definition 10 A TikTok post P is a set {Ut (It ), H, M,Vdes , l, c,C, ME} where
• UT (It ) is the unique user representation of some individuals or group of individuals

associated to the post.
• Vdes is a string representing the video description.
• H is a set of hashtags.
• M is a music track.
• l is an integer representing the number of likes.
• c is an integer representing the number of comments.
• C is the set of all comments 10 associated to the video.
• ME is a list of users mentioned in the post
We will denote the set of all posts at time t as Pt .
This now gives us the opportunity to define TikTok by specifying the structure of SM.
Definition 11 The TikTok webapplication at time t is a social medium where SMt := Pt
Now we will need to define what we understand as a discourse present on TikTok. This will
be the most crucial definition since it constitutes how we model our data. Thinking back
10 Sincein this work I do not collect the comments, I leave a clear definition out. However feature work might
add this functionality and might suggest a definition.
28
to the multidimensional graph introduced in the former section and the knowledge that we
can collect posts, music, and users leads us intuitivly to the following definiton which also
constitutes the theorethical model of a TikTok discourse that I propose in this work.
Definition 12 A TikTok discourse at time t is a multidimensional graph T Dt = (N, E) where

N = {P, H, M,U} where
• P = {p1 , ..., pn } is a set of posts.
• H = {h1 , ..., hm } is the set of hashtags so that hi ∈ p j for some j.
• M = {m1 , ..., ml } is the set of hashtags so that mi ∈ p j for some j.
• U = {u1 , ..., uk } is the set of users so that ui ∈ p j or ui ∈ ME ∈ p j for some j
Additionally E = (PO, MEN, IN) where
• PO is the set of posted relation i.e. (ui , p j ) ∈ PO ↔ ui ∈ U ∈ p j
• MEN is the set of mention relation i.e. (pi , u j ) ∈ MEN ↔ ui ∈ ME ∈ p j
• IN is the set of include relations regarding hashtags and music i.e. (pi , h j ) ∈ IN ↔ h ∈
H ∈ p j and (pi , m j ) ∈ IN ↔ m ∈ M ∈ p j
This definitions can be understood as an adaptation of the example introduced in the former
section. It might become more clear with the presentation of the data in section 4.1 and
especially fig. 4.3 within this chapter.
29
Chapter 3
Implementation
The previous chapter established the objective of generating TikTok data and developed a
theoretical model for the form of the data for social science research purposes. In this chapter,
I will elaborate on how I put the theoretical model into practice. First, I will introduce the
concept of an application programming interface and the way it relates to my program. Then,
I will explain how I utilized the web crawling framework Scrapy for this project. Following
that, I will introduce the reader to the graph database Neo4j which was used to store the data.
Finally, I will briefly touch upon the use of Docker as a tool for enhancing the accessibility
and usability of my tool. Once all these components have been introduced, I will delve into
the algorithms I used to implement the scraping functionality.
3.1 Application Programming Interface
In the previous chapters, I stated that two out of my three primary goals for this work were
to create a web scraper for TikTok data and to ensure that the program is easily accessible.
While this may suggest that I aimed to develop an API, in this section, I will explain why
this is not the case. To do so I will first define what a programming interface is. Then, I will
differentiate the process of implementing a web scraper from that of implementing an API.
An API can be thought of as a program that is not meant to be used by an end user but by
other computer programs. In its origin web-APIs were all sorts of programms that simplified
complex interaction within the web.
30
Chapter 3 Implementation
One metaphor that is often given to illustrate what an API does is the one of a waitress within
a restaurant. Imagine going to a restaurant. If you sit down the waitress will come and ask
you for your order. The waitress will then pass the order to the chef in the kitchen who will
prepare the food. As soon as the food is finished the waitress will bring the food to your table.
In this story, the restaurant visitor corresponds to a program. The waitress corresponds to the
API and the chef corresponds to a server. In this process, the resteraunt visitor/program does
not need to have any knowledge of the complex action of cooking/serverinteraction which is
necessary to generate the food/data he or she receives [37], [38].
Creating an API to provide programs with easy access to TikTok data in a digestible format
would be an ideal solution to the problem at hand. However, implementing such an API
is beyond the scope of this bachelor thesis. The reason is that building an API requires an
existing data structure that can be easily accessed. Although TikTok has an API of its own,
it is costly to use and not realistic for most research projects. In short, the web scraper
developed in this work extracts data from a TikTok and stores it in a database, while an API
would provide an interface for developers to programmatically access data on TikTok.
3.2 Scrapy
We have now reached a point in this work where we have clearly outlined the development
objectives, i.e., the program requirements we need to fulfill. Our next step is to identify ways
to address these requirements. The good news is that in the field of computer science, we
rarely need to start from scratch. There are usually existing tools available that can be utilized
to solve problems similar to the ones we’re working on. However, it is crucial for developers
to understand these tools and apply them thoughtfully to ensure their effectiveness.
There are numerous libraries available for web scraping, including well-known options like
Selenium and beautiful soup, each with its own advantages and disadvantages suitable for
different purposes. In this work, I decided to use the web scraping framework Scrapy. In
this section, I will explain why I chose Scrapy for my use case and how it compares to other
popular web scraping tools like Selenium or Beautiful Soup. While there may be other tools
available, I don’t claim that my choice is the best, but rather a good one.
Scrapy is a web scraping framework that was developed specifically for Python, which is
31
why I chose it as the main programming language for this project. The fact that Python has
strong integration, particularly in scientific contexts, also reinforced my motivation to use
it. Furthermore, Python’s relatively straightforward usage made it feasible for a short-term
project such as a bachelor’s thesis.
As a programming framework Scrapy provides a rigid set of contexts and methodologies for
developers to use, in contrast to a progammin library which is simply a collection of pre-
written code that can be reused by other programs or projects to simplify development and
reduce redundancy. Unlike a programming library, a programming framework guides the
programmer’s design and architecture, reducing room for errors and messy programming.
Once a user initializes a Scrapy-project, Scrapy automatically prepares a program structure

with the main components necessary for the scraping and crawling functionality. Each of
these components comes in the form of a Python class1 . The most important of these com-
ponents are spiders, items, pipelines, middleware and the settings. Once one has understood
each of these components’ purpose and how they interact with each other in a crawling pro-
cess, one has gained a solid understanding of the way Scrapy works. So let us look at them
one by one and then go theoretically through a scrapping process to understand how Scrapy
works2 .
First things first: spiders. The spiders are at the heart of a Scrapy project since they im-
plement the main scraping functionality. There are different types of Spiders for different
purposes, but they all inherit3 from one main Spider class. The name choice is no self pur-
pose. This object can be thought of as a Spider that crawls from site to site by making use of
the existing cobwebs present on the internet. A Spider always has a name and a list of URLs
to start the scraping process. These URLs can be configured to be passed as an argument to
the Spider. Optionally, a Spider also has a list of domains that are allowed to be scraped.
Additionally, a Spider must implement a method called ’start request’. This method should
return an HTML request and determines how the scraping process starts. The response to
these requests is by default sent back to the ’parse’-method inside the Spider. However, one
can specify methods for this purpose that can take additional parameters inside the Spider.
1 In Python as in object-oriented programming languages in general, a class is a blueprint or a template for
creating objects that encapsulate data and behavior. It defines a set of attributes and methods that an object
can have, allowing the developer to create and manage complex data structures and functionalities.
2 In the presentation of the tool, I loosely follow theScrapy documentation, which is also a great resource if the
reader wants to deepen their knowledge on some of the topics discussed in this section: [39]
3 In object-oriented programming, inheritance is a mechanism that allows new classes to be based on existing
classes.
32
In the parsing method, it’s possible to generate new requests based on collected responses,
and the scraped data can be manipulated in any way imaginable.4 However, to promote main-
tainability and code readability, Scrapy recommends storing the data in items. These items
are similar to normal dictionaries5 , but their structure must be defined beforehand. items can
be thought of as data containers that Scrapy knows about before the scraping process begins.
The parsing methods yield these items, which is similar to returning but allows for yielding
multiple objects at different times within the function. Since the yield statement produces
a generator object, each yielded value is treated as an element within a list that can only be
accessed once.
The item pipelines come into play after the data has been collected through parsing methods
and stored in items. To further process the data, one can specify pipeline class and define
specific methods that take item as input. During the scraping process, Scrapy automatically
sends the yielded item from the parsing methods to the defined methods within the pipeline
class. This allows for data cleaning and storage in a database or other desired format.
Thus Scrapy already strongly suggests a way in which one should collect the data. However
one can customize a lot of this process by entering the settings class. Those settings are
known to all the classes that participate in the crawling process. In this way, one can change
a lot of parameters and can specify how the crawling process takes place.
To further customize the scraping process, one can specify middleware classes. These classes
can be used to modify the requests and responses as they flow through the Spider, allowing
for additional processing to be performed at different stages of the scraping process. For
example, one could use a middleware class to check for duplicates, modify the user agent
of requests, or implement proxy rotation. By specifying the order in which the middleware
classes should be executed, it is possible to build complex pipelines that transform the data
in a variety of ways. Within this project, a middleware was used to check for duplicates by
verifying if the requested URL was already in the database before allowing the Spider to
proceed.
It is important to note that the description of Scrapy provided here is not an exhaustive rep-
resentation of the Scrapy architecture. The purpose is to outline the logic that one needs to
4 For example, one could write the data on a sheet of paper and bury it in the Sahara, given the appropriate
resources.
5 In Python, a dictionary is a collection of key-value pairs, where each key is unique and mapped to a specific
value.
33
understand when implementing a Spider, rather than to describe every detail of the process.
While the classes mentioned in this text are the ones that a user is most likely to work with
when building a Scrapy, there are other important components involved in the crawling pro-
cess that are not necessarily accessed by the user. In my own project, I did not use these
components. Figure 3.1 shows the Scrapy architecture, which includes the engine that is the
actual heart of the Scrapy architecture. Although it is actually the engine that performs the
action, for the purposes of understanding the process, it is sufficient to imagine the interaction
of the different components I described in this section.
Figure 3.1: Scrapy Architecture as illustrated in the Documentation [39].
After having explained the logic behind Scrapy, I will now elaborate on why I have chosen
it over other common web scraping tools, namely Beautiful Soup and Selenium.
Beautiful Soup is a parsing library rather than a dedicated web scraping tool, which makes
it suitable for extracting information from files after they have been retrieved. It is a highly
effective solution for this task, and its simplicity and ease of use make it an excellent option
for small-scale web scraping projects. For example, if the location of the desired data is
known and it is contained within a single domain, it may be unnecessary to develop a full-
scale web scraping project using tools like Scrapy. Instead, Beautiful Soup can be employed
to quickly and efficiently retrieve the necessary data.
34
Selenium is a software tool that enables developers to simulate browser interactions. Al-
though it was originally created to test web applications, it has become a popular tool for
web scraping due to its capability to handle JavaScript content. However, as Beautiful Soup
relies on simulated browser interactions, it may not be the most efficient option for large web
scraping projects.
Scrapy differs from Selenium and Beautiful Soup in that it is specifically designed to handle
large-scale projects. The framework is highly extensible, allowing developers to create cus-
tomized web scrapers tailored to their specific needs. Additionally, Scrapy is optimized for
efficiency, utilizing asynchronous requests6 and data processing to streamline the scraping
process.
Figure 3.2 summarizes these features and provides further evidence supporting the decision
to use Scrapy for this particular project.
Figure 3.2: Comparison of Scrapy, BeautifulSoup, and Selenium [40]
6 Asynchronous requests allow multiple requests to be sent and processed simultaneously, without requiring
each request to wait for the previous request to complete. This can significantly improve the efficiency of
web scraping by reducing the amount of time spent waiting for server responses
35
3.3 Neo4j
In section 2.4 we came to the conclusion the most beneficial form in which we can bring the
data is in the form of a graph. We also formulated out theoretical model based on this idea.
However, the question remains on how to practically bring data into the form of a graph.
One option is to manually draw the graph using pen and paper after scraping the necessary
information using Scrapy. However, this method is time-consuming and may not be efficient.
An alternative solution is to use software such as Gephi, which is capable of taking data in
the form of a CSV-file and displaying it as a graph, and allowing the user to explore it further
by applying graph algorithms. However, Gephi heavily relies on the data being in the correct
format, and storing data in a CSV-file may not be the best approach. To address this issue, a
more scalable and flexible solution would be to store the data in a database.
The most communly used database are relational databases7 . However storing the data in
this form would still make it necessary to transform the data for each analysis into the form
of a graph.
An efficient solution to this problem would be to store the data directly in the format of a
graph. Fortunately, there exists a class of databases specifically designed for this purpose.
One such graph-database is called Neo4j. Neo4j is a powerful graph database management
system that comes with a wide range of features and capabilities. Its core functionality as a
graph database is complemented by good integration with various programming languages
and extensive documentation. The software includes a user-friendly desktop interface that
allows for easy exploration of graphs using the intuitive Cypher query language, or through
a graphical interface. Additionally, Neo4j offers built-in support for various applications to
manipulate, analyze, explore and visualize the stored graph [41].
Throughout this work, I relied heavily on two applications within Neo4j: the normal user
interface for Neo4j Desktop and the graph application GraphXR [42]. I used the user inter-
face primarily during the development process to manipulate the graph. On the other hand,
GraphXR is a powerful data exploration and visualization platform that uses graph-based
representations to help users analyze and explore complex data sets. It offers interactive data
exploration, network analysis, and graph visualization, which helps to identify relationships
7 Relational
databases are a type of database that organize data into tables with predefined relationships be-
tween them, allowing for efficient and effective data management and retrieval.
36
and patterns within the data. Additionally one can directly apply a wide range of algorithms
to the stored graph to further analyze it. Thanks to its plugin in Neo4j Desktop, we can
directly open the data in the GraphXR interface and conduct analysis without any need for
transformations.
3.4 Docker
In the previous chapter, we discussed the use of Neo4j as a graph database for storing and
visualizing data for social science research. Neo4j offers a set of powerful tools for exploring
and analyzing data in graph form, which is particularly useful for social network analysis.
However, to make the web scraper tool easily accessible to other researchers, we must con-
sider its ease of use and distribution. To address this, we will introduce Docker as a solution
for making the tool more accessible. Docker allows for easy packaging, deployment, and
distribution of the web scraper, ensuring consistency and reproducibility across different en-
vironments, as well as efficient management and collaboration during tool development.
Docker is an open-source platform that enables developers to package, deploy, and run appli-
cations in containers. Containers are lightweight, portable, and self-sufficient environments
that include everything an application needs to run, such as system tools, libraries, and con-
figurations. They are isolated from the host system, ensuring consistency and reproducibility
across different environments.
Docker simplifies the process of creating, deploying, and running applications through con-
tainerization. With Docker, developers can create an application on their local machine,
package it in a container, and then run it on any other machine with Docker installed, without
worrying about the underlying infrastructure.
In the case of this work, Docker allows us to package the web scraper and its necessary
dependencies, such as Python packages, libraries, and configurations, in a container. Simul-
taneously we ran the Neo4j-database in another container and connected the two containers.
In that way the hole web scraping process can be run on any machine with just one com-
mand. This containerized version of the web scraper can be easily deployed and run on any
machine with Docker installed. Additionally, Docker facilitates efficient management of the
containerized web scraper and provides a centralized repository for sharing the tool.
37
3.5 Algorithmic Implementation
We now described all the tools that we need for implementing the web scrapper. This section
will now go into detail on how the data is actually collected.
I implemented a Scrapy spider which takes a TikTok URL as an argument when the object
is initialized. Starting from this URL the spider requests the HTML-file from the server
corresponding to this URL. From this file the spider collects 158 videos. For each of these
video pages the spider collects the URL of the hashtag pages the music pages and the page
of the user who posted the video and visits them recursively in the same way as before.
All The video pages found within this process are send to the parse function which extracts
the relevant data. Listing 3.1 summarizes this algorithm and table 3.1 shows all the data
extracted.
Listing 3.1: Pseudo Code of the Scraping Algorithm

d e f c o l l e c t U R L s ( u r l ) −> v o i d :
htmlresponse = htmlrequest ( url )
responseUrls = htmlresponse . getAllSignificantURLs ( htmlresponse )
for respUrl in responseUrls :
i f isVideoUrl ( respUurl ) :
parse ( respUrl )
collectURLs ( respUrl )
All the data which is parsed is then loaded into items as described in section 3.2 and then send
to the pipeline. Within this pipeline the data is send to the Neo4j-database. The variables
’videoURL’ , ’music’, ’userScreenname’ and each hashtag are for this purpose understood as
id for nodes and connected to each other as proposed in the theorethical model in chapter 2
and then send to the database. Additionally a timestamp is added to the nodes when they
are added to the database. The other variables are understood as attributes of these nodes.
Table 3.2 summarizes the resulting structure . The resulting structure will be further explored,
visualized and analyzed in the following chapter.
8 This
number is due to the number of videos that are send with as a response within a basic HTML-requests
from the TikTok-sever
38
Variable Type Description

videoUrl String URL which corresponds to the video and is used as id
videoDescription String Description corresponding to the video
user String Unique identifier for each user
UserScreenname String Alternative to Username
nrComments Integer Number of comments below the corresponding video
nrLikes Integer Number of likes below the corresponding video
nrForwarded Integer Number of times the video has been forwarded
hashtags List of Strings Hashtags that are present in the video
music String Name of the music present in the video
date String Indicates the date the video was posted
mentionedUsers List of Strings Indicates the other users who are mentioned within the post
Table 3.1: Description of all Variables Collected for each Video Extracted in the Parse Func-
tion
Variable Role
Post Node
VideoURL Attribute and ID of Post
nrComments Attribute of Post
nrLikes Attribute of Post
nrForwarded Attribute of Post
date Attribute of Post
created9 Attribute of Post
INCLUDES Edge from Post to Hashtag or Music
MENTIONS Edge from Post to User
User Node
UserScreenname Attribute and ID of Username
username Attribute of user
created Attribute of user
POSTED Edge from User to Post
Hashtag Node
created Attribute of Hashtag
Music Node
created Attribute of Music
Table 3.2: Description of the Data as it is sent to the Database in the Pipeline
39
Chapter 4
Analysis
The last chapter explained the components and the algorithmic realization of the scraping
functionality. In this chapter I will now perform a runtime analysis to give researchers a
reference point in estimating the computing costs it will take to collect the data they desire.
Additionally I will perform a descriptive analysis myself in order to show how this tool could
be used.
4.1 Tool
In this section I aim to evaluate the tool’s performance on an example scrape. I will by first
introduce the duration of time I scrapped and to mention meta information about the data I
scrapped and then use the time stamps of the data within the database to evaluate the tools
performance. I will also evaluate problems that accured during the collection.
The first scrape with exectued with the starthashtag lgbt. The reason for this choice was
that this work is directly connected to a project of Thibault Grison who works on content
moderation and uses the lgbt community as a showcase. The scraper was started On Monday
(6. February 2023) at 20:00 and the scrapping was exited manually on Friday (7. February
2023) at 21:00h, which let to a total scrapping time of 27 hours. In this duration a total of
4980 items were collected. Figure 4.1 and fig. 4.2 show the amount of collected items over
time during the collection.
40
Chapter 4 Analysis
220
Amount of accumulated collected Items

200
180
160
140
120
100
19h 21h 23h 01h 03h 05h 07h 09h 11h 13h 15h 17h 19h 21h
Collection Time
Figure 4.1: Collected Posts per Hour
5000
Amount of accumulated collected Items
4000
3000
2000
1000
0
19h 21h 23h 01h 03h 05h 07h 09h 11h 13h 15h 17h 19h 21h
Collection Time
Figure 4.2: Accumulated collected Posts over Timer
The figures 4.1 and 4.2 demonstrate that the collection appears to be relatively consistent.
Although there appears to be some fluctuations in the amount of data collected, as illustrated
in 4.2, the overall trend appears to be linear, with an average of 191 collected items per hour.
41
Chapter 4 Analysis
However, there were some apparent disruptions at 14 hours, which may have been caused by
poor internet connectivity or issues with the TikTok server. If we disregard the anomalies at
14 hours and 21 hours, the average number of collected items per hour increases to 203.
4.2 Descriptive Analysis of the Collected Data
In this section I will report the results of the descriptive analysis of the scrapped data. Behind
the descriptive analysis of the scrapped data are two purposes. First, the results on general
metrics can be seen as approximations of the real values present on TikTok. Secondly the
analysis is meant to be a showcase on how the data can be explored and the theoretic frame-
work of chapter 1 has been translated into praxis. Figure 4.3 shows the general Structure of
the data which exactly matches the theroethical model formulated in section 2.5.
Figure 4.3: Structure of collected data
In total the collected graph consisted out of 57.997 nodes and 92071 edges. Figures 4.4
and 4.5 shows how these properties distribute on the different node types.
42
Chapter 4 Analysis
Music
Post
11.01%
21.48%
51.56% 15.96%
Hashtag User
Figure 4.4: Portions of Nodes by Nodetype
80000
60000
40000
20000
0
Post Music Hashtag User
Figure 4.5: Summed Degrees per Nodetype
It is evident that the majority of nodes in our dataset are hashtags. However, the posts seem
to have more edges in proportion. This can be explained by the fact that each post in our
data scheme has three automatic edges. It is also not surprising that the number of edges for
43
Chapter 4 Analysis
hashtags is relatively high, as each post can contain multiple hashtags, but only one user who
posted the post and one music that is included in the video. To gain a better understanding,
refer to fig. 4.6, which compares the average number of edges for each node type.
0
Figure 4.6: Average Degree per Nodetype
The average degree for each post is slightly above seven, which can be attributed to the
relatively low number of mention relations and the fact that each post can only include one
music track. Based on this, we can conclude that each post includes an average of five
hashtags. To better understand how this number of edges is determined, refer to fig. 4.7,
which displays the degree distributions for each node type.
44
Chapter 4 Analysis
20000 labels
Post
17500 Music
Hashtag
15000 User
Number of Nodes
12500
10000
7500
5000
2500
0
0 2 4 6 8 10 12 14
Degree
Figure 4.7: Degree Distribution per Nodetype
The degree distributions provide interesting insights into the data. We observe that the posts
have a negative peak at zero, which is surprising since there should not be any posts that
are not connected to any other nodes. The next peak is at six, and the curve then gradually
decreases and approaches zero for higher degrees. Additionally, it is particularly intriguing
that all three other node types appear to be Poisson-distributed, indicating that there are a few
nodes with a high degree and many with a low degree. Although I stopped the x-axis in the
figure at degree 14, there are actually a few nodes in the data with much higher degrees. To
illustrate this point, fig. 4.8 shows the maximum number of degrees for each node type.
45
Chapter 4 Analysis
2000
1500
1000
500
0
Figure 4.8: Maximal Degree per Nodetype
Now that we have taken a first glance at the general properties of the graph, we can begin to
explore it. As mentioned earlier, I used graphXR, a software that can be directly integrated
into neo4j to visualize and analyze the data within the database. Figure 4.9 displays the
complete graph.
Figure 4.9: Visualization of Graph of Collected Data
46
Chapter 4 Analysis
Using GraphXR, the data can be modified quickly and easily. Graph algorithms can also be
applied to the data to create new variables for measuring centrality or community belonging.
Furthermore, nodes can be filtered based on their attributes to determine which ones should
be displayed. One particularly useful feature is the ability to quickly transform the graph
into different visualizations. For example, fig. 4.10 displays a form that I found particularly
helpful, as it effectively illustrates the structure of the data.
Figure 4.10: Visualization of Graph of Collected Data with Ring View
As we can see, and as we learned from the degree distribution, we have a few nodes in the
center of the discourse that are particularly well-connected, which leads to a density of edges
in the middle. On the other hand, the majority of nodes are on the outer circle of the graph.
To gain further insight into how the scraping algorithm contributed to the emergence of this
structure, I applied a filter based on the collection time and visualized nodes gradually based
on when they were added to the database. Figure 4.11 illustrates this process.
47
Chapter 4 Analysis
(a) (b) (c)
(d) (e) (f)
Figure 4.11: Visualization of the Scraping Process
48
Chapter 5
Discussion & Conclusion
The purpose of this chapter is to discuss the contributions and limitations of this work and
draw conclusions. The first section will provide a brief summary of the results from the
previous chapters and assess their alignment with the objectives of this work. The following
section will present an overview of related work and its relevance to the current research. An
evaluation will then be conducted to determine the extent to which this work contributes to
the scientific discourse. Possible limitations of the study will also be addressed.
Furthermore, this chapter will identify potential avenues for future research directions and
provide information on the next steps for this work. Finally, the chapter will conclude this
work.
5.1 Results
This section will evaluate the results of the previous chapters to determine the extent to which
they address the problem and objectives of this work.
The first objective of this thesis was to develop a theoretical model of TikTok as a social
medium that is accessible from both a social science and a computer science perspective.. In
chapter 2, I presented a theorethical model that was informed by the theoretical and method-
ological perspectives of social science, as well as the data accessibility of TikTok. By doing
so, I have successfully achieved the first objective of this thesis.
49
Chapter 5 Discussion & Conclusion
The second objective of this thesis was to embed the tool in a theoretical model which aligns
with common methods and theories in social science while incorporating the rigorous and
formal nature of computer science. As demonstrated in chapter 4, the developed program
manages to translate the collected data directly into a structure that aligns with the model
established in the final section of chapter 2. This enables the data to be interpreted within the
framework provided by the definitions, thereby fulfilling the second objective of this thesis.
The third objective of this thesis was to ensure that the web scraping tool is as accessible
and user-friendly as possible. In chapter 3, I provide an introduction to all the principal
components used in the development of the tool. Additionally, by packaging the tool with
Docker, it can be used without requiring programming knowledge or the installation of any
dependencies. Furthermore, by embedding the tool into a Neo4j-database, users can access
the data through a range of graph applications supported by Neo4j Desktop, without needing
programming knowledge. However, users can still access the data through querying if they
prefer. In chapter 4, I demonstrate how GraphXR can be used to explore the data and I
demonstrated some general descriptive statistics which can be obtained from the data and
which can be used as an entry point for further analysis, thereby fulfilling the final objective
of this thesis.
5.2 Related Work & Contributions
An analogous project to this thesis is the Unofficial TikTok API, which is available on Github
[43]. At the outset of this thesis work, I evaluated this tool but encountered various issues
that were frequently reported and often went unanswered on the project’s Discord server.
Since then, the project initiator has released a new version that aims to address the issues I
encountered.
The project is based on selenium which brings the problems and benefits I discussed in sec-
tion 3.2 with it. Overall the program gives researchers the opportunity to scrape data on a
specified request, however, it does not provide the researcher with a structure in the data em-
bedded in, which might be beneficial for some research purposes and might mean additional
work for others. In one sentence, even if both projects answer to the same problem they do it
in a different manner. As the name suggests the Unofficial TikTok API is an API whereas the
program proposed in this work is a web scraper with an integrated database. Where Unofficial
50
TikTok API gives researchers the option to quickly access specific data directly, my project
provides a framework for continuously scrapping data in a form that reflects the structure
of the platform. It might be a good practice to integrate the functionality of the Unofficial
TikTok API into my work to enrich the scrapped data with additional information.
With regard to the formal formulation of the model, i.e., the rigorous definitions provided in
section 2.5, Mukkamala et al. present a similar framework in their work [44]. However, in
their model formulations, they focus more on a generalized model of social media rather than
any specific platform. Nevertheless, their model shares significant similarities with the one
developed in this work.
This work contributed to the scientific discourse by also offering a formel lense on social
media in connection to social phenomena. This might motivate researcher to use TikTok
data in order to face their research problems. The rigerous definition of the formal model
might help to establish a common language for the research on TikTok and facilitate the
communication of results between researchers.
5.3 Limitations
Eventhough this work managed successfully to construct a tool that successfully scrapes data
from TikTok and makes this data in this manner accessible for social science research this
work has limitations which should not be left unmentioned.
First of all this work is in itself experimental. It was driven by my personal experience of
incorporating both the social science and the computer science perspective and my first hand
experience on how hard things that are obvious in one disciplin are often a black box in the
other. The tool which I developed is an attempt of facilitating the access to tools of one
disciplin to the other and has to be seen as a prototype. The way that I proposed is drawn
on my experience and if the understanding of the data in this way is really applicable and
beneficial has yet to be tested.
Additionally, it is important to note that the theoretical model developed in this work is based
on my personal judgment of what is useful to the scientific community, taking into considera-
tion current theories and methods in social science. Therefore, it is important to acknowledge
51
that the model may not accurately reflect the actual data model of TikTok. Unfortunately, this
is a limitation that cannot be easily overcome, as only TikTok itself has access to the real data
model.
Another limitation of this work is the limited quantity of data that can be scraped from each
page. As Scrapy works with requests and the data is scraped from the received HTML pages
without any virtual browser in between, I was only able to scrape data from the first page of
each entity, which always contained only 15 items. This may result in a low amount of data
for a particular context, making it difficult to analyze the discourse around a certain topic.
Additionally, I did not include comments on videos in this work since they were not directly
accessible from the HTML responses. However, as comments represent a significant form of
interaction, they could be valuable for social science researchers.
5.4 Outlook
Both of the limits I mentioned in the former section might be tackled within future work on
this tool. TikTok has announced to release an API to enable researcher to conduct research on
TikTok. The model that has been developed in this work was developed for two reasons:
1. To reflect a theorethical model in the data that is accessible from both social science
and computer science perspective.
2. To establish a way to actually gain access to data from TikTok.
These motivations have led the tool to obtain the interesting property in that the structure of
the data it produces reflects the way it was collected. Since the collection is was orientated
on the structure of the platform the data thereby reflects the structure of the platform. While
a future release of the TikTok API may make the use of Scrapy for data acquisition obsolete,
the fundamental structure of this work can be maintained by using an API to collect data.
This would enable the tool to access a larger portion of the TikTok discourse for research.
It would be worthwhile to test this using the unofficial TikTok API to enrich the data with
comments and corresponding user information. By using users who posted comments as
additional collection points, the collection could focus more on a specific discourse.
52
Another aspect that I plan to explore is the acquisition of music features. A wide range of
videos on TikTok include music, and the musical content of these videos may provide valuable
information about their overall content. Musicbrainz is a website that offers features that have
been extracted from music using machine learning, including information about genres and
emotions. Integrating these features into the data collected during the scraping process could
provide further insights into the videos and their associated discourse.
5.5 Conclusion
Social media is a goldmine for those who want to understand human behavior, whether in-
dividual or collective. The data produced on social media allows for a glimpse into human
interactions that would be difficult to achieve without their existence. However, access to this
data is not without its problems. Firstly, it requires technical know-how to obtain the data in
the first place. Secondly, one must be aware of the legal and ethical questions surrounding
the collection and analysis of the data. On the one hand, the data is a trail of individuals who
may not necessarily be aware that they are being analyzed. On the other hand, these people
are operating in a public sphere and creating content that they want to reach a wide audience.
Allowing social media platform providers to have a monopoly on the insights that can be
gained from the data does not seem like an appropriate solution.
This works attempts to provide reseachers an access point with a low technical barrier for the
analysis of social behaviour on TikTok that was designed by taking into account these legal
and ethical questions. The tool that is presented in this work will enable researchers to scrape
data from TikTok. Additionally this work proposed a formal model of social media in general
and TikTok in particular and relates this definition to a framework for the analysis of social
phenomena. Eventhough, the amount of data that can be scrapped is quiet small especially
when it comes to data to related to small parts of the discourse it might be beneficial to
understand general characteristics of the TikTok discourse and to the connection of multiple
specific discourses.
The multidisciplinar approach of this work between social science and computer science
can be another brick in the bridge facilitating collaborative work between the two disciplins.
Additionally the connection of social theory and rigerous formal model definitions is an
experimental way that might motivate new inovative and interesting future work.
53
Bibliography
[1] J. Adams, S. Leestma, and L. Nyhoff, Turbo C++ an introduction to computing.

Prentice-Hall, Inc., 1995.
[2] Computer Science Is Not About Computers, Any More Than Astronomy Is About Tele-
scopes – Quote Investigator®, en-US, Apr. 2021. [Online]. Available: https : / /
quoteinvestigator.com/2021/04/02/computer- science/ (visited
on 02/03/2023).
[3] J. Arsac and J. Vauthier, Jacques Arsac, un informaticien: entretien avec Jacques Vau-
thier. Editions Beauchesne, 1989, vol. 1.
[4] D. Colander and E. Hunt, Social Science: An Introduction to the Study of Society,
17th ed. New York: Routledge, Mar. 2019, ISBN: 978-0-429-01955-5. DOI: 10.4324/
9780429019555.
[5] K. Varnelis, Networked publics. Mit Press, 2012.
[6] Z. Papacharissi and M. de Fatima Oliveira, “Affective news and networked publics:
The rhythms of news storytelling on# egypt,” Journal of communication, vol. 62, no. 2,
pp. 266–282, 2012.
[7] D. Boyd, “Social network sites as networked publics: Affordances, dynamics, and
implications,” in A networked self, Routledge, 2010, pp. 47–66.
[8] K. E. Anderson, “Getting acquainted with social networks and apps: It is time to talk
about TikTok,” Library Hi Tech News, vol. 37, no. 4, pp. 7–12, Jan. 2020, Publisher:
Emerald Publishing Limited, ISSN: 0741-9058. DOI: 10.1108/LHTN-01-2020-
0001. [Online]. Available: https://doi.org/10.1108/LHTN-01-2020-
0001 (visited on 02/17/2023).
[9] G. Weimann and N. Masri, “Research note: Spreading hate on tiktok,” Studies in con-
flict & terrorism, pp. 1–14, 2020.
54
Bibliography
[10] G. Payne, “Surveys, Statisticians and Sociology: A History of (a Lack of) Quantitative
Methods,” Enhancing Learning in the Social Sciences, vol. 6, no. 2, pp. 74–89, Jul.
2014, Publisher: Routledge _eprint: https://doi.org/10.11120/elss.2014.00028, ISSN:
null. DOI: 10 . 11120 / elss . 2014 . 00028. [Online]. Available: https : / /
doi.org/10.11120/elss.2014.00028 (visited on 02/17/2023).
[11] Social sciences - New World Encyclopedia. [Online]. Available: https : / / www .
newworldencyclopedia.org/entry/Social_sciences (visited on 01/04/2023).
[12] B. S. Turner, The Cambridge dictionary of sociology. 2006.
[13] M. Louzek, “The battle of methods in economics: The classical methodenstreit—Menger
vs. Schmoller,” American Journal of Economics and Sociology, vol. 70, no. 2, pp. 439–
463, 2011, Publisher: JSTOR.
[14] E. Durkheim, Suicide: A Study in Sociology, 2nd ed. London: Routledge, Feb. 2002,
ISBN : 978-0-203-99432-0. DOI : 10.4324/9780203994320.
[15] D. B. Pedersen, “Integrating social sciences and humanities in interdisciplinary re-

search,” Palgrave Communications, vol. 2, no. 1, pp. 1–7, 2016, Publisher: Palgrave.
[16] A. Giddens, Capitalism and Modern Social Theory: An Analysis of the Writings of
Marx, Durkheim and Max Weber, en. Cambridge University Press, 1971, Google-
Books-ID: dbimzFv3BNsC, ISBN: 978-0-521-09785-7.
[17] J. J. Macionis, Sociology, en. Pearson, Sep. 2011, ISBN: 978-0-205-11671-3.
[18] M. Weber and S. Kalberg, The Protestant ethic and the spirit of capitalism. Routledge,
2013.
[19] A. Schutz and A. Schutz, “The social world and the theory of social action,” Collected
papers II: Studies in social theory, pp. 3–19, 1976, Publisher: Springer.
[20] G. HH and C. Wright Mills, From Max Weber: Essays in Sociology, 1948.
[21] J. S. Coleman, Foundations of social theory. Harvard university press, 1994.
[22] A. Šilenskytė, “Corporate strategy implementation: How strategic plans become indi-
vidual strategic actions across organizational levels of the MNC,” PhD Thesis, Sep.
2020. DOI: 10.13140/RG.2.2.10450.63683.
[23] Diagrams of Theory: Coleman’s Boat, en-US, Jan. 2014. [Online]. Available: https:
//dustinstoltz.com/blog/2014/01/26/diagrams- of- theory-
james-colemans-boat-bathtub (visited on 02/10/2023).
55
Bibliography
[24] M. Lunt, “Introduction to statistical modelling: Linear regression,” Rheumatology,

vol. 54, no. 7, pp. 1137–1140, 2015.
[25] G. Manzo, Analytical sociology: actions and networks. John Wiley & Sons, 2014.
[26] G. Manzo, Research Handbook on Analytical Sociology. Edward Elgar Publishing
Ltd., Jan. 2021, Publication Title: Research Handbook on Analytical Sociology, ISBN:
978-1-78990-685-1. DOI: 10.4337/9781789906851.
[27] U. Wilensky and W. Rand, An introduction to agent-based modeling: modeling natu-
ral, social, and engineered complex systems with NetLogo. Mit Press, 2015.
[28] D. J. Watts and S. H. Strogatz, “Collective dynamics of ‘small-world’networks,” na-
ture, vol. 393, no. 6684, pp. 440–442, 1998.
[29] P. DiMaggio and F. Garip, “How network externalities can exacerbate intergroup in-
equality,” American Journal of Sociology, vol. 116, no. 6, pp. 1887–1933, 2011.
[30] M. M. Durland and K. A. Fredericks, “An introduction to social network analysis,”
New Directions for Evaluation, vol. 2005, no. 107, pp. 5–13, 2005.
[31] G. Johnson, Machinery of Mind: Inside the New Science of Artificial Intelligence, En-
glish, 1st edition. New York: Time Books, Sep. 1986, ISBN: 978-0-8129-1229-6.
[32] C. T. Carr and R. A. Hayes, “Social Media: Defining, Developing, and Divining,”
Atlantic Journal of Communication, vol. 23, no. 1, pp. 46–65, 2015, ISSN: 15456889.
[33] H. Kwak, C. Lee, H. Park, and S. Moon, “What is Twitter, a social network or a
news media?” In Proceedings of the 19th international conference on World wide
web, ser. WWW ’10, New York, NY, USA: Association for Computing Machinery,
Apr. 2010, pp. 591–600, ISBN: 978-1-60558-799-8. DOI: 10 . 1145 / 1772690 .
1772751. [Online]. Available: https : / / doi . org / 10 . 1145 / 1772690 .
1772751 (visited on 02/15/2023).
[34] V. Krotov and L. Silva, Legality and Ethics of Web Scraping. Sep. 2018.
[35] Community Guidelines. [Online]. Available: https : / / www . tiktok . com /
community-guidelines?lang=en (visited on 02/17/2023).
[36] N. Belnap, “On Rigorous Definitions,” Philosophical Studies: An International Jour-
nal for Philosophy in the Analytic Tradition, vol. 72, no. 2/3, pp. 115–146, 1993, Pub-
lisher: Springer, ISSN: 0031-8116. [Online]. Available: https :/ / www .jstor .
org/stable/4320448 (visited on 02/20/2023).
56
Bibliography
[37] K. Lane, What Is an API and How Does It Work? en-US, Oct. 2020. [Online]. Avail-
able: https://blog.postman.com/intro- to- apis- what- is- an-
api/ (visited on 03/03/2023).
[38] K. Lane, Intro to APIs: History of APIs, en-US, Oct. 2019. [Online]. Available: https:
//blog.postman.com/intro-to-apis-history-of-apis/ (visited on
03/03/2023).
[39] Scrapy 2.8 documentation — Scrapy 2.8.0 documentation. [Online]. Available: https:
//docs.scrapy.org/en/latest/ (visited on 03/03/2023).
[40] Scrapy vs Beautiful Soup vs Selenium – Which One to Use? en-US. [Online]. Avail-
able: https://proxyway.com/guides/scrapy-vs-beautiful-soup-
vs-selenium (visited on 03/01/2023).
[41] Neo4j Graph Data Platform – The Leader in Graph Databases, en. [Online]. Avail-
able: https://neo4j.com/ (visited on 03/03/2023).
[42] Kineviz GraphXR: Visual analytics, graph BI, and more... en-US. [Online]. Available:
https://www.kineviz.com (visited on 03/03/2023).
[43] D. Teather, TikTokAPI, original-date: 2019-05-26T17:06:09Z, Jul. 2022. [Online]. Avail-
able: https : / / github . com / davidteather / tiktok - api (visited on
02/24/2023).
[44] R. R. Mukkamala, R. Vatrapu, and A. Hussain, “Towards a formal model of social
data,” 2013.
57
Ehrenwörtliche Erklärung
Hiermit versichere ich, die vorliegende Bachelorarbeit selbstständig verfasst und keine ande-
ren als die angegebenen Quellen und Hilfsmittel benutzt zu haben. Alle Stellen, die aus den
Quellen entnommen wurden, sind als solche kenntlich gemacht worden. Diese Arbeit hat in
gleicher oder ähnlicher Form noch keiner Prüfungsbehörde vorgelegen.
Angers, 13. März 2023 Yannick Zelle
58

ba_thesis

Uploaded by

Copyright:

Available Formats

You might also like

ba_thesis

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

ba_thesis

Uploaded by

Copyright:

Available Formats

Let’s Talk about TikTok - A Web Scraping

Tool for Social Science Research

Professorship for Computer Networks

List of Tables vii

List of Listings viii

5 Discussion & Conclusion 49

2.1 Colemann’s General Micro-Macro Model [22] . . . . . . . . . . . . . . . . . 10

3.1 Scrapy Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

4.1 Collected Posts per Hour . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

3.1 Collected Data in the Parse Function . . . . . . . . . . . . . . . . . . . . . . 39

3.1 Pseudo Code of the Scraping Algorithm . . . . . . . . . . . . . . . . . . . . 38

Computer Science is no more the

• to make the scrapper user-friendly by reducing the knowledge requirements of acquir-

In conclusion, the objectives of this thesis are to:

2. implement a web scraping tool adapted to this theoretical model.

2.1 Social Science, Computer Science and Social Media

The Essence of the Social Sciences and the ’Methodenstreit’

1 ’Methodenstreit’ is german and means ’method dispute’.

Social Science Theory

Figure 2.1: Colemann’s General Micro-Macro Model [22]

Coleman’s micro-macro model provided a framework for understanding social phenomena,

Methodology within the Social Sciences

Statistical regression is a technique for explaining a dependent variable y as a function of one

The Essence of Computer Science

Social Media: A Computer Science and Social Science Perspective

Figure 2.3: TikTok’s ’For You Page’

Figure 2.4: Anatomy of a TikTok Post

2.3 Technical Context

In the previous chapter, I provided an overview of TikTok. As previously mentioned, one

Figure 2.6: Example for a Model of a Fictive TikTok Interaction

Figure 2.7: Simplification of the Fictive TikTok Interaction

N = {Goedel,Curie, Escher,Vg ,Ve }

W = {((Goedel,Ve ), (Escher, vg ), (Curie, vg ), (Curie, va )}

Generally, a multidimensional graph can be defined in the following manner:

Definition 1 A multidimensional directed graph G = (N, E) is a tuple, where N = {N1 , · · · , Nk }

2.5 Theorethical Model

Definition 4 An action a by an individual It ∈ Iˆt at time t is a mapping a : (It , Rt , O) →

Definition 7 A social media user is a surjective submapping of U ⊂ FSN : It → SMt .

• UT (It ) is the unique user representation of some individuals or group of individuals

• Vdes is a string representing the video description.

• l is an integer representing the number of likes.

• c is an integer representing the number of comments.

• C is the set of all comments 10 associated to the video.

• ME is a list of users mentioned in the post

We will denote the set of all posts at time t as Pt .

Definition 11 The TikTok webapplication at time t is a social medium where SMt := Pt

Definition 12 A TikTok discourse at time t is a multidimensional graph T Dt = (N, E) where

• P = {p1 , ..., pn } is a set of posts.

• H = {h1 , ..., hm } is the set of hashtags so that hi ∈ p j for some j.

• M = {m1 , ..., ml } is the set of hashtags so that mi ∈ p j for some j.

• U = {u1 , ..., uk } is the set of users so that ui ∈ p j or ui ∈ ME ∈ p j for some j

Additionally E = (PO, MEN, IN) where

• PO is the set of posted relation i.e. (ui , p j ) ∈ PO ↔ ui ∈ U ∈ p j

• MEN is the set of mention relation i.e. (pi , u j ) ∈ MEN ↔ ui ∈ ME ∈ p j

3.1 Application Programming Interface

Once a user initializes a Scrapy-project, Scrapy automatically prepares a program structure

Figure 3.1: Scrapy Architecture as illustrated in the Documentation [39].

Figure 3.2: Comparison of Scrapy, BeautifulSoup, and Selenium [40]

3.5 Algorithmic Implementation