Professional Documents
Culture Documents
2022-05-09 - Textplot Tutorial (Silvio Liesch)
2022-05-09 - Textplot Tutorial (Silvio Liesch)
Textplot
Source: https://github.com/davidmcclure/textplot
Textplot is a little program that converts a document into a network of terms, with the goal of
teasing out information about the high-level topic structure of the text. For each term:
1. Get the set of offsets in the document where the term appears.
2. Using kernel density estimation, compute a probability density function (PDF) that
represents the word's distribution across the document. Eg, from War and Peace:
1
09.05.2022 / Textplot Tutorial Silvio Liesch
1. Compute a Bray-Curtis dissimilarity between the term's PDF and the PDFs of all other
terms in the document. This measures the extent to which two words appear in the
same locations.
2. Sort this list in descending order to get a custom "topic" for the term. Skim off the top
N words (usually 10-20) to get the strongest links. Here's "napoleon":
[('napoleon', 1.0),
('war', 0.65319871313854128),
('military', 0.64782349297012154),
('men', 0.63958189887106576),
('order', 0.63636730075877446),
('general', 0.62621616907584432),
('russia', 0.62233286026418089),
('king', 0.61854160459241103),
('single', 0.61630514751638699),
('killed', 0.61262010905310182),
('peace', 0.60775702746632576),
('contrary', 0.60750138486684579),
('number', 0.59936009740377516),
('accompanied', 0.59748552019874168),
('clear', 0.59661288775164523),
('force', 0.59657370362505935),
('army', 0.59584331507492383),
('authority', 0.59523854206807647),
('troops', 0.59293965397478188),
('russian', 0.59077308177196441)]
1. Shovel all of these links into a network and export a GML file.
2
09.05.2022 / Textplot Tutorial Silvio Liesch
Generating graphs
There are two ways to create graphs - you can use the textplot executable from the
command line, or, if you want to tinker around with the underlying NetworkX graph instance,
you can fire up a Python shell and use the build_graph() helper directly.
pyvenv env
. env/bin/activate
pip install -r requirements.txt
python setup.py install
Where the input is a regular .txt file, and the output is a .gml file. So, if you're working with
War and Peace:
• --term_depth=1000 (int) - The number of terms to include in the network. For now,
Textplot takes the top N most frequent terms, after stopwords are removed.
• --skim_depth=10 (int) - The number of connections (edges) to skim off the top of the
"topics" computed for each word.
• --d_weights (flag) - By default, terms that appear in similar locations in the
document will be connected by edges with "heavy" weights, the semantic expected by
force-directed layout algorithms like Force Atlas 2 in Gephi. If this flag is passed, the
weights will be inverted - use this if you want to do any kind of pathfinding analysis
on the graph, where it's generally assumed that edge weights represent distance or
cost.
• --bandwidth=2000 (int) - The bandwidth for the kernel density estimation. This
controls how "smoothness" of the curve. 2000 is a sensible default for long novels, but
bump it down if you're working with shorter texts.
• --samples=1000 (int) - The number of equally-spaced points on the X-axis where the
kernel density is sampled. 1000 is almost always enough, unless you're working with a
huge document.
1
It should say «textplot» here, of course.
2
Ditto.
3
09.05.2022 / Textplot Tutorial Silvio Liesch
In [2]: g = build_graph('war-and-peace.txt') 4
Tokenizing text...
Extracted 573064 tokens
Indexing terms:
[################################] 124750/124750 - 00:00:06
Generating graph:
[################################] 500/500 - 00:00:03
In [4]: nx.degree_centrality(g.graph) 5
3
Unfortunately, in Windows (at least in Windows 10) we couldn’t get Textplot running directly from the
command line. Therefore we have to make a little detour via Python. To do this, please make sure that both
Python and Textplot (see "pip installation" above) are installed on your computer.
4
Here, we suggest the following code:
g = build_graph("C:/Users/[...].txt" [insert the whole path to your text file here], term_depth = 100)
The "term_depth" option defines the number of terms to include in the network. At N = 100 this of
course is set very low. For more detailed analysis and interpretation, set the depth to N = 500 for
example.
5
We then add a fifth line to create a Geography Markup Language (GML) file that contains spatial data:
g.write_gml("[name your output file].gml")
The GML file (it is probably located in the corresponding user folder under C:/) can now be opened in Gephi
and further customized for visualization.