Download as pdf or txt
Download as pdf or txt
You are on page 1of 4

09.05.

2022 / Textplot Tutorial Silvio Liesch

Textplot
Source: https://github.com/davidmcclure/textplot

War and Peace (click to zoom)

Textplot is a little program that converts a document into a network of terms, with the goal of
teasing out information about the high-level topic structure of the text. For each term:

1. Get the set of offsets in the document where the term appears.
2. Using kernel density estimation, compute a probability density function (PDF) that
represents the word's distribution across the document. Eg, from War and Peace:

1
09.05.2022 / Textplot Tutorial Silvio Liesch

1. Compute a Bray-Curtis dissimilarity between the term's PDF and the PDFs of all other
terms in the document. This measures the extent to which two words appear in the
same locations.
2. Sort this list in descending order to get a custom "topic" for the term. Skim off the top
N words (usually 10-20) to get the strongest links. Here's "napoleon":

[('napoleon', 1.0),
('war', 0.65319871313854128),
('military', 0.64782349297012154),
('men', 0.63958189887106576),
('order', 0.63636730075877446),
('general', 0.62621616907584432),
('russia', 0.62233286026418089),
('king', 0.61854160459241103),
('single', 0.61630514751638699),
('killed', 0.61262010905310182),
('peace', 0.60775702746632576),
('contrary', 0.60750138486684579),
('number', 0.59936009740377516),
('accompanied', 0.59748552019874168),
('clear', 0.59661288775164523),
('force', 0.59657370362505935),
('army', 0.59584331507492383),
('authority', 0.59523854206807647),
('troops', 0.59293965397478188),
('russian', 0.59077308177196441)]

1. Shovel all of these links into a network and export a GML file.

2
09.05.2022 / Textplot Tutorial Silvio Liesch

Generating graphs
There are two ways to create graphs - you can use the textplot executable from the
command line, or, if you want to tinker around with the underlying NetworkX graph instance,
you can fire up a Python shell and use the build_graph() helper directly.

Either way, first install Textplot. With PyPI:

pip install textplot

Or, clone the repo and install the package manually:

pyvenv env
. env/bin/activate
pip install -r requirements.txt
python setup.py install

From the command line

Then, from the command line, generate graphs with:

texplot 1 generate [IN_PATH] [OUT_PATH] [OPTIONS]

Where the input is a regular .txt file, and the output is a .gml file. So, if you're working with
War and Peace:

texplot 2 generate war-and-peace.txt war-and-peace.gml

The generate command takes these options:

• --term_depth=1000 (int) - The number of terms to include in the network. For now,
Textplot takes the top N most frequent terms, after stopwords are removed.
• --skim_depth=10 (int) - The number of connections (edges) to skim off the top of the
"topics" computed for each word.
• --d_weights (flag) - By default, terms that appear in similar locations in the
document will be connected by edges with "heavy" weights, the semantic expected by
force-directed layout algorithms like Force Atlas 2 in Gephi. If this flag is passed, the
weights will be inverted - use this if you want to do any kind of pathfinding analysis
on the graph, where it's generally assumed that edge weights represent distance or
cost.
• --bandwidth=2000 (int) - The bandwidth for the kernel density estimation. This
controls how "smoothness" of the curve. 2000 is a sensible default for long novels, but
bump it down if you're working with shorter texts.
• --samples=1000 (int) - The number of equally-spaced points on the X-axis where the
kernel density is sampled. 1000 is almost always enough, unless you're working with a
huge document.

1
It should say «textplot» here, of course.
2
Ditto.

3
09.05.2022 / Textplot Tutorial Silvio Liesch

• --kernel=gaussian (str) - The kernel function. The scikit-learn implementation also


supports tophat, epanechnikov, exponential, linear, and cosine.

From a Python shell 3

Or, fire up a Python shell and import build_graph() directly:

In [1]: from textplot.helpers import build_graph

In [2]: g = build_graph('war-and-peace.txt') 4

Tokenizing text...
Extracted 573064 tokens

Indexing terms:
[################################] 124750/124750 - 00:00:06

Generating graph:
[################################] 500/500 - 00:00:03

build_graph() returns an instance of textplot.graphs.Skimmer, which gives access to an


instance of networkx.Graph. Eg, to get degree centralities:

In [3]: import networkx as nx

In [4]: nx.degree_centrality(g.graph) 5

3
Unfortunately, in Windows (at least in Windows 10) we couldn’t get Textplot running directly from the
command line. Therefore we have to make a little detour via Python. To do this, please make sure that both
Python and Textplot (see "pip installation" above) are installed on your computer.
4
Here, we suggest the following code:
g = build_graph("C:/Users/[...].txt" [insert the whole path to your text file here], term_depth = 100)
 The "term_depth" option defines the number of terms to include in the network. At N = 100 this of
course is set very low. For more detailed analysis and interpretation, set the depth to N = 500 for
example.
5
We then add a fifth line to create a Geography Markup Language (GML) file that contains spatial data:
g.write_gml("[name your output file].gml")
The GML file (it is probably located in the corresponding user folder under C:/) can now be opened in Gephi
and further customized for visualization.

You might also like