2022-05-09 - Textplot Tutorial (Silvio Liesch)

09.05.
2022 / Textplot Tutorial Silvio Liesch
Textplot
Source: https://github.com/davidmcclure/textplot
War and Peace (click to zoom)
Textplot is a little program that converts a document into a network of terms, with the goal of
teasing out information about the high-level topic structure of the text. For each term:
1. Get the set of offsets in the document where the term appears.
2. Using kernel density estimation, compute a probability density function (PDF) that
represents the word's distribution across the document. Eg, from War and Peace:
1
09.05.2022 / Textplot Tutorial Silvio Liesch
1. Compute a Bray-Curtis dissimilarity between the term's PDF and the PDFs of all other
terms in the document. This measures the extent to which two words appear in the
same locations.
2. Sort this list in descending order to get a custom "topic" for the term. Skim off the top
N words (usually 10-20) to get the strongest links. Here's "napoleon":
[('napoleon', 1.0),
('war', 0.65319871313854128),
('military', 0.64782349297012154),
('men', 0.63958189887106576),
('order', 0.63636730075877446),
('general', 0.62621616907584432),
('russia', 0.62233286026418089),
('king', 0.61854160459241103),
('single', 0.61630514751638699),
('killed', 0.61262010905310182),
('peace', 0.60775702746632576),
('contrary', 0.60750138486684579),
('number', 0.59936009740377516),
('accompanied', 0.59748552019874168),
('clear', 0.59661288775164523),
('force', 0.59657370362505935),
('army', 0.59584331507492383),
('authority', 0.59523854206807647),
('troops', 0.59293965397478188),
('russian', 0.59077308177196441)]
1. Shovel all of these links into a network and export a GML file.
2
Generating graphs
There are two ways to create graphs - you can use the textplot executable from the
command line, or, if you want to tinker around with the underlying NetworkX graph instance,
you can fire up a Python shell and use the build_graph() helper directly.
Either way, first install Textplot. With PyPI:
pip install textplot
Or, clone the repo and install the package manually:
pyvenv env
. env/bin/activate
pip install -r requirements.txt
python setup.py install
From the command line
Then, from the command line, generate graphs with:
texplot 1 generate [IN_PATH] [OUT_PATH] [OPTIONS]
Where the input is a regular .txt file, and the output is a .gml file. So, if you're working with
War and Peace:
texplot 2 generate war-and-peace.txt war-and-peace.gml
The generate command takes these options:
• --term_depth=1000 (int) - The number of terms to include in the network. For now,
Textplot takes the top N most frequent terms, after stopwords are removed.
• --skim_depth=10 (int) - The number of connections (edges) to skim off the top of the
"topics" computed for each word.
• --d_weights (flag) - By default, terms that appear in similar locations in the
document will be connected by edges with "heavy" weights, the semantic expected by
force-directed layout algorithms like Force Atlas 2 in Gephi. If this flag is passed, the
weights will be inverted - use this if you want to do any kind of pathfinding analysis
on the graph, where it's generally assumed that edge weights represent distance or
cost.
• --bandwidth=2000 (int) - The bandwidth for the kernel density estimation. This
controls how "smoothness" of the curve. 2000 is a sensible default for long novels, but
bump it down if you're working with shorter texts.
• --samples=1000 (int) - The number of equally-spaced points on the X-axis where the
kernel density is sampled. 1000 is almost always enough, unless you're working with a
huge document.
1
It should say «textplot» here, of course.
2
Ditto.
3
• --kernel=gaussian (str) - The kernel function. The scikit-learn implementation also

supports tophat, epanechnikov, exponential, linear, and cosine.
From a Python shell 3
Or, fire up a Python shell and import build_graph() directly:
In [1]: from textplot.helpers import build_graph
In [2]: g = build_graph('war-and-peace.txt') 4
Tokenizing text...
Extracted 573064 tokens
Indexing terms:
[################################] 124750/124750 - 00:00:06
Generating graph:
[################################] 500/500 - 00:00:03
build_graph() returns an instance of textplot.graphs.Skimmer, which gives access to an

instance of networkx.Graph. Eg, to get degree centralities:
In [3]: import networkx as nx
In [4]: nx.degree_centrality(g.graph) 5
3
Unfortunately, in Windows (at least in Windows 10) we couldn’t get Textplot running directly from the
command line. Therefore we have to make a little detour via Python. To do this, please make sure that both
Python and Textplot (see "pip installation" above) are installed on your computer.
4
Here, we suggest the following code:
g = build_graph("C:/Users/[...].txt" [insert the whole path to your text file here], term_depth = 100)
 The "term_depth" option defines the number of terms to include in the network. At N = 100 this of
course is set very low. For more detailed analysis and interpretation, set the depth to N = 500 for
example.
5
We then add a fifth line to create a Geography Markup Language (GML) file that contains spatial data:
g.write_gml("[name your output file].gml")
The GML file (it is probably located in the corresponding user folder under C:/) can now be opened in Gephi
and further customized for visualization.

2022-05-09 - Textplot Tutorial (Silvio Liesch)

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

2022-05-09 - Textplot Tutorial (Silvio Liesch)

Uploaded by

Copyright:

Available Formats

09.05.

2022 / Textplot Tutorial Silvio Liesch

War and Peace (click to zoom)

Either way, first install Textplot. With PyPI:

pip install textplot

Or, clone the repo and install the package manually:

From the command line

Then, from the command line, generate graphs with:

texplot 1 generate [IN_PATH] [OUT_PATH] [OPTIONS]

texplot 2 generate war-and-peace.txt war-and-peace.gml

The generate command takes these options:

• --kernel=gaussian (str) - The kernel function. The scikit-learn implementation also

From a Python shell 3

Or, fire up a Python shell and import build_graph() directly:

In [1]: from textplot.helpers import build_graph

build_graph() returns an instance of textplot.graphs.Skimmer, which gives access to an

In [3]: import networkx as nx

You might also like