NLTK

NLTK 的内置函数
1. 词语索引
(1) concordance 函数给出一个指定单词每一次出现，连同上下文一起显示。
>>>text1.concordance('monstrous')
(2) similar 函数查找文中上下文结构相似的词，如 the___pictures 和 the___size

等。
>>> text1.similar("monstrous")
(3) common_contexts 函数检测、查找两个或两个以上的词共同的上下文。
>>> text2.common_contexts(["monstrous", "very"])

be_glad am_glad a_pretty is_pretty a_lucky
>>>
2. 词语离散图
判断词在文本中的位置：从文本开头算起在它前面有多少词。这个位置信息可以用
离散图表示。
>>> text4.dispersion_plot(["citizens", "democracy", "freedom", "duties", "America"])

>>>
3. 词语计数
>>>len(text3)
44764
4. 文本-->词表并排序
sorted(set(text3))
5. 词汇丰富度
>>> from __future__ import division

>>> len(text3) / len(set(text3))
16.050197203298673
>>>
6. 词在文本中出现的次数和百分比
>>> text3.count("smote")
5
>>> 100 * text4.count('a') / len(text4)
1.4643016433938312
>>>
7. 索引列表
(1) 表示元素位置的数字叫做元素的索引。
>>> text1[50]
'grammars'
>>>
(2) 找出一个词第一次出现的索引。
>>> text1.index('grammars')
50
>>>
8. 切片可以获取到文本中的词汇(文本片段)。
>>>text1[100:120]['and', 'to', 'teach', 'them', 'by', 'what', 'name', 'a', 'whale', '-', 'fish',
'is', 'to', 'be', 'called', 'in', 'our', 'tongue', 'leaving', 'out']
>>>
9. NLTK 频率分布类中定义的函数
例子描述
fdist = FreqDist(samples) 创建包含给定样本的频率分布
fdist.inc(sample) 增加样本
fdist['monstrous'] 计数给定样本出现的次数
fdist.freq('monstrous') 给定样本的频率
fdist.N() 样本总数
fdist.keys() 以频率递减顺序排序的样本链表
for sample in fdist: 以频率递减的顺序遍历样本
fdist.max() 数值最大的样本
fdist.tabulate() 绘制频率分布表
fdist.plot() 绘制频率分布图
fdist.plot(cumulative=True) 绘制累积频率分布图
fdist1 < fdist2 测试样本在 fdist1 中出现的频率是否小于 fdist2
text1.concordance("monstrous") # 搜索单词，并显示上下文
text1.similar("monstrous") # 搜索具有相似上下文的单词
text2.common_context(["monstrous", "very"]) #两个或两个以上的词的共同的上下文
text4.dispersion_plot(["citizens", "democracy", "freedom", "duties", "America"]) # 将语料按时
间顺序拼接，此命令即可画出这些单词在语料中的位置，可以用来研究随时间推移语言使用
上的变化
text3.generate() # 根据语料 3 的词序列统计信息生成随机文本【计算机写 SCI 论文的原理？】
len(text3) / len(set(text3)) # 计算平均词频或者叫词汇丰富度

100* text3.count("smote") / len(text3) # 计算特定词在文本中的百分比
标识符: All words
类型：Unique words
FreqDist(text1).keys()[:50] # 查看 text1 中频率最高的前 50 个词，FreeDist([])用来计算列表中

元素的频率
FreqDist(text1).hapaxes() # 查看频率为 1 的词
bigrams(['more', 'is', 'said', 'than', 'done']) # 构造双连词，即[('more', 'is'), ('is', 'said'), ('said', 'than'),
('than', 'done')]
text4.collocations() # 返回文本中的双连词
fdist = FreqDist(samples) 创建包含给定样本的频率分布

fdist.inc(sample) 增加样本
fdist['monstrous'] 计数给定样本出现的次数
fdist.freq('monstrous') 给定样本的频率
fdist.N() 样本总数
fdist.keys() 以频率递减顺序排序的样本链表
for sample in fdist: 以频率递减的顺序遍历样本
fdist.max() 数值最大的样本
fdist.tabulate() 绘制频率分布表
fdist.plot() 绘制频率分布图
fdist.plot(cumulative=True) 绘制累积频率分布图
fdist1 < fdist2 测试样本在 fdist1 中出现的频率是否小于 fdist2
nltk.Text(gutenberg.words("autsten-emma.txt') # 索引文本，下一步才能使用 concordance 等函

数.
gutenberg.raw(fileid) # 给出原始文本内容
gutenberg.words(fileid) # 词数
gutenberg.sents(fileid) # 句数
wordlists = PlaintextCorpusReader(corpus_root, '.*') # 读入自己的语料库
cfdist= ConditionalFreqDist(pairs) 从配对链表中创建条件频率分布

cfdist.conditions() 将条件按字母排序
cfdist[condition] 此条件下的频率分布
cfdist[condition][sample] 此条件下给定样本的频率
cfdist.tabulate() 为条件频率分布制表
cfdist.tabulate(samples, conditions) 指定样本和条件限制下制表
cfdist.plot() 为条件频率分布绘图
cfdist.plot(samples, conditions) 指定样本和条件限制下绘图
cfdist1 < cfdist2 测试样本在 cfdist1 中出现次数是否小于在 cfdist2 中出现次数
条件概率的应用:
# -*- encoding: utf-8 -*-
import nltk
def generate_model(cfdist, word, num=15):

for i in range(num):
print word
word = cfdist[word].max()
text = nltk.corpus.genesis.words('english-kjv.txt')
bigrams = nltk.bigrams(text)
cfd = nltk.ConditionalFreqDist(bigrams)
print cfd['living']
generate_model(cfd, 'living')
nltk.corpus.stopwords.words('english') # stop words, 停用词
nltk.corpus.names # 姓名
wordnet.synsets('car') # 同义词集
wordnet.lemmas('car') # 获取所有包含词 car 的词条
下载、读取、处理网络文本
from urllib import urlopen
url = "http://www.gutenberg.org/files/2554/2554.txt"
raw = urlopen(url).read()
url = "http://news.bbc.co.uk/2/hi/health/2284783.stm"
html = urlopen(url).read()
raw = nltk.clean_html(html) # 清除 html 标记，但导航等内容还是无法清除
import feedparser
blog = feedparser.parse("http://languagelog.ldc.upenn.edu/nll/?feed=atom")
blog['feed']['title']
post = blog.entries[2]
tokens = nltk.word_tokenize(raw) # 分词
text = nltk.Text(tokens) # 下一步才能使用 text.collocations()等函数
# 解码
import codecs
f = codecs.open(path, encoding='latin2')
# 正则
re.findall(r'^.*(ing|ly|ed|ious|ies|ive|es|s|ment)$', 'processing') ==> ['ing']
re.findall(r'^.*(?:ing|ly|ed|ious|ies|ive|es|s|ment)$', 'processing') ==> ['processing']
re.findall(r'^(.*)(ing|ly|ed|ious|ies|ive|es|s|ment)$', 'processes') ==> [('processe', 's')]

re.findall(r'^(.*?)(ing|ly|ed|ious|ies|ive|es|s|ment)$', 'processes') ==> [('processe', 'es')]
re.findall(r'^(.*?)(ing|ly|ed|ious|ies|ive|es|s|ment)?$', 'language') ==> [('processe', '')]
# 查找上、下位词
hobbies_learned = nltk.Text(brown.words(categories=['hobbies', 'learned']))
hobbies_learned.findall(r"<\w*> <and> <other> <\w*s>")
将得到：
speed and other activities; water and other liquids; tomb and other
landmarks; Statues and other monuments; pearls and other jewels;
charts and other items; roads and other features; figures and other
objects; military and other areas; demands and other factors;
# 词干提取
tokens = nltk.word_tokenize(raw)
porter = nltk.PorterStemmer()
lancaster = nltk.LancasterStemmer()
[porter.stem(t) for t in tokens]
# 词形归并
wnl = nltk.WordNetLemmatizer()
[wnl.lemmatize(t) for t in tokens]
# 分词
nltk.regexp_tokenize()
# Python 过程风格与声明风格
# 找到文本中最长的词
maxlen = max(len(word) for word in text)

[word for word in text if len(word) == maxlen] # 熟悉并经常使用
lengths = map(len, nltk.corpus.brown.sents(categories="news"))

avg = sum(lengths) / len(lengths)
set() # 后台已经做了索引，集合成员地查找尽可能使用 set
matplotlib # 绘图工具
NetworkX # 网络可视化

NLTK

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

NLTK

Uploaded by

Copyright:

Available Formats

NLTK 的内置函数

(1) concordance 函数给出一个指定单词每一次出现，连同上下文一起显示。

(2) similar 函数查找文中上下文结构相似的词，如 the_pictures 和 the_size

(3) common_contexts 函数检测、查找两个或两个以上的词共同的上下文。

>>> text2.common_contexts(["monstrous", "very"])

>>> text4.dispersion_plot(["citizens", "democracy", "freedom", "duties", "America"])

>>> from future import division

len(text3) / len(set(text3)) # 计算平均词频或者叫词汇丰富度

FreqDist(text1).keys()[:50] # 查看 text1 中频率最高的前 50 个词，FreeDist([])用来计算列表中

fdist = FreqDist(samples) 创建包含给定样本的频率分布

nltk.Text(gutenberg.words("autsten-emma.txt') # 索引文本，下一步才能使用 concordance 等函

cfdist= ConditionalFreqDist(pairs) 从配对链表中创建条件频率分布

# -- encoding: utf-8 --

def generate_model(cfdist, word, num=15):

re.findall(r'^(.*)(ing|ly|ed|ious|ies|ive|es|s|ment)$', 'processes') ==> [('processe', 's')]

maxlen = max(len(word) for word in text)

lengths = map(len, nltk.corpus.brown.sents(categories="news"))

set() # 后台已经做了索引，集合成员地查找尽可能使用 set

You might also like

NLTK

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

NLTK

Uploaded by

Copyright:

Available Formats

NLTK 的内置函数

(1) concordance 函数 给出一个指定单词每一次出现，连同上下文一起显示。

(2) similar 函数 查找文中上下文结构相似的词，如 the___pictures 和 the___size

(3) common_contexts 函数 检测、查找两个或两个以上的词共同的上下文。

>>> text2.common_contexts(["monstrous", "very"])

>>> text4.dispersion_plot(["citizens", "democracy", "freedom", "duties", "America"])

>>> from __future__ import division

len(text3) / len(set(text3)) # 计算平均词频 或者叫 词汇丰富度

FreqDist(text1).keys()[:50] # 查看 text1 中频率最高的前 50 个词，FreeDist([])用来计算列表中

fdist = FreqDist(samples) 创建包含给定样本的频率分布

nltk.Text(gutenberg.words("autsten-emma.txt') # 索引文本，下一步才能使用 concordance 等函

cfdist= ConditionalFreqDist(pairs) 从配对链表中创建条件频率分布

# -*- encoding: utf-8 -*-

def generate_model(cfdist, word, num=15):

re.findall(r'^(.*)(ing|ly|ed|ious|ies|ive|es|s|ment)$', 'processes') ==> [('processe', 's')]

maxlen = max(len(word) for word in text)

lengths = map(len, nltk.corpus.brown.sents(categories="news"))

set() # 后台已经做了索引，集合成员地查找尽可能使用 set

You might also like

(1) concordance 函数给出一个指定单词每一次出现，连同上下文一起显示。

(2) similar 函数查找文中上下文结构相似的词，如 the_pictures 和 the_size

(3) common_contexts 函数检测、查找两个或两个以上的词共同的上下文。

>>> from future import division

len(text3) / len(set(text3)) # 计算平均词频或者叫词汇丰富度

# -- encoding: utf-8 --