Professional Documents
Culture Documents
Каращук Біляєва Проєкт Корчинський.ipynb - Colaboratory
Каращук Біляєва Проєкт Корчинський.ipynb - Colaboratory
ipynb - Colaboratory
Mounted at /content/drive
Collecting emot
Downloading emot-3.1-py3-none-any.whl (61 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 61.5/61.5 kB 543.7 kB/s eta 0:00:00
Installing collected packages: emot
Successfully installed emot-3.1
Collecting advertools
Downloading advertools-0.14.1-py2.py3-none-any.whl (321 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 321.9/321.9 kB 4.7 MB/s eta 0:00:00
Requirement already satisfied: pandas>=1.1.0 in /usr/local/lib/python3.10/dist-packages (from advertools) (1.5.3)
Requirement already satisfied: pyasn1>=0.4 in /usr/local/lib/python3.10/dist-packages (from advertools) (0.5.1)
Collecting scrapy>=2.5.0 (from advertools)
https://colab.research.google.com/drive/1IX-UYjDI6F67-ES_1VjSVbdawVc7rzO-?usp=sharing#printMode=true 1/12
21/02/2024, 21:51 Каращук Біляєва Проєкт Корчинський.ipynb - Colaboratory
Downloading Scrapy-2.11.1-py2.py3-none-any.whl (287 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 287.8/287.8 kB 11.5 MB/s eta 0:00:00
Collecting twython>=3.8.0 (from advertools)
Downloading twython-3.9.1-py3-none-any.whl (33 kB)
Requirement already satisfied: pyarrow>=5.0.0 in /usr/local/lib/python3.10/dist-packages (from advertools) (14.0.2)
Requirement already satisfied: python-dateutil>=2.8.1 in /usr/local/lib/python3.10/dist-packages (from pandas>=1.1.0-
Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.10/dist-packages (from pandas>=1.1.0->advertool
Requirement already satisfied: numpy>=1.21.0 in /usr/local/lib/python3.10/dist-packages (from pandas>=1.1.0->advertoo
Collecting Twisted>=18.9.0 (from scrapy>=2.5.0->advertools)
Downloading twisted-23.10.0-py3-none-any.whl (3.2 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 3.2/3.2 MB 16.2 MB/s eta 0:00:00
Requirement already satisfied: cryptography>=36.0.0 in /usr/local/lib/python3.10/dist-packages (from scrapy>=2.5.0->a
Collecting cssselect>=0.9.1 (from scrapy>=2.5.0->advertools)
Downloading cssselect-1.2.0-py2.py3-none-any.whl (18 kB)
Collecting itemloaders>=1.0.1 (from scrapy>=2.5.0->advertools)
Downloading itemloaders-1.1.0-py3-none-any.whl (11 kB)
Collecting parsel>=1.5.0 (from scrapy>=2.5.0->advertools)
Downloading parsel-1.8.1-py2.py3-none-any.whl (17 kB)
Requirement already satisfied: pyOpenSSL>=21.0.0 in /usr/local/lib/python3.10/dist-packages (from scrapy>=2.5.0->adve
Collecting queuelib>=1.4.2 (from scrapy>=2.5.0->advertools)
Downloading queuelib-1.6.2-py2.py3-none-any.whl (13 kB)
Collecting service-identity>=18.1.0 (from scrapy>=2.5.0->advertools)
Downloading service_identity-24.1.0-py3-none-any.whl (12 kB)
Collecting w3lib>=1.17.0 (from scrapy>=2.5.0->advertools)
Downloading w3lib-2.1.2-py3-none-any.whl (21 kB)
Collecting zope.interface>=5.1.0 (from scrapy>=2.5.0->advertools)
Downloading zope.interface-6.2-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux20
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 247.3/247.3 kB 15.9 MB/s eta 0:00:00
Collecting protego>=0.1.15 (from scrapy>=2.5.0->advertools)
Downloading Protego-0.3.0-py2.py3-none-any.whl (8.5 kB)
Collecting itemadapter>=0.1.0 (from scrapy>=2.5.0->advertools)
Downloading itemadapter-0.8.0-py3-none-any.whl (11 kB)
Requirement already satisfied: setuptools in /usr/local/lib/python3.10/dist-packages (from scrapy>=2.5.0->advertools
Requirement already satisfied: packaging in /usr/local/lib/python3.10/dist-packages (from scrapy>=2.5.0->advertools)
Collecting tldextract (from scrapy>=2.5.0->advertools)
Downloading tldextract-5.1.1-py3-none-any.whl (97 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 97.7/97.7 kB 10.4 MB/s eta 0:00:00
Requirement already satisfied: lxml>=4.4.1 in /usr/local/lib/python3.10/dist-packages (from scrapy>=2.5.0->advertools
Collecting PyDispatcher>=2.0.5 (from scrapy>=2.5.0->advertools)
Downloading PyDispatcher-2.0.7-py3-none-any.whl (12 kB)
Requirement already satisfied: requests>=2.1.0 in /usr/local/lib/python3.10/dist-packages (from twython>=3.8.0->adve
Requirement already satisfied: requests-oauthlib>=0.4.0 in /usr/local/lib/python3.10/dist-packages (from twython>=3.8
Requirement already satisfied: cffi>=1.12 in /usr/local/lib/python3.10/dist-packages (from cryptography>=36.0.0->scra
Collecting jmespath>=0.9.5 (from itemloaders>=1.0.1->scrapy>=2.5.0->advertools)
Downloading jmespath-1 0 1-py3-none-any whl (20 kB)
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
import pymorphy3
import tokenize_uk
import re
import string
import spacy
from collections import Counter
nlp = spacy.load("uk_core_news_lg")
import matplotlib as pyplot
import seaborn as sns
import emot
import advertools as adv
df = pd.read_csv("/content/all_posts.csv")
df.head()
Date-Time Post
stops = []
with open("/content/stopwords_ua.txt", encoding="utf-8") as file:
for line in file.readlines():
stops.append(line.strip())
https://colab.research.google.com/drive/1IX-UYjDI6F67-ES_1VjSVbdawVc7rzO-?usp=sharing#printMode=true 2/12
21/02/2024, 21:51 Каращук Біляєва Проєкт Корчинський.ipynb - Colaboratory
df.rename(columns={'Date-Time': 'date_time', 'Post': 'text'}, inplace=True)
df['text']= df['text'].astype(str)
print(df.columns)
# Split the DataFrame into two groups based on the split date
before_date = df[df['date_time'] < split_date]
after_date = df[df['date_time'] >= split_date]
morph = pymorphy3.MorphAnalyzer(lang='uk')
def find_verbs(df):
verbs = []
for item in df['text'].tolist():
doc = nlp(item)
for token in doc:
if token.pos_ == "VERB":
verbs.append(token.text)
return verbs
def lemmatize(text):
return ' '.join(morph.parse(word)[0].normal_form for word in tokenize_uk.tokenize_words(text) if morph.parse(word)[0].norm
def remove_links_and_punctuation(text):
# Define a regular expression pattern to match URLs
url_pattern = re.compile(r'https?://\S+|www\.\S+')
# Use the sub() function to replace matched URLs with an empty string
text_without_links = url_pattern.sub('', text)
return text_without_punctuation
verbs_before = find_verbs(before_date)
verbs_after = find_verbs(after_date)
<ipython-input-15-da908878dbe3>:1: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
https://colab.research.google.com/drive/1IX-UYjDI6F67-ES_1VjSVbdawVc7rzO-?usp=sharing#printMode=true 3/12
21/02/2024, 21:51 Каращук Біляєва Проєкт Корчинський.ipynb - Colaboratory
def find_verbs(df):
print(type(df)) # Check the type of df
print(df.columns) # Check the columns in df
verbs = []
for item in df['Post'].tolist():
doc = nlp(item)
for token in doc:
if token.pos_ == 'VERB':
verbs.append(token.lemma_)
return verbs
def return_key_words(data):
# Create a TF-IDF vectorizer
vectorizer = TfidfVectorizer()
keys = pd.DataFrame({
'Ключові слова 1': keywords_before,
'Ключові слова 2': keywords_after,
})
def analyze_verbs(verbs):
singular_counter = Counter()
plural_counter = Counter()
keys
https://colab.research.google.com/drive/1IX-UYjDI6F67-ES_1VjSVbdawVc7rzO-?usp=sharing#printMode=true 4/12
21/02/2024, 21:51 Каращук Біляєва Проєкт Корчинський.ipynb - Colaboratory
0 україна україна
1 братство братство
2 батальйон батальйон
3 том йога
4 йога том
5 канал боєць
6 український військовий
7 день життя
8 корчинський канал
9 допомога посилання
10 життя церква
11 українська український
12 перемога день
13 слава бойовий
14 українець театр
15 війнути зараза
16 мен просити
17 боєць бог
18 бог збір
19 війна допомога
res_verbs_after = analyze_verbs(verbs_after)
res_verbs_before = analyze_verbs(verbs_before)
df_verbs_after
0 є 142 просимо 78
1 має 65 мають 32
2 може 49 закликаємо 27
3 будь 46 маємо 25
4 бере 42 могли 25
5 буде 33 можуть 22
6 виголошу 31 отримають 21
7 хоче 26 поширте 17
8 немає 22 знаємо 14
9 потребує 19 можете 13
df_verbs_before
https://colab.research.google.com/drive/1IX-UYjDI6F67-ES_1VjSVbdawVc7rzO-?usp=sharing#printMode=true 5/12
21/02/2024, 21:51 Каращук Біляєва Проєкт Корчинський.ipynb - Colaboratory
0 є 110 просимо 57
1 має 82 мають 45
2 може 81 маємо 41
3 будь 43 можете 27
4 буде 39 зібрали 24
5 виголошу 31 можуть 21
6 знаю 30 відкриваємо 20
7 виконує 28 московити 20
8 відбувається 25 знаєте 19
9 немає 24 дякуємо 19
# # Apply the count_words function to each row in the 'text_column' and create a new column 'word_count'
# before_date['word_count'] = before_date['text'].apply(count_words)
# # Apply the count_words function to each row in the 'text_column' and create a new column 'word_count'
# after_date['word_count'] = after_date['text'].apply(count_words)
top_n_bigrams=get_top_ngram(before_date['lemmatized'],15, 2)
x,y=map(list,zip(*top_n_bigrams))
sns.barplot(x=y,y=x).set(title='15 найчастіше вживаних біграм часовий період 1')
/usr/local/lib/python3.10/dist-packages/sklearn/feature_extraction/text.py:409
warnings.warn(
[Text(0.5, 1.0, '15 найчастіше вживаних біграм часовий період 1')]
https://colab.research.google.com/drive/1IX-UYjDI6F67-ES_1VjSVbdawVc7rzO-?usp=sharing#printMode=true 6/12
21/02/2024, 21:51 Каращук Біляєва Проєкт Корчинський.ipynb - Colaboratory
top_n_bigrams=get_top_ngram(after_date['lemmatized'],15, 2)
x,y=map(list,zip(*top_n_bigrams))
sns.barplot(x=y,y=x).set(title='15 найчастіше вживаних біграм часовий період 2')
print(most_common_emojis(emojis_s))
print(most_common_emojis(emojis_n))
{}
{}
emot_obj = emot.core.emot()
emoticons_s = Counter(emot_obj.emoticons(before_date['text'].to_string())['value'])
emoticons_n = Counter(emot_obj.emoticons(after_date['text'].to_string())['value'])
print(emoticons_s)
print(emoticons_n)
Counter({':/': 354, '=3': 5, '=D': 3, 'XP': 2, '8D': 2, 'XD': 2, '=p': 2, '=L': 2, 'DX': 1, 'D8': 1})
Counter({':/': 361, '=D': 3, '=3': 2, ':]': 2, 'QQ': 1, '=L': 1, 'DX': 1, 'D8': 1, '8D': 1, '=p': 1})
# def find_entities(text,n):
# doc = nlp(text)
# return dict(Counter([ent.text for ent in doc.ents]).most_common(n))
# ents_n
# ents_s
https://colab.research.google.com/drive/1IX-UYjDI6F67-ES_1VjSVbdawVc7rzO-?usp=sharing#printMode=true 7/12
21/02/2024, 21:51 Каращук Біляєва Проєкт Корчинський.ipynb - Colaboratory
В: 200
А: 196
У: 157
Я: 121
БРАТСТВО: 104
ЗСУ: 89
З: 82
ФСБ: 63
О: 60
ТРО: 50
УПЦ: 49
БРАТСТВА: 44
СБУ: 38
ПОТРЕБИ: 26
РФ: 23
НАТО: 19
ЗИМОВИЙ: 16
США: 13
ЗБОРУ: 12
Р: 11
МП: 11
С: 11
НЕ: 10
РПЦ: 10
ВТ: 9
Х: 9
УНСО: 9
АТО: 9
НОВИНИ: 9
ФРОНТУ: 9
ХХ: 8
И: 8
МАГАТЕ: 8
ТЕЛЕГРАМ: 8
КАНАЛ: 8
ССО: 7
УВАГА: 7
Д: 7
ЗА: 7
ВР: 7
Й: 7
МВС: 6
МОЗ: 6
ООС: 6
ООН: 6
КМДА: 6
БМП: 6
ВОЦ: 6
МО: 5
ГУР: 5
УБН: 5
М: 5
ПЦУ: 5
АЕС: 5
ЗАЕС: 5
НА: 5
Т: 5
https://colab.research.google.com/drive/1IX-UYjDI6F67-ES_1VjSVbdawVc7rzO-?usp=sharing#printMode=true 8/12
21/02/2024, 21:51 Каращук Біляєва Проєкт Корчинський.ipynb - Colaboratory
# Extract word frequency statistics
word_stats = after_date['all_caps_words'].sum()
word_stats_common = word_stats.most_common()
А: 167
В: 148
У: 148
БРАТСТВО: 122
Я: 84
КУЛЬТ: 78
З: 51
ЗСУ: 51
БРАТСТВА: 44
О: 39
ФСБ: 28
УПЦ: 24
РФ: 22
НЕ: 15
ЛГБТ: 15
США: 14
КОЛАБОРАНТ: 12
НА: 11
СБУ: 10
КМС: 10
ГРН: 10
М: 9
МВС: 9
МП: 8
РПЦ: 8
ПЦУ: 8
ТРО: 8
ШТ: 8
Д: 7
Й: 7
ДШВ: 7
ГО: 7
ЗА: 7
ТЗ: 7
ЗС: 6
П: 6
ДБР: 6
ВЛК: 6
МЕРЧ: 6
С: 6
ВЖЕ: 6
ССО: 5
ХХ: 5
АТО: 5
ДО: 5
ООН: 5
ЖИТТЯ: 5
ППО: 5
ОВА: 5
СП: 5
Р: 5
Х: 5
ГЕС: 5
УНР: 4
ООС: 4
АКС: 4
МО: 4
ВР: 4
ЦВЛК 4
df['Post']= df['Post'].astype(str)
df.rename(columns={'Date-Time': 'date_time', 'Post': 'text'}, inplace=True)
https://colab.research.google.com/drive/1IX-UYjDI6F67-ES_1VjSVbdawVc7rzO-?usp=sharing#printMode=true 9/12
21/02/2024, 21:51 Каращук Біляєва Проєкт Корчинський.ipynb - Colaboratory
import pandas as pd
import spacy
import re
nlp = spacy.load("uk_core_news_lg")
df = pd.read_csv("/content/all_posts.csv")
def count_quotes(text):
quote_pattern = re.compile(r'[«»"\'“”„”]')
return len(set(quote_matches))
df['quote_count'] = df['text'].apply(count_quotes)
number_of_quotes_used = df['quote_count'].sum()
4 frames
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()
KeyError: 'text'
The above exception was the direct cause of the following exception:
KeyError: 'text'
Collecting stanza
Downloading stanza-1.7.0-py3-none-any.whl (933 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 933.2/933.2 kB 17.6 MB/s eta 0:00:00
Requirement already satisfied: networkx in /usr/local/lib/python3.10/dist-packages (3.2.1)
Requirement already satisfied: matplotlib in /usr/local/lib/python3.10/dist-packages (3.7.1)
Collecting emoji (from stanza)
Downloading emoji-2.10.1-py2.py3-none-any.whl (421 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 421.5/421.5 kB 44.7 MB/s eta 0:00:00
Requirement already satisfied: numpy in /usr/local/lib/python3.10/dist-packages (from stanza) (1.25.2)
Requirement already satisfied: protobuf>=3.15.0 in /usr/local/lib/python3.10/dist-packages (from stanza) (3.20.3)
Requirement already satisfied: requests in /usr/local/lib/python3.10/dist-packages (from stanza) (2.31.0)
Requirement already satisfied: toml in /usr/local/lib/python3.10/dist-packages (from stanza) (0.10.2)
Requirement already satisfied: torch>=1.3.0 in /usr/local/lib/python3.10/dist-packages (from stanza) (2.1.0+cu121)
Requirement already satisfied: tqdm in /usr/local/lib/python3.10/dist-packages (from stanza) (4.66.2)
Requirement already satisfied: contourpy>=1.0.1 in /usr/local/lib/python3.10/dist-packages (from matplotlib) (1.2.0)
Requirement already satisfied: cycler>=0.10 in /usr/local/lib/python3.10/dist-packages (from matplotlib) (0.12.1)
Requirement already satisfied: fonttools>=4.22.0 in /usr/local/lib/python3.10/dist-packages (from matplotlib) (4.48.1)
Requirement already satisfied: kiwisolver>=1.0.1 in /usr/local/lib/python3.10/dist-packages (from matplotlib) (1.4.5)
Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.10/dist-packages (from matplotlib) (23.2)
Requirement already satisfied: pillow>=6.2.0 in /usr/local/lib/python3.10/dist-packages (from matplotlib) (9.4.0)
Requirement already satisfied: pyparsing>=2.3.1 in /usr/local/lib/python3.10/dist-packages (from matplotlib) (3.1.1)
Requirement already satisfied: python-dateutil>=2.7 in /usr/local/lib/python3.10/dist-packages (from matplotlib) (2.8.2)
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.10/dist-packages (from python-dateutil>=2.7->matplotli
Requirement already satisfied: filelock in /usr/local/lib/python3.10/dist-packages (from torch>=1.3.0->stanza) (3.13.1)
Requirement already satisfied: typing-extensions in /usr/local/lib/python3.10/dist-packages (from torch>=1.3.0->stanza)
Requirement already satisfied: sympy in /usr/local/lib/python3.10/dist-packages (from torch>=1.3.0->stanza) (1.12)
https://colab.research.google.com/drive/1IX-UYjDI6F67-ES_1VjSVbdawVc7rzO-?usp=sharing#printMode=true 10/12
21/02/2024, 21:51 Каращук Біляєва Проєкт Корчинський.ipynb - Colaboratory
Requirement already satisfied: jinja2 in /usr/local/lib/python3.10/dist-packages (from torch>=1.3.0->stanza) (3.1.3)
Requirement already satisfied: fsspec in /usr/local/lib/python3.10/dist-packages (from torch>=1.3.0->stanza) (2023.6.0)
Requirement already satisfied: triton==2.1.0 in /usr/local/lib/python3.10/dist-packages (from torch>=1.3.0->stanza) (2.1
Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.10/dist-packages (from requests->stanz
Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.10/dist-packages (from requests->stanza) (3.6)
Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.10/dist-packages (from requests->stanza) (2.
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.10/dist-packages (from requests->stanza) (20
Requirement already satisfied: MarkupSafe>=2.0 in /usr/local/lib/python3.10/dist-packages (from jinja2->torch>=1.3.0->st
Requirement already satisfied: mpmath>=0.19 in /usr/local/lib/python3.10/dist-packages (from sympy->torch>=1.3.0->stanza
Installing collected packages: emoji, stanza
Successfully installed emoji-2.10.1 stanza-1.7.0
import pandas as pd
import stanza
import networkx as nx
import matplotlib.pyplot as plt
import mpld3
mpld3.enable_notebook()
df = pd.read_csv("/content/ryslan_martsinkiv.csv")
def extract_entities(text):
doc = nlp_stanza(text)
entities = [(ent.text, ent.type) for sent in doc.sentences for ent in sent.ents]
return entities
df['entities'] = df['text'].apply(extract_entities)
G = nx.Graph()
pos = nx.spring_layout(G)
plt.figure(figsize=(10, 8))
nx.draw(G, pos, with_labels=True, font_size=8, node_color=node_colors, cmap=plt.cm.Paired, font_color='black', node_size=500
plt.title("Named Entity Recognition (NER) Visualization")
plt.show()
resources/main/resources_1.7.0.json:
WARNING:stanza:Can not find tokenize: languagetool from official model list. Ignoring it.
WARNING:stanza:Can not find ner: languagetool from official model list. Ignoring it.
INFO:stanza:Downloading these customized packages for language: uk (Ukrainian)...
=======================
| Processor | Package |
-----------------------
=======================
resources/main/resources_1.7.0.json:
WARNING:stanza:Language uk package default expects mwt, which has been added
INFO:stanza:Loading these models for language: uk (Ukrainian):
=======================
| Processor | Package |
-----------------------
| tokenize | iu |
| mwt | iu |
| ner | languk |
=======================
https://colab.research.google.com/drive/1IX-UYjDI6F67-ES_1VjSVbdawVc7rzO-?usp=sharing#printMode=true 11/12
21/02/2024, 21:51 Каращук Біляєва Проєкт Корчинський.ipynb - Colaboratory
ner_counts = {}
for entities in df['entities']:
https://colab.research.google.com/drive/1IX-UYjDI6F67-ES_1VjSVbdawVc7rzO-?usp=sharing#printMode=true 12/12