Download as pdf or txt
Download as pdf or txt
You are on page 1of 12

21/02/2024, 21:51 Каращук Біляєва Проєкт Корчинський.

ipynb - Colaboratory

!pip install tokenize_uk


!pip install pymorphy3
!pip install pymorphy3-dicts-uk
!python3 -m spacy download uk_core_news_lg

Collecting pymorphy3-dicts-ru (from pymorphy3)


Downloading pymorphy3_dicts_ru-2.4.417150.4580142-py2.py3-none-any.whl (8.4 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 8.4/8.4 MB 22.1 MB/s eta 0:00:00
Installing collected packages: pymorphy3-dicts-ru, dawg-python, pymorphy3
Successfully installed dawg-python-0.7.2 pymorphy3-2.0.1 pymorphy3-dicts-ru-2.4.417150.4580142
Collecting pymorphy3-dicts-uk
Downloading pymorphy3_dicts_uk-2.4.1.1.1663094765-py2.py3-none-any.whl (8.2 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 8.2/8.2 MB 17.9 MB/s eta 0:00:00
Installing collected packages: pymorphy3-dicts-uk
Successfully installed pymorphy3-dicts-uk-2.4.1.1.1663094765
Collecting uk-core-news-lg==3.7.0
Downloading https://github.com/explosion/spacy-models/releases/download/uk_core_news_lg-3.7.0/uk_core_news_lg-3.7.0
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 231.2/231.2 MB 2.8 MB/s eta 0:00:00
Requirement already satisfied: spacy<3.8.0,>=3.7.0 in /usr/local/lib/python3.10/dist-packages (from uk-core-news-lg==
Requirement already satisfied: pymorphy3>=1.0.0 in /usr/local/lib/python3.10/dist-packages (from uk-core-news-lg==3.7
Requirement already satisfied: pymorphy3-dicts-uk in /usr/local/lib/python3.10/dist-packages (from uk-core-news-lg==3
Requirement already satisfied: dawg-python>=0.7.1 in /usr/local/lib/python3.10/dist-packages (from pymorphy3>=1.0.0->
Requirement already satisfied: pymorphy3-dicts-ru in /usr/local/lib/python3.10/dist-packages (from pymorphy3>=1.0.0->
Requirement already satisfied: spacy-legacy<3.1.0,>=3.0.11 in /usr/local/lib/python3.10/dist-packages (from spacy<3.8
Requirement already satisfied: spacy-loggers<2.0.0,>=1.0.0 in /usr/local/lib/python3.10/dist-packages (from spacy<3.8
Requirement already satisfied: murmurhash<1.1.0,>=0.28.0 in /usr/local/lib/python3.10/dist-packages (from spacy<3.8.0
Requirement already satisfied: cymem<2.1.0,>=2.0.2 in /usr/local/lib/python3.10/dist-packages (from spacy<3.8.0,>=3.7
Requirement already satisfied: preshed<3.1.0,>=3.0.2 in /usr/local/lib/python3.10/dist-packages (from spacy<3.8.0,>=3
Requirement already satisfied: thinc<8.3.0,>=8.2.2 in /usr/local/lib/python3.10/dist-packages (from spacy<3.8.0,>=3.7
Requirement already satisfied: wasabi<1.2.0,>=0.9.1 in /usr/local/lib/python3.10/dist-packages (from spacy<3.8.0,>=3
Requirement already satisfied: srsly<3.0.0,>=2.4.3 in /usr/local/lib/python3.10/dist-packages (from spacy<3.8.0,>=3.7
Requirement already satisfied: catalogue<2.1.0,>=2.0.6 in /usr/local/lib/python3.10/dist-packages (from spacy<3.8.0,>
Requirement already satisfied: weasel<0.4.0,>=0.1.0 in /usr/local/lib/python3.10/dist-packages (from spacy<3.8.0,>=3
Requirement already satisfied: typer<0.10.0,>=0.3.0 in /usr/local/lib/python3.10/dist-packages (from spacy<3.8.0,>=3
Requirement already satisfied: smart-open<7.0.0,>=5.2.1 in /usr/local/lib/python3.10/dist-packages (from spacy<3.8.0,
Requirement already satisfied: tqdm<5.0.0,>=4.38.0 in /usr/local/lib/python3.10/dist-packages (from spacy<3.8.0,>=3.7
Requirement already satisfied: requests<3.0.0,>=2.13.0 in /usr/local/lib/python3.10/dist-packages (from spacy<3.8.0,>
Requirement already satisfied: pydantic!=1.8,!=1.8.1,<3.0.0,>=1.7.4 in /usr/local/lib/python3.10/dist-packages (from
Requirement already satisfied: jinja2 in /usr/local/lib/python3.10/dist-packages (from spacy<3.8.0,>=3.7.0->uk-core-n
Requirement already satisfied: setuptools in /usr/local/lib/python3.10/dist-packages (from spacy<3.8.0,>=3.7.0->uk-co
Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.10/dist-packages (from spacy<3.8.0,>=3.7.0->
Requirement already satisfied: langcodes<4.0.0,>=3.2.0 in /usr/local/lib/python3.10/dist-packages (from spacy<3.8.0,>
Requirement already satisfied: numpy>=1.19.0 in /usr/local/lib/python3.10/dist-packages (from spacy<3.8.0,>=3.7.0->uk
Requirement already satisfied: annotated-types>=0.4.0 in /usr/local/lib/python3.10/dist-packages (from pydantic!=1.8,
Requirement already satisfied: pydantic-core==2.16.2 in /usr/local/lib/python3.10/dist-packages (from pydantic!=1.8,
Requirement already satisfied: typing-extensions>=4.6.1 in /usr/local/lib/python3.10/dist-packages (from pydantic!=1
Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.10/dist-packages (from requests<3.0
Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.10/dist-packages (from requests<3.0.0,>=2.13.0-
Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.10/dist-packages (from requests<3.0.0,>=2
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.10/dist-packages (from requests<3.0.0,>=2
Requirement already satisfied: blis<0.8.0,>=0.7.8 in /usr/local/lib/python3.10/dist-packages (from thinc<8.3.0,>=8.2
Requirement already satisfied: confection<1.0.0,>=0.0.1 in /usr/local/lib/python3.10/dist-packages (from thinc<8.3.0,
Requirement already satisfied: click<9.0.0,>=7.1.1 in /usr/local/lib/python3.10/dist-packages (from typer<0.10.0,>=0
Requirement already satisfied: cloudpathlib<0.17.0,>=0.7.0 in /usr/local/lib/python3.10/dist-packages (from weasel<0
Requirement already satisfied: MarkupSafe>=2.0 in /usr/local/lib/python3.10/dist-packages (from jinja2->spacy<3.8.0,>
Installing collected packages: uk-core-news-lg
Successfully installed uk-core-news-lg-3.7.0
✔ Download and installation successful
You can now load the package via spacy.load('uk_core_news_lg')
⚠ Restart to reload dependencies
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the

from google.colab import drive


drive.mount('/content/drive')

Mounted at /content/drive

!pip install emot


!pip install advertools

Collecting emot
Downloading emot-3.1-py3-none-any.whl (61 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 61.5/61.5 kB 543.7 kB/s eta 0:00:00
Installing collected packages: emot
Successfully installed emot-3.1
Collecting advertools
Downloading advertools-0.14.1-py2.py3-none-any.whl (321 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 321.9/321.9 kB 4.7 MB/s eta 0:00:00
Requirement already satisfied: pandas>=1.1.0 in /usr/local/lib/python3.10/dist-packages (from advertools) (1.5.3)
Requirement already satisfied: pyasn1>=0.4 in /usr/local/lib/python3.10/dist-packages (from advertools) (0.5.1)
Collecting scrapy>=2.5.0 (from advertools)

https://colab.research.google.com/drive/1IX-UYjDI6F67-ES_1VjSVbdawVc7rzO-?usp=sharing#printMode=true 1/12
21/02/2024, 21:51 Каращук Біляєва Проєкт Корчинський.ipynb - Colaboratory
Downloading Scrapy-2.11.1-py2.py3-none-any.whl (287 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 287.8/287.8 kB 11.5 MB/s eta 0:00:00
Collecting twython>=3.8.0 (from advertools)
Downloading twython-3.9.1-py3-none-any.whl (33 kB)
Requirement already satisfied: pyarrow>=5.0.0 in /usr/local/lib/python3.10/dist-packages (from advertools) (14.0.2)
Requirement already satisfied: python-dateutil>=2.8.1 in /usr/local/lib/python3.10/dist-packages (from pandas>=1.1.0-
Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.10/dist-packages (from pandas>=1.1.0->advertool
Requirement already satisfied: numpy>=1.21.0 in /usr/local/lib/python3.10/dist-packages (from pandas>=1.1.0->advertoo
Collecting Twisted>=18.9.0 (from scrapy>=2.5.0->advertools)
Downloading twisted-23.10.0-py3-none-any.whl (3.2 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 3.2/3.2 MB 16.2 MB/s eta 0:00:00
Requirement already satisfied: cryptography>=36.0.0 in /usr/local/lib/python3.10/dist-packages (from scrapy>=2.5.0->a
Collecting cssselect>=0.9.1 (from scrapy>=2.5.0->advertools)
Downloading cssselect-1.2.0-py2.py3-none-any.whl (18 kB)
Collecting itemloaders>=1.0.1 (from scrapy>=2.5.0->advertools)
Downloading itemloaders-1.1.0-py3-none-any.whl (11 kB)
Collecting parsel>=1.5.0 (from scrapy>=2.5.0->advertools)
Downloading parsel-1.8.1-py2.py3-none-any.whl (17 kB)
Requirement already satisfied: pyOpenSSL>=21.0.0 in /usr/local/lib/python3.10/dist-packages (from scrapy>=2.5.0->adve
Collecting queuelib>=1.4.2 (from scrapy>=2.5.0->advertools)
Downloading queuelib-1.6.2-py2.py3-none-any.whl (13 kB)
Collecting service-identity>=18.1.0 (from scrapy>=2.5.0->advertools)
Downloading service_identity-24.1.0-py3-none-any.whl (12 kB)
Collecting w3lib>=1.17.0 (from scrapy>=2.5.0->advertools)
Downloading w3lib-2.1.2-py3-none-any.whl (21 kB)
Collecting zope.interface>=5.1.0 (from scrapy>=2.5.0->advertools)
Downloading zope.interface-6.2-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux20
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 247.3/247.3 kB 15.9 MB/s eta 0:00:00
Collecting protego>=0.1.15 (from scrapy>=2.5.0->advertools)
Downloading Protego-0.3.0-py2.py3-none-any.whl (8.5 kB)
Collecting itemadapter>=0.1.0 (from scrapy>=2.5.0->advertools)
Downloading itemadapter-0.8.0-py3-none-any.whl (11 kB)
Requirement already satisfied: setuptools in /usr/local/lib/python3.10/dist-packages (from scrapy>=2.5.0->advertools
Requirement already satisfied: packaging in /usr/local/lib/python3.10/dist-packages (from scrapy>=2.5.0->advertools)
Collecting tldextract (from scrapy>=2.5.0->advertools)
Downloading tldextract-5.1.1-py3-none-any.whl (97 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 97.7/97.7 kB 10.4 MB/s eta 0:00:00
Requirement already satisfied: lxml>=4.4.1 in /usr/local/lib/python3.10/dist-packages (from scrapy>=2.5.0->advertools
Collecting PyDispatcher>=2.0.5 (from scrapy>=2.5.0->advertools)
Downloading PyDispatcher-2.0.7-py3-none-any.whl (12 kB)
Requirement already satisfied: requests>=2.1.0 in /usr/local/lib/python3.10/dist-packages (from twython>=3.8.0->adve
Requirement already satisfied: requests-oauthlib>=0.4.0 in /usr/local/lib/python3.10/dist-packages (from twython>=3.8
Requirement already satisfied: cffi>=1.12 in /usr/local/lib/python3.10/dist-packages (from cryptography>=36.0.0->scra
Collecting jmespath>=0.9.5 (from itemloaders>=1.0.1->scrapy>=2.5.0->advertools)
Downloading jmespath-1 0 1-py3-none-any whl (20 kB)

import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
import pymorphy3
import tokenize_uk
import re
import string
import spacy
from collections import Counter
nlp = spacy.load("uk_core_news_lg")
import matplotlib as pyplot
import seaborn as sns
import emot
import advertools as adv

df = pd.read_csv("/content/all_posts.csv")

df.head()

Date-Time Post

0 2023-12-24 23:08:09 __Минув рік від загибелі Непийпива, Святоші, Т...

1 2023-12-24 22:38:06 Рік тому, в Різдвяну ніч 2022 одна з диверсійн...

2 2023-12-24 20:06:29 Браття, дуже важливо підписати цю [петицію](ht...

3 2023-12-24 17:06:18 Слава Ісусу Христу! \n\nБраття та сестри, в це...

4 2023-12-24 16:01:21 26 грудня у КУЛЬТ театрі\n\nМоторошна містична...

stops = []
with open("/content/stopwords_ua.txt", encoding="utf-8") as file:
for line in file.readlines():
stops.append(line.strip())

stops.extend(['telegra', 'file', 'ph'])

https://colab.research.google.com/drive/1IX-UYjDI6F67-ES_1VjSVbdawVc7rzO-?usp=sharing#printMode=true 2/12
21/02/2024, 21:51 Каращук Біляєва Проєкт Корчинський.ipynb - Colaboratory
df.rename(columns={'Date-Time': 'date_time', 'Post': 'text'}, inplace=True)

df['text']= df['text'].astype(str)

print(df.columns)

Index(['date_time', 'text'], dtype='object')

df['text'] = df['text'].str.replace(r'\s+', ' ', regex=True)

# Convert 'date_time' column to datetime type


df['date_time'] = pd.to_datetime(df['date_time'])

# Define the split date


split_date = pd.to_datetime('2023-02-25')

# Split the DataFrame into two groups based on the split date
before_date = df[df['date_time'] < split_date]
after_date = df[df['date_time'] >= split_date]

morph = pymorphy3.MorphAnalyzer(lang='uk')

def find_verbs(df):
verbs = []
for item in df['text'].tolist():
doc = nlp(item)
for token in doc:
if token.pos_ == "VERB":
verbs.append(token.text)
return verbs

def lemmatize(text):
return ' '.join(morph.parse(word)[0].normal_form for word in tokenize_uk.tokenize_words(text) if morph.parse(word)[0].norm

INFO:pymorphy3.opencorpora_dict.wrapper:Loading dictionaries from /usr/local/lib/python3.10/dist-packages/pymorphy3_dict


INFO:pymorphy3.opencorpora_dict.wrapper:format: 2.4, revision: 1, updated: 2022-09-13T18:45:24.998984

def remove_links_and_punctuation(text):
# Define a regular expression pattern to match URLs
url_pattern = re.compile(r'https?://\S+|www\.\S+')

# Use the sub() function to replace matched URLs with an empty string
text_without_links = url_pattern.sub('', text)

# Remove punctuation using the translate method


translator = str.maketrans("", "", string.punctuation)
text_without_punctuation = text_without_links.translate(translator)

return text_without_punctuation

verbs_before = find_verbs(before_date)
verbs_after = find_verbs(after_date)

before_date['lemmatized'] = before_date['text'].apply(lambda x: remove_links_and_punctuation(lemmatize(x)))


after_date['lemmatized'] = after_date['text'].apply(lambda x: remove_links_and_punctuation(lemmatize(x)))

<ipython-input-15-da908878dbe3>:1: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-


before_date['lemmatized'] = before_date['text'].apply(lambda x: remove_links_and_punctuation(lemmatize(x)))
<ipython-input-15-da908878dbe3>:2: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-


after_date['lemmatized'] = after_date['text'].apply(lambda x: remove_links_and_punctuation(lemmatize(x)))

https://colab.research.google.com/drive/1IX-UYjDI6F67-ES_1VjSVbdawVc7rzO-?usp=sharing#printMode=true 3/12
21/02/2024, 21:51 Каращук Біляєва Проєкт Корчинський.ipynb - Colaboratory
def find_verbs(df):
print(type(df)) # Check the type of df
print(df.columns) # Check the columns in df

verbs = []
for item in df['Post'].tolist():
doc = nlp(item)
for token in doc:
if token.pos_ == 'VERB':
verbs.append(token.lemma_)
return verbs

def return_key_words(data):
# Create a TF-IDF vectorizer
vectorizer = TfidfVectorizer()

# Fit and transform the documents


tfidf_matrix = vectorizer.fit_transform([data])

# Get feature names (words)


feature_names = vectorizer.get_feature_names_out()

# Get TF-IDF scores for each word in the first document


tfidf_scores = tfidf_matrix[0].toarray()[0]

# Get top N keywords based on TF-IDF scores


return [feature_names[i] for i in tfidf_scores.argsort()[-20:][::-1]]

keywords_before = return_key_words(' '.join(before_date['lemmatized'].tolist()))


keywords_after = return_key_words(' '.join(after_date['lemmatized'].tolist()))

keys = pd.DataFrame({
'Ключові слова 1': keywords_before,
'Ключові слова 2': keywords_after,
})

# Create a morphological analyzer


morph_analyzer = pymorphy3.MorphAnalyzer(lang='uk')

def analyze_verbs(verbs):
singular_counter = Counter()
plural_counter = Counter()

# Analyze each verb form


for verb in verbs:
parsed = morph_analyzer.parse(verb.lower())[0]
number = parsed.tag.number

# Update counters based on the number information


if number == 'sing':
singular_counter[verb.lower()] += 1
elif number == 'plur':
plural_counter[verb.lower()] += 1
return singular_counter.most_common(10), plural_counter.most_common(10)

INFO:pymorphy3.opencorpora_dict.wrapper:Loading dictionaries from /usr/local/lib/python3.10/dist-packages/pymorphy3_dict


INFO:pymorphy3.opencorpora_dict.wrapper:format: 2.4, revision: 1, updated: 2022-09-13T18:45:24.998984

keys

https://colab.research.google.com/drive/1IX-UYjDI6F67-ES_1VjSVbdawVc7rzO-?usp=sharing#printMode=true 4/12
21/02/2024, 21:51 Каращук Біляєва Проєкт Корчинський.ipynb - Colaboratory

Ключові слова 1 Ключові слова 2

0 україна україна

1 братство братство

2 батальйон батальйон

3 том йога

4 йога том

5 канал боєць

6 український військовий

7 день життя

8 корчинський канал

9 допомога посилання

10 життя церква

11 українська український

12 перемога день

13 слава бойовий

14 українець театр

15 війнути зараза

16 мен просити

17 боєць бог

18 бог збір

19 війна допомога

res_verbs_after = analyze_verbs(verbs_after)
res_verbs_before = analyze_verbs(verbs_before)

df_verbs_after = pd.DataFrame({'Verb in Plural': [item[0] for item in res_verbs_after[0]],


'Count in Plural': [item[1] for item in res_verbs_after[0]],
'Verb in Singular': [item[0] for item in res_verbs_after[1]],
'Count in Singular': [item[1] for item in res_verbs_after[1]]})

df_verbs_before = pd.DataFrame({'Verb in Plural': [item[0] for item in res_verbs_before[0]],


'Count in Plural': [item[1] for item in res_verbs_before[0]],
'Verb in Singular': [item[0] for item in res_verbs_before[1]],
'Count in Singular': [item[1] for item in res_verbs_before[1]]})

df_verbs_after

Verb in Plural Count in Plural Verb in Singular Count in Singular

0 є 142 просимо 78

1 має 65 мають 32

2 може 49 закликаємо 27

3 будь 46 маємо 25

4 бере 42 могли 25

5 буде 33 можуть 22

6 виголошу 31 отримають 21

7 хоче 26 поширте 17

8 немає 22 знаємо 14

9 потребує 19 можете 13

df_verbs_before

https://colab.research.google.com/drive/1IX-UYjDI6F67-ES_1VjSVbdawVc7rzO-?usp=sharing#printMode=true 5/12
21/02/2024, 21:51 Каращук Біляєва Проєкт Корчинський.ipynb - Colaboratory

Verb in Plural Count in Plural Verb in Singular Count in Singular

0 є 110 просимо 57

1 має 82 мають 45

2 може 81 маємо 41

3 будь 43 можете 27

4 буде 39 зібрали 24

5 виголошу 31 можуть 21

6 знаю 30 відкриваємо 20

7 виконує 28 московити 20

8 відбувається 25 знаєте 19

9 немає 24 дякуємо 19

# # Function to count words in a text


# def count_words(text):
# words = tokenize_uk.tokenize_words(text)
# return len(words)

# # Apply the count_words function to each row in the 'text_column' and create a new column 'word_count'
# before_date['word_count'] = before_date['text'].apply(count_words)

# # Calculate the average word count


# average_word_count_b = before_date['word_count'].mean()

# # Apply the count_words function to each row in the 'text_column' and create a new column 'word_count'
# after_date['word_count'] = after_date['text'].apply(count_words)

# # Calculate the average word count


# average_word_count_a = after_date['word_count'].mean()

# print(f"Середня довжина постів у словах періоу 1: {average_word_count_b}")


# print(f"Середня довжина постів у словах періоу 2: {average_word_count_a}")

def get_top_ngram(corpus, k, n=None,):


vec = CountVectorizer(ngram_range=(n, n), stop_words=stops, token_pattern=r"[А-ЩЬЮЯҐЄІЇа-щьюяґєії'`’ʼ]+").fit(corpus)
bag_of_words = vec.transform(corpus)
sum_words = bag_of_words.sum(axis=0)
words_freq = [(word, sum_words[0, idx])
for word, idx in vec.vocabulary_.items() if word != "' '"]
words_freq =sorted(words_freq, key = lambda x: x[1], reverse=True)
return words_freq[:k]

top_n_bigrams=get_top_ngram(before_date['lemmatized'],15, 2)
x,y=map(list,zip(*top_n_bigrams))
sns.barplot(x=y,y=x).set(title='15 найчастіше вживаних біграм часовий період 1')

/usr/local/lib/python3.10/dist-packages/sklearn/feature_extraction/text.py:409
warnings.warn(
[Text(0.5, 1.0, '15 найчастіше вживаних біграм часовий період 1')]

https://colab.research.google.com/drive/1IX-UYjDI6F67-ES_1VjSVbdawVc7rzO-?usp=sharing#printMode=true 6/12
21/02/2024, 21:51 Каращук Біляєва Проєкт Корчинський.ipynb - Colaboratory
top_n_bigrams=get_top_ngram(after_date['lemmatized'],15, 2)
x,y=map(list,zip(*top_n_bigrams))
sns.barplot(x=y,y=x).set(title='15 найчастіше вживаних біграм часовий період 2')

/usr/local/lib/python3.10/dist-packages/sklearn/feature_extraction/text.py:409: UserWarning: Your stop_words may be inco


warnings.warn(
[Text(0.5, 1.0, '15 найчастіше вживаних біграм часовий період 2')]

def extract_emojis(text: str):


emoji_dict = adv.extract_emoji([text])
return emoji_dict['emoji'][0]

def most_common_emojis(emojis: list):


all_emos = []
for i in emojis:
if i:
all_emos.extend(i)
return dict(Counter(all_emos).most_common(10))

emojis_s = before_date['text'].astype(str).apply(lambda x: extract_emojis(x)).to_list()


emojis_n = after_date['text'].astype(str).apply(lambda x: extract_emojis(x)).to_list()

print(most_common_emojis(emojis_s))
print(most_common_emojis(emojis_n))

{}
{}

emot_obj = emot.core.emot()

emoticons_s = Counter(emot_obj.emoticons(before_date['text'].to_string())['value'])

emoticons_n = Counter(emot_obj.emoticons(after_date['text'].to_string())['value'])

print(emoticons_s)
print(emoticons_n)

Counter({':/': 354, '=3': 5, '=D': 3, 'XP': 2, '8D': 2, 'XD': 2, '=p': 2, '=L': 2, 'DX': 1, 'D8': 1})
Counter({':/': 361, '=D': 3, '=3': 2, ':]': 2, 'QQ': 1, '=L': 1, 'DX': 1, 'D8': 1, '8D': 1, '=p': 1})

# def find_entities(text,n):
# doc = nlp(text)
# return dict(Counter([ent.text for ent in doc.ents]).most_common(n))

# ents_s = find_entities(before_date['lemmatized'].to_string().lower(), 10)


# ents_n = find_entities(after_date['lemmatized'].to_string().lower(), 10)

# ents_n

# ents_s

https://colab.research.google.com/drive/1IX-UYjDI6F67-ES_1VjSVbdawVc7rzO-?usp=sharing#printMode=true 7/12
21/02/2024, 21:51 Каращук Біляєва Проєкт Корчинський.ipynb - Colaboratory

keyboard_arrow_down маркування слів капс локом (треба фільтрувати бо є абревіатури)


# Function to find words in all caps
def find_all_caps(text):
pattern = r'\b[А-Я]+\b'
all_caps_words = re.findall(pattern, text)
return Counter(all_caps_words)

# Apply the function to the 'text_column'


before_date['all_caps_words'] = before_date['text'].apply(find_all_caps)
after_date['all_caps_words'] = after_date['text'].apply(find_all_caps)

# Extract word frequency statistics


word_stats = before_date['all_caps_words'].sum()
word_stats_common = word_stats.most_common()

# Print the word frequency statistics in a user-friendly format


for word, count in word_stats_common:
print(f"{word}: {count}")

В: 200
А: 196
У: 157
Я: 121
БРАТСТВО: 104
ЗСУ: 89
З: 82
ФСБ: 63
О: 60
ТРО: 50
УПЦ: 49
БРАТСТВА: 44
СБУ: 38
ПОТРЕБИ: 26
РФ: 23
НАТО: 19
ЗИМОВИЙ: 16
США: 13
ЗБОРУ: 12
Р: 11
МП: 11
С: 11
НЕ: 10
РПЦ: 10
ВТ: 9
Х: 9
УНСО: 9
АТО: 9
НОВИНИ: 9
ФРОНТУ: 9
ХХ: 8
И: 8
МАГАТЕ: 8
ТЕЛЕГРАМ: 8
КАНАЛ: 8
ССО: 7
УВАГА: 7
Д: 7
ЗА: 7
ВР: 7
Й: 7
МВС: 6
МОЗ: 6
ООС: 6
ООН: 6
КМДА: 6
БМП: 6
ВОЦ: 6
МО: 5
ГУР: 5
УБН: 5
М: 5
ПЦУ: 5
АЕС: 5
ЗАЕС: 5
НА: 5
Т: 5

https://colab.research.google.com/drive/1IX-UYjDI6F67-ES_1VjSVbdawVc7rzO-?usp=sharing#printMode=true 8/12
21/02/2024, 21:51 Каращук Біляєва Проєкт Корчинський.ipynb - Colaboratory
# Extract word frequency statistics
word_stats = after_date['all_caps_words'].sum()
word_stats_common = word_stats.most_common()

# Print the word frequency statistics in a user-friendly format


for word, count in word_stats_common:
print(f"{word}: {count}")

А: 167
В: 148
У: 148
БРАТСТВО: 122
Я: 84
КУЛЬТ: 78
З: 51
ЗСУ: 51
БРАТСТВА: 44
О: 39
ФСБ: 28
УПЦ: 24
РФ: 22
НЕ: 15
ЛГБТ: 15
США: 14
КОЛАБОРАНТ: 12
НА: 11
СБУ: 10
КМС: 10
ГРН: 10
М: 9
МВС: 9
МП: 8
РПЦ: 8
ПЦУ: 8
ТРО: 8
ШТ: 8
Д: 7
Й: 7
ДШВ: 7
ГО: 7
ЗА: 7
ТЗ: 7
ЗС: 6
П: 6
ДБР: 6
ВЛК: 6
МЕРЧ: 6
С: 6
ВЖЕ: 6
ССО: 5
ХХ: 5
АТО: 5
ДО: 5
ООН: 5
ЖИТТЯ: 5
ППО: 5
ОВА: 5
СП: 5
Р: 5
Х: 5
ГЕС: 5
УНР: 4
ООС: 4
АКС: 4
МО: 4
ВР: 4
ЦВЛК 4

keyboard_arrow_down Інші методи


print(df.columns)

Index(['Date-Time', 'Post'], dtype='object')

df['Post']= df['Post'].astype(str)
df.rename(columns={'Date-Time': 'date_time', 'Post': 'text'}, inplace=True)

https://colab.research.google.com/drive/1IX-UYjDI6F67-ES_1VjSVbdawVc7rzO-?usp=sharing#printMode=true 9/12
21/02/2024, 21:51 Каращук Біляєва Проєкт Корчинський.ipynb - Colaboratory
import pandas as pd
import spacy
import re

nlp = spacy.load("uk_core_news_lg")

df = pd.read_csv("/content/all_posts.csv")

def count_quotes(text):
quote_pattern = re.compile(r'[«»"\'“”„”]')

quote_matches = re.findall(quote_pattern, text)

return len(set(quote_matches))

df['quote_count'] = df['text'].apply(count_quotes)

frequency_of_posting = len(df) / (pd.to_datetime(df['date_time']).max() - pd.to_datetime(df['date_time']).min()).days

df['word_count'] = df['text'].apply(lambda x: len(nlp(x)))


avg_word_count_in_posts = df['word_count'].mean()

number_of_quotes_used = df['quote_count'].sum()

print(f'Частота дописування: {frequency_of_posting:.2f} пости на день')


print(f'Середня кількість речень у пості: {avg_word_count_in_posts:.2f} слова')
print(f'Кількість цитат: {number_of_quotes_used}')

INFO:pymorphy3.opencorpora_dict.wrapper:Loading dictionaries from /usr/local/lib/python3.10/dist-packages/pymorphy3_dict


INFO:pymorphy3.opencorpora_dict.wrapper:format: 2.4, revision: 1, updated: 2022-09-13T18:45:24.998984
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
/usr/local/lib/python3.10/dist-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
3801 try:
-> 3802 return self._engine.get_loc(casted_key)
3803 except KeyError as err:

4 frames
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 'text'

The above exception was the direct cause of the following exception:

KeyError Traceback (most recent call last)


/usr/local/lib/python3.10/dist-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
3802 return self._engine.get_loc(casted_key)
3803 except KeyError as err:
-> 3804 raise KeyError(key) from err
3805 except TypeError:
3806 # If we have a listlike key, _check_indexing_error will raise

KeyError: 'text'

!pip install stanza networkx matplotlib

Collecting stanza
Downloading stanza-1.7.0-py3-none-any.whl (933 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 933.2/933.2 kB 17.6 MB/s eta 0:00:00
Requirement already satisfied: networkx in /usr/local/lib/python3.10/dist-packages (3.2.1)
Requirement already satisfied: matplotlib in /usr/local/lib/python3.10/dist-packages (3.7.1)
Collecting emoji (from stanza)
Downloading emoji-2.10.1-py2.py3-none-any.whl (421 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 421.5/421.5 kB 44.7 MB/s eta 0:00:00
Requirement already satisfied: numpy in /usr/local/lib/python3.10/dist-packages (from stanza) (1.25.2)
Requirement already satisfied: protobuf>=3.15.0 in /usr/local/lib/python3.10/dist-packages (from stanza) (3.20.3)
Requirement already satisfied: requests in /usr/local/lib/python3.10/dist-packages (from stanza) (2.31.0)
Requirement already satisfied: toml in /usr/local/lib/python3.10/dist-packages (from stanza) (0.10.2)
Requirement already satisfied: torch>=1.3.0 in /usr/local/lib/python3.10/dist-packages (from stanza) (2.1.0+cu121)
Requirement already satisfied: tqdm in /usr/local/lib/python3.10/dist-packages (from stanza) (4.66.2)
Requirement already satisfied: contourpy>=1.0.1 in /usr/local/lib/python3.10/dist-packages (from matplotlib) (1.2.0)
Requirement already satisfied: cycler>=0.10 in /usr/local/lib/python3.10/dist-packages (from matplotlib) (0.12.1)
Requirement already satisfied: fonttools>=4.22.0 in /usr/local/lib/python3.10/dist-packages (from matplotlib) (4.48.1)
Requirement already satisfied: kiwisolver>=1.0.1 in /usr/local/lib/python3.10/dist-packages (from matplotlib) (1.4.5)
Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.10/dist-packages (from matplotlib) (23.2)
Requirement already satisfied: pillow>=6.2.0 in /usr/local/lib/python3.10/dist-packages (from matplotlib) (9.4.0)
Requirement already satisfied: pyparsing>=2.3.1 in /usr/local/lib/python3.10/dist-packages (from matplotlib) (3.1.1)
Requirement already satisfied: python-dateutil>=2.7 in /usr/local/lib/python3.10/dist-packages (from matplotlib) (2.8.2)
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.10/dist-packages (from python-dateutil>=2.7->matplotli
Requirement already satisfied: filelock in /usr/local/lib/python3.10/dist-packages (from torch>=1.3.0->stanza) (3.13.1)
Requirement already satisfied: typing-extensions in /usr/local/lib/python3.10/dist-packages (from torch>=1.3.0->stanza)
Requirement already satisfied: sympy in /usr/local/lib/python3.10/dist-packages (from torch>=1.3.0->stanza) (1.12)

https://colab.research.google.com/drive/1IX-UYjDI6F67-ES_1VjSVbdawVc7rzO-?usp=sharing#printMode=true 10/12
21/02/2024, 21:51 Каращук Біляєва Проєкт Корчинський.ipynb - Colaboratory
Requirement already satisfied: jinja2 in /usr/local/lib/python3.10/dist-packages (from torch>=1.3.0->stanza) (3.1.3)
Requirement already satisfied: fsspec in /usr/local/lib/python3.10/dist-packages (from torch>=1.3.0->stanza) (2023.6.0)
Requirement already satisfied: triton==2.1.0 in /usr/local/lib/python3.10/dist-packages (from torch>=1.3.0->stanza) (2.1
Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.10/dist-packages (from requests->stanz
Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.10/dist-packages (from requests->stanza) (3.6)
Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.10/dist-packages (from requests->stanza) (2.
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.10/dist-packages (from requests->stanza) (20
Requirement already satisfied: MarkupSafe>=2.0 in /usr/local/lib/python3.10/dist-packages (from jinja2->torch>=1.3.0->st
Requirement already satisfied: mpmath>=0.19 in /usr/local/lib/python3.10/dist-packages (from sympy->torch>=1.3.0->stanza
Installing collected packages: emoji, stanza
Successfully installed emoji-2.10.1 stanza-1.7.0

Stanza - GPU test

import pandas as pd
import stanza
import networkx as nx
import matplotlib.pyplot as plt
import mpld3
mpld3.enable_notebook()

stanza.download('uk', processors='tokenize,ner', package='languagetool', logging_level='INFO')


nlp_stanza = stanza.Pipeline('uk', processors='tokenize,ner', use_gpu=True)

df = pd.read_csv("/content/ryslan_martsinkiv.csv")

def extract_entities(text):
doc = nlp_stanza(text)
entities = [(ent.text, ent.type) for sent in doc.sentences for ent in sent.ents]
return entities

df['entities'] = df['text'].apply(extract_entities)

G = nx.Graph()

for entities in df['entities']:


for entity, entity_type in entities:
G.add_node(entity, type=entity_type)

entity_type_mapping = {'MISC': 0, 'PERS': 1, 'LOC': 2, 'ORG': 3, 'TIME': 4, 'DATE': 5, 'NUM': 6}


node_colors = [entity_type_mapping[G.nodes[n]['type']] for n in G.nodes]

pos = nx.spring_layout(G)
plt.figure(figsize=(10, 8))
nx.draw(G, pos, with_labels=True, font_size=8, node_color=node_colors, cmap=plt.cm.Paired, font_color='black', node_size=500
plt.title("Named Entity Recognition (NER) Visualization")
plt.show()

Downloading https://raw.githubusercontent.com/stanfordnlp/stanza- 370k/? [00:00<00:00, 7.39MB/s]

resources/main/resources_1.7.0.json:
WARNING:stanza:Can not find tokenize: languagetool from official model list. Ignoring it.
WARNING:stanza:Can not find ner: languagetool from official model list. Ignoring it.
INFO:stanza:Downloading these customized packages for language: uk (Ukrainian)...
=======================
| Processor | Package |
-----------------------
=======================

INFO:stanza:Finished downloading models and saved to /root/stanza_resources.


INFO:stanza:Checking for updates to resources.json in case models have been updated. Note: this behavior can be turned
Downloading https://raw.githubusercontent.com/stanfordnlp/stanza- 370k/? [00:00<00:00, 7.40MB/s]

resources/main/resources_1.7.0.json:
WARNING:stanza:Language uk package default expects mwt, which has been added
INFO:stanza:Loading these models for language: uk (Ukrainian):
=======================
| Processor | Package |
-----------------------
| tokenize | iu |
| mwt | iu |
| ner | languk |
=======================

INFO:stanza:Using device: cuda

https://colab.research.google.com/drive/1IX-UYjDI6F67-ES_1VjSVbdawVc7rzO-?usp=sharing#printMode=true 11/12
21/02/2024, 21:51 Каращук Біляєва Проєкт Корчинський.ipynb - Colaboratory
ner_counts = {}
for entities in df['entities']:

https://colab.research.google.com/drive/1IX-UYjDI6F67-ES_1VjSVbdawVc7rzO-?usp=sharing#printMode=true 12/12

You might also like