Каращук Біляєва Проєкт Корчинський.ipynb - Colaboratory

21/02/2024, 21:51 Каращук Біляєва Проєкт Корчинський.
ipynb - Colaboratory
!pip install tokenize_uk

!pip install pymorphy3
!pip install pymorphy3-dicts-uk
!python3 -m spacy download uk_core_news_lg
Collecting pymorphy3-dicts-ru (from pymorphy3)

Downloading pymorphy3_dicts_ru-2.4.417150.4580142-py2.py3-none-any.whl (8.4 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 8.4/8.4 MB 22.1 MB/s eta 0:00:00
Installing collected packages: pymorphy3-dicts-ru, dawg-python, pymorphy3
Successfully installed dawg-python-0.7.2 pymorphy3-2.0.1 pymorphy3-dicts-ru-2.4.417150.4580142
Collecting pymorphy3-dicts-uk
Downloading pymorphy3_dicts_uk-2.4.1.1.1663094765-py2.py3-none-any.whl (8.2 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 8.2/8.2 MB 17.9 MB/s eta 0:00:00
Installing collected packages: pymorphy3-dicts-uk
Successfully installed pymorphy3-dicts-uk-2.4.1.1.1663094765
Collecting uk-core-news-lg==3.7.0
Downloading https://github.com/explosion/spacy-models/releases/download/uk_core_news_lg-3.7.0/uk_core_news_lg-3.7.0
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 231.2/231.2 MB 2.8 MB/s eta 0:00:00
Requirement already satisfied: spacy<3.8.0,>=3.7.0 in /usr/local/lib/python3.10/dist-packages (from uk-core-news-lg==
Requirement already satisfied: pymorphy3>=1.0.0 in /usr/local/lib/python3.10/dist-packages (from uk-core-news-lg==3.7
Requirement already satisfied: pymorphy3-dicts-uk in /usr/local/lib/python3.10/dist-packages (from uk-core-news-lg==3
Requirement already satisfied: dawg-python>=0.7.1 in /usr/local/lib/python3.10/dist-packages (from pymorphy3>=1.0.0->
Requirement already satisfied: pymorphy3-dicts-ru in /usr/local/lib/python3.10/dist-packages (from pymorphy3>=1.0.0->
Requirement already satisfied: spacy-legacy<3.1.0,>=3.0.11 in /usr/local/lib/python3.10/dist-packages (from spacy<3.8
Requirement already satisfied: spacy-loggers<2.0.0,>=1.0.0 in /usr/local/lib/python3.10/dist-packages (from spacy<3.8
Requirement already satisfied: murmurhash<1.1.0,>=0.28.0 in /usr/local/lib/python3.10/dist-packages (from spacy<3.8.0
Requirement already satisfied: cymem<2.1.0,>=2.0.2 in /usr/local/lib/python3.10/dist-packages (from spacy<3.8.0,>=3.7
Requirement already satisfied: preshed<3.1.0,>=3.0.2 in /usr/local/lib/python3.10/dist-packages (from spacy<3.8.0,>=3
Requirement already satisfied: thinc<8.3.0,>=8.2.2 in /usr/local/lib/python3.10/dist-packages (from spacy<3.8.0,>=3.7
Requirement already satisfied: wasabi<1.2.0,>=0.9.1 in /usr/local/lib/python3.10/dist-packages (from spacy<3.8.0,>=3
Requirement already satisfied: srsly<3.0.0,>=2.4.3 in /usr/local/lib/python3.10/dist-packages (from spacy<3.8.0,>=3.7
Requirement already satisfied: catalogue<2.1.0,>=2.0.6 in /usr/local/lib/python3.10/dist-packages (from spacy<3.8.0,>
Requirement already satisfied: weasel<0.4.0,>=0.1.0 in /usr/local/lib/python3.10/dist-packages (from spacy<3.8.0,>=3
Requirement already satisfied: typer<0.10.0,>=0.3.0 in /usr/local/lib/python3.10/dist-packages (from spacy<3.8.0,>=3
Requirement already satisfied: smart-open<7.0.0,>=5.2.1 in /usr/local/lib/python3.10/dist-packages (from spacy<3.8.0,
Requirement already satisfied: tqdm<5.0.0,>=4.38.0 in /usr/local/lib/python3.10/dist-packages (from spacy<3.8.0,>=3.7
Requirement already satisfied: requests<3.0.0,>=2.13.0 in /usr/local/lib/python3.10/dist-packages (from spacy<3.8.0,>
Requirement already satisfied: pydantic!=1.8,!=1.8.1,<3.0.0,>=1.7.4 in /usr/local/lib/python3.10/dist-packages (from
Requirement already satisfied: jinja2 in /usr/local/lib/python3.10/dist-packages (from spacy<3.8.0,>=3.7.0->uk-core-n
Requirement already satisfied: setuptools in /usr/local/lib/python3.10/dist-packages (from spacy<3.8.0,>=3.7.0->uk-co
Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.10/dist-packages (from spacy<3.8.0,>=3.7.0->
Requirement already satisfied: langcodes<4.0.0,>=3.2.0 in /usr/local/lib/python3.10/dist-packages (from spacy<3.8.0,>
Requirement already satisfied: numpy>=1.19.0 in /usr/local/lib/python3.10/dist-packages (from spacy<3.8.0,>=3.7.0->uk
Requirement already satisfied: annotated-types>=0.4.0 in /usr/local/lib/python3.10/dist-packages (from pydantic!=1.8,
Requirement already satisfied: pydantic-core==2.16.2 in /usr/local/lib/python3.10/dist-packages (from pydantic!=1.8,
Requirement already satisfied: typing-extensions>=4.6.1 in /usr/local/lib/python3.10/dist-packages (from pydantic!=1
Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.10/dist-packages (from requests<3.0
Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.10/dist-packages (from requests<3.0.0,>=2.13.0-
Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.10/dist-packages (from requests<3.0.0,>=2
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.10/dist-packages (from requests<3.0.0,>=2
Requirement already satisfied: blis<0.8.0,>=0.7.8 in /usr/local/lib/python3.10/dist-packages (from thinc<8.3.0,>=8.2
Requirement already satisfied: confection<1.0.0,>=0.0.1 in /usr/local/lib/python3.10/dist-packages (from thinc<8.3.0,
Requirement already satisfied: click<9.0.0,>=7.1.1 in /usr/local/lib/python3.10/dist-packages (from typer<0.10.0,>=0
Requirement already satisfied: cloudpathlib<0.17.0,>=0.7.0 in /usr/local/lib/python3.10/dist-packages (from weasel<0
Requirement already satisfied: MarkupSafe>=2.0 in /usr/local/lib/python3.10/dist-packages (from jinja2->spacy<3.8.0,>
Installing collected packages: uk-core-news-lg
Successfully installed uk-core-news-lg-3.7.0
✔ Download and installation successful
You can now load the package via spacy.load('uk_core_news_lg')
⚠ Restart to reload dependencies
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
from google.colab import drive

drive.mount('/content/drive')
Mounted at /content/drive
!pip install emot

!pip install advertools
Collecting emot
Downloading emot-3.1-py3-none-any.whl (61 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 61.5/61.5 kB 543.7 kB/s eta 0:00:00
Installing collected packages: emot
Successfully installed emot-3.1
Collecting advertools
Downloading advertools-0.14.1-py2.py3-none-any.whl (321 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 321.9/321.9 kB 4.7 MB/s eta 0:00:00
Requirement already satisfied: pandas>=1.1.0 in /usr/local/lib/python3.10/dist-packages (from advertools) (1.5.3)
Requirement already satisfied: pyasn1>=0.4 in /usr/local/lib/python3.10/dist-packages (from advertools) (0.5.1)
Collecting scrapy>=2.5.0 (from advertools)
https://colab.research.google.com/drive/1IX-UYjDI6F67-ES_1VjSVbdawVc7rzO-?usp=sharing#printMode=true 1/12
21/02/2024, 21:51 Каращук Біляєва Проєкт Корчинський.ipynb - Colaboratory
Downloading Scrapy-2.11.1-py2.py3-none-any.whl (287 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 287.8/287.8 kB 11.5 MB/s eta 0:00:00
Collecting twython>=3.8.0 (from advertools)
Downloading twython-3.9.1-py3-none-any.whl (33 kB)
Requirement already satisfied: pyarrow>=5.0.0 in /usr/local/lib/python3.10/dist-packages (from advertools) (14.0.2)
Requirement already satisfied: python-dateutil>=2.8.1 in /usr/local/lib/python3.10/dist-packages (from pandas>=1.1.0-
Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.10/dist-packages (from pandas>=1.1.0->advertool
Requirement already satisfied: numpy>=1.21.0 in /usr/local/lib/python3.10/dist-packages (from pandas>=1.1.0->advertoo
Collecting Twisted>=18.9.0 (from scrapy>=2.5.0->advertools)
Downloading twisted-23.10.0-py3-none-any.whl (3.2 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 3.2/3.2 MB 16.2 MB/s eta 0:00:00
Requirement already satisfied: cryptography>=36.0.0 in /usr/local/lib/python3.10/dist-packages (from scrapy>=2.5.0->a
Collecting cssselect>=0.9.1 (from scrapy>=2.5.0->advertools)
Downloading cssselect-1.2.0-py2.py3-none-any.whl (18 kB)
Collecting itemloaders>=1.0.1 (from scrapy>=2.5.0->advertools)
Downloading itemloaders-1.1.0-py3-none-any.whl (11 kB)
Collecting parsel>=1.5.0 (from scrapy>=2.5.0->advertools)
Downloading parsel-1.8.1-py2.py3-none-any.whl (17 kB)
Requirement already satisfied: pyOpenSSL>=21.0.0 in /usr/local/lib/python3.10/dist-packages (from scrapy>=2.5.0->adve
Collecting queuelib>=1.4.2 (from scrapy>=2.5.0->advertools)
Downloading queuelib-1.6.2-py2.py3-none-any.whl (13 kB)
Collecting service-identity>=18.1.0 (from scrapy>=2.5.0->advertools)
Downloading service_identity-24.1.0-py3-none-any.whl (12 kB)
Collecting w3lib>=1.17.0 (from scrapy>=2.5.0->advertools)
Downloading w3lib-2.1.2-py3-none-any.whl (21 kB)
Collecting zope.interface>=5.1.0 (from scrapy>=2.5.0->advertools)
Downloading zope.interface-6.2-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux20
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 247.3/247.3 kB 15.9 MB/s eta 0:00:00
Collecting protego>=0.1.15 (from scrapy>=2.5.0->advertools)
Downloading Protego-0.3.0-py2.py3-none-any.whl (8.5 kB)
Collecting itemadapter>=0.1.0 (from scrapy>=2.5.0->advertools)
Downloading itemadapter-0.8.0-py3-none-any.whl (11 kB)
Requirement already satisfied: setuptools in /usr/local/lib/python3.10/dist-packages (from scrapy>=2.5.0->advertools
Requirement already satisfied: packaging in /usr/local/lib/python3.10/dist-packages (from scrapy>=2.5.0->advertools)
Collecting tldextract (from scrapy>=2.5.0->advertools)
Downloading tldextract-5.1.1-py3-none-any.whl (97 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 97.7/97.7 kB 10.4 MB/s eta 0:00:00
Requirement already satisfied: lxml>=4.4.1 in /usr/local/lib/python3.10/dist-packages (from scrapy>=2.5.0->advertools
Collecting PyDispatcher>=2.0.5 (from scrapy>=2.5.0->advertools)
Downloading PyDispatcher-2.0.7-py3-none-any.whl (12 kB)
Requirement already satisfied: requests>=2.1.0 in /usr/local/lib/python3.10/dist-packages (from twython>=3.8.0->adve
Requirement already satisfied: requests-oauthlib>=0.4.0 in /usr/local/lib/python3.10/dist-packages (from twython>=3.8
Requirement already satisfied: cffi>=1.12 in /usr/local/lib/python3.10/dist-packages (from cryptography>=36.0.0->scra
Collecting jmespath>=0.9.5 (from itemloaders>=1.0.1->scrapy>=2.5.0->advertools)
Downloading jmespath-1 0 1-py3-none-any whl (20 kB)
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
import pymorphy3
import tokenize_uk
import re
import string
import spacy
from collections import Counter
nlp = spacy.load("uk_core_news_lg")
import matplotlib as pyplot
import seaborn as sns
import emot
import advertools as adv
df = pd.read_csv("/content/all_posts.csv")
df.head()
Date-Time Post
0 2023-12-24 23:08:09 __Минув рік від загибелі Непийпива, Святоші, Т...
1 2023-12-24 22:38:06 Рік тому, в Різдвяну ніч 2022 одна з диверсійн...
2 2023-12-24 20:06:29 Браття, дуже важливо підписати цю [петицію](ht...
3 2023-12-24 17:06:18 Слава Ісусу Христу! \n\nБраття та сестри, в це...
4 2023-12-24 16:01:21 26 грудня у КУЛЬТ театрі\n\nМоторошна містична...
stops = []
with open("/content/stopwords_ua.txt", encoding="utf-8") as file:
for line in file.readlines():
stops.append(line.strip())
stops.extend(['telegra', 'file', 'ph'])
df.rename(columns={'Date-Time': 'date_time', 'Post': 'text'}, inplace=True)
df['text']= df['text'].astype(str)
print(df.columns)
Index(['date_time', 'text'], dtype='object')
df['text'] = df['text'].str.replace(r'\s+', ' ', regex=True)
# Convert 'date_time' column to datetime type

df['date_time'] = pd.to_datetime(df['date_time'])
# Define the split date

split_date = pd.to_datetime('2023-02-25')
# Split the DataFrame into two groups based on the split date
before_date = df[df['date_time'] < split_date]
after_date = df[df['date_time'] >= split_date]
morph = pymorphy3.MorphAnalyzer(lang='uk')
def find_verbs(df):
verbs = []
for item in df['text'].tolist():
doc = nlp(item)
for token in doc:
if token.pos_ == "VERB":
verbs.append(token.text)
return verbs
def lemmatize(text):
return ' '.join(morph.parse(word)[0].normal_form for word in tokenize_uk.tokenize_words(text) if morph.parse(word)[0].norm
INFO:pymorphy3.opencorpora_dict.wrapper:Loading dictionaries from /usr/local/lib/python3.10/dist-packages/pymorphy3_dict

INFO:pymorphy3.opencorpora_dict.wrapper:format: 2.4, revision: 1, updated: 2022-09-13T18:45:24.998984
def remove_links_and_punctuation(text):
# Define a regular expression pattern to match URLs
url_pattern = re.compile(r'https?://\S+|www\.\S+')
# Use the sub() function to replace matched URLs with an empty string
text_without_links = url_pattern.sub('', text)
# Remove punctuation using the translate method

translator = str.maketrans("", "", string.punctuation)
text_without_punctuation = text_without_links.translate(translator)
return text_without_punctuation
verbs_before = find_verbs(before_date)
verbs_after = find_verbs(after_date)
before_date['lemmatized'] = before_date['text'].apply(lambda x: remove_links_and_punctuation(lemmatize(x)))

after_date['lemmatized'] = after_date['text'].apply(lambda x: remove_links_and_punctuation(lemmatize(x)))
<ipython-input-15-da908878dbe3>:1: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-

before_date['lemmatized'] = before_date['text'].apply(lambda x: remove_links_and_punctuation(lemmatize(x)))
<ipython-input-15-da908878dbe3>:2: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-

after_date['lemmatized'] = after_date['text'].apply(lambda x: remove_links_and_punctuation(lemmatize(x)))
def find_verbs(df):
print(type(df)) # Check the type of df
print(df.columns) # Check the columns in df
verbs = []
for item in df['Post'].tolist():
doc = nlp(item)
for token in doc:
if token.pos_ == 'VERB':
verbs.append(token.lemma_)
return verbs
def return_key_words(data):
# Create a TF-IDF vectorizer
vectorizer = TfidfVectorizer()
# Fit and transform the documents

tfidf_matrix = vectorizer.fit_transform([data])
# Get feature names (words)

feature_names = vectorizer.get_feature_names_out()
# Get TF-IDF scores for each word in the first document

tfidf_scores = tfidf_matrix[0].toarray()[0]
# Get top N keywords based on TF-IDF scores

return [feature_names[i] for i in tfidf_scores.argsort()[-20:][::-1]]
keywords_before = return_key_words(' '.join(before_date['lemmatized'].tolist()))

keywords_after = return_key_words(' '.join(after_date['lemmatized'].tolist()))
keys = pd.DataFrame({
'Ключові слова 1': keywords_before,
'Ключові слова 2': keywords_after,
})
# Create a morphological analyzer

morph_analyzer = pymorphy3.MorphAnalyzer(lang='uk')
def analyze_verbs(verbs):
singular_counter = Counter()
plural_counter = Counter()
# Analyze each verb form

for verb in verbs:
parsed = morph_analyzer.parse(verb.lower())[0]
number = parsed.tag.number
# Update counters based on the number information

if number == 'sing':
singular_counter[verb.lower()] += 1
elif number == 'plur':
plural_counter[verb.lower()] += 1
return singular_counter.most_common(10), plural_counter.most_common(10)

keys
Ключові слова 1 Ключові слова 2
0 україна україна
1 братство братство
2 батальйон батальйон
3 том йога
4 йога том
5 канал боєць
6 український військовий
7 день життя
8 корчинський канал
9 допомога посилання
10 життя церква
11 українська український
12 перемога день
13 слава бойовий
14 українець театр
15 війнути зараза
16 мен просити
17 боєць бог
18 бог збір
19 війна допомога
res_verbs_after = analyze_verbs(verbs_after)
res_verbs_before = analyze_verbs(verbs_before)
df_verbs_after = pd.DataFrame({'Verb in Plural': [item[0] for item in res_verbs_after[0]],

'Count in Plural': [item[1] for item in res_verbs_after[0]],
'Verb in Singular': [item[0] for item in res_verbs_after[1]],
'Count in Singular': [item[1] for item in res_verbs_after[1]]})
df_verbs_before = pd.DataFrame({'Verb in Plural': [item[0] for item in res_verbs_before[0]],

'Count in Plural': [item[1] for item in res_verbs_before[0]],
'Verb in Singular': [item[0] for item in res_verbs_before[1]],
'Count in Singular': [item[1] for item in res_verbs_before[1]]})
df_verbs_after
Verb in Plural Count in Plural Verb in Singular Count in Singular
0 є 142 просимо 78
1 має 65 мають 32
2 може 49 закликаємо 27
3 будь 46 маємо 25
4 бере 42 могли 25
5 буде 33 можуть 22
6 виголошу 31 отримають 21
7 хоче 26 поширте 17
8 немає 22 знаємо 14
9 потребує 19 можете 13
df_verbs_before
Verb in Plural Count in Plural Verb in Singular Count in Singular
0 є 110 просимо 57
1 має 82 мають 45
2 може 81 маємо 41
3 будь 43 можете 27
4 буде 39 зібрали 24
5 виголошу 31 можуть 21
6 знаю 30 відкриваємо 20
7 виконує 28 московити 20
8 відбувається 25 знаєте 19
9 немає 24 дякуємо 19
# # Function to count words in a text

# def count_words(text):
# words = tokenize_uk.tokenize_words(text)
# return len(words)
# # Apply the count_words function to each row in the 'text_column' and create a new column 'word_count'
# before_date['word_count'] = before_date['text'].apply(count_words)
# # Calculate the average word count

# average_word_count_b = before_date['word_count'].mean()
# # Apply the count_words function to each row in the 'text_column' and create a new column 'word_count'
# after_date['word_count'] = after_date['text'].apply(count_words)
# # Calculate the average word count

# average_word_count_a = after_date['word_count'].mean()
# print(f"Середня довжина постів у словах періоу 1: {average_word_count_b}")

# print(f"Середня довжина постів у словах періоу 2: {average_word_count_a}")
def get_top_ngram(corpus, k, n=None,):

vec = CountVectorizer(ngram_range=(n, n), stop_words=stops, token_pattern=r"[А-ЩЬЮЯҐЄІЇа-щьюяґєії'`’ʼ]+").fit(corpus)
bag_of_words = vec.transform(corpus)
sum_words = bag_of_words.sum(axis=0)
words_freq = [(word, sum_words[0, idx])
for word, idx in vec.vocabulary_.items() if word != "' '"]
words_freq =sorted(words_freq, key = lambda x: x[1], reverse=True)
return words_freq[:k]
top_n_bigrams=get_top_ngram(before_date['lemmatized'],15, 2)
x,y=map(list,zip(*top_n_bigrams))
sns.barplot(x=y,y=x).set(title='15 найчастіше вживаних біграм часовий період 1')
/usr/local/lib/python3.10/dist-packages/sklearn/feature_extraction/text.py:409
warnings.warn(
[Text(0.5, 1.0, '15 найчастіше вживаних біграм часовий період 1')]
top_n_bigrams=get_top_ngram(after_date['lemmatized'],15, 2)
x,y=map(list,zip(*top_n_bigrams))
sns.barplot(x=y,y=x).set(title='15 найчастіше вживаних біграм часовий період 2')
/usr/local/lib/python3.10/dist-packages/sklearn/feature_extraction/text.py:409: UserWarning: Your stop_words may be inco

warnings.warn(
[Text(0.5, 1.0, '15 найчастіше вживаних біграм часовий період 2')]
def extract_emojis(text: str):

emoji_dict = adv.extract_emoji([text])
return emoji_dict['emoji'][0]
def most_common_emojis(emojis: list):

all_emos = []
for i in emojis:
if i:
all_emos.extend(i)
return dict(Counter(all_emos).most_common(10))
emojis_s = before_date['text'].astype(str).apply(lambda x: extract_emojis(x)).to_list()

emojis_n = after_date['text'].astype(str).apply(lambda x: extract_emojis(x)).to_list()
print(most_common_emojis(emojis_s))
print(most_common_emojis(emojis_n))
{}
{}
emot_obj = emot.core.emot()
emoticons_s = Counter(emot_obj.emoticons(before_date['text'].to_string())['value'])
emoticons_n = Counter(emot_obj.emoticons(after_date['text'].to_string())['value'])
print(emoticons_s)
print(emoticons_n)
Counter({':/': 354, '=3': 5, '=D': 3, 'XP': 2, '8D': 2, 'XD': 2, '=p': 2, '=L': 2, 'DX': 1, 'D8': 1})
Counter({':/': 361, '=D': 3, '=3': 2, ':]': 2, 'QQ': 1, '=L': 1, 'DX': 1, 'D8': 1, '8D': 1, '=p': 1})
# def find_entities(text,n):
# doc = nlp(text)
# return dict(Counter([ent.text for ent in doc.ents]).most_common(n))
# ents_s = find_entities(before_date['lemmatized'].to_string().lower(), 10)

# ents_n = find_entities(after_date['lemmatized'].to_string().lower(), 10)
# ents_n
# ents_s
keyboard_arrow_down маркування слів капс локом (треба фільтрувати бо є абревіатури)

# Function to find words in all caps
def find_all_caps(text):
pattern = r'\b[А-Я]+\b'
all_caps_words = re.findall(pattern, text)
return Counter(all_caps_words)
# Apply the function to the 'text_column'

before_date['all_caps_words'] = before_date['text'].apply(find_all_caps)
after_date['all_caps_words'] = after_date['text'].apply(find_all_caps)
# Extract word frequency statistics

word_stats = before_date['all_caps_words'].sum()
word_stats_common = word_stats.most_common()
# Print the word frequency statistics in a user-friendly format

for word, count in word_stats_common:
print(f"{word}: {count}")
В: 200
А: 196
У: 157
Я: 121
БРАТСТВО: 104
ЗСУ: 89
З: 82
ФСБ: 63
О: 60
ТРО: 50
УПЦ: 49
БРАТСТВА: 44
СБУ: 38
ПОТРЕБИ: 26
РФ: 23
НАТО: 19
ЗИМОВИЙ: 16
США: 13
ЗБОРУ: 12
Р: 11
МП: 11
С: 11
НЕ: 10
РПЦ: 10
ВТ: 9
Х: 9
УНСО: 9
АТО: 9
НОВИНИ: 9
ФРОНТУ: 9
ХХ: 8
И: 8
МАГАТЕ: 8
ТЕЛЕГРАМ: 8
КАНАЛ: 8
ССО: 7
УВАГА: 7
Д: 7
ЗА: 7
ВР: 7
Й: 7
МВС: 6
МОЗ: 6
ООС: 6
ООН: 6
КМДА: 6
БМП: 6
ВОЦ: 6
МО: 5
ГУР: 5
УБН: 5
М: 5
ПЦУ: 5
АЕС: 5
ЗАЕС: 5
НА: 5
Т: 5
# Extract word frequency statistics
word_stats = after_date['all_caps_words'].sum()
word_stats_common = word_stats.most_common()
# Print the word frequency statistics in a user-friendly format

for word, count in word_stats_common:
print(f"{word}: {count}")
А: 167
В: 148
У: 148
БРАТСТВО: 122
Я: 84
КУЛЬТ: 78
З: 51
ЗСУ: 51
БРАТСТВА: 44
О: 39
ФСБ: 28
УПЦ: 24
РФ: 22
НЕ: 15
ЛГБТ: 15
США: 14
КОЛАБОРАНТ: 12
НА: 11
СБУ: 10
КМС: 10
ГРН: 10
М: 9
МВС: 9
МП: 8
РПЦ: 8
ПЦУ: 8
ТРО: 8
ШТ: 8
Д: 7
Й: 7
ДШВ: 7
ГО: 7
ЗА: 7
ТЗ: 7
ЗС: 6
П: 6
ДБР: 6
ВЛК: 6
МЕРЧ: 6
С: 6
ВЖЕ: 6
ССО: 5
ХХ: 5
АТО: 5
ДО: 5
ООН: 5
ЖИТТЯ: 5
ППО: 5
ОВА: 5
СП: 5
Р: 5
Х: 5
ГЕС: 5
УНР: 4
ООС: 4
АКС: 4
МО: 4
ВР: 4
ЦВЛК 4
keyboard_arrow_down Інші методи

print(df.columns)
Index(['Date-Time', 'Post'], dtype='object')
df['Post']= df['Post'].astype(str)
df.rename(columns={'Date-Time': 'date_time', 'Post': 'text'}, inplace=True)
import pandas as pd
import spacy
import re
nlp = spacy.load("uk_core_news_lg")
df = pd.read_csv("/content/all_posts.csv")
def count_quotes(text):
quote_pattern = re.compile(r'[«»"\'“”„”]')
quote_matches = re.findall(quote_pattern, text)
return len(set(quote_matches))
df['quote_count'] = df['text'].apply(count_quotes)
frequency_of_posting = len(df) / (pd.to_datetime(df['date_time']).max() - pd.to_datetime(df['date_time']).min()).days
df['word_count'] = df['text'].apply(lambda x: len(nlp(x)))

avg_word_count_in_posts = df['word_count'].mean()
number_of_quotes_used = df['quote_count'].sum()
print(f'Частота дописування: {frequency_of_posting:.2f} пости на день')

print(f'Середня кількість речень у пості: {avg_word_count_in_posts:.2f} слова')
print(f'Кількість цитат: {number_of_quotes_used}')

---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
/usr/local/lib/python3.10/dist-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
3801 try:
-> 3802 return self._engine.get_loc(casted_key)
3803 except KeyError as err:
4 frames
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()
KeyError: 'text'
The above exception was the direct cause of the following exception:
KeyError Traceback (most recent call last)

/usr/local/lib/python3.10/dist-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
3802 return self._engine.get_loc(casted_key)
3803 except KeyError as err:
-> 3804 raise KeyError(key) from err
3805 except TypeError:
3806 # If we have a listlike key, _check_indexing_error will raise
KeyError: 'text'
!pip install stanza networkx matplotlib
Collecting stanza
Downloading stanza-1.7.0-py3-none-any.whl (933 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 933.2/933.2 kB 17.6 MB/s eta 0:00:00
Requirement already satisfied: networkx in /usr/local/lib/python3.10/dist-packages (3.2.1)
Requirement already satisfied: matplotlib in /usr/local/lib/python3.10/dist-packages (3.7.1)
Collecting emoji (from stanza)
Downloading emoji-2.10.1-py2.py3-none-any.whl (421 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 421.5/421.5 kB 44.7 MB/s eta 0:00:00
Requirement already satisfied: numpy in /usr/local/lib/python3.10/dist-packages (from stanza) (1.25.2)
Requirement already satisfied: protobuf>=3.15.0 in /usr/local/lib/python3.10/dist-packages (from stanza) (3.20.3)
Requirement already satisfied: requests in /usr/local/lib/python3.10/dist-packages (from stanza) (2.31.0)
Requirement already satisfied: toml in /usr/local/lib/python3.10/dist-packages (from stanza) (0.10.2)
Requirement already satisfied: torch>=1.3.0 in /usr/local/lib/python3.10/dist-packages (from stanza) (2.1.0+cu121)
Requirement already satisfied: tqdm in /usr/local/lib/python3.10/dist-packages (from stanza) (4.66.2)
Requirement already satisfied: contourpy>=1.0.1 in /usr/local/lib/python3.10/dist-packages (from matplotlib) (1.2.0)
Requirement already satisfied: cycler>=0.10 in /usr/local/lib/python3.10/dist-packages (from matplotlib) (0.12.1)
Requirement already satisfied: fonttools>=4.22.0 in /usr/local/lib/python3.10/dist-packages (from matplotlib) (4.48.1)
Requirement already satisfied: kiwisolver>=1.0.1 in /usr/local/lib/python3.10/dist-packages (from matplotlib) (1.4.5)
Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.10/dist-packages (from matplotlib) (23.2)
Requirement already satisfied: pillow>=6.2.0 in /usr/local/lib/python3.10/dist-packages (from matplotlib) (9.4.0)
Requirement already satisfied: pyparsing>=2.3.1 in /usr/local/lib/python3.10/dist-packages (from matplotlib) (3.1.1)
Requirement already satisfied: python-dateutil>=2.7 in /usr/local/lib/python3.10/dist-packages (from matplotlib) (2.8.2)
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.10/dist-packages (from python-dateutil>=2.7->matplotli
Requirement already satisfied: filelock in /usr/local/lib/python3.10/dist-packages (from torch>=1.3.0->stanza) (3.13.1)
Requirement already satisfied: typing-extensions in /usr/local/lib/python3.10/dist-packages (from torch>=1.3.0->stanza)
Requirement already satisfied: sympy in /usr/local/lib/python3.10/dist-packages (from torch>=1.3.0->stanza) (1.12)
Requirement already satisfied: jinja2 in /usr/local/lib/python3.10/dist-packages (from torch>=1.3.0->stanza) (3.1.3)
Requirement already satisfied: fsspec in /usr/local/lib/python3.10/dist-packages (from torch>=1.3.0->stanza) (2023.6.0)
Requirement already satisfied: triton==2.1.0 in /usr/local/lib/python3.10/dist-packages (from torch>=1.3.0->stanza) (2.1
Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.10/dist-packages (from requests->stanz
Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.10/dist-packages (from requests->stanza) (3.6)
Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.10/dist-packages (from requests->stanza) (2.
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.10/dist-packages (from requests->stanza) (20
Requirement already satisfied: MarkupSafe>=2.0 in /usr/local/lib/python3.10/dist-packages (from jinja2->torch>=1.3.0->st
Requirement already satisfied: mpmath>=0.19 in /usr/local/lib/python3.10/dist-packages (from sympy->torch>=1.3.0->stanza
Installing collected packages: emoji, stanza
Successfully installed emoji-2.10.1 stanza-1.7.0
Stanza - GPU test
import pandas as pd
import stanza
import networkx as nx
import matplotlib.pyplot as plt
import mpld3
mpld3.enable_notebook()
stanza.download('uk', processors='tokenize,ner', package='languagetool', logging_level='INFO')

nlp_stanza = stanza.Pipeline('uk', processors='tokenize,ner', use_gpu=True)
df = pd.read_csv("/content/ryslan_martsinkiv.csv")
def extract_entities(text):
doc = nlp_stanza(text)
entities = [(ent.text, ent.type) for sent in doc.sentences for ent in sent.ents]
return entities
df['entities'] = df['text'].apply(extract_entities)
G = nx.Graph()
for entities in df['entities']:

for entity, entity_type in entities:
G.add_node(entity, type=entity_type)
entity_type_mapping = {'MISC': 0, 'PERS': 1, 'LOC': 2, 'ORG': 3, 'TIME': 4, 'DATE': 5, 'NUM': 6}

node_colors = [entity_type_mapping[G.nodes[n]['type']] for n in G.nodes]
pos = nx.spring_layout(G)
plt.figure(figsize=(10, 8))
nx.draw(G, pos, with_labels=True, font_size=8, node_color=node_colors, cmap=plt.cm.Paired, font_color='black', node_size=500
plt.title("Named Entity Recognition (NER) Visualization")
plt.show()
Downloading https://raw.githubusercontent.com/stanfordnlp/stanza- 370k/? [00:00<00:00, 7.39MB/s]
resources/main/resources_1.7.0.json:
WARNING:stanza:Can not find tokenize: languagetool from official model list. Ignoring it.
WARNING:stanza:Can not find ner: languagetool from official model list. Ignoring it.
INFO:stanza:Downloading these customized packages for language: uk (Ukrainian)...
=======================
| Processor | Package |
-----------------------
=======================
INFO:stanza:Finished downloading models and saved to /root/stanza_resources.

INFO:stanza:Checking for updates to resources.json in case models have been updated. Note: this behavior can be turned
Downloading https://raw.githubusercontent.com/stanfordnlp/stanza- 370k/? [00:00<00:00, 7.40MB/s]
resources/main/resources_1.7.0.json:
WARNING:stanza:Language uk package default expects mwt, which has been added
INFO:stanza:Loading these models for language: uk (Ukrainian):
=======================
| Processor | Package |
-----------------------
| tokenize | iu |
| mwt | iu |
| ner | languk |
=======================
INFO:stanza:Using device: cuda
ner_counts = {}
for entities in df['entities']:

Каращук Біляєва Проєкт Корчинський.ipynb - Colaboratory

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Каращук Біляєва Проєкт Корчинський.ipynb - Colaboratory

Uploaded by

Copyright:

Available Formats

21/02/2024, 21:51 Каращук Біляєва Проєкт Корчинський.

!pip install tokenize_uk

Collecting pymorphy3-dicts-ru (from pymorphy3)

from google.colab import drive

!pip install emot

0 2023-12-24 23:08:09 __Минув рік від загибелі Непийпива, Святоші, Т...

1 2023-12-24 22:38:06 Рік тому, в Різдвяну ніч 2022 одна з диверсійн...

2 2023-12-24 20:06:29 Браття, дуже важливо підписати цю [петицію](ht...

3 2023-12-24 17:06:18 Слава Ісусу Христу! \n\nБраття та сестри, в це...

4 2023-12-24 16:01:21 26 грудня у КУЛЬТ театрі\n\nМоторошна містична...

stops.extend(['telegra', 'file', 'ph'])

Index(['date_time', 'text'], dtype='object')

df['text'] = df['text'].str.replace(r'\s+', ' ', regex=True)

# Convert 'date_time' column to datetime type

# Define the split date

INFO:pymorphy3.opencorpora_dict.wrapper:Loading dictionaries from /usr/local/lib/python3.10/dist-packages/pymorphy3_dict

# Remove punctuation using the translate method

before_date['lemmatized'] = before_date['text'].apply(lambda x: remove_links_and_punctuation(lemmatize(x)))

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-

# Fit and transform the documents

# Get feature names (words)

# Get TF-IDF scores for each word in the first document

# Get top N keywords based on TF-IDF scores

keywords_before = return_key_words(' '.join(before_date['lemmatized'].tolist()))

# Create a morphological analyzer

# Analyze each verb form

# Update counters based on the number information

INFO:pymorphy3.opencorpora_dict.wrapper:Loading dictionaries from /usr/local/lib/python3.10/dist-packages/pymorphy3_dict

Ключові слова 1 Ключові слова 2

df_verbs_after = pd.DataFrame({'Verb in Plural': [item[0] for item in res_verbs_after[0]],

df_verbs_before = pd.DataFrame({'Verb in Plural': [item[0] for item in res_verbs_before[0]],

Verb in Plural Count in Plural Verb in Singular Count in Singular

Verb in Plural Count in Plural Verb in Singular Count in Singular

# # Function to count words in a text

# # Calculate the average word count

# # Calculate the average word count

# print(f"Середня довжина постів у словах періоу 1: {average_word_count_b}")

def get_top_ngram(corpus, k, n=None,):

/usr/local/lib/python3.10/dist-packages/sklearn/feature_extraction/text.py:409: UserWarning: Your stop_words may be inco

def extract_emojis(text: str):

def most_common_emojis(emojis: list):

emojis_s = before_date['text'].astype(str).apply(lambda x: extract_emojis(x)).to_list()

# ents_s = find_entities(before_date['lemmatized'].to_string().lower(), 10)

keyboard_arrow_down маркування слів капс локом (треба фільтрувати бо є абревіатури)

# Apply the function to the 'text_column'

# Extract word frequency statistics

# Print the word frequency statistics in a user-friendly format

# Print the word frequency statistics in a user-friendly format

keyboard_arrow_down Інші методи

Index(['Date-Time', 'Post'], dtype='object')

quote_matches = re.findall(quote_pattern, text)

frequency_of_posting = len(df) / (pd.to_datetime(df['date_time']).max() - pd.to_datetime(df['date_time']).min()).days

df['word_count'] = df['text'].apply(lambda x: len(nlp(x)))

print(f'Частота дописування: {frequency_of_posting:.2f} пости на день')

INFO:pymorphy3.opencorpora_dict.wrapper:Loading dictionaries from /usr/local/lib/python3.10/dist-packages/pymorphy3_dict

KeyError Traceback (most recent call last)

!pip install stanza networkx matplotlib

Stanza - GPU test

stanza.download('uk', processors='tokenize,ner', package='languagetool', logging_level='INFO')

for entities in df['entities']:

entity_type_mapping = {'MISC': 0, 'PERS': 1, 'LOC': 2, 'ORG': 3, 'TIME': 4, 'DATE': 5, 'NUM': 6}

Downloading https://raw.githubusercontent.com/stanfordnlp/stanza- 370k/? [00:00<00:00, 7.39MB/s]