Download as pdf or txt
Download as pdf or txt
You are on page 1of 21

Task

1. Find Themes and Sub-Themes from the dataset.


2. You need to find the themes and sub-themes at an overall level, by years, by journal, by license type.
3. Carry out sentiment analysis for various themes and sub-themes in the dataset.
4. Use only abstract and title column for analysis.
5. Present the findings in a pdf: approach used, analysis, outputs and insights.

Importing required libraries

In [6]:

import numpy as np # linear algebra


import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

In [9]:
df = pd.read_csv("./metadata.csv",nrows=100)

In [10]:

df.shape #shows the number of rows and columns in the dataset


Out[10]:
(100, 19)

In [12]:

df.head(3)
Out[12]:

cord_uid sha source_x title doi pmcid pubmed_id license

Clinical
features of
10.1186/1471-
0 ug7v899j d1aafb70c066a2068b02786f8929fd9c900897fb PMC culture- PMC35282 11472636 no-cc
2334-1-6
proven
Mycoplasma...

Nitric oxide: a
pro-
1 02tnwd4m 6b0567729c2143a66d737eb0a2f63f2dce2e5a7d PMC 10.1186/rr14 PMC59543 11667967 no-cc
inflammatory
mediator in l...

Surfactant
protein-D and
2 ejv2xln0 06ced00a5fc04215949aa72528f2eeaae1d58927 PMC 10.1186/rr19 PMC59549 11667972 no-cc
pulmonary
host defense

In [13]:

#Checking number of null objects


df.isna().sum()
Out[13]:
cord_uid 0
sha 6
source_x 0
title 0
doi 0
doi 0
pmcid 0
pubmed_id 0
license 0
abstract 8
publish_time 0
authors 6
journal 0
mag_id 100
who_covidence_id 100
arxiv_id 100
pdf_json_files 6
pmc_json_files 9
url 0
s2_id 100
dtype: int64

In [14]:
#As there are 100 out of 100 null objects we can drop those columns as it does not have a
ny impact on our analysis
df.drop(['mag_id','who_covidence_id','arxiv_id','s2_id'],axis =1,inplace = True)

In [16]:
df.head(3)
Out[16]:

cord_uid sha source_x title doi pmcid pubmed_id license

Clinical
features of
10.1186/1471-
0 ug7v899j d1aafb70c066a2068b02786f8929fd9c900897fb PMC culture- PMC35282 11472636 no-cc
2334-1-6
proven
Mycoplasma...

Nitric oxide: a
pro-
1 02tnwd4m 6b0567729c2143a66d737eb0a2f63f2dce2e5a7d PMC 10.1186/rr14 PMC59543 11667967 no-cc
inflammatory
mediator in l...

Surfactant
protein-D and
2 ejv2xln0 06ced00a5fc04215949aa72528f2eeaae1d58927 PMC 10.1186/rr19 PMC59549 11667972 no-cc
pulmonary
host defense

In [17]:

df.isna().sum() #We have successfully handled the majority null values


Out[17]:
cord_uid 0
sha 6
source_x 0
title 0
doi 0
pmcid 0
pubmed_id 0
license 0
abstract 8
publish_time 0
authors 6
journal 0
pdf_json_files 6
pmc_json_files 9
url 0
dtype: int64
In [20]:
#Finding number of articles published by each journal
df.groupby(['journal']).size().groupby(level=0).max().sort_values(ascending=False)
Out[20]:
journal
Nucleic Acids Res 16
Respir Res 8
BMC Public Health 8
PLoS One 7
J Gen Intern Med 6
Crit Care 6
The EMBO Journal 3
EMBO J 2
J Biomed Biotechnol 2
Virol J 2
BMC Med Ethics 2
BMC Infect Dis 2
BMC Genomics 2
BMC Bioinformatics 2
Aust New Zealand Health Policy 2
PLoS Comput Biol 2
Microb Cell Fact 1
Methods 1
Nat Med 1
PLoS Pathog 1
PLoS Biol 1
PLoS Med 1
Journal of the American Medical Informatics Association 1
Reprod Biol Endocrinol 1
Retrovirology 1
Theor Biol Med Model 1
Mediators Inflamm 1
AIDS Res Ther 1
Journal of Neuropathology and Experimental Neurology 1
J Transl Med 1
Ann Clin Microbiol Antimicrob 1
Insect Mol Biol 1
Immunome Res 1
Harm Reduct J 1
Global Health 1
Genome Biol 1
Evid Based Complement Alternat Med 1
Clinical Chemistry 1
Cell Microbiol 1
Biol Proced Online 1
BMC Mol Biol 1
BMC Gastroenterol 1
BMC Biotechnol 1
Int J Health Geogr 1
dtype: int64

We find that the Nuleic Acids Res has got the highest number of publications

In [29]:
df.groupby(['license']).size().groupby(level=0).max().sort_values(ascending=False)
Out[29]:
license
cc-by 46
no-cc 38
cc-by-nc 7
green-oa 4
bronze-oa 3
cc0 2
dtype: int64
Most publications i.e 46 were published under the cc-by license type compared to the cc0 type with only 2
publications

In [30]:
df.groupby(['license','journal']).size().sort_values(ascending=False)
Out[30]:
license journal
no-cc Nucleic Acids Res 9
cc-by BMC Public Health 8
cc-by-nc Nucleic Acids Res 7
no-cc J Gen Intern Med 6
cc-by PLoS One 6
no-cc Respir Res 5
Crit Care 5
cc-by Respir Res 3
green-oa The EMBO Journal 3
cc-by Aust New Zealand Health Policy 2
PLoS Comput Biol 2
BMC Bioinformatics 2
Virol J 2
no-cc EMBO J 2
J Biomed Biotechnol 2
cc-by BMC Med Ethics 2
BMC Genomics 2
no-cc Evid Based Complement Alternat Med 1
Nat Med 1
Cell Microbiol 1
Biol Proced Online 1
BMC Infect Dis 1
J Transl Med 1
green-oa Methods 1
cc0 PLoS Pathog 1
PLoS One 1
no-cc Mediators Inflamm 1
Insect Mol Biol 1
bronze-oa Clinical Chemistry 1
cc-by Theor Biol Med Model 1
Reprod Biol Endocrinol 1
bronze-oa Journal of the American Medical Informatics Association 1
cc-by AIDS Res Ther 1
Ann Clin Microbiol Antimicrob 1
BMC Biotechnol 1
BMC Gastroenterol 1
BMC Infect Dis 1
BMC Mol Biol 1
Crit Care 1
Genome Biol 1
Global Health 1
Harm Reduct J 1
Immunome Res 1
Int J Health Geogr 1
Microb Cell Fact 1
PLoS Biol 1
PLoS Med 1
bronze-oa Journal of Neuropathology and Experimental Neurology 1
no-cc Retrovirology 1
dtype: int64

We find that the Nucleic Acids Res has got 2 license types namely 'cc-by-cc' and 'no-cc'

In [27]:
df.groupby(['publish_time']).size().groupby(level=0).max().sort_values(ascending=False)
Out[27]:
publish_time
2006-09-29 3
2005-03-30 3
2005-03-30 3
2006-01-10 2
2006-08-18 2
2004-04-03 2
..
2005-01-04 1
2005-01-03 1
2004-11-19 1
2004-11-01 1
2007-06-11 1
Length: 88, dtype: int64

We can get better analysis by extracting the year column from the publish_time column

In [31]:
df.dtypes
Out[31]:
cord_uid object
sha object
source_x object
title object
doi object
pmcid object
pubmed_id int64
license object
abstract object
publish_time object
authors object
journal object
pdf_json_files object
pmc_json_files object
url object
dtype: object

We found that the publish_time data type was given as string instead of datetime datatype

In [33]:
#Converting the datatype to datetime
df['publish_time'] = pd.to_datetime(metadata_df['publish_time'], format='%Y-%m-%d ')

In [34]:
df.dtypes
Out[34]:
cord_uid object
sha object
source_x object
title object
doi object
pmcid object
pubmed_id int64
license object
abstract object
publish_time datetime64[ns]
authors object
journal object
pdf_json_files object
pmc_json_files object
url object
dtype: object

In [37]:
df['year'] = df['publish_time'].dt.year
In [38]:
df.head()
Out[38]:

cord_uid sha source_x title doi pmcid pubmed_id license

Clinical
features of
10.1186/1471-
0 ug7v899j d1aafb70c066a2068b02786f8929fd9c900897fb PMC culture- PMC35282 11472636 no-cc
2334-1-6
proven
Mycoplasma...

Nitric oxide: a
pro-
1 02tnwd4m 6b0567729c2143a66d737eb0a2f63f2dce2e5a7d PMC 10.1186/rr14 PMC59543 11667967 no-cc
inflammatory
mediator in l...

Surfactant
protein-D and
2 ejv2xln0 06ced00a5fc04215949aa72528f2eeaae1d58927 PMC 10.1186/rr19 PMC59549 11667972 no-cc
pulmonary
host defense

Role of
endothelin-1
3 2b73a28n 348055649b6b8cf2b9a376498df9bf41f7123605 PMC 10.1186/rr44 PMC59574 11686871 no-cc
in lung
disease

Gene
expression in
4 9785vg6d 5f48792a5fa08bed9f56016f4981ae2ca6031b32 PMC 10.1186/rr61 PMC59580 11686888 no-cc
epithelial cells
in respons...

In [39]:
df.groupby(['year']).size().groupby(level = 0).max().sort_values(ascending=False)
Out[39]:

year
2006 36
2005 23
2007 13
2004 10
2001 7
2000 5
2003 4
1997 1
2002 1
dtype: int64

We find that 1997 and 2002 has got least number of publications i.e. 1
Highest number of publications were published in 2006 i.e. 36
There has been increasing trend from 1997 but for some reason got only 1 publication in 2000 and saw a
decline again in 2007

In [40]:
df.groupby(['year','journal'])['title'].size().sort_values(ascending=False)
Out[40]:
year journal
2005 Nucleic Acids Res 9
2006 Nucleic Acids Res 6
BMC Public Health 6
2007 PLoS One 5
2006 Crit Care 3
2006 Crit Care 3
..
2005 Int J Health Geogr 1
J Gen Intern Med 1
Methods 1
Microb Cell Fact 1
2007 Virol J 1
Name: title, Length: 69, dtype: int64

In [41]:
df.groupby(['year','license']).size().sort_values(ascending=False)
Out[41]:
year license
2006 cc-by 23
2005 cc-by 11
no-cc 11
2007 cc-by 9
2001 no-cc 6
2006 no-cc 6
2004 no-cc 6
2006 cc-by-nc 6
2000 no-cc 3
2003 no-cc 3
2004 cc-by 3
2007 cc0 2
2000 green-oa 2
2007 cc-by-nc 1
1997 no-cc 1
2006 bronze-oa 1
2005 green-oa 1
2004 bronze-oa 1
2003 bronze-oa 1
2002 no-cc 1
2001 green-oa 1
2007 no-cc 1
dtype: int64

Extracting the title and abstract column for sentiment analysis

In [42]:
metadata1 = df[['title','abstract']]
metadata1.head()
Out[42]:

title abstract

0 Clinical features of culture-proven Mycoplasma... OBJECTIVE: This retrospective chart review des...

1 Nitric oxide: a pro-inflammatory mediator in l... Inflammatory diseases of the respiratory tract...

2 Surfactant protein-D and pulmonary host defense Surfactant protein-D (SP-D) participates in th...

3 Role of endothelin-1 in lung disease Endothelin-1 (ET-1) is a 21 amino acid peptide...

4 Gene expression in epithelial cells in respons... Respiratory syncytial virus (RSV) and pneumoni...

In [43]:
metadata1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 title 100 non-null object
1 abstract 92 non-null object
dtypes: object(2)
memory usage: 1.7+ KB
memory usage: 1.7+ KB

It shows there are 8 null objects in abstract column

In [44]:
metadata1.dropna(how = "any",axis = 0, inplace = True)
metadata1.reset_index(drop = True, inplace = True)
metadata1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 92 entries, 0 to 91
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 title 92 non-null object
1 abstract 92 non-null object
dtypes: object(2)
memory usage: 1.6+ KB

C:\Users\DEV\AppData\Local\Temp\ipykernel_16408\3320786844.py:1: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_g


uide/indexing.html#returning-a-view-versus-a-copy
metadata1.dropna(how = "any",axis = 0, inplace = True)

In [45]:
metadata1['text'] = metadata1['title'] + metadata1['abstract']
metadata1

C:\Users\DEV\AppData\Local\Temp\ipykernel_16408\4285303817.py:1: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_g


uide/indexing.html#returning-a-view-versus-a-copy
metadata1['text'] = metadata1['title'] + metadata1['abstract']
Out[45]:

title abstract text

Clinical features of culture-proven OBJECTIVE: This retrospective chart Clinical features of culture-proven
0
Mycoplasma... review des... Mycoplasma...

Nitric oxide: a pro-inflammatory Inflammatory diseases of the respiratory Nitric oxide: a pro-inflammatory
1
mediator in l... tract... mediator in l...

Surfactant protein-D and pulmonary Surfactant protein-D (SP-D) participates in Surfactant protein-D and pulmonary
2
host defense th... host defens...

Endothelin-1 (ET-1) is a 21 amino acid Role of endothelin-1 in lung


3 Role of endothelin-1 in lung disease
peptide... diseaseEndothelin...

Gene expression in epithelial cells in Respiratory syncytial virus (RSV) and Gene expression in epithelial cells in
4
respons... pneumoni... respons...

... ... ... ...

Global Surveillance of Emerging BACKGROUND: Effective influenza Global Surveillance of Emerging


87
Influenza Viru... surveillance r... Influenza Viru...

Transmission Parameters of the 2001 Despite intensive ongoing research, key Transmission Parameters of the 2001
88
Foot and M... aspect... Foot and M...

Efficient replication of pneumonia virus Pneumonia virus of mice (PVM; family Efficient replication of pneumonia virus
89
of mi... Paramyxov... of mi...

Designing and conducting tabletop BACKGROUND: Since 2001, state and Designing and conducting tabletop
90
exercises to... local health... exercises to...

Transcript-level annotation of Affymetrix BACKGROUND: The wide use of Transcript-level annotation of Affymetrix
91
prob... Affymetrix microar... prob...

92 rows × 3 columns
92 rows × 3 columns

In [46]:
metadata1['text'] = metadata1['text'].astype(str).str.lower()
metadata1.head()

C:\Users\DEV\AppData\Local\Temp\ipykernel_16408\320398156.py:1: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_g


uide/indexing.html#returning-a-view-versus-a-copy
metadata1['text'] = metadata1['text'].astype(str).str.lower()
Out[46]:

title abstract text

Clinical features of culture-proven OBJECTIVE: This retrospective chart clinical features of culture-proven
0
Mycoplasma... review des... mycoplasma...

Nitric oxide: a pro-inflammatory mediator Inflammatory diseases of the respiratory nitric oxide: a pro-inflammatory mediator
1
in l... tract... in l...

Surfactant protein-D and pulmonary host Surfactant protein-D (SP-D) participates surfactant protein-d and pulmonary host
2
defense in th... defens...

Endothelin-1 (ET-1) is a 21 amino acid role of endothelin-1 in lung


3 Role of endothelin-1 in lung disease
peptide... diseaseendothelin...

Gene expression in epithelial cells in Respiratory syncytial virus (RSV) and gene expression in epithelial cells in
4
respons... pneumoni... respons...

Using Regexp to tokenize the words

In [48]:
from nltk.tokenize import RegexpTokenizer

regexp = RegexpTokenizer('\w+')

metadata1['text_token'] = metadata1['text'].apply(regexp.tokenize)

metadata1.head()

C:\Users\DEV\AppData\Local\Temp\ipykernel_16408\956240939.py:5: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_g


uide/indexing.html#returning-a-view-versus-a-copy
metadata1['text_token'] = metadata1['text'].apply(regexp.tokenize)
Out[48]:

title abstract text text_token

Clinical features of culture- OBJECTIVE: This retrospective clinical features of culture- [clinical, features, of, culture,
0
proven Mycoplasma... chart review des... proven mycoplasma... proven, myco...

Nitric oxide: a pro- Inflammatory diseases of the nitric oxide: a pro- [nitric, oxide, a, pro,
1
inflammatory mediator in l... respiratory tract... inflammatory mediator in l... inflammatory, mediator...

Surfactant protein-D and Surfactant protein-D (SP-D) surfactant protein-d and [surfactant, protein, d, and,
2
pulmonary host defense participates in th... pulmonary host defens... pulmonary, host,...

Role of endothelin-1 in lung Endothelin-1 (ET-1) is a 21 role of endothelin-1 in lung [role, of, endothelin, 1, in,
3
disease amino acid peptide... diseaseendothelin... lung, diseaseend...

Gene expression in epithelial Respiratory syncytial virus gene expression in epithelial [gene, expression, in,
4
cells in respons... (RSV) and pneumoni... cells in respons... epithelial, cells, in, ...

In [49]:
import nltk
from nltk.corpus import stopwords

# Make a list of english stopwords


stopwords = nltk.corpus.stopwords.words("english")

Remove stopwords

In [50]:
metadata1['text_token'] = metadata1['text_token'].apply(lambda x: [item for item in x if
item not in stopwords])
metadata1.head()

C:\Users\DEV\AppData\Local\Temp\ipykernel_16408\3856681829.py:1: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_g


uide/indexing.html#returning-a-view-versus-a-copy
metadata1['text_token'] = metadata1['text_token'].apply(lambda x: [item for item in x i
f item not in stopwords])
Out[50]:

title abstract text text_token

Clinical features of culture- OBJECTIVE: This retrospective clinical features of culture- [clinical, features, culture,
0
proven Mycoplasma... chart review des... proven mycoplasma... proven, mycoplas...

Nitric oxide: a pro- Inflammatory diseases of the nitric oxide: a pro- [nitric, oxide, pro,
1
inflammatory mediator in l... respiratory tract... inflammatory mediator in l... inflammatory, mediator, l...

Surfactant protein-D and Surfactant protein-D (SP-D) surfactant protein-d and [surfactant, protein,
2
pulmonary host defense participates in th... pulmonary host defens... pulmonary, host, defense...

Role of endothelin-1 in lung Endothelin-1 (ET-1) is a 21 role of endothelin-1 in lung [role, endothelin, 1, lung,
3
disease amino acid peptide... diseaseendothelin... diseaseendothelin,...

Gene expression in epithelial Respiratory syncytial virus gene expression in epithelial [gene, expression, epithelial,
4
cells in respons... (RSV) and pneumoni... cells in respons... cells, response...

In [51]:
metadata1['text_string'] = metadata1['text_token'].apply(lambda x: ' '.join([item for it
em in x if len(item)>2]))

C:\Users\DEV\AppData\Local\Temp\ipykernel_16408\2135274076.py:1: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_g


uide/indexing.html#returning-a-view-versus-a-copy
metadata1['text_string'] = metadata1['text_token'].apply(lambda x: ' '.join([item for i
tem in x if len(item)>2]))

In [52]:
metadata1[['text','text_token','text_string']]
Out[52]:

text text_token text_string

clinical features of culture-proven [clinical, features, culture, proven, clinical features culture proven
0
mycoplasma... mycoplas... mycoplasma pn...

nitric oxide: a pro-inflammatory mediator [nitric, oxide, pro, inflammatory, nitric oxide pro inflammatory mediator
1
in l... mediator, l... lung di...

surfactant protein-d and pulmonary host [surfactant, protein, pulmonary, host, surfactant protein pulmonary host
2
defens... defense... defensesurfa...

role of endothelin-1 in lung [role, endothelin, 1, lung, role endothelin lung diseaseendothelin
3
diseaseendothelin... diseaseendothelin,... amino a...
3
diseaseendothelin... diseaseendothelin,... amino a...
text text_token text_string
gene expression in epithelial cells in [gene, expression, epithelial, cells, gene expression epithelial cells response
4
respons... response... pneu...

... ... ... ...

global surveillance of emerging influenza [global, surveillance, emerging, global surveillance emerging influenza
87
viru... influenza, vi... virus g...

transmission parameters of the 2001 foot [transmission, parameters, 2001, foot, transmission parameters 2001 foot mouth
88
and m... mouth, ... epidem...

efficient replication of pneumonia virus of [efficient, replication, pneumonia, virus, efficient replication pneumonia virus mice
89
mi... mic... pvm...

designing and conducting tabletop [designing, conducting, tabletop, designing conducting tabletop exercises
90
exercises to... exercises, a... assess...

transcript-level annotation of affymetrix [transcript, level, annotation, transcript level annotation affymetrix
91
prob... affymetrix, pr... probese...

92 rows × 3 columns

In [53]:

#Create a list of all words

all_words = ' '.join([word for word in metadata1['text_string']])

In [54]:
#Tokenize all_words
tokenized_words = nltk.tokenize.word_tokenize(all_words)

In [55]:

# Create a frequency distribution which records the number of times each word has occurre
d:

from nltk.probability import FreqDist

fdist = FreqDist(tokenized_words)
fdist
Out[55]:

FreqDist({'health': 78, 'rna': 63, 'virus': 62, 'expression': 56, 'results': 50, 'patient
s': 49, 'public': 49, 'gene': 48, 'protein': 47, 'viral': 44, ...})

Now we can use our fdist dictionary to drop words which occur less than a certain amount of times (usually
we use a value of 3 or 4).
Since our dataset is really small, we don’t filter out any words and set the value to greater or equal to 1
(otherwise there are not many words left in this particular dataset)

In [56]:

metadata1['text_string_fdist'] = metadata1['text_token'].apply(lambda x: ' '.join([item


for item in x if fdist[item] >= 1 ]))

C:\Users\DEV\AppData\Local\Temp\ipykernel_16408\2477811021.py:1: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_g


uide/indexing.html#returning-a-view-versus-a-copy
metadata1['text_string_fdist'] = metadata1['text_token'].apply(lambda x: ' '.join([item
for item in x if fdist[item] >= 1 ]))

In [57]:
metadata1[['text', 'text_token', 'text_string', 'text_string_fdist']].head()

Out[57]:
Out[57]:

text text_token text_string text_string_fdist

clinical features of culture- [clinical, features, culture, clinical features culture proven clinical features culture proven
0
proven mycoplasma... proven, mycoplas... mycoplasma pn... mycoplasma pn...

nitric oxide: a pro- [nitric, oxide, pro, nitric oxide pro inflammatory nitric oxide pro inflammatory
1
inflammatory mediator in l... inflammatory, mediator, l... mediator lung di... mediator lung di...

surfactant protein-d and [surfactant, protein, surfactant protein pulmonary surfactant protein pulmonary
2
pulmonary host defens... pulmonary, host, defense... host defensesurfa... host defensesurfa...

role of endothelin-1 in lung [role, endothelin, 1, lung, role endothelin lung role endothelin lung
3
diseaseendothelin... diseaseendothelin,... diseaseendothelin amino a... diseaseendothelin amino a...

gene expression in epithelial [gene, expression, epithelial, gene expression epithelial gene expression epithelial cells
4
cells in respons... cells, response... cells response pneu... response pneu...

In [58]:
fdist.most_common(3)

Out[58]:
[('health', 78), ('rna', 63), ('virus', 62)]

In [59]:
fdist.tabulate(3)

health rna virus


78 63 62

Visulizing Common Words


In [60]:
# Obtain top 10 words
top_10 = fdist.most_common(10)

# Create pandas series to make plotting easier


fdist = pd.Series(dict(top_10))

In [67]:

import seaborn as sns


sns.set_theme(style="ticks")

sns.barplot(y=fdist.index, x=fdist.values, palette="hls")

Out[67]:
<AxesSubplot: >
In [68]:

import plotly.express as px

fig = px.bar(y=fdist.index, x=fdist.values)

# sort values
fig.update_layout(barmode='stack', yaxis={'categoryorder':'total ascending'})

# show plot
fig.show()

Search Specific words


In [69]:
# Show frequency of a specific word
fdist["viral"]

Out[69]:
44

Word Cloud
In [71]:

%matplotlib inline
import matplotlib.pyplot as plt
from wordcloud import WordCloud

wordcloud = WordCloud(width=600,
height=400,
random_state=2,
max_font_size=100).generate(all_words)

plt.figure(figsize=(10, 7))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off');

From word cloud and frequency of words we can clearly see that articles are most likely to include Technical
terms related Covid-19 (such as RNA, gene, protine,etc..)

In [72]:

#Different style:
import numpy as np

x, y = np .ogrid[: 300, :300]


mask = (x - 150) ** 2 + (y - 150) ** 2 > 130 ** 2
mask = 255 * mask.astype(int)

wc = WordCloud(background_color="white" , repeat=True, mask=mask)


wc .generate(all_words)

plt.axis("off")
plt.imshow(wc , interpolation="bilinear");
Sentiment analysis

VADER lexicon
NLTK provides a simple rule-based model for general sentiment analysis called VADER, which stands for
“Valence Aware Dictionary and Sentiment Reasoner” (Hutto & Gilbert, 2014).

Sentiment Intensity Analyzer


Initialize an object of SentimentIntensityAnalyzer with name “analyzer”:

In [74]:
from nltk.sentiment import SentimentIntensityAnalyzer
nltk.download('vader_lexicon')

analyzer = SentimentIntensityAnalyzer()

[nltk_data] Downloading package vader_lexicon to


[nltk_data] C:\Users\DEV\AppData\Roaming\nltk_data...

Polarity scores
Use the polarity_scores method:

In [75]:
metadata1['polarity'] = metadata1['text_string_fdist'].apply(lambda x: analyzer.polarity_
scores(x))
metadata1.tail(3)

C:\Users\DEV\AppData\Local\Temp\ipykernel_16408\1250974064.py:1: SettingWithCopyWarning:

A value is trying to be set on a copy of a slice from a DataFrame.


Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_g


uide/indexing.html#returning-a-view-versus-a-copy

Out[75]:

title abstract text text_token text_string text_string_fdist polarity

Efficient Pneumonia virus efficient [efficient, efficient efficient {'neg': 0.057,


replication of of mice (PVM; replication of replication, replication replication 'neu': 0.862,
89
pneumonia virus family pneumonia virus pneumonia, pneumonia virus pneumonia virus 'pos': 0.081,
of mi... Paramyxov... of mi... virus, mic... mice pvm... mice pvm... 'co...
designing designing
Designing title
and abstract designing text
and text_token
[designing, text_string text_string_fdist polarity
{'neg': 0.05,
BACKGROUND: conducting conducting
conducting conducting conducting, 'neu': 0.843,
90 Since 2001, state tabletop tabletop
tabletop tabletop tabletop, 'pos': 0.107,
and local health... exercises exercises
exercises to... exercises to... exercises, a... 'com...
assess... assess...

Transcript-level BACKGROUND: transcript-level [transcript, transcript level transcript level {'neg': 0.014,
annotation of The wide use of annotation of level, annotation annotation 'neu': 0.86,
91
Affymetrix Affymetrix affymetrix annotation, affymetrix affymetrix 'pos': 0.126,
prob... microar... prob... affymetrix, pr... probese... probese... 'com...

Transform data
Change data structure

In [76]:

metadata1 = pd .concat(
[metadata1.drop([ 'title' ,'abstract'], axis=1), metadata1['polarity'].apply(pd .Series)]
, axis=1)
metadata1.head(3)

Out[76]:

text text_token text_string text_string_fdist polarity neg neu pos compound

[clinical, {'neg': 0.092,


clinical features of clinical features clinical features
features, culture, 'neu': 0.886,
0 culture-proven culture proven culture proven 0.092 0.886 0.022 -0.8779
proven, 'pos': 0.022,
mycoplasma... mycoplasma pn... mycoplasma pn...
mycoplas... 'co...

[nitric, oxide, {'neg': 0.125,


nitric oxide: a pro- nitric oxide pro nitric oxide pro
pro, 'neu': 0.828,
1 inflammatory inflammatory inflammatory 0.125 0.828 0.048 -0.7717
inflammatory, 'pos': 0.048,
mediator in l... mediator lung di... mediator lung di...
mediator, l... 'co...

surfactant [surfactant, {'neg': 0.036,


surfactant protein surfactant protein
protein-d and protein, 'neu': 0.91,
2 pulmonary host pulmonary host 0.036 0.910 0.054 0.1531
pulmonary host pulmonary, host, 'pos': 0.054,
defensesurfa... defensesurfa...
defens... defense... 'com...

In [77]:
# Create new variable with sentiment "neutral," "positive" and "negative"
metadata1['sentiment'] = metadata1['compound'].apply(lambda x: 'positive' if x >0 else '
neutral' if x== 0 else 'negative')
metadata1.head()
Out[77]:

text text_token text_string text_string_fdist polarity neg neu pos compound sentimen

{'neg':
0.092,
clinical features of [clinical, features, clinical features clinical features 'neu':
0 culture-proven culture, proven, culture proven culture proven 0.886, 0.092 0.886 0.022 -0.8779 negativ
mycoplasma... mycoplas... mycoplasma pn... mycoplasma pn... 'pos':
0.022,
'co...

{'neg':
0.125,
nitric oxide: a pro- [nitric, oxide, pro, nitric oxide pro nitric oxide pro 'neu':
1 inflammatory inflammatory, inflammatory inflammatory 0.828, 0.125 0.828 0.048 -0.7717 negativ
mediator in l... mediator, l... mediator lung di... mediator lung di... 'pos':
0.048,
'co...

{'neg':
0.036,
surfactant protein- [surfactant, protein, surfactant protein surfactant protein 'neu':
2 d and pulmonary pulmonary, host, pulmonary host pulmonary host 0.91, 0.036 0.910 0.054 0.1531 positiv
host defens... defense... defensesurfa... defensesurfa... 'pos':
0.054,
0.054,
text text_token text_string text_string_fdist polarity
'com... neg neu pos compound sentimen

{'neg':
0.0,
role endothelin role endothelin
role of endothelin-1 [role, endothelin, 1, 'neu':
lung lung
3 in lung lung, 0.942, 0.000 0.942 0.058 0.3400 positiv
diseaseendothelin diseaseendothelin
diseaseendothelin... diseaseendothelin,... 'pos':
amino a... amino a...
0.058,
'comp...

{'neg':
0.0,
gene expression in [gene, expression, gene expression gene expression 'neu':
4 epithelial cells in epithelial, cells, epithelial cells epithelial cells 0.949, 0.000 0.949 0.051 0.4939 positiv
respons... response... response pneu... response pneu... 'pos':
0.051,
'comp...

Analyze data
Title with highest positive sentiment

In [78]:
metadata1.loc[metadata1['compound'].idxmax()].values

Out[78]:
array(['pandemic influenza preparedness: an ethical framework to guide decision-makingbac
kground: planning for the next pandemic influenza outbreak is underway in hospitals acros
s the world. the global sars experience has taught us that ethical frameworks to guide de
cision-making may help to reduce collateral damage and increase trust and solidarity with
in and between health care organisations. good pandemic planning requires reflection on v
alues because science alone cannot tell us how to prepare for a public health crisis. dis
cussion: in this paper, we present an ethical framework for pandemic influenza planning.
the ethical framework was developed with expertise from clinical, organisational and publ
ic health ethics and validated through a stakeholder engagement process. the ethical fram
ework includes both substantive and procedural elements for ethical pandemic influenza pl
anning. the incorporation of ethics into pandemic planning can be helped by senior hospit
al administrators sponsoring its use, by having stakeholders vet the framework, and by de
signing or identifying decision review processes. we discuss the merits and limits of an
applied ethical framework for hospital decision-making, as well as the robustness of the
framework. summary: the need for reflection on the ethical issues raised by the spectre o
f a pandemic influenza outbreak is great. our efforts to address the normative aspects of
pandemic planning in hospitals have generated interest from other hospitals and from the
governmental sector. the framework will require re-evaluation and refinement and we hope
that this paper will generate feedback on how to make it even more robust.',
list(['pandemic', 'influenza', 'preparedness', 'ethical', 'framework', 'guide', 'd
ecision', 'makingbackground', 'planning', 'next', 'pandemic', 'influenza', 'outbreak', 'u
nderway', 'hospitals', 'across', 'world', 'global', 'sars', 'experience', 'taught', 'us',
'ethical', 'frameworks', 'guide', 'decision', 'making', 'may', 'help', 'reduce', 'collate
ral', 'damage', 'increase', 'trust', 'solidarity', 'within', 'health', 'care', 'organisat
ions', 'good', 'pandemic', 'planning', 'requires', 'reflection', 'values', 'science', 'al
one', 'cannot', 'tell', 'us', 'prepare', 'public', 'health', 'crisis', 'discussion', 'pap
er', 'present', 'ethical', 'framework', 'pandemic', 'influenza', 'planning', 'ethical', '
framework', 'developed', 'expertise', 'clinical', 'organisational', 'public', 'health', '
ethics', 'validated', 'stakeholder', 'engagement', 'process', 'ethical', 'framework', 'in
cludes', 'substantive', 'procedural', 'elements', 'ethical', 'pandemic', 'influenza', 'pl
anning', 'incorporation', 'ethics', 'pandemic', 'planning', 'helped', 'senior', 'hospital
', 'administrators', 'sponsoring', 'use', 'stakeholders', 'vet', 'framework', 'designing'
, 'identifying', 'decision', 'review', 'processes', 'discuss', 'merits', 'limits', 'appli
ed', 'ethical', 'framework', 'hospital', 'decision', 'making', 'well', 'robustness', 'fra
mework', 'summary', 'need', 'reflection', 'ethical', 'issues', 'raised', 'spectre', 'pand
emic', 'influenza', 'outbreak', 'great', 'efforts', 'address', 'normative', 'aspects', 'p
andemic', 'planning', 'hospitals', 'generated', 'interest', 'hospitals', 'governmental',
'sector', 'framework', 'require', 'evaluation', 'refinement', 'hope', 'paper', 'generate'
, 'feedback', 'make', 'even', 'robust']),
'pandemic influenza preparedness ethical framework guide decision makingbackground
planning next pandemic influenza outbreak underway hospitals across world global sars exp
erience taught ethical frameworks guide decision making may help reduce collateral damage
erience taught ethical frameworks guide decision making may help reduce collateral damage
increase trust solidarity within health care organisations good pandemic planning require
s reflection values science alone cannot tell prepare public health crisis discussion pap
er present ethical framework pandemic influenza planning ethical framework developed expe
rtise clinical organisational public health ethics validated stakeholder engagement proce
ss ethical framework includes substantive procedural elements ethical pandemic influenza
planning incorporation ethics pandemic planning helped senior hospital administrators spo
nsoring use stakeholders vet framework designing identifying decision review processes di
scuss merits limits applied ethical framework hospital decision making well robustness fr
amework summary need reflection ethical issues raised spectre pandemic influenza outbreak
great efforts address normative aspects pandemic planning hospitals generated interest ho
spitals governmental sector framework require evaluation refinement hope paper generate f
eedback make even robust',
'pandemic influenza preparedness ethical framework guide decision makingbackground
planning next pandemic influenza outbreak underway hospitals across world global sars exp
erience taught ethical frameworks guide decision making may help reduce collateral damage
increase trust solidarity within health care organisations good pandemic planning require
s reflection values science alone tell prepare public health crisis discussion paper pres
ent ethical framework pandemic influenza planning ethical framework developed expertise c
linical organisational public health ethics validated stakeholder engagement process ethi
cal framework includes substantive procedural elements ethical pandemic influenza plannin
g incorporation ethics pandemic planning helped senior hospital administrators sponsoring
use stakeholders vet framework designing identifying decision review processes discuss me
rits limits applied ethical framework hospital decision making well robustness framework
summary need reflection ethical issues raised spectre pandemic influenza outbreak great e
fforts address normative aspects pandemic planning hospitals generated interest hospitals
governmental sector framework require evaluation refinement hope paper generate feedback
make even robust',
{'neg': 0.047, 'neu': 0.609, 'pos': 0.344, 'compound': 0.995},
0.047, 0.609, 0.344, 0.995, 'positive'], dtype=object)

Title with highest negative sentiment

In [79]:
metadata1.loc[metadata1['compound'].idxmin()].values

Out[79]:
array(['public awareness of risk factors for cancer among the japanese general population
: a population-based surveybackground: the present study aimed to provide information on
awareness of the attributable fraction of cancer causes among the japanese general popula
tion. methods: a nationwide representative sample of 2,000 japanese aged 20 or older was
asked about their perception and level of concern about various environmental and genetic
risk factors in relation to cancer prevention, as a part of an omnibus survey. interviews
were conducted with 1,355 subjects (609 men and 746 women). results: among 12 risk factor
candidates, the attributable fraction of cancer-causing viral and bacterial infection was
considered highest (51%), followed by that of tobacco smoking (43%), stress (39%), and en
docrine-disrupting chemicals (37%). on the other hand, the attributable fractions of canc
er by charred fish and meat (21%) and alcohol drinking (22%) were considered low compared
with other risk factor candidates. for most risk factors, attributable fraction responses
were higher in women than in men. as a whole, the subjects tended to respond with higher
values than those estimated by epidemiologic evidence in the west. the attributable fract
ion of cancer speculated to be genetically determined was 32%, while 36% of cancer was co
nsidered preventable by improving lifestyle. conclusion: our results suggest that awarene
ss of the attributable fraction of cancer causes in the japanese general population tends
to be dominated by cancer-causing infection, occupational exposure, air pollution and foo
d additives rather than major lifestyle factors such as diet.',
list(['public', 'awareness', 'risk', 'factors', 'cancer', 'among', 'japanese', 'ge
neral', 'population', 'population', 'based', 'surveybackground', 'present', 'study', 'aim
ed', 'provide', 'information', 'awareness', 'attributable', 'fraction', 'cancer', 'causes
', 'among', 'japanese', 'general', 'population', 'methods', 'nationwide', 'representative
', 'sample', '2', '000', 'japanese', 'aged', '20', 'older', 'asked', 'perception', 'level
', 'concern', 'various', 'environmental', 'genetic', 'risk', 'factors', 'relation', 'canc
er', 'prevention', 'part', 'omnibus', 'survey', 'interviews', 'conducted', '1', '355', 's
ubjects', '609', 'men', '746', 'women', 'results', 'among', '12', 'risk', 'factor', 'cand
idates', 'attributable', 'fraction', 'cancer', 'causing', 'viral', 'bacterial', 'infectio
n', 'considered', 'highest', '51', 'followed', 'tobacco', 'smoking', '43', 'stress', '39'
, 'endocrine', 'disrupting', 'chemicals', '37', 'hand', 'attributable', 'fractions', 'can
cer', 'charred', 'fish', 'meat', '21', 'alcohol', 'drinking', '22', 'considered', 'low',
'compared', 'risk', 'factor', 'candidates', 'risk', 'factors', 'attributable', 'fraction'
, 'responses', 'higher', 'women', 'men', 'whole', 'subjects', 'tended', 'respond', 'highe
, 'responses', 'higher', 'women', 'men', 'whole', 'subjects', 'tended', 'respond', 'highe
r', 'values', 'estimated', 'epidemiologic', 'evidence', 'west', 'attributable', 'fraction
', 'cancer', 'speculated', 'genetically', 'determined', '32', '36', 'cancer', 'considered
', 'preventable', 'improving', 'lifestyle', 'conclusion', 'results', 'suggest', 'awarenes
s', 'attributable', 'fraction', 'cancer', 'causes', 'japanese', 'general', 'population',
'tends', 'dominated', 'cancer', 'causing', 'infection', 'occupational', 'exposure', 'air'
, 'pollution', 'food', 'additives', 'rather', 'major', 'lifestyle', 'factors', 'diet']),
'public awareness risk factors cancer among japanese general population population
based surveybackground present study aimed provide information awareness attributable fra
ction cancer causes among japanese general population methods nationwide representative s
ample 000 japanese aged older asked perception level concern various environmental geneti
c risk factors relation cancer prevention part omnibus survey interviews conducted 355 su
bjects 609 men 746 women results among risk factor candidates attributable fraction cance
r causing viral bacterial infection considered highest followed tobacco smoking stress en
docrine disrupting chemicals hand attributable fractions cancer charred fish meat alcohol
drinking considered low compared risk factor candidates risk factors attributable fractio
n responses higher women men whole subjects tended respond higher values estimated epidem
iologic evidence west attributable fraction cancer speculated genetically determined canc
er considered preventable improving lifestyle conclusion results suggest awareness attrib
utable fraction cancer causes japanese general population tends dominated cancer causing
infection occupational exposure air pollution food additives rather major lifestyle facto
rs diet',
'public awareness risk factors cancer among japanese general population population
based surveybackground present study aimed provide information awareness attributable fra
ction cancer causes among japanese general population methods nationwide representative s
ample 000 japanese aged older asked perception level concern various environmental geneti
c risk factors relation cancer prevention part omnibus survey interviews conducted 355 su
bjects 609 men 746 women results among risk factor candidates attributable fraction cance
r causing viral bacterial infection considered highest followed tobacco smoking stress en
docrine disrupting chemicals hand attributable fractions cancer charred fish meat alcohol
drinking considered low compared risk factor candidates risk factors attributable fractio
n responses higher women men whole subjects tended respond higher values estimated epidem
iologic evidence west attributable fraction cancer speculated genetically determined canc
er considered preventable improving lifestyle conclusion results suggest awareness attrib
utable fraction cancer causes japanese general population tends dominated cancer causing
infection occupational exposure air pollution food additives rather major lifestyle facto
rs diet',
{'neg': 0.282, 'neu': 0.661, 'pos': 0.057, 'compound': -0.9927},
0.282, 0.661, 0.057, -0.9927, 'negative'], dtype=object)

Visualize data
In [80]:
# Number of tweets
sns.countplot(y='sentiment',
data=metadata1,
palette =['#b2d8d8',"#008080", '#db3d13']
);
In [81]:
# Boxplot
sns.boxplot (y='compound',
x='sentiment',
palette =['#b2d8d8',"#008080", '#db3d13'],
data=metadata1);

In [83]:
# Lineplot
g = sns.lineplot(x=df ['year'], y=metadata1['compound'])

g.set(title='Sentiment of Titles')
g.set(xlabel="Time")
g.set(ylabel="Sentiment")

g.axhline (0, ls ='--', c = 'grey');


Above graphs clearly stats that, there was more positive news than negetive news

Thank You

You might also like