Professional Documents
Culture Documents
Data Analytics Task
Data Analytics Task
In [6]:
In [9]:
df = pd.read_csv("./metadata.csv",nrows=100)
In [10]:
In [12]:
df.head(3)
Out[12]:
Clinical
features of
10.1186/1471-
0 ug7v899j d1aafb70c066a2068b02786f8929fd9c900897fb PMC culture- PMC35282 11472636 no-cc
2334-1-6
proven
Mycoplasma...
Nitric oxide: a
pro-
1 02tnwd4m 6b0567729c2143a66d737eb0a2f63f2dce2e5a7d PMC 10.1186/rr14 PMC59543 11667967 no-cc
inflammatory
mediator in l...
Surfactant
protein-D and
2 ejv2xln0 06ced00a5fc04215949aa72528f2eeaae1d58927 PMC 10.1186/rr19 PMC59549 11667972 no-cc
pulmonary
host defense
In [13]:
In [14]:
#As there are 100 out of 100 null objects we can drop those columns as it does not have a
ny impact on our analysis
df.drop(['mag_id','who_covidence_id','arxiv_id','s2_id'],axis =1,inplace = True)
In [16]:
df.head(3)
Out[16]:
Clinical
features of
10.1186/1471-
0 ug7v899j d1aafb70c066a2068b02786f8929fd9c900897fb PMC culture- PMC35282 11472636 no-cc
2334-1-6
proven
Mycoplasma...
Nitric oxide: a
pro-
1 02tnwd4m 6b0567729c2143a66d737eb0a2f63f2dce2e5a7d PMC 10.1186/rr14 PMC59543 11667967 no-cc
inflammatory
mediator in l...
Surfactant
protein-D and
2 ejv2xln0 06ced00a5fc04215949aa72528f2eeaae1d58927 PMC 10.1186/rr19 PMC59549 11667972 no-cc
pulmonary
host defense
In [17]:
We find that the Nuleic Acids Res has got the highest number of publications
In [29]:
df.groupby(['license']).size().groupby(level=0).max().sort_values(ascending=False)
Out[29]:
license
cc-by 46
no-cc 38
cc-by-nc 7
green-oa 4
bronze-oa 3
cc0 2
dtype: int64
Most publications i.e 46 were published under the cc-by license type compared to the cc0 type with only 2
publications
In [30]:
df.groupby(['license','journal']).size().sort_values(ascending=False)
Out[30]:
license journal
no-cc Nucleic Acids Res 9
cc-by BMC Public Health 8
cc-by-nc Nucleic Acids Res 7
no-cc J Gen Intern Med 6
cc-by PLoS One 6
no-cc Respir Res 5
Crit Care 5
cc-by Respir Res 3
green-oa The EMBO Journal 3
cc-by Aust New Zealand Health Policy 2
PLoS Comput Biol 2
BMC Bioinformatics 2
Virol J 2
no-cc EMBO J 2
J Biomed Biotechnol 2
cc-by BMC Med Ethics 2
BMC Genomics 2
no-cc Evid Based Complement Alternat Med 1
Nat Med 1
Cell Microbiol 1
Biol Proced Online 1
BMC Infect Dis 1
J Transl Med 1
green-oa Methods 1
cc0 PLoS Pathog 1
PLoS One 1
no-cc Mediators Inflamm 1
Insect Mol Biol 1
bronze-oa Clinical Chemistry 1
cc-by Theor Biol Med Model 1
Reprod Biol Endocrinol 1
bronze-oa Journal of the American Medical Informatics Association 1
cc-by AIDS Res Ther 1
Ann Clin Microbiol Antimicrob 1
BMC Biotechnol 1
BMC Gastroenterol 1
BMC Infect Dis 1
BMC Mol Biol 1
Crit Care 1
Genome Biol 1
Global Health 1
Harm Reduct J 1
Immunome Res 1
Int J Health Geogr 1
Microb Cell Fact 1
PLoS Biol 1
PLoS Med 1
bronze-oa Journal of Neuropathology and Experimental Neurology 1
no-cc Retrovirology 1
dtype: int64
We find that the Nucleic Acids Res has got 2 license types namely 'cc-by-cc' and 'no-cc'
In [27]:
df.groupby(['publish_time']).size().groupby(level=0).max().sort_values(ascending=False)
Out[27]:
publish_time
2006-09-29 3
2005-03-30 3
2005-03-30 3
2006-01-10 2
2006-08-18 2
2004-04-03 2
..
2005-01-04 1
2005-01-03 1
2004-11-19 1
2004-11-01 1
2007-06-11 1
Length: 88, dtype: int64
We can get better analysis by extracting the year column from the publish_time column
In [31]:
df.dtypes
Out[31]:
cord_uid object
sha object
source_x object
title object
doi object
pmcid object
pubmed_id int64
license object
abstract object
publish_time object
authors object
journal object
pdf_json_files object
pmc_json_files object
url object
dtype: object
We found that the publish_time data type was given as string instead of datetime datatype
In [33]:
#Converting the datatype to datetime
df['publish_time'] = pd.to_datetime(metadata_df['publish_time'], format='%Y-%m-%d ')
In [34]:
df.dtypes
Out[34]:
cord_uid object
sha object
source_x object
title object
doi object
pmcid object
pubmed_id int64
license object
abstract object
publish_time datetime64[ns]
authors object
journal object
pdf_json_files object
pmc_json_files object
url object
dtype: object
In [37]:
df['year'] = df['publish_time'].dt.year
In [38]:
df.head()
Out[38]:
Clinical
features of
10.1186/1471-
0 ug7v899j d1aafb70c066a2068b02786f8929fd9c900897fb PMC culture- PMC35282 11472636 no-cc
2334-1-6
proven
Mycoplasma...
Nitric oxide: a
pro-
1 02tnwd4m 6b0567729c2143a66d737eb0a2f63f2dce2e5a7d PMC 10.1186/rr14 PMC59543 11667967 no-cc
inflammatory
mediator in l...
Surfactant
protein-D and
2 ejv2xln0 06ced00a5fc04215949aa72528f2eeaae1d58927 PMC 10.1186/rr19 PMC59549 11667972 no-cc
pulmonary
host defense
Role of
endothelin-1
3 2b73a28n 348055649b6b8cf2b9a376498df9bf41f7123605 PMC 10.1186/rr44 PMC59574 11686871 no-cc
in lung
disease
Gene
expression in
4 9785vg6d 5f48792a5fa08bed9f56016f4981ae2ca6031b32 PMC 10.1186/rr61 PMC59580 11686888 no-cc
epithelial cells
in respons...
In [39]:
df.groupby(['year']).size().groupby(level = 0).max().sort_values(ascending=False)
Out[39]:
year
2006 36
2005 23
2007 13
2004 10
2001 7
2000 5
2003 4
1997 1
2002 1
dtype: int64
We find that 1997 and 2002 has got least number of publications i.e. 1
Highest number of publications were published in 2006 i.e. 36
There has been increasing trend from 1997 but for some reason got only 1 publication in 2000 and saw a
decline again in 2007
In [40]:
df.groupby(['year','journal'])['title'].size().sort_values(ascending=False)
Out[40]:
year journal
2005 Nucleic Acids Res 9
2006 Nucleic Acids Res 6
BMC Public Health 6
2007 PLoS One 5
2006 Crit Care 3
2006 Crit Care 3
..
2005 Int J Health Geogr 1
J Gen Intern Med 1
Methods 1
Microb Cell Fact 1
2007 Virol J 1
Name: title, Length: 69, dtype: int64
In [41]:
df.groupby(['year','license']).size().sort_values(ascending=False)
Out[41]:
year license
2006 cc-by 23
2005 cc-by 11
no-cc 11
2007 cc-by 9
2001 no-cc 6
2006 no-cc 6
2004 no-cc 6
2006 cc-by-nc 6
2000 no-cc 3
2003 no-cc 3
2004 cc-by 3
2007 cc0 2
2000 green-oa 2
2007 cc-by-nc 1
1997 no-cc 1
2006 bronze-oa 1
2005 green-oa 1
2004 bronze-oa 1
2003 bronze-oa 1
2002 no-cc 1
2001 green-oa 1
2007 no-cc 1
dtype: int64
In [42]:
metadata1 = df[['title','abstract']]
metadata1.head()
Out[42]:
title abstract
0 Clinical features of culture-proven Mycoplasma... OBJECTIVE: This retrospective chart review des...
1 Nitric oxide: a pro-inflammatory mediator in l... Inflammatory diseases of the respiratory tract...
2 Surfactant protein-D and pulmonary host defense Surfactant protein-D (SP-D) participates in th...
4 Gene expression in epithelial cells in respons... Respiratory syncytial virus (RSV) and pneumoni...
In [43]:
metadata1.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 title 100 non-null object
1 abstract 92 non-null object
dtypes: object(2)
memory usage: 1.7+ KB
memory usage: 1.7+ KB
In [44]:
metadata1.dropna(how = "any",axis = 0, inplace = True)
metadata1.reset_index(drop = True, inplace = True)
metadata1.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 92 entries, 0 to 91
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 title 92 non-null object
1 abstract 92 non-null object
dtypes: object(2)
memory usage: 1.6+ KB
C:\Users\DEV\AppData\Local\Temp\ipykernel_16408\3320786844.py:1: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
In [45]:
metadata1['text'] = metadata1['title'] + metadata1['abstract']
metadata1
C:\Users\DEV\AppData\Local\Temp\ipykernel_16408\4285303817.py:1: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
Clinical features of culture-proven OBJECTIVE: This retrospective chart Clinical features of culture-proven
0
Mycoplasma... review des... Mycoplasma...
Nitric oxide: a pro-inflammatory Inflammatory diseases of the respiratory Nitric oxide: a pro-inflammatory
1
mediator in l... tract... mediator in l...
Surfactant protein-D and pulmonary Surfactant protein-D (SP-D) participates in Surfactant protein-D and pulmonary
2
host defense th... host defens...
Gene expression in epithelial cells in Respiratory syncytial virus (RSV) and Gene expression in epithelial cells in
4
respons... pneumoni... respons...
Transmission Parameters of the 2001 Despite intensive ongoing research, key Transmission Parameters of the 2001
88
Foot and M... aspect... Foot and M...
Efficient replication of pneumonia virus Pneumonia virus of mice (PVM; family Efficient replication of pneumonia virus
89
of mi... Paramyxov... of mi...
Designing and conducting tabletop BACKGROUND: Since 2001, state and Designing and conducting tabletop
90
exercises to... local health... exercises to...
Transcript-level annotation of Affymetrix BACKGROUND: The wide use of Transcript-level annotation of Affymetrix
91
prob... Affymetrix microar... prob...
92 rows × 3 columns
92 rows × 3 columns
In [46]:
metadata1['text'] = metadata1['text'].astype(str).str.lower()
metadata1.head()
C:\Users\DEV\AppData\Local\Temp\ipykernel_16408\320398156.py:1: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
Clinical features of culture-proven OBJECTIVE: This retrospective chart clinical features of culture-proven
0
Mycoplasma... review des... mycoplasma...
Nitric oxide: a pro-inflammatory mediator Inflammatory diseases of the respiratory nitric oxide: a pro-inflammatory mediator
1
in l... tract... in l...
Surfactant protein-D and pulmonary host Surfactant protein-D (SP-D) participates surfactant protein-d and pulmonary host
2
defense in th... defens...
Gene expression in epithelial cells in Respiratory syncytial virus (RSV) and gene expression in epithelial cells in
4
respons... pneumoni... respons...
In [48]:
from nltk.tokenize import RegexpTokenizer
regexp = RegexpTokenizer('\w+')
metadata1['text_token'] = metadata1['text'].apply(regexp.tokenize)
metadata1.head()
C:\Users\DEV\AppData\Local\Temp\ipykernel_16408\956240939.py:5: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
Clinical features of culture- OBJECTIVE: This retrospective clinical features of culture- [clinical, features, of, culture,
0
proven Mycoplasma... chart review des... proven mycoplasma... proven, myco...
Nitric oxide: a pro- Inflammatory diseases of the nitric oxide: a pro- [nitric, oxide, a, pro,
1
inflammatory mediator in l... respiratory tract... inflammatory mediator in l... inflammatory, mediator...
Surfactant protein-D and Surfactant protein-D (SP-D) surfactant protein-d and [surfactant, protein, d, and,
2
pulmonary host defense participates in th... pulmonary host defens... pulmonary, host,...
Role of endothelin-1 in lung Endothelin-1 (ET-1) is a 21 role of endothelin-1 in lung [role, of, endothelin, 1, in,
3
disease amino acid peptide... diseaseendothelin... lung, diseaseend...
Gene expression in epithelial Respiratory syncytial virus gene expression in epithelial [gene, expression, in,
4
cells in respons... (RSV) and pneumoni... cells in respons... epithelial, cells, in, ...
In [49]:
import nltk
from nltk.corpus import stopwords
Remove stopwords
In [50]:
metadata1['text_token'] = metadata1['text_token'].apply(lambda x: [item for item in x if
item not in stopwords])
metadata1.head()
C:\Users\DEV\AppData\Local\Temp\ipykernel_16408\3856681829.py:1: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
Clinical features of culture- OBJECTIVE: This retrospective clinical features of culture- [clinical, features, culture,
0
proven Mycoplasma... chart review des... proven mycoplasma... proven, mycoplas...
Nitric oxide: a pro- Inflammatory diseases of the nitric oxide: a pro- [nitric, oxide, pro,
1
inflammatory mediator in l... respiratory tract... inflammatory mediator in l... inflammatory, mediator, l...
Surfactant protein-D and Surfactant protein-D (SP-D) surfactant protein-d and [surfactant, protein,
2
pulmonary host defense participates in th... pulmonary host defens... pulmonary, host, defense...
Role of endothelin-1 in lung Endothelin-1 (ET-1) is a 21 role of endothelin-1 in lung [role, endothelin, 1, lung,
3
disease amino acid peptide... diseaseendothelin... diseaseendothelin,...
Gene expression in epithelial Respiratory syncytial virus gene expression in epithelial [gene, expression, epithelial,
4
cells in respons... (RSV) and pneumoni... cells in respons... cells, response...
In [51]:
metadata1['text_string'] = metadata1['text_token'].apply(lambda x: ' '.join([item for it
em in x if len(item)>2]))
C:\Users\DEV\AppData\Local\Temp\ipykernel_16408\2135274076.py:1: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
In [52]:
metadata1[['text','text_token','text_string']]
Out[52]:
clinical features of culture-proven [clinical, features, culture, proven, clinical features culture proven
0
mycoplasma... mycoplas... mycoplasma pn...
nitric oxide: a pro-inflammatory mediator [nitric, oxide, pro, inflammatory, nitric oxide pro inflammatory mediator
1
in l... mediator, l... lung di...
surfactant protein-d and pulmonary host [surfactant, protein, pulmonary, host, surfactant protein pulmonary host
2
defens... defense... defensesurfa...
role of endothelin-1 in lung [role, endothelin, 1, lung, role endothelin lung diseaseendothelin
3
diseaseendothelin... diseaseendothelin,... amino a...
3
diseaseendothelin... diseaseendothelin,... amino a...
text text_token text_string
gene expression in epithelial cells in [gene, expression, epithelial, cells, gene expression epithelial cells response
4
respons... response... pneu...
global surveillance of emerging influenza [global, surveillance, emerging, global surveillance emerging influenza
87
viru... influenza, vi... virus g...
transmission parameters of the 2001 foot [transmission, parameters, 2001, foot, transmission parameters 2001 foot mouth
88
and m... mouth, ... epidem...
efficient replication of pneumonia virus of [efficient, replication, pneumonia, virus, efficient replication pneumonia virus mice
89
mi... mic... pvm...
designing and conducting tabletop [designing, conducting, tabletop, designing conducting tabletop exercises
90
exercises to... exercises, a... assess...
transcript-level annotation of affymetrix [transcript, level, annotation, transcript level annotation affymetrix
91
prob... affymetrix, pr... probese...
92 rows × 3 columns
In [53]:
In [54]:
#Tokenize all_words
tokenized_words = nltk.tokenize.word_tokenize(all_words)
In [55]:
# Create a frequency distribution which records the number of times each word has occurre
d:
fdist = FreqDist(tokenized_words)
fdist
Out[55]:
FreqDist({'health': 78, 'rna': 63, 'virus': 62, 'expression': 56, 'results': 50, 'patient
s': 49, 'public': 49, 'gene': 48, 'protein': 47, 'viral': 44, ...})
Now we can use our fdist dictionary to drop words which occur less than a certain amount of times (usually
we use a value of 3 or 4).
Since our dataset is really small, we don’t filter out any words and set the value to greater or equal to 1
(otherwise there are not many words left in this particular dataset)
In [56]:
C:\Users\DEV\AppData\Local\Temp\ipykernel_16408\2477811021.py:1: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
In [57]:
metadata1[['text', 'text_token', 'text_string', 'text_string_fdist']].head()
Out[57]:
Out[57]:
clinical features of culture- [clinical, features, culture, clinical features culture proven clinical features culture proven
0
proven mycoplasma... proven, mycoplas... mycoplasma pn... mycoplasma pn...
nitric oxide: a pro- [nitric, oxide, pro, nitric oxide pro inflammatory nitric oxide pro inflammatory
1
inflammatory mediator in l... inflammatory, mediator, l... mediator lung di... mediator lung di...
surfactant protein-d and [surfactant, protein, surfactant protein pulmonary surfactant protein pulmonary
2
pulmonary host defens... pulmonary, host, defense... host defensesurfa... host defensesurfa...
role of endothelin-1 in lung [role, endothelin, 1, lung, role endothelin lung role endothelin lung
3
diseaseendothelin... diseaseendothelin,... diseaseendothelin amino a... diseaseendothelin amino a...
gene expression in epithelial [gene, expression, epithelial, gene expression epithelial gene expression epithelial cells
4
cells in respons... cells, response... cells response pneu... response pneu...
In [58]:
fdist.most_common(3)
Out[58]:
[('health', 78), ('rna', 63), ('virus', 62)]
In [59]:
fdist.tabulate(3)
In [67]:
Out[67]:
<AxesSubplot: >
In [68]:
import plotly.express as px
# sort values
fig.update_layout(barmode='stack', yaxis={'categoryorder':'total ascending'})
# show plot
fig.show()
Out[69]:
44
Word Cloud
In [71]:
%matplotlib inline
import matplotlib.pyplot as plt
from wordcloud import WordCloud
wordcloud = WordCloud(width=600,
height=400,
random_state=2,
max_font_size=100).generate(all_words)
plt.figure(figsize=(10, 7))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off');
From word cloud and frequency of words we can clearly see that articles are most likely to include Technical
terms related Covid-19 (such as RNA, gene, protine,etc..)
In [72]:
#Different style:
import numpy as np
plt.axis("off")
plt.imshow(wc , interpolation="bilinear");
Sentiment analysis
VADER lexicon
NLTK provides a simple rule-based model for general sentiment analysis called VADER, which stands for
“Valence Aware Dictionary and Sentiment Reasoner” (Hutto & Gilbert, 2014).
In [74]:
from nltk.sentiment import SentimentIntensityAnalyzer
nltk.download('vader_lexicon')
analyzer = SentimentIntensityAnalyzer()
Polarity scores
Use the polarity_scores method:
In [75]:
metadata1['polarity'] = metadata1['text_string_fdist'].apply(lambda x: analyzer.polarity_
scores(x))
metadata1.tail(3)
C:\Users\DEV\AppData\Local\Temp\ipykernel_16408\1250974064.py:1: SettingWithCopyWarning:
Out[75]:
Transcript-level BACKGROUND: transcript-level [transcript, transcript level transcript level {'neg': 0.014,
annotation of The wide use of annotation of level, annotation annotation 'neu': 0.86,
91
Affymetrix Affymetrix affymetrix annotation, affymetrix affymetrix 'pos': 0.126,
prob... microar... prob... affymetrix, pr... probese... probese... 'com...
Transform data
Change data structure
In [76]:
metadata1 = pd .concat(
[metadata1.drop([ 'title' ,'abstract'], axis=1), metadata1['polarity'].apply(pd .Series)]
, axis=1)
metadata1.head(3)
Out[76]:
In [77]:
# Create new variable with sentiment "neutral," "positive" and "negative"
metadata1['sentiment'] = metadata1['compound'].apply(lambda x: 'positive' if x >0 else '
neutral' if x== 0 else 'negative')
metadata1.head()
Out[77]:
text text_token text_string text_string_fdist polarity neg neu pos compound sentimen
{'neg':
0.092,
clinical features of [clinical, features, clinical features clinical features 'neu':
0 culture-proven culture, proven, culture proven culture proven 0.886, 0.092 0.886 0.022 -0.8779 negativ
mycoplasma... mycoplas... mycoplasma pn... mycoplasma pn... 'pos':
0.022,
'co...
{'neg':
0.125,
nitric oxide: a pro- [nitric, oxide, pro, nitric oxide pro nitric oxide pro 'neu':
1 inflammatory inflammatory, inflammatory inflammatory 0.828, 0.125 0.828 0.048 -0.7717 negativ
mediator in l... mediator, l... mediator lung di... mediator lung di... 'pos':
0.048,
'co...
{'neg':
0.036,
surfactant protein- [surfactant, protein, surfactant protein surfactant protein 'neu':
2 d and pulmonary pulmonary, host, pulmonary host pulmonary host 0.91, 0.036 0.910 0.054 0.1531 positiv
host defens... defense... defensesurfa... defensesurfa... 'pos':
0.054,
0.054,
text text_token text_string text_string_fdist polarity
'com... neg neu pos compound sentimen
{'neg':
0.0,
role endothelin role endothelin
role of endothelin-1 [role, endothelin, 1, 'neu':
lung lung
3 in lung lung, 0.942, 0.000 0.942 0.058 0.3400 positiv
diseaseendothelin diseaseendothelin
diseaseendothelin... diseaseendothelin,... 'pos':
amino a... amino a...
0.058,
'comp...
{'neg':
0.0,
gene expression in [gene, expression, gene expression gene expression 'neu':
4 epithelial cells in epithelial, cells, epithelial cells epithelial cells 0.949, 0.000 0.949 0.051 0.4939 positiv
respons... response... response pneu... response pneu... 'pos':
0.051,
'comp...
Analyze data
Title with highest positive sentiment
In [78]:
metadata1.loc[metadata1['compound'].idxmax()].values
Out[78]:
array(['pandemic influenza preparedness: an ethical framework to guide decision-makingbac
kground: planning for the next pandemic influenza outbreak is underway in hospitals acros
s the world. the global sars experience has taught us that ethical frameworks to guide de
cision-making may help to reduce collateral damage and increase trust and solidarity with
in and between health care organisations. good pandemic planning requires reflection on v
alues because science alone cannot tell us how to prepare for a public health crisis. dis
cussion: in this paper, we present an ethical framework for pandemic influenza planning.
the ethical framework was developed with expertise from clinical, organisational and publ
ic health ethics and validated through a stakeholder engagement process. the ethical fram
ework includes both substantive and procedural elements for ethical pandemic influenza pl
anning. the incorporation of ethics into pandemic planning can be helped by senior hospit
al administrators sponsoring its use, by having stakeholders vet the framework, and by de
signing or identifying decision review processes. we discuss the merits and limits of an
applied ethical framework for hospital decision-making, as well as the robustness of the
framework. summary: the need for reflection on the ethical issues raised by the spectre o
f a pandemic influenza outbreak is great. our efforts to address the normative aspects of
pandemic planning in hospitals have generated interest from other hospitals and from the
governmental sector. the framework will require re-evaluation and refinement and we hope
that this paper will generate feedback on how to make it even more robust.',
list(['pandemic', 'influenza', 'preparedness', 'ethical', 'framework', 'guide', 'd
ecision', 'makingbackground', 'planning', 'next', 'pandemic', 'influenza', 'outbreak', 'u
nderway', 'hospitals', 'across', 'world', 'global', 'sars', 'experience', 'taught', 'us',
'ethical', 'frameworks', 'guide', 'decision', 'making', 'may', 'help', 'reduce', 'collate
ral', 'damage', 'increase', 'trust', 'solidarity', 'within', 'health', 'care', 'organisat
ions', 'good', 'pandemic', 'planning', 'requires', 'reflection', 'values', 'science', 'al
one', 'cannot', 'tell', 'us', 'prepare', 'public', 'health', 'crisis', 'discussion', 'pap
er', 'present', 'ethical', 'framework', 'pandemic', 'influenza', 'planning', 'ethical', '
framework', 'developed', 'expertise', 'clinical', 'organisational', 'public', 'health', '
ethics', 'validated', 'stakeholder', 'engagement', 'process', 'ethical', 'framework', 'in
cludes', 'substantive', 'procedural', 'elements', 'ethical', 'pandemic', 'influenza', 'pl
anning', 'incorporation', 'ethics', 'pandemic', 'planning', 'helped', 'senior', 'hospital
', 'administrators', 'sponsoring', 'use', 'stakeholders', 'vet', 'framework', 'designing'
, 'identifying', 'decision', 'review', 'processes', 'discuss', 'merits', 'limits', 'appli
ed', 'ethical', 'framework', 'hospital', 'decision', 'making', 'well', 'robustness', 'fra
mework', 'summary', 'need', 'reflection', 'ethical', 'issues', 'raised', 'spectre', 'pand
emic', 'influenza', 'outbreak', 'great', 'efforts', 'address', 'normative', 'aspects', 'p
andemic', 'planning', 'hospitals', 'generated', 'interest', 'hospitals', 'governmental',
'sector', 'framework', 'require', 'evaluation', 'refinement', 'hope', 'paper', 'generate'
, 'feedback', 'make', 'even', 'robust']),
'pandemic influenza preparedness ethical framework guide decision makingbackground
planning next pandemic influenza outbreak underway hospitals across world global sars exp
erience taught ethical frameworks guide decision making may help reduce collateral damage
erience taught ethical frameworks guide decision making may help reduce collateral damage
increase trust solidarity within health care organisations good pandemic planning require
s reflection values science alone cannot tell prepare public health crisis discussion pap
er present ethical framework pandemic influenza planning ethical framework developed expe
rtise clinical organisational public health ethics validated stakeholder engagement proce
ss ethical framework includes substantive procedural elements ethical pandemic influenza
planning incorporation ethics pandemic planning helped senior hospital administrators spo
nsoring use stakeholders vet framework designing identifying decision review processes di
scuss merits limits applied ethical framework hospital decision making well robustness fr
amework summary need reflection ethical issues raised spectre pandemic influenza outbreak
great efforts address normative aspects pandemic planning hospitals generated interest ho
spitals governmental sector framework require evaluation refinement hope paper generate f
eedback make even robust',
'pandemic influenza preparedness ethical framework guide decision makingbackground
planning next pandemic influenza outbreak underway hospitals across world global sars exp
erience taught ethical frameworks guide decision making may help reduce collateral damage
increase trust solidarity within health care organisations good pandemic planning require
s reflection values science alone tell prepare public health crisis discussion paper pres
ent ethical framework pandemic influenza planning ethical framework developed expertise c
linical organisational public health ethics validated stakeholder engagement process ethi
cal framework includes substantive procedural elements ethical pandemic influenza plannin
g incorporation ethics pandemic planning helped senior hospital administrators sponsoring
use stakeholders vet framework designing identifying decision review processes discuss me
rits limits applied ethical framework hospital decision making well robustness framework
summary need reflection ethical issues raised spectre pandemic influenza outbreak great e
fforts address normative aspects pandemic planning hospitals generated interest hospitals
governmental sector framework require evaluation refinement hope paper generate feedback
make even robust',
{'neg': 0.047, 'neu': 0.609, 'pos': 0.344, 'compound': 0.995},
0.047, 0.609, 0.344, 0.995, 'positive'], dtype=object)
In [79]:
metadata1.loc[metadata1['compound'].idxmin()].values
Out[79]:
array(['public awareness of risk factors for cancer among the japanese general population
: a population-based surveybackground: the present study aimed to provide information on
awareness of the attributable fraction of cancer causes among the japanese general popula
tion. methods: a nationwide representative sample of 2,000 japanese aged 20 or older was
asked about their perception and level of concern about various environmental and genetic
risk factors in relation to cancer prevention, as a part of an omnibus survey. interviews
were conducted with 1,355 subjects (609 men and 746 women). results: among 12 risk factor
candidates, the attributable fraction of cancer-causing viral and bacterial infection was
considered highest (51%), followed by that of tobacco smoking (43%), stress (39%), and en
docrine-disrupting chemicals (37%). on the other hand, the attributable fractions of canc
er by charred fish and meat (21%) and alcohol drinking (22%) were considered low compared
with other risk factor candidates. for most risk factors, attributable fraction responses
were higher in women than in men. as a whole, the subjects tended to respond with higher
values than those estimated by epidemiologic evidence in the west. the attributable fract
ion of cancer speculated to be genetically determined was 32%, while 36% of cancer was co
nsidered preventable by improving lifestyle. conclusion: our results suggest that awarene
ss of the attributable fraction of cancer causes in the japanese general population tends
to be dominated by cancer-causing infection, occupational exposure, air pollution and foo
d additives rather than major lifestyle factors such as diet.',
list(['public', 'awareness', 'risk', 'factors', 'cancer', 'among', 'japanese', 'ge
neral', 'population', 'population', 'based', 'surveybackground', 'present', 'study', 'aim
ed', 'provide', 'information', 'awareness', 'attributable', 'fraction', 'cancer', 'causes
', 'among', 'japanese', 'general', 'population', 'methods', 'nationwide', 'representative
', 'sample', '2', '000', 'japanese', 'aged', '20', 'older', 'asked', 'perception', 'level
', 'concern', 'various', 'environmental', 'genetic', 'risk', 'factors', 'relation', 'canc
er', 'prevention', 'part', 'omnibus', 'survey', 'interviews', 'conducted', '1', '355', 's
ubjects', '609', 'men', '746', 'women', 'results', 'among', '12', 'risk', 'factor', 'cand
idates', 'attributable', 'fraction', 'cancer', 'causing', 'viral', 'bacterial', 'infectio
n', 'considered', 'highest', '51', 'followed', 'tobacco', 'smoking', '43', 'stress', '39'
, 'endocrine', 'disrupting', 'chemicals', '37', 'hand', 'attributable', 'fractions', 'can
cer', 'charred', 'fish', 'meat', '21', 'alcohol', 'drinking', '22', 'considered', 'low',
'compared', 'risk', 'factor', 'candidates', 'risk', 'factors', 'attributable', 'fraction'
, 'responses', 'higher', 'women', 'men', 'whole', 'subjects', 'tended', 'respond', 'highe
, 'responses', 'higher', 'women', 'men', 'whole', 'subjects', 'tended', 'respond', 'highe
r', 'values', 'estimated', 'epidemiologic', 'evidence', 'west', 'attributable', 'fraction
', 'cancer', 'speculated', 'genetically', 'determined', '32', '36', 'cancer', 'considered
', 'preventable', 'improving', 'lifestyle', 'conclusion', 'results', 'suggest', 'awarenes
s', 'attributable', 'fraction', 'cancer', 'causes', 'japanese', 'general', 'population',
'tends', 'dominated', 'cancer', 'causing', 'infection', 'occupational', 'exposure', 'air'
, 'pollution', 'food', 'additives', 'rather', 'major', 'lifestyle', 'factors', 'diet']),
'public awareness risk factors cancer among japanese general population population
based surveybackground present study aimed provide information awareness attributable fra
ction cancer causes among japanese general population methods nationwide representative s
ample 000 japanese aged older asked perception level concern various environmental geneti
c risk factors relation cancer prevention part omnibus survey interviews conducted 355 su
bjects 609 men 746 women results among risk factor candidates attributable fraction cance
r causing viral bacterial infection considered highest followed tobacco smoking stress en
docrine disrupting chemicals hand attributable fractions cancer charred fish meat alcohol
drinking considered low compared risk factor candidates risk factors attributable fractio
n responses higher women men whole subjects tended respond higher values estimated epidem
iologic evidence west attributable fraction cancer speculated genetically determined canc
er considered preventable improving lifestyle conclusion results suggest awareness attrib
utable fraction cancer causes japanese general population tends dominated cancer causing
infection occupational exposure air pollution food additives rather major lifestyle facto
rs diet',
'public awareness risk factors cancer among japanese general population population
based surveybackground present study aimed provide information awareness attributable fra
ction cancer causes among japanese general population methods nationwide representative s
ample 000 japanese aged older asked perception level concern various environmental geneti
c risk factors relation cancer prevention part omnibus survey interviews conducted 355 su
bjects 609 men 746 women results among risk factor candidates attributable fraction cance
r causing viral bacterial infection considered highest followed tobacco smoking stress en
docrine disrupting chemicals hand attributable fractions cancer charred fish meat alcohol
drinking considered low compared risk factor candidates risk factors attributable fractio
n responses higher women men whole subjects tended respond higher values estimated epidem
iologic evidence west attributable fraction cancer speculated genetically determined canc
er considered preventable improving lifestyle conclusion results suggest awareness attrib
utable fraction cancer causes japanese general population tends dominated cancer causing
infection occupational exposure air pollution food additives rather major lifestyle facto
rs diet',
{'neg': 0.282, 'neu': 0.661, 'pos': 0.057, 'compound': -0.9927},
0.282, 0.661, 0.057, -0.9927, 'negative'], dtype=object)
Visualize data
In [80]:
# Number of tweets
sns.countplot(y='sentiment',
data=metadata1,
palette =['#b2d8d8',"#008080", '#db3d13']
);
In [81]:
# Boxplot
sns.boxplot (y='compound',
x='sentiment',
palette =['#b2d8d8',"#008080", '#db3d13'],
data=metadata1);
In [83]:
# Lineplot
g = sns.lineplot(x=df ['year'], y=metadata1['compound'])
g.set(title='Sentiment of Titles')
g.set(xlabel="Time")
g.set(ylabel="Sentiment")
Thank You