Download as pdf or txt
Download as pdf or txt
You are on page 1of 10

This article has been accepted for publication in a future issue of this journal, but has not been

fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JBHI.2020.3032479, IEEE Journal of
Biomedical and Health Informatics

IEEE JOURNAL OF BIOMEDICAL AND HEALTH INFORMATICS 1

Automatically Assessing Quality of Online


Health Articles
Fariha Afsana, Student Member, IEEE, Muhammad Ashad Kabir, Member, IEEE, Naeemul Hassan,
Manoranjan Paul, Senior Member, IEEE

Abstract— Today Information in the world wide web is The extensive spread of unreliable information can nega-
overwhelmed by unprecedented quantity of data on ver- tively affect public health. Misinformation based wrong deci-
satile topics with varied quality. However, the quality of sion forces people to uphold erroneous belief and opinions
information disseminated in the field of medicine has been
questioned as the negative health consequences of health instead of irrefutable evidence [8]. Sometimes, these types
misinformation can be life-threatening. There is currently of articles fail to render envisioned information to pursuer.
no generic automated tool for evaluating the quality of This may result misinterpretation of concept and eventually
online health information spanned over broad range. To trigger fear and incite one to change regular habit overnight.
address this gap, in this paper, we applied data mining ap- However, the online network isn’t going anywhere and seeking
proach to automatically assess the quality of online health
articles based on 10 quality criteria. We have prepared a and sharing health information online will not be stopped.
labelled dataset with 53012 features and applied different Misinformation will prevail as well [9]. For this reason,
feature selection methods to identify the best feature sub- assessing and assuring the quality of health information on
set with which our trained classifier achieved an accuracy World Wide Web becomes a fundamental issue for users
of 84% − 90% varied over 10 criteria. Our semantic analysis [10]. The better the quality of health information, the more
of features shows the underpinning associations between
the selected features & assessment criteria and further reliable and accessible it is and the more effective it will be
rationalize our assessment approach. Our findings will help in moulding users behaviour towards health-care.
in identifying high quality health articles and thus aiding In order to curb this situation, several approaches have been
users in shaping their opinion to make right choice while proposed to assess the quality of health related information.
picking health related help from online.
Among these, some of the approaches conducted assessment
Index Terms— Health articles, misinformation, quality as- manually and demanded users perception to qualify a health
sessment, data mining. news. A number of studies [11], [12] estimated the quality of
the overall web sources rather evaluating each of the article
I. I NTRODUCTION published in it. A few others [13]–[15] tried to evaluate
the quality of articles published in specific disease domain
T HE tremendous advancement of digital technology and
widespread usage of Internet have made information
accessible worldwide. Consequently, majority of people are
which narrowed down the scope of their work. Some studies
(e.g., [16]) proposed evaluation criteria framework and some
tried to assess quality based on that proposed framework [9].
turning to the Internet for searching a diverse range of
But in case of criteria selection, a question is always there
health related information. According to a study by Australian
about its specific application on medical domain as criteria
Institute of Health and Welfare, 78% of Australian adults
selection for health specific articles necessitate the involvement
were found to search health-related information in 2015 [1].
of health professionals. However, given the ever changing
However, the reliability of information from web sources are
landscape of Internet, no universal framework for automati-
questionable due to the unregulated nature of Internet.
cally assessing the quality of OHA has been proposed to date.
In this era of Internet, misinformation (dubious, low quality
With this context in mind, this study attempts to automate
fabricated information) disseminates much faster like wildfire
the quality assessment process of OHA based on the ideas
than the truth. A plethora of information from online health
and effort from HealthNewsReview.org1 . This organization
articles (OHA) and other sources (Blogs, Facebook, Twitter,
manually evaluates health-related articles by a team comprised
YouTube, etc.) are available for health information quester.
50 experts from various disciplines including journalism,
But all the information are not reliable as these stem from
medicine, health services research, public health and patient
various individuals and organization [2]–[4]. Hence, the task
perspectives. Performance of this organization is excellent
of distinguishing unreliable health information from reliable
but not scalable in comparison to the speed of information
one poses substantial challenges on individuals [5], [6], [7].
explosion worldwide. In this paper, we applied a data mining
F. Afsana, M. A. Kabir and M. Paul are with the School of Computing based approach to assess the quality of online health articles
and Mathematics, Charles Sturt University, NSW, Australia (e-mail: automatically. Our main contributions can be summarized as
fafsana@csu.edu.au; akabir@csu.edu.au; mpaul@csu.edu.au) follows:
N. Hassan is with the Philip Merrill College of Journalism, University
of Maryland, USA (e-mail: nhassan@umd.edu)
Corresponding author: Muhammad Ashad Kabir 1 https://www.healthnewsreview.org

2168-2194 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Middlesex University. Downloaded on November 02,2020 at 14:52:59 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JBHI.2020.3032479, IEEE Journal of
Biomedical and Health Informatics
2 IEEE JOURNAL OF BIOMEDICAL AND HEALTH INFORMATICS

• We have developed a labelled dataset of health related prescribe the action that is required following the evaluation.
news articles which were finely annotated by health EQIP tool was demonstrated through several processes of item
experts from HealthNewsReview.org. So far, no generic generation, testing for concurrent validity, inter-rater reliability
health related dataset is available that is suitable for and utility using large diverse samples of written health care
assessing the quality of OHA. Our dataset, once released, information.
will be a valuable resource for health and research The Quality Index for health related Media Reports (QIMR)
communities for conducting future studies on the topic was developed as an evaluation tool to monitor the quality of
of misinformation in the field of medicine. health research reporting in the lay media, more specifically
• We have explored multifaceted feature spaces through for Canadian media. Themes from interviews with health
systematic content analysis to identify appropriate fea- journalists and researchers were undertaken to develop QIMR
tures to automate quality assessment process. We have [21]. However, QIMR approach is limited in sample size and
also keyed out criteria-wise discriminating features by scope, and failed to evaluate quality of news sources having
analyzing feature importance. content of varying quality.
• We have examined the applicability of various data min- However, specific focus on treatment information or partic-
ing techniques in assessing the quality of OHA automat- ular media has narrowed down the scope of these approaches
ically and achieved state-of-the-art performance on it. on different applications and questions their applicability to
• We also have provided explanation of feature subset online content about other aspects of health and illness. On
corresponding to each criterion to justify the value of the contrary, our approach is applicable to all health related
the assessment. information domains. Moreover, the existing approaches were
conducted through manual labour, whereas ours is fully auto-
II. R ELATED W ORK mated system to assess quality of health articles in a shorter
Quality of the online health related information has been a possible time.
major concern from the dawn of the World Wide Web (WWW)
era [17], [18]. Numerous tools have been developed to alleviate B. Criteria Based Quality Assessment Approach
the quality measurement of health related information most of
To date, there is no clear universal standard to assess the
which are based on a particular disease (e.g., cancer, diabetes,
quality of web based health information [24]. Kim et. al. con-
etc.) and lack in robust validity and reliability testing. In
ducted extensive review to identify criteria that were already
[19], Keselman et. al. conducted an exploratory study with
proposed or employed specifically for evaluating health related
a view to developing a methodological approach to analyze
information world wide [25]. Eysenbach et. al. conducted a
health related web pages and apply it to a set of relevant
systematic review to compile criteria actually used to measure
web pages. This qualitative study analysed webpages about
the quality of health information on the Web and synthesized
natural treatment of diabetes to accentuate the challenges
evaluation results from studies containing quantitative data
faced by consumers in seeking health information. It has also
on structure and process [10]. Comparing the methodological
underscored the importance of developing support tools so
frameworks of existing approaches authors concluded with the
that this formative study could help users to seek, evaluate,
need for defining operational criteria for quality assessment.
and analyze information in the world wide web. We have
[2] is another systematic review where authors reviewed
summarized the relevant research along three categories.
empirical studies on trust and credibility in the use of web-
based health information (WHI) with an aim to identify factors
A. Statistical Analysis Based Quality Assessment that impact judgments of trustworthiness and credibility, and
Approach to explore the role of demographic factors affecting trust
DISCERN [20], a short instrument, was developed for formation.
judging the quality of written consumer health information The Code of Conduct for medical websites (HONcode),
about treatment choices by producers, health professionals and initiated by the Health On the Net Foundation, was the first
patients, and for facilitating the production of high quality attempt to propose guidelines to information providers for
evidence-based patient information. The DISCERN approach raising the quality of medical and health information available
was a combination of qualitative methods and a statistical on the World Wide Web [26]. Adopting a set of eight criteria to
measure of inter-rater agreements among expert panel repre- certify websites containing health information, its creators also
senting a range of expertise including production and use of developed a Health Website Evaluation Tool, which offered
consumer health information [21]. For establishing the face users with an indication of commitment to quality from the
and content validity, and inter-rater reliability, this approach providers.
administered questionnaire to information providers and self- There are several criteria-based assessment tools and few of
help organizations. Later, authors of [20] developed an explicit them have proper validation [27]. Quality Evaluation Scoring
scheme for calculating a 5-star quality rating system for Tool (QUEST) is the first quantitative tool that supports a
consumer health information based on DISCERN [22]. broad range of health information and had undergone a vali-
The Ensuring Quality Information for Patients (EQIP) [23] dation process [16]. Based on a review of existing tools [13],
is another tool to assess the presentation quality of all types of [28], QUEST quantitatively measures six criteria: authorship,
written health care information in a more rigorous way, and to attribution, conflicts of interest, currency, complementarity

2168-2194 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Middlesex University. Downloaded on November 02,2020 at 14:52:59 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JBHI.2020.3032479, IEEE Journal of
Biomedical and Health Informatics
AFSANA et al.: QUALITY ASSESSMENT OF HEALTH ARTICLES 3

and tone which can be used by health care professionals which has not be examined so far to the best of our knowledge.
and researchers alike. QUESTs reliability and validity were Thus, in the paper, we focus on to use finely tuned manually
demonstrated by evaluating online articles on Alzheimers annotated health articles by a group of experts to examine the
disease . In an Fuzzy VIKOR based approach, Afful-Dadzie performance of automated quality assessment approach using
et. al. [9] proposed a new criteria framework for measuring data mining.
the quality of information provided by each site. Authors
demonstrated a decision making model to find out how online III. DATASET D ESCRIPTION
health information providers could be assessed and ranked
There is currently no single dataset for assessing the quality
based on their quality.
of online health articles (OHA). For this study, we have
prepared a dataset based on 1720 health-related articles from
C. Machine Learning Based Analysis and Miscellaneous HealthNewsReview.org. The mission of this website is to intro-
Apart from aforementioned approaches, there are few more duce a significant step towards meaningful health care reform
studies which are not directly aligned with our research but by evaluating the accuracy of medical news and examining
provide us with valuable insights. the quality of evidence they provide. Since its foundation
In [29], authors developed a new labelled dataset of misin- in 2006, HealthNewsReview.org provides reviews of health
formative and non-misinformative comments from a medical news reporting from major U.S. news organizations conducted
health forum, MedHelp, with a view to making a resource for by a multi-disciplinary team of reviewers from journalism,
medical research communities to study the spread of medical medicine, health services research and public health domain.
misinformation. Preliminary feature analysis of the dataset was According to the editorial team of HealthNewsReview.org,
also presented to develop a real-time automated system for all stories and press news releases about public health inter-
identifying and classifying medical misinformation in online ventions should be evaluated by ten different criteria to ensure
forums. the quality of information in terms of accuracy, balance and
An applied machine learning based approach is proposed completeness. This organization proposed ten criteria, based
in [30], where authors addressed the veracity of online health on a analysis from previous studies combined with viewpoint
information by automating systemic approaches in conjunction from health care journalism2 , and as a standard of judging the
with Evidence-Based Medicine (EBM). Based on EBM and quality of health articles. This criteria address the basic issues
trusted medical information sources, authors proposed an that a consumer should know for developing their opinions on
algorithm, MedFact, which would recommend trusted medical health related interventions and how/whether they matter in
information within health related social media and empower their lives. Below we provide a list of those criteria3 .
online users to determine the veracity of health information • Criterion 1 Does the story adequately discuss the costs of
using machine learning techniques. Their aim was to address the intervention?
the factual accuracy of online health information from social • Criterion 2 Does the story adequately quantify the benefits
media discourse based on keyword extraction. Whereas, our of the intervention?
objective is to evaluate the quality of online health realted • Criterion 3 Does the story adequately explain/quantify the
articles from datamining perspective. we have focused on harms of the intervention?
identifying the discriminating features of health related articles • Criterion 4 Does the story seem to grasp the quality of the
for assessing the quality in a automatic manner. evidence?
Ghenai et. al. [31] proposed a tool for tracking misinfor- • Criterion 5 Does the story commit disease-mongering?
mation around health concerns on Twitter based on a case • Criterion 6 Does the story use independent sources and
study about Zika. The tool discovered health related rumours identify conflicts of interest?
in social media by incorporating professional health experts • Criterion 7 Does the story compare the new approach with
through crowdsourcing for annotating dataset and machine existing alternatives?
learning for rumour classification. Our aim is different from • Criterion 8 Does the story establish the availability of the
this study. Rather than focusing on health related rumour, we treatment/test/product/procedure?
focused on all types of health related articles available online • Criterion 9 Does the story establish the true novelty of the
to evaluate their quality so that people could be able to identify approach?
which articles to read or which to avoid for decision making. • Criterion 10 Does the story appear to rely solely or largely
A recent study by Dhoju et. al. [11] has identified structural, on a news release?
topical and semantic differences between health related news In HealthNewsReview.org, for a published health news
articles from reliable and unreliable media by conducting article, a group of expert reviews and justifies each of the
a systematic content analysis. By leveraging a large-scale above criterion with ‘satisfactory’ or ‘Not Satisfactory’ scores
dataset, authors successfully identified some discriminating based on their quality. In some cases, some criteria is rated as
features which separate reliable health news from the unre- ‘Not Applicable’ when it is impossible or unreasonable for an
liable one.
2 https://healthjournalism.org/secondarypage-detai
However, our study is quite different from these already
ls.php?id=56
existing methodologies. Our aim is to automate the quality 3 https://www.healthnewsreview.org/about-us/review
assessment process of health related articles using data mining -criteria/

2168-2194 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Middlesex University. Downloaded on November 02,2020 at 14:52:59 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JBHI.2020.3032479, IEEE Journal of
Biomedical and Health Informatics
4 IEEE JOURNAL OF BIOMEDICAL AND HEALTH INFORMATICS

1600 IV. F EATURE E NGINEERING


Satisfactory

1400
Not Satisfactory
Not Applicable
In this section, we will explain the data pre-processing,
1200
feature extraction and feature selection process to establish
baseline performance for our approach. All our data pre-
Number of WHA

1000
processing and feature extraction have been conducted using
800 python, and some other useful library, e.g., scikit-learn6 and
600 NLTK7 .
400

200 A. Data Pre-Processing


Certain refinement of raw data is essential for removing
n1 n2 n3 n4 n5 n6 n7 n8 n9 0
n1
Cr
ite
rio
Cr
ite
rio
Cr
ite
rio
Cr
ite
rio
Cr
ite
rio
Cr
ite
rio
Cr
ite
rio
Cr
ite
rio
Cr
ite
rio
Cr
ite
rio irrelevant information and reducing the size of actual data [32].
To enhance the accuracy and performance of our classification
Fig. 1: Criteria wise class distribution over entire corpus. model, we have run step by step data pre-processing tasks on
each article. Following three refinement steps are adopted:
1) Contraction Expansion: Contractions are shortened ver-
article to address those. The total scores are translated into a sion of words or syllables which pose problems in text
star rating: 0, 1, 2, 3, 4 and 5 stars for the percentage of cri- analytics. To help text standardization with original form of
teria judged satisfactory 0%, 1%–20%, 21%–40%, 61%–80% words, each contraction has been expanded to its main form.
and 81%–100%, respectively. Among all those ten criteria, For instance, expanded form of ‘i’d’ and ‘you’ve’ became ‘I
Criterion 4 is the most important one as this criterion founds would’ and ‘you have’ respectively.
highest satisfactory rate among all the 4- and 5-stars rating 2) Noise Removal: Noise removal is one of the most
articles and lowest ‘not satisfactory’ rate among all the 0- and important text pre-processing steps. Usually, URLs, special
1-star articles. characters and symbols add extra noise in unstructured text.
Characteristics of ten defined criteria from standards of We applied punctuation removal, special character removal,
health reporting perspectives and all possible basic points with html formatting removal and numbers removal to get rid of
a view to serve the interests of the public have convinced us these noise. Because of having little significance in corpus,
to adopt these set as standard for evaluating health related we removed stop words (words like: a, the, is, me, etc.) as
articles. Our aim is to automate this quality assessment process well.
using data mining. 3) Word Normalization: In text analytics, tokenization of
document is required for identifying meaningful keywords.
Apart from tokenizing documents, stemming and lemmatiza-
A. Data Collection tion have also been used for reducing inflectional forms of
word (connet, connected, connection, etc.) and derivationaly
To collect data from HelathNewsReview.org we have cre-
related forms of word to a common base form.
ated GUI app using C#.Net framework. Since the website has
Remaining chunks of cleaned text data are then fed for
no API, we have created our scraper using HTML Agility
feature extraction.
Pack4 , a free and open source tool to extract data from website,
and stored data in MS SQL database. For each review, we
gathered title of the original news, corresponding link of orig- B. Feature Extraction
inal news, category and criteria wise score as ‘Satisfactory’, Multiple categories of features have been extracted for
‘Unsatisfactory’ or ‘Not Applicable’. We have collected all classifying criteria. For model construction, we have keyed
reviewed stories from 2006 to 2018 and reviewed press news out some features which might help in prediction of classes.
releases from 2015 to 2018 from the website and removed The complete set of extracted features with their corresponding
duplicity as same story may coexist under different categories. description is depicted in Table I.
The source URL is then accessed using Newspaper3k5 , a 1) Linguistic Inquiry and Word Count (LIWC): To obtain a
python3 library for extracting and curating articles. wide variety of psychological and linguistic features, we apply
Overall, our dataset consists of three class labels: Satisfac- LIWC2015 [33], a transparent text analysis program to score
tory, Not Satisfactory and Not Applicable, for each of the ten words in psychologically meaningful categories, on original
criteria. Figure 1 shows the criteria wise distribution of class news texts in our dataset. LIWC calculates the following
labels over 1720 data corpus. dimensions:
As we can see that the number of observations belonging to • Summary Dimension (Consists of 8 features; e.g., word
Not Applicable class is significantly lower than that of other count, word per sentence)
two classes in every criteria, we have omitted this class value • Punctuation mark (Consists of 12 features; e.g., comma,
for our initial study. colon, quote)
4 https://html-agility-pack.net 6 https://scikit-learn.org/stable
5 https://newspaper.readthedocs.io/en/latest/ 7 https://www.nltk.org

2168-2194 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Middlesex University. Downloaded on November 02,2020 at 14:52:59 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JBHI.2020.3032479, IEEE Journal of
Biomedical and Health Informatics
AFSANA et al.: QUALITY ASSESSMENT OF HEALTH ARTICLES 5

TABLE I: List of extracted features with brief description


Scope Feature Name Description Feature Output
Num- Type
ber
Linguistic measure LIWC Measures textual features 93 Real
Word frequency TF-IDF Measures the importance of a word in a document 4000 Real
POS Tag Counts the number of part of speech in a document 35 Integer
Word-category disambiguation
POS Word Defines the parts of speech of each word separately and then counts 47450 Integer
its number within the document
Internal Links Defines the number of self-citations 1 Integer
Citation and Ranking External Links Defines the number of external citations 1 Integer
Alexa Rating Ranks every document having link according to alexa rating 1428 Real
Similarity measure Cosine similarity Measures the relation between headline and body 1 Real
Normalized distinct word Measures how many distinct words were used in the text 1 Real
Miscellaneous count
Per num count Counts the number of person mentioned in the text 1 Integer
Org num count Counts the number of organization mentioned in the text 1 Integer

• Function words (Consists of 15 features; e.g., Pronoun, well as extremely high frequency for achieving a better accu-
article, conjunction) racy [34] [35]. Since too frequent or too rare words are not
• Perceptual process (Consists of 4 features; e.g., see, hear) influential in characterizing an article, we ignored all words
• Biological process (Consists of 5 features; e.g., Body, that have appeared in more than 90% of the documents and
health) less than 3 documents. Again, to keep the dimensionality of
• Drives (Consists of 6 features; e.g., reward, risk, power) our feature set to a manageable size, we set maximum feature
• Other grammar (Consists of 6 features; e.g., interroga- count to top 4000 terms based on frequency.
tives, numbers) 3) Part Of Speech Tagging: Part Of Speech Tagging
• Time orientation (Consists of 3 features; e.g., past, (POST), also known as word-category disambiguation, is used
present, future) to annotate word with appropriate part-of-speech based on
• Relativity (Consists of 4 features; e.g., motion, time) both its definition and context to resolve lexical ambiguity
• Affect (Consists of 6 features; positive emotion, negative [36]. To recognize POST, we have applied Stanford postag-
emotion (e.g. anger)) ger8 . We found 35 tagset (list of part-of-speech tags, e.g., CC,
• Personal concerns (Consists of 6 features; e.g., word, CD, NP, RBR, etc. ) in the corpus with which we derived
leisure, money) two sets of features: POS tag count and POSWord count.
• Social (Consists of 5 features; e.g., Family, friend) For POS tag count, we measured the document wise count
• Informal language (Consists of 6 features; e.g., filler, of words belonging to a particular POS tag and thus found 35
swear) individual features. On the other hand, for POSWord count, we
• Cognitive process (Consists of 7 features; e.g., Differ, measured the count of tag associated with each individual word
Insight) within a document and found 47,451 non-overlapping features.
2) Term Frequency and Inverse Document Frequency: We POSWord features are capable of performing rudimentary
have used this weighting metric to measure the importance word sense disambiguation in situations where a word can
of a term in a document within entire dataset [32]. Term represent several meanings.
Frequency (TF) is used to quantify the frequency of a word 4) Citation and Ranking: We analysed the presence of hy-
in a particular document. On the contrary, Inverse Document perlinks to determine the credibility of an article. We extracted
Frequency (IDF) measures the importance of a term within three features – internal link, external link and Rank from link
the corpus. Let, D symbolizes the whole corpus with N attribute. We counted the number of internal links to inspect
documents. If n(t)d denotes the number of times term t the amount of self-citation occurred in a document so that
appears in a document d, then TF, denoted by T F (t)d , can be we can predict some biasness in it. Conversely, number of
calculated by equation (1): external links were counted to predict the citation network of
an article. We derived the rank attribute to envisage the quality
n(t)d
T F (t)d = P , (1) of the article by measuring the superiority of the webpages
t́∈d n(t́)d that particular article cited to. We considered Alexa Global
Ranking9 as an indicator of superiority measurement of a
And IDF, denoted by IDF (t)D , can be calculated by
webpage as it gives an estimation of a websites popularity.
equation (2):
We counted outgoing links from all the documents within the
IDF (t)D = 1 + log[N × |{d ∈ D : t ∈ d}|−1 ], (2) corpus and found 1428 distinct domains. We replaced each
domain with its associated alexa rank value and thus, we found
For a particular term, the product of TF and IDF represents 1428 distinct rank features for overall corpus.
the TF-IDF weight of that term. The higher the TF-IDF
score, the rarer the term is. We applied unigram tokenizer 8 https://nlp.stanford.edu/software/tagger.html

on our text and eliminated features with extremely low as 9 https://www.alexa.com/siteinfo

2168-2194 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Middlesex University. Downloaded on November 02,2020 at 14:52:59 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JBHI.2020.3032479, IEEE Journal of
Biomedical and Health Informatics
6 IEEE JOURNAL OF BIOMEDICAL AND HEALTH INFORMATICS

5) Similarity Measure: Ambiguous and misleading headline PolyKernel as kernel to control the projection and the amount
can degrade the quality of an article. So, to measure the of flexibility in separating classes in our dataset. Second, Naive
relevance between headline-body pair of each article, we used Bayes classification algorithm which calculates the posterior
TF-IDF Cosine similarity metric to extract similarity feature probability for each class using a simple implementation of the
[37]. It quantifies the similarity between headline and body Bayes theorem and makes the prediction for the class with the
of the document irrespective of their size by measuring the highest probability. For each numerical attribute, a Gaussian
cosine of the angle between two vectors projected in a multi- distribution is assumed by default [40]. Third, Random Forest
dimensional space. classifier which constructs a multitude of decision trees at
6) Miscellaneous: We quantified normalized distinct word training time and merges them together to get a more accurate
count as a feature to determine how rare a word contributes in and stable prediction [41]. We considered 100 trees for Ran-
the classification problem as health related articles comprise dom Forest implementation. Forth, EnsembleVoteClassifier,
different medical terms. We have also counted the number a meta-classifier combining similar or conceptually different
of organizations and person mentioned in articles to predict machine learning classifiers for classification via majority
biasness. We used Stanford Named Entity Recognizer (NER)10 voting. We have combined three aforementioned classifiers to
to extract these features. build our ensemble estimator and examined its performance
on our dataset. All methods were evaluated by 10-fold cross-
C. Feature Selection validation, where in each validation 90% of dataset was used
for training purpose and 10% for testing. Various combinations
We are aiming to predict ten different criteria using numer-
of the extracted features have been experimented to evaluate
ous features (total of 53012) some of which might redundant
how accurately our approach can automatically classify each
or irrelevant to make predictions. Dataset containing irrelevant
criterion.
features can result in over-fitting. It also can mislead the
modelling power of a method. Thus, it is critically impor-
tant to select most relevant features from the feature set. B. Identify Feature Selection Method and Feature Size
In order to select the features that contribute most in our To identify the feature selection method and the feature size
classification task, we have employed three different automatic that result best classification accuracy for our dataset, We have
feature selection techniques. First, correlation-based attribute experimented the impact of different feature selection methods
evaluation (Co AE − P C), which evaluates worth of a feature and varied feature sizes on classification accuracy.
by measuring Pearsons correlation between it and the class. 1) Identify Feature Selection Method: We ran three feature
Second, Classifier-based attribute evaluation (Cl AE − LR), selection methods on our feature space, with a goal of de-
which evaluates the worth of a feature using Logistic Re- termining which feature selection method performs best by
gression classifier. Third, Classifier-based attribute evaluation selecting a best feature subset that results best classification
(Cl AE − RF ), which evaluates the worth of a feature using performance. Table II presents the outcomes of the compara-
Random Forest classifier. For each of the above three attribute tive study of three different feature selection methods over four
evaluator, rank search method was performed which ranks different classifiers (SVM, Naive Bayes, Random Forest and
features by their individual evaluations to find out the most EnsembleVote) carried out against a feature subset with feature
correlated feature set. size 4000. Here, we have presented weighted Precision (WP ),
weighted Recall (WR ) and weighted F-Measure (WF ) from
V. E XPERIMENTAL E VALUATION the Weka output for presenting a better estimate of overall
classification performance [42]. Weka [38] calculates weighted
The core contribution of our work is to assess the quality
average by taking average of each class, weighted by the
of online health articles automatically applying various data
proportion of how many elements are in each class. So, for
mining techniques. In this section, we have quantified and
our binary class problem, WP , WR and WF are calculated
evaluated the performance of a number of classification tech-
from equation (3), (4) and (5) respectively.
niques, for different feature selection methods and variable
feature sizes, to achieve the best result. We have used WEKA (PCS × |CS|) + (PCN S × |CN S|)
WP = , (3)
tool [38] in our experimental evaluation. |CS| + |CN S|

(RCS × |CS|) + (RCN S × |CN S|)


A. Evaluate Classification Techniques WR = , (4)
|CS| + |CN S|
We have experimented four prominent classification tech-
niques on our dataset and reported their results. We have (FCS × |CS|) + (FCN S × |CN S|)
performed a binary class (Satisfactory and Not Satisfactory) WF = , (5)
|CS| + |CN S|
classification using three supervised learning methods and one
ensemble method for obtaining better accuracy in assessing Where, PCS and PCN S are the Precisions for class ‘Sat-
quality of OHA. First, Support Vector Machine (SVM) algo- isfactory’ and ‘Not Satisfactory’; RCS and RCN S are the
rithm, which uses kernel trick to implicitly mapping inputs Recalls for class ‘Satisfactory’ and ‘Not Satisfactory’; FCS
into a high-dimensional feature space [39]. We have used and FCN S are the F-Measures for class ‘Satisfactory’ and ‘Not
Satisfactory’; |CS| and |CN S| are the number of instances in
10 https://nlp.stanford.edu/software/CRF-NER.html class ‘Satisfactory’ and ‘Not Satisfactory’ respectively.

2168-2194 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Middlesex University. Downloaded on November 02,2020 at 14:52:59 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JBHI.2020.3032479, IEEE Journal of
Biomedical and Health Informatics
AFSANA et al.: QUALITY ASSESSMENT OF HEALTH ARTICLES 7

1 1

0.9 0.9

0.9 0.8 0.8

0.7 0.7

True Positive Rate


True Positive Rate
0.6 0.6
F-Measure (Weighted Average)

0.5 0.5

0.4 0.4
0.8
0.3 0.3

Imbalanced data Imbalanced data


0.2 0.2
Over-sampling Over-sampling
Criterion 1
0.1 SMOTE 0.1 SMOTE
Criterion 2
Criterion 3 Under Sampling Under Sampling
0 0
Criterion 4 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
0.7 Criterion 5 False Positive Rate False Positive Rate
Criterion 6
Criterion 7
Criterion 8
Criterion 9
(a) Criterion 5 (b) Criterion 10
Criterion 10

1000 2000 3000 4000 5000 10000 15000 53000


Fig. 3: The ROC curve of balanced and imbalanced class for
Feature Size (Log Scale)
the feature size 5000 (Criteria 5 and 10)
Fig. 2: Classification performance of SVM with Co AE − P C
feature selection method for varying feature sizes (1000 to
53012 (all features)) feature size 5000. Overall, all reduced features subset (varied
in size) achieved at least 80% accuracy with our explored
feature combination.
For all criteria, clearly SVM performed well among all
four classifiers. We also observe from the table II that, for C. Class Balancing
all criteria (except criterion 3 and criterion 5), the Pearson’s
We noticed some imbalanced class in our dataset. As we
correlation feature selection method (Co AE − P C) performs
found that criteria 5 and 10 are most imbalanced class distribu-
best for the SVM classifer. For criteria 3 and 5, the Logistic
tion, we have combat this class imbalance problem by adopting
Regression feature selection performs slightly better than the
three class balancing techniques Under sampling, Over-
Pearson’s correlation method.
sampling and Synthetic Minority Over-Sampling Technique
For all of the criteria (except criteria 6 and 7), we observed (SMOTE). As over-sampling duplicates the minority class
that, Random Forest classifier always misclassified the minor- instances, it can lead to model over-fitting. Similarly, under
ity class (e.g., for criterion 1, minority class was ‘Satisfactory’) sampling can degrade performance if it leaves out important
into majority class (e.g., for criterion 1, majority class was instances while cutting down. Thus, we also experimented
‘Not Satisfactory’) which results a drop in recall value for our dataset with SMOTE which generates synthetic sample of
the minority class. This happened due to the imbalanced class minority class rather than using duplicates. However, SMOTE
distribution of our dataset (see Fig. 1). In our dataset, criteria still does not prevent over-fitting as it generates synthetic data
6 and 7 are very close to balance and thus Random Forest from existing data points.
classifier performed moderately. Figure 3 shows the performance comparison of these three
2) Identify Feature Size: We have also varied the feature methods in terms of Receiver Operating Characteristics (ROC)
sizes to see how this impact on the classification performance. curve. It is observed that, all of the sampling techniques
In this part of the experiment, we have used the Pearson’s performed better than the imbalanced dataset. For SMOTE,
Correlation feature selection method (Co AE − P C) with the we have got 99% and 98% accuracy for criteria 5 & 10,
SVM classifer, as we found them best for our classification respectively.
problem (see Table II). Figure 2 shows the performance of
the SVM classifier combined with Co AE − P C feature set
under various feature sizes – 1000, 2000, 3000, 4000, 5000, VI. S EMANTIC A NALYSIS OF F EATURES
10000, and 53012. In this section, we have semantically analyzed the co-
We observed from the Figure 2 that, for all criteria, the relations between a criterion and its corresponding most sig-
feature set comprises all features (total 53012) performs lower nificant feature set to show how the feature set justify our
due to having irrelevant and redundant features. We can see assessment of a criterion.
that there is an improvement in performance with reduced To analyze each criterion, we have itemized the top 16
feature subset. For criterion 1, we achieved 90% accuracy for most discriminating features by combining the results found
the feature size 3000 in terms of F-measure. For criteria 2, 4 from Pearson’s correlation, Logistic Regression and Random
and 7, we achieved 85% accuracy for the feature size 5000. For Forest feature selection algorithms. The top 16 feature list is
criteria 3, we achieved 84% accuracy for the feature size 5000. presented in the Table III. Overall, POSWord count and TF-
Criterion 6 achieved 86% accuracy for 1000 sized feature set. IDF features are found most significant and other features
For criterion 8 and 9, 87% and 86% accuracy were achieved get varied from criteria to criteria. Insights gained from
for the feature size 4000 and 3000 respectively. For the rest two determining relevant features are as follows:
criteria (5 and 10), we noticed that the performance curves are As criterion 1 is about coverage of cost intervention, it is
a bit different from other curves because of their imbalanced instinctive to have features associated with money, cost, price,
dataset nature. Criterion 5 begins with the highest accuracy dollars, amount of dollars (thousand, hundred), insurance etc.;
(90%) at feature size 1000 and performance varied with feature each of which are found as top discriminating features in this
size. Criterion 10 achieved highest accuracy (88%) for the study.

2168-2194 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Middlesex University. Downloaded on November 02,2020 at 14:52:59 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JBHI.2020.3032479, IEEE Journal of
Biomedical and Health Informatics
8 IEEE JOURNAL OF BIOMEDICAL AND HEALTH INFORMATICS

TABLE II: Comparison study of three feature selection methods over four classifiers (Feature Size: 4000) (bold number is
showing the highest weighted F-Measures for each criterion)
Criterion FS Methods SVM Random Forest Naive Bayes Ensemble
WP WR WF WP WR WF WP WR WF WP WR WF
1 Co AE −P C 0.901 0.903 0.899 0.794 0.786 0.718 0.861 0.756 0.774 0.868 0.803 0.816
Cl AE − LR 0.887 0.888 0.881 0.826 0.786 0.711 0.813 0.684 0.708 0.859 0.827 0.836
Cl AE − RF 0.877 0.881 0.875 0.831 0.792 0.724 0.792 0.637 0.666 0.848 0.813 0.823
2 Co AE −P C 0.855 0.857 0.854 0.739 0.707 0.621 0.793 0.721 0.730 0.799 0.738 0.747
Cl AE − LR 0.826 0.828 0.821 0.765 0.707 0.615 0.766 0.686 0.696 0.789 0.739 0.747
Cl AE − RF 0.795 0.801 0.794 0.751 0.703 0.610 0.691 0.567 0.587 0.729 0.674 0.685
3 Co AE −P C 0.835 0.837 0.832 0.752 0.720 0.655 0.790 0.712 0.720 0.793 0.727 0.735
Cl AE − LR 0.851 0.849 0.841 0.779 0.722 0.651 0.764 0.698 0.707 0.773 0.727 0.735
Cl AE − RF 0.804 0.808 0.803 0.780 0.721 0.648 0.704 0.612 0.624 0.729 0.674 0.685
4 Co AE −P C 0.847 0.848 0.846 0.707 0.695 0.635 0.783 0.713 0.718 0.790 0.728 0.733
Cl AE − LR 0.824 0.824 0.818 0.746 0.693 0.615 0.758 0.689 0.694 0.769 0.720 0.726
Cl AE − RF 0.779 0.783 0.778 0.741 0.692 0.615 0.680 0.583 0.592 0.704 0.643 0.650
5 Co AE −P C 0.894 0.887 0.843 0.769 0.877 0.819 0.864 0.722 0.767 0.873 0.890 0.863.
Cl AE − LR 0.888 0.901 0.888 0.769 0.877 0.819 0.829 0.757 0.786 0.869 0.887 0.873
Cl AE − RF 0.856 0.880 0.862 0.892 0.877 0.820 0.805 0.707 0.748 0.845 0.873 0.852
6 Co AE −P C 0.835 0.835 0.835 0.688 0.688 0.688 0.754 0.744 0.742 0.747 0.737 0.735
Cl AE − LR 0.733 0.733 0.732 0.678 0.676 0.674 0.689 0.644 0.623 0.693 0.648 0.628
Cl AE − RF 0.719 0.719 0.719 0.687 0.686 0.685 0.665 0.636 0.621 0.667 0.638 0.624
7 Co AE −P C 0.855 0.854 0.854 0.669 0.666 0.663 0.737 0.733 0.732 0.747 0.737 0.735
Cl AE − LR 0.716 0.715 0.715 0.678 0.666 0.659 0.669 0.630 0.611 0.669 0.630 0.611
Cl AE − RF 0.690 0.689 0.688 0.644 0.638 0.632 0.627 0.607 0.595 0.627 0.607 0.595
8 Co AE −P C 0.875 0.876 0.870 0.768 0.737 0.638 0.813 0.777 0.786 0.813 0.781 0.789
Cl AE − LR 0.814 0.821 0.807 0.780 0.737 0.638 0.710 0.642 0.661 0.737 0.712 0.721
Cl AE − RF 0.762 0.775 0.765 0.750 0.736 0.639 0.657 0.523 0.563 0.718 0.699 0.706
9 Co AE −P C 0.867 0.869 0.867 0.695 0.698 0.607 0.782 0.765 0.770 0.782 0.765 0.770
Cl AE − LR 0.827 0.826 0.816 0.754 0.710 0.621 0.689 0.608 0.621 0.727 0.689 0.699
Cl AE − RF 0.769 0.777 0.769 0.752 0.704 0.608 0.620 0.477 0.507 0.661 0.610 0.623
10 Co AE −P C 0.880 0.886 0.878 0.815 0.805 0.722 0.848 0.765 0.787 0.854 0.794 0.811
Cl AE − LR 0.867 0.875 0.865 0.777 0.804 0.721 0.779 0.689 0.717 0.822 0.814 0.818
Cl AE − RF 0.828 0.842 0.830 0.783 0.806 0.730 0.757 0.632 0.679 0.796 0.793 0.795
Legend: FS – Feature Selection; WP – Weighted Precision; WR – Weighted Recall; WF – Weighted F-Measure.

Inclusion of absolute number in quantifying benefit gives tiary details.


readers a better sense of understanding about an intervention. It is a matter of judgement to identify disease mongering.
For example, the sentence ‘New drug reduces heart failure risk From the feature subset found for criterion 5, we can relate
in half’ can give reader a peachy idea about the intervention. TF-IDF features ‘dry-eye’, ‘suffer’, ‘need’, ‘inform’, and POS-
But if the sentence was like ‘4% risk dropping to a 2%’ Word features ‘revealed’, ‘excessive’ are directly inflating the
(showing risk halved), it would sound less significant to the seriousness of a condition. For example, using rating scales
reader with clear idea. From the top selected feature subset to diagnose chronic dry-eye is simply an exaggeration of a
for criterion 2, we find TF-IDF feature ‘percent’ and LIWC common disorder. As only 19% articles from our dataset were
feature ‘number’ to be pertinent to the usage of absolute rated ‘Not Satisfactory’ on this criterion, we found less aligned
number in article. Besides, TF-IDF feature ‘compar’, ‘trust’; extracted features to define disease mongering.
LIWC feature ‘differ’, ‘quant’ are also relevant in explaining
According to criterion 6, independent experts should be
benefits.
included in news stories about health care interventions and
When reading a story about a new intervention, it is conflicts of interest in the people who are quoted should be
expected to have explanation about the potential harms and explored and disclosed. In order to explore this criterion, we
side effects of the intervention. In our feature set, we have defined a new feature, ‘per ner count’ to count the number of
found POSWord feature ‘risks’, ‘cause’, ‘nausea’ (a common person referred in a document and we found this feature to
side effect of drug), ‘side’, ‘died’; TF-IDF feature ‘common’, be the most relevant feature to describe this criterion. Same
‘effect’, ‘high’; and LIWC feature ‘negate’, ‘tentat’ are quite is the case with the feature ‘Org ner count’ which gives us
meaningful to describe criterion 3. the count of organizations cited in a document. Apart from
In order to grasp the quality of the evidence, a story needs these, POSWord features ‘university’, ‘that’, ‘said’, ‘national’,
to present an elaborate explanation of the study (source, size, ‘professor’, ‘study’, and ‘involve’ are also aligned with this
type, limitation, etc.) it went through. For example, a report criterion to describe it.
published in The Wall Street Journal on Ebola Vaccine stated As criterion 7 is about comparing new intervention with
that this was the ‘first placebo-controlled study of two vaccines existing alternatives, it is usual for a document to contain
against the Ebola virus’ and mentioned its shortcomings as comparison words. From our feature subset, we found that
well and the reviewers in the HealthNewsReview.org rated it the LIWC feature ‘differ’, and POSWord feature ‘than’, ‘not’,
as ‘Satisfactory’ for criterion 4. Our feature subset consists of and ‘but’ are more relevant to describe this criterion.
POSWord Features ‘randomly’, ‘assigned’, ‘placebo’, ‘study’,
‘evidence’ and ‘group’, and TF-IDF feature ‘random-assign’
are stringently aligned with this criterion in describing eviden-

2168-2194 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Middlesex University. Downloaded on November 02,2020 at 14:52:59 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JBHI.2020.3032479, IEEE Journal of
Biomedical and Health Informatics
AFSANA et al.: QUALITY ASSESSMENT OF HEALTH ARTICLES 9

TABLE III: Most discriminating Features [Criteria 1-10] this context, our work will make manual reviewing process
Cr Most Correlated Features (Top 16) scalable and save manual labour and time. Our developed
1 M oney ? , cost NN‡ , costs VBZ‡ ,insurance NN‡ , dataset will help researchers to contribute in the growing
costs NNS‡ , But CC ‡ , price NN‡ , negate? ,verb? ,
not RB ‡ , dollars NNS‡ , covered VBN‡ ,
field of health care research. Overall, this automated quality
thousand dollar∗ , pay VB‡ , cost VB‡ , tag N N ‡ assessment approach may help search engine to promote high
2 Percent NN‡ , quant? ,were V BD‡ , quality health information and discourage low quality articles.
compared V BN ‡ ,england journal∗ , group NN‡ ,
dif f er? ,new england∗ , year percent∗ , However, there are some limitations in our study. Experts
Reuters NNP‡ , standardsth thomson∗ , trust principl∗ , from HealthNewsReview.org used three labels - ‘Satisfactory’,
compar percent∗ , reuter trust∗ , percent percent∗ ,
journal medicin∗
‘Not Satisfactory’ and ‘Not Applicable’ for characterizing 10
3 not RB‡ , should MD‡ , W C ? , negate? , dif f er? , criteria . Cases where a number of criteria may be impossible
effects NNS‡ , some DT ‡ , causeV B ‡ , risks NNS‡ , or unreasonable for some of the stories were rated as ‘Not
tentat? , have VB‡ , nausea NN‡ , external link⊗ ,
common ef f ect∗ , ef f ect includ∗ , side JJ‡, high dos∗ , Applicable’ by the review experts. In our study, we deducted
died V BD‡ stories with ‘Not Applicable’ criteria from our training set
4 study NN‡ , N ormalizeddistinctwordcount⊗ , not RB‡ ,
randomly RB‡ , dif f er? , assigned VBN‡ , as those stories constituted a small part of the whole corpus
studies NNS‡ , T he DT ‡ , were VBD‡ , placebo NN‡ , and trained our classifiers for two class labels - ‘Satisfactory’
One CD‡ , editorial NN‡ , evidence NN‡ , group N N ‡ , and ‘Not Satisfactory’. That’s why we could not use all 1720
randomized V BN ‡ , random assign∗ , placebo group∗ ,
email N N ‡ articles for each of the 10 criteria and number of total dataset
5 f amili histori∗ , Anesthesiologists NNPS‡ , dri eye∗ , varied from criteria to criteria (e.g., our dataset for criterion 1
revealed V BN ‡ , history N N ‡ , Hed NNP‡ , moist JJ‡ ,
transit NN‡ , american suf f er∗ , anesthesiology NN‡ , comprised of 1426 articles after removing class instances of
need new∗ , excessive JJ ‡ , inf orm patient∗ , ‘Not Applicable’ label). In our future study we plan to address
labbased JJ‡ , histori breast∗ , air N N ‡ this shortcoming.
6 per ner count⊗ , professor NN‡ , study NN‡ ,
U niversity N N P ‡ , involved VBN‡ , The DT‡ , Another limitation is, our dataset is not large enough to
normalizeddistinctwordcount⊗ , W C?, But CC‡ , be compatible for deep learning framework. We trained deep
that IN ‡ , said VBD‡ , National NNP‡ , School N N P ‡ ,
not RB‡ , f unded V BN ‡ , about IN ‡ learning classifier for our dataset though and found approxi-
7 Not RB‡ , But CC‡ , W C?, Dif f er? , mately 50% accuracy over all criteria. In our future work, we
There EX‡ , The DT‡ , Than IN‡ , Are VBP‡ , plan to enrich our dataset to examine its feasibility from deep
normalizeddistinctwordcount⊗ , Many JJ‡ , F or IN ‡ ,
Year NN‡ , T hat DT ‡ , U niversity N N P ‡ , Of ten RB ‡ , learning perspective.
Better JJR‡
8 N ot RB ‡ , dif f er? , are V BP ‡ , negate? , VIII. C ONCLUSION AND F UTURE W ORK
radiotherapy N N ‡ , Alessandro N N P ‡ , Magnet reson∗ ,
CITATION NNP‡ , Twice week∗ , Reson imag∗ , In this paper, we have applied data mining approach to
T emperature N N ‡ , Outcom studi∗ , Resonance N N ‡ ,
Cognit impair∗ , Welltolerated VBN‡ , Axis NN‡ automatically assess the quality of online health articles.
9 Have V BP ‡ , Studies N N S ‡ , FoxNewscom NNP‡ , We have prepared our dataset comprises 1720 health related
Moisturizers NNS‡ , N ews releas∗ , Result promis∗ , articles extensively reviewed by a group of experts. Through
Help woman∗ , Control blood∗ , Consumption N N ‡ ,
Healing VBG‡ , Leadership NN‡ , M olecular JJ ‡ , a pipeline of data pre-processing steps, we have refined our
Melbourne NNP‡ , Educated VBN‡ , Obesity N N P ‡ , data and extracted 53012 features to train classifiers. We
Penetrate VB‡
10 Dif f er? , N egate? , social? , tentat? , Sixltr? , have identified the best feature selection technique to select
Detect diseas∗ , N ews releas∗ , Develop research∗ , most relevant feature subset from our feature space, and have
Lead investig ∗ , Collaborate VBP‡ , Discovery NN‡ ,
Tumour NN‡ , M edia contact∗ , Innovator N N ‡ , applied four different classifiers - SVM, Naive Bayes, Random
Resume VB‡ , Exceptional JJ‡ Forest and EnsembleVote to train model. For our dataset, we
Legend: Cr – Criterion ? – LIWC Feature; ∗ – TF-IDF Feature; ‡ – found SVM is the best performer achieving accuracy upto
POSWord count; ⊗ – Miscellaneous Features; Features common in all three 84% to 90% for ten different criteria. We have also analyzed
feature set are indicated by Bold texts.
top 16 most correlated features for each of the ten criteria
to justify the feasibility of our assessment. We found that
VII. D ISCUSSION our selected features are capable of characterizing criteria
successfully. From our experimental results and analysis, it can
In this study, we have examined the application of machine be concluded that it is feasible to apply data mining techniques
learning approach to automate the quality assessment process to automate quality assessment process for online health
for web based health related information. We found that it is articles. Following the richness of dataset and specific focus
feasible to apply machine learning classifiers to estimate the independent nature of analysis, proposed model may serve as
quality of health related articles if the classifier can be trained a universal standard for appraising quality of OHA and wipe
properly. This work is not directly comparable to the already out the negative impact of misinformation dissemination to
existing studies because most of the studies examined the some extent.
quality of health information from a single domain perspective As future work, we will further investigate this study with
(e.g., vaccination [43], [44], [45]; diabetic neuropathy [13]; deep learning approach and explore multinomial classifica-
reproductive health information [14]; nutrition coverage [15] tion problem to evaluate health related articles which cannot
etc.) through a manual process and statistical analysis. We address some of the specific criteria. We have also plan to
have examined articles over entire health domain, ensuring conduct case studies and develop an article recommendation
its applicability to all possible health related category. In system based on our model.

2168-2194 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Middlesex University. Downloaded on November 02,2020 at 14:52:59 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JBHI.2020.3032479, IEEE Journal of
Biomedical and Health Informatics
10 IEEE JOURNAL OF BIOMEDICAL AND HEALTH INFORMATICS

R EFERENCES [24] L. Theodosiou and J. Green, “Emerging challenges in using health


information from the internet,” Advances in Psychiatric Treatment,
[1] Australian Institute of Health and Welfare, Australia’s health 2018. vol. 9, no. 5, pp. 387–396, 2003.
[2] L. Sbaffi and J. Rowley, “Trust and Credibility in Web-Based Health [25] P. Kim, T. R. Eng, M. J. Deering, and A. Maxfield, “Published criteria
Information: A Review and Agenda for Future Research.” Journal of for evaluating health related web sites: review.” BMJ (Clinical research
medical Internet research, vol. 19, no. 6, p. 218, 2017. ed.), vol. 318, no. 7184, pp. 647–9, 1999.
[3] B. Kitchensa, C. A.Harlea, and S. Li, “Quality of health-related online [26] C. Boyer, M. Selby, J.-R. Scherrer, and R. Appel, “The Health On the
search results,” Decision Support Systems, vol. 57, pp. 454–462, 2014. Net Code of Conduct for medical and health Websites,” Computers in
[4] G. Eysenbach and C. Köhler, “How do consumers search for and Biology and Medicine, vol. 28, no. 5, pp. 603–610, 1998.
appraise health information on the world wide web? Qualitative study [27] M. Breckons, R. Jones, J. Morris, and J. Richardson, “What Do
using focus groups, usability tests, and in-depth interviews.” BMJ Evaluation Instruments Tell Us About the Quality of Complementary
(Clinical research ed.), vol. 324, no. 7337, pp. 573–7, 2002. Medicine Information on the Internet?” Journal of Medical Internet
[5] G. Bhatt, A. Sharma, S. Sharma, A. Nagpal, B. Raman, and A. Mittal, Research, vol. 10, no. 1, 2008.
“On the Benefit of Combining Neural, Statistical and External Features [28] W. M. Silberg, G. D. Lundberg, and R. A. Musacchio, “Assessing,
for Fake News Identification,” 2017. controlling, and assuring the quality of medical information on the
[6] W.-Y. S. Chou, A. Oh, and W. M. P. Klein, “Addressing Health-Related Internet: Caveant lector et viewor–Let the reader and viewer beware.”
Misinformation on Social Media,” JAMA, vol. 320, no. 23, pp. 2417– JAMA, vol. 277, no. 15, pp. 1244–5, 1997.
2418, 2018. [29] A. Kinsora, K. Barron, Q. Mei, and V. V. Vydiswaran, “Creating a
[7] R. M. Merchant and D. A. Asch, “Protecting the Value of Medical Labeled Dataset for Medical Misinformation in Health Forums,” in
Science in the Age of Social Media and Fake News,” JAMA, vol. 320, 2017 IEEE International Conference on Healthcare Informatics (ICHI).
no. 23, p. 2415, 2018. IEEE, 2017, pp. 456–461.
[8] K. Shu, A. Sliva, S. Wang, J. Tang, and H. Liu, “Fake News Detection on [30] H. Samuel and O. Zaı̈ane, “Medfact: Towards improving veracity of
Social Media,” ACM SIGKDD Explorations Newsletter, vol. 19, no. 1, medical information in social media using applied machine learning,”
pp. 22–36, 2017. in Canadian Conf. on Artificial Intelligence, 2018, pp. 108–120.
[9] E. Afful-Dadzie, S. Nabareseh, Z. K. Oplatková, and P. Klı́mek, “Model [31] A. Ghenai and Y. Mejova, “Catching zika fever: Application of crowd-
for Assessing Quality of Online Health Information: A Fuzzy VIKOR sourcing and machine learning for tracking health misinformation on
Based Method,” Journal of Multi-Criteria Decision Analysis, vol. 23, twitter,” in 2017 IEEE International Conference on Healthcare Infor-
no. 1-2, pp. 49–62, 2016. matics (ICHI), 2017, pp. 518–518.
[10] G. Eysenbach, J. Powell, O. Kuss, and E.-R. Sa, “Empirical Studies [32] H. Ahmed, I. Traore, and S. Saad, “Detection of online fake news
Assessing the Quality of Health Information for Consumers on the World using n-gram analysis and machine learning techniques,” in International
Wide Web,” JAMA, vol. 287, no. 20, p. 2691, 2002. conference on intelligent, secure, and dependable systems in distributed
[11] S. Dhoju, M. Main Uddin Rony, M. Ashad Kabir, and N. Hassan, and cloud environments. Springer, 2017, pp. 127–138.
“Differences in Health News from Reliable and Unreliable Media,” [33] Y. R. Tausczik and J. W. Pennebaker, “The Psychological Meaning of
in Companion Proceedings of The 2019 World Wide Web Conference. Words: LIWC and Computerized Text Analysis Methods,” Journal of
ACM, 2019, pp. 981–987. Language and Social Psychology, vol. 29, no. 1, pp. 24–54, 2010.
[12] J. Fairbanks, N. Fitch, N. Knauf, and E. Briscoe, “Credibility assessment [34] K. T. Huq, A. S. Mollah, and M. S. H. Sajal, “Comparative study of
in the news: Do we need to read?” WSDM workshop on Misinformation feature engineering techniques for disease prediction,” in Intl. Conf. on
and Misbehavior Mining on the Web (MIS2), ACM, 2018, p. 8. Big Data, Cloud and Applications. Springer, 2018, pp. 105–117.
[13] S. Chumber, J. Huber, and P. Ghezzi, “A Methodology to Analyze the [35] S. Gilda, “Evaluating machine learning algorithms for fake news de-
Quality of Health Information on the Internet: The Example of Diabetic tection,” in 2017 IEEE 15th Student Conference on Research and
Neuropathy ,” The Diabetes Educator, vol. 41, no. 1, pp. 95–105, 2015. Development (SCOReD). IEEE, 2017, pp. 110–115.
[14] A. Aslani, O. Pournik, A. Abu-Hanna, and S. Eslami, “Web-site evalu- [36] K. Sharma, F. Qian, H. Jiang, N. Ruchansky, M. Zhang, and Y. Liu,
ation tools: a case study in reproductive health information.” Studies in “Combating fake news: A survey on identification and mitigation
health technology and informatics, vol. 205, pp. 895–9, 2014. techniques,” ACM Transactions on Intelligent Systems and Technology
[15] A. R. Kininmonth, N. Jamil, N. Almatrouk, and C. E. L. Evans, “Quality (TIST), vol. 10, no. 3, pp. 1–42, 2019.
assessment of nutrition coverage in the media: a 6-week survey of five [37] J. Singh and M. Kumar, “A Meta Search Approach to Find Similarity be-
popular UK newspapers.” BMJ open, vol. 7, no. 12, p. e014633, 2017. tween Web Pages Using Different Similarity Measures,” in International
[16] J. M. Robillard, J. H. Jun, J.-A. Lai, and T. L. Feng, “The QUEST for Conference on Advances in Computing, Communication and Control,
quality online health information: validation of a short quantitative tool,” 2011, pp. 150–160.
BMC Medical Informatics and Decision Making, vol. 18, no. 1, 2018. [38] I. H. Witten, E. Frank, M. A. Hall, and C. J. Pal, Data Mining, Fourth
[17] T. Devine, J. Broderick, L. M. Harris, H. Wu, and S. W. Hilfiker, “Mak- Edition: Practical Machine Learning Tools and Techniques, 4th ed. San
ing Quality Health Websites a National Public Health Priority: Toward Francisco, CA, USA: Morgan Kaufmann Publishers Inc., 2016.
Quality Standards.” Journal of medical Internet research, vol. 18, no. 8, [39] H. Yu and S. Kim, “SVM Tutorial – Classification, Regression and
pp. 211–218, 2016. Ranking,” in Handbook of Natural Computing. Springer, 2012, pp.
[18] R. Moynihan, L. Bero, D. Ross-Degnan, D. Henry, K. Lee, J. Watkins, 479–506.
C. Mah, and S. B. Soumerai, “Coverage by the News Media of the [40] H. Zhang and J. Su, “Naive bayesian classifiers for ranking,” in European
Benefits and Risks of Medications,” New England Journal of Medicine, conference on machine learning. Springer, 2004, pp. 501–512.
vol. 342, no. 22, pp. 1645–1650, 2000. [41] S. R. Kalmegh, “Comparative analysis of weka data mining algorithm
[19] A. Keselman, C. Arnott Smith, A. C. Murcko, and D. R. Kaufman, randomforest, randomtree and ladtree for classification of indigenous
“Evaluating the Quality of Health Information in a Changing Digital news data,” International Journal of Emerging Technology and Advanced
Ecosystem.” Journal of medical Internet research, vol. 21, no. 2, pp. Engineering, vol. 5, no. 1, pp. 507–517, 2015.
111–129, 2019. [42] X. Zeng, D. F. Wong, and L. S. Chao, “Constructing better classifier
[20] D. Charnock, S. Shepperd, G. Needham, and R. Gann, “DISCERN: ensemble based on weighted accuracy and diversity measure.” The
an instrument for judging the quality of written consumer health infor- Scientific World Journal, vol. 2014, p. 12, 2014.
mation on treatment choices.” Journal of epidemiology and community [43] X. Zhou, E. Coiera, G. Tsafnat, D. Arachi, M.-S. Ong, and A. G.
health, vol. 53, no. 2, pp. 105–11, 1999. Dunn, “Using social connection information to improve opinion mining:
[21] D. Zeraatkar, M. Obeda, J. S. Ginsberg, and J. Hirsh, “The development Identifying negative sentiment about HPV vaccines on Twitter.” Studies
and validation of an instrument to measure the quality of health research in health technology and informatics, vol. 216, pp. 761–5, 2015.
reports in the lay media,” BMC Public Health, vol. 17, no. 1, 2017. [44] J. Du, J. Xu, H. Song, X. Liu, and C. Tao, “Optimization on machine
[22] S. Shepperd, D. Charnock, and A. Cook, “A 5-star system for rating the learning based approaches for sentiment analysis on HPV vaccines
quality of information based on DISCERN,” Health Information and related tweets,” J Biomed Semant, vol. 8, no. 1, p. 9, 2017.
Libraries Journal, vol. 19, no. 4, pp. 201–205, 2002. [45] T. Mitra, S. Counts, and J. W. Pennebaker, “Understanding Anti-
[23] B. Moult, L. Franck, and H. Brady, “Ensuring quality information for Vaccination Attitudes in Social Media,” Tenth International AAAI Con-
patients: development and preliminary validation of a new instrument ference on Web and Social Media, 2016.
to improve the quality of written health care information.” Health
expectations : an international journal of public participation in health
care and health policy, vol. 7, no. 2, pp. 165–75, 2004.

2168-2194 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Middlesex University. Downloaded on November 02,2020 at 14:52:59 UTC from IEEE Xplore. Restrictions apply.

You might also like