Download as pdf or txt
Download as pdf or txt
You are on page 1of 22

1

Data and Knowledge Management Competency


Mojdeh Amini
HCIN-548-02-SP21 - HCI Seminar- e-Portfolio
Healthcare Informatics- University of San Diego
Professor Dorothy O'Hagan
May 13, 2021
2

Introduction

The data and knowledge management competencies were essential and relevant to our healthcare

informatics program, particularly in the analytic track, because it included many opportunities to

gain knowledge, skills, and appropriate statistical tools and techniques for evaluation of the data

to answer the concerns and solve the problem with a more accurate outcome. For example,

medical/nonmedical terminologies ( see Appendix A) and statistical tools and techniques like

structured query language (SQL) for collecting data and statistical analysis and Python as a

programing language with code readability for data analytics, machine learning, and design to

web development and data visualization, I have selected the following competencies.

• Demonstrate proper techniques for gathering, formatting, and storing data to investigate a

given question or problem.

• Demonstrate skills in using data management software such as SQL and Microsoft Office

to analyze a given problem.

• Apply selected statistical methodologies to evaluate a problem.


3

Artifact 1

Demonstrate proper techniques for gathering, formatting, and storing data to investigate a

given question or problem.

Before selecting and collecting any data, my primary consideration is to ensure that any

information collected is consistent with freedom of information and privacy protection

legislation and complies Health Insurance Portability and Accountability Act of 1996 (HIPAA).

Also, to protect the credibility and reliability of data, information should be gathered using

accepted data collection techniques. Commonly the six steps as follows are used:

Step 1: Collecting data which is the most critical step of the knowledge management process,

Step 2: Organizing. As the data collected needs to be organized,

Step 3: Summarizing,

Step 4: Analyzing and interpreting,

Step 5: Synthesizing, and

Step 6: Decision-making for acting on the data.


4

Artifact 2

Demonstrate skills in using data management software such as SQL and Microsoft Office

to analyze a given problem.

SQL as a standard language allows storing, manipulating, and retrieving data in databases and

relational database management systems that contain one or more objects called tables. Some

standard relational database management systems that use SQL are Sybase, Microsoft SQL

Server, Access, Ingres. SQL statements are used to perform tasks such as update data on a

database or retrieve data from a database. However, most database systems use SQL. However,

the standard SQL commands such as Select, From, Where Insert, Update, Delete, Create, and

Drop can be used to complete almost everything that one needs to do with a database because

SQL is a relational database management system that can contain one or more objects called

tables. The select statement is to query the database and retrieve selected data that match the

specific criteria for inserting or adding a row of data into the tables. That can be accomplished by

carefully constructing a where clause.


5

Artifact 3

Apply selected statistical methodologies to evaluate a problem.

Blood pressure (BP) and diabetes activities via HbA1c are two examples for SQL Code to load

tables.

SQL Code to load tables: BP_Class_Activity_20191024.sql


6

SQL Code to load tables:

HbA1c_Class_Activity_20191024.sql
7

Appendix A

Machine Learning Terminology and References

Artificial Intelligence & Machine Learning

Artificial Intelligence: the ability of a machine to perform cognitive functions we

associate with human minds, such as perceiving, reasoning, learning, interacting with the

environment, problem-solving, and even exercising creativity.

Hyperlink for more info: https://builtin.com/artificial-intelligence

Machine Learning: detect patterns and learn how to make predictions and

recommendations by processing data and experiences rather than receiving explicit

programming instruction.

Hyperlink for more info: https://www.expert.ai/blog/machine-learning-definition/

Deep Learning: a type of machine learning that can process a broader range of data

resources, requires fewer data preprocessing by humans, and can often produce more

accurate results than traditional machine-learning approaches.

Hyperlink for more info: https://machinelearningmastery.com/what-is-deep-learning/

Descriptive Analysis: use data aggregation and data mining to provide insight into the

past.

Predictive Modeling: use statistical models and forecasting techniques to understand the

future.

Hyperlink for more info: https://www.microstrategy.cn/us/resources/introductory-

guides/predictive-modeling-the-only-guide-you-need
8

Prescriptive Modeling: use optimization and simulation algorithms to advise on possible

outcomes.

Hyperlink for more info: https://www.valamis.com/hub/prescriptive-analytics

Text Analytics: automated process of translating large volumes of unstructured text into

quantitative data to uncover insights, trends, and patterns.

Natural Language Processing (NLP): a field of Artificial Intelligence that gives the

machines the ability to read, understand and derive meaning from human languages.

Python: programming language more general approach to data science. The general-

purpose programming language is used to develop software on the web and in-app form.

Hyperlink for more info: https://www.pythonforbeginners.com/learn-python/what-is-

python

R: programming language mainly used for statistical Analysis - data manipulation,

calculation, and graphical display

Python Run-Time Environment: the software stack responsible is for installing your web

service's code and its dependencies and running your service.

Alternative definition: To get your machine to run python code, you need some way to

convert it into machine code (a low-level language comprised of binary digits - ones and

zeros). The programs, libraries, and configurations that allow you to do this are

collectively known as the "python runtime environment."

Source:

https://www.reddit.com/r/learnpython/comments/2pmqcj/can_someone_explain_what_wh

at_is_the_python/
9

Python Library: Reusable chunk of code is that can be included in your programs/

projects; a collection of core modules.

Why are libraries used in Python?

Python Libraries are a set of useful functions that eliminate the need for writing codes

from scratch.

Source/Read more - 34 Open-Source Python Libraries You Should Know About:

https://www.mygreatlearning.com/blog/open-source-python-libraries/

Python Notebook: interface to combine, compile and print output of software code.

Alternative definition: An open-source web application allows data scientists to create

and share documents that integrate live code, equations, computational output,

visualizations, and other multimedia resources, along with explanatory text in a single

document.

You can use Jupyter Notebooks for all sorts of data science tasks, including data cleaning

and transformation, numerical simulation, exploratory data analysis, data visualization,

statistical Modeling, machine learning, deep learning, and much more.

Source/Read more - Why You Should be Using Jupyter Notebooks:

https://medium.com/@ODSC/why-you-should-be-using-jupyter-notebooks-ea2e568c59f2

Google Colaboratory: tool to combine executable code and rich text in a single

document, along with images, HTML, LaTeX, and more

Alternative definition: Colab is a Python development environment that runs in the

browser using Google Cloud.


10

Colab notebooks are Jupyter notebooks that Google Colab hosts. Colab enables users to

collaborate and run code that exploits Google's cloud resources, i.e., GPUs, TPUs, and

saving documents to Google Drive.

Source/Read more - Introduction to Colab and Python:

https://colab.research.google.com/github/tensorflow/examples/blob/master/courses/udacit

y_intro_to_tensorflow_for_deep_learning/l01c01_introduction_to_colab_and_python.ipy

nb

References:

Brownlee, J. (2020, August 14). What is deep learning? Retrieved February 03, 2021,

from https://machinelearningmastery.com/what-is-deep-learning/.

Built-In. (n.d.). What is Artificial Intelligence? How does ai work? Built in. Retrieved

February 03, 2021, from https://builtin.com/artificial-intelligence.

Expert.ai Team. (2020, May 6). What is machine learning? A definition - expert system.

Retrieved February 03, 2021, from https://www.expert.ai/blog/machine-learning-

definition/.

MicroStrategy. (n.d.). Predictive Modeling: The only guide you need. Retrieved February

03, 2021, from https://www.microstrategy.cn/us/resources/introductory-guides/predictive-

modeling-the-only-guide-you-need.

Panesar. (2019). Machine Learning and AI for Healthcare. Après.

VALAMIS. (n.d.). What are Prescriptive Analytics? How does it work? Examples &

benefits. Retrieved February 03, 2021, from https://www.valamis.com/hub/prescriptive-

analytics

Healthcare Data for Machine Learning


11

Clinical data sets: they are a group of information for a specific disease, intervention,

monitoring activity to maintain statistics, disease management, and clinical governance

(NIH, 2021).

Clinical value: Improving care, efficiency, and patient satisfaction (Becker's Hospital

Review, n.d.)

International Classification of Diseases (ICD): provides a method of classifying injuries,

diseases, and causes of death (NIH, 2018).

Systematized Nomenclature of Medicine (SNOMED): provides a standardized way to

represent clinical phrases recorded by the clinician and allows for the automatic

interpretation of these clinical phrases (SNOMED, n.d.)

Logical Observation Identifiers Names and Codes (LOINC): allows for the

aggregation and exchange of clinical results for care delivery, research, and outcomes

management by providing a set of standardized codes and structured names to

unambiguously identify things you can observe or measure (LOINC, n.d.).

RxNorm: Is a standardized naming system for both branded and generic drugs and a tool

for supporting semantic interoperation between pharmacy knowledge base systems and

drug terminologies (NIH, 2021).

National Drug Code (NDC): A universal product identifier for human drugs in the

United States is a unique 10-digit or 11-digit, and 3-segment number (Anderson, 2020).

Current Procedural Terminology (CPT): A medical code set used to report surgical,

diagnostic, and medical services and procedures to entities such as physicians, health

insurance companies, and accreditation organizations (Lee, 2015). Moreover, these CPT
12

codes are used and ICD-9-CM or ICD-10-CM numerical diagnostic coding during the

electronic medical record billing process (Lee, 2015).

Web and social media data: Clicks, history, health forums (Panesar, 2021).

Machine-to-machine data: It is sensors, wearables (Panesar, 2021).

Big transaction data: It iHealth claim data, billing data (Panesar, 2021).

Biometric data: It is Fingerprints, genetics, biomarkers driven from wearables (Panesar,

2021).

Human-generated data: Email, paper documents, electronic medical records (Panesar,

2021)\

Big Data 4 v's: Volume, Variety, Velocity, Veracity (Panesar, 2021).

Volume: Size of generated and stored data (Panesar, 2021).

Variety: Different types of data (Panesar, 2021).

Velocity: Speed in which Data is generated (Panesar, 2021).

Veracity: Data accuracy (Panesar, 2021).

Clinical data processes: the process of collection, cleaning, and management of subject

data in compliance with regulatory standards (Krishnankutty et al. l, 2012)

Diagnosis: investigation or Analysis of the cause or nature of a condition, situation, or

problem (Merriam-Webster, n.d.)

Lab results: are often shown as a set of numbers which are known as a reference range or

normal values for a sample test like blood, urine, and body fluid or tissue.

Medications: medicinal substance (Merriam-Webster, n.d.)

Procedures: a particular way of accomplishing something or of acting (Merriam-

Webster, n.d.)
13

Outliers: a statistical observation that is markedly different in value from the others of the

sample (Merriam-Webster, n.d.)

Missing Values: data value that is not stored for a variable in the observation of interest.

Code Libraries: a collection of codes that are available for public use.

Public data sources: free available datasets

Pandas: An open-source, BSD-licensed library providing high-performance, easy-to-use

data structures and data analysis tools for the Python programming language.

Matplotlib: a plotting library for the Python programming language and its numerical

mathematics extension NumPy

Data frame: a table or a two-dimensional array-like structure in which each column

contains values of one variable and each row contains one set of values from each

column.

Data Cleaning: the process of detecting and correcting corrupt or inaccurate records from

a record set, table, or database and refers to identifying incomplete, incorrect, faulty, or

irrelevant parts of the data and then replacing, modifying, or deleting the dirty or coarse

data.

Null: data value does not exist in the database

NaN: Not a Number

Kaggle: an online community of data scientists and machine learning practitioners

Imputation: the process of replacing missing data with substituted values

References:

Anderson, L. (2020). National drug Codes explained: What you need to know—Feb 2021,

from https://www.drugs.com/ndc.html.
14

Becker's Healthcare Review. (n.d.). Creating clinical Value: 4 steps to drive change and

improve care. Retrieved February 03, 2021, from

https://go.beckershospitalreview.com/creating-clinical-value-4-steps-to-drive-change-and-

improve-care\

Lee, K. (2015, June 22). What is Current PROCEDURAL TERMINOLOGY (CPT) code?

- definition from whatis.com. Retrieved February 03, 2021, from

https://searchhealthit.techtarget.com/definition/Current-Procedural-Terminology-

CPT#:~:text=Current%20Procedural%20Terminology%20(CPT)%20is,insurance%20com

panies%20and%20accreditation%20organizations.&text=CPT%20is%20a%20registered

%20trademark%20of%20the%20American%20Medical%20Association.

Logical Observation Identifiers Names and Codes (LOINC). (n.d.). About LOINC

Retrieved February 03, 2021, from

https://loinc.org/about/#:~:text=LOINC%20enables%20the%20exchange%20and,franca%

20for%20interoperable%20data%20exchange.

National Cancer Institute (NIH). (2018, December 3). What is the ICD? Retrieved

February 03, 2021, from https://training.seer.cancer.gov/icd10cm/intro.html.

National Library of Medicine (NIH). (2021). RxNorm overview. Retrieved February 03,

2021, from https://www.nlm.nih.gov/research/umls/rxnorm/overview.html.

Panesar. (2019). Machine Learning and AI for Healthcare. Apress.

Panesar, A. (2021). Machine Learning and AI for Healthcare: Big Data for Improved

Health Outcomes (2nd ed.). Apress. Doi: https://doi.org/10.1007/978-1-4842-6537-6.

Systematized Nomenclature of Medicine (SNOMED). (n.d.). 5-Step briefing. Retrieved

February 03, 2021, from https://www.snomed.org/snomed-ct/five-step-briefing.


15

Krishnankutty, B., Bellary, S., Kumar, N. B., & Moodahadu, L. S. (2012). Data

management in clinical research: An overview. Indian journal of pharmacology, 44(2),

168–172. https://doi.org/10.4103/0253-7613.93842

Merriam-Webster. (n.d.). Diagnosis. In Merriam-Webster.com dictionary. Retrieved

February 14, 2021, from https://www.merriam-webster.com/dictionary/diagnosis

Fundamentals of Machine Learning Algorithms

Ski-kit learn is a Python machine learning library and provides a range of supervised and

unsupervised learning algorithms by a consistent interface in Python (Brownlee, 2020).

Encode: means converting categorical data such as ordinal and nominal data into a

readable form to the machine.

Target: Output Variables (https://machinelearningmastery.com/how-to-transform-target-

variables-for-regression-with-scikit-learn/)

Feature: Input Variables to a machine learning model. (Doi: 10.1001/jama.2019.16489)

They were scaling the Data (aka normalizing): A method used to standardize the range

of features of data. Data is transformed so that parts are within a specific field, e.g.(0,1),

where x's is the normalized value. (https://kharshit.github.io/blog/2018/03/23/scaling-vs-

normalization)

Train_test_split: A procedure of estimating the performance of machine learning

algorithms when they are used to make predictions on data not used to train the model.

(https://machinelearningmastery.com/train-test-split-for-evaluating-machine-learning-

algorithms/)

Linear Regression: A continuous statistical techniques to understand the relationship

between an input/independent variable and an output/dependent variable.


16

Correlation coefficient : Statistical measure of the strength of the relationship between

the relative movements of two variables (https://www.statisticshowto.com/probability-

and-statistics/correlation-coefficient-formula/)

R2 (R-Squared): Statistical measure that represents the proportion of the variance for a

dependent variable that is explained by an independent variable or variables in a

regression model. (https://www.investopedia.com/terms/r/r-squared.asp)

Linear Equation: An equation that makes a straight line when it is graphed and often

written in the form y = mx + b (MathisFun, n.d.).

References

Brownlee, J. (2020). A gentle introduction to scikit-learn.

https://machinelearningmastery.com/a-gentle-introduction-to-scikit-learn-a-python-

machine-learning-library/

Brownlee, J. (2020). How to transform target variables for regression in Python.

https://machinelearningmastery.com/how-to-transform-target-variables-for-regression-

with-scikit-learn/

Brownlee, J. (2020). Train-Test Split for Evaluating Machine Learning Algorithms.

https://machinelearningmastery.com/train-test-split-for-evaluating-machine-learning-

algorithms/

Chen, P. C., Krause, J., Liu, Y., & Peng, L. (2019). How to read articles that use machine

learning users' guides to the medical literature. 10.1001/jama.2019.16489

Fernando, J. (2020). R-squared Definition. https://www.investopedia.com/terms/r/r-

squared.asp
17

Glen, S. (2021). Correlation Coefficient: Simple Definition, Formula, Easy Steps.

https://www.statisticshowto.com/probability-and-statistics/correlation-coefficient-

formula/

Kumar, H. (2018). Scaling vs. Normalization.

https://kharshit.github.io/blog/2018/03/23/scaling-vs-normalization

MathisFun. Linear Equation. Math is Fun. https://www.mathsisfun.com/definitions/linear-

equation.html.

Supervised Learning Using Classification Algorithms

Logistic Regression: A type of regression analysis to conduct when the dependent

variable is dichotomous (StatisticsSolutions, n.d.).

Confusion matrix: A table that is used to describe the performance of a classification

model on a set of data where the true values are known (data school, 2014).

Prevalence: The proportion of a population who have a specific character during a given

time (NIH, 2017).

Accuracy: is one metric for evaluating classification models by calculating the correct

predictions as a ratio of all projections.

Accuracy= Number of valid predictions/Number of Total predictions.

K-Nearest Neighbors: (KNN) is a simple, easy-to-implement supervised ML that makes

predictions using the training dataset directly to solve problems in both classification and

regression.

Model Tuning: is the process of maximizing a model's performance without overfitting

or creating too high of a variance. that enables the algorithm to perform the "best," based

on what is specified as "best" (Panesar, 2019).


18

Naive Bayes- is a probabilistic classification method based on Bayes' theorem where a

prediction can be made based on prior knowledge and current evidence (Saritas & Yasar,

2019).

Decision Tree: A tree-like graph consists of nodes representing a test on an attribute and

branches signifying the outcome of the test and leaf nodes meaning a label (Rai, Devi &

Guleria, 2016).

Support Vector Machines: A computer algorithm that uses the example to assign labels

to objects (Noble, 2006).

Random Forest: A machine learning algorithm that fits multiple decision trees to input

data using a random subset of the input variables for each tree constructed (Mascaro et al.,

2014).

Coding Schemes: A set of codes, defined by the words and phrases researchers assign to

categorize a segment of the data by topic; researchers consider what questions are trying

to be answered and related issues to those questions (Urban Institute, 2015).

References

Brownlee, J. (2020). Metrics to Evaluate Machine Learning Algorithms in Python.

Machine Learning Mastery. https://machinelearningmastery.com/metrics-evaluate-

machine-learning-algorithms-python/.

data school. (2014, March 25). A simple guide to confusion matrix terminology. Data

School. https://www.dataschool.io/simple-guide-to-confusion-matrix-terminology/.

Harrison, O. (2019). Machine Learning Basics with the K-Nearest Neighbors Algorithm.

https://towardsdatascience.com/machine-learning-basics-with-the-k-nearest-neighbors-

algorithm-6a6e71d01761.
19

Mascaro, J., Asner, G. P., Knapp, D. E., Kennedy-Bowdoin, T., Martin, R. E., Anderson,

C., ... & Chadwick, K. D. (2014). A tale of two "forests": Random Forest machine

learning aids tropical forest carbon mapping. PloS one, 9(1), e85993.

National Institute of Mental Health (NIH). (2017, November). What is Prevalence?

National Institute of Mental Health. https://www.nimh.nih.gov/health/statistics/what-is-

prevalence.shtml.

Noble, W. S. (2006). What is a support vector machine? Nature Biotechnology, 24(12),

1565-1567.

Rai, K., Devi, M. S., & Guleria, A. (2016). Decision tree-based algorithm for intrusion

detection. International Journal of Advanced Networking and Applications, 7(4), 2828.

Saritas, M. M., & Yasar, A. (2019). Performance analysis of ANN and Naive Bayes

classification algorithm for data classification. International Journal of Intelligent Systems

and Applications in Engineering, 7(2), 88-91.

statistics solutions. (n.d.). What is Logistic Regression? Statistics Solutions.

https://www.statisticssolutions.com/what-is-logistic-regression/.

Panesar, A. (2019). Machine learning and AI for healthcare. Coventry, UK: Apress.

eBook ISBN 978-1-4842-3799-1; Softcover ISBN 978-1-4842-3798-4

Urban Institute. (2015). Qualitative Data Analysis. Urban Institute: Data & Methods.

https://www.urban.org/research/data-methods/data-analysis/qualitative-data-

analysis#:~:text=A%20coding%20scheme%20is%20a,related%20topics%20to%20those

%20questions.

Unsupervised Clustering Algorithms


20

MinMaxScaler: transforms features by scaling each feature to a given range (sci-kit

learn, n.d.).

Pipeline: is a sum of tools and processes for performing data integration by capturing

datasets from multiple sources (AltexSoft, 2019).

Principle Component Analysis (PCA): is a technique for reducing the dimensionality of

datasets, increasing interpretability, and minimizing information loss (Jolliffe & Cadima,

2016).

K-means refers to averaging of the data for finding the centroid. (Garbade, 2018).

Centroids: Actual or predicted center of a given cluster (Garbade, 2018)

Silhouette score: A score indicating separation distance between resulting clusters. A

score of 0 indicates proximity while -1, or +1, indicates farther away. (Scikit Learn, 2020)

Sigmoid function: is a mathematical function with a characteristic S-shaped curve—

several standard sigmoid functions, such as the logistic function, the hyperbolic tangent,

and the arctangent (Wood, 2020).

References

AltexSoft. (2019). What is Data Engineering: Explaining the Data Pipeline, Data

Warehouse, and Data Engineer Role. AltexSoft.

https://www.altexsoft.com/blog/datascience/what-is-data-engineering-explaining-data-

pipeline-data-warehouse-and-data-engineer-role/.

Garbade, D. M. J. (2018, September 12). Understanding K-means Clustering in Machine

Learning. Medium. https://towardsdatascience.com/understanding-k-means-clustering-in-

machine-learning-6a6e67336aa1.
21

Jolliffe, I. T., & Cadima, J. (2016). Principal component analysis: a review and recent

developments. Philosophical Transactions of the Royal Society A: Mathematical, Physical

and Engineering Sciences, 374(2065), 20150202.

scikit learn. sklearn.preprocessing.MinMaxScaler. scikit. https://scikit-

learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html.

Wood, T. (2020). Sigmoid Function. DeepAI. https://deepai.org/machine-learning-

glossary-and-terms/sigmoid-functio

Garbade, M. (2018). Understanding K-means Clustering in Machine Learning.

https://towardsdatascience.com/understanding-k-means-clustering-in-machine-learning-

6a6e67336aa1

Scikit Learn. (2020). Selecting the number of clusters with silhouette analysis on KMeans

clustering. https://scikit-

learn.org/stable/auto_examples/cluster/plot_kmeans_silhouette_analysis.html

Ethics of Machine Learning

Definitions: A set of organized, reusable code called upon to perform a required coding

action (Python, 2021).

Structure: the aggregate of elements of an entity in their relationships to each other

(Merriam-Webster, 2021).=

Formatting: A process in Python where a user inserts a specified value inside the desired

placeholder, e.g., a string placeholder – string. Format(value1, value2, value3)

(W3Schools, 2021).

Writing: Process of scripting computer code (Python, 2021).


22

Attributions (proper APA citations): Actions related to qualities or features as

characteristics of or possessed by entities, people, or things (Pratt & Last, 2014).

Hyperlinks: a tag in a web page that can link one web page to another page or location in

the same web page (Pratt & Last, 2014).

References

Marriam-Webster. (2021). https://www.merriam-webster.com/dictionary/structure.

Merriam-Webster. https://www.merriam-webster.com/dictionary/structure.

Pratt, P. J., & Last, M. Z. (2014). Concepts of database management. Cengage Learning.

Python. (2021). Classes. https://docs.python.org/3/tutorial/classes.html

Python. (2021). Input and Output. Reading and Writing Files.

https://docs.python.org/3/tutorial/inputoutput.html#reading-and-writing-files

W3Schools. (2021). Python String format() Method.

https://www.w3schools.com/python/ref_string_format.asp

You might also like