Data Science IBM

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 157

Module 1 Summary

Welcome to “Open-Source Tools for Data Science Part 1.” After watching this video, you will be able

to: List the open-source data management tools. List the open-source data integration and

transformation tools. List the data visualization tools. List the model tools for building, deployment,

monitoring, and assessment, and List tools for code and data asset management.
Play video starting at ::30 and follow transcript

0:30

So, the most widely used open-source data management tools are relational databases like MySQL

and PostgreSQL. Also, there are NoSQL Databases like MongoDB, Apache CouchDB, and Apache

Cassandra. In addition, there are file-based tools like the Hadoop File System or Cloud File systems

like Ceph. You also have an elastic search tool that stores text data, including the creation of a

search index for fast document retrieval. Now, the task of data integration and transformation in the

classic data warehousing world is to Extract, Transform, and Load (ETL). Data scientists often

propose Extract, Load, Transform (ELT) as data is dumped somewhere, and the data engineer or

data scientist handles the transformation of the data. Another term for this process emerged: Data

Refinery and Cleansing. The most widely used open-source data integration and transformation

tools are the following: Apache AirFlow, which was created by Airbnb originally. KubeFlow, which

allows the execution of data science pipelines on top of Kubernetes. Apache Kafka, which originated

from LinkedIn. Apache Nifi, which delivers a very nice visual editor. Apache SparkSQL, lets you use

ANSI SQL and scales up to compute clusters of thousands of nodes and NodeRED also brings a

visual editor. In addition, NodeRED is so low in resource consumption that it even runs on tiny

devices like a Raspberry Pi. Now let’s discuss the most widely used open-source data visualization

tools. You must distinguish between programming libraries where you must use code or tools

containing a user interface. Pixie Dust is also a library but has a user interface that facilitates plotting

in Python. A similar approach uses Hue, which can create visualizations from SQL queries. Whereas

Kibana, a data exploration, and visualization web application is limited to Elasticsearch (data

provider). And finally, Apache Superset is a data exploration and visualization web application.

Model deployment is a crucial step. Once you’ve created a machine learning model capable of

predicting some critical aspects of the future, you should make it consumable by other developers

and turn it into an API. Apache PredictionIO currently only supports Apache Spark ML models for
deployment, but support for all libraries is on the roadmap. Seldon is an interesting product since it

supports nearly every framework including, TensorFlow, Apache SparkML, R, and scikit learn.

Interestingly, it can run on top of Kubernetes and Redhat OpenShift. Another way to deploy SparkML

models is MLeap. Finally, TensorFlow can serve any tensor flow model using the TensorFlow

service. It can be an embedded device like a Raspberry Pi or smartphone using TensorFlow lite and

deployed to a web browser using TensorFlow dot JS. Model monitoring is an important step as well.

Once you’ve deployed a machine learning model, you want to track its prediction performance while

new data arrives to maintain outdated models. Some examples are the following: ModelDB is a

machine model metadata base where information about the models is stored and queried. It natively

supports Apache Spark ML Pipelines and scikit-learn. A generic, multi-purpose tool called

Prometheus is widely used as well. Although it is not specifically made for machine learning model

monitoring, it is used for this purpose. Model performance is measured by more than accuracy.

Model bias against protected groups like gender or race is important as well. The IBM AI Fairness

360 open-source toolkit detects and mitigates bias in machine learning models. These models,

especially neural network-based deep learning models, can be subject to adversarial attacks where

an attacker tries to mislead the model with manipulated data or by controlling it. The IBM Adversarial

Robustness 360 Toolbox detects vulnerability against adversarial attacks and leverages the model to

be more robust. Finally, machine learning modes are often considered as a black box applying some

magic. The IBM AI Explainability 360 toolkit addresses that problem by finding similar examples in a

dataset to be presented to an end-user for manual comparison. IBM AI Explainability 360 toolkit can

also address the training of a simpler machine learning model to explain the responsibility of different

input variables directed toward the final decision of the model. So, the choice of code asset

management tools has become quite simple. Git is now the de facto standard for code asset

management, also known as version management or version control. Around Git emerged several

services. The most prominent is GitHub, but the runner-up is GitLab, with the advantage that the

platform is entirely open source and can be hosted and managed on your own. Another choice is

Bitbucket. Data asset management, also known as data governance or data lineage, is a crucial part

of enterprise-grade data science. Data has to be versioned and annotated with metadata. Apache

Atlas is such a tool supporting this task. Another interesting project is ODPi Egeria, managed

through the Linux Foundation, is an open ecosystem offers a set of open APIs, types, and
interchange protocols that metadata repositories use to share and exchange data. And finally, Kylo is

an open-source data management software platform, with extensive support for data asset

management tasks. In this video, you learned that: Data management tools are MySQL,

PostgreSQL, MongoDB, Apache CouchDB, Apache Cassandra, Hadoop File System, Ceph, and

elastic search. Data integration and transformation tools are Apache AirFlow, KubeFlow, Apache

Kafka, Apache Nifi, Apache SparkSQL, and NodeRED. Data Visualization tools are Pixie Dust, Hue,

Kibana, and Apache Superset. Model deployment tools are Apache PredictionIO, Seldon,

Kubernetes, Redhat OpenShift, Mleap, TensorFlow service, TensorFlow lite, and TensorFlow dot JS.

Model monitoring tools are ModelDB, Prometheus, IBM AI Fairness 360, IBM Adversarial

Robustness 360 Toolbox, and IBM AI Explainability 360. Code asset management tools are Git,

GitHub, GitLab, and Bitbucket. And finally, data asset management tools are Apache Atlas, ODPi

Egeria, and Kylo.

Welcome to Open-Source Tools for Data Science Part 2. After watching this video, you will be able
to Compare and contrast different open-source tools, and Describe the relevant features of
open-source tools Currently, the most famous development environment data scientists are using is
“Jupyter,” which emerged as a tool for interactive Python programming. Jupyter now supports more
than a hundred different programming languages through “kernels.” This encapsulates the execution
environment for the different programming languages. A key property of Jupyter Notebooks is to
unify documentation, code, output from the code, shell commands, and visualizations in a single
document. Jupyter lab is the next version of Jupyter Notebooks, and in the long term will replace
Jupyter Notebooks. The abundance of architectural changes makes Jupyter more modern and
modular. From a user’s perspective, the main difference between Jupyter Lab and Jupyter
Notebooks is the ability to open different types of files, including Jupyter Notebooks, data, and
terminals, and then arrange them on the canvas. Although it has been reimplemented from scratch,
Apache Zeppelin was inspired by Jupyter Notebooks and provides a similar experience. One key
differentiator is the integrated plotting capability. In Jupyter Notebooks, you are required to use
external libraries in Zeppelin, and plotting doesn’t require coding. You can also extend the
capabilities by using additional libraries. RStudio is among the oldest development environments for
statistics and data science. RStudio has its origins in the year 2011. It exclusively runs R and all
associated R libraries. In the R environment, Python development is possible. R is tightly integrated
into the Jupyter tool and provides optimal user experience. RStudio unifies programming, execution,
debugging, remote data access, data exploration, and visualization into one tool. Finally, Spyder tries
to mimic the behavior of RStudio to bring its functionality to the Python world. Although not at par
with the functionality of RStudio, data scientists consider it as an alternative. In the Python world,
Jupyter is used more. This diagram shows that Spyder integrates code, documentation, and
visualizations, among others, into a single canvas. Sometimes your data doesn’t fit into a single
computer’s storage or main memory capacity. Therefore, cluster execution environments exist. The
extensively famous Apache Spark is among the most active Apache projects that are used across all
industries, including many Fortune 500 companies. The key property of Apache Spark is linear
scalability. This means that if you double the number of servers in a cluster, you’ll roughly double its
performance. Apache Flink was developed after Apache Spark continued to gain market share. The
key difference between Apache Spark and Apache Flink is that Apache Spark is a batch data
processing engine, capable of processing vast amounts of data one by one or file by file. Whereas
Apache Flink is a stream-processing image with its main focus on processing real-time data
streams. Although both engines support both data processing paradigms, at the same time, Apache
Spark is the choice for most use cases. After Apache Spark and Apache Flink, Ray is one of the
latest developments in the data science execution environments and has a clear focus on
large-scale deep learning model training. Let’s look at open-source tools for data scientists, which
are fully integrated and visual. This means no programming knowledge is necessary. The tools
support a subset of important tasks that include data integration and transformation, data
visualization, and model building. KNIME originated from the University of Konstanz in 2004. As you
can see, KNIME has a visual user interface with drag-and-drop capabilities. It has built-in
visualization capabilities. In addition, it can be extended by programming in R and Python and even
has connectors to Apache Spark. Orange is another representative of this group of tools. It is less
flexible than KNIME but is easier to use. In this video, you have learned about the most common
tasks in data science and which open-source tools are relevant.

Congratulations! You have completed this module. At this point in the course, you know:
● The Data Science Task Categories include:
○ Data Management - storage, management and retrieval of data
○ Data Integration and Transformation - streamline data pipelines and automate data
processing tasks
○ Data Visualization - provide graphical representation of data and assist with
communicating insights
○ Modelling - enable Building, Deployment, Monitoring and Assessment of Data and
Machine Learning models
● Data Science Tasks support the following:
○ Code Asset Management - store & manage code, track changes and allow
collaborative development
○ Data Asset Management - organize and manage data, provide access control, and
backup assets
○ Development Environments - develop, test and deploy code
○ Execution Environments - provide computational resources and run the code

The data science ecosystem consists of many open source and commercial options, and include
both traditional desktop applications and server-based tools, as well as cloud-based services that
can be accessed using web-browsers and mobile interfaces.
Data Management Tools: include Relational Databases, NoSQL Databases, and Big Data
platforms:
● MySQL, and PostgreSQL are examples of Open Source Relational Database Management
Systems (RDBMS), and IBM Db2 and SQL Server are examples of commercial RDBMSes
and are also available as Cloud services.
● MongoDB and Apache Cassandra are examples of NoSQL databases.
● Apache Hadoop and Apache Spark are used for Big Data analytics.
Data Integration and Transformation Tools: include Apache Airflow and Apache Kafka.
Data Visualization Tools: include commercial offerings such as Cognos Analytics, Tableau and
PowerBI and can be used for building dynamic and interactive dashboards.
Code Asset Management Tools: Git is an essential code asset management tool. GitHub is a
popular web-based platform for storing and managing source code. Its features make it an ideal tool
for collaborative software development, including version control, issue tracking, and project
management.
Development Environments: Popular development environments for Data Science include Jupyter
Notebooks and RStudio.
● Jupyter Notebooks provides an interactive environment for creating and sharing code,
descriptive text, data visualizations, and other computational artifacts in a web-browser
based interface.
● RStudio is an integrated development environment (IDE) designed specifically for working
with the R programming language, which is a popular tool for statistical computing and data
analysis.

Python

Welcome to “Introduction to Python”. After watching this video, you will be able to identify the users
of Python. List the benefits of using Python. Describe the diversity and inclusion efforts of the Python
community. Python is a powerhouse of a language. It is the most widely used and most popular
programming language used in the data science industry. According to the 2019 Kaggle Data
Science and Machine Learning Survey, ¾ of the over 10,000 respondents worldwide reported that
they use Python regularly. Glassdoor reported that in 2019 more than 75% of data science positions
listed included Python in their job descriptions. When asked which language an aspiring data
scientist should learn first, most data scientists say Python. Let’s start with the people who use
Python. If you already know how to program, then Python is great for you because it uses clear and
readable syntax. You can develop the same programs from other languages with lesser code using
Python. For beginners, Python is a good language to start with because of the huge global
community and wealth of documentation. Several different surveys done in 2019 established that
over 80% of data professionals use Python worldwide. Python is useful in many areas including data
science, AI and machine learning, web development, and Internet of Things (IoT) devices, like the
Raspberry Pi. Large organizations that heavily use python include IBM, Wikipedia, Google, Yahoo!,
CERN, NASA, Facebook, Amazon, Instagram, Spotify, and Reddit. Python is widely supported by a
global community and shepherded by the Python Software Foundation. Python is a high-level,
general-purpose programming language that can be applied to many different classes of problems. It
has a large, standard library that provides tools suited to many different tasks including but not
limited to Databases, Automation, Web scraping, Text processing, Image processing, Machine
learning, and Data analytics. For data science, you can use Python's scientific computing libraries
like Pandas, NumPy, SciPy, and Matplotlib. For artificial intelligence, it has TensorFlow, PyTorch,
Keras, and Scikit-learn. Python can also be used for Natural Language Processing (NLP) using the
Natural Language Toolkit (NLTK). Another great selling point for Python is that the Python
community has a well-documented history of paving the way for diversity and inclusion efforts in the
tech industry as a whole. The Python language has a code of conduct executed by the Python
Software Foundation that seeks to ensure safety and inclusion for all, in both online and in-person
Python communities. Communities like PyLadies seek to create spaces for people interested in
learning Python in safe and inclusive environments. PyLadies is an international mentorship group
with a focus on helping more women become active participants and leaders in the Python
open-source community.

In this video, you learned that Python uses clear and readable syntax. Python has a huge global
community and a wealth of documentation. For data science, you can use python's scientific
computing libraries like Pandas, NumPy, SciPy, and Matplotlib. Python can also be used for Natural
Language Processing (NLP) using the Natural Language Toolkit (NLTK). Python community has a
well-documented history of paving the way for diversity and inclusion efforts in the tech industry as a
whole.

Welcome to “Introduction to R Language”.

After watching this video, you will be able to: compare open source with free software, identify the

users of the R language, list the benefits of using R, and list the global communities for connecting

with other R users. According to the results of the 2019 Kaggle Data Science survey, which had over

ten thousand respondents worldwide, learning three languages can earn you an increment! And R

offers many possibilities. Now, Python is open source, while R is free software. Let us compare the

open source and free software. The similarities are that both are free to use. Both commonly refer to

the same set of licenses. For example, many open-source projects use the General Public License

(GNU). Both support collaboration. And in many cases, these terms can be used interchangeably

(but not all). Now, let’s discuss the differences between open source and free software. The

Open-Source Initiative (OSI) champions open source, while the Free Software Foundation (FSF)

defines free software. Open source is more business focused, while free software is more focused

on a set of values. So, why R? You should learn R because it is free software. You can use the

language in the same way that you contribute to open source. In addition, it allows for private use,

commercial use, and public collaboration. R is another language supported by a wide global

community of people who want to use the language to solve big problems. Statisticians,

mathematicians, and data miners use R to develop statistical software, graphing, and data analysis.

R Language's array-oriented syntax makes it easier to translate from math to code for learners with

no or minimal programming background. According to Kaggle’s Data Science and Machine Learning

survey, most programmers learn R after a few years into their data science career. And R is mostly

popular in academia. In addition, companies that use R include IBM, Google, Facebook, Microsoft,

Bank of America, Ford, TechCrunch, Uber, and Trulia. R has become the world’s largest repository
of statistical knowledge. As of 2018, R has more than 15,000 publicly released packages making it

possible to conduct complex exploratory data analysis. R integrates well with other computer

languages like C++, Java, C, .Net, and Python. Using R, common mathematical operations like

matrix multiplication give immediate results. And R has stronger object-oriented programming

facilities than most statistical computing languages.


Play video starting at :3:13 and follow transcript

3:13

Now, there are many ways to connect with other R users around the globe. For connecting to other

users, you can use communities such as useR, WhyR, SatRdays, and R-ladies. And in addition, you

can check out the R project website for R conferences and events.
Play video starting at :3:33 and follow transcript

3:33

In this video, you learned that: The Open-Source Initiative (OSI) champions open source, while the

Free Software Foundation (FSF) defines free software. Python is open source, and R is free

software. R language’s array-oriented syntax makes it easier to translate from math to code for

learners with none or minimal programming background. And R has become the world’s largest

repository of statistical knowledge.

SQL- Structured Query Language

Welcome to “Introduction to SQL”. After watching this video, you will be able to: explain SQL and
relational databases, define the SQL elements, and list the benefits of using SQL. SQL is a bit
different than the other languages. Officially it is pronounced as “ess cue el” though some call it
“sequel”. And while the acronym stands for “Structured Query Language”, many people consider
SQL different from other software development languages because it is a non-procedural language.
Its scope is limited to querying and managing data. While it is not a “Data Science” language, data
scientists regularly use it because it is simple and powerful! Some other facts about SQL are that it is
older than python and R by about 20 years. It first appeared in 1974 and was developed at IBM! This
language is useful in handling structured data, which is the data incorporating relations among
entities and variables. SQL was designed for managing data in relational databases. Here you can
see a diagram showing the general structure of a relational database. A relational database is
formed by collections of two-dimensional tables, for example, Datasets and Excel Spreadsheets.
Each of these tables is then formed by a fixed number of columns and any possible number of rows.
However, although SQL was originally developed for use with relational databases, because of its
pervasiveness and ease of use, SQL interfaces have also been developed for many NoSQL and big
data repositories. The SQL language is subdivided into several language elements, including:
Clauses, Expressions, Predicates, Queries, and Statements. So, what makes SQL great? Knowing
SQL will help you get many different jobs in data science, such as a business and data analyst. This
knowledge is also a must in data engineering. When performing operations with SQL, the data is
accessed directly, without needing to copy the data separately, which can considerably speed up
workflow executions. SQL behaves like an interpreter between you and the database. SQL is an
American National Standards Institute (or ANSI) standard, which means if you learn SQL and use it
with one database, you can apply your SQL knowledge to many other databases easily. Now, many
different SQL databases are available, including the following: MySQL, IBM DB2, PostgreSQL,
Apache Open Office Base, SQLite, Oracle, MariaDB, Microsoft SQL Server, and more. The syntax of
the SQL you write may change based on the relational database management system you are
using. If you want to learn SQL, you should focus on a specific relational database and then plug into
the community for that specific platform. In addition, there are many available great introductory
courses on SQL! In this video, you learned that: SQL is different from other software development
languages because it is a non-procedural language. SQL’s scope is limited to querying and
managing data. SQL was designed for managing data in relational databases. SQL behaves like an
interpreter between you and the database. And if you learn SQL and use it with one database, you
can apply your SQL knowledge to many other databases easily.
Other Languages
Welcome to “Other Languages for Data Science”. After watching this video, you will be able to
review other languages like Java, Scala, C++, JavaScript, and Julia, and explore how each is used
in Data Science Previously, we reviewed Python, R, and SQL. In this lesson, we will review some
other languages that have compelling use cases for data science. Scala, Java, C++, and Julia are
probably the most traditional data science languages. However, JavaScript, PHP, Go, Ruby, Visual
Basic and many others have found their place in the data science community. Let us go through
some notable highlights about a few of them. Java is a general-purpose tried and tested
object-oriented programming language. It has huge adoption in the enterprise space and was
designed to be fast and scalable. Java applications are compiled to bytecode and run on the Java
Virtual Machine or JVM. Some notable data science tools built with Java include: Weka for data
mining, Java-ML for machine learning, Apache MLlib makes machine learning scalable, and
Deeplearning4 for deep learning. Hadoop is another application of Java which manages data
processing and storage for big data applications running in clustered systems. Scala is a
general-purpose programming language that provides support for functional programming and is a
strong static type system. The Scala language was constructed to address the shortcomings of
Java. It is also inter-operable with Java as it runs on the JVM. The name Scala is a combination of
scalable and language. This language is designed to evolve with the requirements of its users. For
data science, the most popular program built with Scala is Apache Spark. Spark is a fast and
general-purpose cluster computing system that provides APIs, which make parallel jobs easy to
write. It has an optimized engine that supports general computation graphs. Spark includes Shark,
which is a query engine, MLlib for machine learning, GraphX for graph processing, and Spark
Streaming. It was designed to be faster than Hadoop. C++ is a general-purpose programming
language. It is an extension of the C programming language or "C with Classes.” C++ improves
processing speed, enables system programming, and provides broader control over the software
application. Many organizations that use Python or other high-level languages for data analysis and
exploratory tasks rely on C++ to develop programs that feed data to customers in real-time. For data
science, TensorFlow is a popular Deep Learning library for dataflow that was built with C++.
Although C++ is the foundation of TensorFlow, it runs on a python interface, so users don’t require
the knowledge of C++ to run it. MongoDB is a NoSQL database for big data management that was
built with C++. Caffe is a deep learning algorithm repository built with C++ with Python and Matlab
bindings. A core technology for the world wide web, JavaScript is a general-purpose language that
extended beyond the browser with the creation of Node.js and other server-side approaches.
Javascript is NOT related to the Java language. For Data Science, undoubtedly TensorFlow.js is the
most popular implementation. TensorFlow.js makes machine learning and deep learning possible in
Node.js as well as in the browser. TensorFlow.js was also adopted by other open-source libraries
including brain.js and machinelearn.js. Another implementation of JavaScript for Data Science is
R-js. The project R-js has re-written linear algebra specifications from the R Language into
typescript. This sets the foundation for future projects to implement more powerful math base
frameworks like Numpy and SciPy of Python. Typescript is a superset of JavaScript. Finally, Julia
was designed at MIT for high-performance numerical analysis and computational science. Julia
provides speedy development like Python or R, while producing programs that run as fast as C or
Fortran programs. It’s compiled which means that Julia code is executed directly on the processor as
executable code. It calls C, Go, Java, MATLAB, R, Fortran, and Python libraries, and has refined
parallelism. Julia as a language is only 8 years old, written in 2012, but there is a lot of promise for
its future impact on the data science industry. One great application of Julia for Data Science is
JuliaDB, which is a package for working with large persistent data sets. In this video, you learned
that Data science tools built with Java include Weka, Java-ML, Apache MLlib, and Deeplearning4.
For data science, a popular program built with Scala is Apache Spark, that includes Shark, MLlib,
GraphX, and Spark Streaming. For data science, TensorFlow, MongoDB and Caffe were built with
C++. Programs built for Data Science with JavaScript include TensorFlow.js and R-js. One great
application of Julia for Data Science is JuliaDB.

Summary
● You should select a language to learn depending on your needs, the problems you are trying
to solve, and whom you are solving them for.
● The popular languages are Python, R, SQL, Scala, Java, C++, and Julia.
● For data science, you can use Python's scientific computing libraries like Pandas, NumPy,
SciPy, and Matplotlib.
● Python can also be used for Natural Language Processing (NLP) using the Natural
Language Toolkit (NLTK).
● Python is open source, and R is free software.
● R language’s array-oriented syntax makes it easier to translate from math to code for
learners with no or minimal programming background.
● SQL is different from other software development languages because it is a non-procedural
language.
● SQL was designed for managing data in relational databases.
● If you learn SQL and use it with one database, you can apply your SQL knowledge with
many other databases easily.
● Data science tools built with Java include Weka, Java-ML, Apache MLlib, and
Deeplearning4.
● For data science, popular program built with Scala is Apache Spark which includes Shark,
MLlib, GraphX, and Spark Streaming.
● Programs built for Data Science with JavaScript include TensorFlow.js and R-js.
● One great application of Julia for Data Science is JuliaDB.
Packages ,APIs, Datasets and Modules
Welcome to “Libraries for Data Science.” After watching this video, you will be able to: List the
scientific computing libraries in Python List the visualization libraries in Python List the high-level
machine learning and deep learning libraries, and List the libraries used in other languages In this
video, you will review several data science libraries. Libraries are a collection of functions and
methods that allow you to perform many actions without writing the code. We will focus on the
following Python libraries: Scientific Computing Libraries in Python Visualization Libraries in Python
High-Level- Machine Learning and Deep Learning Libraries (High-level means you don’t have to
worry about details making studying or improving difficult.) And finally, Deep Learning Libraries in
Python, and Libraries used in other languages Now, scientific computing libraries contain built-in
modules providing different functionalities, which you can use directly. They are also called
frameworks. For example, Pandas offers data structures and tools for effective data cleaning,
manipulation, and analysis. It provides tools to work with different types of data. The primary
instrument of Pandas is a two-dimensional table consisting of columns and rows, called a Data
Frame. Pandas can also provide easy indexing so you can work with your data. NumPy libraries are
based on arrays and matrices, allowing you to apply mathematical functions to the arrays. Pandas is
built on top of NumPy. You use data visualization methods to communicate with others and display
meaningful results of an analysis. These libraries enable you to create graphs, charts, and maps.
The Matplotlib package is the most well-known library for data visualization. They are popular for
making graphs and plots, and the graphs are easily customizable. Another high-level visualization
library is Seaborn. It is based on matplotlib. This library generates heat maps, time series, and violin
plots. Now, for machine learning, the Scikit-learn library contains tools for statistical modeling,
including regression, classification, clustering, and so on. It is built on NumPy, SciPy, and matplotlib.
It is simple to get started. In this high-level approach, you define the model and specify the
parameter types you want to use. For building deep learning models, Keras allows you to build the
standard deep learning model. Like Scikit, the high-level interface allows you to build models in a
quick, simple manner. It can function using Graphics processing units (GPU) but in many cases, a
lower-level environment is necessary for deep learning. TensorFlow is a low-level framework used in
the large-scale production of deep learning models. It's designed for production and deployment but
can be unwieldy for experimentation. Pytorch is used for experimentation, making it simple for
researchers to test ideas. Apache Spark is a general-purpose cluster-computing framework allowing
you to process data using compute clusters. The data is processed in parallel in more than one
computer simultaneously. The Spark library has similar functionality to the following: Pandas,
Numpy, and Scikit-learn. Apache Spark data processing jobs can be in: Python R Scala, and SQL
There are many Scala libraries. Scala is predominately used in data engineering and data science.
Let’s discuss some libraries that are complementary to Spark. Vegas is a Scala Library for statistical
data visualizations. With Vegas, you can work with data files as well as Spark Data Frames. For
deep learning, you can use big DL. R has built-in functionality for machine learning and data
visualization, but there are also complementary libraries. ggplot2 is a popular library for data
visualization in R. You can also use libraries that allow you to interface with Keras and TensorFlow.
And R was a de-facto standard for open-source data science, but now Python will supersede it. In
this video, you learned that: Libraries usually contain built-in modules providing different
functionalities that can be used directly. You can use data visualization methods to communicate
with others and display meaningful results of an analysis. For machine learning, the Scikit-learn
library contains tools for statistical modeling, including regression, classification, clustering and so
on. TensorFlow is a low-level framework used in large-scale production of deep learning models.
And, Apache Spark is a general-purpose cluster-computing framework allowing you to process data
using compute clusters.

APIs
Welcome to “Application Program Interfaces (API).” After watching this video, you will be able to
define an API, list API libraries, and define REST API in relation to request and response. An
application programming interface (API) allows communication between two pieces of software. For
example, in a program, you have some data and other software components. You use the API to
communicate using inputs and outputs without knowing what happens at the backend. The API only
refers to the interface. It is the part of the library you see while it contains all the program
components. To further understand how an API works in a library, consider an example of the
Pandas library. Pandas is a set of software components where not all components are written in
Python. In your program, there is some data and a set of software components. You can use the
Pandas API to process the data by communicating with the other software components. The
software component at the back end can be the same, but there can be an API for different
languages. Consider TensorFlow at the backend, written in C++ that can use APIs for other
languages, such as: Python JavaScript C++ Java, and Go And thus, the API is just the interface.
Other volunteer-developed APIs for TensorFlow are Julia Matlab R Scala And many more. So, REST
APIs are another popular type of API. The RE stands for Representational. The S stands for State.
The T stands for Transfer. They allow you to communicate through the internet and take advantage
of resources like storage, data, artificially intelligent algorithms, and much more. In Rest API, your
program is the client. The API communicates with a web service you can call through the internet.
Though there are rules regarding Communication, Input or Request, and Output or Response. So,
let’s look at some common terms used with regards to API. You or your code is the client. The web
service is the resource. And the client finds the service via an endpoint. The client sends requests to
the resource and receives a response from the resource. Data is transmitted over the internet using
HTTP methods. The Rest APIs get all the information from the request sent by the client. The
request is sent using an HTTP message that contains a JSON file. The file contains instructions for
what operation is to be performed by the web service. This operation is transmitted to the web
service via the internet. And the service performs the operation. Similarly, the web service returns a
response through an HTTP message, where the information is returned using a JSON file. And this
information is transmitted back to the client. Now, another example of a Rest API is Watson Text to
Speech API. This API converts speech to text. In the API call, you will send a copy of the audio file
to the API; this is called a post request. Then the API will send the text transcription of what the
individual is saying. At the backend, the API is making a Get request. And finally, let’s look at another
example, the Watson language-translator API. You send the text you would like to translate into
Watson language-translator API. The API will translate the text and send the translation back to you.
In this case, the API translates English to Spanish.
Play video starting at :4:8 and follow transcript

4:08
In this video, you learned an application programming interface (API) allows communication between
two pieces of software, An API is the part of the library you see while the library contains all the
components of the program. And REST APIs allow you to communicate through the internet and
take advantage of resources like storage, data, artificially intelligent algorithms, and much more.

Data Sets-Powering data science


Welcome to “Data Sets – Powering Data Science.” After watching this video, you will be able to
define a data set, describe the types of data ownership, list the sources of data, and describe the
Community Data License Agreement. Let’s first define what a dataset is. A data set is a structured
collection of data. Data embodies information represented as text, numbers, or media such as
images, audio, or video files. A tabular data set comprises a collection of rows containing columns
that store the information. One popular tabular data format is "comma separated values," or CSV. A
CSV file is a delimited text file where each line represents a row, and a comma separates data
values. For example, imagine a dataset of observations from a weather station. Each row represents
an observation at a given time, while each column contains information about that observation, such
as the temperature, humidity, and other weather conditions. Hierarchical or network data structures
are typically used to represent relationships between data. Hierarchical data is organized in a
tree-like format, whereas network data is stored as a graph. For example, the connections between
people on a social networking website are often represented as a graph. A data set might also
include raw data files, such as images or audio. The Modified National Institute of Standards and
Technology (MNIST) dataset is popular for data science. It contains images of handwritten digits and
is commonly used to train image processing systems. Traditionally, most data sets were private
because they contained proprietary or confidential information such as customer data, pricing data,
or other commercially sensitive information. These datasets are typically not shared publicly. Over
time, many public and private entities such as scientific institutions, governments, organizations, and
even companies have started making data sets available to the public as “open data,” providing free
information. For example, the United Nations and federal and municipal governments worldwide
have published many datasets on their websites, covering the economy, society, healthcare,
transportation, the environment, and much more. Access to these and other open datasets enables
data scientists, researchers, analysts, and others to uncover previously unknown and potentially
valuable insights. They are used to create new applications for commercial purposes and the public
good. They are also used to carry out further research. Open data has played a significant role in the
growth of data science, machine learning, and artificial intelligence. It has allowed practitioners to
hone their skills in various data sets. There are many open data sources on the internet. You can
find a comprehensive list of available data portals worldwide on the Open Knowledge Foundation’s
datacatalogs.org website. The United Nations, the European Union, and many other governmental
and intergovernmental organizations maintain data repositories providing access to a wide range of
information. On Kaggle, a popular data science online community, you can find (and contribute) data
sets that might be of general interest. Google provides a search engine that might help you find data
sets that could be of value to you. Open data distribution and use might be restricted, as defined by
certain licensing terms. Without a license for open data distribution, many data sets were shared in
the past under open-source software licenses. These licenses were not designed to cover specific
considerations related to the distribution and use of data sets. To address the issue, the Linux
Foundation created the Community Data License Agreement, or CDLA. Two licenses were initially
created for sharing data: CDLA-Sharing and CDLA-Permissive. The CDLA-Sharing license grants
you permission to use and modify the data. The license stipulates that if you publish your modified
version of the data, you must do so under the same license terms as the original data. The
CDLA-Permissive license also grants you permission to use and modify the data. However, you are
not required to share changes to the data. Note that neither license imposes any restrictions on
results you might derive by using the data, which is important in data science. Let’s say, for example,
that you are building a model that performs a prediction. If you are training the model using
CDLA-licensed data sets, you are under no obligation to share the model or to share it under a
specific license if you choose to share it. In this video, you’ve learned Open data is fundamental to
Data Science. Community Data License Agreement makes it easier to share open data, and Open
datasets might not meet enterprise requirements, due to the impact they might have on the
business.

Reading: Additional Sources of Datasets

In this reading, you will learn about:

● Open datasets and sources


● Proprietary datasets and sources
● Dataset license

Open datasets and sources

In this data-driven world, some datasets are freely available for anyone to access, use, modify,
and share. These are called open datasets.
Open datasets include a public license and are very useful for your journey as a Data Scientist.
Some of the most informative open dataset sources are listed below.

Government Data:

● https://www.data.gov/
● https://www.census.gov/data.html
● https://data.gov.uk/
● https://www.opendatanetwork.com/
● https://data.un.org/

Financial Data Sources:

● https://data.worldbank.org/
● https://www.globalfinancialdata.com/
● https://comtrade.un.org/
● https://www.nber.org/
● https://fred.stlouisfed.org/

Crime Data:
● https://www.fbi.gov/services/cjis/ucr
● https://www.icpsr.umich.edu/icpsrweb/content/NACJD/index.html
● https://www.drugabuse.gov/related-topics/trends-statistics
● https://www.unodc.org/unodc/en/data-and-analysis/

Health Data:

● https://www.who.int/gho/database/en/
● https://www.fda.gov/Food/default.htm
● https://seer.cancer.gov/faststats/selections.php?series=cancer
● https://www.opensciencedatacloud.org/
● https://pds.nasa.gov/
● https://earthdata.nasa.gov/
● https://www.sgim.org/communities/research/dataset-compendium/public-datasets-topic-g
rid

Academic and Business Data:

● https://scholar.google.com/
● https://nces.ed.gov/
● https://www.glassdoor.com/research/
● https://www.yelp.com/dataset

Other General Data:

● https://www.kaggle.com/datasets
● https://www.reddit.com/r/datasets/

Propriety datasets and sources

Proprietary datasets contain data primarily owned and controlled by specific individuals or
organizations. This data is limited in distribution because it is sold with a licensing agreement.
Some data from private sources cannot be easily disclosed, like public data.

National security data, geological, geophysical, and biological data are examples of propriety
data. Copyright laws or patents usually bind this type of data. Proprietary datasets that mainly
contain sensitive information are less widely available than open datasets.

Some standard propriety dataset sources are listed below.


Health Care:

https://www.sgim.org/communities/research/dataset-compendium/proprietary-datasets

Financial Market data:

https://datarade.ai/data-categories/proprietary-market-data

Google Cloud based datasets:

https://cloud.google.com/datasets

Dataset licenses

When you select a dataset, it is necessary to look into the license. A license explains whether you
can use that dataset or not; or explains if you have to accept certain guidelines to use that dataset.
The different license types are listed below.

1. PUBLIC DOMAIN MARK - PUBLIC DOMAIN


When a dataset has a Public Domain license, all the rights to use, access, modify and
share the dataset are open to everyone. Here there is technically no license.
2. OPEN DATA COMMONS PUBLIC DOMAIN DEDICATION AND LICENSE –
PDDL
Open Data Commons license has the same features as the Public Domain license, but the
difference is the PDDL license uses a licensing mechanism to give the rights to the
dataset.
3. CREATIVE COMMONS ATTRIBUTION 4.0 INTERNATIONAL CC-BY
This license allows users to share and modify a dataset, but only if they give credit to the
creator(s) of the dataset.
4. COMMUNITY DATA LICENSE AGREEMENT – CDLA PERMISSIVE-2.0
Like most open-source licenses, this license allows users to use, modify, adapt, and share
the dataset, but only if a disclaimer of warranties and liability is also included.
5. OPEN DATA COMMONS ATTRIBUTION LICENSE - ODC-BY
This license allows users to share and adapt a dataset, but only if they give credit to the
creator(s) of the dataset.
6. CREATIVE COMMONS ATTRIBUTION-SHAREALIKE 4.0 INTERNATIONAL
- CC-BY-SA
This license allows users to use, share, and adapt a dataset, but only if they give credit to
the dataset and show any changes or transformations, they made to the dataset. Users
might not want to use this license because they have to share the work they did on the
dataset.
7. COMMUNITY DATA LICENSE AGREEMENT – CDLA-SHARING-1.0
This license uses the principle of ‘copyleft’: users can use, modify, and adapt a dataset,
but only if they don’t add license restrictions on the new work(s) they create with the
dataset.
8. OPEN DATA COMMONS OPEN DATABASE LICENSE - ODC-ODBL
This license allows users to use, share, and adapt a dataset but only if they give credit to
the dataset and show any changes or transformations they make to the dataset. Users
might not want to use this license because they have to share the work they did on the
dataset.
9. CREATIVE COMMONS ATTRIBUTION-NONCOMMERCIAL 4.0
INTERNATIONAL - CC BY-NC
This license is a restrictive license. Users can share and adapt a dataset, provided they
give credit to its creator(s) and ensure that the dataset is not used for any commercial
purpose.
10. CREATIVE COMMONS ATTRIBUTION-NO DERIVATIVES 4.0
INTERNATIONAL - CC BY-ND
This license is also a restrictive license. Users can share a dataset if they give credit to its
creator(s). This license does not allow additions, transformations, or changes to the
dataset.
11. CREATIVE COMMONS ATTRIBUTION-NONCOMMERCIAL-SHAREALIKE
4.0 INTERNATIONAL - CC BY-NC-SA
This license allows users to share a dataset only if they give credit to its creator(s). Users
can share additions, transformations, or changes to the dataset, but they cannot use the
dataset for commercial purposes.
12. CREATIVE COMMONS
ATTRIBUTION-NONCOMMERCIAL-NODERIVATIVES 4.0 INTERNATIONAL
- CC BY-NC-ND
This license allows users to share a dataset only if they give credit to its creator(s). Users
are not allowed to modify the dataset and are not allowed to use it for commercial
purposes.Note: Additional license types exist. Any dataset you use will include details
about its license
Welcome to “Sharing Enterprise Data – Data Asset eXchange”
After watching this video, you will be able to: Navigate around IBM's open data repository, the Data
Asset eXchange. Explore open data sets on the Data Asset eXchange. Identify the notebook
associated with a data set in Watson Studio.
Play video starting at ::29 and follow transcript

0:29
There are many open data sets available to the public, but it can be difficult to find data sets that are
both high quality and have clearly defined license and usage terms. To help solve this challenge,
IBM created the Data Asset eXchange, or "DAX”. DAX provides a curated collection of open data
sets, both from IBM Research and trusted third-party sources. These data sets are ready for use in
enterprise applications, with a wide variety of application types, including images, video, text, and
audio. DAX aims to foster data sharing and collaboration by keeping data sets available under a
Community Data License Agreement (or CDLA). DAX makes it easier for developers to get started
with data sets because it provides a single place to access unique, high-quality data sets from
trusted sources like IBM Research. It also provides tutorial notebooks that walk through the basics of
data cleaning, pre-processing, and exploratory analysis. Certain data sets include advanced
notebooks that explain how to perform more complex tasks, like creating charts, training
machine-learning models, integrating deep learning via the Model Asset eXchange, and running
statistical analysis and time-series analysis. The Data Asset eXchange and the Model Asset
eXchange are both available on the IBM Developer website. With these resources, developers can
create end-to-end analytic and machine learning workflows and to consume open data and models
with confidence under clearly defined license terms. Now, let’s explore the Data Asset eXchange.
Open https://developer.ibm.com/ in your web browser. Then select “Open Source at IBM” From the
drop-down, select “Data Asset eXchange”. In the Data Asset eXchange, multiple open data sets are
available for you to explore. Let’s say you’ve found a data set that might be very interesting to you:
the “NOAA Weather Data - JFK Airport” data set, which contains data from a weather station at the
John F. Kennedy Airport in New York. On this data set page, you can click Get this data set to
download the NOAA data set from the cloud storage. Run data set notebooks to access the
notebooks associated with the data set in Watson. and Preview the data and Notebooks to explore
DAX metadata, glossary and the notebook.
Play video starting at :3:2 and follow transcript

3:02
Most data sets on DAX are complemented by one or more Notebooks. Click assets to view all the
Jupyter Notebooks and data available. You can then click the source code to view all the notebooks
associated with your NOAA project. You can execute all the notebooks in Watson studio to perform
data cleaning, pre-processing, and exploratory analysis. If you are already familiar with opening the
notebooks in Watson studio, you can log into your IBM Cloud account, create a project, and load all
the notebooks into the project. Data sets on DAX also consist of one or more data files. Click the
Data option to view the data files, available in your project. In this video, you learned that the IBM
Data Asset eXchange (DAX) site contains high-quality open data sets DAX open data sets include
tutorial notebooks that provide basic and advanced walk throughs for developers. DAX and MAX are
available on the IBM Developer website. You can get, run, and preview data sets and notebooks on
DAX, and DAX notebooks are opened in Watson Studio.
Welcome to “Machine Learning Models – Learning from models to make predictions.” After watching

this video, you will be able to define a machine learning model, describe the different learning model

types, and describe how to use a learning model to solve a problem. Now data contains a wealth of

information that can be used to solve certain types of problems. Traditional data analysis

approaches can be a person manually inspecting the data or a specialized computer program that

automates the human analysis. These approaches reach their limits due to the amount of data to be

analyzed or the complexity of the problem. Machine learning (ML) uses algorithms – also known as

“models” - to identify patterns in the data. The process by which the model learns these patterns

from data is called “model training”.


Play video starting at :1:1 and follow transcript

1:01

Once a model is trained, it can then be used to make predictions. When the model is presented with

new data, it tries to make predictions or decisions based on the patterns it has learned from past

data. Machine Learning models can be divided into three basic classes: Supervised Learning,

Unsupervised Learning, and Reinforcement Learning. The most commonly used type of machine

learning is Supervised Learning. In Supervised Learning, a human provides input data and correct

outputs. The model tries to identify relationships and dependencies between the input data and the

correct output. This type of learning comprises two types of models, regression and classification.

Regression models are used to predict a numeric (or “real”) value. For example, if information is

given about past home sales, such as geographic location, size, number of bedrooms, and sales

price, you can train a model to predict the estimated sales price for other homes with similar

characteristics. Classification models are used to predict whether some information or data belongs

to a category (or “class”). For example, for a set of emails along with a designation you can classify

whether they are to be considered as spam or not. And so, you can train an algorithm to identify

unsolicited emails. In Unsupervised Learning, the data is not labeled by a human. The models must

analyze the data and try to identify patterns and structure within the data based on its

characteristics. Clustering is an example of this learning style. Clustering models are used to divide

each record of a dataset into one of a similar group. An example of a clustering model could be

providing purchase recommendations for an e-commerce store, based on past shopping behavior

and the content of a shopping basket. Another example is anomaly detection that identifies outliers
in a dataset, such as fraudulent credit card transactions or suspicious online log-in attempts. And the

third type of learning, Reinforcement Learning, is loosely based on the way human beings and other

organisms learn. So, think about a mouse in a maze. If the mouse gets to the end of the maze, it

gets a piece of cheese. This is the “reward” for completing a task. The mouse learns through trial

and error how to get through the maze to get as much cheese as it can. In a similar way, a

reinforcement learning model learns the best set of actions to take, given its current environment, to

get the most rewards over time. This type of learning has recently been very successful in beating

the best human players in games such as Go, chess and popular strategy video games. Deep

learning is a specialized type of machine learning. It refers to a general set of models and

techniques that loosely emulate the way the human brain solves a wide range of problems. It is

commonly used to analyze natural language (both spoken and text), images, audio, video, to

forecast time series data and much more. Deep learning has recently been very successful in these

and other areas and hence is becoming an increasingly popular and important tool for data science.

It requires large datasets of labeled data to train a model, is compute intensive, and usually requires

special purpose hardware to achieve acceptable training times. Now you can build a custom Deep

Learning model from scratch or use pre-trained models from public model repositories. Deep

Learning models are implemented using popular frameworks such as TensorFlow, PyTorch and

Keras. The learning frameworks provide a Python API and many support other programming

languages, such as C++ and JavaScript. You can download pre-trained state-of-the-art models from

repositories that are commonly referred to as model zoos. Popular model zoos include those

provided by TensorFlow, PyTorch, Keras, and ONNX. Models are also published by academic and

commercial research groups. Let’s briefly outline the high-level tasks involved in building a model

using an example. Assume you want to enable an application to identify objects in images by

training a deep learning model. First, you collect and prepare data that will be used to train a model.

Data preparation can be a time-consuming and labor-intensive process. In order to train a model to

detect objects in images, you need to label the raw training data. For example, you can draw

bounding boxes around objects and label them. Next, you build a model from scratch or select an

existing model that might be well suited for the task from a public or private resource. You can then

train the model on your prepared data. During training, your model learns from the labeled data how

to identify objects that are depicted in an image. Once training has commenced, you analyze the
training results and repeat the process until the trained model performance meets your

requirements. When the trained model performs as desired, you deploy it to make it available to your

applications.
Play video starting at :6:36 and follow transcript

6:36

In this video you learned that: Machine learning (ML) uses algorithms – also known as “models” ‒ to

identify patterns in the data. The process by which the model learns data patterns is called “model

training”. Types of ML are Supervised, Unsupervised, and Reinforcement. Supervised learning

comprises two types of models, regression and classification. And deep learning refers to a general

set of models and techniques that loosely emulate the way the human brain solves a wide range of

problems.

Model asset eXchange


Welcome to The Model Asset Exchange After watching this video, you will be able to navigate the
Model Asset Exchange from IBM Research, and explain how deep learning model-serving detects
images. The Model Asset eXchange, or “MAX”, on the IBM Developer platform, is a free open
source resource for deep learning models. The tasks needed to train a model from scratch require a
large amount of data, labor, time, and resources. Because of this, time to value can be quite long. To
reduce time to value, consider taking advantage of pre-trained models for certain types of problems.
These pre-trained models can be ready to use right away, or they might take less time to train.
Models are created by running data through a Model using compute resources and domain
expertise. After research, evaluation, test, train and validate steps are complete, you will have a
validated model. The Model Asset eXchange is a free open source repository for ready-to-use and
customizable deep learning microservices. These microservices are configured to use pre-trained or
custom-trainable deep learning models to solve common business problems. These models have
been fully tested, and can be quickly deployed in local and cloud environments. All models in MAX
are available under permissive open source licenses, making it easier to use them for personal and
commercial purposes, which reduces the risk of legal liabilities. On MAX, you can find models for a
variety of domains, including: Object detection, Image, audio, video, and text classification, Named
entity recognition, Image to text translation, Human pose detection, and more. Let’s look at the
components of a typical model-serving microservice. Each microservice includes a pre-trained deep
learning model, code that pre-processes the input before it is analyzed by the model, code that
post-processes the model output, and a standardized public API that makes the services
functionality available to applications. Model-serving microservices are created by running inputs
through a validated model and then applying the output to a rest API. After implement, package,
document, and test steps are complete, you will have a model-serving microservice that can then be
sent to a Local machine, or a Private, Hybrid, or Public cloud. MAX model-serving microservices are
built and distributed as open source Docker images. Docker is a container platform that makes it
easy to build and deploy applications. The Docker image source is published on GitHub and can be
downloaded and customized for use in personal and commercial environments. Use the Kubernetes
open source system to automate the deployment, scaling, and management of these Docker
images. Red Hat OpenShift is a popular enterprise-grade Kubernetes platform. It is available on IBM
Cloud, Google Cloud Platform, Amazon Web Services, and Microsoft Azure. Let’s explore some
machine learning models. Go to ml-exchange.org. Here you can view and use multiple predefined
models. We'll explore the predefined object detector model. This model will recognize objects in an
image because it consists of: a deep convolutional net base model for image feature extraction, and
added convolutional layers specialized in object detection. On the MAX object detector page, select
CodePen. CodePen is an online tool used by developers to edit front-end languages like HTML,
JavaScript, and CSS. You will be redirected to the CodePen page, where you can select MAX
Tensorflow.js model. This model is trained to identify objects in an image and assigns each pixel of
the image to a particular object. Here you can upload different images of a person, dog, cat, truck, or
car. The model was previously trained on labeled images, so now it can recognize images even
when they are not labeled. Select an image to see what happens when the model invokes the
prediction endpoint. Click on Extract prediction. This invokes the prediction endpoint, and the image
is uploaded. The prebuilt TFJS model prepares the input image for pre-processing. The deep
learning model algorithm identifies the different objects in the image. It generates its response using
the prediction results and returns the result to the application. You will see the existing image
separated into two different images: the background image and the image of the dog. The model test
is complete. You have confirmed that this model is able to identify items within an image without
using predefined labels. In this video, you learned: The Model Asset eXchange is a free open source
repository for ready-to-use and customizable deep learning microservices. To reduce time to value,
consider taking advantage of pre-trained models for certain types of problems. MAX model-serving
microservices are built and distributed on GitHub as open source Docker images. Red Hat
OpenShift is a Kubernetes platform used to automate deployment, scaling, and management of
microservices. Ml-exchange.org has multiple predefined models. The CodePen tool lets users edit
front-end languages.

Congratulations! You have completed this module. At this point in the course, you know:
● Libraries usually contain built-in modules that provide different functionalities.
● You can use data visualization methods to communicate with others and display meaningful
results of an analysis.
● For machine learning, the Scikit-learn library contains tools for statistical modeling, including
regression, classification, clustering, and so on.
● Large-scale production of deep-learning models use TensorFlow, a low-level framework.
● Apache Spark is a general-purpose cluster-computing framework that allows you to process
data using compute clusters.
● An application programming interface (API) allows communication between two pieces of
software.
● API is the part of the library you see while the library contains all the components of the
program.
● REST APIs allow you to communicate through the internet and take advantage of resources
like storage, data, artificially intelligent algorithms, and much more.
● Open data is fundamental to Data Science.
● Community Data License Agreement makes it easier to share open data.
● The IBM Data Asset eXchange (DAX) site contains high-quality open data sets.
● DAX open data sets include tutorial notebooks that provide basic and advanced
walk-throughs for developers.
● DAX notebooks open in Watson Studio.
● Machine learning (ML) uses algorithms – also known as “models” – to identify patterns in the
data.
● Types of ML are Supervised, Unsupervised, and Reinforcement.
● Supervised learning comprises two types of models, regression and classification.
● Deep learning refers to a general set of models and techniques that loosely emulate the way
the human brain solves a wide range of problems.
● The Model Asset eXchange is a free, open-source repository for ready-to-use and
customizable deep-learning microservices.
● MAX model-serving microservices are built and distributed on GitHub as open-source
Docker images.
● You can use Red Hat OpenShift, a Kubernetes platform, to automate deployment, scaling,
and management of microservices.
● Ml-exchange.org has multiple predefined models.

Introduction to Jupyter Notebook


Welcome to “Introduction to Jupyter Notebooks.” After watching this video, you will be able to define
a Jupyter Notebook, explain how to use JupyterLab, and describe how to use the notebooks in
JupyterLab. Jupyter Notebooks originated as “iPython,” originally developed for Python
programming. Later, when it started supporting additional languages, it was renamed Jupyter, which
stands for Julia, Python, and R. However, now, it supports many other languages. A Jupyter
Notebook is a browser-based application that allows you to create and share documents containing
code, equations, visualizations, narrative text links, and more. It is like a scientist’s lab notebook,
where a scientist records all steps to perform their experiments and the results they can reproduce.
In the same way, a Jupyter Notebook allows a Data Scientist to record their data experiments and
results that others can reuse. Now a Jupyter Notebook file allows you to combine descriptive text,
code blocks, and code output in a single file. When you run the code, it generates the output,
including plots and tables, within the notebook file. And you can then export the notebook to a PDF
or HTML file format that can then be shared with anyone. Next, let’s learn about Jupyter Lab. Jupyter
Lab is a browser-based application that allows you to access multiple Jupyter Notebook files, other
code, and data files. In addition, it extends the functionalities of Jupyter Notebooks by enabling you
to work with multiple notebooks, text editors, terminals, and custom components in a flexible,
integrated, and extensible manner. It is compatible with several file formats like CSV, JSON, PDF,
Vega, and so on. And it is also an open source. Jupyter Notebooks can be used with cloud-based
services like IBM and Google Colab. They don't require any installation on your local machine. They
give you access to the Jupyter Notebook environment and allow you to import and export notebooks
using the standard IPython Notebook file format. Also, these services support the Python language
and other languages as well. Jupyter Notebooks can be installed via the command line using the pip
install function. It can also be downloaded locally on your laptop through the Anaconda Platform
from Anaconda dot com. Anaconda is one of the popular distributions which includes Jupyter and
Jupyterlab. So, for this course, you have access to a hosted version of JupyterLab in Skills Network
Labs, so you do not require any installations on your own device to complete the hands-on labs. As
shown here, you will see a screen that will launch the Jupyter Lab in the virtual environment. Simply
click the Open tool tab. In this video, you learned that Jupyter Notebooks are used in Data Science
for recording experiments and projects. Jupyter Lab is compatible with many files and Data Science
languages. And there are different ways to install and use Jupyter Notebooks.

Welcome to “Getting started with Jupyter.” After watching this video, you will be able to: Describe
how to run, insert, and delete a cell in a notebook. Work with multiple notebooks. Present the
notebook, and shut down the notebook session. In the lab session of this module, you can launch a
notebook using the Skills Network virtual environment. After selecting the check box, click the Open
tool tab, and the environment will open the Jupyter Lab. Here you see the open notebook. On
opening the notebook, you can change the name of the notebook. Click File. Then click Rename
Notebook to give the required name. And you can now start working on your new notebook. In the
new notebook, print “hello world”.

Then click the Run button to show that the environment is giving the correct output. On the main
menu bar at the top, click Run. In the drop-down menu, click Run Selected Cells to run the current
highlighted cells. Alternatively, you can use a shortcut, press Shift + Enter. In case you have multiple
code cells, click Run All cells to run the code in all the cells. You can add code by inserting a new
cell. To add a new cell, click the plus symbol in the toolbar. In addition, you can delete a cell.
Highlight the cell and on the main menu bar, click Edit, and then click Delete Cells. Alternatively, you
can use a shortcut by pressing D twice on the highlighted cell. Also, you can move the cells up or
down as required. So, now you have learned to work with a single notebook. Next, let’s learn to work
with multiple notebooks. Click the plus button on the toolbar and select the file you want to open.
Another notebook will open. Alternatively, you can click File on the menu bar and click Open a new
launcher or Open a new notebook. And when you open the new file, you can move them around. For
example, as shown, you can place the notebooks side by side. On one notebook, you can assign
variable one to the number 1, and variable two to the number 2 and then you can print the result of
adding the numbers one and two.

As a data scientist, it is important to communicate your results. Jupyter supports presenting results
directly from the notebooks. You can create a Markdown to add titles and text descriptions to help
with the flow of the presentation. To add markdown, click Code and select Markdown. You can
create line plots and convert each cell and output into a slide or sub-slide in the form of a
presentation.

The slides functionality in Jupyter allows you to deliver code, visualization, text, and outputs of the
executed code as part of a project.

Now, when you have completed working with your notebook or notebooks, you can shut them down.
Shutting down notebooks release their memory. Click the stop icon on the sidebar, it is the second
icon from the top. You can terminate all sessions at once or shut them down individually. And after
you shut down the notebook session, you will see “no kernel” at the top right. This confirms it is no
longer active, you can now close the tabs.

In this video, you learned how to: Run, delete, and insert a code cell. Run multiple notebooks at the
same time. Present a notebook using a combination of Markdown and code cells. And shut down
your notebook sessions after you have completed your work.
Jupyter Kernels
Welcome to “Jupyter Kernels.” After watching this video, you will be able to define a kernel, and
describe how to work with kernels. A notebook kernel is a computational engine that executes
the code contained in a Notebook file. Jupyter Kernels for many languages exist, and we will
explore some that are relevant in Data Science. When a Notebook document opens, the related
kernel launches automatically. When the Notebook is executed, the kernel performs the computation
and produces the results. Depending on your settings, you may need to install other notebook
languages in your Jupyter environment. In the Skills Network lab environment, a few languages have
been pre-installed for you. The first one is the Python kernel. When you launch a notebook, pick the
language you are interested in for your Data Science project. The Python kernel allows you to run
python cells. You can run the Python script in the cells to produce an output. The top right corner of
the Notebook shows the name of the kernel. Here it shows the Python kernel. You have the option to
run other kernels. The Skills Network virtual Jupyter environment has Apache, Julia, R, and Swift.
You can use any language to execute your code, either by selecting the kernel on the launch page or
clicking the top right icon and selecting the kernel from the drop-down menu. If running the kernel on
your local machine, you will need to manually install the languages through your command line
interface (CLI). In this video, you learned that The kernel acts like a computational engine and
executes the code in a Notebook file. Jupyter Notebook supports different languages, and you can
switch to a different kernel as per your requirement.

Jupyter Architecture
Welcome to “Jupyter Architecture.” After watching this video, you will be able to describe the basic
Jupyter architecture, and explain Jupyter architecture for conversion of a file format. Jupyter
architecture implements a two-process model with a kernel and a client. The client is the interface
offering the user the ability to send code to the kernel. It is the browser in a Jupyter Notebook. The
kernel executes the code and returns the result to the client for display. Jupyter Notebooks represent
your code, metadata, contents, and outputs. When you save the Notebook, it is sent from your
browser to the Notebook server. It saves the notebook file on a disk as a JSON file with a .ipynb
(pronounced as dot i PI NB) extension. The Notebook server is responsible for saving and loading
the notebooks. And the kernel executes the cells of code contained in the Notebook when the user
runs them. The Jupyter architecture uses the NB convert tool to convert files to other formats. For
example, if we want to convert a notebook file into an HTML file, the notebook is first modified by a
preprocessor, then an exporter converts the notebook to the new file format. Finally, a postprocessor
will work on the exported file to give the final output. After conversion, on giving the url of the file, the
HTML file displays.
Play video starting at :1:39 and follow transcript

1:39
In this video, you learned that: Jupyter implements a two-process model with a kernel and a client.
The Notebook server is responsible for saving and loading the notebooks. The kernel executes the
cells of code contained in the Notebook. And the Jupyter architecture uses the NB convert tool to
convert files to other formats.

Welcome to Additional Anaconda Jupyter Environments After watching this video, you will be able to
describe Anaconda and its data science features, describe Anaconda Jupyter environments, and
identify tools in Anaconda Jupyter environments. Computational notebooks combine code,
computational output, explanatory text, and multimedia resources into a single document. Jupyter
Notebook is a popular type of computational notebook because it supports dozens of programming
languages. JupyterLab and VS Code are popular environments for creating and modifying Jupyter
Notebooks on a local device. JupyterLab is an open-source, web-based application based on
Jupyter Notebook. You can create code, interactive visualizations, text, and equations, just like with
Jupyter Notebook. JupyterLab includes expanded features with some of Anaconda's most extensive
pre-installed Python libraries, including NumPy, Pandas, and Matplotlib. Anaconda is a free and
open-source distributor for Python and R, the top languages used in data science and machine
learning. Anaconda has fifteen hundred plus libraries. It is free to install and has free community
support for any users who need help with Python. The downloadable Anaconda Navigator graphical
user interface allows users to install new packages on their local device without using a command
line interface or ‘CLI.’ You can download Anaconda Navigator from the given URL. Here is the home
page of the Anaconda Navigator. To launch JupyterLab, click Launch in the JupyterLab box. If the
Launch button is missing, click Install first, and then click Launch. To start with the Jupyter Notebook,
type Jupyter Notebook(anaconda3) in the search bar and press enter. The JupyterLab dashboard
opens in the browser on the localhost. It is specifically designed to manage Jupyter Notebooks. To
create a new Jupyter Notebook, click New and select Python 3. This opens a new notebook in a new
tab. You will see the URL, which shows the filename and the kernel. It also shows the Last
Checkpoint. Let’s rename your notebook by clicking Untitled. Type a name for the notebook and click
Rename. Next, you will review two main cell types: Code and Markdown. In the dropdown menu,
select Code. A code cell contains code to be executed in the kernel and displays its output. To
execute the cell, click Run. Alternatively, in the dropdown menu, select Markdown. A Markdown cell
contains rich text and displays its output in place when it executes. To download a notebook, go to
File and click Download as. You will see several download options. You can select the option you
want. VS Code is a free, open-source code editor for debugging and task-running operations. VS
code works on Linux, Windows, and macOS. It supports multiple languages, syntax highlighting,
auto-indentation, and more. VS Code is one of the most popular development environment tools. If
you prefer to install VS code separately, without using Anaconda Navigator, you can go to
code.visualstudio.com, click the download option that applies to your device, then follow the install
instructions. A separate installation of VS Code will work the same as in Anaconda Navigator, but it
will not configure for Anaconda, Python, or Jupyter Notebooks. To open VS Code using Anaconda
Navigator, open Anaconda Navigator, find the VS Code application, and click Launch. Once
installed, you will see the Get Started screen. You need to install a few extensions to execute Python
code in VS Code. First, click Extensions or use Ctrl + Shift + X keys to open Extensions. Then
search for “Python”; all the extensions related to Python will appear. Once you install the extensions,
click File. Then select New File. In New File, select Jupyter Notebook. The notebook will look like
this. Notice that the kernel is Python. Write your code and then execute it using the RUN icon. You
will get a confirmation that your code has been executed successfully. And finally, navigate to File
and select Save.
Play video starting at :4:59 and follow transcript4:59
In this video, you learned that Jupyter is a popular computational notebook tool because it supports
dozens of programming languages. The Anaconda Navigator GUI can launch multiple applications
on a local device. Jupyter environments in the Anaconda Navigator include JupyterLab and VS
Code. And you can download Jupyter environments separately from the Anaconda Navigator, but
they may not be configured properly.
Required
en
​ dditional Cloud-Based Jupyter Environments
A
Welcome to “Additional Cloud-Based Jupyter Environments”.​After watching this video, you will be
able to:​describe cloud-based Jupyter environments and their data science features​, navigate
cloud-based Jupyter environments, and​identify tools in cloud-based environments.​Computational
notebooks combine code, computational output, explanatory text, and multimedia resources in a
single document. ​Jupyter notebook is a popular type of computational notebook because it supports
dozens of programming languages. ​Popular cloud-based environments used to create and modify
Jupyter notebooks include:​JupyterLite, and​GoogleColaboratory​.​JupyterLite is a lightweight tool
built from JupyterLab components ​that executes entirely in the browser.​JupyterLite does not require
a dedicated Jupyter server. ​Only a web server is required, which means ​we can deploy JupyerLite
as a static website.​We can also use it to create interactive graphics and visualizations because it
supports many visualization libraries like Altair, Plotly, and ipywidgets​. Since JupyterLite is a
distribution of JupyterLab, ​it includes the latest improvements and features.​To launch JupyterLite,
open a browser and type jupyter.org/try-jupyter/lab in the URL field. Then press Enter.​JupyterLite
will appear.​Next, click Python(Pyodide). ​Here is a view of a JupyterLite notebook. We know this is a
JupyterLite notebook because ​we see the kernel is Python Pyodide. ​This kernel allows installing
and running Python packages in a browser. You will notice different kernels depending on the type of
Jupyter environment you use. For cloud-based Jupyter environments, Python Pyodide and Python
Pyolite are common kernels. ​The default kernel for JupyterLite is​Pyolite. ​Pyolite is a Python kernel
based on Pyodide. Pyolite runs in the background, so that intensive computations can execute
quickly. ​Other kernels can also be used with JupyterLite.​Google Colaboratory (or 'GoogleColab') is
a free Jupyter notebook environment that runs entirely in the cloud.​GoogleColab Jupyter notebooks
execute on a browser, and GoogleColab projects are stored on Google Drive and GitHub. ​You can
upload and share notebooks without setup and installation.​You can also clone projects from GitHub
and execute them in GoogleColab.​Most machine learning and visualization libraries are
pre-installed, like scikit-learn and matplotlib.​With GoogleCollab, you can develop many trending
data science applications “on the fly”, which is to say, quickly without a lot of setup or preparation.​
Play video starting at :3:5 and follow transcript3:05
To open the Colab notebook, open Google Drive, and​click New​​To explore GoogleColab,​from the
Google Drive menu, select More. ​Then select Google Colaboratory.​​The GoogleColab notebook will
appear.​In the notebook, write the code ​in the code section, and then to execute the code,​click the
Run icon.​To add more Code or Text cells, you need to click ​+Code and​+Text. ​Here, text cells are
used to write rich text, or you can set these cells as Markdown cells.​
Play video starting at :3:39 and follow transcript3:39
In this video, you learned that:​Jupyter is a popular computational notebook tool because it supports
dozens of programming languages.​The Anaconda Navigator GUI can launch multiple applications.​
Additional open-source Jupyter environments include the following: JupyterLab, JupyterLite, VS
Code, and Google Colaboratory. ​JupyterLite is a browser-based tool.​
Required
en

+-
Jupyter Notebooks on the Internet

There are thousands of interesting Jupyter Notebooks available on the internet for you to learn from.

One of the best sources is: https://github.com/jupyter/jupyter/wiki

It is important to notice that you can download such notebooks to your local computer or import them

to a cloud based notebook tool so that you can rerun, modify, and apply what's explained in the

notebook.

Very often, Jupyter Notebooks are already shared in a rendered view. This means that you can look

at them as if they were running locally on your machine. But sometimes, folks only share a link to the

Jupyter file (which you can make out by the *.ipynb extension). In this case, you can pick the URL to

that file and paste it to the NB-Viewer => https://nbviewer.jupyter.org/

The list of Jupyter Notebooks provides you with a huge collection of materials to explore. Therefore,

it might be useful to give you some pointers to interesting notebooks. You have covered some

examples with data in the labs. Let's highlight some useful data that further explores data science. In
addition, as we have covered different tasks in data science, we will also provide a sample notebook

for each task.

First, you start with exploratory data analysis, for which this notebook is highly recommended:

https://nbviewer.jupyter.org/github/Tanu-N-Prabhu/Python/blob/master/Exploratory_data_Analysis.ipy

nb

For data integration/cleansing at a smaller scale, the python library_pandas_is often used. For this

task, you can have a look at this notebook:

https://towardsdatascience.com/data-cleaning-with-python-using-pandas-library-c6f4a68ea8eb

If you want to know more about clustering, have a look at this notebook:

https://nbviewer.jupyter.org/github/temporaer/tutorial_ml_gkbionics/blob/master/2%20-%20KMeans.i

pynb

And finally, if you want an in-depth notebook on the_iris_dataset, have a look at this:

https://www.kaggle.com/lalitharajesh/iris-dataset-exploratory-data-analysis

Welcome to “Introduction to R and RStudio.” After watching this video, you will be able to explain
what is R, list R capabilities, describe RStudio environment, and list the R libraries for data science.
R is a statistical programming language. It is a powerful tool for data processing and manipulation,
statistical inference, data analysis, and machine learning algorithm. Based on 2017 analysis, it was
found that R is used most by academics, healthcare, and the government. R supports importing of
data from different sources like flat files, databases, web, and statistical software such as SPSS and
STATA. R is a preferred language for some data scientists because R functions are easy to use. It is
also known for producing great visualizations and contains packages to handle data analysis without
the need to install additional libraries. A popular integrated development environment for developing
and running the R language source code and programs is RStudio. It improves and increases
productivity with the R language. R studio includes: a syntax-highlighting editor that supports direct
code execution and a place where you can keep a record of your work, a Console for typing R
commands, a workspace and History tab that shows the list of R objects you created during your R
session and the history of all previous commands, and finally, Files, Plots, Packages, and Help tabs.
The Files tab shows files in your working directory. The Plots tab displays the history of plots you
have created. You can also export plots to PDF or image files. The Packages tab displays external R
packages available on your local computer. And, the Help tab provides help on R resources, R
studio support, packages, and many more. If R is your tool choice for data science, here are some
popular R libraries available in the Data Science community: dplyr for manipulating data, stringr for
manipulating strings, ggplot for visualizing data, and caret for machine learning. To get you up and
learning quickly, we have provided you with an R Studio virtual environment as part of the Skills
Network Labs. This virtual lab environment is designed to assist you to easily practice what you
learn in the course and skip the need to create an account or download or install anything.
Play video starting at :2:52 and follow transcript2:52
In this video, you learned the capabilities of R and its uses in Data Science, the RStudio interface for
running R codes, and popular R packages for Data Science.

Welcome to “Plotting in RStudio.” After watching this video, you will be able to: List the R data
visualization packages, Use the inbuilt R plot function, Use the R ggplot library to add functions and
arguments to the plot, And, add titles and names to the plot. With the influx of data, one of your
many jobs as data scientists is to produce insights using visualizations. R has different packages for
data visualization that you can use based on your requirement. To install these packages in your R
environment, use the install.packages and the package name command. Examples of R packages
include the following. ggplot is used for data visualizations such as histograms, bar charts,
scatterplots, and so on. It allows adding layers and components to a single visualization. Plotly is
used for web-based data visualizations that can be displayed or saved as individual HTML files.
Lattice is used to implement complex, multi-variable data sets. It is a high-level data visualization
library that can handle graphics without customizations. And, Leaflet is used for creating interactive
plots. R has inbuilt functions to create plots and visualization. For example, you can create a plot
using the definition shown here. The plot function returns a scatterplot of the values vs. the index.
You can also add lines to the function and a title to make the visualization easier to read and
understand. To add a line, you specify the type and to add a title, you select the title function. In the
plot, you have added a line and a title. You can create informative visualizations using the ggplot
library of R. It can handle complex requests by adding layers to plots using different functions and
arguments. For example, to create a scatter plot, let’s use the inbuilt dataset Mtcars. You will first
read the ggplot library into the memory using the library function. Next, use the ggplot function on the
data frame MTcars, specify the X axis as miles per gallon and the Y axis as weight. Then add the
geom point function to specify a scatter plot; otherwise, it will return an empty plot. The output will be
an easier-to-read plot. In addition, you can add titles and change the axis name by using the Ggtitle
argument and the lab’s argument to specify appropriate names for both axes. The result will be a
graph with meaningful titles. In the lab, you will recreate the graphics with ggplot and the extension
library called GGally. GGally extends ggplot by adding several functions to reduce the complexity of
combining geometric objects with transformed data. In this video, you learned about: Popular data
visualization packages in R, Plotting with the inbuilt R plot function, Plotting with ggplot, Adding titles
and changing the axis names using the ggtitle and lab’s function.

​Objective for Exercise

We will create different data visualizations using the ggplot package using the inbuilt dataset in R
called mtcars
Click on the + symbol on the top left and choose R Script from the menu to open a new R edit
window in RStudio:
library(datasets)

#Load Data

data(mtcars)

#View first 5 rows

head(mtcars, 5)
Type this ?mtcars to get information about the variables. This will print the information at the bottom
right panel, on the Help tab

Copy and paste the following code to load the ggplot package and create a scatterplot of disp and
mpg.
#load ggplot package
library(ggplot2)

#create a scatterplot of displacement (disp) and miles per gallon (mpg)

ggplot(aes(x=disp,y=mpg,),data=mtcars)+geom_point()

#load ggplot package


library(ggplot2)

#create a scatterplot of displacement (disp) and miles per gallon (mpg)

ggplot(aes(x=disp,y=mpg,),data=mtcars)+geom_point()

Type this ?mtcars to get information about the variables. This will print the information at the bottom
right panel, on the Help tab

Copy and paste the following code to load the ggplot package and create a scatterplot of disp and
mpg.

#load ggplot package


library(ggplot2)

#create a scatterplot of displacement (disp) and miles per gallon (mpg)

ggplot(aes(x=disp,y=mpg,),data=mtcars)+geom_point()

#load ggplot package


library(ggplot2)
#create a scatterplot of displacement (disp) and miles per gallon (mpg)

ggplot(aes(x=disp,y=mpg,),data=mtcars)+geom_point()

Use the following code to add a title.


#Add a title

ggplot(aes(x=disp,y=mpg,),data=mtcars)+geom_point()+ggtitle("displacement vs miles per gallon")

Use the following code to change the name of the x-axis and y-axis

#change axis name

ggplot(aes(x=disp,y=mpg,),data=mtcars)+geom_point()+ggtitle("displacement vs miles per gallon") +


labs(x = "Displacement", y = "Miles per Gallon")

Use the following to create a boxplot of the the distribution of mpg for the individual Engine types vs
Engine (0 = V-shaped, 1 = straight)
To do this you have to make vs a string or factor.

#make vs a factor
mtcars$vs <- as.factor(mtcars$vs)

#create boxplot of the distribution for v-shaped and straight Engine

ggplot(aes(x=vs, y=mpg), data = mtcars) + geom_boxplot()

Add color to the boxplots to help differentiate:


ggplot(aes(x=vs, y=mpg, fill = vs), data = mtcars) +
geom_boxplot(alpha=0.3) +
theme(legend.position="none")

Finally, let us create the histogram of weight wt


ggplot(aes(x=wt),data=mtcars) + geom_histogram(binwidth=0.5)

This lab introduces you to plotting in R with ggplot and GGally. GGally is an extension of ggplot2.

Exercise:

Click the plus symbol on the top left and click R Script to create a new R script, if you don’t have one
open already
(Music) In this video, you’ll get an overview of Git and GitHub, which are popular environments
among developers and data scientists for performing version control of source code files and
projects and collaborating with others. You can’t talk about Git and GitHub without a basic
understanding of what version control is.
Play video starting at ::30 and follow transcript0:30
A version control system allows you to keep track of changes to your documents. This makes it easy
for you to recover older versions of your document if you make a mistake, and it makes collaboration
with others much easier. Here is an example to illustrate how version control works. Let’s say you’ve
got a shopping list and you want your roommates to confirm the things you need and add additional
items. Without version control, you’ve got a big mess to clean up before you can go shopping. With
version control, you know exactly what you need after everyone has contributed their ideas.
Play video starting at :1:9 and follow transcript1:09
Git is free and open source software distributed under the GNU General Public License. Git is a
distributed version control system, which means that users anywhere in the world can have a copy
of your project on their own computer. When they’ve made changes, they can sync their version to a
remote server to share it with you. Git isn’t the only version control system out there, but the
distributed aspect is one of the main reasons it’s become one of the most common version control
systems available. Version control systems are widely used for things involving code, but you can
also version control images, documents, and any number of file types. You can use Git without a
web interface by using your command line interface, but GitHub is one of the most popular
web-hosted services for Git repositories. Others include GitLab, BitBucket, and Beanstalk. There are
a few basic terms that you will need to know before you can get started. The SSH protocol is a
method for secure remote login from one computer to another. A repository contains your project
folders that are set up for version control. A fork is a copy of a repository. A pull request is the way
you request that someone reviews and approves your changes before they become final. A working
directory contains the files and subdirectories on your computer that are associated with a Git
repository. There are a few basic Git commands that you will always use. When starting out with a
new repository, you only need create it once: either locally, and then push to GitHub, or by cloning an
existing repository by using the command "git init".
Play video starting at :3:3 and follow transcript3:03
"git add" moves changes from the working directory to the staging area. "git status" allows you to see
the state of your working directory and the staged snapshot of your changes. "git commit" takes your
staged snapshot of changes and commits them to the project. "git reset" undoes changes that you’ve
made to the files in your working directory. "git log" enables you to browse previous changes to a
project. "git branch" lets you create an isolated environment within your repository to make changes.
"git checkout" lets you see and change existing branches. "git merge" lets you put everything back
together again. To learn how to use Git effectively and begin collaborating with data scientists around
the world, you will need to learn the essential commands. Luckily for us, GitHub has amazing
resources available to help you get started. Go to try.github.io to download the cheat sheets and run
through the tutorials. In the following modules, we'll give you a crash course on setting up your local
environment and getting started on a project.

Introduction to Github
(Music) Welcome to Introduction to GitHub After watching this video, you will be able to: Describe
the purpose of source repositories and explain how GitHub satisfies the needs of a source
repository. Linux development in the early 2000’s was managed under a free-to-use system known
as BitKeeper. In 2005, BitKeeper changed to a for-fee system which was problematic for Linux
developers for many reasons. Linus Torvalds led a team to develop a replacement source-version
control system. The project ran in a short a timeframe and the key characteristics were defined by a
small group. These include: Strong support for non-linear development. (Linux patches were then
arriving at a rate of 6.7 patches per second) Distributed development. Each developer can have a
local copy of the full development history. Compatibility with existing systems and protocols. This
was necessary to acknowledge the diversity of the Linux community. Efficient handling of large
projects. Cryptographic authentication of history. This makes certain that distributed systems all have
identical code updates. Pluggable merge strategies. Many pathways of development can lead to
complex integration decisions that might require explicit integration strategies. What is special about
the Git Repository model? Git is designed as a distributed version-control system. Primarily focused
on tracking source code during development. Contains elements to coordinate among programmers,
track changes, and support non-linear workflows. Created in 2005 by Linus Torvalds for distribution
of Linux kernels. Git is a distributed version-control system that is used to track changes to content.
It serves as a central point for collaboration with a particular focus on agile development
methodologies. In a central version control system, every developer needs to check out code from
the central system and commit back into it. As Git is a distributed version control, each developer
has a local copy of the full development history, and changes are copied from one such repository to
another. Each developer can act as a hub. When Git is used correctly, there is a main branch that
corresponds to the deployable code. Teams can continuously integrate changes that are ready to be
released and can simultaneously work on separate branches in between releases. Git also allows
centralized administration of tasks with access-level controls for each team. Git can co-exist locally
such as through the GitHub Desktop client or it can be used directly through a browser connected to
the GitHub web interface. IBM Cloud is based on sound and established open-source tools including
Git repositories, often called repos. GitHub is an online hosting service for Git repositories. GitHub
hosted by a subsidiary of Microsoft. GitHub offers free, professional and enterprise accounts. As of
August 2019, GitHub had over 100M repositories. A Repository is: A data structure for storing
documents including application source code. A repository can track and maintain version-control.
GitLab is a complete DevOps platform, delivered as a single application. GitLab provides access to
Git repositories, controlled by source code management. With GitLab, developers can: Collaborate,
reviewing code, making comments and helping to improve each other’s code. Work from their own
local copy of the code. Branch and merge code when required. Streamline testing and delivery with
Built-in Continuous Integration (CI) and Continuous Delivery (CD). In this video, you learned: GitHub
is the online hosting service for Git repositories. Repositories store documents including application
source code and enable contributors to track and maintain version-control. What is special about the
Git Repository model? Git is designed as a distributed version-control system. Primarily focused on
tracking source code during development. Contains elements to coordinate among programmers,
track changes, and support non-linear workflow

You will now be redirected to the repository you have created. The root folder of your repository is
listed by default and it has just one file ReadMe.md.

Now, it’s time to edit the readme. You can do this in your browser. Just click the pencil to open the
online editor and you can change the text of the readme. To save your changes to the repository,
you must commit them. After you have made your changes, scroll down to the Commit changes
section. Add a commit message and optionally add a description, then click Commit changes. The
"commit changes" is used to save your changes to the repository. Go back to the home screen by
clicking the repository name link. Note that the readme file is updated and verify your changes.

Let’s learn how to create a new file using the built-in web editor provided by GitHub which runs in
the browser. Click Add File, then click Create New File to create the new file.

To create a python file called firstpython.py. First, provide the file name. Next, add a comment that
describes your code, then add the code.

Once finished, commit the change to the repository. You can see that your file is now added to the
repository and the repository listing shows when the file was added or changed. When you need to
change the file, you can edit it again. Click the file name, and then click the pencil icon, make your
edits and commit the changes.

You can also upload a file from your local system into the repository. From the home screen of the
repository, click Add File and choose the Upload files option.

Click Choose Your Files and select the files you want to upload from your local system.

The file upload process may take a short time, depending on what you are uploading. Once the
files finish uploading, click Commit Changes. The repository now reflects the files that were
uploaded. In this video, you learned how to create a repository, edit files, and commit changes
using the web interface.

Working with branches


Welcome to “GitHub: Working with Branches.” After watching this video, you will be able to define a
GitHub branch create master and child branches describe how to merge branches, and create a Pull
Request. A branch is a snapshot of your repository to which you can make changes. It is a copy of
the master branch and can be used to develop and test workflow changes before merging it into the
master branch. In Git and GitHub, there is a main branch called master. It has the deployable code
and is the official working version of your project. It is meant to be stable, thus, it is advisable not to
push any code that has not been tested in the master. If you want to change the code and the
workflow in the master branch, you can create a copy of the master branch. This can be the child
branch that will be a copy of the workflow. In the child branch, changes and experiments are done.
You can build, make edits, test the changes, and when you are satisfied with them, you can merge
them back to the master branch, where you can prepare the model for deployment. You can see that
all of this is done outside the main branch and until you merge, changes will not be made to the
workflow before you branched. To ensure that changes done by one member, do not impede or
affect the workflow of other members, multiple branches can be created and merged appropriately
with the master after the workflow is properly tested and approved. To create branches in GitHub,
let’s look at this repository. There is currently one branch in the repository. You want to make some
changes but don’t want to alter the master in case something goes wrong, so you will create a
branch. To do that, you will click the drop-down arrow and create a new branch. Name the new
branch ‘child branch’ and then click enter. The repository now has two branches, the master and
child branches. You can check this by selecting the child branch in the Branch selector drop-down
list. All the content in the master branch is copied to the child branch. However, you can add files in
the child branch without adding any to the master branch. To add a file, ensure the child branch is
selected in the branch selector drop-down list. Then click Create new file. In the space provided,
name the file ‘test child dot py’ and then add a few lines of code. You can print the statement inside
the child branch. At the bottom of the screen, you will see a section, ‘Commit new file.’ Commit
messages are important as they help to keep track of the changes made. Add a descriptive commit
message for the convenience of the team. Here you can add ‘Create test child dot py.’ Then click
Commit new file. The file is added to the child branch. You can verify by going to the master branch
by clicking ‘master’ from the Branch selector menu, and you can see that the new file is not added to
the master branch. After you have created the new file, test and ensure it is working. You can merge
the changes in the child branch to reflect in the master branch by creating a Pull Request (PR). Pull
requests show the differences in the content from both branches. It can notify other team members
of the changes and edits to the main branch. Ideally, another team member reviews the changes and
approves them to be merged with the Master branch. Pull requests are a means of collaboration on
GitHub. When you open a pull request, you propose your changes. You can assign team members
to review and approve your contribution and merge in the target branch. To open a pull request to
see the differences between the branches, click Compare and pull request. If you scroll down to the
bottom of the screen, you will see the comparison between both branches. It shows that one file has
changed, and the file has two additions, the two lines you added to the file with zero deletions. You
will now create the pull request. Add the title and an optional comment. Click Create pull request.
The next screen will show the details of the pull request. If you are okay with the changes, click
Merge pull request and then click Confirm. You will get a confirmation that the pull request has been
successfully merged. You can delete the branch if you no longer need to edit or add new information.
Now, the child branch has completely merged with the Master branch. You can check the Master
branch and verify it contains the test child dot py file. In this video, you learned: A branch is a
snapshot of your repository to which you can make changes. In the child branch, you can build,
make edits, and test the changes, and then you can merge them with the Master branch. To ensure
that changes done by one member do not impede or affect the workflow of other members, multiple
branches can be created and merged with the master. And, a pull request is a way to notify other
team members of the changes and edits to the main branch.

[Optional] Getting Started with Branches using Git Commands


You would typically use Git commands from your own desktop/laptop. However, so you can get
started using the commands quickly without having to download or install anything, we are providing
an IDE with a Terminal on the Cloud. Simply click the Open Tool button below to launch the Skills
Network Cloud IDE and in the new browser tab that launches, follow the instructions to practice the
Git commands.

After completing this lab you will be able to use git commands to start working with creating and
managing your code branches, including:

create a new local repository using git init

create and add a file to the repo using git add

commit changes using git commit


create a branch using git branch

switch to a branch using git checkout

check the status of files changed using git status

review recent commits using git log

revert changes using git revert

get a list of branches and active branch using git branch

merge changes in your active branch into another branch using git merge

If you are unable to open the lab or view it properly, please click
here
to view the HTML version full screen in a new browser tab.

This course uses a third-party app, [Optional] Getting Started with Branches using Git Commands, to
enhance your learning experience. The app will reference basic information like your name, email,
and Coursera ID.

Watson Studio
Welcome to Introduction to Watson Studio. After watching this video, you will be able to explain what
Watson Studio is for and who uses it, list the components of IBM Cloud Pak for Data, find common
resources in Watson Studio and IBM Cloud Pak for Data, and build models and manage services
and integrations. Watson Studio is a collaborative platform for the data science community. Data
Analysts, Data Scientists, Data Engineers, Developers and Data Stewards all use Watson Studio to
analyze data and construct models. With Watson Studio, you can create projects to organize data
connections, data assets, and Jupyter notebooks. You can upload files to your project, and you can
clean and shape the data to refine it for analysis. You can then create and share data visualizations
via dashboards without using any coding, Watson Knowledge Catalog provides a secure enterprise
catalog management platform to deliver trusted and meaningful data. and Watson Machine Learning
offers tools and services to build, train, and deploy machine learning models. Cloud Pak for Data as
a Service is a secure, seamless data access and integration platform that enables a single view of
the data, no matter how many data sources you are working with. It includes IBM Watson Studio,
IBM Watson Knowledge Catalog, and IBM Watson Machine Learning and more. In IBM Cloud Pak
for Data, you can find step-by-step tutorials that show how to integrate data from multiple sources,
build, deploy, test, and more. You can create collaborative data workspaces called Projects where
your team can perform tasks for data science, data engineering, data curation, or machine learning
and AI. And you can read Cloud Pak for Data news and updates. As you scroll down, you will see
your work highlights, recent activities, and shortcuts. The Quick start section has links to get you
started, and the navigation menu is on the upper left. In the navigation menu, Projects shows
projects and jobs you have created. In Deployments, you can train, deploy, and manage machine
learning models in collaborative deployment spaces. In Services, you can view different services
associated with your account, and explore the catalog. and Gallery shows a collection of data sets,
notebooks, industry accelerators, and sample projects. Now, on the Gallery page, you can search for
projects and filter your search by clicking All filters You can filter by format and topics. You can then
explore different project types. and you can also explore the data. Once you select the project, you
will see the notebook, when it was last modified, the problem statement, and more. You can Add to
your project or download the notebook to your system. One of the core services in Cloud Pak for
Data, Watson Studio architecture is centered around the project. To create a project, select Work
with data from the Cloud Pak for Data homepage. The Create a project popup will appear with
options to create an empty project or create a project from a sample or file. Now when you click,
Create an empty project, this page loads. Here you can manage all your projects. Use the context
information and actions menu from anywhere in the project to view project information or load data
files as assets. The RStudio integrated development environment (IDE) is included with IBM Watson
Studio so you can use R notebooks and scripts in your projects. Launch RStudio from the Launch
IDE menu after you create a project. The Overview page keeps you up to date with recent assets
you created, the resource usage of the project, a readme for your project description, and a project
history. To find project assets, click the Assets tab. The New Assets button lets you use data and
tools to create analytical assets, like flows, visualizations, experiments, or notebooks, and the Import
Asset button lets you import assets. A job is a way of running assets, such as Data Refinery flows or
Notebooks in a project. Under the Jobs tab, you can run a job immediately or schedule a job. You
can manage your projects under the Manage tab. There you can control access with user groups,
define environments and monitor active runtimes, see resource usage, and tools and processing
power using Services and integrations. In Service & Integration, you can associate IBM Cloud
services with your project to add tools, environments, and capabilities. And you can also use
third-party Integrations so your project can interact with external tools. For example, you can
integrate with a Git repository to export the project, work with documents and notebooks in
JupyterLab, or back up the project. From the Asset tab, you can use graphical builders, like:
Dashboards Editor, to create a sharable visualization in the dashboard, or Data Refinery to create
flows to refine data. Decision Optimization has the Decision Optimization model builder to solve
scenarios, and with SPSS Modeler, you can quickly develop predictive models and deploy them into
business operations to improve decision-making. The New asset tool type options also include Code
editors, which provides a Jupyter notebook editor where you can analyze data or build models. A
Jupyter notebook is a web-based environment for interactive computing. You can use notebooks to
run small pieces of code that process your data, and you can immediately view the results of your
computation. Now, this Jupyter notebook editor is largely used for interactive, exploratory data
analysis programming and data visualization. You should use it if you are new to Jupyter notebooks.
In this video, you learned that Watson Studio is a helpful tool for: analyzing and viewing data,
cleaning and shaping data, embedding data into streams, and creating, training, and deploying.
Learning Watson Studio promotes career growth. Learning Watson Studio is easy and requires no
special skills. And Watson Studio offers many available resources.

Welcome to Creating an IBM Cloud Account and a Watson Studio service. After watching this video,
you will be able to create an IBM Cloud account create an IBM Watson Studio service, and create a
project in Watson Studio.

To create an IBM Cloud account, go to the IBM Cloud registration page. Enter your Email and
Password, and then click Next. Verify your email using the 7-digit code sent to the email address you
entered, then click Next. So, once your email is successfully verified, enter your first name, last
name, and country or region, then click Next. Review the Account Notice, opt in for email updates if
you desire, and accept the terms and conditions, then click Continue. Review the privacy notice and
accept it. Click Continue to create the account. It will take a few seconds to create and set up your
account. Now, in the apply code section, you will see that the feature code has been applied to your
account; click the Create account button to continue. It may take a few minutes to create your
account.

So, once your IBM Cloud account is created, you will see the IBM Cloud dashboard. To use the
Watson Studio service, click the Catalog option. Scroll down and select AI/Machine Learning. Then
select Watson Studio. This will load the Watson Studio creation page. Here, select Dallas or London
under the location option. To avoid any charges, select the lite plan, then accept the license
agreements, and click the Create button. Then click Launch in IBM Cloud Pak for Data. Provide a
company name and phone number, and select the contact options you prefer. Then click Continue.
On the next screen, click Go to the IBM Watson Studio. You have successfully created an account
for IBM Watson Studio and are ready to use the Watson service.

Now, let’s start creating a project: click Work with data. You can either create an empty project or
create a project from a sample or file. In this video, we will create an empty project. Provide the
project name and description, then click Add to select a storage service. You will be redirected to a
new page where you will create a Cloud Object Storage. Scroll down and select the Lite plan, then
click Create. Once created, you will be redirected to the previous page. Notice the Refresh option is
enabled. Click Refresh to refresh the object storage. Now, your cloud object storage is visible, and
the Create option is enabled. Click Create to create a project. It will take a few seconds to create a
project. Once created, the project looks like this. The next video will show you how to add a
notebook to the project. So, without further delay, let's begin exploring the Watson Studio. In this
video, you learned that: You must create an IBM Cloud account to use Watson Studio. You can get
to Watson Studio by clicking Catalog on the IBM Cloud dashboard and then AI/Machine Learning.
You can create an empty project or a project from a sample or file by selecting the Work with data
option. And you can add a storage service by selecting Add and then selecting the storage service of
your choice

Jupyter Notebooks in Watson Studio


Welcome to “Jupyter Notebooks in Watson Studio – Part 1”. After watching this video, you will be
able to create a Jupyter Notebook and load a data file, share a Jupyter Notebook with others, and
create a job and schedule it to run. In the previous video, you learned how to create a project. Once
created, you will see this page. Click New asset to add or create a new notebook. Under Tool type,
select Code editors and then, select Jupyter notebook editor to create a new notebook. On the New
Notebook page, you can create a blank notebook, upload a notebook file from your file system, or
upload a notebook file from a URL. In this video, we will create a blank notebook. First, provide a
notebook name and description.

You need to specify the runtime environment for the language you want to use (for example, Python,
R, or Scala). Then click Create to create a notebook.
After you create a notebook, you will upload the data. Make sure the data you load, and the code
commands you use to analyze that data, both match the kernel/runtime language you selected when
you created your notebook. To upload the data, click Find and Add Data. In the Data pane, browse
for the files or drag them onto the pane. You must stay on the page until the upload is complete. You
have the option to cancel an ongoing upload process if need be. Now, the data is available to work
on. Click Insert to code and select pandas DataFrame. It's a best practice to insert a cell at the top of
the Jupyter notebook using the Insert Cell Above option from the Insert tab. Now, a cell is added.
Now, change the cell type to markdown, so this cell will not be treated as code. In the markdown cell,
describe what the notebook does and run it. Now you're ready to run the notebook. The inserted
code loads the data set into a data frame. Run the code cell to display the first five rows of the data
set. From the File tab, select the Save Version option to save the latest changes in the notebook.
Click the project name to return to the project home page.

Back on the project home page, under the Assets tab, select the Source Code tab in the left
navigation pane, you’ll find the notebook on which you have recently worked. Click to open it. The
notebook will open in a read-only mode. To edit it, click the pencil icon in the Notebook action bar.

When the view notebook info icon is selected, under the General tab, you can change the name and
description of your notebook, and see the last editor, last modified date, and the creation date. On
the Environment tab, you can see the environment template used to open the notebook, change the
template, view the environment details, and check the runtime status. You can create a URL to share
the notebook on social media or with people outside of Watson Studio. The URL shows a read-only
view of the notebook. Anyone who has the URL can view or download the notebook. Click the Share
icon in the notebook action bar to see sharing options in a pop-up window.

If you want to share a read-only version of the notebook, in the dialog box: Enable the option Share
with anyone who has a link. You can select how much of the content you'd like to share by selecting
Only text and output or All content excluding sensitive code cells in the sharing options. You can
then share the notebook either through a link or on social media. The Jobs feature provides an
efficient way to run, schedule, and monitor jobs in a Watson Studio project. Click the Create a job
icon from the notebook action bar and click Create a job. In the Define details page, enter the job
name and description, and then click Next. In the Configure page, enter all the required details, and
then click Next. You can add a one-time or repeat schedule on the Schedule page. If you define a
start day and time without selecting Repeat, the job will run exactly once on the specified day and
time. Select the start day and time, and then click Next.

If you require a notification for this job, enable the notification option, and then click Next.
Play video starting at :4:47 and follow transcript4:47
Review the job details and click Create. To view the job you created go to the Jobs tab. From here,
you can edit and delete the job.

In this video, you learned that you can add or create a new notebook by clicking New asset on the
project home page under the Assets tab. You can share your notebook without sharing the sensitive
cells. Jobs are created and scheduled from the Create a job icon in the notebook action bar, and
jobs can be viewed, edited, or deleted on the project home page under the Jobs tab.
Required
en

Welcome to “Jupyter Notebooks in Watson Studio – Part 2.” After watching this video, you will be
able to check the Jupyter Notebook runtime environment, use different types of Jupyter Notebook
templates, and and change the kernel within a Jupyter Notebook. You have learned how to create a
Jupyter Notebook, now you will manage the environment. Go to the Data Science Project home
page and click the Manage tab, and then in the left pane click Environments. If your Notebook is
executing, you will be able to see the active runtime environment. Note that you can execute your
notebook to see the active runtime if it is not visible. To see the available runtime environments, click
Templates. In the Template name list, you can explore available runtime environment templates.
Select any template as your active runtime environment. Let’s select the RStudio environment. On
selecting the template, you will see the Summary of the template, and its Software configuration
details. To create a new template, go back to the Templates page and click New template. Enter the
environment name, its description, and configuration. Then click Create to create the environment. A
Summary and Software configuration details of the new template will be displayed. To change the
current runtime to the newly created template runtime, click the Assets tab, and then select the new
notebook you created. If you see the lock icon, unlock it and select the notebook again. Next select
the three dots on the right and click Change environment. Select the new template environment you
created and click Change. In the pop-up, select the kernel from the drop-down option and click Set
Kernel. On the top right corner, you will see the new runtime. Now you can drag and drop any CSV
file to upload and then, click Insert to code to include the code into the notebook. To execute the
notebook, click Run, and you will see that the output is now using the new runtime environment.

In this video, you learned that Jupyter Notebook runtime environments and templates can be found
on the Environments tab in the Home Page, and you can create and change new Jupyter Notebook
templates.

https://eu-gb.dataplatform.cloud.ibm.com/analytics/notebooks/v2/953f475b-87d1-4699-b901-71403e
777879/view?access_token=d426d8be71f20d383ad2768bba6e8f9bc8aaf7f7678e1049b54aec242e5
7ac7c&context=cpdaas

Data Science Methodology


Welcome to the course for an Introduction for Data Science Methodology. This is the beginning of a
story you'll tell others about for years. Your story won't be in the form you experience here. But rather
through the stories you'll be sharing with others as you explain how your understanding of a question
resulted in an answer that changed house something was done. During this course, you'll gain
indepth insights into data science methodologies. After you complete this course, you'll be able to
describe how data scientists apply structured approach to solve business problems. And you'll be
ready to start creating your own data science methodology stories. You'll master the basics of
foundational data science methodology through instructional videos that incorporate a real life case
study readings, interactive hands-on labs and practice assessments. Where you can practice your
skills, downloadable glossaries for your reference, and graded assessments at the end of each
lesson. And the end of the course where you can validate your knowledge. Let's take a more
detailed view of the course content. In each lesson within each module, you'll explore two
foundational data science methodology stages. In Module 1, Lesson 1, you'll explore the stages,
business understanding and analytics approach. In Module 1, Lesson 2, you discover data scientists
work during the data requirements and data collection stages. In Module 2, Lesson 1, you'll examine
the work that happens during the data understanding and data preparation stages. Next, in Module
2, Lesson 2, you'll study the modeling and evaluation stages. And finally, in module three, you'll
focus on the deployment and feedback stages. The case study in the course highlights how to apply
data science methodology in the following real world scenario. A healthcare facility has a limited
budget to properly address a patient's condition before their initial discharge. The core question is
what is the best way to allocate these funds to maximize their use to provide quality care? As you'll
see, if the new data science pilot program is successful, the facility will deliver better patient care by
giving physicians new tools to incorporate timely data driven information into patient care decisions.
When viewing case study information, use the icons displayed at the top right hand corner of your
screen to help you differentiate theory from practice. While participating in the course, if you
encounter challenges or have questions. You can find support and answers and connect with other
learners in the course's discussion forums. You're ready for this course if have basic data science
knowledge and you know how to use Jupyter Notebooks. And now, welcome to the course. We look
forward to you completing this course, earning a valuable certificate, and continuing your path to a
data science career.

Syllabus
cognitiveclass.ai logo

Module 1: From Problem to Approach and From Requirements to Collection


Video: Course Introduction
Reading: Helpful Tips for Course Completion
Reading: Syllabus
Lesson 1: From Problem to Approach
Video: Data Science Methodology Overview
Video: Business Understanding
Video: Analytic Approach
Hands-on Lab: From Problem to Approach
Reading: Lesson 1 Summary: From Problem to Approach
Practice Quiz: From Problem to Approach
Glossary: From Problem to Approach
Graded Quiz: From Problem to Approach
Lesson 2: From Requirements to Collection
Video: Data Requirements
Video: Data Collection
Hands-on Lab: From Requirements to Collection
Reading: Lesson 2 Summary: From Requirements to Collection
Practice Quiz: From Requirements to Collection
Glossary: From Requirements to Collection
Graded Quiz: From Requirements to Collection
Module 2: From Understanding to Preparation and from Modeling to Evaluation
Lesson 1: From Understanding to Preparation
Video: Data Understanding
Data Preparation - Concepts
Data Preparation - Case Study
Hands-on Lab: From Understanding to Preparation
Reading: Lesson 1 Summary: From Understanding to Preparation
Practice Quiz: From Understanding to Preparation
Glossary: From Understanding to Preparation
Graded Quiz: From Understanding to Preparation
Lesson 2: From Modeling to Evaluation
Video: Modeling - Concepts
Video: Modeling - Case Study
Video: Evaluation
Hands-on Lab: From Modeling to Evaluation
Reading: Lesson 2 Summary:From Modeling to Evaluation
Practice Quiz: From Modeling to Evaluation
Glossary: From Modeling to Evaluation
Graded Quiz: From Modeling to Evaluation
Module 3: From Deployment to Feedback
Video: Deployment
Video: Feedback
Video: Storytelling
Video: Course Summary
Reading: Module 3 Summary:From Deployment to Feedback
Practice Quiz: From Deployment to Feedback
Glossary: From Deployment to Feedback
Graded Quiz: From Deployment to Feedback
Module 4: Final Project and Assessment
Final Project
Video: Introduction to CRISP-DM
Reading: Final Assignment Overview
Peer Review: Final Assignment
Course Summary and Final Quiz
Reading: Review What You Learned
Graded Quiz: Final Quiz
Course Wrap Up
Reading: Congratulations and Next Steps
Reading: Thanks from the Course Team
Reading: IBM Digital Badge

Welcome to data science methodology overview. After watching this video, you'll be able to describe
the term methodology, relate methodology to data science and John Rollins's contributions to data
methodology, identify the 10 stages of standard data methodology and categorize the questions for
10 stages of standard data methodology. Data science is an influential domain that combined
statistical analysis, technological expertise, and domain knowledge to extract valuable insights from
extensive data sets. However, despite the recent increase in computing power and easier access to
data, we often don't understand the questions being asked or know how to apply the data correctly
to address the problem at hand. Using a methodology helps resolve those issues. What is a
methodology? A methodology is a system of methods used in a particular area of study. A
methodology is a guideline for the decisions researchers must make during the scientific process. In
the context of data science, data science methodology is a structured approach that guides data
scientists and solving complex problems and making data-driven decisions. Data science
methodology also includes data collection forms, measurement strategies, and comparisons of data
analysis methods relative to different research goals and situations. Using a methodology provides
the practical guidance needed to conduct scientific research efficiently. There's often a temptation to
bypass methodology and jump directly to solutions. However, jumping to solutions hinders our best
intentions for solving problems. Next, let's explore methodology as it relates to data science. The
data science methodology discussed in this course is a methodology outlined by John Rollins, a
seasoned IBM Senior Data Scientist. This course is built on his professional experience and insights
into the importance of following a methodology for successful data science outcomes. As a general
methodology, data science methodology consists of the following 10 stages. Business
understanding, analytic approach, data requirements, data collection, data understanding, data
preparation, modeling, evaluation, deployment, and feedback. Asking questions is the cornerstone of
success in data science. Questions drive every stage of data science methodology. Data science
methodology aims to answer the following 10 basic questions which align with the data methodology
questions. These first two questions help you define the issue and determine what approach to use.
You'll ask, what is the problem that you're trying to solve? How can you use data to answer the
question? You'll use the next four questions to help you get organized around the data. You'll ask,
what data do you need to answer the question, where's the data source from, and how will you
receive the data? Does the data you collect represent the problem to be solved and what additional
work is required to manipulate and work with the data? Then you'll use these final four questions to
validate your approach and final design for ongoing analysis. You'll ask, when you apply data
visualizations, do you see answers that address the business problem? Does the data model answer
the initial business question or must you adjust the data? Can you put the model into practice? Can
you get constructive feedback from the data and the stakeholder to answer the business question?
In this video, you learned that data science methodology guides data scientists in solving complex
problems with data. A methodology also includes data collection forms, measurement strategies,
and comparisons of data analysis methods relative to different research goals and situations. As a
general science methodology, data methodology consists of the following 10 stages. Business
understanding, analytic approach, data requirements, data collection, data understanding, data
preparation, modeling, evaluation, deployment, and feedback. The 10 questions aligned with
defining the business issue, determining an approach, organizing your data, and validating your
approach for the final data design.


Welcome to data science methodology overview. After watching this video, you'll be able to describe
the term methodology, relate methodology to data science and John Rollins's contributions to data
methodology, identify the 10 stages of standard data methodology and categorize the questions for
10 stages of standard data methodology. Data science is an influential domain that combined
statistical analysis, technological expertise, and domain knowledge to extract valuable insights from
extensive data sets. However, despite the recent increase in computing power and easier access to
data, we often don't understand the questions being asked or know how to apply the data correctly
to address the problem at hand. Using a methodology helps resolve those issues. What is a
methodology? A methodology is a system of methods used in a particular area of study. A
methodology is a guideline for the decisions researchers must make during the scientific process. In
the context of data science, data science methodology is a structured approach that guides data
scientists and solving complex problems and making data-driven decisions. Data science
methodology also includes data collection forms, measurement strategies, and comparisons of data
analysis methods relative to different research goals and situations. Using a methodology provides
the practical guidance needed to conduct scientific research efficiently. There's often a temptation to
bypass methodology and jump directly to solutions. However, jumping to solutions hinders our best
intentions for solving problems. Next, let's explore methodology as it relates to data science. The
data science methodology discussed in this course is a methodology outlined by John Rollins, a
seasoned IBM Senior Data Scientist. This course is built on his professional experience and insights
into the importance of following a methodology for successful data science outcomes. As a general
methodology, data science methodology consists of the following 10 stages. Business
understanding, analytic approach, data requirements, data collection, data understanding, data
preparation, modeling, evaluation, deployment, and feedback. Asking questions is the cornerstone of
success in data science. Questions drive every stage of data science methodology. Data science
methodology aims to answer the following 10 basic questions which align with the data methodology
questions. These first two questions help you define the issue and determine what approach to use.
You'll ask, what is the problem that you're trying to solve? How can you use data to answer the
question? You'll use the next four questions to help you get organized around the data. You'll ask,
what data do you need to answer the question, where's the data source from, and how will you
receive the data? Does the data you collect represent the problem to be solved and what additional
work is required to manipulate and work with the data? Then you'll use these final four questions to
validate your approach and final design for ongoing analysis. You'll ask, when you apply data
visualizations, do you see answers that address the business problem? Does the data model answer
the initial business question or must you adjust the data? Can you put the model into practice? Can
you get constructive feedback from the data and the stakeholder to answer the business question?
In this video, you learned that data science methodology guides data scientists in solving complex
problems with data. A methodology also includes data collection forms, measurement strategies,
and comparisons of data analysis methods relative to different research goals and situations. As a
general science methodology, data methodology consists of the following 10 stages. Business
understanding, analytic approach, data requirements, data collection, data understanding, data
preparation, modeling, evaluation, deployment, and feedback. The 10 questions aligned with
defining the business issue, determining an approach, organizing your data, and validating your
approach for the final data design.

Business Understanding
Welcome to Data Science Methodology 101 From Problem to Approach Business Understanding!
Has this ever happened to you? You've been called into a meeting by your boss, who makes you
aware of an important task one with a very tight deadline that absolutely has to be met. You both go
back and forth to ensure that all aspects of the task have been considered and the meeting ends
with both of you confident that things are on track. Later that afternoon, however, after you've spent
some time examining the various issues at play, you realize that you need to ask several additional
questions in order to truly accomplish the task. Unfortunately, the boss won't be available again until
tomorrow morning. Now, with the tight deadline still ringing in your ears, you start feeling a sense of
uneasiness. So, what do you do? Do you risk moving forward or do you stop and seek clarification.
Data science methodology begins with spending the time to seek clarification, to attain what can be
referred to as a business understanding. Having this understanding is placed at the beginning of the
methodology because getting clarity around the problem to be solved, allows you to determine which
data will be used to answer the core question. Rollins suggests that having a clearly defined
question is vital because it ultimately directs the analytic approach that will be needed to address the
question. All too often, much effort is put into answering what people THINK is the question, and
while the methods used to address that question might be sound, they don't help to solve the actual
problem. Establishing a clearly defined question starts with understanding the GOAL of the person
who is asking the question. For example, if a business owner asks: "How can we reduce the costs of
performing an activity?" We need to understand, is the goal to improve the efficiency of the activity?
Or is it to increase the businesses profitability? Once the goal is clarified, the next piece of the
puzzle is to figure out the objectives that are in support of the goal. By breaking down the objectives,
structured discussions can take place where priorities can be identified in a way that can lead to
organizing and planning on how to tackle the problem. Depending on the problem, different
stakeholders will need to be engaged in the discussion to help determine requirements and clarify
questions. So now, let's look at the case study related to applying "Business Understanding" In the
case study, the question being asked is: What is the best way to allocate the limited healthcare
budget to maximize its use in providing quality care? This question is one that became a hot topic for
an American healthcare insurance provider. As public funding for readmissions was decreasing, this
insurance company was at risk of having to make up for the cost difference,which could potentially
increase rates for its customers. Knowing that raising insurance rates was not going to be a popular
move, the insurance company sat down with the health care authorities in its region and brought in
IBM data scientists to see how data science could be applied to the question at hand. Before even
starting to collect data, the goals and objectives needed to be defined. After spending time to
determine the goals and objectives, the team prioritized "patient readmissions" as an effective area
for review. With the goals and objectives in mind, it was found that approximately 30% of individuals
who finish rehab treatment would be readmitted to a rehab center within one year; and that 50%
would be readmitted within five years. After reviewing some records, it was discovered that the
patients with congestive heart failure were at the top of the readmission list. It was further
determined that a decision-tree model could be applied to review this scenario, to determine why this
was occurring. To gain the business understanding that would guide the analytics team in
formulating and performing their first project, the IBM Data scientists, proposed and delivered an
on-site workshop to kick things off. The key business sponsors involvement throughout the project
was critical, in that the sponsor: Set overall direction Remained engaged and provided guidance.
Ensured necessary support, where needed. Finally, four business requirements were identified for
whatever model would be built. Namely: Predicting readmission outcomes for those patients with
Congestive Heart Failure Predicting readmission risk. Understanding the combination of events that
led to the predicted outcome Applying an easy-to-understand process to new patients, regarding
their readmission risk. This ends the Business Understanding section of this course. Thanks for
watching! (music

Welcome to Data Science Methodology 101 From Problem to Approach Business Understanding!
Has this ever happened to you? You've been called into a meeting by your boss, who makes you
aware of an important task one with a very tight deadline that absolutely has to be met. You both go
back and forth to ensure that all aspects of the task have been considered and the meeting ends
with both of you confident that things are on track. Later that afternoon, however, after you've spent
some time examining the various issues at play, you realize that you need to ask several additional
questions in order to truly accomplish the task. Unfortunately, the boss won't be available again until
tomorrow morning. Now, with the tight deadline still ringing in your ears, you start feeling a sense of
uneasiness. So, what do you do? Do you risk moving forward or do you stop and seek clarification.
Data science methodology begins with spending the time to seek clarification, to attain what can be
referred to as a business understanding. Having this understanding is placed at the beginning of the
methodology because getting clarity around the problem to be solved, allows you to determine which
data will be used to answer the core question. Rollins suggests that having a clearly defined
question is vital because it ultimately directs the analytic approach that will be needed to address the
question. All too often, much effort is put into answering what people THINK is the question, and
while the methods used to address that question might be sound, they don't help to solve the actual
problem. Establishing a clearly defined question starts with understanding the GOAL of the person
who is asking the question. For example, if a business owner asks: "How can we reduce the costs of
performing an activity?" We need to understand, is the goal to improve the efficiency of the activity?
Or is it to increase the businesses profitability? Once the goal is clarified, the next piece of the
puzzle is to figure out the objectives that are in support of the goal. By breaking down the objectives,
structured discussions can take place where priorities can be identified in a way that can lead to
organizing and planning on how to tackle the problem. Depending on the problem, different
stakeholders will need to be engaged in the discussion to help determine requirements and clarify
questions. So now, let's look at the case study related to applying "Business Understanding" In the
case study, the question being asked is: What is the best way to allocate the limited healthcare
budget to maximize its use in providing quality care? This question is one that became a hot topic for
an American healthcare insurance provider. As public funding for readmissions was decreasing, this
insurance company was at risk of having to make up for the cost difference,which could potentially
increase rates for its customers. Knowing that raising insurance rates was not going to be a popular
move, the insurance company sat down with the health care authorities in its region and brought in
IBM data scientists to see how data science could be applied to the question at hand. Before even
starting to collect data, the goals and objectives needed to be defined. After spending time to
determine the goals and objectives, the team prioritized "patient readmissions" as an effective area
for review. With the goals and objectives in mind, it was found that approximately 30% of individuals
who finish rehab treatment would be readmitted to a rehab center within one year; and that 50%
would be readmitted within five years. After reviewing some records, it was discovered that the
patients with congestive heart failure were at the top of the readmission list. It was further
determined that a decision-tree model could be applied to review this scenario, to determine why this
was occurring. To gain the business understanding that would guide the analytics team in
formulating and performing their first project, the IBM Data scientists, proposed and delivered an
on-site workshop to kick things off. The key business sponsors involvement throughout the project
was critical, in that the sponsor: Set overall direction Remained engaged and provided guidance.
Ensured necessary support, where needed. Finally, four business requirements were identified for
whatever model would be built. Namely: Predicting readmission outcomes for those patients with
Congestive Heart Failure Predicting readmission risk. Understanding the combination of events that
led to the predicted outcome Applying an easy-to-understand process to new patients, regarding
their readmission risk. This ends the Business Understanding section of this course. Thanks for
watching! (music)
Required
en

Welcome to Data Science Methodology 101 From problem to approach Analytic Approach!
Selecting the right analytic approach depends on the question being asked. The approach involves
seeking clarification from the person who is asking the question, so as to be able to pick the most
appropriate path or approach. In this video we'll see how the second stage of the data science
methodology is applied. Once the problem to be addressed is defined, the appropriate analytic
approach for the problem is selected in the context of the business requirements. This is the second
stage of the data science methodology. Once a strong understanding of the question is established,
the analytic approach can be selected. This means identifying what type of patterns will be needed
to address the question most effectively. If the question is to determine probabilities of an action,
then a predictive model might be used. If the question is to show relationships, a descriptive
approach maybe be required. This would be one that would look at clusters of similar activities
based on events and preferences. Statistical analysis applies to problems that require counts. For
example if the question requires a yes/ no answer, then a classification approach to predicting a
response would be suitable. Machine Learning is a field of study that gives computers the ability to
learn without being explicitly programmed. Machine Learning can be used to identify relationships
and trends in data that might otherwise not be accessible or identified. In the case where the
question is to learn about human behaviour, then an appropriate response would be to use
Clustering Association approaches. So now, let's look at the case study related to applying Analytic
Approach. For the case study, a decision tree classification model was used to identify the
combination of conditions leading to each patient's outcome. In this approach, examining the
variables in each of the nodes along each path to a leaf, led to a respective threshold value. This
means the decision tree classifier provides both the predicted outcome, as well as the likelihood of
that outcome, based on the proportion at the dominant outcome, yes or no, in each group. From this
information, the analysts can obtain the readmission risk, or the likelihood of a yes for each patient. If
the dominant outcome is yes, then the risk is simply the proportion of yes patients in the leaf. If it is
no, then the risk is 1 minus the proportion of no patients in the leaf. A decision tree classification
model is easy for non-data scientists to understand and apply, to score new patients for their risk of
readmission. Clinicians can readily see what conditions are causing a patient to be scored as
high-risk and multiple models can be built and applied at various points during hospital stay. This
gives a moving picture of the patient's risk and how it is evolving with the various treatments being
applied. For these reasons, the decision tree classification approach was chosen for building the
Congestive Heart Failure readmission model. This ends the Analytic Approach section for this
course.

Business understanding questions

The company's e-commerce business goal is to optimize its pricing


strategy to maximize revenue and profitability. By leveraging data science,
the company aims to identify patterns in historical sales data, pricing
changes, and customer behavior to make informed decisions on pricing
and promotional strategies.

Analytical Approach
Identifying the pattern to address the question

Business Goal
A transportation company aims to optimize its delivery routes and schedules to minimize costs and
improve delivery efficiency. The company wants to use data science to identify the most optimal
routes and delivery time windows based on historical delivery data and external factors such as
traffic and weather conditions.
Various questions are targeted by data scientist to achieve this business goal
Glossary
Module 1 Lesson 1: From Problem to Approach

Welcome! This alphabetized glossary contains many of the terms you'll find within this lesson. These
terms are important for you to recognize when working in the industry, when participating in user
groups, and when participating in other certificate programs.

Term Definition
Analytic Approach The process of selecting the appropriate method or path to address a specific
data science question or problem.
Analytics The systematic analysis of data using statistical, mathematical, and computational
techniques to uncover insights, patterns, and trends.
Business Understanding The initial phase of data science methodology involves seeking
clarification and understanding the goals, objectives, and requirements of a given task or problem.
Clustering Association An approach used to learn about human behavior and identify patterns and
associations in data.
Cohort A group of individuals who share a common characteristic or experience is studied or
analyzed as a unit.
Cohort study An observational study where a group of individuals with a specific characteristic or
exposure is followed over time to determine the incidence of outcomes or the relationship between
exposures and outcomes.
Congestive Heart Failure (CHF) A chronic condition in which the heart cannot pump enough
blood to meet the body's needs, resulting in fluid buildup and symptoms such as shortness of breath
and fatigue.
CRISP-DM Cross-Industry Standard Process for Data Mining is a widely used methodology for
data mining and analytics projects encompassing six phases: business understanding, data
understanding, data preparation, modeling, evaluation, and deployment.
Data analysis The process of inspecting, cleaning, transforming, and modeling data to discover
useful information, draw conclusions, and support decision-making.
Data cleansing The process of identifying and correcting or removing errors, inconsistencies, or
inaccuracies in a dataset to improve its quality and reliability
Data science An interdisciplinary field that combines scientific methods, processes, algorithms,
and systems to extract knowledge and insights from structured and unstructured data.
Data science methodology A structured approach to solving business problems using data
analysis and data-driven insights.
Data scientist A professional using scientific methods, algorithms, and tools to analyze data, extract
insights, and develop models or solutions to complex business problems.
Data scientists Professionals with data science and analytics expertise who apply their skills to solve
business problems.
Data-Driven Insights Insights derived from analyzing and interpreting data to inform
decision-making
Decision tree A supervised machine learning algorithm that uses a tree-like structure of decisions
and their possible consequences to make predictions or classify instances.
Decision Tree Classification Model A model that uses a tree-like structure to classify data based
on conditions and thresholds provides predicted outcomes and associated probabilities.
Decision Tree Classifier A classification model that uses a decision tree to determine
outcomes based on specific conditions and thresholds.
Decision-Tree Model A model used to review scenarios and identify relationships in data, such as
the reasons for patient readmissions
Descriptive approach An approach used to show relationships and identify clusters of similar
activities based on events and preferences
Descriptive modeling Modeling technique that focuses on describing and summarizing data, often
through statistical analysis and visualization, without making predictions or inferences
Domain knowledge Expertise and understanding of a specific subject area or field, including its
concepts, principles, and relevant data
Goals and objectives The sought-after outcomes and specific objectives that support the overall
goal of the task or problem.
Iteration A single cycle or repetition of a process often involves refining or modifying a solution
based on feedback or new information.
Iterative process A process that involves repeating a series of steps or actions to refine and
improve a solution or analysis. Each iteration builds upon the previous one.
Leaf The final nodes of a decision tree where data is categorized into specific outcomes.
Machine Learning A field of study that enables computers to learn from data without being
explicitly programmed, identifying hidden relationships and trends.
Mean The average value of a set of numbers is calculated by summing all the values and dividing
by the total number of values.
Median When arranged in ascending or descending order, the middle value in a set of numbers
divides the data into two equal halves.
Model (Conceptual model) A simplified representation or abstraction of a real-world system or
phenomenon used to understand, analyze, or predict its behavior.
Model building The process of developing predictive models to gain insights and make informed
decisions based on data analysis.
Pairwise comparison (correlation) A statistical technique that measures the strength and
direction of the linear relationship between two variables by calculating a correlation coefficient.
Pattern A recurring or noticeable arrangement or sequence in data can provide insights or be used
for prediction or classification.
Predictive model A model used to determine probabilities of an action or outcome based on
historical data.
Predictors Variables or features in a model that are used to predict or explain the outcome
variable or target variable.
Prioritization The process of organizing objectives and tasks based on their importance and
impact on the overall goal.
Problem solving The process of addressing challenges and finding solutions to achieve
desired outcomes.
Stakeholders Individuals or groups with a vested interest in the data science model's outcome and
its practical application, such as solution owners, marketing, application developers, and IT
administration.
Standard deviation A measure of the dispersion or variability of a set of values from their mean;
It provides information about the spread or distribution of the data.
Statistical analysis Stand deviations are applied to problems that require counts, such as yes/no
answers or classification tasks.
Statistics The collection, analysis, interpretation, presentation, and organization of data to
understand patterns, relationships, and variability in the data.
Structured data (data model) Data organized and formatted according to a predefined schema or
model and is typically stored in databases or spreadsheets.
Text analysis data mining The process of extracting useful information or knowledge from
unstructured textual data through techniques such as natural language processing, text mining, and
sentiment analysis.
Threshold value The specific value used to split data into groups or categories in a decision
tree.
Welcome to Data Science Methodology 101 From Requirements to Collection Data Requirements! If
your goal is to make a spaghetti dinner but you don't have the right ingredients to make the dish,
then your success will be compromised. Think of this section of the data science methodology as
cooking with data. Each step is critical in making the meal. So, if the problem that needs to be
resolved is the recipe, so to speak, and data is an ingredient, then the data scientist needs to
identify: which ingredients are required, how to source or to collect them, how to understand or work
with them, and how to prepare the data to meet the desired outcome. Building on the understanding
of the problem at hand, and then using the analytical approach selected, the Data Scientist is ready
to get started. Now let's look at some examples of the data requirements within the data science
methodology. Prior to undertaking the data collection and data preparation stages of the
methodology, it's vital to define the data requirements for decision-tree classification. This includes
identifying the necessary data content, formats and sources for initial data collection. So now, let's
look at the case study related to applying "Data Requirements". In the case study, the first task was
to define the data requirements for the decision tree classification approach that was selected. This
included selecting a suitable patient cohort from the health insurance providers member base. In
order to compile the complete clinical histories, three criteria were identified for inclusion in the
cohort. First, a patient needed to be admitted as in-patient within the provider service area, so they'd
have access to the necessary information. Second, they focused on patients with a primary
diagnosis of congestive heart failure during one full year. Third, a patient must have had continuous
enrollment for at least six months, prior to the primary admission for congestive heart failure, so that
complete medical history could be compiled. Congestive heart failure patients who also had been
diagnosed as having other significant medical conditions, were excluded from the cohort because
those conditions would cause higher-than-average re-admission rates and, thus, could skew the
results. Then the content, format, and representations of the data needed for decision tree
classification were defined. This modeling technique requires one record per patient, with columns
representing the variables in the model. To model the readmission outcome, there needed to be data
covering all aspects of the patient's clinical history. This content would include admissions, primary,
secondary, and tertiary diagnoses, procedures, prescriptions, and other services provided either
during hospitalization or throughout patient/doctor visits. Thus, a particular patient could have
thousands of records, representing all their related attributes. To get to the one record per patient
format, the data scientists rolled up the transactional records to the patient level, creating a number
of new variables to represent that information. This was a job for the data preparation stage, so
thinking ahead and anticipating subsequent stages is important. This ends the Data Requirements
section for this course.
Welcome to Data Science Methodology 101 From Requirements to Collection Data Collection! After
the initial data collection is performed, an assessment by the data scientist takes place to determine
whether or not they have what they need. As is the case when shopping for ingredients to make a
meal, some ingredients might be out of season and more difficult to obtain or cost more than initially
thought. In this phase the data requirements are revised and decisions are made as to whether or
not the collection requires more or less data. Once the data ingredients are collected, then in the
data collection stage, the data scientist will have a good understanding of what they will be working
with. Techniques such as descriptive statistics and visualization can be applied to the data set, to
assess the content, quality, and initial insights about the data. Gaps in data will be identified and
plans to either fill or make substitutions will have to be made. In essence, the ingredients are now
sitting on the cutting board. Now let's look at some examples of the data collection stage within the
data science methodology. This stage is undertaken as a follow-up to the data requirements stage.
So now, let's look at the case study related to applying "Data Collection". Collecting data requires
that you know the source or, know where to find the data elements that are needed. In the context of
our case study, these can include: demographic, clinical and coverage information of patients,
provider information, claims records, as well as pharmaceutical and other information related to all
the diagnoses of the congestive heart failure patients. For this case study, certain drug information
was also needed, but that data source was not yet integrated with the rest of the data sources. This
leads to an important point: It is alright to defer decisions about unavailable data, and attempt to
acquire it at a later stage. For example, this can even be done after getting some intermediate
results from the predictive modeling. If those results suggest that the drug information might be
important in obtaining a good model, then the time to try to get it would be invested. As it turned out
though, they were able to build a reasonably good model without this drug information. DBAs and
programmers often work together to extract data from various sources, and then merge it. This
allows for removing redundant data, making it available for the next stage of the methodology, which
is data understanding. At this stage, if necessary, data scientists and analytics team members can
discuss various ways to better manage their data, including automating certain processes in the
database, so that data collection is easier and faster.
Now that the data collection stage is complete, data scientists typically use descriptive statistics and
visualization techniques to better understand the data and get acquainted with it. Data scientists,
essentially, explore the data to:

understand its content,


assess its quality,
discover any interesting preliminary insights, and,
determine whether additional data is necessary to fill any gaps in the data.
● Data scientists apply descriptive statistics and visualization techniques to thoroughly assess
the content, quality, and initial insights gained from the collected data, identify gaps, and
determine if new data is needed or to substitute existing data.

Module 1 Lesson 2: From Requirements to Collection


cognitiveclass.ai logo

Welcome! This alphabetized glossary contains many of the terms you'll find within this lesson. These
terms are important for you to recognize when working in the industry, when participating in user
groups, and when participating in other certificate programs.

Term Definition
Analytics team A group of professionals, including data scientists and analysts, responsible for
performing data analysis and modeling.
Data collection The process of gathering data from various sources, including demographic, clinical,
coverage, and pharmaceutical information.
Data integration The merging of data from multiple sources to remove redundancy and
prepare it for further analysis.
Data Preparation The process of organizing and formatting data to meet the requirements of
the modeling technique.
Data Requirements The identification and definition of the necessary data elements, formats, and
sources required for analysis.
Data Understanding A stage where data scientists discuss various ways to manage data
effectively, including automating certain processes in the database.
DBAs (Database Administrators) The professionals who are responsible for managing and
extracting data from databases.
Decision tree classification A modeling technique that uses a tree-like structure to classify data
based on specific conditions and variables.
Demographic information Information about patient characteristics, such as age, gender, and
location.
Descriptive statistics Techniques used to analyze and summarize data, providing initial insights
and identifying gaps in data.
Intermediate results Partial results obtained from predictive modeling can influence decisions on
acquiring additional data.
Patient cohort A group of patients with specific criteria selected for analysis in a study or model.
Predictive modeling The building of models to predict future outcomes based on historical data.
Training set A subset of data used to train or fit a machine learning model; consists of input data
and corresponding known or labeled output values.
Unavailable data Data elements are not currently accessible or integrated into the data
sources.
Univariate Modeling analysis focused on a single variable or feature at a time, considering its
characteristics and relationship to other variables independently.
Unstructured data Data that does not have a predefined structure or format, typically text
images, audio, or video, requires special techniques to extract meaning or insights.
Visualization The process of representing data visually to gain insights into its content and quality.

Welcome to Data Science Methodology 101 From Understanding to Preparation Data


Understanding! Data understanding encompasses all activities related to constructing the data set.
Essentially, the data understanding section of the data science methodology answers the question:
Is the data that you collected representative of the problem to be solved? Let's apply the data
understanding stage of our methodology, to the case study we've been examining. In order to
understand the data related to congestive heart failure admissions, descriptive statistics needed to
be run against the data columns that would become variables in the model. First, these statistics
included hurst, univariates, and statistics on each variable, such as mean, median, minimum,
maximum, and standard deviation. Second, pairwise correlations were used, to see how closely
certain variables were related, and which ones, if any, were very highly correlated, meaning that they
would be essentially redundant, thus making only one relevant for modeling. Third, histograms of the
variables were examined to understand their distributions. Histograms are a good way to understand
how values or a variable are distributed, and which sorts of data preparation may be needed to
make the variable more useful in a model. For example, for a categorical variable that has too many
distinct values to be informative in a model, the histogram would help them decide how to
consolidate those values. The univariates, statistics, and histograms are also used to assess data
quality. From the information provided, certain values can be re-coded or perhaps even dropped if
necessary, such as when a certain variable has missing values. The question then becomes, does
"missing" mean anything? Sometimes a missing value might mean "no", or "0" (zero), or at other
times it simply means "we don't know" or, if a variable contains invalid or misleading values, such as
a numeric variable called "age" that contains 0 to 100 and also 999, where that "triple-9" actually
means "missing", but would be treated as a valid value unless we corrected it. Initially, the meaning
of congestive heart failure admission was decided on the basis of a primary diagnosis of congestive
heart failure. But working through the data understanding stage revealed that the initial definition
was not capturing all of the congestive heart failure admissions that were expected, based on clinical
experience. This meant looping back to the data collection stage and adding secondary and tertiary
diagnoses, and building a more comprehensive definition of congestive heart failure admission. This
is just one example of the interactive processes in the methodology. The more one works with the
problem and the data, the more one learns and therefore the more refinement that can be done
within the model, ultimately leading to a better solution to the problem. This ends the Data
Understanding section of this course. Thanks for watching! (music)

Welcome to Data Science Methodology 101 From Understanding to Preparation Data Preparation -
Concepts! In a sense, data preparation is similar to washing freshly picked vegetables in so far as
unwanted elements, such as dirt or imperfections, are removed. Together with data collection and
data understanding, data preparation is the most time-consuming phase of a data science project,
typically taking 70% and even up to even 90% of the overall project time. Automating some of the
data collection and preparation processes in the database, can reduce this time to as little as 50%.
This time savings translates into increased time for data scientists to focus on creating models. To
continue with our cooking metaphor, we know that the process of chopping onions to a finer state will
allow for its flavours to spread through a sauce more easily than that would be the case if we were to
drop the whole onion into the sauce pot. Similarly, transforming data in the data preparation phase is
the process of getting the data into a state where it may be easier to work with. Specifically, the data
preparation stage of the methodology answers the question: What are the ways in which data is
prepared? To work effectively with the data, it must be prepared in a way that addresses missing or
invalid values and removes duplicates, toward ensuring that everything is properly formatted.
Feature engineering is also part of data preparation. It is the process of using domain knowledge of
the data to create features that make the machine learning algorithms work. A feature is a
characteristic that might help when solving a problem. Features within the data are important to
predictive models and will influence the results you want to achieve. Feature engineering is critical
when machine learning tools are being applied to analyze the data. When working with text, text
analysis steps for coding the data are required to be able to manipulate the data. The data scientist
needs to know what they're looking for within their dataset to address the question. The text analysis
is critical to ensure that the proper groupings are set, and that the programming is not overlooking
what is hidden within. The data preparation phase sets the stage for the next steps in addressing the
question. While this phase may take a while to do, if done right the results will support the project. If
this is skipped over, then the outcome will not be up to par and may have you back at the drawing
board. It is vital to take your time in this area, and use the tools available to automate common steps
to accelerate data preparation. Make sure to pay attention to the detail in this area. After all, it takes
just one bad ingredient to ruin a fine meal. This ends the Data Preparation section of this course, in
which we've reviewed key concepts. Thanks for watching! (Music)
● During the Data Preparation stage, data scientists must address missing or invalid values,
remove duplicates, and validate that the data is properly formatted.
● Feature engineering, also part of the Data Preparation stage, uses domain knowledge of the
data to create features that make the machine learning algorithms work.
● Text analysis during the Data Preparation stage is critical for validating that the proper
groupings are set and that the programming is not overlooking hidden data.

Module 2 Lesson 1: From Understanding to Preparation


cognitiveclass.ai logo
Welcome! This alphabetized glossary contains many of the terms you'll find within this lesson. These
terms are important for you to recognize when working in the industry, when participating in user
groups, and when participating in other certificate programs.

Term Definition
Automation Using tools and techniques to streamline data collection and preparation processes.
Data CollectionThe phase of gathering and assembling data from various sources.
Data Compilation The process of organizing and structuring data to create a comprehensive
data set.
Data Formatting The process of standardizing the data to ensure uniformity and ease of
analysis.
Data Manipulation The process of transforming data into a usable format.
Data Preparation The phase where data is cleaned, transformed, and formatted for further
analysis, including feature engineering and text analysis.
Data Preparation The stage where data is transformed and organized to facilitate effective
analysis and modeling.
Data Quality Assessment of data integrity and completeness, addressing missing, invalid, or
misleading values.
Data Quality Assessment The evaluation of data integrity, accuracy, and completeness.
Data Set A collection of data used for analysis and modeling.
Data Understanding The stage in the data science methodology focused on exploring and
analyzing the collected data to ensure that the data is representative of the problem to be solved.
Descriptive Statistics Summary statistics that data scientists use to describe and understand the
distribution of variables, such as mean, median, minimum, maximum, and standard deviation.
Feature A characteristic or attribute within the data that helps in solving the problem.
Feature Engineering The process of creating new features or variables based on domain
knowledge to improve machine learning algorithms' performance.
Feature Extraction Identifying and selecting relevant features or attributes from the data set.
Interactive Processes Iterative and continuous refinement of the methodology based on insights
and feedback from data analysis.
Missing Values Values that are absent or unknown in the dataset, requiring careful handling during
data preparation.
Model Calibration Adjusting model parameters to improve accuracy and alignment with the
initial design.
Pairwise Correlations An analysis to determine the relationships and correlations between different
variables.
Text Analysis Steps to analyze and manipulate textual data, extracting meaningful information and
patterns.
Text Analysis Groupings Creating meaningful groupings and categories from textual data for
analysis.
Visualization techniques Methods and tools that data scientists use to create visual
representations or graphics that enhance the accessibility and understanding of data patterns,
relationships, and insights.

Welcome to Data Science Methodology 101 From Modeling to Evaluation Modeling - Concepts!
Modelling is the stage in the data science methodology, where the data scientist has the chance to
sample the sauce and determine, if it's bang on or in need of more seasoning! This portion of the
course is geared toward answering two key questions: First, what is the purpose of data modeling,
and second, what are some characteristics of this process? Data Modelling focuses on developing
models that are either descriptive or predictive. An example of a descriptive model might examine
things like: if a person did this, then they're likely to prefer that. A predictive model tries to yield
yes/no, or stop/go type outcomes. These models are based on the analytic approach that was taken,
either statistically driven or machine learning driven. The data scientist will use a training set for
predictive modelling. A training set is a set of historical data in which the outcomes are already
known. The training set acts like a gauge to determine if the model needs to be calibrated. In this
stage, the data scientist will play around with different algorithms to ensure that the variables in play
are actually required. The success of data compilation, preparation and modelling, depends on the
understanding of the problem at hand, and the appropriate analytical approach being taken. The
data supports the answering of the question, and like the quality of the ingredients in cooking, sets
the stage for the outcome. Constant refinement, adjustments and tweaking are necessary within
each step to ensure the outcome is one that is solid. In John Rollins' descriptive Data Science
Methodology, the framework is geared to do 3 things: First, understand the question at hand.
Second, select an analytic approach or method to solve the problem, and third, obtain, understand,
prepare, and model the data. The end goal is to move the data scientist to a point where a data
model can be built to answer the question. With dinner just about to be served and a hungry guest at
the table, the key question is: Have I made enough to eat? Well, let's hope so. In this stage of the
methodology, model evaluation, deployment, and feedback loops ensure that the answer is near and
relevant. This relevance is critical to the data science field overall, as it is a fairly new field of study,
and we are interested in the possibilities it has to offer. The more people that benefit from the
outcomes of this practice, the further the field will develop. This ends the Modeling to Evaluation
section of this course, in which we reviewed the key concepts related to modeling. Thanks for
watching! (Music)

Welcome to Data Science Methodology 101 From Modeling to Evaluation Modeling - Case Study!
Modelling is the stage in the data science methodology where the data scientist has the chance to
sample the sauce and determine if it's bang on or in need of more seasoning! Now, let's apply the
case study to the modeling stage within the data science methodology. Here, we'll discuss one of the
many aspects of model building, in this case, parameter tuning to improve the model. With a
prepared training set, the first decision tree classification model for congestive heart failure
readmission can be built. We are looking for patients with high-risk readmission, so the outcome of
interest will be congestive heart failure readmission equals "yes". In this first model, overall accuracy
in classifying the yes and no outcomes was 85%. This sounds good, but it represents only 45% of
the "yes". The actual readmissions are correctly classified, meaning that the model is not very
accurate. The question then becomes: How could the accuracy of the model be improved in
predicting the yes outcome? For decision tree classification, the best parameter to adjust is the
relative cost of misclassified yes and no outcomes. Think of it like this: When a true,
non-readmission is misclassified, and action is taken to reduce that patient's risk, the cost of that
error is the wasted intervention. A statistician calls this a type I error, or a false-positive. But when a
true readmission is misclassified, and no action is taken to reduce that risk, then the cost of that
error is the readmission and all its attended costs, plus the trauma to the patient. This is a type II
error, or a false-negative. So we can see that the costs of the two different kinds of misclassification
errors can be quite different. For this reason, it's reasonable to adjust the relative weights of
misclassifying the yes and no outcomes. The default is 1-to-1, but the decision tree algorithm, allows
the setting of a higher value for yes. For the second model, the relative cost was set at 9-to-1. This is
a very high ratio, but gives more insight to the model's behaviour. This time the model correctly
classified 97% of the yes, but at the expense of a very low accuracy on the no, with an overall
accuracy of only 49%. This was clearly not a good model. The problem with this outcome is the large
number of false-positives, which would recommend unnecessary and costly intervention for patients,
who would not have been re-admitted anyway. Therefore, the data scientist needs to try again to find
a better balance between the yes and no accuracies. For the third model, the relative cost was set at
a more reasonable 4-to-1. This time 68% accuracy was obtained on only yes, called sensitivity by
statisticians, and 85% accuracy on the no, called specificity, with an overall accuracy of 81%. This is
the best balance that can be obtained with a rather small training set through adjusting the relative
cost of misclassified yes and no outcomes parameter. A lot more work goes into the modeling, of
course, including iterating back to the data preparation stage to redefine some of the other variables,
so as to better represent the underlying information, and thereby improve the model. This concludes
the Modeling section of the course, in which we applied the Case Study to the modeling stage within
the data science methodology. Thanks for watching! (Music)

Welcome to Data Science Methodology 101 From Modeling to Evaluation - Evaluation! A model
evaluation goes hand-in-hand with model building as such, the modeling and evaluation stages are
done iteratively. Model evaluation is performed during model development and before the model is
deployed. Evaluation allows the quality of the model to be assessed but it's also an opportunity to
see if it meets the initial request. Evaluation answers the question: Does the model used really
answer the initial question or does it need to be adjusted? Model evaluation can have two main
phases. The first is the diagnostic measures phase, which is used to ensure the model is working as
intended. If the model is a predictive model, a decision tree can be used to evaluate if the answer
the model can output, is aligned to the initial design. It can be used to see where there are areas that
require adjustments. If the model is a descriptive model, one in which relationships are being
assessed, then a testing set with known outcomes can be applied, and the model can be refined as
needed. The second phase of evaluation that may be used is statistical significance testing. This
type of evaluation can be applied to the model to ensure that the data is being properly handled and
interpreted within the model. This is designed to avoid unnecessary second guessing when the
answer is revealed. So now, let's go back to our case study so that we can apply the "Evaluation"
component within the data science methodology. Let's look at one way to find the optimal model
through a diagnostic measure based on tuning one of the parameters in model building. Specifically
we'll see how to tune the relative cost of misclassifying yes and no outcomes. As shown in this table,
four models were built with four different relative misclassification costs. As we see, each value of
this model-building parameter increases the true-positive rate, or sensitivity, of the accuracy in
predicting yes, at the expense of lower accuracy in predicting no, that is, an increasing false-positive
rate. The question then becomes, which model is best based on tuning this parameter? For
budgetary reasons, the risk-reducing intervention could not be applied to most or all congestive heart
failure patients, many of whom would not have been readmitted anyway. On the other hand, the
intervention would not be as effective in improving patient care as it should be, with not enough
high-risk congestive heart failure patients targeted. So, how do we determine which model was
optimal? As you can see on this slide, the optimal model is the one giving the maximum separation
between the blue ROC curve relative to the red base line. We can see that model 3, with a relative
misclassification cost of 4-to-1, is the best of the 4 models. And just in case you were wondering,
ROC stands for receiver operating characteristic curve, which was first developed during World War
II to detect enemy aircraft on radar. It has since been used in many other fields as well. Today it is
commonly used in machine learning and data mining. The ROC curve is a useful diagnostic tool in
determining the optimal classification model. This curve quantifies how well a binary classification
model performs, declassifying the yes and no outcomes when some discrimination criterion is
varied. In this case, the criterion is a relative misclassification cost. By plotting the true-positive rate
against the false-positive rate for different values of the relative misclassification cost, the ROC curve
helped in selecting the optimal model. This ends the Evaluation section of this course. Thanks for
watching! (Music)
● The Evaluation phase consists of two stages, the diagnostic measures phase, and the
statistical significance phase.
● During the Evaluation stage, data scientists and others assess the quality of the model and
determine if the modelanswers the initial Business Understanding question or if the data
model needs adjustment.
● The ROC curve, known as the receiver operating characteristic curve, is a useful diagnostic
tool for determining the optimal classification model. This curve quantifies how well a binary
classification model performs, declassifying the yes and no outcomes when some
discrimination criterion is varied.

Glossary
Module 2 Lesson 2: From Modeling to Evaluation
cognitiveclass.ai logo
Welcome! This alphabetized glossary contains many of the terms you'll find within this lesson. These
terms are important for you to recognize when working in the industry, when participating in user
groups, and when participating in other certificate programs.

Term Definition
Binary classification model A model that classifies data into two categories, such as yes/no or
stop/go outcomes.
Data compilation The process of gathering and organizing data required for modeling.
Data modeling The stage in the data science methodology where data scientists develop models,
either descriptive or predictive, to answer specific questions.
Descriptive model A type of model that examines relationships between variables and makes
inferences based on observed patterns.
Diagnostic measure based tuning The process of fine-tuning the model by adjusting parameters
based on diagnostic measures and performance indicators.
Diagnostic measures The evaluation of a model's performance of a model to ensure that the model
functions as intended.
Discrimination criterion A measure used to evaluate the performance of the model in classifying
different outcomes.
False-positive rate The rate at which the model incorrectly identifies negative outcomes as
positive.
Histogram A graphical representation of the distribution of a dataset, where the data is divided
into intervals or bins, and the height of each bar represents the frequency or count of data points
falling within that interval.
Maximum separation The point where the ROC curve provides the best discrimination between
true-positive and false-positive rates, indicating the most effective model.
Model evaluation The process of assessing the quality and relevance of the model before
deployment.
Optimal model The model that provides the maximum separation between the ROC curve and the
baseline, indicating higher accuracy and effectiveness.
Receiver Operating Characteristic (ROC) Originally developed for military radar, the military
used this statistical curve to assess the performance of binary classification models.
Relative misclassification cost This measurement is a parameter in model building used to tune the
trade-off between true-positive and false-positive rates.
ROC curve (Receiver Operating Characteristic curve) A diagnostic tool used to determine the
optimal classification model's performance.
Separation Separation is the degree of discrimination achieved by the model in correctly
classifying outcomes.
Statistical significance testing Evaluation technique to verify that data is appropriately handled and
interpreted within the model.
True-positive rate The rate at which the model correctly identifies positive outcomes.

Welcome to Data Science Methodology 101 From Deployment to Feedback - Deployment! While a
data science model will provide an answer, the key to making the answer relevant and useful to
address the initial question, involves getting the stakeholders familiar with the tool produced. In a
business scenario, stakeholders have different specialties that will help make this happen, such as
the solution owner, marketing, application developers, and IT administration. Once the model is
evaluated and the data scientist is confident it will work, it is deployed and put to the ultimate test.
Depending on the purpose of the model, it may be rolled out to a limited group of users or in a test
environment, to build up confidence in applying the outcome for use across the board. So now, let's
look at the case study related to applying Deployment" In preparation for solution deployment, the
next step was to assimilate the knowledge for the business group who would be designing and
managing the intervention program to reduce readmission risk. In this scenario, the business people
translated the model results so that the clinical staff could understand how to identify high-risk
patients and design suitable intervention actions. The goal, of course, was to reduce the likelihood
that these patients would be readmitted within 30 days after discharge. During the business
requirements stage, the Intervention Program Director and her team had wanted an application that
would provide automated, near real-time risk assessments of congestive heart failure. It also had to
be easy for clinical staff to use, and preferably through browser-based application on a tablet, that
each staff member could carry around. This patient data was generated throughout the hospital stay.
It would be automatically prepared in a format needed by the model and each patient would be
scored near the time of discharge. Clinicians would then have the most up-to-date risk assessment
for each patient, helping them to select which patients to target for intervention after discharge. As
part of solution deployment, the Intervention team would develop and deliver training for the clinical
staff. Also, processes for tracking and monitoring patients receiving the intervention would have to
be developed in collaboration with IT developers and database administrators, so that the results
could go through the feedback stage and the model could be refined over time. This map is an
example of a solution deployed through a Cognos application. In this case, the case study was
hospitalization risk for patients with juvenile diabetes. Like the congestive heart failure use case, this
one used decision tree classification to create a risk model that would serve as the foundation for
this application. The map gives an overview of hospitalization risk nationwide, with an interactive
analysis of predicted risk by a variety of patient conditions and other characteristics. This slide
shows an interactive summary report of risk by patient population within a given node of the model,
so that clinicians could understand the combination of conditions for this subgroup of patients. And
this report gives a detailed summary on an individual patient, including the patient's predicted risk
and details about the clinical history, giving a concise summary for the doctor. This ends the
Deployment section of this course. Thanks for watching! (Music)
Welcome! This alphabetized glossary contains many of the terms you'll find within this lesson. These
terms are important for you to recognize when working in the industry, when participating in user
groups, and when participating in other certificate programs.

Term Definition
Browser-based application An application that users access through a web browser, typically on
a tablet or other mobile device, to provide easy access to the model's insights.
Cyclical methodology An iterative approach to the data science process, where each stage informs
and refines the subsequent stages.
Data collection refinement The process of obtaining additional data elements or information to
improve the model's performance.
Data science model The result of data analysis and modeling that provides answers to specific
questions or problems.
Feedback The process of obtaining input and comments from users and stakeholders to refine
and improve the data science model.
Model refinement The process of adjusting and improving the data science model based on
user feedback and real-world performance.
Redeployment The process of implementing a refined model and intervention actions after
incorporating feedback and improvements.
Review process The systematic assessment and evaluation of the data science model's
performance and impact.
Solution deployment The process of implementing and integrating the data science model into the
business or organizational workflow.
Solution owner The individual or team responsible for overseeing the deployment and management
of the data science solution.
Stakeholders Individuals or groups with a vested interest in the data science model's outcome and
its practical application, such as solution owners, marketing, application developers, and IT
administration.
Storytelling Storytelling is the art of conveying your message, or ideas through a narrative
structure that engages, entertains, and resonates with the audience.
Test environment A controlled setting where the data science model is evaluated and refined
before full-scale implementation.

Welcome to an Introduction to CRISP-DM. After watching this video, you'll be able to, define
CRISP-DM, list and describe the six stages of the CRISP-DM model, and explain what happens
after the final CRISP-DM stage. CRISP-DM, which stands for Cross-Industry Standard Process for
Data Mining, is an industry-proven way to guide your data mining efforts. CRISP-DM is an iterative
data mining mode and is a comprehensive methodology for data mining projects which provides a
structured approach to guide data-driven decision making. As a data methodology, a study of the
CRISP-DM model includes six data mining stages, their descriptions and provides explanations of
the relationships between tasks and stages. And as a process model, CRISP-DM provides high-level
insights into the data mining cycle. Like other data mining science methodologies, CRISP-DM
requires flexibility at each stage, and communication with peers, management, and stakeholders to
keep the project on track. After any of the following six stages, data scientists might need to revisit
an earlier stage and make changes. The business understanding stage is the most important
because this stage sets and outlines the intentions of the data analysis project. This stage is
common to both John Rollins data science methodology, and CRISP-DM methodology. This stage
requires communication and clarity to overcome stakeholders'differing objectives, biases, and
information related modalities. Without a clear concise and complete understanding of the business
problem and project goals, the project effort will waste time and resources. Then, CRISP-DM
combines the stages of data requirements, data collection, and data understanding from Johns
Rollins methodology outline into a single data understanding stage. During this stage, data scientists
decide on data sources and acquire data. Next during the data preparation stage, data scientists
transform the collected data into a usable data subset and determine if they need more data. With
data collection complete, data scientists select a dataset and address questionable missing or
ambiguous data values. Data preparation is common to foundational data methodology in
CRISP-DM. The modeling stage fulfills the purpose of data mining and creates data models that
reveal patterns and structures within the data. These patterns and structures provide knowledge and
insights that address the stated business problem and goals. Data scientists select models based on
subsets of the data and adjust the models as needed. Model selection is an art and science. Both
foundational methodology and CRISP-DM focus on creating knowledge information that has
meaning and utility. During the evaluation stage, data scientists test the selected model. Data
scientists usually prepare a pre-selected test to run the trained model. The test platform sees the
data as new and data scientists then assess the model's effectiveness. These testing results
determine the model's efficacy and foreshadow the model's role in the next and final stage. Finally,
during the deployment stage, data scientists and stakeholders use the model on new data outside of
the scope of the dataset. New interactions during this stage might reveal the new variables and need
for a different dataset and model. Remember that the CRISP-DM model is iterative and cyclical,
deployment results might initiate revisions to the business needs and actions, the model and data, or
any combination of these items. After completing all six stages, you'll have another business
understanding meeting with the stakeholders to discuss the results. In CRISP-DM, the stage is not
named. However, in John Rollins Data Science methodology model, the stage is explicitly named the
Feedback stage. You'll continue the CRISP-DM process stages until the stakeholders, management,
and you agree that the data model and its analysis provide the stakeholder with the answers they
need to resolve their business problems and attain their business goals. In this video, you learned
that CRISP-DM stands for Cross-Industry Standard Process for Data Mining. The CRISP-DM model
consolidates the steps outlined in foundational data methodology into the following six stages,
business understanding, data understanding, data preparation, modeling, evaluation, and
deployment. You'll continue the CRISP-DM process until the stakeholders, management, and you
agree that the data model and its analysis answer the business questions. [MUSIC]

Welcome to an Introduction to CRISP-DM. After watching this video, you'll be able to, define
CRISP-DM, list and describe the six stages of the CRISP-DM model, and explain what happens
after the final CRISP-DM stage. CRISP-DM, which stands for Cross-Industry Standard Process for
Data Mining, is an industry-proven way to guide your data mining efforts. CRISP-DM is an iterative
data mining mode and is a comprehensive methodology for data mining projects which provides a
structured approach to guide data-driven decision making. As a data methodology, a study of the
CRISP-DM model includes six data mining stages, their descriptions and provides explanations of
the relationships between tasks and stages. And as a process model, CRISP-DM provides high-level
insights into the data mining cycle. Like other data mining science methodologies, CRISP-DM
requires flexibility at each stage, and communication with peers, management, and stakeholders to
keep the project on track. After any of the following six stages, data scientists might need to revisit
an earlier stage and make changes. The business understanding stage is the most important
because this stage sets and outlines the intentions of the data analysis project. This stage is
common to both John Rollins data science methodology, and CRISP-DM methodology. This stage
requires communication and clarity to overcome stakeholders'differing objectives, biases, and
information related modalities. Without a clear concise and complete understanding of the business
problem and project goals, the project effort will waste time and resources. Then, CRISP-DM
combines the stages of data requirements, data collection, and data understanding from Johns
Rollins methodology outline into a single data understanding stage. During this stage, data scientists
decide on data sources and acquire data. Next during the data preparation stage, data scientists
transform the collected data into a usable data subset and determine if they need more data. With
data collection complete, data scientists select a dataset and address questionable missing or
ambiguous data values. Data preparation is common to foundational data methodology in
CRISP-DM. The modeling stage fulfills the purpose of data mining and creates data models that
reveal patterns and structures within the data. These patterns and structures provide knowledge and
insights that address the stated business problem and goals. Data scientists select models based on
subsets of the data and adjust the models as needed. Model selection is an art and science. Both
foundational methodology and CRISP-DM focus on creating knowledge information that has
meaning and utility. During the evaluation stage, data scientists test the selected model. Data
scientists usually prepare a pre-selected test to run the trained model. The test platform sees the
data as new and data scientists then assess the model's effectiveness. These testing results
determine the model's efficacy and foreshadow the model's role in the next and final stage. Finally,
during the deployment stage, data scientists and stakeholders use the model on new data outside of
the scope of the dataset. New interactions during this stage might reveal the new variables and need
for a different dataset and model. Remember that the CRISP-DM model is iterative and cyclical,
deployment results might initiate revisions to the business needs and actions, the model and data, or
any combination of these items. After completing all six stages, you'll have another business
understanding meeting with the stakeholders to discuss the results. In CRISP-DM, the stage is not
named. However, in John Rollins Data Science methodology model, the stage is explicitly named the
Feedback stage. You'll continue the CRISP-DM process stages until the stakeholders, management,
and you agree that the data model and its analysis provide the stakeholder with the answers they
need to resolve their business problems and attain their business goals. In this video, you learned
that CRISP-DM stands for Cross-Industry Standard Process for Data Mining. The CRISP-DM model
consolidates the steps outlined in foundational data methodology into the following six stages,
business understanding, data understanding, data preparation, modeling, evaluation, and
deployment. You'll continue the CRISP-DM process until the stakeholders, management, and you
agree that the data model and its analysis answer the business questions. [MUSIC]
en

After completing this course, you learned many facts about data science methodology. Here are 14
key, high-level takeaway facts you’ll want to remember from this course.

Foundational methodology, a cyclical, iterative data science methodology developed by John Rollins,
consists of 10 stages, starting with Business Understanding and ending with Feedback.

CRISP-DM, an open source data methodology, combines several data-related methodology stages
into one stage and omits the Feedback stage resulting in a six-stage data methodology.

The primary goal of the Business Understanding stage is to understand the business problem and
determine the data needed to answer the core business question.

During the Analytic Approach stage, you can choose from descriptive diagnostic, predictive, and
prescriptive analytic approaches and whether to use machine learning techniques.

During the Data Requirements stage, scientists identify the correct and necessary data content,
formats, and sources needed for the specific analytical approach.

During the Data Collection stage, expert data scientists revise data requirements and make critical
decisions regarding the quantity and quality of data. Data scientists apply descriptive statistics and
visualization techniques to thoroughly assess the content, quality, and initial insights gained from the
collected data, identify gaps, and determine if new data is needed, or if they should substitute
existing data.

The Data Understanding stage encompasses all activities related to constructing the data set. This
stage answers the question of whether the collected data represents the data needed to solve the
business problem. Data scientists might use descriptive statistics, predictive statistics, or both.

Data scientists commonly apply Hurst, univariates, and statistics such as mean, median, minimum,
maximum, standard deviation, pairwise correlation, and histograms.

During the Data Preparation stage, data scientists must address missing or invalid values, remove
duplicates, and validate that the data is properly formatted. Feature engineering and text analysis
are key techniques data scientists apply to validate and analyze data during the Data Preparation
stage.
The end goal of the Modeling stage is that the data model answers the business question. During
the Modeling stage, data scientists use a training data set. Data scientists test multiple algorithms on
the training set data to determine whether the variables are required and whether the data supports
answering the business question. The outcome of those models is either descriptive or predictive.

The Evaluation stage consists of two phases, the diagnostic measures phase, and the statistical
significance phase. Data scientists and others assess the quality of the model and determine if the
model answers the initial Business Understanding question or if the data model needs adjustment.

During the Deployment stage, data scientists release the data model to a targeted group of
stakeholders, including solution owners, marketing staff, application developers, and IT
administration.,

During the Feedback stage, stakeholders and users evaluate the model and contribute feedback to
assess the model’s performance.

The data model’s value depends on its ability to iterate; that is, how successfully the data model
incorporates user feedback.

Course Overview
Welcome to the Python for Data Science, AI, and Development course. After completing this course,
you'll possess the basic knowledge of Python and acquire a good understanding of different data
types. You’ll also learn to use lists and tuples, dictionaries, and Python sets. Additionally, you’ll
acquire the concepts of condition and branching and will know how to implement loops, create
functions, perform exception handling, and create objects. Furthermore, you’ll be proficient in
reading and writing files and will be able to implement unique ways to collect data using APIs and
web scraping. In addition to the module labs, you'll prove your skills in a peer-graded project and
your overall knowledge with the final quiz.
Course Content
This course is divided into five modules. You should set a goal to complete at least one module per
week.
Module 1: Python Basics
● About the Course
● Types
● Expressions and Variables
● String Operations

Module 2: Python Data Structures


● Lists and Tuples
● Dictionaries
● Sets

Module 3: Python Programming Fundamentals


● Conditions and Branching
● Loops
● Functions
● Exception Handling
● Objects and Classes
● Practice with Python Programming Fundamentals

Module 4: Working with Data in Python


● Reading and Writing Files with Open
● Pandas
● Numpy in Python

Module 5: APIs and Data Collection


● Simple APIs
● REST APIs, Web Scraping, and Working with Files
● Final Exam

The course contains a variety of learning assets: Videos, activities, labs, projects, practice, graded
quizzes, and readings. The videos and readings present the instruction. Labs and activities support
that instruction with hands-on learning experiences. Discussions allow you to interact and learn from
your peers. A peer-review project that mimics real-world scenarios encourage you to showcase your
skills, Practice quizzes enable you to test your knowledge of what you learned. Finally, graded
quizzes indicate how well you have learned the course concepts.
Enjoy the course!
Go to next item

Welcome to “Introduction to Python”. After watching this video, you will be able to identify the users
of Python. List the benefits of using Python. Describe the diversity and inclusion efforts of the Python
community. Python is a powerhouse of a language. It is the most widely used and most popular
programming language used in the data science industry. According to the 2019 Kaggle Data
Science and Machine Learning Survey, ¾ of the over 10,000 respondents worldwide reported that
they use Python regularly. Glassdoor reported that in 2019 more than 75% of data science positions
listed included Python in their job descriptions. When asked which language an aspiring data
scientist should learn first, most data scientists say Python. Let’s start with the people who use
Python. If you already know how to program, then Python is great for you because it uses clear and
readable syntax. You can develop the same programs from other languages with lesser code using
Python. For beginners, Python is a good language to start with because of the huge global
community and wealth of documentation. Several different surveys done in 2019 established that
over 80% of data professionals use Python worldwide. Python is useful in many areas including data
science, AI and machine learning, web development, and Internet of Things (IoT) devices, like the
Raspberry Pi. Large organizations that heavily use python include IBM, Wikipedia, Google, Yahoo!,
CERN, NASA, Facebook, Amazon, Instagram, Spotify, and Reddit. Python is widely supported by a
global community and shepherded by the Python Software Foundation. Python is a high-level,
general-purpose programming language that can be applied to many different classes of problems. It
has a large, standard library that provides tools suited to many different tasks including but not
limited to Databases, Automation, Web scraping, Text processing, Image processing, Machine
learning, and Data analytics. For data science, you can use Python's scientific computing libraries
like Pandas, NumPy, SciPy, and Matplotlib. For artificial intelligence, it has TensorFlow, PyTorch,
Keras, and Scikit-learn. Python can also be used for Natural Language Processing (NLP) using the
Natural Language Toolkit (NLTK). Another great selling point for Python is that the Python
community has a well-documented history of paving the way for diversity and inclusion efforts in the
tech industry as a whole. The Python language has a code of conduct executed by the Python
Software Foundation that seeks to ensure safety and inclusion for all, in both online and in-person
Python communities. Communities like PyLadies seek to create spaces for people interested in
learning Python in safe and inclusive environments. PyLadies is an international mentorship group
with a focus on helping more women become active participants and leaders in the Python
open-source community.

In this video, you learned that Python uses clear and readable syntax. Python has a huge global
community and a wealth of documentation. For data science, you can use python's scientific
computing libraries like Pandas, NumPy, SciPy, and Matplotlib. Python can also be used for Natural
Language Processing (NLP) using the Natural Language Toolkit (NLTK). Python community has a
well-documented history of paving the way for diversity and inclusion efforts in the tech industry as a
whole.


Welcome to “Getting started with Jupyter.” After watching this video, you will be able to: Describe
how to run, insert, and delete a cell in a notebook. Work with multiple notebooks. Present the
notebook, and shut down the notebook session. In the lab session of this module, you can launch a
notebook using the Skills Network virtual environment. After selecting the check box, click the Open
tool tab, and the environment will open the Jupyter Lab. Here you see the open notebook. On
opening the notebook, you can change the name of the notebook. Click File. Then click Rename
Notebook to give the required name. And you can now start working on your new notebook. In the
new notebook, print “hello world”.

Then click the Run button to show that the environment is giving the correct output. On the main
menu bar at the top, click Run. In the drop-down menu, click Run Selected Cells to run the current
highlighted cells. Alternatively, you can use a shortcut, press Shift + Enter. In case you have multiple
code cells, click Run All cells to run the code in all the cells. You can add code by inserting a new
cell. To add a new cell, click the plus symbol in the toolbar. In addition, you can delete a cell.
Highlight the cell and on the main menu bar, click Edit, and then click Delete Cells. Alternatively, you
can use a shortcut by pressing D twice on the highlighted cell. Also, you can move the cells up or
down as required. So, now you have learned to work with a single notebook. Next, let’s learn to work
with multiple notebooks. Click the plus button on the toolbar and select the file you want to open.
Another notebook will open. Alternatively, you can click File on the menu bar and click Open a new
launcher or Open a new notebook. And when you open the new file, you can move them around. For
example, as shown, you can place the notebooks side by side. On one notebook, you can assign
variable one to the number 1, and variable two to the number 2 and then you can print the result of

As a data scientist, it is important to communicate your results. Jupyter supports presenting results
directly from the notebooks. You can create a Markdown to add titles and text descriptions to help
with the flow of the presentation. To add markdown, click Code and select Markdown. You can
create line plots and convert each cell and output into a slide or sub-slide in the form of a
presentation.
The slides functionality in Jupyter allows you to deliver code, visualization, text, and outputs of the
executed code as part of a project.

Now, when you have completed working with your notebook or notebooks, you can shut them down.
Shutting down notebooks release their memory. Click the stop icon on the sidebar, it is the second
icon from the top. You can terminate all sessions at once or shut them down individually. And after
you shut down the notebook session, you will see “no kernel” at the top right. This confirms it is no
longer active, you can now close the tabs.

In this video, you learned how to: Run, delete, and insert a code cell. Run multiple notebooks at the
same time. Present a notebook using a combination of Markdown and code cells. And shut down
your notebook sessions after you have completed your work.

A type is how Python represents different types of data. In this video, we will discuss some widely
used types in Python. You can have different types in Python. They can be integers like 11, real
numbers like 21.213, they can even be words. Integers, real numbers, and words can be expressed
as different data types. The following chart summarizes three data types for the last examples. The
first column indicates the expression. The second column indicates the data type. We can see the
actual data type in Python by using the type command. We can have int, which stands for an integer
and float that stands for float, essentially a real number. The type string is a sequence of characters.
Here are some integers. Integers can be negative or positive. It should be noted that there is a finite
range of integers but it is quite large. Floats are real numbers. They include the integers but also
numbers in between the integers. Consider the numbers between 0 and 1. We can select numbers
in between them. These numbers are floats. Similarly, consider the numbers between 0.5 and 0.6.
We can select numbers in between them. These are floats as well. We can continue the process
zooming in for different numbers. Of course there is a limit but it is quite small. You can change the
type of the expression in Python, this is called typecasting. You can convert an int to a float. For
example, you can convert or cast the integer 2 to a float 2.0. Nothing really changes, if you cast a
float to an integer, you must be careful. For example, if you cast the float 1.1 to 1, you will lose some
information. If a string contains an integer value, you can convert it to int. If we convert a string that
contains a non-integer value, we get an error. Check out more examples in the lab. You can convert
an int to a string or a float to a string. Boolean is another important type in Python. A Boolean can
take on two values. The first value is True, just remember we use an uppercase T. Boolean values
can also be False with an uppercase F. Using the type command on a Boolean value, we obtain the
term bool. This is short for Boolean, if we cast a Boolean True to an integer or float, we will get a 1. If
we cast a Boolean False to an integer or float, we get a 0. If you cast a 1 to a Boolean, you get a
True. Similarly, if you cast a 0 to a Boolean, you get a False. Check the labs for more examples or
check Python.org for other kinds of types in Python. (Music

In this video, we will cover expressions and variables. Expressions describe a type of operation that
computers perform. Expressions are operations that python performs. For example, basic arithmetic
operations like adding multiple numbers. The result in this case is 160. We call the numbers
operands, and the math symbols in this case, addition, are called operators. We can perform
operations such as subtraction using the subtraction sign. In this case, the result is a negative
number. We can perform multiplication operations using the asterisk. The result is 25. In this case,
the operators are given by negative and asterisk. We can also perform division with the forward
slash (/) 25 / 5 is 5.0; 25 / 6 is approximately 4.167. In Python 3, the version we will be using in this
course, both will result in a float. We can use the double slash for integer division, where the result is
rounded. Be aware, in some cases the results are not the same as regular division. Python follows
mathematical conventions when performing mathematical expressions. The following operations are
in a different order. In both cases, Python performs multiplication, then addition, to obtain the final
result. There are a lot more operations you can do with Python, check the labs for more examples.
We will also be covering more complex operations throughout the course. The expressions in the
parentheses are performed first. We then multiply the result by 60. The result is 1,920. Now, let's
look at variables. We can use variables to store values. In this case, we assign a value of 1 to the
variable my_variable using the assignment operator, i.e, the equal sign. We can then use the value
somewhere else in the code by typing the exact name of the variable. We will use a colon to denote
the value of the variable. We can assign a new value to my_variable using the assignment operator.
We assign a value of 10. The variable now has a value of 10. The old value of the variable is not
important. We can store the results of expressions. For example, we add several values and assign
the result to x. X now stores the result. We can also perform operations on x and save the result to a
new variable-y. Y now has a value of 2.666. We can also perform operations on x and assign the
value x. The variable x now has a value: 2.666. As before, the old value of x is not important. We can
use the type command in variables as well. It's good practice to use meaningful variable names; so,
you don't have to keep track of what the variable is doing. Let say, we would like to convert the
number of minutes in the highlighted examples to number of hours in the following music data-set.
We call the variable that contains the total number of minutes "total_min". It's common to use the
underscore to represent the start of a new word. You could also use a capital letter. We call the
variable that contains the total number of hours, total_hour. We can obtain the total number of hours
by dividing total_min by 60. The result is approximately 2.367 hours. If we modify the value of the
first variable, the value of the variable will change. The final result values change accordingly, but we
do not have to modify the rest of the code. (Music)

Carriage return '\r' in python


In Python, \r is a special character known as the carriage return. A carriage return character is
another special escape sequence in Python that positions the cursor at the start of the line. It
controls the cursor's position when printing text to the console. Imagine you're typing on a typewriter
and want to go back to the beginning of the line without moving down to the following line - that's
what \r does in Python.
8

10

11

6
7

time.sleep(1)

print("\nTask complete!")

print(f"Progress: {i}/10", end='\r')

# Simulate some processing time

In this example:
● We import the time module to introduce a delay.
● We have a loop that iterates 10 times.
● Inside the loop, we print the current progress using f-strings. The end='\r' argument ensures
that the cursor returns to the beginning of the line after printing.
● We use time.sleep(1) to pause the execution for 1 second, simulating some processing time.

You'll see the progress updated dynamically on the same line in your terminal. This technique is
commonly used to provide progress feedback in command-line applications and scripts.

In Python, a string is a sequence of characters. A string is contained within two quotes. You could
also use single quotes. A string can be spaces or digits. A string can also be special characters. We
can bind or assign a string to another variable. It is helpful to think of a string as an ordered
sequence. Each element in the sequence can be accessed using an index represented by the array
of numbers. The first index can be accessed as follows: We can access index six. Moreover, we can
access the 13th index. We can also use negative indexing with strings. The last element is given by
the index negative one. The first element can be obtained by index negative 15 and so on. We can
bind a string to another variable. It is helpful to think of string as a list or tuple. We can treat the string
as a sequence and perform sequence operations. We can also input a stride value as follows: The
two indicates we'd select every second variable. We can also incorporate slicing. In this case, we
return every second value up to index four. We can use the len command to obtain the length of the
string. As there are 15 elements, the result is 15. We can concatenate or combine strings. We use
the addition symbols. The result is a new string that is a combination of both. We can replicate
values of a string. We simply multiply the string by the number of times we would like to replicate it-
in this case, three. The result is a new string. The new string consists of three copies of the original
string. This means you cannot change the value of the string, but you can create a new string. For
example, you can create a new string by setting it to the original variable and concatenate it with a
new string. The result is a new string that changes from Michael Jackson to Michael Jackson is the
best. Strings are immutable. Back slashes represent the beginning of escape sequences. Escape
sequences represent strings that may be difficult to input. For example, backslash "n" represents a
new line. The output is given by a new line after the backslash "n" is encountered. Similarly,
backslash "t" represents a tab. The output is given by a tab where the backslash, "t" is. If you want to
place a backslash in your string, use a double backslash. The result is a backslash after the escape
sequence. We can also place an "r" in front of the string. Now, let's take a look at string methods.
Strings are sequences and as such, have apply methods that work on lists and tuples. Strings also
have a second set of methods that just work on strings. When we apply a method to the string A, we
get a new string B that is different from A. Let's do some examples. Let's try with the method
"Upper". This method converts lowercase characters to uppercase characters. In this example, we
set the variable A to the following value. We apply the method "Upper", and set it equal to B. The
value for B is similar to A, but all the characters are uppercase. The method replaces a segment of
the string- i.e. a substring with a new string. We input the part of the string we would like to change.
The second argument is what we would like to exchange the segment with. The result is a new
string with a segment changed. The method find, finds substrings. The argument is the substring you
would like to find. The output is the first index of the sequence. We can find the substring Jack. If the
substring is not in the string, the output is negative one. Check the labs for more examples. (Music)

Reading: Format Strings in Python


Estimates effort: 5 mins

Format strings are a way to inject variables into a string in Python. They are used to format strings
and produce more human-readable outputs. There are several ways to format strings in Python:

String interpolation (f-strings)


Introduced in Python 3.6, f-strings are a new way to format strings in Python. They are prefixed with
'f' and use curly braces {} to enclose the variables that will be formatted. For example:

1
2
3
name = "John"
age = 30
print(f"My name is {name} and I am {age} years old.")
Copied!
This will output:

1
My name is John and I am 30 years old.
Copied!
str.format()
This is another way to format strings in Python. It uses curly braces {} as placeholders for variables
which are passed as arguments in the format() method. For example:

1
2
3
name = "John"
age = 50
print("My name is {} and I am {} years old.".format(name, age))
Copied!
This will output:

1
My name is John and I am 50 years old.
Copied!
% Operator
This is one of the oldest ways to format strings in Python. It uses the % operator to replace variables
in the string. For example:

1
2
3
name = "Johnathan"
age = 30
print("My name is %s and I am %d years old." % (name, age))
Copied!
This will output:

1
My name is Johnathan and I am 30 years old.
Copied!
Each of these methods has its own advantages and use cases. However, f-strings are generally
considered the most modern and preferred way to format strings in Python due to their readability
and performance.

Additional capabilities
F-strings are also able to evaluate expressions inside the curly braces, which can be very handy. For
example:

1
2
3
x = 10
y = 20
print(f"The sum of x and y is {x+y}.")
Copied!
This will output:

1
The sum of x and y is 30.
Copied!

odule 1 Summary: Python Basics


Congratulations! You have completed this module. At this point, you know that:
● Python can distinguish among data types such as integers, floats, strings, and Booleans.
● Integers are whole numbers that can be positive or negative.
● Floats include integers as well as decimal numbers between the integers.
● You can convert integers to floats using typecasting, but you cannot convert a float to an
integer.
● You can convert integers and floats to strings.
● You can convert an integer or float value to True (1) or False (0).
● Expressions in Python are a combination of values and operations used to produce a single
result.
● Expressions perform mathematical operations such as addition, subtraction, multiplication,
and so on.
● We use"//" to round off integer divisions, resulting in float values.
● Python follows the order of operations (BODMASS) to perform operations with multiple
expressions.
● Variables store and manipulate data, allowing you to access and modify values throughout
your code.
● The assignment operator "=" assigns a value to a variable.
● ":" denotes the value of the variable within the code.
● Assigning another value to the same variable overrides the previous value of that variable.
● You can perform mathematical operations on variables using the same or different variables.
● While performing operations with various variables, modifying a value in one variable will
lead to changes in the other variables.
● Python string operations involve manipulating text data using tasks such as indexing,
concatenation, slicing, and formatting.
● A string is usually written within double quotes or single quotes, including letters, white
space, digits, or special characters.
● A string attaches to another variable and is an ordered sequence of characters.
● Characters in a string identify their index numbers, which can be positive or negative.
● We use strings as a sequence to perform sequence operations.
● You can input a stride value to perform slicing while operating on a string.
● Operations like finding the length of the string, combining, concatenating, and replicating,
result in a new string.
● You cannot modify an existing string; they are immutable.
● You can perform escape sequences using " " to change the layout of the string.
● In Python, you perform tasks such as searching, modifying, and formatting text data with its
pre-built string methods functions.
● You apply a method to a string to change its value, resulting in another string.
● You can perform actions such as changing the case of characters in a string, replacing items
in a string, finding items in a string, and so on using pre-built string methods.

In this video we will cover lists and tuples. These are called compound data types and are one of the
key types of data structures in Python. Tuples. Tuples are an ordered sequence. Here is a tuple
ratings. Tuples are expressed as comma separated elements within parentheses. These are values
inside the parentheses. In Python, there are different types: strings, integer, float. They can all be
contained in a tuple but the type of the variable is tuple. Each element of a tuple can be accessed
via an index. The following table represents the relationship between the index and the elements in
the tuple. The first element can be accessed by the name of the tuple followed by a square bracket
with the index number, in this case zero. We can access the second element as follows. We can also
access the last element. In Python, we can use negative index. The relationship is as follows. The
corresponding values are shown here. We can concatenate or combine tuples by adding them. The
result is the following with the following index. If we would like multiple elements from a tuple, we
could also slice tuples. For example, if we want the first three elements we use the following
command. The last index is one larger than the index you want; similarly if we want the last two
elements, we use the following command. Notice, how the last index is one larger than the last index
of the tuple. We can use the len command to obtain the length of a tuple. As there are five elements,
the result is 5. Tuples are immutable which means we can't change them. To see why this is
important, let's see what happens when we set the variable ratings 1 to ratings. Let's use the image
to provide a simplified explanation of what's going on. Each variable does not contain a tuple, but
references the same immutable tuple object. See the objects and classes module for more about
objects. Let's say, we want to change the element at index 2. Because tuples are immutable we
can't, therefore ratings 1 will not be affected by a change in rating because the tuple is immutable, i.e
we can't change it. We can assign a different tuple to the ratings variable. The variable ratings now
references another tuple. As a consequence of immutability, if we would like to manipulate a tuple
we must create a new tuple instead. For example, if we would like to sort a tuple we use the function
sorted. The input is the original tuple, the output is a new sorted list. For more on functions, see our
video on functions. A tuple can contain other tuples as well as other complex data types. This is
called nesting. We can access these elements using the standard indexing methods. If we select an
index with a tuple, the same index convention applies. As such, we can then access values in the
tuple. For example, we could access the second element. We can apply this indexing directly to the
tuple variable NT. It is helpful to visualize this as a tree. We can visualize this nesting as a tree. The
tuple has the following indexes. If we consider indexes with other tuples, we see the tuple at index 2
contains a tuple with two elements. We can access those two indexes. The same convention applies
to index 3. We can access the elements in those tuples as well. We can continue the process. We
can even access deeper levels of the tree by adding another square bracket. We can access
different characters in the string or various elements in the second tuple contained in the first. Lists
are also a popular data structure in Python. Lists are also an ordered sequence. Here is a list, "L." A
list is represented with square brackets. In many respect, lists are like tuples. One key difference is
they are mutable. Lists can contain strings, floats, integers. We can nest other lists. We also nest
tuples and other data structures. The same indexing conventions apply for nesting Like tuples, each
element of a list can be accessed via an index. The following table represents the relationship
between the index and the elements in the list. The first element can be accessed by the name of
the list followed by a square bracket with the index number, in this case zero. We can access the
second element as follows. We can also access the last element. In Python, we can use a negative
index; the relationship is as follows. The corresponding indexes are as follows. We can also perform
slicing in lists. For example, if we want the last two elements in this list we use the following
command. Notice how the last index is one larger than the length of the list. The index conventions
for lists and tuples are identical. Check the labs for more examples. We can concatenate or combine
lists by adding them. The result is the following. The new list has the following indices. Lists are
mutable, therefore we can change them. For example, we apply the method extends by adding a dot
followed by the name of the method then parentheses. The argument inside the parentheses is a
new list that we are going to concatenate to the original list. In this case, instead of creating a new
list, "L1," the original list, "L," is modified by adding two new elements. To learn more about methods
check out our video on objects and classes. Another similar method is append. If we apply append
instead of extended, we add one element to the list. If we look at the index there is only one more
element. Index 3 contains the list we appended. Every time we apply a method, the list changes. If
we apply extend, we add two new elements to the list. The list L is modified by adding two new
elements. If we append the string A, we further change the list, adding the string A. As lists are
mutable we can change them. For example, we can change the first element as follows. The list now
becomes hard rock 10 1.2. We can delete an element of a list using the del command. We simply
indicate the list item we would like to remove as an argument. For example, if we would like to
remove the first element the result becomes 10 1.2. We can delete the second element. This
operation removes the second element off the list. We can convert a string to a list using split. For
example, the method split converts every group of characters separated by a space into an element
of a list. We can use the split function to separate strings on a specific character known, as a
delimiter. We simply pass the delimiter we would like to split on as an argument, in this case a
comma. The result is a list. Each element corresponds to a set of characters that have been
separated by a comma. When we set one variable B equal to A, both A and B are referencing the
same list. Multiple names referring to the same object is known as aliasing. We know from the list
slide that the first element in B is set as hard rock. If we change the first element in A to banana, we
get a side effect, the value of B will change as a consequence. A and B are referencing the same
list, therefore if we change A, list B also changes. If we check the first element of B after changing
list A, we get banana instead of hard rock. You can clone list A by using the following syntax.
Variable A references one list. Variable B references a new copy or clone of the original list. Now if
you change A, B will not change. We can get more info on lists, tuples, and many other objects in
Python using the help command. Simply pass in the list, tuple, or any other Python object. See the
labs for more things, you can do with lists. (Music)

Tuples and lists exercise

Imagine you received album recommendations from your friends and compiled all of the
recommendations into a table, with specific information about each album.

The table has one row for each movie and several columns:

Artist - Name of the artist


Album - Name of the album
Released_year - Year the album was released
Length_min_sec - Length of the album (hours,minutes,seconds)
Genre - Genre of the album
Music_recording_sales_millions - Music recording sales (millions in USD) on SONG://DATABASE
Claimed_sales_millions - Album's claimed sales (millions in USD) on SONG://DATABASE
Date_released - Date on which the album was released
Soundtrack - Indicates if the album is the movie soundtrack (Y) or (N)
Rating_of_friends - Indicates the rating from your friends from 1 to 10
Python Data Structures Cheat

Sheet

List

Package/ Descrip Code Example


Method tion

append() The Syntax:


`append
​ 1
()`
method ● list_name.append(element)
is used
to add
Copied!
an
element
Example:
to the
end of a
list. ​ 1

​ 2

● fruits = ["apple", "banana", "orange"]

● fruits.append("mango") print(fruits)
Copied!

copy() The Example 1:


`copy()`
​ 1
method
is used ​ 2
to
​ 3
create a
shallow ● my_list = [1, 2, 3, 4, 5]
copy of
● new_list = my_list.copy() print(new_list)
a list.

● # Output: [1, 2, 3, 4, 5]

Copied!

count() The Example:


`count()
​ 1
`
method ​ 2
is used
​ 3
to count
the ● my_list = [1, 2, 2, 3, 4, 2, 5, 2]
number
● count = my_list.count(2) print(count)
of
occurre
● # Output: 4
nces of
a
specific Copied!
element
in a list
in
Python.

Creating a A list is Example:


list a
​ 1
built-in
data ● fruits = ["apple", "banana", "orange", "mango"]
type
that
Copied!
represe
nts an
ordered
and
mutable
collectio
n of
element
s. Lists
are
enclose
d in
square
bracket
s [] and
element
s are
separat
ed by
comma
s.

del The Example:


`del`
​ 1
stateme
nt is ​ 2
used to
​ 3
remove
an ● my_list = [10, 20, 30, 40, 50]
element
● del my_list[2] # Removes the element at index 2
from
list.
print(my_list)
`del`
stateme ● # Output: [10, 20, 40, 50]

nt
remove
Copied!
s the
element
at the
specifie
d index.
extend() The Syntax:
`extend(
​ 1
)`
method ● list_name.extend(iterable)
is used
to add
Copied!
multiple
element
Example:
s to a
list. It
takes ​ 1

an
​ 2
iterable
(such ​ 3

as
​ 4
another
list, ● fruits = ["apple", "banana", "orange"]
tuple, or
● more_fruits = ["mango", "grape"]
string)
and ● fruits.extend(more_fruits)
append
● print(fruits)
s each
element
of the Copied!
iterable
to the
original
list.
Indexing Indexin Example:
g in a
​ 1
list
allows ​ 2
you to
​ 3
access
individu ​ 4
al
​ 5
element
s by
● my_list = [10, 20, 30, 40, 50]
their
position ● print(my_list[0])

. In
● # Output: 10 (accessing the first element)
Python,
indexin ● print(my_list[-1])

g starts
● # Output: 50 (accessing the last element using negative
from 0
for the indexing)

first
element
Copied!
and
goes up
to
`length_
of_list -
1`.
insert() The Syntax:
`insert()
​ 1
`
method ● list_name.insert(index, element)
is used
to insert
Copied!
an
element
Example:
.

​ 1

​ 2

​ 3

● my_list = [1, 2, 3, 4, 5]

● my_list.insert(2, 6)

● print(my_list)

Copied!

Modifying You can Example:


a list use
​ 1
indexin
g to ​ 2
modify
​ 3
or
assign ​ 4
new
values ● my_list = [10, 20, 30, 40, 50]
to
● my_list[1] = 25 # Modifying the second element
specific
element ● print(my_list)
s in the
● # Output: [10, 25, 30, 40, 50]
list.

Copied!

pop() `pop()` Example 1:


method
​ 1
is
another ​ 2
way to
​ 3
remove
an ​ 4
element
​ 5
from a
list in
​ 6
Python.
It ​ 7

remove
● my_list = [10, 20, 30, 40, 50]
s and
returns ● removed_element = my_list.pop(2) # Removes and returns

the
the element at index 2
element
at the ● print(removed_element)

specifie
● # Output: 30
d index.
If you ●
don't
● print(my_list)
provide
an ● # Output: [10, 20, 40, 50]
index to
the
Copied!
`pop()`
method,
Example 2:
it will
remove
and ​ 1

return
​ 2
the last
element ​ 3

of the
​ 4
list by
default ​ 5

​ 6

​ 7

● my_list = [10, 20, 30, 40, 50]

● removed_element = my_list.pop() # Removes and returns

the last element

● print(removed_element)

● # Output: 50


● print(my_list)

● # Output: [10, 20, 30, 40]

Copied!

remove() To Example:
remove
​ 1
an
element ​ 2
from a
​ 3
list. The
`remove ​ 4
()`
● my_list = [10, 20, 30, 40, 50]
method
remove
● my_list.remove(30) # Removes the element 30
s the
first ● print(my_list)

occurre
● # Output: [10, 20, 40, 50]
nce of
the
specifie Copied!

d value.

reverse() The Example 1:


`reverse
​ 1
()`
method ​ 2
is used
to ​ 3
reverse
● my_list = [1, 2, 3, 4, 5]
the
order of ● my_list.reverse() print(my_list)
element
● # Output: [5, 4, 3, 2, 1]
s in a
list

Copied!

Slicing You can Syntax:


use
​ 1
slicing
to ● list_name[start:end:step]
access
a range
Copied!
of
element
Example:
s from a
list.
​ 1

​ 2

​ 3

​ 4

​ 5

​ 6

​ 7
​ 8

​ 9

​ 10

​ 11

​ 12

● my_list = [1, 2, 3, 4, 5]

● print(my_list[1:4])

● # Output: [2, 3, 4] (elements from index 1 to 3)

● print(my_list[:3])

● # Output: [1, 2, 3] (elements from the beginning up to index

2)

● print(my_list[2:])

● # Output: [3, 4, 5] (elements from index 2 to the end)

● print(my_list[::2])

● # Output: [1, 3, 5] (every second element)

Copied!
sort() The Example 1:
`sort()`
​ 1
method
is used ​ 2
to sort
​ 3
the
element ​ 4
s of a
● my_list = [5, 2, 8, 1, 9]
list in
ascendi
● my_list.sort()
ng
order. If ● print(my_list)

you
● # Output: [1, 2, 5, 8, 9]
want to
sort the
list in Copied!

descen
ding Example 2:

order,
you can ​ 1
pass
​ 2
the
`reverse ​ 3
=True`
​ 4
argume
nt to the ● my_list = [5, 2, 8, 1, 9]
`sort()`
● my_list.sort(reverse=True)
method.

● print(my_list)
● # Output: [9, 8, 5, 2, 1]

Copied!

Dictionary
Package/M Descript Code Example
ethod ion

Accessing You can Syntax:


Values access
​ 1
the
values in ● Value = dict_name["key_name"]
a
dictionar
Copied!
y using
their
Example:
correspo
nding
`keys`. ​ 1

​ 2

● name = person["name"]

● age = person["age"]

Copied!
Add or Inserts a Syntax:
modify new
​ 1
key-valu
e pair ● dict_name[key] = value
into the
dictionar
Copied!
y. If the
key
Example:
already
exists,
the value ​ 1

will be
​ 2
updated;
otherwis ● person["Country"] = "USA" # A new entry will be created.

e, a new
● person["city"] = "Chicago" # Update the existing value for
entry is
created. the same key

Copied!

clear() The Syntax:


`clear()`
​ 1
method
empties ● dict_name.clear()
the
dictionar
Copied!
y,
removing
all
key-valu Example:
e pairs
within it. ​ 1
After this
● grades.clear()
operatio
n, the
dictionar
Copied!
y is still
accessibl
e and
can be
used
further.

copy() Creates Syntax:


a
​ 1
shallow
copy of ● new_dict = dict_name.copy()
the
dictionar
Copied!
y. The
new
Example:
dictionar
y
contains ​ 1

the same
​ 2
key-valu
e pairs ● new_person = person.copy()

as the
original,
but they ● new_person = dict(person) # another way to create a copy
remain
of dictionary
distinct
objects
in Copied!
memory.

Creating a A Example:
Dictionary dictionar
​ 1
y is a
built-in ​ 2
data type
● dict_name = {} #Creates an empty dictionary
that
represen ● person = { "name": "John", "age": 30, "city": "New York"}
ts a
collectio
Copied!
n of
key-valu
e pairs.
Dictionar
ies are
enclosed
in curly
braces
`{}`.
del Remove Syntax:
s the
​ 1
specified
key-valu ● del dict_name[key]
e pair
from the
Copied!
dictionar
y. Raises
Example:
a
`KeyErro
r` if the ​ 1

key does
● del person["Country"]
not exist.

Copied!

items() Retrieve Syntax:


s all
​ 1
key-valu
e pairs ● items_list = list(dict_name.items())
as tuples
and
Copied!
converts
them into
Example:
a list of
tuples.
Each ​ 1

tuple
● info = list(person.items())
consists
of a key
and its Copied!

correspo
nding
value.

key You can Example:


existence check for
​ 1
the
existenc ​ 2
e of a
● if "name" in person:
key in a
dictionar ● print("Name exists in the dictionary.")
y using
the `in`
Copied!
keyword

keys() Retrieve Syntax:


s all keys
​ 1
from the
dictionar ● keys_list = list(dict_name.keys())
y and
converts
Copied!
them into
a list.
Example:
Useful
for
iterating ​ 1

or
● person_keys = list(person.keys())
processi
ng keys
using list Copied!

methods.

update() The Syntax:


`update()
​ 1
` method
merges ● dict_name.update({key: value})
the
provided
Copied!
dictionar
y into the
Example:
existing
dictionar
y, adding ​ 1

or
● person.update({"Profession": "Doctor"})
updating
key-valu
e pairs. Copied!

values() Extracts Syntax:


all
​ 1
values
from the ● values_list = list(dict_name.values())
dictionar
y and
Copied!
converts
them into
Example:
a list.
This list
can be
used for ​ 1
further
● person_values = list(person.values())
processi
ng or
analysis. Copied!

Sets
Package/Meth Descriptio Code Example
od n

add() Elements Syntax:


can be
​ 1
added to a
set using ● set_name.add(element)
the `add()`
method.
Copied!
Duplicates
are
Example:
automaticall
y removed,
as sets only ​ 1

store
● fruits.add("mango")
unique
values.

Copied!
clear() The Syntax:
`clear()`
​ 1
method
removes all ● set_name.clear()
elements
from the
Copied!
set,
resulting in
Example:
an empty
set. It
updates the ​ 1

set in-place.
● fruits.clear()

Copied!

copy() The Syntax:


`copy()`
​ 1
method
creates a ● new_set = set_name.copy()
shallow
copy of the
Copied!
set. Any
modification
Example:
s to the
copy won't
affect the ​ 1

original set.
● new_fruits = fruits.copy()
Copied!

Defining Sets A set is an Example:


unordered
​ 1
collection of
unique ​ 2
elements.
● empty_set = set() #Creating an Empty Set
Sets are
enclosed in ● fruits = {"apple", "banana", "orange"}
curly braces
`{}`. They
Copied!
are useful
for storing
distinct
values and
performing
set
operations.

discard() Use the Syntax:


`discard()`
​ 1
method to
remove a ● set_name.discard(element)
specific
element
Copied!
from the
set. Ignores
Example:
if the
element is ​ 1
not found.
● fruits.discard("apple")

Copied!

issubset() The Syntax:


`issubset()`
​ 1
method
checks if ● is_subset = set1.issubset(set2)
the current
set is a
Copied!
subset of
another set.
Example:
It returns
True if all
elements of ​ 1

the current
● is_subset = fruits.issubset(colors)
set are
present in
the other Copied!

set,
otherwise
False.
issuperset() The Syntax:
`issuperset(
​ 1
)` method
checks if ● is_superset = set1.issuperset(set2)
the current
set is a
Copied!
superset of
another set.
Example:
It returns
True if all
elements of ​ 1

the other
● is_superset = colors.issuperset(fruits)
set are
present in
the current Copied!

set,
otherwise
False.
pop() The `pop()` Syntax:
method
​ 1
removes
and returns ● removed_element = set_name.pop()
an arbitrary
element
Copied!
from the
set. It raises
Example:
a
`KeyError` if
the set is ​ 1

empty. Use
● removed_fruit = fruits.pop()
this method
to remove
elements Copied!

when the
order
doesn't
matter.

remove() Use the Syntax:


`remove()`
​ 1
method to
remove a ● set_name.remove(element)
specific
element
Copied!
from the
set. Raises
Example:
a
`KeyError` if
the element ​ 1
is not
● fruits.remove("banana")
found.

Copied!

Set Operations Perform Syntax:


various
​ 1
operations
on sets: ​ 2
`union`,
​ 3
`intersectio
n`, ​ 4
`difference`,
● union_set = set1.union(set2)
`symmetric
difference`.
● intersection_set = set1.intersection(set2)

● difference_set = set1.difference(set2)

● sym_diff_set = set1.symmetric_difference(set2)

Copied!

Example:

​ 1

​ 2

​ 3
​ 4

● combined = fruits.union(colors)

● common = fruits.intersection(colors)

● unique_to_fruits = fruits.difference(colors)

● sym_diff = fruits.symmetric_difference(colors)

Copied!

update() The Syntax:


`update()`
​ 1
method
adds ● set_name.update(iterable)
elements
from
Copied!
another
iterable into
Example:
the set. It
maintains
the ​ 1

uniqueness
● fruits.update(["kiwi", "grape"]
of
elements.

Module 2 Summary: Python Data Structures


Congratulations! You have completed this module. At this point, you know that:
● In Python, we often use tuples to group related data together.Tuples refer to ordered and
immutable collections of elements.
● Tuples are usually written as comma-separated elements in parentheses “()".
● You can include strings, integers, and floats in tuples and access them using both positive
and negative indices.
● You can perform operations such as combining, concatenating, and slicing on tuples.
● Tuples are immutable, so you need to create a new tuple to manipulate it.
● Tuples, termed nesting, can include other tuples of complex data types.
● You can access elements in a nested tuple through indexing.
● Lists in Python contain ordered collections of items that can hold elements of different types
and are mutable, allowing for versatile data storage and manipulation.
● A list is an ordered sequence, represented with square brackets "[]".
● Lists possess mutability, rendering them akin to tuples.
● A list can contain strings, integers, and floats; you can nest lists within it.
● You can access each element in a list using both positive and negative indexing.
● Concatenating or appending a list will result in the modification of the same list.
● You can perform operations such as adding, deleting, splitting, and so forth on a list.
● You can separate elements in a list using delimiters.
● Aliasing occurs when multiple names refer to the same object.
● You can also clone a list to create another list.
● Dictionaries in Python are key-value pairs that provide a flexible way to store and retrieve
data based on unique keys.
● Dictionaries consist of keys and values, both composed of string elements.
● You denote dictionaries using curly brackets.
● The keys necessitate immutability and uniqueness.
● The values may be either immutable or mutable, and they allow duplicates.
● You separate each key-value pair with a comma, and you can use color highlighting to make
the key more visible.
● You can assign dictionaries to a variable.
● You use the key as an argument to retrieve the corresponding value.
● You can make additions and deletions to dictionaries.
● You can perform an operation on a dictionary to check the key, which results in a true or false
output.
● You can apply methods to obtain a list of keys and values in a dictionary.
● Sets in Python are collections of unique elements, useful for tasks such as removing
duplicates and performing set operations like union and intersection. Sets lack order.
● Curly brackets "{}" are helpful for defining elements of a set.
● Sets do not contain duplicate items.
● A list passed through the set function generates a set containing unique elements.
● You use “Set Operations” to perform actions such as adding, removing, and verifying
elements in a set.
● You can combine sets using the ampersand "&" operator to obtain the common elements
from both sets.
● You can use the Union function to combine two sets, including both the common and unique
elements from both sets.
● The sub-set method is used to determine if two or more sets are subsets.
Cheat Sheet:

Python Data

Structures Part-2
Dictionaries

2023-10-11 14:37:21 Wednesday

Package/ Description Code Example


Method

Creating a A dictionary is Example:


Dictionary a built-in data
type that ​ 1
represents a
​ 2
collection of
key-value ● dict_name = {} #Creates an empty dictionary
pairs.
Dictionaries ● person = { "name": "John", "age": 30, "city": "New
are enclosed
York"}
in curly braces

{}.

Copied!

Accessing You can Syntax:


Values access the
values in a ​ 1
dictionary
● Value = dict_name["key_name"]
using their
corresponding

keys.
Copied!

Example:

​ 1

​ 2

● name = person["name"]

● age = person["age"]

Copied!
Add or Inserts a new Syntax:
modify key-value pair
into the ​ 1
dictionary. If
● dict_name[key] = value
the key
already exists,
the value will
Copied!
be updated;
otherwise, a
Example:
new entry is
created.
​ 1

​ 2

● person["Country"] = "USA" # A new entry will be

created.

● person["city"] = "Chicago" # Update the existing value

for the same key

Copied!

del Removes the Syntax:


specified
key-value pair ​ 1
from the
● del dict_name[key]
dictionary.
Raises a

KeyError if the
key does not Copied!

exist.
Example:

​ 1

● del person["Country"]

Copied!

update() The update() Syntax:

method
merges the ​ 1
provided
● dict_name.update({key: value})
dictionary into
the existing
dictionary,
Copied!
adding or
updating
Example:
key-value
pairs.
​ 1

● person.update({"Profession": "Doctor"})

Copied!
clear() The clear() Syntax:

method
empties the ​ 1
dictionary,
● dict_name.clear()
removing all
key-value
pairs within it.
Copied!
After this
operation, the
Example:
dictionary is
still accessible
​ 1
and can be
used further. ● grades.clear()

Copied!

key You can check Example:


existence for the
existence of a ​ 1
key in a
​ 2
dictionary

using the in ● if "name" in person:


keyword
● print("Name exists in the dictionary.")

Copied!
copy() Creates a Syntax:
shallow copy
of the ​ 1
dictionary. The
● new_dict = dict_name.copy()
new dictionary
contains the
same
Copied!
key-value
pairs as the
Example:
original, but
they remain
​ 1
distinct objects
in memory. ​ 2

● new_person = person.copy()

● new_person = dict(person) # another way to create a

copy of dictionary

Copied!

keys() Retrieves all Syntax:


keys from the
dictionary and ​ 1
converts them
● keys_list = list(dict_name.keys())
into a list.
Useful for
iterating or
Copied!
processing
keys using list Example:
methods.

​ 1

● person_keys = list(person.keys())

Copied!

values() Extracts all Syntax:


values from
the dictionary ​ 1
and converts
● values_list = list(dict_name.values())
them into a
list. This list
can be used
Copied!
for further
processing or
Example:
analysis.

​ 1

● person_values = list(person.values())

Copied!
items() Retrieves all Syntax:
key-value
pairs as tuples ​ 1
and converts
● items_list = list(dict_name.items())
them into a list
of tuples. Each
tuple consists
Copied!
of a key and
its
Example:
corresponding
value.
​ 1

● info = list(person.items())

Copied!

Sets

Package/Meth Descriptio Code Example


od n
add() Elements Syntax:
can be
​ 1
added to a
set using ● set_name.add(element)
the `add()`
method.
Copied!
Duplicates
are
Example:
automatical
ly removed,
as sets only ​ 1

store
● fruits.add("mango")
unique
values.

Copied!

clear() The Syntax:


`clear()`
​ 1
method
removes all ● set_name.clear()
elements
from the
Copied!
set,
resulting in
Example:
an empty
set. It
updates the ​ 1

set
● fruits.clear()</td>
in-place.
Copied!

copy() The Syntax:


`copy()`
​ 1
method
creates a ● new_set = set_name.copy()
shallow
copy of the
Copied!
set. Any
modificatio
Example:
ns to the
copy won't
affect the ​ 1

original set.
● new_fruits = fruits.copy()

Copied!

Defining Sets A set is an Example:


unordered
​ 1
collection of
unique ​ 2
elements.
● empty_set = set() #Creating an Empty
Sets are
enclosed in ● Set fruits = {"apple", "banana", "orange"}
curly
braces `{}`.
Copied!
They are
useful for
storing
distinct
values and
performing
set
operations.

discard() Use the Syntax:


`discard()`
​ 1
method to
remove a ● set_name.discard(element)
specific
element
Copied!
from the
set. Ignores
Example:
if the
element is
not found. ​ 1

● fruits.discard("apple")

Copied!
issubset() The Syntax:
`issubset()`
​ 1
method
checks if ● is_subset = set1.issubset(set2)
the current
set is a
Copied!
subset of
another set.
Example:
It returns
True if all
elements of ​ 1

the current
● is_subset = fruits.issubset(colors)
set are
present in
the other Copied!

set,
otherwise
False.

issuperset() The Syntax:


`issuperset(
is_superset = set1.issuperset(set2)
)` method
checks if
Example:
the current
set is a
superset of ​ 1

another set.
● is_superset = colors.issuperset(fruits)
It returns
True if all
elements of
the other Copied!

set are
present in
the current
set,
otherwise
False.

pop() The `pop()` Syntax:


method
​ 1
removes
and returns ● removed_element = set_name.pop()
an arbitrary
element
Copied!
from the
set. It
Example:
raises a
`KeyError`
if the set is ​ 1

empty. Use
● removed_fruit = fruits.pop()
this method
to remove
elements Copied!

when the
order
doesn't
matter.
remove() Use the Syntax:
`remove()`
​ 1
method to
remove a ● set_name.remove(element)
specific
element
Copied!
from the
set. Raises
Example:
a
`KeyError`
if the ​ 1

element is
● fruits.remove("banana")
not found.

Copied!

Set Operations Perform Syntax:


various
​ 1
operations
on sets: ​ 2
`union`,
​ 3
`intersectio
n`, ​ 4
`difference`
● union_set = set1.union(set2)
,
`symmetric
● intersection_set = set1.intersection(set2)
difference`.
● difference_set = set1.difference(set2)
● sym_diff_set = set1.symmetric_difference(set2)

Copied!

Example:

​ 1

​ 2

​ 3

​ 4

● combined = fruits.union(colors)

● common = fruits.intersection(colors)

● unique_to_fruits = fruits.difference(colors)

● sym_diff = fruits.symmetric_difference(colors)

Copied!

update() The Syntax:


`update()`
​ 1
method
adds ● set_name.update(iterable)
elements
from
Copied!
another
iterable into
the set. It Example:
maintains
the ​ 1
uniqueness
● fruits.update(["kiwi", "grape"])
of
elements.

Copied!

Conditions and

Branching
Estimated time needed: 10 minutes

Objective:

In this reading, you'll learn about:

1. Comparison operators
2. Branching
3. Logical operators

1. Comparison operations
Comparison operations are essential in programming. They help compare values and make
decisions based on the results.

Equality operator

The equality operator == checks if two values are equal. For example, in Python:

​ 1

​ 2

​ 3

● age = 25

● if age == 25:

● print("You are 25 years old.")

Copied!

Here, the code checks if the variable age is equal to 25 and prints a message accordingly.

Inequality operator

The inequality operator != checks if two values are not equal:


​ 1

​ 2

● if age != 30:

● print("You are not 30 years old.")

Copied!

Here, the code checks if the variable age is not equal to 30 and prints a message accordingly.

Greater than and less than

You can also compare if one value is greater than another.

​ 1

​ 2

● if age>= 20:

● Print("Yes, the Age is greater than 20")

Copied!

Here, the code checks if the variable age is greater than equal to 30 and prints a message
accordingly.

2. Branching

Branching is like making decisions in your program based on conditions. Think of it as real-life
choices.
The IF statement

Consider a real-life scenario of entering a bar. If you're above a certain age, you can enter;
otherwise, you cannot.

​ 1

​ 2

​ 3

​ 4

​ 5

● age = 20

● if age >= 21:

● print("You can enter the bar.")

● else:

● print("Sorry, you cannot enter.")

Copied!

Here, you are using the if statement to make a decision based on the age variable.

The ELIF Statement

Sometimes, there are multiple conditions to check. For example, if you're not old enough for the bar,
you can go to a movie instead.

​ 1

​ 2
​ 3

​ 4

​ 5

​ 6

● if age >= 21:

● print("You can enter the bar.")

● elif age >= 18:

● print("You can watch a movie.")

● else:

● print("Sorry, you cannot do either.")

Copied!

Real-life example: Automated Teller Machine (ATM)

When a user interacts with an ATM, the software in the ATM can use branching to make decisions
based on the user's input. For example, if the user selects "Withdraw Cash" the ATM can branch into
different denominations of bills to dispense based on the amount requested.

​ 1

​ 2

​ 3

​ 4

​ 5

​ 6
​ 7

​ 8

​ 9

● user_choice = "Withdraw Cash"

● if user_choice == "Withdraw Cash":

● amount = input("Enter the amount to withdraw: ")

● if amount % 10 == 0:

● dispense_cash(amount)

● else:

● print("Please enter a multiple of 10.")

● else:

● print("Thank you for using the ATM.")

Copied!

3. Logical operators

Logical operators help combine and manipulate conditions.

The NOT operator

Real-life example: Notification settings


In a smartphone's notification settings, you can use the NOT operator to control when to send
notifications. For example, you might only want to receive notifications when your phone is not in "Do
Not Disturb" mode.

The not operator negates a condition.

​ 1

​ 2

​ 3

● is_do_not_disturb = True

● if not is_do_not_disturb:

● send_notification("New message received")

Copied!

The AND operator

Real-life example: Access control

In a secure facility, you can use the AND operator to check multiple conditions for access. To open a
high-security door, a person might need both a valid ID card and a matching fingerprint.

The AND operator checks if all required conditions are true, like needing both keys to open a safe.

​ 1

​ 2

​ 3

​ 4
● has_valid_id_card = True

● has_matching_fingerprint = True

● if has_valid_id_card and has_matching_fingerprint:

● open_high_security_door()

Copied!

The OR operator

Real-life example: Movie night decision

When planning a movie night with friends, you can use the OR operator to decide on a movie genre.
You'll choose a movie if at least one person is interested.

The OR operator checks if at least one condition is true. It's like choosing between different movies
to watch.

​ 1

​ 2

​ 3

​ 4

​ 5

● friend1_likes_comedy = True

● friend2_likes_action = False

● friend3_likes_drama = False

● if friend1_likes_comedy or friend2_likes_action or friend3_likes_drama:


● choose a movie()

Copied!

Summary
In this reading, you delved into the most frequently used operator and the concept of conditional

branching, which encompasses the utilization of if statements and if-else statements.

Introduction to

Loops in Python
Estimated time needed: 10 minutes

Objectives
1. Understand Python loops.
2. How the loop Works
3. Learn about the needs for loop
4. Utilize Python's Range function.
5. Familiarize with Python's enumerate function.
6. Apply while loops for conditional tasks.
7. Distinguish appropriate loop selection.

What is a Loop?
In programming, a loop is like a magic trick that allows a computer to do something over and over
again. Imagine you are a magician's assistant, and your magician friend asks you to pull a rabbit out
of a hat, but not just once - they want you to keep doing it until they tell you to stop. That is what
loops do for computers - they repeat a set of instructions as many times as needed.

How Loop works?


Here's how it works in Python:
● Start: The for loop begins with the keyword for, followed by a variable that will take on each
value in a sequence.
● Condition: After the variable, you specify the keyword in and a sequence, such as a list or a
range, that the loop will iterate through.
● If Condition True:
1. The loop takes the first value from the sequence and assigns it to the variable.
2. The indented block of code following the loop header is executed using this value.
3. The loop then moves to the next value in the sequence and repeats the process until all
values have been used.
● Statement: Inside the indented block of the loop, you write the statements that you want to
repeat for each value in the sequence.
● Repeat: The loop continues to repeat the block of code for each value in the sequence until
there are no more values left.
● If Condition False:
1. Once all values in the sequence have been processed, the loop terminates automatically.
2. The loop completes its execution, and the program continues to the next statement after the
loop.

The Need for Loops

Think about when you need to count from 1 to 10. Doing it manually is easy, but what if you had to
count to a million? Typing all those numbers one by one would be a nightmare! This is where loops
come in handy. They help computers repeat tasks quickly and accurately without getting tired.

Main Types of Loops

For Loops

For loops are like a superhero's checklist. A for loop in programming is a control structure that allows
the repeated execution of a set of statements for each item in a sequence, such as elements in a list
or numbers in a range, enabling efficient iteration and automation of tasks

Syntax of for loop

​ 1

​ 2

● for val in sequence:

● # statement(s) to be executed in sequence as a part of the loop.


Copied!

Here is an example of For loop.

Imagine you're a painter, and you want to paint a beautiful rainbow with seven colors. Instead of
picking up each color one by one and painting the rainbow, you could tell a magical painter's
assistant to do it for you. This is what a basic for loop does in programming.

We have a list of colours.

​ 1

● colors = ["red", "orange", "yellow", "green", "blue", "indigo", "violet"]

Copied!

Let's print the colour name in the new line using for loop.

​ 1

​ 2

● for color in colors:

● print(color)

Copied!

In this example, the for loop picks each color from the colors list and prints it on the screen. You don't
have to write the same code for each color - the loop does it automatically!

Sometimes you do not want to paint a rainbow, but you want to count the number of steps to reach
your goal. A range-based for loop is like having a friendly step counter that helps you reach your
target.
Here is how you might use a for loop to count from 1 to 10:
​ 1

​ 2

● for number in range(1, 11):

● print(number)

Copied!

Here, the range(1, 11) generates a sequence from 1 to 10, and the for loop goes through each
number in that sequence, printing it out. It's like taking 10 steps, and you're guided by the loop!

Range Function

The range function in Python generates an ordered sequence that can be used in loops. It takes one
or two arguments:

● If given one argument (e.g., range(11)), it generates a sequence starting from 0 up to (but
not including) the given number.

​ 1

​ 2

● for number in range(11):

● print(number)

Copied!

● If given two arguments (e.g., range(1, 11)), it generates a sequence starting from the first
argument up to (but not including) the second argument.

​ 1

​ 2
● for number in range(1, 11):

● print(number)

Copied!

The Enumerated For Loop

Have you ever needed to keep track of both the item and its position in a list? An enumerated for
loop comes to your rescue. It's like having a personal assistant who not only hands you the item but
also tells you where to find it.

Consider this example:

​ 1

​ 2

​ 3

● fruits = ["apple", "banana", "orange"]

● for index, fruit in enumerate(fruits):

● print(f"At position {index}, I found a {fruit}")

Copied!

With this loop, you not only get the fruit but also its position in the list. It's as if you have a magical
guide pointing out each fruit's location!

While Loops

While loops are like a sleepless night at a friend's sleepover. Imagine you and your friends keep
telling ghost stories until someone decides it's time to sleep. As long as no one says, "Let's sleep"
you keep telling stories.
A while loop works similarly - it repeats a task as long as a certain condition is true. It's like saying,
"Hey computer, keep doing this until I say stop!"

Basic syntax of While Loop.

​ 1

​ 2

​ 3

● while condition:

● # Code to be executed while the condition is true

● # Indentation is crucial to indicate the scope of the loop

Copied!

For example, here's how you might use a while loop to count from 1 to 10:

​ 1

​ 2

​ 3

​ 4

● count = 1

● while count <= 10:

● print(count)

● count += 1
Copied!

here's a breakdown of the above code.

1. There is a variable named count initialized with the value 1.


2. The while loop is used to repeatedly execute a block of code as long as a given condition is
True. In this case, the condition is count <= 10, meaning the loop will continue as long as
count is less than or equal to 10.
3. Inside the loop:
○ The print(count) statement outputs the current value of the count variable.
○ The count += 1 statement increments the value of count by 1. This step ensures that
the loop will eventually terminate when count becomes greater than 10.
4. The loop will continue executing as long as the condition count <= 10 is satisfied.
5. The loop will print the numbers 1 to 10 in consecutive order since the print statement is
inside the loop block and executed during each iteration.
6. Once count reaches 11, the condition count <= 10 will evaluate to False, and the loop will
terminate.
7. The output of the code will be the numbers 1 to 10, each printed on a separate line.

The Loop Flow

Both for and while loops have their special moves, but they follow a pattern:

● Initialization: You set up things like a starting point or conditions.


● Condition: You decide when the loop should keep going and when it should stop.
● Execution: You do the task inside the loop.
● Update: You make changes to your starting point or conditions to move forward.
● Repeat: The loop goes back to step 2 until the condition is no longer true.

When to Use Each


For Loops: Use for loops when you know the number of iterations in advance and want to process
each element in a sequence. They are best suited for iterating over collections and sequences
where the length is known.

While Loops: Use while loops when you need to perform a task repeatedly as long as a certain
condition holds true. While loops are particularly useful for situations where the number of iterations
is uncertain or where you're waiting for a specific condition to be met.

Summary
In this adventure into coding, we explored loops in Python - special tools that help us do things over
and over again without getting tired. We met two types of loops: "for loops" and "while loops."

For Loops were like helpers that made us repeat tasks in order. We painted colors, counted
numbers, and even got a helper to tell us where things were in a list. For loops made our job easier
and made our code look cleaner.

While Loops were like detectives that kept doing something as long as a rule was true. They helped
us take steps, guess numbers, and work until we were tired. While loops were like

sm Exploring Python
Functions
Estimated time needed: 15 minutes

Objectives:
By the end of this reading, you should be able to:

1. Describe the function concept and the importance of functions in programming


2. Write a function that takes inputs and performs tasks
3. Use built-in functions like len(), sum(), and others effectively
4. Define and use your functions in Python
5. Differentiate between global and local variable scopes
6. Use loops within the function
7. Modify data structures using functions

Introduction to functions

A function is a fundamental building block that encapsulates specific actions or computations. As in


mathematics, where functions take inputs and produce outputs, programming functions perform
similarly. They take inputs, execute predefined actions or calculations, and then return an output.

Purpose of functions

Functions promote code modularity and reusability. Imagine you have a task that needs to be
performed multiple times within a program. Instead of duplicating the same code at various places,
you can define a function once and call it whenever you need that task. This reduces redundancy
and makes the code easier to manage and maintain.

Benefits of using functions


Modularity: Functions break down complex tasks into manageable components
Reusability: Functions can be used multiple times without rewriting code
Readability: Functions with meaningful names enhance code understanding
Debugging: Isolating functions eases troubleshooting and issue fixing
Abstraction: Functions simplify complex processes behind a user-friendly interface
Collaboration: Team members can work on different functions concurrently
Maintenance: Changes made in a function automatically apply wherever it's used

How functions take inputs, perform tasks, and

produce outputs

Inputs (Parameters)

Functions operate on data, and they can receive data as input. These inputs are known as
parameters or arguments. Parameters provide functions with the necessary information they need to
perform their tasks. Consider parameters as values you pass to a function, allowing it to work with
specific data.

Performing tasks

Once a function receives its input (parameters), it executes predefined actions or computations.
These actions can include calculations, operations on data, or even more complex tasks. The
purpose of a function determines the tasks it performs. For instance, a function could calculate the
sum of numbers, sort a list, format text, or fetch data from a database.

Producing outputs

After performing its tasks, a function can produce an output. This output is the result of the
operations carried out within the function. It's the value that the function “returns” to the code that
called it. Think of the output as the end product of the function's work. You can use this output in
your code, assign it to variables, pass it to other functions, or even print it out for display.
Example:

Consider a function named calculate_total that takes two numbers as input (parameters), adds them
together, and then produces the sum as the output. Here's how it works:

​ 1

​ 2

​ 3

​ 4

​ 5

​ 6

● def calculate_total(a, b): # Parameters: a and b

● total = a + b # Task: Addition

● return total # Output: Sum of a and b

● result = calculate_total(5, 7) # Calling the function with inputs 5 and 7

● print(result) # Output: 12

Copied!

Python's built-in functions

Python has a rich set of built-in functions that provide a wide range of functionalities. These functions
are readily available for you to use, and you don't need to be concerned about how they are
implemented internally. Instead, you can focus on understanding what each function does and how
to use it effectively.
Using built-in functions or Pre-defined functions

To use a built-in function, you simply call the function's name followed by parentheses. Any required
arguments or parameters are passed into the function within these parentheses. The function then
performs its predefined task and may return an output you can use in your code.

Here are a few examples of commonly used built-in functions:

len(): Calculates the length of a sequence or collection

​ 1

​ 2

● string_length = len("Hello, World!") # Output: 13

● list_length = len([1, 2, 3, 4, 5]) # Output: 5

Copied!

sum(): Adds up the elements in an iterable (list, tuple, and so on)

​ 1

● total = sum([10, 20, 30, 40, 50]) # Output: 150

Copied!

max(): Returns the maximum value in an iterable

​ 1

● highest = max([5, 12, 8, 23, 16]) # Output: 23

Copied!

min(): Returns the minimum value in an iterable


​ 1

● lowest = min([5, 12, 8, 23, 16]) # Output: 5

Copied!

Python's built-in functions offer a wide array of functionalities, from basic operations like len() and
sum() to more specialized tasks.

Defining your functions

Defining a function is like creating your mini-program:

1. Use def followed by the function name and parentheses

Here is the syntax to define a function:

​ 1

​ 2

● def function_name():

● pass

Copied!

A "pass" statement in a programming function is a placeholder or a no-op (no operation) statement.


Use it when you want to define a function or a code block syntactically but do not want to specify any
functionality or implementation at that moment.

● Placeholder: "pass" acts as a temporary placeholder for future code that you intend to write
within a function or a code block.
● Syntax Requirement: In many programming languages like Python, using "pass" is
necessary when you define a function or a conditional block. It ensures that the code
remains syntactically correct, even if it doesn't do anything yet.

● No Operation: "pass" itself doesn't perform any meaningful action. When the interpreter
encounters “pass”, it simply moves on to the next statement without executing any code.

Function Parameters:

● Parameters are like inputs for functions


● They go inside parentheses when defining the function
● Functions can have multiple parameters

Example:

​ 1

​ 2

​ 3

​ 4

​ 5

● def greet(name):

● print("Hello, " + name)

● result = greet("Alice")

● print(result) # Output: Hello, Alice

Copied!

Docstrings (Documentation Strings)


● Docstrings explain what a function does
● Placed inside triple quotes under the function definition
● Helps other developers understand your function

Example:

​ 1

​ 2

​ 3

​ 4

​ 5

​ 6

​ 7

​ 8

● def multiply(a, b):

● """

● This function multiplies two numbers.

● Input: a (number), b (number)

● Output: Product of a and b

● """

● print(a * b)

● multiply(2,6)

Copied!
Return statement

● Return gives back a value from a function


● Ends the function's execution and sends the result
● A function can return various types of data

Example:

​ 1

​ 2

​ 3

​ 4

● def add(a, b):

● return a + b

● sum_result = add(3, 5) # sum_result gets the value 8

Copied!

Understanding scopes and variables

Scope is where a variable can be seen and used:

● Global Scope: Variables defined outside functions; accessible everywhere


● Local Scope: Variables inside functions; only usable within that function

Example:

Part 1: Global variable declaration


​ 1

● global_variable = "I'm global"

Copied!

This line initializes a global variable called global_variable and assigns it the value "I'm global".

Global variables are accessible throughout the entire program, both inside and outside functions.

Part 2: Function definition

​ 1

​ 2

​ 3

​ 4

● def example_function():

● local_variable = "I'm local"

● print(global_variable) # Accessing global variable

● print(local_variable) # Accessing local variable

Copied!

Here, you define a function called example_function().

Within this function:

● A local variable named local_variable is declared and initialized with the string value "I'm
local." This variable is local to the function and can only be accessed within the function's
scope.
● The function then prints the values of both the global variable (global_variable) and the
local variable (local_variable). It demonstrates that you can access global and local
variables within a function.

Part 3: Function call

​ 1

● example_function()

Copied!

In this part, you call the example_function() by invoking it. This results in the function's code being
executed.
As a result of this function call, it will print the values of the global and local variables within the
function.

Part 4: Accessing global variable outside the function

​ 1

● print(global_variable) # Accessible outside the function

Copied!

After calling the function, you print the value of the global variable global_variable outside the
function. This demonstrates that global variables are accessible inside and outside of
functions.

Part 5: Attempting to access local variable outside the function

​ 1

● # print(local_variable) # Error, local variable not visible here


Copied!

In this part, you are attempting to print the value of the local variable local_variable outside of the
function. However, this line would result in an error.

Local variables are only visible and accessible within the scope of the function where they are
defined.

Attempting to access them outside of that scope would raise a "NameError".

Using functions with loops

Functions and loops together

1. Functions can contain code with loops


2. This makes complex tasks more organized
3. The loop code becomes a repeatable function

Example:

​ 1

​ 2

​ 3

​ 4

​ 5

● def print_numbers(limit):

● for i in range(1, limit+1):

● print(i)


● print_numbers(5) # Output: 1 2 3 4 5

Copied!

Enhancing code organization and reusability

1. Functions group similar actions for easy understanding


2. Looping within functions keeps code clean
3. You can reuse a function to repeat actions

Example

​ 1

​ 2

​ 3

​ 4

​ 5

● def greet(name):

● return "Hello, " + name

● for _ in range(3):

● print(greet("Alice"))

Copied!

Modifying data structure using functions


You'll use Python and a list as the data structure for this illustration. In this example, you will create
functions to add and remove elements from a list.

Part 1: Initialize an empty list

​ 1

​ 2

● # Define an empty list as the initial data structure

● my_list = []

Copied!

In this part, you start by creating an empty list named my_list. This empty list serves as the data
structure that you will modify throughout the code.

Part 2: Define a function to add elements

​ 1

​ 2

​ 3

● # Function to add an element to the list

● def add_element(data_structure, element):

● data_structure.append(element)

Copied!

Here, you define a function called add_element. This function takes two parameters:

● data_structure: This parameter represents the list to which you want to add an element

● element: This parameter represents the element you want to add to the list
Inside the function, you use the append method to add the provided element to the data_structure,
which is assumed to be a list.

Part 3: Define a function to remove elements

​ 1

​ 2

​ 3

​ 4

​ 5

​ 6

● # Function to remove an element from the list

● def remove_element(data_structure, element):

● if element in data_structure:

● data_structure.remove(element)

● else:

● print(f"{element} not found in the list.")

Copied!

In this part, you define another function called remove_element. It also takes two parameters:

● data_structure: The list from which we want to remove an element

● element: The element we want to remove from the list


Inside the function, you use conditional statements to check if the element is present in the

data_structure. If it is, you use the remove method to remove the first occurrence of the element. If
it's not found, you print a message indicating that the element was not found in the list.

Part 4: Add elements to the list

​ 1

​ 2

​ 3

​ 4

● # Add elements to the list using the add_element function

● add_element(my_list, 42)

● add_element(my_list, 17)

● add_element(my_list, 99)

Copied!

Here, you use the add_element function to add three elements (42, 17, and 99) to the my_list. These
are added one at a time using function calls.

Part 5: Print the current list

​ 1

​ 2

● # Print the current list

● print("Current list:", my_list)


Copied!

This part simply prints the current state of the my_list to the console, allowing us to see the elements
that have been added so far.

Part 6: Remove elements from the list

​ 1

​ 2

​ 3

● # Remove an element from the list using the remove_element function

● remove_element(my_list, 17)

● remove_element(my_list, 55) # This will print a message since 55 is not in the list

Copied!

In this part, you use the remove_element function to remove elements from the my_list. First, you
attempt to remove 17 (which is in the list), and then you try to remove 55 (which is not in the list).

The second call to remove_element will print a message indicating that 55 was not found.

Part 7: Print the updated list

​ 1

​ 2

● # Print the updated list

● print("Updated list:", my_list)

Copied!
Finally, you print the updated my_list to the console. This allows us to observe the modifications
made to the list by adding and removing elements using the defined functions.

Conclusion

Congratulations! You've completed the Reading Instruction Lab on Python functions. You've gained a
solid understanding of functions, their significance, and how to create and use them effectively.
These skills will empower you to write more organized, modular, and powerful code in your Python
projects.

art assistants that didn't stop until we said so.

You might also like