Professional Documents
Culture Documents
IBM Data Analyts Professional Certificate Note
IBM Data Analyts Professional Certificate Note
WEEK 1
A data analyst ecosystem includes the infrastructure, software, tools, frameworks, and
processes used to gather, clean, analyze, mine, and visualize data.
Based on how well-defined the structure of the data is, data can be categorized as:
● Structured Data, that is data which is well organized in formats that can be stored in
databases.
● Semi-Structured Data, that is data which is partially organized and partially free form.
● Unstructured Data, that is data which can not be organized conventionally into rows and
columns.
Data comes in a wide-ranging variety of file formats, such as delimited text files, spreadsheets,
XML, PDF, and JSON, each with its own list of benefits and limitations of use.
Data is extracted from multiple data sources, ranging from relational and non-relational
databases to APIs, web services, data streams, social platforms, and sensor devices.
Once the data is identified and gathered from different sources, it needs to be staged in a data
repository so that it can be prepared for analysis. The type, format, and sources of data
influence the type of data repository that can be used.
Data professionals need a host of languages that can help them extract, prepare, and analyze
data. These can be classified as:
● Querying languages, such as SQL, used for accessing and manipulating data from
databases.
● Programming languages such as Python, R, and Java, for developing applications and
controlling application behavior.
● Shell and Scripting languages, such as Unix/Linux Shell, and PowerShell, for automating
repetitive operational tasks.
A Data Repository is a general term that refers to data that has been collected, organized, and
isolated so that it can be used for reporting, analytics, and also for archival purposes.
ETL, or Extract Transform and Load, Process is an automated process that converts raw data
into analysis-ready data by:
Data Pipeline, sometimes used interchangeably with ETL, encompasses the entire journey of
moving data from the source to a destination data lake or application, using the ETL process.
Big Data refers to the vast amounts of data that is being produced each moment of every day,
by people, tools, and machines. The sheer velocity, volume, and variety of data challenge the
tools and systems used for conventional data. These challenges led to the emergence of
processing tools and platforms designed specifically for Big Data, such as Apache Hadoop,
Apache Hive, and Apache Spark.
WEEK 3
Gathering data
In this lesson, you have learned:
● The process of identifying data begins by determining the information that needs to be
collected, which in turn is determined by the goal you seek to achieve.
● Having identified the data, your next step is to identify the sources from which you will
extract the required data and define a plan for data collection. Decisions regarding the
timeframe over which you need your data set, and how much data would suffice for
arriving at a credible analysis also weigh in at this stage.
● Data Sources can be internal or external to the organization, and they can be primary,
secondary, or third-party, depending on whether you are obtaining the data directly from
the original source, retrieving it from externally available data sources, or purchasing it
from data aggregators.
● Some of the data sources from which you could be gathering data include databases,
the web, social media, interactive platforms, sensor devices, data exchanges, surveys
and observation studies.
● Data that has been identified and gathered from the various data sources is combined
using a variety of tools and methods to provide a single interface using which data can
be queried and manipulated.
● The data you identify, the source of that data, and the practices you employ for gathering
the data have implications for quality, security, and privacy, which need to be considered
at this stage.
● Data sources: Forrester, Business Insider etc
Wrangling data
In this lesson, you have learned the following information:
Once the data you identified is gathered and imported, your next step is to make it
analysis-ready. This is where the process of Data Wrangling, or Data Munging, comes in. Data
Wrangling is an iterative process that involves data exploration, transformation, and validation.
● Structurally manipulate and combine the data using Joins and Unions.
● Normalize data, that is, clean the database of unused and redundant data.
● Denormalize data, that is, combine data from multiple tables into a single table so that it
can be queried faster.
● Clean data, which involves profiling data to uncover quality issues, visualizing data to
spot outliers, and fixing issues such as missing values, duplicate data, irrelevant data,
inconsistent formats, syntax errors, and outliers.
● Enrich data, which involves considering additional data points that could add value to the
existing data set and lead to a more meaningful analysis.
A variety of software and tools are available for the Data Wrangling process. Some of the
popularly used ones include Excel Power Query, Spreadsheets, OpenRefine, Google DataPrep,
Watson Studio Refinery, Trifacta Wrangler, Python, and R, each with their own set of
characteristics, strengths, limitations, and applications.
WEEK 4
Statistics is a branch of mathematics dealing with the collection, analysis, interpretation, and
presentation of numerical or quantitative data.
Statistical Analysis involves the use of statistical methods in order to develop an understanding
of what the data represents.
● Descriptive; that which provides a summary of what the data represents. Common
measures include Central Tendency, Dispersion, and Skewness.
● Inferential; that which involves making inferences, or generalizations, about data.
Common measures include Hypothesis Testing, Confidence Intervals, and Regression
Analysis.
Data Mining, simply put, is the process of extracting knowledge from data. It involves the use of
pattern recognition technologies, statistical analysis, and mathematical techniques, in order to
identify correlations, patterns, variations, and trends in data.
There are several techniques that can help mine data, such as, classifying attributes of data,
clustering data into groups, establishing relationships between events, variables, and input and
output.
A variety of software and tools are available for analyzing and mining data. Some of the
popularly used ones include Spreadsheets, R-Language, Python, IBM SPSS Statistics, IBM
Watson Studio, and SAS, each with their own set of characteristics, strengths, limitations, and
applications.
● Ensure that your audience is able to trust you, understand you, and relate to your
findings and insights.
● Establish the credibility of your findings.
● Present the data within a structured narrative.
● Support your communication with strong visualizations so that the message is clear and
concise, and drives your audience to take action.
Data visualization is the discipline of communicating information through the use of visual
elements such as graphs, charts, and maps. The goal of visualizing data is to make information
easy to comprehend, interpret, and retain.
There are several types of graphs and charts available for you to be able to plot any kind of
data, such as bar charts, column charts, pie charts, and line charts. Let’s look at some basic
examples of the types of graphs you can create for visualizing your data.
● Bar Charts are great for comparing related data sets or parts of a whole. For example, in
this bar chart, you can see the population numbers of 10 different countries and how
they compare to one another.
● Column Charts compare values side-by-side. You can use them quite effectively to show
change over time. For example, showing how page views and user sessions time on
your website is changing on a month-to-month basis. Although alike, except for the
orientation, bar charts and column charts cannot always be used interchangeably. For
example, a column chart may be better suited for showing negative and positive values.
● Pie Charts show the breakdown of an entity into its sub-parts and the proportion of the
sub-parts in relation to one another. Each portion of the pie represents a static value or
category, and the sum of all categories is equal to hundred percent. In this example, in a
marketing campaign with four marketing channels—social sites, native advertising, paid
influencers, and live events—you can see the total number of leads generated per
channel.
● Line Charts display trends. They’re great for showing how a data value is changing in
relation to a continuous variable. For example, how has the sale of your product, or
multiple products, changed over time, where time is the continuous variable. Line charts
can be used for understanding trends, patterns, and variations in data; also, for
comparing different but related data sets with multiple series.
You can also use data visualization to build dashboards. Dashboards organize and display
reports and visualizations coming from multiple data sources into a single graphical interface.
They are easy to comprehend and allow you to generate reports on the go.
When deciding which tools to use for data visualization, you need to consider the ease-of-use
and purpose of the visualization. Some of the popularly used tools include Spreadsheets,
Jupyter Notebook, Python libraries, R-Studio and R-Shiny, IBM Cognos Analytics, Tableau, and
Power BI.
WEEK 5
Currently, the demand for skilled data analysts far outweighs the supply, which means
companies are willing to pay a premium to hire skilled data analysts.
● Data Analyst Specialist roles - On this path, you start as a Junior Data Analyst and move
up to the level of a Principal Analyst by continually advancing your technical, statistical,
and analytical skills from a foundational level to an expert level.
● Domain Specialist roles - These roles are for you if you have acquired specialization in a
specific domain and want to work your way up to be seen as an authority in your domain.
● Analytics-enabled job roles - These roles include jobs where having analytic skills can
up-level your performance and differentiate you from your peers.
● Other Data Professions - There are several other roles in a modern data ecosystem,
such as Data Engineer, Big Data Engineer, Data Scientist, Business Analyst, or
Business Intelligence Analyst. If you upskill yourself based on the required skills, you can
transition into these roles.
There are several paths you can consider in order to gain entry into the Data Analyst field.
These include:
● A change in frequency of orders placed, for example, a customer who typically places a
couple of orders a month, suddenly makes numerous transactions within a short span of
time, sometimes within minutes of the previous order.
● Orders that are significantly higher than a user’s average transaction.
● Bulk orders of the same item with slight variations such as color or size—especially if
this is atypical of the user’s transaction history.
● A sudden change in delivery preference, for example, a change from home or office
delivery address to in-store, warehouse, or PO Box delivery.
● A mismatched IP Address, or an IP Address that is not from the general location or area
of the billing address.
Before you can analyze the data for patterns and anomalies, you need to:
● Identify and gather all data points that can be of relevance to your use case. For
example, the card holder’s details, transaction details, delivery details, location, and
network are some of the data points that could be explored.
● Clean the data. You need to identify and fix issues in the data that can lead to false or
incomplete findings, such as missing data values and incorrect data. You may also need
to standardize data formats in some cases, for example, the date fields.
Excel Basics for Data Analysis
WEEK 1
The table below lists keyboard shortcuts for some of the most common Excel tasks.
Task Shortcut
Copy Ctrl+C
Cut Ctrl+X
Paste Ctrl+V
Undo Ctrl+Z
Bold Ctrl+B
Move to the edge of the current data region in the worksheet Ctrl+Arrow key (e.g.
(e.g. end of column) Ctrl+Down arrow)
Move to the cell in the upper-left corner of the window (when Home+Scroll Lock
Scroll Lock is On)
Edit the active cell and put the cursor at the end of the cell’s F2
contents
● There are several spreadsheet applications available in the marketplace; the most
commonly used and fully-featured spreadsheet application is Microsoft Excel.
● Spreadsheets provide several advantages over manual calculation methods and they
help you keep data organized and easily accessible.
● As a Data Analyst, you can use spreadsheets as a tool for your data analysis tasks.
● There are several elements that make up a workbook in a spreadsheet application.
● The ribbon provides access to all the features and tools required to view, enter, edit,
manipulate, clean, and analyze data in Excel.
● There are several ways to navigate around a worksheet and workbook in Excel.
WEEK 2
Done in Excel for web(online)
WEEK 3
● Accuracy
● Completeness
● Reliability
● Relevance
● Timeliness
Importing Text:
● You can use the ‘Text Import Wizard’ to import data from other formats, such as plain
text, or comma-separated value files.
● Confidentiality
● Collection and Use
● Compliance
Cleaning data
In this lesson, you have learned the following information:
● It’s important to remove any duplicated or inaccurate data, and it’s important to remove
any empty rows in your dataset.
● There are several other types of data inconsistency that you may need to resolve, in
order to properly clean your data:
1. Change the case of text
2. Fix date formatting errors
3. Trim whitespace from your data
● You can use the Flash Fill and Text to Columns features in Excel to manipulate and
standardize your data, and functions can also be used to help manipulate and
standardize your data.
WEEK 4
The most basic way of shaping your data is to sort and filter it:
Excel Functions:
Pivot Tables:
● To obtain usable and presentable insights into your data you need to use Pivot Tables.
● Pivot tables provide a simple and quick way to summarize and analyze data, to observe
trends and patterns in your data and to make comparisons of your data.
● Pivot tables are dynamic, so as you change and add data to the original dataset on
which the pivot table is based, the analysis and summary information changes too.
● A Data Analyst can use pivot tables to draw useful and relevant conclusions about, and
create insights into, an organization’s data in order to present those insights to interested
parties within the company.
Use this Pivot Table checklist to ensure your data is in a fit state to make a Pivot Table:
● Format your data as a table for best results.
● Ensure column headings are correct, and there is only one header row, as these column
headings become the field names in a Pivot Table.
● Remove any blank rows and columns, and try to eliminate blank cells also.
● Ensure value fields are formatted as numbers, and not text, and ensure date fields are
formatted as dates, and not text.
● You use the Pivot Table Fields pane to add and arrange data fields in your pivot table.
● Recommended Pivot Tables are a list of suggested different combinations of data that
could be used when creating a Pivot Table, based on the data selected in the worksheet.
● Slicers are on-screen graphical filter objects that enable you to filter your data using
buttons, which makes it easier to perform quick filtering of your pivot table data.
● Timelines are another type of filter tool that enable you to filter specifically on
date-related data in your pivot table. This is a much quicker and more effective way of
dynamically filtering by date, rather than having to create and adjust filters on your date
columns.
WEEK 5