Download as pdf or txt
Download as pdf or txt
You are on page 1of 27

Principles and Practices of

Data Science
Data Science Life Cycle
Stage 1.Business and Problem Understanding

● Every data science project should aim to fulfill a precise and measurable
goal and a business need that is clearly connected to the purposes,
workflows, and decision-making processes of the business.

● Failure to define a clear goal at the launch of a data science project quickly
leads to confusion and collaboration becomes impossible, actions are
misaligned, and nobody will ever know if the goal has been reached.

● In order to achieve maximum return on investment (ROI), this goal must be


expressed in such a way as to lead to advantageous changes in business
operations. To this end, efforts must be centered around identifying the right
business question to ask, “How will the results be used?
Example: ‘How will the results be used?’

● Objective :
-Delight Our Customers

● Key Results :
- Interview 20 customers per month and get feedback
-Increase customer retention
-Achieve a product engagement
A Business need could be…

1. A goal
- For example - increasing sales to grow a business
2. Solving a problem
- For example, low customer retention
Problem Types

Problems can be small or large,simple or complex ,they all require a slightly


different approach but the first step is always the same : is Understanding what
kind of problem you are trying to solve.
Data scientist has to deal with variety types of problem like :

1. Making predictions.
2. Categorizing things.
3. Spotting something unusual.
4. Identifying themes.
5. Finding Patterns.
1. Making predictions
● Using data to make an informed decision about how things may be in the
future
● For example, a hospital system might use a remote patient monitoring to
predict health events for chronically ill patients.
● The patients would take their health vitals at home everyday, and that
information would be combined with data about their age, risk factors and
other important details could enable the hospital’s algorithm to predict future
health problems and even decrease future hospitalisations.
2. Categorizing things
● Assigning information into different groups or clusters based on common
features.
● For example, a manufacturer that reviews data on shop floor employee
performance.
● An analyst may create a group for employees who are most and least
effective at engineering, a group for employees who are most and least
effective at repair and maintenance, most and least effective at assembly and
many more groups or clusters.
3. Spotting something unusual
● Identifying data that is different from the norm.

● For example, a school system that has a sudden increase in the students
registered, maybe as big as a 30% jump in the number of students. A data
analyst might look into this increase and discover that several new apartment
complexes had been built in the school district earlier that year. They could
use this analysis to make sure the school has enough resources to handle the
additional students.
4. Identifying themes
● Grouping categorized information into broader concepts.
● For example, going back to the manufacturer that we mentioned before, who
was reviewing data on shop floor employees:
● First, these people are grouped by types and tasks, but now the data analyst
can take those categories and group them into the broader concept of low
productivity and high productivity.
● This would make it possible for the business to see who is most productive
and least productive, in order to reward top performers and provide additional
support to those workers who need more training.
5. Finding patterns
● Using historical data to find out what happened in the past and is therefore
likely to happen again.
● For example, e-commerce companies use data to find patterns all the time.
Data scientists look at transaction data to understand customer buying habits
at certain points in time throughout the year.
● They may find that customers buy more canned food just before a hurricane
or purchase fewer cold weather accessories like hats and gloves during the
warmer months.
● The e-commerce companies can use these insights to make sure they stock
the right amount of products at these key times
Stage 2: Data Collection
● Data collection is the process of accumulating data that’s required to solve a
problem statement or business need.

● It includes the process of acquiring, collecting, extracting, and storing the


voluminous amount of data which may be in the structured or unstructured
form like text, video, audio, XML files, records, or other image files used in
later stages of data analysis..
A.Determine what type of data is needed
● The next step is to consider what type of data you must collect.
● Is it quantitative or qualitative?
● Accessing and processing quantitative data is easier because it involves
raw numbers and digits. On the other hand, processing qualitative data,
such as customer reviews or feedback, is more complex.
● What is the target audience.
● What is the data collection sources ?
Data Collection Sources
The actual data is then further divided mainly according to it is sources , into two types known as:
1. Primary data
2. Secondary data
1.Primary data:
● Primary data: The data which is Raw, original, and extracted directly from the
official sources is known as primary data. This type of data is collected
directly by performing techniques such as questionnaires, interviews, and
surveys.
● The data collected must be according to the demand and requirements of the
target audience on which analysis is performed .
Primary data:

Few methods of collecting primary data:


1. Interview method
2. Survey method
3. Observation method
4. Experimental method
2.Secondary Data
● Secondary data is the data which has already been collected and reused
again for some valid purpose. This type of data is previously recorded from
primary data and it has two types of sources named

1. internal source
2. external source.
1.Internal source
● These types of data can easily be found within the organization such as
market record, a sales record, transactions, customer data, accounting
resources, etc. The cost and time consumption is less in obtaining internal
sources.
2.External Resources
● The data which can’t be found at internal organizations and can be gained
through external third party resources is external source data.

● The cost and time consumption is more because this contains a huge amount
of data.

● Examples of external sources are Government publications, news


publications, Registrar General of India, planning commission, international
labor bureau, syndicate services, and other non-governmental publications.
Other resources

● Sensors data: With the advancement of IoT devices, the sensors of these devices collect data which
can be used for sensor data analytics to track the performance and usage of products.
● Satellites data: Satellites collect a lot of images and data in terabytes on daily basis through
surveillance cameras which can be used to collect useful information.
● Web traffic: Due to fast and cheap internet facilities many formats of data which is uploaded by users on
different platforms can be predicted and collected with their permission for data analysis. The search
engines also provide their data through keywords and queries searched mostly.
B.Decide on your data sources

● Once you have an idea about what data you need, start looking into whether
the data is within your organization or if you'll require third-party or external
data.
● In most cases, the smart thing to do is to acquire external data. This
acquisition will keep you on par with your competitors, who will probably also
invest in third-party data.
● You must be willing to buy data and keep your legal team close.
Collect your data
● To effectively collect data, devise a plan that addresses all the questions relevant to
securely collecting data.
● If you're collecting data from a third party or a stakeholder, make sure all requirements and
privacy issues get considered.
● Additionally, create a plan for how you will store the data. Make sure your organization has
the right tools and infrastructure to manage and process the data.
● You also need to establish a systematic approach for storing all the different types of data
so that you can later combine and further process them.
● For example, storing transactional data can be relatively easier since there are tons of tools
that arrange such data in a tabular format. On the other hand, unstructured data can be
relatively difficult to manage and store due to its loose format.
● Therefore, you must devise a plan to collect your data and make the processing simpler.
C.Create a timeline
● Now it's time to identify the time frame within which the data is most useful.
● For example, do you need end-to-end data about how a customer lands on an
e-commerce website? Or do you need relevant parts about the user's search
history, geography, and background?
● Identifying the timeline is key to getting the exact type of data you need to
solve your problem statement.
● A potential lead may generate data at different stages, and it's your job to
effectively evaluate which data is most relevant.

D.Interpreting the data correctly
● The goal of all data analysts and collection is to use data to draw accurate
conclusions and make good recommendations. That all starts with having
complete, correct, and relevant data.
● But keep in mind, it is possible to have solid data and still make the wrong
choices. It is up to data analysts to interpret the data accurately. When data
is interpreted incorrectly, it can lead to huge losses.
For example: Coke launch failure
● In 1985, New Coke was launched, replacing the classic Coke formula. The company
had done taste tests with 200,000 people and found that test subjects preferred the
taste of New Coke over Pepsi, which had become a tough competitor. Based on this
data alone, classic Coke was taken off the market and replaced with New Coke. This
was seen as the solution to take back the market share that had been lost to Pepsi.
● But as it turns out, New Coke was a massive flop and the company ended up losing
tens of millions of dollars. How could this have happened with data that seemed
correct? It is because the data wasn’t complete, which made it inaccurate. The data
didn't consider how customers would feel about New Coke replacing classic Coke.
The company’s decision to retire classic Coke was a data-driven decision based on
incomplete data.
When data is used strategically, businesses can transform and grow their
revenue.
● In Palestine, and around the world, during the coronavirus pandemic the
majority of stores considered non-essential were ordered to close to help slow
down the spread of the virus.
● Using in insights and predictions and internal data , many clothing stores and
other businesses who had shops and had never sold their products online
adapted to the new market during lockdown and focussed their marketing and
strategy on online shopping. Some even saw an increase in their sales and
profits.
Ethical issues relevant to data collection and data privacy.
● In the 2010s, personal data belonging to millions of Facebook users was
collected without their consent by British consulting firm Cambridge Analytica,
predominantly to be used for political advertising.
● Data such as public profiles, page likes, birthdays and current city was used
by a consulting company by U.S. presidential candidates, Ted Cruz and
Donald Trump to advertise their election campaigns. For a given political
campaign, each profile's information suggested what type of advertisement
would be most effective to persuade a particular person in a particular
location for some political event.

You might also like