What Is Data Science

You might also like

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 14

What is Data Science?

By Ds. Jackson Marube


What is Data Science?
 Data science involves the application of data to comprehend and address real-world
challenges, a practice with historical roots in the analysis of sales figures and trends
dating back to the invention of zero.
 While the fundamental concept is not novel, the past decade has witnessed an
unprecedented surge in available data, largely facilitated by advancements in
computing technology. Computers play a dual role, both generating vast amounts of
data and serving as the essential tool for processing this information.
 In the realm of data science, computer code becomes a pivotal instrument, enabling
data scientists to manipulate, aggregate, conduct statistical analyses, and train machine
learning models. The outcomes of this coding effort can take various forms, ranging
from reports or dashboards designed for human consumption to machine learning
models designed for continuous deployment.
What is Data Science?? continuation
 In the scenario where a retail company faces challenges in determining the optimal location for a new
store, they might engage the expertise of a data scientist for a comprehensive analysis.
 The data scientist would delve into historical data pertaining to locations where online orders are
dispatched, aiming to discern patterns in customer demand.
 Additionally, the data scientist could integrate this customer location data with demographic and
income details sourced from census records for those specific areas.
 By combining these datasets, the data scientist aims to identify the most suitable location for the new
store.
 Subsequently, they would compile their findings into a Microsoft PowerPoint presentation, intending
to deliver a well-structured recommendation to the vice president of retail operations within the
company.
Essentials of Data Science: Unveiling the Core Components

Drawing upon my expertise, the path to becoming a proficient


data scientist hinges on three fundamental pillars. These
fundamentals align closely with Conway’s original Venn diagram,
encompassing:

Mathematics/Statistics
Databases/Programming
Business understanding
Component: Mathematics/statistics
At its core, data literacy involves a foundation in mathematics and statistics. This proficiency can be
dissected into three key levels:

Awareness of Techniques:
 Recognizing the existence of various methods is crucial, as not knowing what's possible hinders effective
utilization. For example, in grouping similar customers, a data scientist must first understand that
statistical methods, like clustering, can be employed.

Application Proficiency:
 Beyond awareness, proficiency lies in understanding the intricacies of applying techniques. This extends
beyond coding skills to encompass configuration complexities. For instance, using k-means clustering
involves not only knowing how to implement it in code but also understanding how to configure
parameters for optimal results.
 Cluster analysis in a programming language like R or Python requires proficiency. Additionally, one must
grasp the skill of fine-tuning method parameters, such as determining the number of groups to establish.
Mathematics/statistics continuation
How to choose which techniques to try
Selecting the appropriate techniques is crucial in data science due
to the multitude of available options. It is imperative for data
scientists to swiftly evaluate the efficacy of a technique. In the
context of our customer grouping scenario, even when
concentrating on clustering, the data scientist must navigate
through numerous methods and algorithms. Instead of attempting
each one, the ability to swiftly eliminate unsuitable methods and
concentrate on a select few is essential.
A Mathematical Perspective on E-commerce Analysis

 In a data science role, continual application of mathematical skills is essential.


Consider an e-commerce scenario: determining countries with the highest average
order value.
 While easily answered with available data, a deeper analysis is crucial. For instance,
having one $100 order in country A versus a thousand $75 orders in country B might
suggest country A has a higher average order value.
 However, confidence in advising your business partner to invest in advertising for
country A is limited due to having only one data point, potentially an outlier. If
country A had 500 orders, mathematical skills, such as statistical tests, could assess
whether the order values were significantly different, helping make informed
decisions about sensible approaches and relevant results.
Component:Databases/programming
 The skills related to programming and databases involve the capability to
extract data from company databases and write code that is not only
clean, efficient, and maintainable but also geared towards open-ended
analysis, distinguishing it from the predefined output typically associated
with software developers.
 While the technical skills required for a data scientist may vary based on
each company's unique data stack, a fundamental understanding of
acquiring data from databases, along with expertise in cleaning,
manipulating, summarizing, visualizing, and sharing data, is essential.
Databases/programming continuation
 In the majority of data science roles, R or Python serves as the primary programming language.
 R, rooted in statistics, excels in statistical analysis, modeling, visualization, and report generation.
 Python, initially a general software development language, has gained immense popularity in data science,
particularly for tasks involving large datasets, machine learning, and real-time algorithms.
 Python's strengths include superior performance in working with large datasets and powering real-time
algorithms like Amazon’s recommendation engines.
 Ongoing contributions have brought R and Python to near parity in capabilities.
 Data scientists now use R for running machine learning models millions of times a week and perform clean
statistical analyses in Python.
 R and Python dominate the data science landscape due to their free and open-source nature, allowing widespread
contributions and the availability of extensive packages or libraries for data-related tasks.
 Their large and active communities make it easy for data scientists to find assistance when encountering
challenges.
 While some companies still use paid programs like SAS, SPSS, STATA, MATLAB, a growing number are
transitioning to R or Python for data science tasks.
Databases/programming continuation
 Primary data science analysis is typically conducted using R or Python; however, interacting with databases for
data retrieval is common.
 SQL, the standard programming language for databases, is crucial for data manipulation and extraction in data
science.
 Example scenario: A data scientist analyzing customer order records might use SQL to retrieve daily order counts
before running statistical forecasts in R or Python.
 SQL proficiency is highly valued in the data science community due to its essential role in data manipulation and
extraction from databases.
 Another essential skill is version control, crucial for tracking code changes over time.
 Version control enables storing files, reverting to previous versions, and tracking changes made by individuals,
providing crucial oversight for data science and software engineering.
 Git, widely used for version control, is often paired with GitHub, a web-based hosting service. Git allows saving
changes, viewing project history, and avoiding conflicts when multiple individuals work on the same file.
 Proficiency in Git is particularly important for sharing code or deploying solutions in companies, especially those
with strong engineering teams.
Quick Trivia:Can you be a data scientist without programming?

 Excel, Tableau, and other business intelligence tools with graphical interfaces can handle substantial
data work without coding.

 Despite not requiring code, these tools assert comparable functionality to R or Python, prompting
occasional use by data scientists.

 However, these tools are not considered a comprehensive data science toolkit.

 In reality, few companies operate data science teams that entirely avoid programming.

 Programming offers distinct advantages over reliance on graphical interface tools, irrespective of
team structures.
Advantages of Programming in Data Science
Reproducibility:

 Writing code enables the ability to rerun it whenever data changes, ensuring consistent results over time.
 Connects with version control, maintaining a single file with a comprehensive history, avoiding the need for constant file renaming.

Flexibility:

 Programming offers greater flexibility compared to point-and-click tools like Tableau.


 If a specific graph type is unavailable, coding allows the creation of custom solutions beyond the tool's limitations.

Community Contribution in Open Source Languages (Python and R):


 Open-source languages like Python and R benefit from community contributions.
 Thousands of individuals create and openly share packages on platforms like GitHub, CRAN (for R), and pip (for Python).
 Users can freely download and utilize this code for diverse problem-solving, reducing dependency on a single company or group for
feature updates.
Business understanding
Varying Business Understanding of Data Science:
 Businesses exhibit diverse levels of comprehension regarding data science processes.
 Management often seeks results without a detailed understanding, relying on data science experts.

Core Skill in Data Science:

 Essential skill: Translating business situations into data questions, finding data answers, and delivering actionable insights.
 Example: Answering questions like "Why are customers leaving?" requires deducing solutions without predefined tools.

Practical Application in Business Understanding:


 Aligns data science ideals with real-world practicalities.
 Emphasizes the need to understand data storage, updates, and intricacies specific to the company, especially in subscription services.

Question Formulation and Stakeholder Engagement:


 Knowing what questions to ask is a product of business understanding.
 Stakeholder queries like "What should we do next?" necessitate further clarification.
 Developing an understanding of the core business aids in parsing situations and tailoring responses to various stakeholders.
Conclusion
 Summarize Key Insights: Highlight the main takeaways from today's presentation.
 Actionable Recommendations: Emphasize actionable recommendations derived from the data analysis.
 Visualizations Recap: Reinforce key findings using impactful visualizations.
 Follow-Up Opportunities: Encourage further inquiries, questions, or contributions.
 Contact Information: Provide contact details for inquiries via email or LinkedIn.
 Thank you for your engagement, and feel free to explore more Data Science topics in my other postings. Looking
forward to future interactions and discussions.

 For further inquiries, questions, or contributions, reach me at:

 Email: jacksonmarubee@gmail.com
 LinkedIn: Jackson Marube
 Hope the insights were valuable to you.

You might also like