Professional Documents
Culture Documents
Generative Ai For Data Analytics Meap Artur Guja Online Ebook Texxtbook Full Chapter PDF
Generative Ai For Data Analytics Meap Artur Guja Online Ebook Texxtbook Full Chapter PDF
Artur Guja
Visit to download the full and correct content document:
https://ebookmeta.com/product/generative-ai-for-data-analytics-meap-artur-guja/
More products digital (pdf, epub, mobi) instant
download maybe you interests ...
https://ebookmeta.com/product/creating-intelligent-products-
meap-v01-generative-ai-advanced-analytics-smart-automation-leo-
porter/
https://ebookmeta.com/product/data-storytelling-with-python-
altair-and-generative-ai-meap-v01-angelica-lo-duca/
https://ebookmeta.com/product/creative-prototyping-with-
generative-ai-augmenting-creative-workflows-with-generative-ai-
design-thinking-1st-edition-patrick-parra-pennefather/
https://ebookmeta.com/product/generative-ai-in-action-
meap-v02-amit-bahree/
It's All Analytics - Part II: Designing an Integrated
AI, Analytics, and Data Science Architecture for Your
Organization 1st Edition Scott Burk
https://ebookmeta.com/product/its-all-analytics-part-ii-
designing-an-integrated-ai-analytics-and-data-science-
architecture-for-your-organization-1st-edition-scott-burk/
https://ebookmeta.com/product/convergence-of-cloud-with-ai-for-
big-data-analytics-foundations-and-innovation-1st-edition-danda-
b-rawat/
https://ebookmeta.com/product/the-ai-revolution-in-project-
management-elevating-productivity-with-generative-ai-1st-edition-
kanabar/
https://ebookmeta.com/product/data-enabled-analytics-dea-for-big-
data-joe-zhu/
https://ebookmeta.com/product/generative-ai-how-chatgpt-and-
other-ai-tools-will-revolutionize-business-1st-edition-tom-
taulli/
Generative AI for Data Analytics
1. welcome
2. 1_Introduction_to_the_use_of_Generative_AI_in_Data_Analytics
3. 2_Using_Generative_AI_to_ensure_sufficient_Data_Quality
4. 3_Descriptive_analysis_and_statistical_inference_supported_by_Generative_AI
5. 4_Utilizing_Generative_AI_for_result_interpretation_and_formulating_recommenda
6. 5_Basic_Text_Mining_using_Generative_AI
7. 6_Advanced_Text_Mining_with_Generative_AI
8. 7_Scaling_and_Performance_Optimization
9. 8_Risk,_Mitigation,_and_Trade-offs
10. Appendix_A_Specifying_multiple_Dataframes_to_ChatGPT_v4
11. Appendix_B_On_debugging_ChatGPT’s_code
12. Appendix_C_On_laziness_and_human_errors
welcome
Thank you for purchasing the Manning Early Access Program (MEAP)
edition of our upcoming book, Generative AI for Data Analytics. We are
excited to share this comprehensive guide that will help you harness the
power of state-of-the-art language models like ChatGPT for data analysis.
As you progress through the chapters, you will gain valuable insights into the
best practices for using GPT models in data analytics, as well as the risks
associated with these cutting-edge tools. You will also look into topics such
as data quality, result uncertainty, bias and ethics. We’ll show you where the
models can excel but also where they are still lacking. This will all be
illustrated by transcripts of conversations with the models. We will guide you
on how to use the models to their full potential and how to recognize,
manage, and mitigate risks, ensuring that you can confidently and responsibly
make the most of GPT-driven data analytics.
This book is tailored for current and future data analysts who wish to take full
advantage of available tools while acknowledging their limitations and
appropriateness for specific tasks. By the end of your journey, you will be
equipped with the knowledge and skills needed to effectively perform data
analytics in a modern data environment, utilizng GPT models, like ChatGPT
to make your work faster and more thorough.
We hope that you find this book insightful, engaging, and valuable as you
explore the exciting opportunities that GPT-driven data analytics can offer.
We look forward to your feedback and contributions in the liveBook
discussion forum as we work together to create a comprehensive guide to this
cutting-edge field. Your input is invaluable, and we are eager to create a
resource that exceeds your expectations.
Once again, thank you for joining us on this exciting journey, and we look
forward to hearing your thoughts on Generative AI for Data Analytics.
In this book
This book aims to show you how to utilize Generative AI to improve multi-
faceted business activities such as Data Analytics. We will show you how to
extract knowledge stored in the depths of neural networks and not fall victim
to risks inherent to this technology. To excel in both tasks, you need to have
in the back of your head what drives the responses you get to your prompts.
Therefore, this chapter will provide a brief overview of Generative AI
models, their underlying technology, and what are their main limitations. The
point is not to give the readers encyclopedic knowledge of the technology but
a deep enough understanding to demystify it and allow a more critical
interpretation of its abilities.
NOTE:
This book will not be a traditional programming book. You will not be able
to use it like you would a script. You are going to engage in a dialogue with
Large Language Models, and like any conversation, the words and direction
will change depending on the model, as well as the context that came before.
The output that you receive will very likely differ from what is printed in this
book. This should not discourage you. Instead, you should explore. The
journey is as rewarding as the destination. You might find yourself frustrated
that they could not follow along. Have patience. If you are disciplined (and
somewhat adventurous), you can get GPT to cooperate with the general
themes and aim of this book: learning how to use generative AI to make you
a better programmer.
While the temptation is strong to jump right into the advanced applications,
let's begin by illuminating the fundamental characteristics of Generative AIs.
These features often result in limitations that a user should keep in mind in
order to utilize the model appropriately.
Input and output limits: While using Generative AIs, you should be aware
of the size limit for prompts and responses, which depends on the token limit
of the specific model version. For GPT-4, the latest model available at the
time of writing, the token limit is 8000 tokens. This limit covers the
combined number of both input (prompt) and output (response) tokens. A
token can be as short as one character or as long as one word, and it’s usually
unclear to users how many tokens they supply in the prompt or expect in the
response. However, a good rule of thumb is to expect this limit to be roughly
equivalent to around 800 to 1000 words. While for most simple prompts, this
seems plenty, most users quickly find this limiting once they try to feed the
model with a reasonable amount of data or they expect a decent size output.
One way around it is to use the model’s limited memory to build up the
context across multiple queries. However, bear in mind that the context
memory of the model is also restricted, albeit much less visibly, and the
model may forget past prompts and responses after a few rounds of dialogue.
Verbosity: When you try some prompts, it will also quickly become clear to
you that LLM models may sometimes generate overly verbose responses or
overuse certain phrases. This is visible, for the example of ChatGPTv4, in its
tendency to close each broader response with an “In conclusion” paragraph,
rephrasing the key points. This verbosity can be attributed to biases or
patterns in the training data, where longer responses, or responses of a certain
structure, might be more common.
Egg it on!
The prompt/response size limit and the verbosity can often lead to incomplete
or cut-off responses. One possibility, when designing a conversation with the
Generative AI, is to ensure that the combined length of the prompt and
expected response doesn't exceed the token limit.
Also, when faced with ambiguous questions, Generative AI will still generate
a response. The model tends to guess the user's intent, as it’s currently unable
to ask for more context or specifics, which can lead to answers that may not
be directly relevant to the intended query.
Text-only: At the time of writing this book, Generative AIs do not possess
visual understanding or communication capabilities, as they are based on
text-based models. They cannot process or interpret images, videos, or any
other non-textual information. Sometimes you can describe images in the
prompt, but this quickly becomes limiting in practice. Neither can they
present results in these formats, except in very rudimentary cases, where they
attempt to “draw” a graph using ASCII art. Fortunately, for the topic of data
analysis, this will not be a major limitation most of the time. Firstly, the data
being analyzed is usually already digitized. Secondly, the model will be
mostly used to generate code or provide support and explanation, so even if
visual data, such as images or movies, is to be analyzed, it won’t be done by
the model per se but by running the code generated by it.
Inherent sycophancy: The larger the model at the base of Generative AI, the
more likely it is to exercise agreeability over reliability. If confronted or
questioned about the provided answer, it’s likely to apologize and present the
point of view contradicting its previous statement even if it was correct the
first time; truth be damned!
This ability to generate and explain code will be the most useful feature for us
throughout the book but also comes with the biggest warning, which will be
repeated in many places, and covered in depth in Chapter 7. Specifically,
never trust the model to spit out fully correct answers or perfectly working
code. The higher the importance or risk of your project, the more
scrupulously you should verify any output through review and testing. In
subsequent chapters, you’ll find examples of model-generated code which
either works not as expected or has incorrect explanations attached to it.
Caveat emptor!
Finally, once the analysis is performed, language models can help interpret
the results by generating natural language summaries and explanations. This
feature enables users to understand complex analytical results more easily
and communicate their findings to a broader audience.
Never take the first answer from the model to be the perfect one.
Evaluate, refine, and test.
You can never blame the model for any shortcomings or failures in data
analysis. The model does not understand, and it’s the user’s sole
responsibility to approach the problem, the data, and the results
critically.
Even if you forget everything else that’s in this book, please apply the two
points above to your data analysis practice.
The unblocker
The scale of Generative AIs’ abilities is only starting to be explored.
However, it’s already clear that they can be taught to respond consistently
and relevantly on a vast number of topics. They have the ability to drill down
into details, summarize, explain, and associate related concepts to an
extraordinary degree. These abilities can be used to effectively unblock your
own thinking and get you out of your rut. You no longer have to trawl
through dozens of random articles trying to find inspiration or pointers. Just
ask a question. Even if the answer is not perfect, it may point you to concepts
you haven’t thought of before. Use this to expand your horizons.
While Generative AIs can generate human-like text based on the context
provided, they may struggle to comprehend and process highly specialized
domain knowledge. Consequently, their ability to accurately generate insights
or recommendations in the context of specific industries or niche subjects
may be limited. This can be overcome by fine-tuning models for specific
domains, but such refinements are frequently prohibitively costly and time-
consuming unless effects of scale can be achieved. One common way to
overcome this limitation is for users to provide the model with more general
prompts and then refine the answer based on their own specialist knowledge.
Data analytics often involves working with databases, APIs, or other data
sources to extract, clean, and process data. Generative AIs lack the built-in
capability to interact directly with these sources. While it is possible to
integrate Generative AIs with custom-built solutions to bridge this gap, doing
so can be resource-intensive and challenging to implement effectively. As in
the previous cases, the model can still be effectively used to guide the
analysis and provide solutions or even whole swathes of code which will be
executed by the users independently of the model.
The sections below will cover some more advanced ways of accessing the
models through API or third-party integrations and are included for
completeness. However, in practice, we find that the majority of useful
interactions can be achieved through the web interface, and it’s also the
fastest and cheapest option to use.
If you’re eager to get started, you can skip the remaining few sections and get
right into it in the next chapter. Later, when you want to play with accessing
the API or integrating ChatGPT into an automated workflow, you can come
back to cover this.
As stated in the previous section, you'll need to have the OpenAI API set up
first. You can follow the instructions to install the library and set up an API
key on https://beta.openai.com/docs/
Once you have the OpenAI API set up, you can use the following simple
Python code to interact with ChatGPT:
Listing 1.1
import openai
def chat_with_gpt(prompt):
response = openai.Completion.create(
engine="text-davinci-003",
prompt=f"{prompt}",
max_tokens=150,
n=1,
stop=None,
temperature=0.5,
)
message = response.choices[0].text.strip()
return message
# Example usage:
prompt = "What is the capital city of France?"
response = chat_with_gpt(prompt)
print(response)
The function names, the variables and parameters, and the code itself are
written in bold.
Comment lines start with the # sign and are written in bold italics. These are
not interpreted by Python at all. They are usually used to explain what the
following piece of code does. Sometimes, when more than one line of
comment is required or the comment contains quotation marks itself, we put
the comment text between triplets of quotation signs.
Those of you new to Python should note a critical property of the Python
language, i.e., indentations. It’s a method of telling Python that we are
entering another nesting level, e.g., when we iterate over a collection.
Indented code will be executed for each element provided by a code of higher
indentation level. If another loop exists within this code, it would be indented
further. Failing to properly indent code will lead to either an error at the
code execution or, worse, to incorrect results.
You can adjust these parameters based on your specific use case to get the
desired output.
For example, ChatGPT has been integrated with the following tools:
The very big impact on the whole programming world has a Git Copilot. It’s
a cloud-based AI pair programmer that offers autocomplete-style suggestions
as you code. It is powered by OpenAI's GPT-3 language model, which has
been trained on a massive dataset of open-source code. Git Copilot can
suggest entire functions, classes, and even entire applications. It can also
suggest documentation, unit tests, and other code-related artifacts.
Git Copilot is available for Visual Studio Code, Visual Studio, Neovim, and
JetBrains IDEs. It is free to use for verified students, teachers, and
maintainers of popular open-source projects. For other users, it is available as
a paid subscription.
Here are some other tools that use Generative AI to assist with coding:
These tools can all be helpful for speeding up the coding process and
reducing errors. However, it is important to use them with caution and to
always double-check the suggestions before using them.
By following these best practices and tips, you can successfully integrate
Generative AI into your data analytics workflows and harness their full
potential to enhance your decision-making, automate tasks, and uncover
valuable insights.
Hopefully, after this introduction, the Generative AI will no longer appear to
be a mysterious and, possibly, useless invention. Subsequent chapters will
demonstrate specific exchanges between a human and a Generative AI, using
the responses in all aspects of data analytical work while commenting on the
shortcomings and pitfalls that need to be looked out for to make this
cooperation between humans and AI as painless and productive as possible.
1.4 Summary
Generative AI and derivative tools have taken great strides in recent
years and can be used as invaluable support in many fields, including
data analytics.
Despite the progress, these tools won’t (yet!) replace a competent data
analyst, and there are many limitations that users should be aware of.
At the same time, users should not be afraid to take full advantage of the
immense possibilities of supporting their data analytical work with the
language models, which can be done safely by just following a few
common-sense guidelines.
The easiest way to access Generative AIs is via their web interfaces,
although API and SDK can be used in more advanced applications.
[1] https://cdn.openai.com/papers/gpt-4-system-card.pdf
2 Using Generative AI to ensure
sufficient Data Quality
This chapter covers
Technical aspects of cleaning the data
Data content quality evaluation
Dealing with data errors
Investigating unclear data
On the other hand, if you threw the coin multiple times and each time you got
your expected tails, maybe the coin is not random. Maybe the quality of your
data is sufficient for your purpose. Unfortunately, despite some marketing
claims, there is no one-size-fits-all solution ensuring data quality. The most
effective way to prepare for analysis is to closely examine the data and
carefully consider the meaning of each variable within the context of your
specific business. However, you are not helpless. While there may not be
methods that fully automate this task, there are techniques and protocols that
can help you approach data evaluation and cleaning in a structured and
systematic manner.
In short, you want to be sure that your data adequately describes the slice of
reality you’re analyzing (points 1-4) and that its structure allows for
reliable analysis (points. 5-7).
If it looks like a lot of work, it’s because it is. Ensuring data quality can be a
daunting task due to the sheer amount of knowledge required for a
practitioner to adequately refine their data, as well as the number of steps
required to ensure that you will be analyzing data relevant for the analyzed
process. There is no silver bullet, and cutting corners can lead to unreliable
data, but there is hope: a good structure can lead you through this process in a
finite time. A well-educated GPT advisor can help you here.
Data quality is not a space where you shall allow yourself to skate on the thin
ice. You just need to get used to getting things systematically. A long-
falsified statement states that 80% of data work is related to data cleaning and
preparation. Nowadays, data people rarely can afford the luxury of such a
solid foundation building due to a very unpleasant feedback loop. Business
does not see the value of analysis, so it demands more of it. The more they
demand, the higher the time pressure. The higher the time pressure, the more
corners are cut (usually in areas that do not get into final reports and thus are
invisible/useless from the perspective of the business sponsors). The more
corners cut, the less reliable the analysis. The less reliable analytics, the lower
the value of said analytics. The lower value of analytics...
We would ask that you take the time to read over the transcript below and,
likewise, the many others that will appear in this book, despite a possible
temptation to skip over them. As we will show throughout this book and
summarized in Chapter 6, working with GPT models requires prudence. It
means carefully reading the answers; you will discover both real gems as
well as red herrings there. Distinguishing between the two requires attention
to detail.
Human: I want to perform data analysis. I imported my data into pandas DataF
ChatGPT: After importing your data into pandas DataFrames, the first thing y
df.head()
df.shape
df.info()
df.describe()
df.isna().sum()
df['column_name'].value_counts()
Once you have a better understanding of your data, you can proceed with data
ChatGPT off-handedly offered a critical bit of information in its first
sentence. We need to understand our data. It sounds obvious, but
unfortunately, omissions in this regard are hardly uncommon. Such
fundamental mistakes cannot be redeemed by any amount of analysis,
machine learning, or AI applied at later stages; if you start on shaky
foundations, the entire building will be compromised.
If you work in business and are not absolutely, positively sure about the exact
meaning of the variables received from your IT department, don’t assume.
Ask! We happened to analyze the efficiency of the service, sending tens of
millions of messages to the clients of a multinational company. We
discovered an anomaly that would occur very rarely, where the ratio of open-
to-send messages was higher than one. Trying to understand what is actually
defined as a sent message and what is actually defined as an open message,
we discovered the error in the logging system was so deep that all the
previous analytics on the efficiency of messaging were worthless! We cannot
overemphasize the importance of close collaboration between the business
and data or IT crowds!
A careful read of the rest of the model’s answer may reveal some less
impressive components. Some steps seem redundant. For example,
df.isna().sum() method does not offer any information not previously
provided by df.info() (which, in fact, also provides information returned by
df.shape).
Efficiency vs. readability is just one of the examples where you’ll have to
choose one of the different approaches. Generative AI, when inquired, can
often present you with the benefits and weaknesses of different methods, but
you need to be conscious of this dichotomy and sometimes explicitly request
the solution prioritizing one over the other.
Listing 2.1
import pandas as pd
df_customers = pd.read_csv('olist_customers_dataset.csv')
df_order_items = pd.read_csv('olist_order_items_dataset.csv')
df_order_payments = pd.read_
csv('olist_order_payments_dataset.csv')
df_order_reviews = pd.read_csv('olist_order_reviews_dataset.csv')
df_orders = pd.read_csv('olist_orders_dataset.csv')
df_products = pd.read_csv('olist_products_dataset.csv')Jupyter
df_sellers = pd.read_csv('olist_sellers_dataset.csv')
df_product_category_translation = pd.read_csv('product_category_name_transla
Human: I have a pandas DataFrame with customer_id and order_id columns. How
ChatGPT: You can quickly check if every customer made just one order or if t
import pandas as pd
# Sample DataFrame
df = pd.DataFrame(data)
order_counts = df['customer_id'].value_counts()
if order_counts.max() > 1:
print("There are customers with more than one order.")
else:
print("Every customer made just one order.")
ChatGPT can understand what we are trying to achieve and proposed code
we can adapt to our needs replacing just the name of the DataFrame to be
tested with df_orders, as we defined it when loading our datasets.
You may check the result in the accompanying files. It shows that either we
got the data from the online platform with the lowest customer retention rate
in existence (one visit per user), or there is a problem with the customer_id
column in df_orders DataFrame.
import pandas as pd
# Sample DataFrame
df = pd.DataFrame(data)
relationship = df.groupby('customer_unique_id')['customer_id'].nunique()
print(relationship)
Based on the results, you can determine which column contains actual identif
Going back to our analysis, the resulting relationship Series is too big to
analyze manually. Let’s try to complain about it a little.
Human: With my data, the "relationship" series produced by your previous cod
ChatGPT: Sure! You can analyze the "relationship" Series by generating a sum
import pandas as pd
# Sample DataFrame
df = pd.DataFrame(data)
unique_customer_ids = df['customer_id'].nunique()
unique_customer_unique_ids = df['customer_unique_id'].nunique()
print(f"Min: {relationship.min()}")
print(f"Max: {relationship.max()}")
print(f"Mean: {relationship.mean()}")
print(f"Count of unique customer_id values per customer_unique_id:\n{relatio
This summary will give you a better understanding of the relationship betwee
The key takeaway message is that ChatGPT can help us choose the right
analytical methods, but it is on us to understand the actual data meaning in
the context of the business domain we operate in and ask the right questions.
2.3 Quality assessment structure
By now, we should know what information is encoded in the data at hand,
and as a bonus, we should have an overview of its completeness.
A note of a warning here, some data is inherently sparse. Just counting the
missing value in the previous step will not define data completeness;
however, if you combine it with the knowledge of data meaning, in most
cases, it should.
Human: I want to perform data analysis. I imported my data into pandas DataF
ChatGPT: Once you have imported your data into pandas DataFrames and perform
1. Data Cleaning:
Address missing values, duplicate records, and data entry errors. Depending
2. Feature Engineering:
Create new features from the existing data to enhance the information availa
3. Data Transformation:
Scale, normalize, or standardize numerical features, and encode categorical
Order matters
In this particular case, the assumption is that the data being processed is
known and understood by the analyst well enough to proceed with its
transformation and deriving feature engineering before Exploratory Data
Analysis (EDA). In such a case, EDA, which should form a solid foundation
of understanding the data set and play an important role in selecting
appropriate analytical tools, becomes a superficial step resulting in a couple
of mostly useless charts. We will therefore change the order of the proposed
steps.
1. Understand the relation of your data with reality, i.e., what each variable
means and how it was generated.
2. Make sure the data meets some basic technical standards, i.e., ensure
that you work exclusively but all available relevant data points and can
cross-relate them between data sources if needed.
3. Before you try to analyze reality using your data, analyze the data itself,
i.e., validate if it adheres to either expertise- or common-sense-based
expectations.
By now, we should know what variables we have at our disposal. In the
following sections, we will ensure the contents describe phenomena rooted in
reality, not IT system design or implementation artifacts.
Even if you consider yourself above natural human reactions, there is also
another reason. Some of the methods of handling missing values are based on
the values of surrounding data. In such a case, you don’t want the previous or
next row repeated a hundred times.
2.4.1 Removing duplicates
There are many ways by which duplicate records can sneak into your data.
From data entry mistakes (human errors) and system glitches (software
errors), to badly designed identifiers (human), to synchronization problems in
distributed systems (software) to errors in merging datasets (design-level:
human, execution-level: software, or human if it was a bug), to import issues,
the list goes on. With so many things that can go wrong, some of them will.
In all honesty, we’re still waiting to see a duplicate-free dataset that was not
previously specifically curated for sharing. You should test for and eliminate
duplicate records to avoid over-representing certain data points, which could
skew the analysis results.
ChatGPT: To remove duplicate entries from a DataFrame, you can use the `
import pandas as pd
df = pd.DataFrame(data)
df_no_duplicates = df.drop_duplicates()
print("Original DataFrame:")
print(df)
print(df_no_duplicates)
If you want to remove duplicates based on a specific set of columns, you can
In our case, it sure is (as we didn’t have any duplicates in the first place), so
we can triumphantly tick this box and move to the next step.
Depending on the context and the amount of missing data, you can choose to
either delete the incomplete data entry or variable (aka. row or column in
tabular formats), ignore it, or somehow fill in missing values. You may fill
the empty fields with, e.g., a default value, an estimate (e.g., mean, median,
or mode), or use interpolation methods.
In our initial analysis, we should notice three DataFrames, where the number
of non-null values differs from the number of rows in some columns
(meaning it’s not just an empty column). It’s df_orders, df_order_reviews,
and df_products.
Knowledge is power.
To decide on what to do with the missing values, we should first have a look
at the data.
In the case of df_products, we are missing some of the category names, the
same number of missing product names, description lengths and pictures, and
some product size data. Let’s see if we’ll be able to say anything looking at
the products where some of the values are missing.
There are 611 rows where some of the data is missing. This would be a chore
to analyze and address manually, and if we were dealing with 6111 or 61111
rows, it would become outright unfeasible. What can we do to automate the
process somewhat? Let’s start with investigating whether the number of
missing values in different columns is coincidental or if there is some
structure to it.
ChatGPT: You can check if the missing values form a pattern by analyzing the
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
vdf = pd.DataFrame(data)
missing_indicator = df.isna()
missing_corr = missing_indicator.corr()
Finally, the correlation matrix is visualized using a heatmap from the Seabo
This time, ChatGPT used some really heavy artillery, including visualization
libraries. It’s one of its very useful skills, and we’ll lean on it a lot from now
on.
If you’d like to learn more about the importance of proper visualization for
statistical analysis, we can’t recommend timeless books by Professor Edward
Tufte enough. His Visual Explanations: Images and Quantities, Evidence and
Narrative, or The Visual Display of Quantitative Information are technology-
agnostic bibles of great visual communication.
Figure 2.1 depicts the results of the one of analyses proposed by ChatGPT.
We present here just this one, as the work with the others of this category is
identical, while the results are not so interesting.
Figure 2.1 The correlation between missing values in different columns.
We can see that some products have missing values in all columns
product_category_name, product_name_lenght, and
product_photos_qty, or in all columns product_weight_g,
product_length_cm, product_height_cm, and product_width_cm. It
means that some products can be either weightless and dimensionless
(services, perhaps?), or they don’t have any category, product name or photos
(mystery box comes to mind).
The search of rows of the dataset such that all the columns beside product_id
are missing, resulted in a single product:
5eb564652db742ff8f28759cd8d2652a.
Unfortunately, we need to take off our Sherlock Holmes cap and get to deal
with a much more prevalent, if not so exciting, category of problem.
Unfortunately, the best tool to search for such errors is biological rather than
digital. As of the time of writing this book, nothing beats a good old-
fashioned eyeball. Some of the methods described in section 2.1 can help
identify the oddities, e.g., df[‘column_name’].value_counts() can help you
spot values that occur only once, which may be suspicious if you expect
uniform distribution. However, detecting and correcting data entry errors
requires a combination of these techniques, along with domain knowledge
and manual inspection.
The specific type of data entry error analysis we would like to point your
attention to is outlier detection and treatment. Outliers are data points that
deviate significantly from the rest of the dataset (e.g., if all but one
transaction in a set are from NY, and there is suddenly one from LA).
Identifying and handling outliers can be important, as they can
disproportionately impact the analysis results. Or they can indicate data
points of extreme importance. Outliers can be accepted, removed, capped, or
transformed depending on the context to mitigate their influence.
There is one more element of data cleaning that you should consider. We
touched upon one when we identified inconsistencies in customer
identification when first checking our DataFrames.
In human language, for analytical purposes, you’d really rather not find in the
datetime column stuff like “last Tuesday, around noon” next to “it was a
beautiful spring morning”.
1. Define validation rules and constraints: Identify the specific rules and
2. Check data types: Ensure that each column has the correct data type. If n
4. Validate values: Check that the values in each column follow the predefin
5. Handle invalid values: If invalid values are detected, decide how to hand
# Check foreign key constraints (assuming df1 and df2 are two related DataFr
foreign_key_violations = df1[~df1['foreign_key_column'].isin(df2['primary_ke
if not foreign_key_violations.empty:
print("Foreign key violations found:")
print(foreign_key_violations)
# Handle violations (e.g., drop rows or correct the foreign key values)
7. Custom validation functions: In some cases, you may need to write custom
Another random document with
no related content on Scribd:
of the early battles and negotiations with foreign powers. There was
one way in which these advantages could have been supported and
continued. Breckenridge, shrewd and able politician as he was, saw
that the way was to keep Southern Representatives in Congress, at
least as long as Northern sentiment would abide it, and in this way
win victories at the very fountain-head of power. But at the close of
the extra session this view had become unpopular at both ends of the
line, and even Breckenridge abandoned it and sought to hide his
original purpose by immediate service in the Confederate armies.
It will be noted that those who vacated their seats to enter the
Confederacy were afterwards expelled. In this connection a curious
incident can be related, occurring as late as the Senate session of
1882:
The widow of the late Senator Nicholson, of Tennessee, who was in
the Senate when Tennessee seceded, a short time ago sent a petition
to Congress asking that the salary of her late husband, after he
returned to Tennessee, might be paid to her. Mr. Nicholson’s term
would have expired in 1865 had he remained in his seat. He did not
appear at the special session of Congress convened in July, 1861, and
with other Senators from the South was expelled from the Senate on
July 11th of that year. The Senate Committee on Claims, after
examining the case thoroughly, submitted to the Senate an adverse
report. After giving a concise history of the case the committee say:
“We do not deem it proper, after the expiration of twenty years, to
pass special acts of Congress to compensate the senators and
Representatives who seceded in 1861 for their services in the early
part of that year. We recommend that the claim of the petitioner be
disallowed.”
The Sessions of the 37th Congress changed the political course of
many public men. It made the Southern believers in secession still
more vehement; it separated the Southern Unionists from their
former friends, and created a wall of fire between them; it changed
the temper of Northern Abolitionists, in so far as to drive from them
all spirit of faction, all pride of methods, and compelled them to
unite with a republican sentiment which was making sure advances
from the original declaration that slavery should not be extended to
the Territories, to emancipation, and, finally, to the arming of the
slaves. It changed many Northern Democrats, and from the ranks of
these, even in representative positions, the lines of the Republicans
were constantly strengthened on pivotal questions. On the 27th of
July Breckinridge had said in a speech: “When traitors become
numerous enough treason becomes respectable.” Senator Andrew
Johnson, of Tennessee, replied to this, and said: “God being willing,
whether traitors be many or few, as I have hitherto waged war
against traitors and treason, I intend to continue it to the end.” And
yet Johnson had the year before warmly supported Breckinridge in
his presidential campaign.
Among the more conspicuous Republicans and anti-Lecompton
Democrats in this session were Charles Sumner, a man who then
exceeded all others in scholarly attainments and as an orator, though
he was not strong in current debate. Great care and preparation
marked every important effort, but no man’s speeches were more
admired throughout the North, and hated throughout the South,
than those of Charles Sumner. An air of romance surrounded the
man, because he was the first victim of a senatorial outrage, when
beaten by Brooks of South Carolina; but, sneered his political
enemies, “no man more carefully preserved his wounds for
exhibition to a sympathetic world.” He had some minor weaknesses,
which were constantly displayed, and these centred in egotism and
high personal pride—not very popular traits—but no enemy was so
malicious as to deny his greatness.
Fessenden of Maine was one of the great lights of that day. He was
apt, almost beyond example, in debate, and was a recognized leader
of the Republicans until, in the attempt to impeach President
Johnson, he disagreed with the majority of his party and stepped
“down and out.” Yet no one questioned his integrity, and all believed
that his vote was cast on this question in a line with his convictions.
The leading character in the House was Thaddeus Stevens, an
original Abolitionist in sentiment, but a man eminently practical and
shrewd in all his methods.
The chances of politics often carry men into the Presidential Chair,
into Cabinets, and with later and demoralizing frequency into Senate
seats; but chance never makes a Commoner, and Thaddeus Stevens
was throughout the war, and up to the hour of his death, recognized
as the great Commoner of the Northern people. He led in every
House battle, and a more unflinching party leader was never known
to parliamentary bodies. Limp and infirm, he was not liable to
personal assault, even in days when such assaults were common; but
when on one occasion his fiery tongue had so exasperated the
Southerners in Congress as to make them show their knives and
pistols, he stepped out into the aisle, and facing, bid them defiance.
He was a Radical of the Radicals, and constantly contended that the
government—the better to preserve itself—could travel outside of the
Constitution. What cannot be said of any other man in history, can
be said of Thaddeus Stevens. When he lay dead, carried thus from
Washington to his home in Lancaster, with all of his people knowing
that he was dead, he was, on the day following the arrival of his
corpse, and within a few squares of his residence, unanimously
renominated by the Republicans for Congress. If more poetic and
less practical sections or lands than the North had such a hero,
hallowed by such an incident, both the name and the incident would
travel down the ages in song and story.[20]
The “rising” man in the 37th Congress was Schuyler Colfax, of
Indiana, elected Speaker of the 38th, and subsequently Vice-
President. A great parliamentarian, he was gifted with rare
eloquence, and with a kind which won friends without offending
enemies—something too rare to last. In the House were also Justin S.
Morrill, the author of the Tariff Bill which supplied the “sinews of
war,” Henry L. Dawes of Massachusetts, then “the man of Statistics”
and the “watch-dog of the treasury.” Roscoe Conkling was then the
admitted leader of the New York delegation, as he was the admitted
mental superior of any other in subsequent terms in the Senate, up
to the time of his resignation in 1881. Reuben E. Fenton, his factional
opponent, was also there. Ohio was strongly represented in both
parties—Pendleton, Cox and Vallandigham on the side of the
Democrats; Bingham and Ashley on the part of the Republicans.
Illinois showed four prominent anti-Lecompton supporters of the
administration—Douglas in the Senate; Logan, McClernand and
Richardson in the House; while prominent among the Republicans
were Lovejoy (an original Abolitionist), Washburne, a candidate for
the Presidential nomination in 1880—Kellogg and Arnold. John F.
Potter was one of the prominent Wisconsin men, who had won
additional fame by accepting the challenge to duel of Roger A. Pryor
of Virginia, and naming the American rifle as the weapon.
Fortunately the duel did not come off. Pennsylvania had then, as she
still has, Judge Kelley of Philadelphia, chairman of Ways and Means
in the 46th Congress; also Edward McPherson, frequently since
Clerk of the House, temporary President of the Cincinnati
Convention, whose decision overthrew the unit rule, and author of
several valuable political works, some of which we freely quote in this
history. John Hickman, subsequently a Republican, but one of the
earliest of the anti-Lecompton Democrats, was an admitted leader, a
man of rare force and eloquence. So radical did he become that he
refused to support the re-election of Lincoln. He was succeeded by
John M. Broomall, who made several fine speeches in favor of the
constitutional amendments touching slavery and civil rights. Here
also were James Campbell, Hendricks B. Wright, John Covode,
James K. Morehead, and Speaker Grow—the father of the
Homestead Bill, which will be found in Book V., giving the Existing
Political Laws.
At this session Senator Trumbull of Illinois, renewed the agitation
of the slavery question, by reporting from the Judiciary Committee of
which he was Chairman, a bill to confiscate all property and free all
slaves used for insurrectionary purposes.[21] Breckinridge fought the
bill, as indeed he did all bills coming from the Republicans, and said
if passed it would eventuate in “the loosening of all bonds.” Among
the facts stated in support of the measure was this, that the
Confederates had at Bull Run used the negroes and slaves against the
Union army—a statement never well established. The bill passed the
Senate by 33 to 6, and on the 3d of August passed the House, though
several Republicans there voted against it, fearing a too rapid
advance would prejudice the Union cause. Indeed this fear was
entertained by Lincoln when he recommended
COMPENSATED EMANCIPATION
To the President:
Slaves.
Kentucky had 225,490
Maryland 87,188
Virginia 490,887
Delaware 1,798
Missouri 114,965
Tennessee 275,784
C. A. Wickliffe, Ch’n,
Garrett Davis,
R. Wilson,
J. J. Crittenden,
John S. Carlile,
J. W. Crisfield,
J. S. Jackson,
H. Grider,
John S. Phelps,
Francis Thomas,
Chas. B. Calvert,
C. L. Leary,
Edwin H. Webster,
R. Mallory,
Aaron Harding,
James S. Rollins,
J. W. Menzies,
Thomas L. Price,
G. W. Dunlap,
Wm. A. Hall.
Others of the minority, among them Senator Henderson and
Horace Maynard, forwarded separate replies, but all rejecting the
idea of compensated emancipation. Still Lincoln adhered to and
advocated it in his recent annual message sent to Congress, Dec. 1,
1862, from which we take the following paragraphs, which are in
themselves at once curious and interesting:
“We have two million nine hundred and sixty-three thousand
square miles. Europe has three million and eight hundred thousand,
with a population averaging seventy-three and one-third persons to
the square mile. Why may not our country, at some time, average as
many? Is it less fertile? Has it more waste surface, by mountains,
rivers, lakes, deserts, or other causes? Is it inferior to Europe in any
natural advantage? If, then, we are at some time to be as populous as
Europe, how soon? As to when this may be, we can judge by the past
and the present; as to when it will be, if ever, depends much on
whether we maintain the Union. Several of our States are already
above the average of Europe—seventy-three and a third to the square
mile. Massachusetts has 157; Rhode Island, 133; Connecticut, 99;
New York and New Jersey, each, 80. Also two other great states,
Pennsylvania and Ohio, are not far below, the former having 63 and
the latter 59. The states already above the European average, except
New York, have increased in as rapid a ratio, since passing that
point, as ever before; while no one of them is equal to some other
parts of our country in natural capacity for sustaining a dense
population.
“Taking the nation in the aggregate, and we find its population and
ratio of increase, for the several decennial periods, to be as follows:
1870 42,323,341
1880 56,967,216
1890 76,677,872
1900 103,208,415
1910 138,918,526
1920 186,984,335
1930 251,680,914