Download as pdf or txt
Download as pdf or txt
You are on page 1of 69

Generative AI for Data Analytics (MEAP)

Artur Guja
Visit to download the full and correct content document:
https://ebookmeta.com/product/generative-ai-for-data-analytics-meap-artur-guja/
More products digital (pdf, epub, mobi) instant
download maybe you interests ...

Creating Intelligent Products MEAP V01 Generative AI


advanced analytics smart automation Leo Porter

https://ebookmeta.com/product/creating-intelligent-products-
meap-v01-generative-ai-advanced-analytics-smart-automation-leo-
porter/

Data Storytelling with Python Altair and Generative AI


(MEAP V01) Angelica Lo Duca

https://ebookmeta.com/product/data-storytelling-with-python-
altair-and-generative-ai-meap-v01-angelica-lo-duca/

Creative Prototyping with Generative AI: Augmenting


Creative Workflows with Generative AI (Design Thinking)
1st Edition Patrick Parra Pennefather

https://ebookmeta.com/product/creative-prototyping-with-
generative-ai-augmenting-creative-workflows-with-generative-ai-
design-thinking-1st-edition-patrick-parra-pennefather/

Generative AI in Action (MEAP V02) Amit Bahree

https://ebookmeta.com/product/generative-ai-in-action-
meap-v02-amit-bahree/
It's All Analytics - Part II: Designing an Integrated
AI, Analytics, and Data Science Architecture for Your
Organization 1st Edition Scott Burk

https://ebookmeta.com/product/its-all-analytics-part-ii-
designing-an-integrated-ai-analytics-and-data-science-
architecture-for-your-organization-1st-edition-scott-burk/

Convergence of Cloud with AI for Big Data Analytics:


Foundations and Innovation 1st Edition Danda B. Rawat

https://ebookmeta.com/product/convergence-of-cloud-with-ai-for-
big-data-analytics-foundations-and-innovation-1st-edition-danda-
b-rawat/

The AI Revolution in Project Management: Elevating


Productivity with Generative AI 1st Edition Kanabar

https://ebookmeta.com/product/the-ai-revolution-in-project-
management-elevating-productivity-with-generative-ai-1st-edition-
kanabar/

Data-Enabled Analytics: DEA for Big Data Joe Zhu

https://ebookmeta.com/product/data-enabled-analytics-dea-for-big-
data-joe-zhu/

Generative AI: How ChatGPT and Other AI Tools Will


Revolutionize Business 1st Edition Tom Taulli

https://ebookmeta.com/product/generative-ai-how-chatgpt-and-
other-ai-tools-will-revolutionize-business-1st-edition-tom-
taulli/
Generative AI for Data Analytics
1. welcome
2. 1_Introduction_to_the_use_of_Generative_AI_in_Data_Analytics
3. 2_Using_Generative_AI_to_ensure_sufficient_Data_Quality
4. 3_Descriptive_analysis_and_statistical_inference_supported_by_Generative_AI
5. 4_Utilizing_Generative_AI_for_result_interpretation_and_formulating_recommenda
6. 5_Basic_Text_Mining_using_Generative_AI
7. 6_Advanced_Text_Mining_with_Generative_AI
8. 7_Scaling_and_Performance_Optimization
9. 8_Risk,_Mitigation,_and_Trade-offs
10. Appendix_A_Specifying_multiple_Dataframes_to_ChatGPT_v4
11. Appendix_B_On_debugging_ChatGPT’s_code
12. Appendix_C_On_laziness_and_human_errors
welcome
Thank you for purchasing the Manning Early Access Program (MEAP)
edition of our upcoming book, Generative AI for Data Analytics. We are
excited to share this comprehensive guide that will help you harness the
power of state-of-the-art language models like ChatGPT for data analysis.

The book is designed to be accessible to beginners (both to data analytics,


and using advanced data environments), though having some knowledge of
data analytics or basic programming concepts will undoubtedly be
advantageous. Our aim is to provide you with a clear and practical
understanding of modern analytics and how GPT models can be employed in
improving their various aspects, with ChatGPT serving as a concrete
example.

As you progress through the chapters, you will gain valuable insights into the
best practices for using GPT models in data analytics, as well as the risks
associated with these cutting-edge tools. You will also look into topics such
as data quality, result uncertainty, bias and ethics. We’ll show you where the
models can excel but also where they are still lacking. This will all be
illustrated by transcripts of conversations with the models. We will guide you
on how to use the models to their full potential and how to recognize,
manage, and mitigate risks, ensuring that you can confidently and responsibly
make the most of GPT-driven data analytics.

This book is tailored for current and future data analysts who wish to take full
advantage of available tools while acknowledging their limitations and
appropriateness for specific tasks. By the end of your journey, you will be
equipped with the knowledge and skills needed to effectively perform data
analytics in a modern data environment, utilizng GPT models, like ChatGPT
to make your work faster and more thorough.

We hope that you find this book insightful, engaging, and valuable as you
explore the exciting opportunities that GPT-driven data analytics can offer.
We look forward to your feedback and contributions in the liveBook
discussion forum as we work together to create a comprehensive guide to this
cutting-edge field. Your input is invaluable, and we are eager to create a
resource that exceeds your expectations.

Once again, thank you for joining us on this exciting journey, and we look
forward to hearing your thoughts on Generative AI for Data Analytics.

Dr. Marian Siwiak, Dr. Marlena Siwiak , Artur Guja

In this book

welcome 1 Introduction to the use of Generative AI in Data Analytics 2


Using Generative AI to ensure sufficient Data Quality 3 Descriptive analysis
and statistical inference supported by Generative AI 4 Utilizing Generative
AI for result interpretation and formulating recommendations 5 Basic Text
Mining using Generative AI 6 Advanced Text Mining with Generative AI 7
Scaling and Performance Optimization 8 Risk, Mitigation, and Trade-offs
Appendix A Specifying multiple Dataframes to ChatGPT v4 Appendix B On
debugging ChatGPT’s code Appendix C On laziness and human errors
1 Introduction to the use of
Generative AI in Data Analytics
This chapter covers
The introduction to relevant properties of Generative AI models
Role of Generative AI in Data Analytics
Getting started with using LLMs to support Data Analytics

This book aims to show you how to utilize Generative AI to improve multi-
faceted business activities such as Data Analytics. We will show you how to
extract knowledge stored in the depths of neural networks and not fall victim
to risks inherent to this technology. To excel in both tasks, you need to have
in the back of your head what drives the responses you get to your prompts.
Therefore, this chapter will provide a brief overview of Generative AI
models, their underlying technology, and what are their main limitations. The
point is not to give the readers encyclopedic knowledge of the technology but
a deep enough understanding to demystify it and allow a more critical
interpretation of its abilities.

1.1 Key features (or limitations) of Generative AI


models
Before diving deep into the transformative potential of Generative AIs in the
realm of Data Analytics, it's imperative to lay a solid foundation. LLMs, with
their intricate architecture and vast capabilities, are not mere tools but
powerful allies in deciphering complex data narratives. Getting a grasp of
their key features (or limitations) will not only enhance your analytical
prowess but will ensure that you have the capability to harness their potential
both efficiently and responsibly.

NOTE:
This book will not be a traditional programming book. You will not be able
to use it like you would a script. You are going to engage in a dialogue with
Large Language Models, and like any conversation, the words and direction
will change depending on the model, as well as the context that came before.
The output that you receive will very likely differ from what is printed in this
book. This should not discourage you. Instead, you should explore. The
journey is as rewarding as the destination. You might find yourself frustrated
that they could not follow along. Have patience. If you are disciplined (and
somewhat adventurous), you can get GPT to cooperate with the general
themes and aim of this book: learning how to use generative AI to make you
a better programmer.

While the temptation is strong to jump right into the advanced applications,
let's begin by illuminating the fundamental characteristics of Generative AIs.
These features often result in limitations that a user should keep in mind in
order to utilize the model appropriately.

A lack of true comprehension: LLMs, of course, are computer models. This


means that they have a total and complete lack of the kind of understanding
we humans (usually) exhibit. Although they can generate contextually
relevant and coherent responses, they do not possess a deep abstract
awareness of the topics under discussion. The model essentially learns
patterns from the training data and generates responses based on those
patterns without actual comprehension of the subject matter. It may result in
answers that are irrelevant, incomplete, ridiculous, and, in the worst case,
logical and convincing yet entirely and utterly wrong.

Inaccurate or outdated information: Each Generative AI knowledge is


derived from the content corpus it was trained on. Depending on the
knowledge cutoff, the model may provide outdated or incorrect information,
lack knowledge of recent events, or be unaware of developments in various
fields since that point in time. This will be visible in examples in this book,
where the model gives answers using API calls or programming language
structures from obsolete versions. This is not such a severe limitation as it
might initially seem. Firstly, the majority of the concepts covered don’t
evolve that fast, and most people will have a lot of ground to cover in the
basics before they reach the need to tap the latest developments. Some
models do have access to the internet; however, the results of mixing the
time-limited body of knowledge used to train the model and continuous
updates do not guarantee human-like development of knowledge and may
lead to inaccuracies and incohesiveness of results.

Input and output limits: While using Generative AIs, you should be aware
of the size limit for prompts and responses, which depends on the token limit
of the specific model version. For GPT-4, the latest model available at the
time of writing, the token limit is 8000 tokens. This limit covers the
combined number of both input (prompt) and output (response) tokens. A
token can be as short as one character or as long as one word, and it’s usually
unclear to users how many tokens they supply in the prompt or expect in the
response. However, a good rule of thumb is to expect this limit to be roughly
equivalent to around 800 to 1000 words. While for most simple prompts, this
seems plenty, most users quickly find this limiting once they try to feed the
model with a reasonable amount of data or they expect a decent size output.
One way around it is to use the model’s limited memory to build up the
context across multiple queries. However, bear in mind that the context
memory of the model is also restricted, albeit much less visibly, and the
model may forget past prompts and responses after a few rounds of dialogue.

Verbosity: When you try some prompts, it will also quickly become clear to
you that LLM models may sometimes generate overly verbose responses or
overuse certain phrases. This is visible, for the example of ChatGPTv4, in its
tendency to close each broader response with an “In conclusion” paragraph,
rephrasing the key points. This verbosity can be attributed to biases or
patterns in the training data, where longer responses, or responses of a certain
structure, might be more common.

Egg it on!

The prompt/response size limit and the verbosity can often lead to incomplete
or cut-off responses. One possibility, when designing a conversation with the
Generative AI, is to ensure that the combined length of the prompt and
expected response doesn't exceed the token limit.

It will always provide an answer (but it could be a wrong one): In Chapter


8, we’ll also discuss the model’s sensitivity to input phrasing. Slight
rephrasing of a question may result in different answers with varying degrees
of quality and relevance. It’s worth noting that, since prompts are usually
supplied in natural language, this sensitivity is slightly different than for
search engines since the latter only react to clusters of keywords, whereas a
model like ChatGPT may provide a different response based not just on the
keywords, but also the grammatical structure of the prompt, its perceived
emotional tone of writing, and the context created by the recent exchanges.

Also, when faced with ambiguous questions, Generative AI will still generate
a response. The model tends to guess the user's intent, as it’s currently unable
to ask for more context or specifics, which can lead to answers that may not
be directly relevant to the intended query.

Some of the answers might even be invented: There might be instances


where Generative AI provides an answer that appears plausible but is not
based on facts or even directly linked to the training material. This is because
the model sometimes fills in gaps in its knowledge by generating content that
aligns with the patterns it has observed in the training data, even if the
information is not accurate or complete. This will also be covered in chapter
7, when we discuss the phenomenon of AI hallucination.

Biased or inappropriate content: Despite efforts to reduce harmful and


biased content, Generative AI may still generate responses that exhibit biases
or produce content that could be considered inappropriate. This can result
from some biases still present in training data, biases hidden or purposefully
included in the prompts or a multitude of other overlapping factors in such a
complex model. The developers of most Generative AIs living up to their
name have gone to great lengths to balance the model’s responses. An
example of can be found in the GPT-4 System Card document[1].

Text-only: At the time of writing this book, Generative AIs do not possess
visual understanding or communication capabilities, as they are based on
text-based models. They cannot process or interpret images, videos, or any
other non-textual information. Sometimes you can describe images in the
prompt, but this quickly becomes limiting in practice. Neither can they
present results in these formats, except in very rudimentary cases, where they
attempt to “draw” a graph using ASCII art. Fortunately, for the topic of data
analysis, this will not be a major limitation most of the time. Firstly, the data
being analyzed is usually already digitized. Secondly, the model will be
mostly used to generate code or provide support and explanation, so even if
visual data, such as images or movies, is to be analyzed, it won’t be done by
the model per se but by running the code generated by it.

Inherent sycophancy: The larger the model at the base of Generative AI, the
more likely it is to exercise agreeability over reliability. If confronted or
questioned about the provided answer, it’s likely to apologize and present the
point of view contradicting its previous statement even if it was correct the
first time; truth be damned!

Awareness of these limitations is crucial for users when interacting with


Generative AIs or incorporating them into various applications. Continued
research and development aim to address these limitations and improve the
performance and safety of Generative AIs.

1.2 The role of Generative AI in Data Analytics


Before we dive deep into using Generative AI to further our data analytics,
we have to decide what role we are going to expect AI to fulfill. Keeping in
mind the fact that LLMs have their limitations, the model cannot be expected
to do all the work. Let us, therefore, have a closer look at what we are dealing
with and what are the differences and similarities to the other elements of the
data analytics flow, namely the analytical software and analytical mind.

1.2.1 The complementarity of language models and other data


analytics tools
To maximize the potential of your data-driven insights, it is crucial to
understand how language models can be complemented by other data
analytics tools, techniques, and algorithms to unlock powerful data analytics
and drive decision-making.

Generative AIs, like ChatGPT, which we will use as an example throughout


this book, excel at understanding, interpreting, and generating human
language. These models are particularly well-suited for tasks that involve text
data, such as sentiment analysis, text classification, summarization, and
question-answering. However, their potential extends beyond text-based
tasks. Language models can also provide support for data in formats other
than text by guiding users through the analysis process, generating code to
perform the analysis, and explaining the results in natural language.

Firstly, these models can be utilized as an interactive guide to help users


navigate through the various stages of data analysis. By understanding user
queries and offering relevant suggestions, these text-only models can
recommend appropriate data preprocessing techniques, statistical tests, and
visualization methods tailored to the specific data and analysis objectives. It’s
like having an expert on speed dial.

Secondly, language models can be used to generate code in a variety of


programming languages such as Python, R, or Scala, or even, for the more
adventurous, in PHP, Perl or, heavens forbid, Cobol or Intercal. Probably the
best known example currently is the GitHub Copilot implementation of this
concept. This capability allows users to obtain ready-to-use code for data
processing, analysis, and visualization tasks, saving time and effort. The
generated code can vary in size and complexity, from short snippets and
single functions serving as a starting point for users to customize and refine
their analyses further, to whole algorithm implementations and modules,
limited only by the user’s ability and patience to coax the model to spit it out.
Unlike raw code snippets downloaded off the internet, the model has the
invaluable ability to explain the code, as we’ll see in several examples in this
book, and optimize or modify it to the user’s specifications.

This ability to generate and explain code will be the most useful feature for us
throughout the book but also comes with the biggest warning, which will be
repeated in many places, and covered in depth in Chapter 7. Specifically,
never trust the model to spit out fully correct answers or perfectly working
code. The higher the importance or risk of your project, the more
scrupulously you should verify any output through review and testing. In
subsequent chapters, you’ll find examples of model-generated code which
either works not as expected or has incorrect explanations attached to it.
Caveat emptor!

Finally, once the analysis is performed, language models can help interpret
the results by generating natural language summaries and explanations. This
feature enables users to understand complex analytical results more easily
and communicate their findings to a broader audience.

The outline of a good model of data analysis supported by Generative AI can


be seen in Figure 1.1.

Figure 1.1 Recommended Generative-AI-supported data analysis flow

The key takeaways of this model are:

Never take the first answer from the model to be the perfect one.
Evaluate, refine, and test.
You can never blame the model for any shortcomings or failures in data
analysis. The model does not understand, and it’s the user’s sole
responsibility to approach the problem, the data, and the results
critically.

Even if you forget everything else that’s in this book, please apply the two
points above to your data analysis practice.

The unblocker
The scale of Generative AIs’ abilities is only starting to be explored.
However, it’s already clear that they can be taught to respond consistently
and relevantly on a vast number of topics. They have the ability to drill down
into details, summarize, explain, and associate related concepts to an
extraordinary degree. These abilities can be used to effectively unblock your
own thinking and get you out of your rut. You no longer have to trawl
through dozens of random articles trying to find inspiration or pointers. Just
ask a question. Even if the answer is not perfect, it may point you to concepts
you haven’t thought of before. Use this to expand your horizons.

1.2.2 The limitation of Generative AIs’ ability to automate and


streamline data analytics processes
While Generative AIs can be employed in all the various applications listed
above and in the subsequent sections of this book, their effectiveness in
automating and streamlining data analytics processes has certain limitations.
While you can incorporate them into the data analytics domain, you’ll still
have to hold off on firing all of your data analysts.

Lack of Quantitative Analysis Skills

Generative AIs excel at understanding and generating natural language, but


they lack the inherent ability to perform complex quantitative analysis. Data
analytics processes often require mathematical and statistical methods, such
as regression analysis, time series forecasting, and clustering techniques.
While Generative AIs can suggest such methods, they cannot inherently
perform these functions without being combined with specialized tools or
algorithms tailored for quantitative analysis. In the most common scenario,
this will involve the model generating the relevant code, and the user running
the code on the input data and providing the model with results for further
analysis.

Limited Understanding of Domain-specific Concepts

While Generative AIs can generate human-like text based on the context
provided, they may struggle to comprehend and process highly specialized
domain knowledge. Consequently, their ability to accurately generate insights
or recommendations in the context of specific industries or niche subjects
may be limited. This can be overcome by fine-tuning models for specific
domains, but such refinements are frequently prohibitively costly and time-
consuming unless effects of scale can be achieved. One common way to
overcome this limitation is for users to provide the model with more general
prompts and then refine the answer based on their own specialist knowledge.

Inability to Interact with Databases and APIs

Data analytics often involves working with databases, APIs, or other data
sources to extract, clean, and process data. Generative AIs lack the built-in
capability to interact directly with these sources. While it is possible to
integrate Generative AIs with custom-built solutions to bridge this gap, doing
so can be resource-intensive and challenging to implement effectively. As in
the previous cases, the model can still be effectively used to guide the
analysis and provide solutions or even whole swathes of code which will be
executed by the users independently of the model.

Apart from these concerns, in Chapter 7, we’ll cover the phenomenon of AI


delusion and the risks associated with the use of Generative AIs. In any case,
you’ll probably need to keep a few smart people in the data analytics team for
some time yet.

1.3 Getting Started with Generative AIs for Data


Analytics
There is an old Chinese proverb, “In the forest of algorithms, the path to
wisdom has many branches.”. Actually, there isn’t and it’s ChatGPT that
generated it for us. The idea we tried to convey is that depending on the
situation, you have more than one way to access your AI advisor. We are
adamant that to utilize Generative AI’s potential, you need to get comfortable
with conversing with it, as most of the tools built upon it strip the answer of
relevant nuance, but you should know your options here.

1.3.1 Web interface


In this book, we will use ChatGPT as an example of Generative AI. The
easiest way to access it is through the OpenAI website, and this is what we
will be using throughout this book. You can use the web-based interface to
input your queries and receive responses from the AI model. Since this mode
of interaction is very intuitive, and also due to frequent changes in the
available options, we won’t dwell too much on describing it in detail.

If you haven't already, visit the OpenAI platform at https://chat.openai.com


and create an account. Once logged in, you can access the ChatGPT interface,
along with the history of your previous chats, and start using the model to
assist with your data analysis tasks.

Jump right into it…

The sections below will cover some more advanced ways of accessing the
models through API or third-party integrations and are included for
completeness. However, in practice, we find that the majority of useful
interactions can be achieved through the web interface, and it’s also the
fastest and cheapest option to use.

If you’re eager to get started, you can skip the remaining few sections and get
right into it in the next chapter. Later, when you want to play with accessing
the API or integrating ChatGPT into an automated workflow, you can come
back to cover this.

1.3.2 Accessing and using the API and SDK


For more advanced use cases and seamless integration with your existing data
analysis tools, you can access ChatGPT through the OpenAI API. This
method allows you to send requests and receive responses programmatically,
giving you greater control over the AI's capabilities. To get started, sign up
for an API key on the OpenAI website (https://beta.openai.com/signup).
Then, follow the API documentation (https://beta.openai.com/docs) to
understand how to interact with ChatGPT using your preferred programming
language.

The API is suitable for a variety of programming languages, including


Python, JavaScript, and more. With API access, you can create custom
applications, integrate ChatGPT into your existing data analysis workflows,
and even build ChatGPT-powered analytics dashboards.

To further simplify the process of integrating ChatGPT into your data


analysis projects, you can use available SDKs (Software Development Kits)
and libraries created by OpenAI or third-party developers. These resources
can save you time and effort when it comes to working with the API, as they
provide pre-built functions and classes that handle common tasks. You can
find popular SDKs and libraries for various programming languages on
platforms like GitHub. Make sure to check the compatibility and support
status before using them in your projects.

1.3.3 Examples of programmatic access to ChatGPT


Although the examples in this book use the ChatGPT web interface provided
by OpenAI, there are various other methods of accessing the model,
including access through third-party libraries, as well as direct API calls to
the mode. To illustrate programmatic access, please consider the following
example of accessing the ChatGPT models from Python. A similar code will
be used in some of the discussions in Chapter 5 on using ChatGPT directly
for data analysis.

A rose by any other name...

Throughout this book we work with Python in a Jupyter environment. In


chapter 7 we show that Generative AIs are perfectly capable of supporting
any other programming environment as well.

As stated in the previous section, you'll need to have the OpenAI API set up
first. You can follow the instructions to install the library and set up an API
key on https://beta.openai.com/docs/

Once you have the OpenAI API set up, you can use the following simple
Python code to interact with ChatGPT:

Listing 1.1
import openai

# Replace "your_api_key_here" with your actual API key


openai.api_key = "your_api_key_here"

def chat_with_gpt(prompt):
response = openai.Completion.create(
engine="text-davinci-003",
prompt=f"{prompt}",
max_tokens=150,
n=1,
stop=None,
temperature=0.5,
)
message = response.choices[0].text.strip()
return message

# Example usage:
prompt = "What is the capital city of France?"
response = chat_with_gpt(prompt)
print(response)

Replace "your_api_key_here" with your actual API key obtained from


OpenAI. This code defines a function chat_with_gpt that takes a prompt as
input and returns the generated response. You can then use this function to
interact with ChatGPT by providing different prompts.

Form following function

Throughout this book, we present the transcripts of our inquiries to ChatGPT


and its answers, as well as code listings. It brings some formatting challenges.

The function names, the variables and parameters, and the code itself are
written in bold.

Comment lines start with the # sign and are written in bold italics. These are
not interpreted by Python at all. They are usually used to explain what the
following piece of code does. Sometimes, when more than one line of
comment is required or the comment contains quotation marks itself, we put
the comment text between triplets of quotation signs.
Those of you new to Python should note a critical property of the Python
language, i.e., indentations. It’s a method of telling Python that we are
entering another nesting level, e.g., when we iterate over a collection.
Indented code will be executed for each element provided by a code of higher
indentation level. If another loop exists within this code, it would be indented
further. Failing to properly indent code will lead to either an error at the
code execution or, worse, to incorrect results.

The code calls the openai.Completion.create function that generates


completions for a given prompt using an AI model. The parameters used by
the function are as follows:

engine: The ID of the model you want to use for generating


completions. For example, "text-davinci-003" refers to one of OpenAI's
powerful language models at the time of writing, although it’s probably
best to ask ChatGPT itself, using the Web interface, for an updated list
of models, since they are likely to change frequently. Other engines you
may consider are:
gpt-3.5-turbo - the most capable GPT-3.5 model and optimized for chat
at 1/10th the per-1k-tokens price of text-davinci-003.
text-davinci-002 - similar capabilities to text-davinci-003 but trained
with supervised fine-tuning instead of reinforcement learning
code-davinci-002 - Optimized for code-completion tasks.
gpt-4 and gpt-4-32k - at the time of writing, these models are available
via the API only to a limited audience of beta testers, although GPT-4 is
available through the web interface to users with paid accounts.
prompt: The input text that you want the model to generate a
completion for. It can be a question, statement, or any piece of text that
you want the AI to process.
max_tokens: The maximum number of tokens (words or word pieces)
in the generated response. The model will stop generating new tokens
once this limit is reached. This parameter helps control the length of the
response.
n: The number of completions you want the model to generate for the
given prompt. For example, if you set n=3, the model will return three
different completions. You can then choose the one that suits your needs
best or process them further.
stop: A sequence or list of sequences where the API will stop generating
tokens when encountered. This can be used to prevent the model from
generating unnecessary or irrelevant text.
temperature: Controls the randomness of the generated text. A higher
value (e.g., 1.0) will make the output more diverse and creative, while a
lower value (e.g., 0.1) will make the output more focused and
deterministic. A value of 0.5 is a good balance between randomness and
determinism.

You can adjust these parameters based on your specific use case to get the
desired output.

1.3.4 Third-party integrations of ChatGPT


In addition to the methods mentioned above, you may also find ChatGPT
integrated into various third-party applications and plugins. These
integrations typically focus on specific use cases, such as data visualization,
natural language processing, or predictive analytics.

For example, ChatGPT has been integrated with the following tools:

Integrated Development Environments (IDEs) and Code Editors:

Generative AIs can be used to generate code snippets, documentation, or


offer suggestions while coding. This can range from highlighting common
mistakes and suggesting fixes to generating whole sections of code based on
the user's description of functionality. It is similar to having a vastly
knowledgeable but slightly narrow-minded programming expert on speed
dial.

Some examples of such implementations include:

Visual Studio Code: Build custom extensions that use ChatGPT to


generate code snippets based on user input, provide documentation for
functions, or offer coding suggestions based on the context.
JetBrains IDEs: Create custom plugins that leverage ChatGPT to
generate code examples, provide context-aware code suggestions, or
offer explanations for programming concepts.
Atom: Develop custom packages that use ChatGPT to assist with code
generation, documentation lookup, or coding best practices based on the
context of the code being written.

The very big impact on the whole programming world has a Git Copilot. It’s
a cloud-based AI pair programmer that offers autocomplete-style suggestions
as you code. It is powered by OpenAI's GPT-3 language model, which has
been trained on a massive dataset of open-source code. Git Copilot can
suggest entire functions, classes, and even entire applications. It can also
suggest documentation, unit tests, and other code-related artifacts.

Git Copilot is available for Visual Studio Code, Visual Studio, Neovim, and
JetBrains IDEs. It is free to use for verified students, teachers, and
maintainers of popular open-source projects. For other users, it is available as
a paid subscription.

Here are some other tools that use Generative AI to assist with coding:

Tabnine: A code completion tool that uses a neural network to predict


the next code token.
Kite: A code completion tool that uses a statistical language model to
predict the next code token.
IntelliSense: A code completion tool that is built into Visual Studio and
other Microsoft products.
Code Assistant: A code completion tool that is built into JetBrains IDEs.

These tools can all be helpful for speeding up the coding process and
reducing errors. However, it is important to use them with caution and to
always double-check the suggestions before using them.

No-code and workflow platforms

These platforms allow users to create workflows or design online


applications with varying degrees of complexity by putting together pre-
defined blocks, requiring little or no coding knowledge to fulfill these tasks.
Some of these platforms have provided blocks integrated with ChatGPT,
giving users the ability to include within their workflows such tasks as
automatic content creation, data extraction, or sentiment analysis. For
example:

Bubble.io: Build apps with AI-powered text processing and generation


features, like personalized content recommendations or summarizing
user input.
Integromat: Automate workflows that involve processing natural
language input, such as summarizing articles, generating social media
captions, or analyzing customer feedback.
Zapier: Create custom integrations and automation involving text
generation or processing, like drafting emails or analyzing the sentiment
of survey responses.

Business Intelligence (BI) Tools:

Business Intelligence platforms provide sets of tools for data collection,


analysis, reporting, and management support. These platforms are used by
organizations to help management stay on top of the vast amounts of data
generated in today’s digitized world. Generative AI can be integrated with BI
tools to provide natural language interfaces for querying data, generating
insights, or automating report creation. For example:

Tableau: Use Generative AI to create a natural language interface for


querying data in Tableau, allowing users to ask questions about their
data and receive contextually relevant visualizations or insights.
Power BI: Integrate Generative AI to enable users to ask questions in
natural language and generate insights or visualizations based on the
underlying data. Additionally, Generative AI can be used to generate
human-readable summaries of complex data trends or patterns.
Looker: Develop custom integrations that use Generative AI to provide
natural language querying capabilities, enabling users to ask questions
about their data and receive relevant insights, visualizations, or
recommendations.
Qlik Sense: Utilize Generative AI to create a natural language interface
that allows users to explore their data by asking questions and receiving
context-aware responses, visualizations, or suggestions.
By integrating Generative AI with business intelligence tools, users can
interact with their data in a more intuitive and user-friendly way. This can
help to democratize data access and understanding, making it easier for non-
technical users to gain insights and make data-driven decisions.

1.3.5 Best practices and tips for successful Generative AI


implementation
Although this book won’t cover the advanced topics related to the direct
integration of Generative AI into applications using API or SDKs, we’d
recommend you follow some best practices and consider the following useful
tips to ensure successful integration of Generative AIs with data analytics
solutions:

Define clear objectives: Start by clearly identifying the goals and


expectations of integrating Generative AI into your data analytics
solution. Determine the tasks you want Generative AI to perform, such
as data preprocessing, generating insights, or creating visualizations, and
tailor your integration accordingly.
Familiarize yourself with the API or SDK you are planning to use:
Thoroughly read and understand the OpenAI API documentation,
including its features, limitations, and best practices. This knowledge
will help you design efficient and reliable interactions between
Generative AI and your data analytics tools.
Use appropriate data formats: Ensure that you are using compatible
data formats when sending requests to and receiving responses from
Generative AI. Transform your data, if necessary, to ensure seamless
integration and prevent data loss or misinterpretation.
Monitor usage and cost: Keep track of your API usage to prevent
unexpected costs, especially when working with large datasets or
complex analytics tasks. Implement rate limiting, caching, or other
optimizations to manage your API calls and stay within your plan's
limits.
Handle errors and timeouts: Implement proper error handling and
retry mechanisms to deal with potential issues, such as timeouts or rate
limit errors. This will help ensure the stability and reliability of your
integration.
Optimize your prompts: Craft your prompts carefully to obtain the
most accurate and relevant results from Generative AI. Use clear,
concise language and provide enough context to help the AI understand
your requirements. You may need to experiment with different prompt
structures to find the best approach for your specific use case.
Evaluate the AI's output: Generative AI’s output may not always be
accurate or relevant. Always double-check the results provided by the
AI, and consider implementing human review or validation processes,
especially for critical or high-stakes decisions.
Test and iterate: Before fully integrating Generative AI into your data
analytics solution, thoroughly test its performance with various tasks
and datasets. This will help you identify any issues, limitations, or
inaccuracies. Continuously iterate on your prompts, data formats, and
integration methods to improve the overall effectiveness of the AI in
your analytics workflows.
Ensure data security and privacy: When working with sensitive data,
make sure you comply with data protection regulations and follow
security best practices. IF working with cloud-based Generative AI,
encrypt data when transmitting it to and from the provider, and consider
using data anonymization techniques to protect the privacy of your
users.
Stay updated with Generative AI developments: Keep track of
updates and new features developed in the field, as these may impact
your integration or offer additional capabilities. Regularly review the
API documentation and subscribe to relevant newsletters or forums to
stay informed about any changes or improvements to Generative AI.
Leverage community resources: Take advantage of resources provided
by the AI community, such as sample code, tutorials, and forums. These
resources can help you learn from others' experiences and discover best
practices for integrating Generative AIs with various data analytics
solutions.

By following these best practices and tips, you can successfully integrate
Generative AI into your data analytics workflows and harness their full
potential to enhance your decision-making, automate tasks, and uncover
valuable insights.
Hopefully, after this introduction, the Generative AI will no longer appear to
be a mysterious and, possibly, useless invention. Subsequent chapters will
demonstrate specific exchanges between a human and a Generative AI, using
the responses in all aspects of data analytical work while commenting on the
shortcomings and pitfalls that need to be looked out for to make this
cooperation between humans and AI as painless and productive as possible.

1.4 Summary
Generative AI and derivative tools have taken great strides in recent
years and can be used as invaluable support in many fields, including
data analytics.
Despite the progress, these tools won’t (yet!) replace a competent data
analyst, and there are many limitations that users should be aware of.
At the same time, users should not be afraid to take full advantage of the
immense possibilities of supporting their data analytical work with the
language models, which can be done safely by just following a few
common-sense guidelines.
The easiest way to access Generative AIs is via their web interfaces,
although API and SDK can be used in more advanced applications.

Things to ask Generative AI

What are your limitations?


What is the knowledge base you’ve been built upon?
What is the latest version of <insert favorite analytics tool> that you
know about?

[1] https://cdn.openai.com/papers/gpt-4-system-card.pdf
2 Using Generative AI to ensure
sufficient Data Quality
This chapter covers
Technical aspects of cleaning the data
Data content quality evaluation
Dealing with data errors
Investigating unclear data

In classical logic, conclusions drawn from false assumptions can be either


true or false. Similarly, business analytics performed on false input may
result in true or false conclusions. This highlights the importance of accurate
input data. Poor quality input leads to unreliable results regardless of the
sophistication of the analysis. The key concept being unreliable. Let’s
assume you risked performing an analysis on poor data, and it returned good
results. It doesn’t “sanctify” the approach. Yes, it happens. It doesn’t mean
that the next time your results will be “good” as well. It means that the next
time you will yet again get unreliable results. If you threw a coin and got
tails that you bet on, it doesn’t mean you will get the tails next time. This
concept is often referred to as "garbage in, garbage out."

On the other hand, if you threw the coin multiple times and each time you got
your expected tails, maybe the coin is not random. Maybe the quality of your
data is sufficient for your purpose. Unfortunately, despite some marketing
claims, there is no one-size-fits-all solution ensuring data quality. The most
effective way to prepare for analysis is to closely examine the data and
carefully consider the meaning of each variable within the context of your
specific business. However, you are not helpless. While there may not be
methods that fully automate this task, there are techniques and protocols that
can help you approach data evaluation and cleaning in a structured and
systematic manner.

There is one more important note before we begin. If we were to do things by


the book, we’d first prepare a protocol and then apply it to our data.
However, one of our goals here is to teach you how to utilize Generative AI’s
capability to develop such a protocol. Therefore, we will play it by the ear, in
the sense that we will give the wheel to ChatGPTv4 asking it questions about
each step of the journey and applying its suggestions. As long as the received
suggestions will be of value, we will limit ourselves to commenting on them.
Cases where we decided to interfere are rare and far between. We don’t want
to get some “perfect” or “optimal” solution here. We want to show you how
to get sufficient solutions.

2.1 A note on best practices


Despite the note in the introduction, we need a teeny-tiny bit of background
around the best practices of data quality. We need to clarify what we want to
achieve, what we should expect from data we would call useful for decision-
making (from the analytics perspective, we do not get into, e.g., accessibility
or security issues).

Relevance: The data should be pertinent to the specific business


question or problem being addressed. Irrelevant data can lead to
confusion and detract from the focus of the analysis.
Accuracy: The data should be correct, error-free, and consistent with the
real-world objects or events it represents.
Completeness: All required data points should be present, and any
missing or incomplete information should be identified and addressed.
Timeliness: Data should be up-to-date and relevant to the analyzed
period. Outdated data can result in misleading or irrelevant insights.
Uniqueness: Duplicate data should be identified and eliminated to
prevent redundancies and inaccuracies in the analysis.
Consistency: Data should be consistent across different sources,
formats, and systems. Inconsistencies can lead to incorrect conclusions
or insights.
Integrity: The relationships between data elements should be
maintained, ensuring that the data is coherent and meaningful when
analyzed.

In short, you want to be sure that your data adequately describes the slice of
reality you’re analyzing (points 1-4) and that its structure allows for
reliable analysis (points. 5-7).

If it looks like a lot of work, it’s because it is. Ensuring data quality can be a
daunting task due to the sheer amount of knowledge required for a
practitioner to adequately refine their data, as well as the number of steps
required to ensure that you will be analyzing data relevant for the analyzed
process. There is no silver bullet, and cutting corners can lead to unreliable
data, but there is hope: a good structure can lead you through this process in a
finite time. A well-educated GPT advisor can help you here.

Slow is smooth, and smooth is fast

Data quality is not a space where you shall allow yourself to skate on the thin
ice. You just need to get used to getting things systematically. A long-
falsified statement states that 80% of data work is related to data cleaning and
preparation. Nowadays, data people rarely can afford the luxury of such a
solid foundation building due to a very unpleasant feedback loop. Business
does not see the value of analysis, so it demands more of it. The more they
demand, the higher the time pressure. The higher the time pressure, the more
corners are cut (usually in areas that do not get into final reports and thus are
invisible/useless from the perspective of the business sponsors). The more
corners cut, the less reliable the analysis. The less reliable analytics, the lower
the value of said analytics. The lower value of analytics...

Time to get to work.

2.2 Getting started


From clay tablets to AI, tools change. Until now, it was on the data analysts
to use their experience and knowledge to design the best path toward
meaningful analysis. Generative AI promises access to condensed knowledge
of immeasurable quantity, supplemented with contextual awareness. Let’s see
what the first step of data analysis emerges from this concoction.

We would ask that you take the time to read over the transcript below and,
likewise, the many others that will appear in this book, despite a possible
temptation to skip over them. As we will show throughout this book and
summarized in Chapter 6, working with GPT models requires prudence. It
means carefully reading the answers; you will discover both real gems as
well as red herrings there. Distinguishing between the two requires attention
to detail.

Let’s ask ChatGPT 2.1

Human: I want to perform data analysis. I imported my data into pandas DataF

ChatGPT: After importing your data into pandas DataFrames, the first thing y

1. Inspect the first few rows of the DataFrame:


Use the head() method to display the first few rows of the DataFrame. This w

df.head()

2. Check the shape of the DataFrame:


Use the shape attribute to get the dimensions of the DataFrame (rows and col

df.shape

3. Obtain summary statistics and data types:


Use the info() method to get a summary of the data types, number of non-null

df.info()

4. Generate summary statistics for numerical columns:


Use the describe() method to get summary statistics for the numerical column

df.describe()

5. Check for missing values:


Use the isna() method in conjunction with the sum() method to count the numb

df.isna().sum()

6. Investigate the categorical columns:


For categorical columns, you can use the value_counts() method to get a coun

df['column_name'].value_counts()

Once you have a better understanding of your data, you can proceed with data
ChatGPT off-handedly offered a critical bit of information in its first
sentence. We need to understand our data. It sounds obvious, but
unfortunately, omissions in this regard are hardly uncommon. Such
fundamental mistakes cannot be redeemed by any amount of analysis,
machine learning, or AI applied at later stages; if you start on shaky
foundations, the entire building will be compromised.

If you work in business and are not absolutely, positively sure about the exact
meaning of the variables received from your IT department, don’t assume.
Ask! We happened to analyze the efficiency of the service, sending tens of
millions of messages to the clients of a multinational company. We
discovered an anomaly that would occur very rarely, where the ratio of open-
to-send messages was higher than one. Trying to understand what is actually
defined as a sent message and what is actually defined as an open message,
we discovered the error in the logging system was so deep that all the
previous analytics on the efficiency of messaging were worthless! We cannot
overemphasize the importance of close collaboration between the business
and data or IT crowds!

A careful read of the rest of the model’s answer may reveal some less
impressive components. Some steps seem redundant. For example,
df.isna().sum() method does not offer any information not previously
provided by df.info() (which, in fact, also provides information returned by
df.shape).

You need to decide on your priorities

Efficiency vs. readability is just one of the examples where you’ll have to
choose one of the different approaches. Generative AI, when inquired, can
often present you with the benefits and weaknesses of different methods, but
you need to be conscious of this dichotomy and sometimes explicitly request
the solution prioritizing one over the other.

Nevertheless, the overall answer is very good. Taking an old Carpenter’s


adage, it’s better to measure twice. We appreciate that ChatGPT errs on the
side of redundancy. Provided with a wide enough toolset, in a short time,
you’ll be able to select tools suiting your analysis style best. Provided with
insufficient tools, you may stay in the dark.
To have something to actually analyze, let’s load our real-business example
data (https://www.kaggle.com/datasets/olistbr/brazilian-ecommerce, as we
mentioned in the preface, you’ll need to use a free Kaggle account). Below
we show the code we used to load our data (yet another friendly reminder, all
the code required to perform presented analytics is available in our github
repository
https://github.com/mariansiwiak/Genrative_AI_for_Data_Analytics). Not
because we think it’s so brilliant that it needs to be immortalized on the book
pages, but because it is the simplest way of explicitly requesting and
communicating variables storing different datasets, which we’ll use
throughout the book.

Listing 2.1

import pandas as pd

df_customers = pd.read_csv('olist_customers_dataset.csv')
df_order_items = pd.read_csv('olist_order_items_dataset.csv')
df_order_payments = pd.read_
csv('olist_order_payments_dataset.csv')
df_order_reviews = pd.read_csv('olist_order_reviews_dataset.csv')
df_orders = pd.read_csv('olist_orders_dataset.csv')
df_products = pd.read_csv('olist_products_dataset.csv')Jupyter
df_sellers = pd.read_csv('olist_sellers_dataset.csv')
df_product_category_translation = pd.read_csv('product_category_name_transla

As the next step, we have to take a look at DataFrames’ properties listed


above. And unfortunately, we do mean running all of the commands
proposed by our GPT advisor for all downloaded data sets. That’s the habit
you should have for all and every bit of data you import.

We found two interesting points while we waded through the data


structures.The number of customers in df_customers DataFrame and the
number of orders in df_orders DataFrame is identical. There are two
possibilities. One is that each customer, as identified by the system, made one
and only one purchase, and the second one is that we are provided with an
incomplete set of data. The number of 99441 rows is arbitrarily set by
whoever generated this dataset for us. Both possibilities have far-reaching
implications related to further analytics.
df_customers DataFrame contains two suspiciously similar columns:
customer_id and customer_unique_id. As we will probably want to analyze
customer properties in other DataFrames, the column that will allow us to
connect them is customer_id. As such, this column is of great importance, so
we should stop and ask if customer_id is indeed a unique identification for
customers.

As we mentioned and will mention, it’s of utmost importance not to wave


away anything that doesn’t look right. Let’s see if ChatGPT will understand
the first of our problems and propose a workable solution.

Let’s ask ChatGPT 2.2.

Human: I have a pandas DataFrame with customer_id and order_id columns. How

ChatGPT: You can quickly check if every customer made just one order or if t

import pandas as pd

# Sample DataFrame

data = {'customer_id': [1, 2, 3, 4, 2],


'order_id': [101, 102, 103, 104, 105]}

df = pd.DataFrame(data)

# Count orders per customer

order_counts = df['customer_id'].value_counts()

# Check if any customer made more than one order

if order_counts.max() > 1:
print("There are customers with more than one order.")
else:
print("Every customer made just one order.")

In this example, df['customer_id'].value_counts() returns a pandas Series co

ChatGPT can understand what we are trying to achieve and proposed code
we can adapt to our needs replacing just the name of the DataFrame to be
tested with df_orders, as we defined it when loading our datasets.
You may check the result in the accompanying files. It shows that either we
got the data from the online platform with the lowest customer retention rate
in existence (one visit per user), or there is a problem with the customer_id
column in df_orders DataFrame.

We previously noticed that the customer_id column in df_customers


DataFrame is accompanied by the customer_unique_id column, which
seems redundant if the customer_id column was properly constructed. We
should check what the relationship is between the two.

Let’s ask ChatGPT 2.3.

Human: In my DataFrame containing customers' data, I have two columns: "cust

ChatGPT: To check how the "customer_id" and "customer_unique_id" columns rel

import pandas as pd

# Sample DataFrame

data = {'customer_id': [1, 2, 3, 4, 5],


'customer_unique_id': [1001, 1002, 1002, 1004, 1004]}

df = pd.DataFrame(data)

# Check the number of unique values in each column


unique_customer_ids = df['customer_id'].nunique()
unique_customer_unique_ids = df['customer_unique_id'].nunique()

print(f"Unique customer_id values: {unique_customer_ids}")

print(f"Unique customer_unique_id values: {unique_customer_unique_ids}")

# Check the relationship between the two columns

relationship = df.groupby('customer_unique_id')['customer_id'].nunique()

print("\nRelationship between customer_unique_id and customer_id:")

print(relationship)

In this example, df['customer_id'].nunique() and df['customer_unique_id'].nu


Next, df.groupby('customer_unique_id')['customer_id'].nunique() groups the D

Based on the results, you can determine which column contains actual identif

That’s an interesting answer, especially in regard to the fact that it doesn’t


suggest that the answer will be definitive! There is, of course, a possibility
that none of these columns actually are true identifiers. That point is actually
really important. Don’t assume anything about the data you received from an
uncertain source. Even if you firmly believe and logic dictates that some
values mean this thing or another. In nine out of ten cases, it will be what you
expect, yet in that tenth case you’ll thank us when you’ll scratch your head in
astonishment after reaching out to the data source and learning what this
variable actually means.

Going back to our analysis, the resulting relationship Series is too big to
analyze manually. Let’s try to complain about it a little.

Let’s ask ChatGPT 2.4.

Human: With my data, the "relationship" series produced by your previous cod

ChatGPT: Sure! You can analyze the "relationship" Series by generating a sum

import pandas as pd

# Sample DataFrame

data = {'customer_id': [1, 2, 3, 4, 5, 6, 7],


'customer_unique_id': [1001, 1002, 1002, 1004, 1004, 1001, 1004]}

df = pd.DataFrame(data)

# Check the number of unique values in each column

unique_customer_ids = df['customer_id'].nunique()
unique_customer_unique_ids = df['customer_unique_id'].nunique()

print(f"Unique customer_id values: {unique_customer_ids}")

print(f"Unique customer_unique_id values: {unique_customer_unique_ids}")

# Check the relationship between the two columns


relationship = df.groupby('customer_unique_id')['customer_id'].nunique()

print("\nSummary of the relationship between customer_unique_id and customer

print(f"Min: {relationship.min()}")
print(f"Max: {relationship.max()}")
print(f"Mean: {relationship.mean()}")
print(f"Count of unique customer_id values per customer_unique_id:\n{relatio

In this example, the summary of the relationship between the "customer_uniqu

This summary will give you a better understanding of the relationship betwee

The results are indeed easier to digest:


Unique customer_id values: 99441
Unique customer_unique_id values: 96096

Summary of the relationship between customer_unique_id and customer_id:


Min: 1
Max: 17
Mean: 1.0348089410589412
Count of unique customer_id values per customer_unique_id:
1 93099
2 2745
3 203
4 30
5 8
6 6
7 3
9 1
17 1
Name: customer_id, dtype: int64

Looks like customer_id values uniquely identify records in the dataset,


rather than uniquely identify real customers. Therefore, customer_unique_id
is more likely to serve a customer identifier. In the actual business
environment, our next step would be to check with the IT or other relevant
department if over 93% of our clients make only a single order in their
lifetime or whether it is indeed a data quality issue. For now, we’ll take the
data at face value and continue our analysis.

The key takeaway message is that ChatGPT can help us choose the right
analytical methods, but it is on us to understand the actual data meaning in
the context of the business domain we operate in and ask the right questions.
2.3 Quality assessment structure
By now, we should know what information is encoded in the data at hand,
and as a bonus, we should have an overview of its completeness.

A note of a warning here, some data is inherently sparse. Just counting the
missing value in the previous step will not define data completeness;
however, if you combine it with the knowledge of data meaning, in most
cases, it should.

Let’s see what next steps ChatGPT proposes we take.

Let’s ask ChatGPT 2.5.

Human: I want to perform data analysis. I imported my data into pandas DataF

ChatGPT: Once you have imported your data into pandas DataFrames and perform

1. Data Cleaning:
Address missing values, duplicate records, and data entry errors. Depending

2. Feature Engineering:
Create new features from the existing data to enhance the information availa

3. Data Transformation:
Scale, normalize, or standardize numerical features, and encode categorical

4. Exploratory Data Analysis (EDA):


Perform a more in-depth exploration of the data by visualizing relationships

5. Model Selection and Training (if applicable):


If your analysis involves machine learning or statistical modeling, choose a

6. Model Evaluation and Validation (if applicable):


Assess the performance of your models using relevant metrics and validation

7. Interpretation and Communication:


Interpret the results of your analysis or modeling, draw meaningful insights

8. Iterate and Refine (if needed):


Based on the insights and feedback from stakeholders or the performance of y
By following these steps, you can effectively perform a data analysis using

Looks like ChatGPT jumped to a bit of a different level of abstraction this


time. The proposed order, however, has some underlying assumptions, which
could lead to suboptimal results if not confirmed.

Order matters

GPT models are trained on big datasets, containing a plenitude of guides,


tutorials, and articles about many different topics. For the sake of readability,
most of their training materials operate under “normal conditions”, whatever
it means in each of the covered topics. Sometimes, some cleaning steps may
obfuscate some errors, especially in preparatory work. It is of great
importance to understand how well things change after each step.

In this particular case, the assumption is that the data being processed is
known and understood by the analyst well enough to proceed with its
transformation and deriving feature engineering before Exploratory Data
Analysis (EDA). In such a case, EDA, which should form a solid foundation
of understanding the data set and play an important role in selecting
appropriate analytical tools, becomes a superficial step resulting in a couple
of mostly useless charts. We will therefore change the order of the proposed
steps.

There's more than one way to skin a cat.

We do not think that a one-size-fits-all, silver bullet, cure-all data preparation


protocol actually exists. That being said, you need to start somewhere. The
general framework we believe may lead to high-quality data is as follows:

1. Understand the relation of your data with reality, i.e., what each variable
means and how it was generated.
2. Make sure the data meets some basic technical standards, i.e., ensure
that you work exclusively but all available relevant data points and can
cross-relate them between data sources if needed.
3. Before you try to analyze reality using your data, analyze the data itself,
i.e., validate if it adheres to either expertise- or common-sense-based
expectations.
By now, we should know what variables we have at our disposal. In the
following sections, we will ensure the contents describe phenomena rooted in
reality, not IT system design or implementation artifacts.

2.4 Data cleaning


We agree with ChatGPT’s assessment of Data Cleaning's importance. It must
be done as early as possible, as errors resulting from poor data representation,
consistency, or integrity are persistent and stack quickly. The answer we
received focused on three main aspects of data cleaning: handling missing
values, removing duplicates, and searching for data entry errors.

That is a solid common denominator of different possible data-cleaning


strategies. You may encounter data cleaning protocols, including steps we
describe as a part of EDA or even data analysis presented in chapter 3. And
they are most likely much better for the data types they are designed for! For
the exercise we’re running, we assume the worst-case scenario, where you’re
unsure what to expect from your data. These three proposed points are a good
starting point. You for sure don’t want to work with data missing critical
values or where data points are plainly wrong. Such data simply doesn’t
reflect reality.

In this section, we added an element of data validation for good measure. It


completes the first iteration of ensuring that, from the technical perspective,
the data is analysis-grade.

The first element of data cleaning mentioned by ChatGPT is handling


missing values. However, there are two reasons not to follow this order,
technical and psychological. Let’s start with the latter. Removing duplicates
is rather quick and simple. If your data is really a big unknown, being able to
“tick a box” is a big confidence booster.

Even if you consider yourself above natural human reactions, there is also
another reason. Some of the methods of handling missing values are based on
the values of surrounding data. In such a case, you don’t want the previous or
next row repeated a hundred times.
2.4.1 Removing duplicates

There are many ways by which duplicate records can sneak into your data.
From data entry mistakes (human errors) and system glitches (software
errors), to badly designed identifiers (human), to synchronization problems in
distributed systems (software) to errors in merging datasets (design-level:
human, execution-level: software, or human if it was a bug), to import issues,
the list goes on. With so many things that can go wrong, some of them will.
In all honesty, we’re still waiting to see a duplicate-free dataset that was not
previously specifically curated for sharing. You should test for and eliminate
duplicate records to avoid over-representing certain data points, which could
skew the analysis results.

Let’s ask ChatGPT 2.6.

Human: How can I remove duplicate entries from a DataFrame?

ChatGPT: To remove duplicate entries from a DataFrame, you can use the `

import pandas as pd

# Sample DataFrame with duplicate rows

data = {'A': [1, 2, 2, 4, 5, 1],


'B': [6, 7, 7, 9, 10, 6]}

df = pd.DataFrame(data)

# Remove duplicate rows

df_no_duplicates = df.drop_duplicates()

print("Original DataFrame:")

print(df)

print("\nDataFrame without duplicates:")

print(df_no_duplicates)

If you want to remove duplicates based on a specific set of columns, you can

# Remove duplicates based on column 'A'


df_no_duplicates = df.drop_duplicates(subset=['A'])

print("\nDataFrame without duplicates based on column 'A':")


print(df_no_duplicates)

In this example, df.drop_duplicates(subset=['A']) removes duplicate rows bas

This answer is technically correct and, under regular circumstances,


sufficient.

Efficiency may not be the top priority

Depending on your GPT model’s specifics, it may consider the answer's


simplicity rather than the resource effectiveness of the solution. And quoting
Grace Hopper, it’s a feature, not a bug. Chapter 5 describes methods for
using even general-knowledge GPT models to optimize code.

In our case, it sure is (as we didn’t have any duplicates in the first place), so
we can triumphantly tick this box and move to the next step.

2.4.2 Handling missing values


We will repeat the warning we issued earlier. Some data is sparse. However,
a list of the possible causes for the missing data entries at least as long as it
was in the case of duplicates. Data entry errors, sensor malfunctions, copying
errors, integration errors, you name it. Finally, there is the unavailability of
information.

Depending on the context and the amount of missing data, you can choose to
either delete the incomplete data entry or variable (aka. row or column in
tabular formats), ignore it, or somehow fill in missing values. You may fill
the empty fields with, e.g., a default value, an estimate (e.g., mean, median,
or mode), or use interpolation methods.

In our initial analysis, we should notice three DataFrames, where the number
of non-null values differs from the number of rows in some columns
(meaning it’s not just an empty column). It’s df_orders, df_order_reviews,
and df_products.
Knowledge is power.

It is paramount that the decision on what to do with missing values is based


on understanding why the value is missing! Was it an error of, e.g., a sensor
system that missed a measurement? Or was it an error of the system writing
down some results? Was there an error when multiple data sources were
connected together? Or is it the lack of valuable information on its own?

To decide on what to do with the missing values, we should first have a look
at the data.

In df_orders DataFrame, we noticed that what is missing are some of the


approval and delivery dates. It seems like a good candidate to analyze the
process itself and the relation of, e.g., the order status with the missing dates.
Let’s leave it for Exploratory Data Analysis.

In the case of df_products, we are missing some of the category names, the
same number of missing product names, description lengths and pictures, and
some product size data. Let’s see if we’ll be able to say anything looking at
the products where some of the values are missing.

There are 611 rows where some of the data is missing. This would be a chore
to analyze and address manually, and if we were dealing with 6111 or 61111
rows, it would become outright unfeasible. What can we do to automate the
process somewhat? Let’s start with investigating whether the number of
missing values in different columns is coincidental or if there is some
structure to it.

Let’s ask ChatGPT 2.7.

Human: In my DataFrame, I have missing values. It looks like the number of m

ChatGPT: You can check if the missing values form a pattern by analyzing the

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Sample DataFrame with missing values


data = {'A': [1, None, 3, 4, None],
'B': [1, 2, None, 4, None],
'C': [None, 2, 3, None, 5],
'D': [1, 2, 3, 4, 5]}

vdf = pd.DataFrame(data)

# Create a DataFrame indicating which values are missing

missing_indicator = df.isna()

# Compute the correlation between the missing value indicators

missing_corr = missing_indicator.corr()

# Visualize the correlation matrix using a heatmap

sns.heatmap(missing_corr, annot=True, cmap='coolwarm', vmin=-1, vmax=1)


plt.title("Correlation of Missing Value Indicators")
plt.show()

In this example, df.isna() returns a DataFrame of the same shape as

missing_indicator.corr() computes the pairwise correlation between the colum

Finally, the correlation matrix is visualized using a heatmap from the Seabo

This time, ChatGPT used some really heavy artillery, including visualization
libraries. It’s one of its very useful skills, and we’ll lean on it a lot from now
on.

A picture is worth a thousand words.

If you’d like to learn more about the importance of proper visualization for
statistical analysis, we can’t recommend timeless books by Professor Edward
Tufte enough. His Visual Explanations: Images and Quantities, Evidence and
Narrative, or The Visual Display of Quantitative Information are technology-
agnostic bibles of great visual communication.

Figure 2.1 depicts the results of the one of analyses proposed by ChatGPT.
We present here just this one, as the work with the others of this category is
identical, while the results are not so interesting.
Figure 2.1 The correlation between missing values in different columns.

We can see that some products have missing values in all columns
product_category_name, product_name_lenght, and
product_photos_qty, or in all columns product_weight_g,
product_length_cm, product_height_cm, and product_width_cm. It
means that some products can be either weightless and dimensionless
(services, perhaps?), or they don’t have any category, product name or photos
(mystery box comes to mind).

The search of rows of the dataset such that all the columns beside product_id
are missing, resulted in a single product:
5eb564652db742ff8f28759cd8d2652a.

As we discussed earlier, to decide whether to remove or keep such an item,


we should check the possible business or data collection process reasons
behind such an oddity.
Fraud detection and forensic analysis are outside the scope of this book, but
as we try to instill in you a bit of professional paranoia, we can’t completely
ignore the fact that this case just begs for an investigation.

If we were to check such a case, we would first check the simplest


explanation. Someone was playing with the database and didn’t clean up after
themselves. In such a case, this product_id should be present only in this
single table. The product_id in question, however, seems to be connected to
an order. If possible, we’d just try to find the product's website and see if it
isn’t, e.g., a service (unlikely, as there is just one such case). If we’d feel
especially inquisitive, we’d move to check other transactions of this seller,
particularly with the same buyer. In short, we would try to understand the
actual event and the expected shape of the data describing it. From there, we
would try to identify how the error could occur in said data (if it’s, in fact, an
error, not just a non-typical, but a legitimate situation).

Unfortunately, we need to take off our Sherlock Holmes cap and get to deal
with a much more prevalent, if not so exciting, category of problem.

2.4.3 Correcting data entry errors


There are multiple possible data entry errors, such as typos, mislabeling, or
inconsistent formats, which can negatively affect the dataset’s quality. You
need to identify and correct these errors to ensure the data is accurate and
consistent.

Unfortunately, the best tool to search for such errors is biological rather than
digital. As of the time of writing this book, nothing beats a good old-
fashioned eyeball. Some of the methods described in section 2.1 can help
identify the oddities, e.g., df[‘column_name’].value_counts() can help you
spot values that occur only once, which may be suspicious if you expect
uniform distribution. However, detecting and correcting data entry errors
requires a combination of these techniques, along with domain knowledge
and manual inspection.

The specific type of data entry error analysis we would like to point your
attention to is outlier detection and treatment. Outliers are data points that
deviate significantly from the rest of the dataset (e.g., if all but one
transaction in a set are from NY, and there is suddenly one from LA).
Identifying and handling outliers can be important, as they can
disproportionately impact the analysis results. Or they can indicate data
points of extreme importance. Outliers can be accepted, removed, capped, or
transformed depending on the context to mitigate their influence.

There is one more element of data cleaning that you should consider. We
touched upon one when we identified inconsistencies in customer
identification when first checking our DataFrames.

2.4.4 Data validation


Validating the data ensures that the data adheres to specific rules or
constraints, such as data types, value ranges, or relationships between
columns, which is crucial for maintaining data integrity and avoiding
inconsistencies that could lead to erroneous conclusions.

In human language, for analytical purposes, you’d really rather not find in the
datetime column stuff like “last Tuesday, around noon” next to “it was a
beautiful spring morning”.

Let’s see what ChatGPT’s approach to data validation is.

Let’s ask ChatGPT 2.8.

Human: What shall I do to perform data validation on my DataFrames?

ChatGPT: Performing data validation on your DataFrames involves checking the

1. Define validation rules and constraints: Identify the specific rules and

2. Check data types: Ensure that each column has the correct data type. If n

# Check the data types of each column


print(df.dtypes)

# Convert a column to the correct data type


df['column_name'] = df['column_name'].astype('desired_data_type')
3. Check for missing values: Identify columns with missing values and decide

# Check for missing values


print(df.isna().sum())

# Drop rows or columns with missing values


df.dropna(axis=0, inplace=True) # Drop rows
df.dropna(axis=1, inplace=True) # Drop columns

# Fill missing values


df['column_name'].fillna(value, inplace=True)

4. Validate values: Check that the values in each column follow the predefin

# Check if values are in the allowed set


allowed_values = {value1, value2, value3}
df['column_name'].isin(allowed_values)

# Check if values are in the allowed range


df['column_name'].between(min_value, max_value)

5. Handle invalid values: If invalid values are detected, decide how to hand

# Drop rows with invalid values


df.drop(df[~df['column_name'].isin(allowed_values)].index, inplace=True)

# Set invalid values to a default value


df.loc[~df['column_name'].isin(allowed_values), 'column_name'] = default_val

6. Check relationships between columns: Ensure that relationships between co

# Check for duplicate entries based on unique key constraints


duplicates = df[df.duplicated(subset=['unique_key_column'], keep=False)]
if not duplicates.empty:
print("Duplicate entries found:")
print(duplicates)
# Handle duplicates (e.g., drop or merge rows)

# Check foreign key constraints (assuming df1 and df2 are two related DataFr
foreign_key_violations = df1[~df1['foreign_key_column'].isin(df2['primary_ke
if not foreign_key_violations.empty:
print("Foreign key violations found:")
print(foreign_key_violations)
# Handle violations (e.g., drop rows or correct the foreign key values)

7. Custom validation functions: In some cases, you may need to write custom
Another random document with
no related content on Scribd:
of the early battles and negotiations with foreign powers. There was
one way in which these advantages could have been supported and
continued. Breckenridge, shrewd and able politician as he was, saw
that the way was to keep Southern Representatives in Congress, at
least as long as Northern sentiment would abide it, and in this way
win victories at the very fountain-head of power. But at the close of
the extra session this view had become unpopular at both ends of the
line, and even Breckenridge abandoned it and sought to hide his
original purpose by immediate service in the Confederate armies.
It will be noted that those who vacated their seats to enter the
Confederacy were afterwards expelled. In this connection a curious
incident can be related, occurring as late as the Senate session of
1882:
The widow of the late Senator Nicholson, of Tennessee, who was in
the Senate when Tennessee seceded, a short time ago sent a petition
to Congress asking that the salary of her late husband, after he
returned to Tennessee, might be paid to her. Mr. Nicholson’s term
would have expired in 1865 had he remained in his seat. He did not
appear at the special session of Congress convened in July, 1861, and
with other Senators from the South was expelled from the Senate on
July 11th of that year. The Senate Committee on Claims, after
examining the case thoroughly, submitted to the Senate an adverse
report. After giving a concise history of the case the committee say:
“We do not deem it proper, after the expiration of twenty years, to
pass special acts of Congress to compensate the senators and
Representatives who seceded in 1861 for their services in the early
part of that year. We recommend that the claim of the petitioner be
disallowed.”
The Sessions of the 37th Congress changed the political course of
many public men. It made the Southern believers in secession still
more vehement; it separated the Southern Unionists from their
former friends, and created a wall of fire between them; it changed
the temper of Northern Abolitionists, in so far as to drive from them
all spirit of faction, all pride of methods, and compelled them to
unite with a republican sentiment which was making sure advances
from the original declaration that slavery should not be extended to
the Territories, to emancipation, and, finally, to the arming of the
slaves. It changed many Northern Democrats, and from the ranks of
these, even in representative positions, the lines of the Republicans
were constantly strengthened on pivotal questions. On the 27th of
July Breckinridge had said in a speech: “When traitors become
numerous enough treason becomes respectable.” Senator Andrew
Johnson, of Tennessee, replied to this, and said: “God being willing,
whether traitors be many or few, as I have hitherto waged war
against traitors and treason, I intend to continue it to the end.” And
yet Johnson had the year before warmly supported Breckinridge in
his presidential campaign.
Among the more conspicuous Republicans and anti-Lecompton
Democrats in this session were Charles Sumner, a man who then
exceeded all others in scholarly attainments and as an orator, though
he was not strong in current debate. Great care and preparation
marked every important effort, but no man’s speeches were more
admired throughout the North, and hated throughout the South,
than those of Charles Sumner. An air of romance surrounded the
man, because he was the first victim of a senatorial outrage, when
beaten by Brooks of South Carolina; but, sneered his political
enemies, “no man more carefully preserved his wounds for
exhibition to a sympathetic world.” He had some minor weaknesses,
which were constantly displayed, and these centred in egotism and
high personal pride—not very popular traits—but no enemy was so
malicious as to deny his greatness.
Fessenden of Maine was one of the great lights of that day. He was
apt, almost beyond example, in debate, and was a recognized leader
of the Republicans until, in the attempt to impeach President
Johnson, he disagreed with the majority of his party and stepped
“down and out.” Yet no one questioned his integrity, and all believed
that his vote was cast on this question in a line with his convictions.
The leading character in the House was Thaddeus Stevens, an
original Abolitionist in sentiment, but a man eminently practical and
shrewd in all his methods.
The chances of politics often carry men into the Presidential Chair,
into Cabinets, and with later and demoralizing frequency into Senate
seats; but chance never makes a Commoner, and Thaddeus Stevens
was throughout the war, and up to the hour of his death, recognized
as the great Commoner of the Northern people. He led in every
House battle, and a more unflinching party leader was never known
to parliamentary bodies. Limp and infirm, he was not liable to
personal assault, even in days when such assaults were common; but
when on one occasion his fiery tongue had so exasperated the
Southerners in Congress as to make them show their knives and
pistols, he stepped out into the aisle, and facing, bid them defiance.
He was a Radical of the Radicals, and constantly contended that the
government—the better to preserve itself—could travel outside of the
Constitution. What cannot be said of any other man in history, can
be said of Thaddeus Stevens. When he lay dead, carried thus from
Washington to his home in Lancaster, with all of his people knowing
that he was dead, he was, on the day following the arrival of his
corpse, and within a few squares of his residence, unanimously
renominated by the Republicans for Congress. If more poetic and
less practical sections or lands than the North had such a hero,
hallowed by such an incident, both the name and the incident would
travel down the ages in song and story.[20]
The “rising” man in the 37th Congress was Schuyler Colfax, of
Indiana, elected Speaker of the 38th, and subsequently Vice-
President. A great parliamentarian, he was gifted with rare
eloquence, and with a kind which won friends without offending
enemies—something too rare to last. In the House were also Justin S.
Morrill, the author of the Tariff Bill which supplied the “sinews of
war,” Henry L. Dawes of Massachusetts, then “the man of Statistics”
and the “watch-dog of the treasury.” Roscoe Conkling was then the
admitted leader of the New York delegation, as he was the admitted
mental superior of any other in subsequent terms in the Senate, up
to the time of his resignation in 1881. Reuben E. Fenton, his factional
opponent, was also there. Ohio was strongly represented in both
parties—Pendleton, Cox and Vallandigham on the side of the
Democrats; Bingham and Ashley on the part of the Republicans.
Illinois showed four prominent anti-Lecompton supporters of the
administration—Douglas in the Senate; Logan, McClernand and
Richardson in the House; while prominent among the Republicans
were Lovejoy (an original Abolitionist), Washburne, a candidate for
the Presidential nomination in 1880—Kellogg and Arnold. John F.
Potter was one of the prominent Wisconsin men, who had won
additional fame by accepting the challenge to duel of Roger A. Pryor
of Virginia, and naming the American rifle as the weapon.
Fortunately the duel did not come off. Pennsylvania had then, as she
still has, Judge Kelley of Philadelphia, chairman of Ways and Means
in the 46th Congress; also Edward McPherson, frequently since
Clerk of the House, temporary President of the Cincinnati
Convention, whose decision overthrew the unit rule, and author of
several valuable political works, some of which we freely quote in this
history. John Hickman, subsequently a Republican, but one of the
earliest of the anti-Lecompton Democrats, was an admitted leader, a
man of rare force and eloquence. So radical did he become that he
refused to support the re-election of Lincoln. He was succeeded by
John M. Broomall, who made several fine speeches in favor of the
constitutional amendments touching slavery and civil rights. Here
also were James Campbell, Hendricks B. Wright, John Covode,
James K. Morehead, and Speaker Grow—the father of the
Homestead Bill, which will be found in Book V., giving the Existing
Political Laws.
At this session Senator Trumbull of Illinois, renewed the agitation
of the slavery question, by reporting from the Judiciary Committee of
which he was Chairman, a bill to confiscate all property and free all
slaves used for insurrectionary purposes.[21] Breckinridge fought the
bill, as indeed he did all bills coming from the Republicans, and said
if passed it would eventuate in “the loosening of all bonds.” Among
the facts stated in support of the measure was this, that the
Confederates had at Bull Run used the negroes and slaves against the
Union army—a statement never well established. The bill passed the
Senate by 33 to 6, and on the 3d of August passed the House, though
several Republicans there voted against it, fearing a too rapid
advance would prejudice the Union cause. Indeed this fear was
entertained by Lincoln when he recommended

COMPENSATED EMANCIPATION

in the second session of the 37th Congress, which recommendation


excited official discussion almost up to the time the emancipation
proclamation was issued as a war necessity. The idea of compensated
emancipation originated with or was first formulated by James B.
McKean of New York, who on Feb. 11th, 1861, at the 2d session of the
36th Congress, introduced the following resolution:
Whereas, The “Gulf States” have assumed to secede from the
Union, and it is deemed important to prevent the “border slave
States” from following their example; and whereas it is believed that
those who are inflexibly opposed to any measure of compromise or
concession that involves, or may involve, a sacrifice of principle or
the extension of slavery, would nevertheless cheerfully concur in any
lawful measure for the emancipation of the slaves: Therefore,
Resolved, That the select committee of five be instructed to inquire
whether, by the consent of the people, or of the State governments,
or by compensating the slaveholders, it be practicable for the General
Government to procure the emancipation of the slaves in some, or
all, of the “border States;” and if so, to report a bill for that purpose.
Lincoln was so strongly impressed with the fact, in the earlier
struggles of the war, that great good would follow compensated
emancipation, that on March 2d, 1862, he sent a special message to
the 2d session of the 37th Congress, in which he said:
“I recommend the adoption of a joint resolution by your honorable
bodies, which shall be substantially as follows:
Resolved, That the United States ought to co-operate with any
State which may adopt gradual abolishment of slavery, giving to such
State pecuniary aid, to be used by such State in its discretion, to
compensate for the inconveniences, public and private, produced by
such change of system.
“If the proposition contained in the resolution does not meet the
approval of Congress and the country, there is the end; but if it does
command such approval, I deem it of importance that the States and
people immediately interested should be at once distinctly notified of
the fact, so that they may begin to consider whether to accept or
reject it. The Federal Government would find its highest interest in
such a measure, as one of the most efficient means of self-
preservation. The leaders of the existing insurrection entertain the
hope that this Government will ultimately be forced to acknowledge
the independence of some part of the disaffected region, and that all
the slave States north of such part will then say, ‘the Union for which
we have struggled being already gone, we now choose to go with the
southern section.’ To deprive them of this hope, substantially ends
the rebellion; and the initiation of emancipation completely deprives
them of it as to all the States initiating it. The point is not that all the
States tolerating slavery would very soon, if at all, initiate
emancipation; but that, while the offer is equally made to all, the
more northern shall, by such initiation, make it certain to the more
southern that in no event will the former ever join the latter in their
proposed confederacy. I say ‘initiation,’ because, in my judgment,
gradual, and not sudden emancipation, is better for all. In the mere
financial or pecuniary view, any member of Congress, with the
census tables and Treasury reports before him, can readily see for
himself how very soon the current expenditures of this war would
purchase, at fair valuation, all the slaves in any named State. Such a
proposition on the part of the General Government sets up no claim
of a right by Federal authority to interfere with slavery within State
limits, referring, as it does the absolute control of the subject in each
case to the State and its people immediately interested. It is
proposed as a matter of perfectly free choice with them.
“In the annual message last December, I thought fit to say, ‘the
Union must be preserved; and hence all indispensable means must
be employed.’ I said this not hastily, but deliberately. War has been
made, and continues to be an indispensable means to this end. A
practical reacknowledgment of the national authority would render
the war unnecessary, and it would at once cease. If, however,
resistance continues, the war must also continue; and it is impossible
to foresee all the incidents which may attend, and all the ruin which
may follow it. Such as may seem indispensable, or may obviously
promise great efficiency toward ending the struggle, must and will
come.
“The proposition now made, though an offer only, I hope it may be
esteemed no offence to ask whether the pecuniary consideration
tendered would not be of more value to the States and private
persons concerned, than are the institution, and property in it, in the
present aspect of affairs?
“While it is true that the adoption of the proposed resolution
would be merely initiatory, and not within itself a practical measure,
it is recommended in the hope that it would soon lead to important
practical results. In full view of my great responsibility to my God
and to my country, I earnestly beg the attention of Congress and the
people to the subject.”
Mr. Conkling called the question up in the House March 10th, and
under a suspension of the rules, it was passed by 97 to 36. It passed
the Senate April 2, by 32 to 10, the Republicans, as a rule, voting for
it, the Democrats, as a rule, voting against it; and this was true even
of those in the Border States.
The fact last stated excited the notice of President Lincoln, and in
July, 1862, he sought an interview with the Border State
Congressmen, the result of which is contained in McPherson’s
Political History of the Great Rebellion, as follows:
The President’s Appeal to the Border States.

The Representatives and Senators of the border slaveholding


States, having, by special invitation of the President, been convened
at the Executive Mansion, on Saturday morning last, (July 12,) Mr.
Lincoln addressed them as follows from a written paper held in his
hand:
“Gentlemen: After the adjournment of Congress, now near, I shall
have no opportunity of seeing you for several months. Believing that
you of the border States hold more power for good than any other
equal number of members, I feel it a duty which I cannot justifiably
waive, to make this appeal to you.
“I intend no reproach or complaint when I assure you that, in my
opinion, if you all had voted for the resolution in the gradual
emancipation message of last March, the war would now be
substantially ended. And the plan therein proposed is yet one of the
most potent and swift means of ending it. Let the States which are in
rebellion see definitely and certainly that in no event will the States
you represent ever join their proposed Confederacy, and they cannot
much longer maintain the contest. But you cannot divest them of
their hope to ultimately have you with them so long as you show a
determination to perpetuate the institution within your own States.
Beat them at elections, as you have overwhelmingly done, and,
nothing daunted, they still claim you as their own. You and I know
what the lever of their power is. Break that lever before their faces,
and they can shake you no more forever.
“Most of you have treated me with kindness and consideration,
and I trust you will not now think I improperly touch what is
exclusively your own, when, for the sake of the whole country, I ask,
‘Can you, for your States, do better than to take the course I urge?’
Discarding punctilio and maxims adapted to more manageable
times, and looking only to the unprecedentedly stern facts of our
case, can you do better in any possible event? You prefer that the
constitutional relations of the States to the nation shall be practically
restored without disturbance of the institution; and, if this were
done, my whole duty, in this respect, under the Constitution and my
oath of office, would be performed. But it is not done, and we are
trying to accomplish it by war. The incidents of the war cannot be
avoided. If the war continues long, as it must, if the object be not
sooner attained, the institution in your States will be extinguished by
mere friction and abrasion—by the mere incidents of the war. It will
be gone, and you will have nothing valuable in lieu of it. Much of its
value is gone already. How much better for you and for your people
to take the step which at once shortens the war and secures
substantial compensation for that which is sure to be wholly lost in
any other event! How much better to thus save the money which else
we sink forever in the war! How much better to do it while we can,
lest the war ere long render us pecuniarily unable to do it! How much
better for you, as seller, and the nation, as buyer, to sell out and buy
out that without which the war could never have been, than to sink
both the thing to be sold and the price of it in cutting one another’s
throats!
“I do not speak of emancipation at once, but of a decision at once
to emancipate gradually. Room in South America for colonization
can be obtained cheaply and in abundance, and when numbers shall
be large enough to be company and encouragement for one another,
the freed people will not be so reluctant to go.
“I am pressed with a difficulty not yet mentioned, one which
threatens division among those who, united, are none too strong. An
instance of it is known to you. General Hunter is an honest man. He
was, and I hope still is, my friend. I valued him none the less for his
agreeing with me in the general wish that all men everywhere could
be freed. He proclaimed all men free within certain States, and I
repudiated the proclamation. He expected more good and less harm
from the measure than I could believe would follow. Yet, in
repudiating it, I gave dissatisfaction, if not offence, to many whose
support the country cannot afford to lose. And this is not the end of
it. The pressure in this direction is still upon me, and is increasing.
By conceding what I now ask you can relieve me, and, much more,
can relieve the country in this important point.
“Upon these considerations I have again begged your attention to
the message of March last. Before leaving the Capitol, consider and
discuss it among yourselves. You are patriots and statesmen, and as
such I pray you consider this proposition; and at the least commend
it to the consideration of your States and people. As you would
perpetuate popular government for the best people in the world, I
beseech you that you do in nowise omit this. Our common country is
in great peril, demanding the loftiest views and boldest action to
bring a speedy relief. Once relieved, its form of government is saved
to the world, its beloved history and cherished memories are
vindicated, and its happy future fully assured and rendered
inconceivably grand. To you, more than to any others, the privilege is
given to assure that happiness and swell that grandeur, and to link
your own names therewith forever.”
At the conclusion of these remarks some conversation was had
between the President and several members of the delegations from
the border States, in which it was represented that these States could
not be expected to move in so great a matter as that brought to their
notice in the foregoing address while as yet the Congress had taken
no step beyond the passage of a resolution, expressive rather of a
sentiment than presenting a substantial and reliable basis of action.
The President acknowledged the force of this view, and admitted
that the border States were entitled to expect a substantial pledge of
pecuniary aid as the condition of taking into consideration a
proposition so important in its relations to their social system.
It was further represented, in the conference, that the people of the
border States were interested in knowing the great importance which
the President attached to the policy in question, while it was equally
due to the country, to the President, and to themselves, that the
representatives of the border slaveholding States should publicly
announce the motives under which they were called to act, and the
considerations of public policy urged upon them and their
constituents by the President.
With a view to such a statement of their position, the members
thus addressed met in council to deliberate on the reply they should
make to the President, and, as the result of a comparison of opinions
among themselves, they determined upon the adoption of a majority
and minority answer.
REPLY OF THE MAJORITY.

The following paper was yesterday sent to the President, signed by


the majority of the Representatives from the border slaveholding
States:—

Washington, July 14, 1862.

To the President:

The undersigned, Representatives of Kentucky, Virginia, Missouri,


and Maryland, in the two Houses of Congress, have listened to your
address with the profound sensibility naturally inspired by the high
source from which it emanates, the earnestness which marked its
delivery, and the overwhelming importance of the subject of which it
treats. We have given it a most respectful consideration, and now lay
before you our response. We regret that want of time has not
permitted us to make it more perfect.
We have not been wanting, Mr. President, in respect to you, and in
devotion to the Constitution and the Union. We have not been
indifferent to the great difficulties surrounding you, compared with
which all former national troubles have been but as the summer
cloud; and we have freely given you our sympathy and support.
Repudiating the dangerous heresies of the secessionists, we believed,
with you, that the war on their part is aggressive and wicked, and the
objects for which it was to be prosecuted on ours, defined by your
message at the opening of the present Congress, to be such as all
good men should approve. We have not hesitated to vote all supplies
necessary to carry it on vigorously. We have voted all the men and
money you have asked for, and even more; we have imposed onerous
taxes on our people, and they are paying them with cheerfulness and
alacrity; we have encouraged enlistments and sent to the field many
of our best men; and some of our number have offered their persons
to the enemy as pledges of their sincerity and devotion to the
country.
We have done all this under the most discouraging circumstances,
and in the face of measures most distasteful to us and injurious to
the interests we represent, and in the hearing of doctrines avowed by
those who claim to be your friends, must be abhorrent to us and our
constituents. But, for all this, we have never faltered, nor shall we as
long as we have a Constitution to defend and a Government which
protects us. And we are ready for renewed efforts, and even greater
sacrifices, yea, any sacrifice, when we are satisfied it is required to
preserve our admirable form of government and the priceless
blessings of constitutional liberty.
A few of our number voted for the resolution recommended by
your message of the 6th of March last, the greater portion of us did
not, and we will briefly state the prominent reasons which influenced
our action.
In the first place, it proposed a radical change of our social system,
and was hurried through both Houses with undue haste, without
reasonable time for consideration and debate, and with no time at all
for consultation with our constituents, whose interests it deeply
involved. It seemed like an interference by this Government with a
question which peculiarly and exclusively belonged to our respective
States, on which they had not sought advice or solicited aid. Many of
us doubted the constitutional power of this Government to make
appropriations of money for the object designated, and all of us
thought our finances were in no condition to bear the immense
outlay which its adoption and faithful execution would impose upon
the national Treasury. If we pause but a moment to think of the debt
its acceptance would have entailed, we are appalled by its magnitude.
The proposition was addressed to all the States, and embraced the
whole number of slaves.
According to the census of 1860 there were then nearly four
million slaves in the country; from natural increase they exceed that
number now. At even the low average of $300, the price fixed by the
emancipation act for the slaves of this District, and greatly below
their real worth, their value runs up to the enormous sum of
$1,200,000,000; and if to that we add the cost of deportation and
colonization, at $100 each, which is but a fraction more than is
actually paid by the Maryland Colonization Society, we have
$400,000,000 more. We were not willing to impose a tax on our
people sufficient to pay the interest on that sum, in addition to the
vast and daily increasing debt already fixed upon them by the
exigencies of the war, and if we had been willing, the country could
not bear it. Stated in this form the proposition is nothing less than
the deportation from the country of $1,600,000,000 worth of
producing labor, and the substitution in its place of an interest-
bearing debt of the same amount.
But, if we are told that it was expected that only the States we
represent would accept the proposition, we respectfully submit that
even then it involves a sum too great for the financial ability of this
Government at this time. According to the census of 1860—

Slaves.
Kentucky had 225,490
Maryland 87,188
Virginia 490,887
Delaware 1,798
Missouri 114,965
Tennessee 275,784

Making in the whole 1,196,112

At the same rate of valuation these would amount to $358,933,500


Add for deportation and colonization $100 each 118,244,533

And we have the enormous sum of $478,038,133

We did not feel that we should be justified in voting for a measure


which, if carried out, would add this vast amount to our public debt
at a moment when the Treasury was reeling under the enormous
expenditure of the war.
Again, it seemed to us that this resolution was but the
annunciation of a sentiment which could not or was not likely to be
reduced to an actual tangible proposition. No movement was then
made to provide and appropriate the funds required to carry it into
effect; and we were not encouraged to believe that funds would be
provided. And our belief has been fully justified by subsequent
events. Not to mention other circumstances, it is quite sufficient for
our purpose to bring to your notice the fact that, while this resolution
was under consideration in the Senate, our colleague, the Senator
from Kentucky, moved an amendment appropriating $500,000 to
the object therein designated, and it was voted down with great
unanimity. What confidence, then, could we reasonably feel that if
we committed ourselves to the policy it proposed, our constituents
would reap the fruits of the promise held out; and on what ground
could we, as fair men, approach them and challenge their support?
The right to hold slaves is a right appertaining to all the States of
this Union. They have the right to cherish or abolish the institution,
as their tastes or their interests may prompt, and no one is
authorized to question the right or limit the enjoyment. And no one
has more clearly affirmed that right than you have. Your inaugural
address does you great honor in this respect, and inspired the
country with confidence in your fairness and respect for the law. Our
States are in the enjoyment of that right. We do not feel called on to
defend the institution or to affirm it is one which ought to be
cherished; perhaps, if we were to make the attempt, we might find
that we differ even among ourselves. It is enough for our purpose to
know that it is a right; and, so knowing, we did not see why we
should now be expected to yield it. We had contributed our full share
to relieve the country at this terrible crisis; we had done as much as
had been required of others in like circumstances; and we did not see
why sacrifices should be expected of us from which others, no more
loyal, were exempt. Nor could we see what good the nation would
derive from it.
Such a sacrifice submitted to by us would not have strengthened
the arm of this Government or weakened that of the enemy. It was
not necessary as a pledge of our loyalty, for that had been manifested
beyond a reasonable doubt, in every form, and at every place
possible. There was not the remotest probability that the States we
represent would join in the rebellion, nor is there now, or of their
electing to go with the southern section in the event of a recognition
of the independence of any part of the disaffected region. Our States
are fixed unalterably in their resolution to adhere to and support the
Union. They see no safety for themselves, and no hope for
constitutional liberty but by its preservation. They will, under no
circumstances, consent to its dissolution; and we do them no more
than justice when we assure you that, while the war is conducted to
prevent that deplorable catastrophe, they will sustain it as long as
they can muster a man or command a dollar. Nor will they ever
consent, in any event, to unite with the Southern Confederacy. The
bitter fruits of the peculiar doctrines of that region will forever
prevent them from placing their security and happiness in the
custody of an association which has incorporated in its organic law
the seeds of its own destruction.

Mr. President, we have stated with frankness and candor the


reasons on which we forbore to vote for the resolution you have
mentioned; but you have again presented this proposition, and
appealed to us with an earnestness and eloquence which have not
failed to impress us, to “consider it, and at the least to commend it to
the consideration of our States and people.” Thus appealed to by the
Chief Magistrate of our beloved country, in the hour of its greatest
peril, we cannot wholly decline. We are willing to trust every
question relating to their interest and happiness to the consideration
and ultimate judgment of our own people. While differing from you
as to the necessity of emancipating the slaves of our States as a
means of putting down the rebellion, and while protesting against
the propriety of any extra-territorial interference to induce the
people of our States to adopt any particular line of policy on a subject
which peculiarly and exclusively belongs to them, yet, when you and
our brethren of the loyal States sincerely believe that the retention of
slavery by us is an obstacle to peace and national harmony, and are
willing to contribute pecuniary aid to compensate our States and
people for the inconveniences produced by such a change of system,
we are not unwilling that our people shall consider the propriety of
putting it aside.
But we have already said that we regarded this resolution as the
utterance of a sentiment, and we had no confidence that it would
assume the shape of a tangible, practical proposition which would
yield the fruits of the sacrifice it required. Our people are influenced
by the same want of confidence, and will not consider the
proposition in its present impalpable form. The interest they are
asked to give up is to them of much importance, and they ought not
to be expected even to entertain the proposal until they are assured
that when they accept it their just expectations will not be frustrated.
We regard your plan as a proposition from the Nation to the States to
exercise an admitted constitutional right in a particular manner and
yield up a valuable interest. Before they ought to consider the
proposition, it should be presented in such a tangible, practical,
efficient shape as to command their confidence that its fruits are
contingent only upon their acceptance. We cannot trust anything to
the contingencies of future legislation.
If Congress, by proper and necessary legislation, shall provide
sufficient funds and place them at your disposal, to be applied by you
to the payment of any of our States or the citizens thereof who shall
adopt the abolishment of slavery, either gradual or immediate, as
they may determine, and the expense of deportation and
colonization of the liberated slaves, then will our State and people
take this proposition into careful consideration, for such decision as
in their judgment is demanded by their interest, their honor, and
their duty to the whole country. We have the honor to be, with great
respect,

C. A. Wickliffe, Ch’n,
Garrett Davis,
R. Wilson,
J. J. Crittenden,
John S. Carlile,
J. W. Crisfield,
J. S. Jackson,
H. Grider,
John S. Phelps,
Francis Thomas,
Chas. B. Calvert,
C. L. Leary,
Edwin H. Webster,
R. Mallory,
Aaron Harding,
James S. Rollins,
J. W. Menzies,
Thomas L. Price,
G. W. Dunlap,
Wm. A. Hall.
Others of the minority, among them Senator Henderson and
Horace Maynard, forwarded separate replies, but all rejecting the
idea of compensated emancipation. Still Lincoln adhered to and
advocated it in his recent annual message sent to Congress, Dec. 1,
1862, from which we take the following paragraphs, which are in
themselves at once curious and interesting:
“We have two million nine hundred and sixty-three thousand
square miles. Europe has three million and eight hundred thousand,
with a population averaging seventy-three and one-third persons to
the square mile. Why may not our country, at some time, average as
many? Is it less fertile? Has it more waste surface, by mountains,
rivers, lakes, deserts, or other causes? Is it inferior to Europe in any
natural advantage? If, then, we are at some time to be as populous as
Europe, how soon? As to when this may be, we can judge by the past
and the present; as to when it will be, if ever, depends much on
whether we maintain the Union. Several of our States are already
above the average of Europe—seventy-three and a third to the square
mile. Massachusetts has 157; Rhode Island, 133; Connecticut, 99;
New York and New Jersey, each, 80. Also two other great states,
Pennsylvania and Ohio, are not far below, the former having 63 and
the latter 59. The states already above the European average, except
New York, have increased in as rapid a ratio, since passing that
point, as ever before; while no one of them is equal to some other
parts of our country in natural capacity for sustaining a dense
population.
“Taking the nation in the aggregate, and we find its population and
ratio of increase, for the several decennial periods, to be as follows:

1790 3,929,827 Ratio of increase.


1800 5,305,937 35.02 per cent.
1810 7,239,814 36.45 „
1820 9,638,131 33.13 „
1830 12,866,020 33.49 „
1840 17,069,453 32.67 „
1850 23,191,876 35.87 „
1860 31,443,790 35.58 „
This shows an annual decennial increase of 34.69 per cent, in
population through the seventy years from our first to our last census
yet taken. It is seen that the ratio of increase, at no one of these seven
periods is either two per cent. below or two per cent. above the
average; thus showing how inflexible, and, consequently, how
reliable, the law of increase in our case is. Assuming that it will
continue, gives the following results:

1870 42,323,341
1880 56,967,216
1890 76,677,872
1900 103,208,415
1910 138,918,526
1920 186,984,335
1930 251,680,914

“These figures show that our country may be as populous as


Europe now is at some point between 1920 and 1930—say about
1925—our territory, at seventy-three and a third persons to the
square mile, being of capacity to contain 217,186,000.
“And we will reach this, too, if we do not ourselves relinquish the
chance by the folly and evils of disunion, or by long and exhausting
war springing from the only great element of national discord among
us. While it cannot be foreseen exactly how much one huge example
of secession, breeding lesser ones indefinitely, would retard
population, civilization, and prosperity no one can doubt that the
extent of it would be very great and injurious.
The proposed emancipation would shorten the war, perpetuate
peace, insure this increase of population, and proportionately the
wealth of the country. With these, we should pay all the
emancipation would cost, together with our other debt, easier than
we should pay our other debt without it. If we had allowed our old
national debt to run at six per cent. per annum, simple interest, from
the end of our revolutionary struggle until to-day, without paying
anything on either principal or interest, each man of us would owe
less upon that debt now than each man owed upon it then; and this
because our increase of men through the whole period has been
greater than six per cent.; has run faster than the interest upon the
debt. Thus, time alone relieves a debtor nation, so long as its
population increases faster than unpaid interest accumulates on its
debt.
“This fact would be no excuse for delaying payment of what is
justly due; but it shows the great importance of time in this
connection—the great advantage of a policy by which we shall not
have to pay until we number a hundred millions, what, by a different
policy, we would have to pay now, when we number but thirty-one
millions. In a word, it shows that a dollar will be much harder to pay
for the war than will be a dollar for emancipation on the proposed
plan. And then the latter will cost no blood, no precious life. It will be
a saving of both.”
Various propositions and measures relating to compensated
emancipation, were afterwards considered in both Houses, but it was
in March, 1863, dropped after a refusal of the House to suspend the
rules for the consideration of the subject.
Emancipation as a War Necessity.

Before the idea of compensated emancipation had been dropped,


and it was constantly discouraged by the Democrats and Border
Statesmen, President Lincoln had determined upon a more radical
policy, and on the 22d of September, 1862, issued his celebrated
proclamation declaring that he would emancipate “all persons held
as slaves within any State or designated part of a State, the people
whereof shall be in rebellion against the United States”—by the first
of January, 1863, if such sections were not “in good faith represented
in Congress.” He followed this by actual emancipation at the time
stated.

You might also like