Synthetic Data For Deep Learning: Generate Synthetic Data For Decision Making and Applications With Python and R 1st Edition Necmi Gürsakal

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 70

Synthetic Data for Deep Learning:

Generate Synthetic Data for Decision


Making and Applications with Python
and R 1st Edition Necmi Gürsakal
Visit to download the full and correct content document:
https://ebookmass.com/product/synthetic-data-for-deep-learning-generate-synthetic-d
ata-for-decision-making-and-applications-with-python-and-r-1st-edition-necmi-gursak
al/
More products digital (pdf, epub, mobi) instant
download maybe you interests ...

Synthetic Data for Deep Learning: Generate Synthetic


Data for Decision Making and Applications with Python
and R 1st Edition Necmi Gürsakal

https://ebookmass.com/product/synthetic-data-for-deep-learning-
generate-synthetic-data-for-decision-making-and-applications-
with-python-and-r-1st-edition-necmi-gursakal/

Data Universe: Organizational Insights with Python:


Embracing Data Driven Decision Making Van Der Post

https://ebookmass.com/product/data-universe-organizational-
insights-with-python-embracing-data-driven-decision-making-van-
der-post/

Beginner's Guide to Streamlit with Python: Build Web-


Based Data and Machine Learning Applications 1st
Edition Sujay Raghavendra

https://ebookmass.com/product/beginners-guide-to-streamlit-with-
python-build-web-based-data-and-machine-learning-
applications-1st-edition-sujay-raghavendra/

More Judgment Than Data: Data Literacy and Decision-


Making Michael Jones

https://ebookmass.com/product/more-judgment-than-data-data-
literacy-and-decision-making-michael-jones/
Machine Learning with Python for Everyone (Addison
Wesley Data & Analytics Series) 1st Edition, (Ebook
PDF)

https://ebookmass.com/product/machine-learning-with-python-for-
everyone-addison-wesley-data-analytics-series-1st-edition-ebook-
pdf/

(eBook PDF) Intro to Python for Computer Science and


Data Science: Learning to Program with AI, Big Data and
The Cloud

https://ebookmass.com/product/ebook-pdf-intro-to-python-for-
computer-science-and-data-science-learning-to-program-with-ai-
big-data-and-the-cloud/

Modern Business Analytics: Practical Data Science for


Decision-making Matt Taddy

https://ebookmass.com/product/modern-business-analytics-
practical-data-science-for-decision-making-matt-taddy/

Data Analysis for the Life Sciences with R 1st Edition

https://ebookmass.com/product/data-analysis-for-the-life-
sciences-with-r-1st-edition/

Data Mining for Business Analytics: Concepts,


Techniques and Applications in Python eBook

https://ebookmass.com/product/data-mining-for-business-analytics-
concepts-techniques-and-applications-in-python-ebook/
Synthetic
Data for
Deep Learning
Generate Synthetic Data for Decision
Making and Applications with
Python and R

Necmi Gürsakal
Sadullah Çelik
Esma Birişçi
Synthetic Data for Deep
Learning
Generate Synthetic Data for Decision
Making and Applications with Python
and R

Necmi Gürsakal
Sadullah Çelik
Esma Birişçi
Synthetic Data for Deep Learning: Generate Synthetic Data for Decision Making and
Applications with Python and R
Necmi Gürsakal Sadullah Çelik
Bursa, Turkey Aydın, Turkey

Esma Birişçi
Bursa, Turkey

ISBN-13 (pbk): 978-1-4842-8586-2 ISBN-13 (electronic): 978-1-4842-8587-9


https://doi.org/10.1007/978-1-4842-8587-9

Copyright © 2022 by Necmi Gürsakal, Sadullah Çelik, and Esma Birişçi


This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the
material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation,
broadcasting, reproduction on microfilms or in any other physical way, and transmission or information
storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now
known or hereafter developed.
Trademarked names, logos, and images may appear in this book. Rather than use a trademark symbol with
every occurrence of a trademarked name, logo, or image we use the names, logos, and images only in an
editorial fashion and to the benefit of the trademark owner, with no intention of infringement of the
trademark.
The use in this publication of trade names, trademarks, service marks, and similar terms, even if they are not
identified as such, is not to be taken as an expression of opinion as to whether or not they are subject to
proprietary rights.
While the advice and information in this book are believed to be true and accurate at the date of publication,
neither the authors nor the editors nor the publisher can accept any legal responsibility for any errors or
omissions that may be made. The publisher makes no warranty, express or implied, with respect to the
material contained herein.
Managing Director, Apress Media LLC: Welmoed Spahr
Acquisitions Editor: Celestin Suresh John
Development Editor: Laura Berendson
Coordinating Editor: Mark Powers
Cover designed by eStudioCalamar
Cover image by Simon Lee on Unsplash (www.unsplash.com)
Distributed to the book trade worldwide by Apress Media, LLC, 1 New York Plaza, New York, NY 10004,
U.S.A. Phone 1-800-SPRINGER, fax (201) 348-4505, e-mail orders-ny@springer-sbm.com, or visit www.
springeronline.com. Apress Media, LLC is a California LLC and the sole member (owner) is Springer Science
+ Business Media Finance Inc (SSBM Finance Inc). SSBM Finance Inc is a Delaware corporation.
For information on translations, please e-mail booktranslations@springernature.com; for reprint,
paperback, or audio rights, please e-mail bookpermissions@springernature.com.
Apress titles may be purchased in bulk for academic, corporate, or promotional use. eBook versions and
licenses are also available for most titles. For more information, reference our Print and eBook Bulk Sales
web page at http://www.apress.com/bulk-sales.
Any source code or other supplementary material referenced by the author in this book is available to
readers on GitHub (https://github.com/Apress). For more detailed information, please visit http://www.
apress.com/source-code.
Printed on acid-free paper
This book is dedicated to our mothers.
Table of Contents
About the Authors���������������������������������������������������������������������������������������������������� ix

About the Technical Reviewer��������������������������������������������������������������������������������� xi


Preface������������������������������������������������������������������������������������������������������������������ xiii

Introduction�������������������������������������������������������������������������������������������������������������xv

Chapter 1: An Introduction to Synthetic Data����������������������������������������������������������� 1


What Synthetic Data is?���������������������������������������������������������������������������������������������������������������� 1
Why is Synthetic Data Important?������������������������������������������������������������������������������������������� 2
Synthetic Data for Data Science and Artificial Intelligence����������������������������������������������������� 3
Accuracy Problems����������������������������������������������������������������������������������������������������������������������� 4
The Lifecycle of Data��������������������������������������������������������������������������������������������������������������������� 5
Data Collection versus Privacy������������������������������������������������������������������������������������������������������ 7
Data Privacy and Synthetic Data��������������������������������������������������������������������������������������������� 8
Synthetic Data and Data Quality������������������������������������������������������������������������������������������������� 10
Aplications of Synthetic Data������������������������������������������������������������������������������������������������������ 10
Financial Services����������������������������������������������������������������������������������������������������������������� 11
Manufacturing����������������������������������������������������������������������������������������������������������������������� 12
Healthcare����������������������������������������������������������������������������������������������������������������������������� 14
Automotive���������������������������������������������������������������������������������������������������������������������������� 16
Robotics��������������������������������������������������������������������������������������������������������������������������������� 17
Security��������������������������������������������������������������������������������������������������������������������������������� 18
Social Media�������������������������������������������������������������������������������������������������������������������������� 19
Marketing������������������������������������������������������������������������������������������������������������������������������ 20
Natural Language Processing������������������������������������������������������������������������������������������������ 21

v
Table of Contents

Computer Vision�������������������������������������������������������������������������������������������������������������������� 22
Summary������������������������������������������������������������������������������������������������������������������������������������ 27
References���������������������������������������������������������������������������������������������������������������������������������� 27

Chapter 2: Foundations of Synthetic data�������������������������������������������������������������� 31


How to Generated Fair Synthetic Data?�������������������������������������������������������������������������������������� 31
Generating Synthetic Data in A Simple Way�������������������������������������������������������������������������������� 32
Using Video Games to Create Synthetic Data������������������������������������������������������������������������������ 37
The Synthetic-to-Real Domain Gap��������������������������������������������������������������������������������������������� 42
Bridging the Gap�������������������������������������������������������������������������������������������������������������������� 42
Is Real-World Experience Unavoidable?�������������������������������������������������������������������������������� 49
Pretraining����������������������������������������������������������������������������������������������������������������������������� 50
Reinforcement Learning�������������������������������������������������������������������������������������������������������� 51
Self-Supervised Learning������������������������������������������������������������������������������������������������������ 53
Summary������������������������������������������������������������������������������������������������������������������������������������ 54
References���������������������������������������������������������������������������������������������������������������������������������� 55

Chapter 3: Introduction to GANs����������������������������������������������������������������������������� 61


GANs������������������������������������������������������������������������������������������������������������������������������������������� 61
CTGAN������������������������������������������������������������������������������������������������������������������������������������ 63
SurfelGAN������������������������������������������������������������������������������������������������������������������������������ 64
Cycle GANs���������������������������������������������������������������������������������������������������������������������������� 65
SinGAN-Seg��������������������������������������������������������������������������������������������������������������������������� 66
MedGAN��������������������������������������������������������������������������������������������������������������������������������� 66
DCGAN����������������������������������������������������������������������������������������������������������������������������������� 67
WGAN������������������������������������������������������������������������������������������������������������������������������������� 68
SeqGAN���������������������������������������������������������������������������������������������������������������������������������� 69
Conditional GAN��������������������������������������������������������������������������������������������������������������������� 70
BigGAN���������������������������������������������������������������������������������������������������������������������������������� 71
Summary������������������������������������������������������������������������������������������������������������������������������������ 72
References���������������������������������������������������������������������������������������������������������������������������������� 72

vi
Table of Contents

Chapter 4: Synthetic Data Generation with R��������������������������������������������������������� 75


Basic Functions Used in Generating Synthetic Data������������������������������������������������������������������� 75
Creating a Value Vector from a Known Univariate Distribution���������������������������������������������� 77
Vector Generation from a Multi-Levels Categorical Variable������������������������������������������������� 78
Multivariate��������������������������������������������������������������������������������������������������������������������������� 78
Multivariate (with correlation)����������������������������������������������������������������������������������������������� 79
Generating an Artificial Neural Network Using Package “nnet” in R������������������������������������������ 84
Augmented Data�������������������������������������������������������������������������������������������������������������������� 90
Image Augmentation Using Torch Package��������������������������������������������������������������������������������� 97
Multivariate Imputation Via “mice” Package in R��������������������������������������������������������������������� 102
Generating Synthetic Data with the “conjurer” Package in R��������������������������������������������������� 114
Creat a Customer����������������������������������������������������������������������������������������������������������������� 115
Creat a Product�������������������������������������������������������������������������������������������������������������������� 117
Creating Transactions���������������������������������������������������������������������������������������������������������� 118
Generating Synthetic Data��������������������������������������������������������������������������������������������������� 119
Generating Synthetic Data with “Synthpop” Package In R������������������������������������������������������� 121
Copula��������������������������������������������������������������������������������������������������������������������������������� 145
t Copula������������������������������������������������������������������������������������������������������������������������������� 147
Normal Copula��������������������������������������������������������������������������������������������������������������������� 150
Gaussian Copula������������������������������������������������������������������������������������������������������������������ 153
Summary���������������������������������������������������������������������������������������������������������������������������������� 157
References�������������������������������������������������������������������������������������������������������������������������������� 157

Chapter 5: Synthetic Data Generation with Python���������������������������������������������� 159


Data Generation with Know Distribution����������������������������������������������������������������������������������� 159
Data with Date information������������������������������������������������������������������������������������������������� 163
Data with Internet information�������������������������������������������������������������������������������������������� 163
A more complex and comprehensive example�������������������������������������������������������������������� 163
Synthetic Data Generation in Regression Problem������������������������������������������������������������������� 164
Gaussian Noise Apply to Regression Model������������������������������������������������������������������������ 168

vii
Table of Contents

Friedman Functions and Symbolic Regression������������������������������������������������������������������������� 172


Make 3d Plot������������������������������������������������������������������������������������������������������������������������ 174
Make3d Plot������������������������������������������������������������������������������������������������������������������������� 177
Synthetic data generation for Classification and Clustering Problems������������������������������������� 182
Classification Problems������������������������������������������������������������������������������������������������������� 183
Clustering Problems������������������������������������������������������������������������������������������������������������ 194
Generation Tabular Synthetic Data by Applying GANs��������������������������������������������������������������� 203
Synthetic data Generation��������������������������������������������������������������������������������������������������� 205
Summary���������������������������������������������������������������������������������������������������������������������������������� 214
Reference���������������������������������������������������������������������������������������������������������������������������������� 214

Index��������������������������������������������������������������������������������������������������������������������� 215

viii
About the Authors
Necmi Gürsakal a statistics professor at Mudanya University
in Turkey, where he shares his experience and knowledge
with his students. Before that, he worked as a faculty member
at the Econometrics Department Bursa Uludağ University
for more than 40 years. Necmi has many published Turkish
books and English and Turkish articles on data science,
machine learning, artificial intelligence, social network
analysis, and big data. In addition, he has served as a
consultant to various business organizations.

Sadullah Çelik a mathematician, statistician, and data


scientist who completed his undergraduate and graduate
education in mathematics and his doctorate in statistics. He
has written Turkish and English numerous articles on big
data, data science, machine learning, multivariate statistics,
and network science. He developed his programming and
machine learning knowledge while writing his doctoral
thesis, Big Data and Its Applications in Statistics. He has
been working as a Research Assistant at Adnan Menderes
University Aydin, for more than 8 years and has extensive
knowledge and experience in big data, data science, machine
learning, and statistics, which he passes on to his students.

ix
About the Authors

Esma Birişçi a programmer, statistician, and operations


researcher with more than 15 years of experience in
computer program development and five years in teaching
students. She developed her programming ability while
studying for her bachelor degree, and knowledge of machine
learning during her master degree program. She completed
her thesis about data augmentation and supervised learning.
Esma transferred to Industrial Engineering and completed
her doctorate program on dynamic and stochastic nonlinear
programming. She studied large-scale optimization and life
cycle assessment, and developed a large-­scale food supply chain system application
using Python. She is currently working at Bursa Uludag University, Turkey, where she
transfers her knowledge to students. In this book, she is proud to be able to explain
Python’s powerful structure.

x
About the Technical Reviewer
Fatih Gökmenoğlu is a researcher focused on synthetic
data, computational intelligence, domain adaptation, and
active learning. He also likes reporting on the results of his
research.
His knowledge closely aligns with computer vision,
especially with deepfake technology. He studies both the
technology itself and ways of countering it.
When he’s not on the computer, you’ll likely find him
spending time with his little daughter, whose development
has many inspirations for his work on machine learning.

xi
Preface
In 2017, The Economist wrote, “The world’s most valuable resource is no longer oil,
but data,” and this becomes truer with every passing day. The gathering and analysis
of massive amounts of data drive the business world, public administration, and
science, giving leaders the information they need to make accurate, strategically-sound
decisions. Although some worry about the implications of this new “data economy,” it is
clear that data is here to stay. Those who can harness the power of data will be in a good
position to shape the future.
To use data ever more efficiently, machine and deep learning—forms of artificial
intelligence (AI)—continue to evolve. And every new development in how data and
AI are used impacts innumerable areas of everyday life. In other words, from banking
to healthcare to scientific research to sports and entertainment, data has become
everything. But, for privacy reasons, it is not always possible to find sufficient data.
As the lines between the real and virtual worlds continue to blur, data scientists have
begun to generate synthetic data, with or without real data, to understand, control, and
regulate decision-making in the real world. Instead of focusing on how to overcome
barriers to data, data professionals have the option of either transforming existing data
for their specific use or producing it synthetically. We have written this book to explore
the importance and meaning of these two avenues through real-world examples. If you
work with or are interested in data science, statistics, machine learning, deep learning, or
AI, this book is for you.
While deep learning models’ huge data needs are a bottleneck for such applications,
synthetic data has allowed these models to be, in a sense, self-fueled. Synthetic data
is still an emerging topic, from healthcare to retail, manufacturing to autonomous
driving. It should be noted that since labeling processes start with real data. Real data,
augmented data, and synthetic data all take place in these deep learning processes.
This book includes examples of Python and R applications for synthetic data
production. We hope that it proves to be as comprehensive as you need it to be.

—Necmi Gürsakal
— Sadullah Çelik
— Esma Birişçi

xiii
Introduction
“The claim is that nature itself operates in a way that is analogous to a priori reasoning.
The way nature operates is, of course, via causation: the processes we see unfolding
around us are causal processes, with earlier stages linked to later ones by causal
relations” [1]. Data is extremely important in the operation of causal relationships and
can be described as the “sine qua non” of these processes. In addition, data quality is
related to quantity and diversity, especially in the AI framework.
Data is the key to understanding causal relationships. Without data, it would
be impossible to understand how the world works. The philosopher David Hume
understood this better than anyone. According to Hume, our knowledge of the world
comes from our experiences. Experiences produce data, which can be stored on a
computer or in the cloud. Based on this data, we can make predictions about what will
happen in the future. These predictions allow us to test our hypotheses and theories. If
our predictions are correct, we can have confidence in our ideas. If they are wrong, we
need to rethink our hypotheses and theories. This cycle of testing and refinement is how
we make progress in science and life. This is how we make progress as scientists and as
human beings.
Many of today’s technology giants, such as Amazon, Facebook, and Google, have
made data-driven decision-making the core of their business models. They have done
this by harnessing the power of big data and AI to make decisions that would otherwise
be impossible. In many ways, these companies are following in the footsteps of Hume,
using data to better understand the world around them.
As technology advances, how we collect and store data also changes. In the past,
data was collected through experiments and observations made by scientists. However,
with the advent of computers and the internet, data can now be collected automatically
and stored in a central location. This has led to a change in the way we think about
knowledge. Instead of knowledge being stored in our minds, it is now something that is
stored in computers and accessed through algorithms.
This change in the way we think about knowledge has had a profound impact on the
way we live and work. In the past, we would have to rely on our memory and experience
to make decisions. However, now we can use data to make more informed decisions.

xv
Introduction

For example, we can use data about the past behavior of consumers to predict what
they might buy in the future. This has led to a more efficient and effective way of doing
business.
In the age of big data, it is more important than ever to have high-quality data to
make accurate predictions. However, it is not only the quantity and quality of the data
that is important but also the diversity. The diversity of data sources is important to
avoid bias and to get a more accurate picture of the world. This is because different data
sources can provide different perspectives on the same issue, which can help to avoid
bias. Furthermore, more data sources can provide a more complete picture of what is
happening in the world.

Machine Learning
In recent years, a method has been developed to teach machines to see, read, and hear
via data input. The point of origin for this is what we think of in the brain as producing
output bypassing inputs through a large network of neurons. In this framework, we
are trying to give machines the ability to learn by modeling artificial neural networks.
Although some authors suggest that the brain does not work that way, this is the path
followed today.
Many machines learning projects in new application areas began with the labeling
of data by humans to initiate machine training. These projects were categorized under
the title of supervised learning. This labeling task is similar to the structured content
analysis applied in social sciences and humanities. Supervised learning is a type of
machine learning that is based on providing the machine with training data that is
already labeled. This allows the machine to learn and generalize from the data to make
predictions about new data. Supervised learning is a powerful tool for many machine
learning applications.
The quality of data used in machine learning studies is crucial for the accuracy
of the findings. A study by Geiger et al. (2020) showed that the data used to train a
machine learning model for credit scoring was of poor quality, which led to an unfair
and inaccurate model. The study highlights the importance of data quality in machine
learning research. Data quality is essential for accurate results. Furthermore, the study
showed how data labeling impacts data quality. About half of the papers using original
human annotation overlap with other papers to some extent, and about 70% of the

xvi
Introduction

papers that use multiple overlaps report metrics of inter-annotator agreement [2]. This
suggests that the data used in these studies is unreliable and that further research is
needed to improve data quality.
As more business decisions are informed by data analysis, more companies are
built on data. However, data quality remains a problem. Unfortunately, “garbage in,
garbage out,” which was a frequently used motto about computers in the past, is valid
in the sense of data sampling, which is also used in the framework of machine learning.
According to the AI logic most employed today, if qualified college graduates have been
successful in obtaining doctorates in the past, they will remain doing so in the future.
In this context, naturally, the way to get a good result in machine learning is to include
“black swans” in our training data, and this is also a problem with our datasets.
A “black swan” is a term used to describe outliers in datasets. It is a rare event that
is difficult to predict and has a major impact on a system. In machine learning, a black
swan event is not represented in the training data but could significantly impact the
results of the machine learning algorithm. Black swans train models to be more robust
to unexpected inputs. It is important to include them in training datasets to avoid biased
results.
Over time, technological development has moved it into the framework of human
decision-making with data and into the decision-making framework of machines. Now,
machines evaluate big data and make decisions with algorithms written by humans.
For example, a driverless car can navigate toward the desired destination by constantly
collecting data on stationary and moving objects around it in various ways. Autonomous
driving is a very important and constantly developing application area for synthetic data.
Autonomous driving systems should be developed at a capability level that can solve
complex and varied traffic problems in simulation. The scenarios we mentioned in these
simulations are sometimes made by gaming engines such as Unreal and Unity. Creating
accurate and useful “synthetic data” with simulations based on real data will be the way
companies will prefer real data that cannot be easily found.
Synthetic data is becoming an increasingly important tool for businesses looking to
improve their AI initiatives and overcome many of the associated challenges. By creating
synthetic data, businesses can shape and form data to their needs and augment and
de-bias their datasets. This makes synthetic data an essential part of any AI strategy.
DataGen, Mostly, Cvedia, Hazy, AI.Reverie, Omniverse, and Anyverse can be counted
among the startups that produce synthetic data. Sample images from synthetic outdoor
datasets produced by such companies can be seen in the given source.

xvii
Introduction

In addition to the benefits mentioned, synthetic data can also help businesses train
their AI models more effectively and efficiently. Businesses can avoid the need for costly
and time-consuming data collection processes by using synthetic data. This can help
businesses save money and resources and get their AI initiatives up and running more
quickly.

Who Is This Book For?


The book is meant for people who want to learn about synthetic data and its
applications. It will prove especially useful for people working in machine learning and
computer vision, as synthetic data can be used to train machine learning models that
can make more accurate predictions about real-world data.
The book is written for the benefit of data scientists, machine learning engineers,
deep learning practitioners, artificial intelligence researchers, data engineers, business
analysts, information technology professionals, students, and anyone interested in
learning more about synthetic data and its applications.

Book Structure
Synthetic data is not originally collected from real-world sources. It is generated by
artificial means, using algorithms or mathematical models, and has many applications
in deep learning, particularly in training neural networks. This book, which discusses the
structure and application of synthetic data, consists of five chapters.
Chapter 1 covers synthetic data, why it is important, and how it can be used in
data science and artificial intelligence applications. This chapter also discusses the
accuracy problems associated with synthetic data, the life cycle of data, and the
tradeoffs between data collection and privacy. Finally, this chapter describes some
applications of synthetic data, including financial services, manufacturing, healthcare,
automotive, robotics, security, social media, marketing, natural language processing,
and computer vision.
Chapter 2 provides information about different ways of generating synthetic data. It
covers how to generate fair synthetic data, as well as how to use video games to create
synthetic data. The chapter also discusses the synthetic-to-real domain gap and how
to overcome it using domain transfer, domain adaptation, and domain randomization.

xviii
Introduction

Finally, the chapter discusses whether a real-world experience is necessary for training
machine learning models and, if not, how to achieve it using pretraining, reinforcement
learning, and self-supervised learning.
Chapter 3 explains the content and purpose of a generative adversarial network, or
GAN, a type of AI used to generate new data, like training data.
Chapter 4 explores synthetic data generation with R.
Chapter 5 covers different methods of synthetic data generation with Python.

Learning Outcomes of the Book


Readers of this book will learn about the various types of synthetic data, how to create
them, and their benefits and challenges. They will also learn about its importance in data
science and artificial intelligence. Furthermore, readers will come away understanding
how to employ automatic data labeling and how GANs can be used to generate synthetic
data. Lastly, readers who complete this book will know how to generate synthetic data
using the R and Python programming languages.

Source Code
The datasets and source code used in this book can be downloaded from ­github.com/
apress/synthetic-data-deep-learning.

References
[1]. M. Rozemund, “The Nature of the Mind,” in The Blackwell Guide
to Descartes’ Meditations, S. Gaukroger, John Wiley & Sons, 2006.

[2]. R. S. Geiger et al., “Garbage In, Garbage Out?,” in Proceedings of the


2020 Conference on Fairness, Accountability, and Transparency,
Jan. 2020, pp. 325–336. doi: 10.1145/3351095.3372862.

xix
CHAPTER 1

An Introduction
to Synthetic Data
In this chapter, we will explore the concept of data and its importance in today’s world.
We will discuss the lifecycle of data from collection to storage and how synthetic data can
be used to improve accuracy in data science and artificial intelligence (AI) applications.
Next, we will explore of synthetic data applications in financial services, manufacturing,
healthcare, automotive, robotics, security, social media, and marketing. Finally, we
will examine natural language processing, computer vision, understanding of visual
scenes, and segmentation problems in terms of synthetic data.

What Synthetic Data is?


Despite 21st-century advances in data collection and analysis, there is still a lack of
understanding of how to properly utilize data to minimize the perceived ambiguity or
subjectivity of the information it represents. This is because the same meaning can be
expressed in a variety of ways, and a single expression can have multiple meanings. As
a result, it is difficult to create a comprehensive framework for data interpretation that
considers all of the potential nuances and implications of the information. One way to
overcome this challenge is to develop standardized methods for data collection and
analysis. This will ensure that data is collected consistently and that the results can be
compared across different studies and synthetic data can help us do just that.
People generally view synthetic data as being less reliable than data that is obtained
by direct measurement. Put simply, Synthetic data is data that is generated by a
computer program rather than being collected from real-world sources. While synthetic
data is often less reliable than data that is collected directly from the real world, it is still
an essential tool for data scientists. This is because synthetic data can be used to test

1
© Necmi Gürsakal, Sadullah Çelik, and Esma Birişçi 2022
N. Gürsakal et al., Synthetic Data for Deep Learning, https://doi.org/10.1007/978-1-4842-8587-9_1
Chapter 1 An Introduction to Synthetic Data

hypotheses and models before they are applied to real-world data. This can help data
scientists avoid making errors that could have negative consequences in the real world.
Synthetic data that is artificially generated by a computer program or simulation,
rather than being collected from real-world sources [11]. When we examine this
definition, we see that the following concepts are included in the definition:
“Annotated information, computer simulations, algorithm, and “not measured in a
real-world”.
The key features of synthetic data are as follows:

• Not obtained by direct measurement

• Generated via an algorithm

• Associated with a mathematical or statistical model

• Mimics real data

Now let’s explain why synthetic data is important.

Why is Synthetic Data Important?


Humans have a habit of creating synthetic versions of expensive products. Silk is an
expensive product that began to be used thousands of years ago, and rayon was created
in the 1880s. So, it’s little surprise that people would do the same with data, choosing
to produce synthetic data because it is cost-effective. As mentioned earlier, synthetic
data allows scientists to test hypotheses and models in a controlled environment it can
also be used to create “what if” scenarios, helping data scientists to better understand
the outcomes of their models.
Likewise, synthetic data can be used in a variety of ways to improve machine
learning models and protect the privacy of real data. First, synthetic data can be used
to train machine learning models when real data is not available. This is especially
important for developing countries, where data is often scarce. Second, synthetic data
can be used to test machine learning models before deploying them on real data. This is
important for ensuring that models work as intended and won’t compromise the real
data. Finally, synthetic data can be used to protect the privacy of real data by generating
data that is similar to real data but does not contain any personal information.
Synthetic data also provides more control than real data. Actual data comes from
many different sources, which can result in datasets so large and diverse that they
become unwieldy. Because synthetic data is created using a model whose data is
2
Chapter 1 An Introduction to Synthetic Data

generated for a specific purpose, it will not be randomly scattered. In some cases,
synthetic data may even be of a higher quality than real data. Actual data may need to be
over-processed when necessary, and too much data may be processed when necessary.
These actions can reduce the quality of the data. Synthetic data, on the other hand, can
be of higher quality, thanks to the model used to generate the data.
Overall, synthetic data has many advantages over real data. Synthetic data is more
controlled, of higher quality, and can be generated in the desired quantities. These
factors make synthetic data a valuable tool for many applications. A final reason why
synthetic data is important is that it can be used to generate data for research purposes,
allowing researchers to study data that is not biased or otherwise not representative of
the real data.
Now let’s explain the importance of synthetic data for data science and artificial
intelligence.

Synthetic Data for Data Science and Artificial Intelligence


The use of synthetic data is not a new concept. In the early days of data science and AI,
synthetic data was used to train machine learning models. However, synthetic data of the
past was often low-quality and not realistic enough to be useful for training today’s more
sophisticated AI models.
Recent advances in data generation techniques, such as Generative Adversarial
Networks (GANs), have made it possible to generate synthetic data that is virtually
indistinguishable from real-world data. This high-quality synthetic data is often referred
to as “realistic synthetic data”.
The use of realistic synthetic data has the potential to transform the data science and
AI fields. Realistic synthetic data can be used to train machine learning models without
the need for real-world data. This is especially beneficial in cases where real-world data
is scarce, expensive, or difficult to obtain.
In addition, realistic synthetic data can be used to create “virtual environments” for
testing and experimentation. These virtual environments can be used to test machine
learning models in a safe and controlled manner, without the need for real-world data.
For example, a computer algorithm might be used to generate realistic-looking
images of people or objects. This could be used to train a machine learning system
to better recognize these objects in real-world images. Alternatively, synthetic data
could be used instead of real-world data if the latter is not available or is too expensive
to obtain.
3
Chapter 1 An Introduction to Synthetic Data

Overall, the use of synthetic data is a promising new trend in data science and
AI. The ability to generate high-quality synthetic data is opening new possibilities for
training and experimentation. For example, a synthetic data set could be created that
contains 6000 words, instead of the usual 2000. This would allow the AI system to learn
from a larger and more diverse data set, which would in turn improve its performance
on real-world data. In the future, synthetic data is likely to play an increasingly important
role in the data science and AI fields.
Let us now consider accuracy problems in terms of synthetic data.

Accuracy Problems
Supervised learning algorithms are trained with labeled data. In this method, the data
is commonly named “ground truth”, and the test data is called “holdout data”. We have
three types to compare accuracies across algorithms [2]:

• Estimator score method: An estimator is a number that is used to


estimate, or guess, the value of something. The score method is a
way to decide how good an estimator is by analyzing how close the
estimator’s guesses are to the actual value.

• Scoring parameter: Cross-validation is a model-evaluation technique


that relies on an internal scoring strategy.

• Metric functions: The sklearn.metrics module provides functions for


assessing prediction error for specific purposes.

It’s important to acknowledge that the accuracy of synthetic data can be


problematic for several reasons. First, the data may be generated by a process that is not
representative of the real-world process that the data is meant to represent. This can lead
to inaccuracies in the synthetic data that may not be present in real-world data. Second,
the data may be generated with a specific goal in mind, such as training a machine
learning algorithm, that does not match the goal of the data’s user of the synthetic data.
This can also lead to inaccuracies in the synthetic data. Finally, synthetic data may be
generated using a stochastic process, which can introduce randomness into the data that
may not be present in real-world data. This randomness can also lead to inaccuracies in
the synthetic data.

4
Chapter 1 An Introduction to Synthetic Data

One way to overcome potential issues with accuracy in machine learning is to


use synthetic data. This can be done by automatically tagging and preparing data for
machine learning algorithms, which cuts down on the time and resources needed to
create a training dataset. This also creates a more consistent dataset that is less likely
to contain errors. Another way to improve accuracy in machine learning is to use a
larger training dataset. This will typically result in better performance from the machine
learning algorithm.
Working on the recognition and classification of aircraft from satellite photos, the
Airbus and OneView companies, in their studies on data for machine learning, achieved
accuracy of 88% versus 82% with the simulated dataset of OneView company, compared
to data consisting of only of real data. When real data and synthetic data are used
in a mixed way, an accuracy of ~ 90% is obtained, and this number represents an
8% improvement over real-only real data [3]. This improved accuracy is due to the
increased variety of data that is available when both real and simulated data are used.
The increased variety of data allows the machine learning algorithm to better learn the
underlying patterns of the data. This improved accuracy is significant and can lead to
better decision-making in a variety of applications.
Now let’s examine the life cycle of data in terms of synthetic data.

The Lifecycle of Data


Before leveraging the power of synthetic data, it’s important to understand the lifecycle
of data. First, it can help organizations to better manage their data; by understanding
the stages that data goes through, organizations can more effectively control how data
is used and prevent unauthorized access. Additionally, the data lifecycle can help
organizations ensure that their data is of high quality. Finally, the data lifecycle can help
organizations plan for the eventual destruction of data; by understanding when data
is no longer needed, organizations can ensure that they do not keep data longer than
necessary, which can both save space and reduce costs.
The data lifecycle is the process of managing data from its creation to its eventual
disposal. Figure 1-1 shows the five main phases of the data lifecycle.

5
Chapter 1 An Introduction to Synthetic Data

ĂƚĂ ĐƌĞĂƟŽŶ

ĂƚĂ ĞŶƚƌLJ ĂŶĚ


ĂƚĂĚŝƐƉŽƐĂů
ƐƚŽƌĂŐĞ

ĂƚĂ ůŝĨĞĐLJĐůĞ

ĂƚĂ ŽƵƚƉƵƚ
ĂŶĚ ĂƚĂƉƌŽƐƐĞƐŝŶŐ
ĚŝƐƐĞŵŝŶĂƟŽŶ

Figure 1-1. Data lifecycle

In the context of synthetic data, the data life cycle refers to the process of generating,
storing, manipulating, and outputting synthetic data. This process is typically carried
out by computers and involves the use of algorithms and rules to generate data that
resembles real-world data.
Following are the five stages of the data lifecycle:

1. Data creation: The first stage of the data lifecycle is data creation. This is
the stage at which synthetic data is first generated, either through direct
input or through the capture of information from an external source.

2. Data entry and storage: This stage involves the entry of synthetic
data into a computer system and its storage in a database. Data
entry and storage typically involve the use of algorithms or rules to
generate data that resembles real-world data.

3. Data processing: This stage covers the manipulation of synthetic


data within the computer system, to convert it into a format
that is more useful format for users. This may involve the use of
algorithms and the application of rules and filters. Data processing
typically involves the use of algorithms or rules to generate data
that resembles real-world data.
6
Chapter 1 An Introduction to Synthetic Data

4. Data output and dissemination: This stage is the process of


generating synthetic data from a computer system and making it
available to users. This may involve the generation of reports, the
creation of graphs and charts, or the output of data in a format
that can be imported into another system.

5. Data disposal: The final stage of the data lifecycle is data disposal.
This stage covers the disposal of synthetic data that is no longer
needed. This may involve the deletion of data from a database or
the physical destruction of storage media. Data disposal typically
involves the use of algorithms or rules to generate.

Reinforcement learning algorithms are used to learn how to do things by interacting


with an environment. However, these algorithms can be inefficient, meaning they need
a lot of interactions to learn well. To address this issue, some people are using external
sources of knowledge, such as data from demonstrations or observations. This data can
come from experts, real-world demonstrations, simulations, or synthetic demonstrations.
Researchers at Google and DeepMind think that datasets also have a lifecycle and
they summarize the lifecycle of datasets in three stages: Producing the data, consuming
the data, and sharing the data [4].
In the production data phase, users record their interactions with the environment
and provide datasets. At this stage, users add additional information to the data
automatically or manually labeling or filtering the data.
In the consuming the data phase, researchers analyze and visualize datasets or use
them to train algorithms for machine learning purposes.
In the sharing the data stage, researchers often share their data with other
researchers to help with their reserach. When researchers share data, it makes it easier
for other researchers to run and validate new algorithms. However, the researchers who
produced the data still own it and should be given credit for their work.
Let’s now consider data collection and privacy issues in terms of synthetic data.

Data Collection versus Privacy


Data can be collected in many ways. For example, data from radar, LIDAR, and the
camera systems of driverless cars can be taken and fused into a format usable for
decision making. Considering that fusion data is also virtual data, it is necessary to
think in detail about the importance of real and virtual data. So, since real data can be

7
Chapter 1 An Introduction to Synthetic Data

converted into virtual data, and used or augmented data can be used together with real
data in machine learning, these two data types are important for us. This is sometimes
referred to as the Ensemble Method.
In the Ensemble Method, a few basic models can be combined to create an optimal
predictive model; Why can’t the data be fused, and more qualified data can be obtained?
Labeling the data is tedious and costly, as machine learning models require large and
diverse data to produce good results. Therefore creating synthetic data by transforming
real data using with data augmentation techniques or directly generating synthetic data
and using it as an alternative to real data can reduce transaction costs. According to
Gartner, by 2030, there will be more synthetic data than real data in AI models.
Another issue that synthetic data can help overcome is that of data privacy.

Data Privacy and Synthetic Data


Today, many institutions and organizations use large amounts of data to forecast, create
policies, plan, and achieve higher profit margins. By using this data, they can better
understand the world around them and make more informed decisions. However, due
to privacy restrictions and guarantees given to personal data, only the personnel of the
institutions have full access to such data. Anonymization techniques are used to prevent
the identities of data subjects from being revealed. Sure, data collectors can maintain
data privacy by using aggregation, recoding, record exchange, suppression of sensitive
values, and random error insertion, data collectors can maintain data privacy. However,
advances in computer and cloud technologies are likely to make such measures
insufficient to maintain data privacy. We’ll explore some examples in the next section.
In today’s world, with the advances in information technology, patient data, and
driver the data, of those using vehicles the data obtained by research companies from
public opinion surveys have reached enormous amounts. However, most of the time, when
this data is used to find new solutions, the concept of “individual privacy” comes up. This
problem is overcome by anonymizing the data, which is the process of modifying the data
to eliminate any information that could lead to privacy intrusion. Anonymizing data is
important to protect people’s privacy, as even without personal identifiers, the remaining
attributes in the data may still be used to re-identify an individual. In the simplest form of
data anonymization, all personal identifiers are removed. However, it has been shown that

8
Chapter 1 An Introduction to Synthetic Data

this is not enough to protect people’s privacy. Therefore, it is important to fully anonymize
data possible to protect people’s privacy rights.
Most people think that privacy is protected by anonymization. This is when there
is no name, surname, or any sign to indicate identity in the database. However, this is
not always accurate. This means that if you have an account on both Twitter and Flickr,
there’s a good chance that someone could identify you from the anonymous Twitter
graph. However, the error rate is only 12%, so the chances are still pretty good that you
won’t be identified [14]. Even though the chances of being identified are relatively low,
it is still important to be aware of the potential risks of sharing personal information
online. Anonymity is not a guarantee of privacy, and even seemingly innocuous
information can identify individuals in certain cases. Therefore, exercising caution is
required when sharing personal information online, even if it is ostensibly anonymous.
Anonymization and labeling are two primary techniques in AI applications.
However, both techniques have their own set of problems. Anonymization can lead
to the loss of vital information, while labeling can introduce bias and be costly to
implement. In addition, hand-labeled data might not be high quality because it is often
mislabeled. To overcome these problems, researchers have proposed various methods,
such as semi-supervised learning and active learning. However, these methods are still
not perfect, and further research is needed to improve them.

The Bottom Line


The collection of more data with increasing data sources makes it necessary for businesses
to take security measures against information attacks. In some cases, businesses need more
data than is available to innovate in certain areas. In some cases, more data may be necessary
due to a lack of practical research or high costs of data collection. Many businesses generate
data programmatically in the real world to obtain otherwise unattainable information. The
use of synthetic data is becoming increasingly popular as businesses attempt to collect more
data and test different scenarios. Synthetic data is created by computer programs and is
designed to mimic real-world data. This allows businesses to gather data more efficiently
and to test various scenarios to see what may happen in the real world.
The world is becoming more data-centric, so businesses are starting to use computer
programs to create data similar to data gathered from the real world. This is useful
because it facilitates data collection and helps businesses test different scenarios to see
what will happen in the real world.
Now let’s examine synthetic data and data quality.

9
Chapter 1 An Introduction to Synthetic Data

Synthetic Data and Data Quality


When working on AI projects, it is important to focus on data quality. It all starts there; if
data quality is poor, the AI system will be fatally compromised. Data cascades can occur
when AI practitioners apply conventional AI practices that don’t value data quality. Most
AI practitioners (92%) have reported experiencing one or more data cascades. This often
happens because they applied conventional AI practices that didn’t value data quality. For
this reason, it is important to use high-quality data when training deep learning networks [5].
Andrew Ng has said that “Data is food for AI” and that the issue of data quality should be
focused on data more than the model/algorithm [6]. The use of synthetic data can help
to address the issue of data quality in AI projects. This is because synthetic data can be
generated to be of high quality, and it can be generated to be representative of the real-world
data that the AI system will be used on. This means that the AI system trained on synthetic
data will be more likely to generalize well to the real world.
AI technologies in particular use synthetic data intensively. Just a few examples
include medicine, where synthetic data is used extensively to test specific conditions
and cases for which real data is not available; self-driving cars, such as the ones used
by Uber and Google, are trained using synthetic data; fraud detection and protection
in the financial industry is facilitated using synthetic data. Synthetic data gives data
professionals access to centrally stored data while maintaining the privacy of the data. In
addition, synthetic data reproduce important features of real data without revealing its
true meaning and protects confidentiality. In research departments, on the other hand,
synthetic data is used to develop and deliver innovative products for which the necessary
data may not be available [7]. Overall, the use of synthetic data is extremely beneficial
as it allows for the testing of new products and services while maintaining the privacy of
the original data. Synthetic data is also incredibly versatile and can be used in a variety
of different industries and applications. In the future, the use of synthetic data is likely to
become even more widespread as the benefits of using it become more widely known.
Let us now examine some of synthetic data applications.

Aplications of Synthetic Data


Synthetic data is often used in financial services, manufacturing, healthcare, automotive,
robotics, security, social media, and marketing.
Let’s first quickly explore how synthetic data cen be used in finance.

10
Chapter 1 An Introduction to Synthetic Data

Financial Services
The use of synthetic data is becoming increasingly important in financial services as the
industry moves towards more data-driven decision-making. Synthetic data can be used
to supplement or replace traditional data sources, providing a more complete picture of
the underlying risk.
Financial services is an industry that is highly regulated and subject to constant
change. New rules and regulations are constantly being introduced, and the industry is
constantly evolving. As a result, it can be difficult for financial institutions to keep up with
the changes and ensure that their data is compliant.
Synthetic data can be used to generate data that is compliant with the latest rules and
regulations. This can help financial institutions avoid the costly fines and penalties that
can be associated with non-compliance. In addition, synthetic data can be used to test
new products and services before they are launched. This can help financial institutions
avoid the costly mistakes that can be made when launching new products and services.
Synthetic data can also help to improve the accuracy of risk models by providing a
more complete picture of underlying risks. For example, consider a portfolio of loans.
Traditional data sources may only provide information on the loan amount, interest rate,
and term. However, synthetic data can provide additional information on the borrower’s
credit score, employment history, and other factors that can impact the risk of default.
This additional information can help to improve the accuracy of the risk model.
Another key benefit of synthetic data is that it can provide a way to test and validate
models before they are deployed in live environments. This is because synthetic data can
be generated with known values for the inputs and outputs. This allows for the testing
of models under a variety of different scenarios, which can help to identify any potential
issues before the model is deployed in a live environment.
Synthetic data can be used in financial services in a variety of other ways. For
example, it can be used to:

• Generate realistic scenarios for stress testing and risk


management: Generating synthetic data can help financial
institutions to identify potential risks and to develop plans for dealing
with them. This can be used to generate realistic scenarios for stress
testing and risk management purposes. Doing so can help to improve
the resilience of the financial system.

11
Chapter 1 An Introduction to Synthetic Data

• Train machine learning models: Synthetic data can help train


machine learning models for tasks such as fraud detection and credit
scoring. This can automate processes for financial institutions and
make them more efficient.

• Generate synthetic transactions: Synthetic data can be used


to generate synthetic transactions, which can help financial
institutions test new products and services, or simulate market
conditions.

• Generate synthetic customer data: Financial institutions can use


synthetic data to generate synthetic customer data. This can help
them to test new customer acquisition strategies or to evaluate
customer service levels.

• Generate synthetic financial data: Synthetic data can be used


to generate synthetic financial data. This can help financial
institutions to test new financial products or to evaluate the impact of
new regulations.

Finally, synthetic data can help to reduce the cost of data acquisition and storage.
This is because synthetic data can be generated on-demand, as needed. This eliminates
the need to store large amounts of data, which can save on both the cost of data
acquisition and storage.
Now, let’s look at how synthetic data can be used in the manufacturing field.

Manufacturing
In the world of manufacturing, data is used to help inform decision-makers about
various aspects of the manufacturing process, from production line efficiency to quality
control. In some cases, this data is easy to come by- for example, data on production
line outputs can be gathered through sensors and other monitoring devices. However,
in other cases, data can be much more difficult to obtain. For example, data on the
performance of individual components within a production line may be hard to come by
or may be prohibitively expensive to gather. In these cases, synthetic data can be used to
fill in the gaps.
In many manufacturing settings, it is difficult or impossible to obtain real-world
data that can be used to train models. This is often due to the proprietary nature of

12
Chapter 1 An Introduction to Synthetic Data

manufacturing processes, which can make it difficult to obtain data from inside a
factory. Additionally, the data collected in a manufacturing setting may be too noisy or
unrepresentative to be useful for training models.
To address these issues, synthetic data can be used to train models for
manufacturing applications. However, it is important to consider both the advantages
and disadvantages of using synthetic data before deciding whether it is the right choice
for a particular application.
Synthetic data can be employed in manufacturing in several ways. First, synthetic
data can be used to train machine learning models that can be used to automate
various tasks in the manufacturing process. This can improve the efficiency of the
manufacturing process and help to reduce costs. Second, synthetic data can be used to
test and validate manufacturing processes and equipment. This can help to ensure that
the manufacturing process is running smoothly, and that the equipment is operating
correctly. Third, synthetic data can be used to monitor the manufacturing process and to
identify potential problems. This can help to improve the quality of the products being
produced and to avoid costly manufacturing defects.
Synthetic data can be used to improve the efficiency of data-driven models. This
is because synthetic data can be generated much faster than real-world data. This is
important because it is allowing manufacturers to train data-driven models faster and
get them to market quicker.
The use of synthetic data is widespread in the manufacturing industry. It helps
companies to improve product quality, reduce manufacturing costs, and improve
process efficiency. Some examples of the use of synthetic data in manufacturing are as
follows:

• Quality Control: Synthetic data can be used to create models that


predict the likelihood of defects in products. This information can be
used to improve quality control procedures.

• Cost Reduction: The use of synthetic data can help identify patterns
in manufacturing processes that lead to increased costs. This
information can be used to develop strategies for reducing costs,
thereby reducing the overall cost of production.
• Efficiency Improvement: Synthetic data can be used to create
models that predict the efficiency of manufacturing processes. This
information can be used to improve process efficiency.

13
Chapter 1 An Introduction to Synthetic Data

• Product Development: Synthetic data can help improve product


development processes by predicting the performance of new
products. In this way, it can be decided which products to monitor
and how to develop them.

• Production Planning: Production planning can be done by using


synthetic data to create models that predict the demand for products.
In this way, businesses can improve their production planning by
making better predictions about future demand.

• Maintenance: Synthetic data can be used to create models that


predict the probability of equipment failures. In this way, preventive
measures can be taken, and maintenance processes can be improved
by predicting when equipment will fail.

Now, let’s quickly explore how synthetic data can be employed in the
healthcare realm.

Healthcare
The most obvious benefit of utilizing synthetic data in healthcare is to protect the
privacy of patients. By using synthetic data, healthcare organizations can create models
and simulations that are based on real data but do not contain any actual patient
information. This can be extremely helpful in situations where patient privacy is of
paramount concern, such as when developing new treatments or testing new medical
devices.
The use of synthetic data will evolve in line with the needs and requirements of
health institutions. However, the following are some of the most common reasons why
healthcare organizations might use synthetic data include:

• Machine learning models: One of the most common reasons why


healthcare organizations use synthetic data is to train machine
learning models. This is because synthetic data can be generated in a
controlled environment, which allows for more reliable results.
• Artificial intelligence: synthetic data can be used to identify patterns
in patient data that may be indicative of a particular condition
or disease. This can then be used to help diagnose patients more
accurately and to also help predict how they are likely to respond
14
Chapter 1 An Introduction to Synthetic Data

to treatment. This is extremely important in terms of ensuring that


patients receive the most effective care possible.

• Protect privacy: One of the biggest challenges in the healthcare


industry is the reliable sharing of data. Health data is very important
for doctors to diagnose and treat patients quickly. For this reason,
many hospitals and health institutions attach great importance to
patient data. Synthetic data help provide the best possible treatment.
In addition, synthetic data is a technology that will help healthcare
organizations share information while protecting personal privacy.

• Treatments: Another common reason why healthcare organizations


use synthetic data is to test new treatments. This is because synthetic
data can be used to create realistic simulations of real-world
conditions, which can help to identify potential side effects or issues
with a new treatment before it is used on real patients.

• To help design new drugs and to test their efficacy.

• Improve patient care: Healthcare organizations can also use synthetic


data to improve patient care. This is because synthetic data can be
used to create realistic simulations of real-world conditions, which
can help healthcare professionals to identify potential issues and
make better-informed decisions about patient care.

• Reduce costs: Healthcare organizations can also use synthetic data to


reduce cost. This is because synthetic data can be generated relatively
cheaply, which can help to reduce the overall costs associated with
real-world data collection and analysis.

• Several hospitals are now using synthetic data in the health sector
to improve the quality of care that they can provide. This is being
one in several different ways, but one of the most common is to use
computer simulations. This allows for a more realistic representation
of patients and their conditions, which can then be used to test
out new treatments or procedures. This can be extremely beneficial
in reducing the risk of complications and ensuring that patients
receive the best possible care.

15
Chapter 1 An Introduction to Synthetic Data

Overall, the use of synthetic data in the health sector is extremely beneficial. It
is helping to improve the quality of care that is being provided and is also helping to
reduce the risk of complications. In addition, it is also helping to speed up the process of
diagnosis and treatment.
Now let’s look at how synthetic data can be used in the automotive industry field.

Automotive
Another application of synthetic data in the automotive industry is autonomous
driving. A large amount of data is needed to train an autonomous driving system. This
data can be used to train a machine learning model that can then be used to make
predictions about how the autonomous driving system should behave in different
situations. However, real-world data is often scarce, expensive, and difficult to obtain.
Another important application of synthetic data in automotive is in safety-critical
systems. To ensure the safety of a vehicle, it is important essential to be able to test
the systems in a variety of scenarios. Synthetic data can be used to generated data for all
the different scenarios that need to be tested. This is important because it allows for to
provide more thorough testing of system and helps ensure the safety of the vehicle.
Overall, synthetic data has to potential to be a valuable tool for the automotive
industry. It can be used to speed up the development process and to generate large
quantities of data. However, it is important to be aware of the challenges associated with
synthetic data and to ensure that it is used in a way that maximizes its benefits.
There are a few reasons why automotive companies need synthetic data. The first
has to do with the development of new technologies a large amount of data. In order to
create and test new features or technologies, companies need a large amount of data.
This data is used to train algorithms that will eventually be used in the product. However,
collecting this data can be difficult, time-consuming, and expensive.
Another reason automotive companies need synthetic data is for testing
purposes. Before a new product is released, it needs to go through rigorous testing.
This testing often includes putting the product through a range of different scenarios.
However, it can be difficult to test every single scenario in the real world. This is where
synthetic data comes in. It can be used to create realistic test scenarios that would be
difficult or impossible to re-create in the real world.

16
Chapter 1 An Introduction to Synthetic Data

Synthetic data can be used for marketing purposes. Automotive companies


also often use data to create marketing materials such as ads or website content.
However, this data can be difficult to obtain. Synthetic data can be used to create
realistic marketing scenarios that can be used to test different marketing strategies.
In conclusion, synthetic data is needed in automotive industry for a variety of
reasons. It can be used to create realistic test scenarios, train algorithms, and create
marketing materials.
Now let’s look at how synthetic data is used in the robotics field.

Robotics
Robots are machines that can be programmed to do specific tasks. Sometimes these
tasks are very simple, like moving a piece of paper from one place to another. Other
times, the tasks are more complex, like moving around in the world and doing things
that humans can do, like solving a Rubik’s Cube. Creating robots that can do complex
tasks is a challenge because the robots need a lot of training data to behave like humans.
This data can be generated by simulations, which is a way of creating a model of how the
robot will behave.
There are several reasons why synthetic data is needed in robotics. The first is that
real-world data is often scarce. This is especially true for data needed to train machine
learning models, a key component of robotics. Synthetic data can be used to supplement
real-world data and, in some cases, to replace them entirely. Second, real-world data is
often noisy. This noise can come from a variety of sources, such as sensors, actuators,
and the environment. Synthetic data can be used to generate noise-free data that can be
helpful for training machine learning models. The third reason is that collecting real-­
world data is often expensive. This is especially true for data needed to train machine
learning models. Synthetic data can be used to generate data that is much cheaper to
collect. A fourth reason is that real-world data is often biased. This bias can come from
a variety of sources, such as sensors, actuators, and the environment. Synthetic data
can be used to generate bias-free data that can be helpful for training machine learning
models. The fifth reason synthetic data is needed in robotics is that real-world data is
often unrepresentative. This is especially true for data needed to train machine learning
models. Synthetic data can be used to create data that better represents the real world,
which can be helpful for training machine learning models.

17
Chapter 1 An Introduction to Synthetic Data

Robots can learn to identify and respond to different types of objects by using
synthetic data. By learning from this data, the robot can learn how to better identify and
respond to different types of human behavior. For example, a robot might be given a set
of synthetic data that includes variousa variety of different types of human behavior and
how to respond to them.
Now let’s look at how synthetic data can be used in security field.

Security
Synthetic data can play a vital role in enhancing security, both through its ability to
train machine learning models to better detect security threats and by providing it also
provides a means way of testing security systems and measuring their effectiveness.
Machine learning models that are trained on synthetic data are more effective at
detecting security threats because they are not limited by available the real-world data
that is available synthetic data can be generated to match any desired distribution,
including distributions that are not present in the real world. This allows machine
learning models to learn more about the underlying distribution of data, and to better
identify outliers that may represent security threats.
Testing security systems with synthetic data is important because it allows a
controlled environmentets for measure the system’s performance. Synthetic data can
be generated to match any desired distribution of security threats, making it possible to
test how well a security system can detect and respond to a wide variety of threats. This
is importnat because real-world data is often limited in scope and may not be
representative of the full range of security threats that a system may encounter.
Overall, the use of synthetic data is importnat essential for both training machine
learning models to detect security threats and for testing the performance of security
systems. Synthetic data provides a more complete picture of the underlying distribution
of data which leads to better improves the detection of security threats. Additionally,
synthetic data can be used to create controlled environments for testing security system
performance, making it possible to measure the effectiveness of a security system more
accurately.
Now, let’s quickly explore how synthetic data can be employed in the social
media realm.

18
Chapter 1 An Introduction to Synthetic Data

Social Media
Social media has become an integral part of our lives. It is a platform where we share our
thoughts, ideas, and feelings with our friends and family. However, social media has also
become a breeding ground for fake news and misinformation. This is because anyone
can create a fake account and spread false information.
To combat this problem, many social media platforms are now using AI to detect
fake accounts and flag them. However, AI can only be as effective as the data it is trained
on. If the data is biased or inaccurate, the AI ​​will also be biased or inaccurate. This is
where synthetic data comes in. Synthetic data can be used to train AI algorithms to be
more accurate in detecting fake accounts. Synthetic data can help reduce the spread of
fake news and misinformation on social media.
One way to generate synthetic data is to use generative models. For example, a
generative model could be trained on a dataset of real images of people. Once trained,
the model could then generate new images of people that look real but are fake. This is
important because it allows us to creates data that is representative of the real world.
Simulation is another way of generating synthetic data. For example, we could create
a simulation of a social media platform. This simulation would include all the same
features as the real social media platform. However, it would also allows us to control
what data is generated. This is important because it allows us to test different scenarios.
For example, we could test what would happen if a certain percentage of accounts were
fake. This would allow us to see how our AI algorithms would react in the real world.
Some social media platforms that have been known to use synthetic data include
Facebook, Google, and Twitter; Each of this platforms has used synthetic data in
different ways and for different purposes.
Facebook has been known to uses synthetic data to train its algorithms. For example,
Facebook has used synthetic data to train its facial recognition algorithms. Because it
is difficult to obtain a large enough dataset of real-world faces to train these algorithms
effectively. In addition, Facebook has also used synthetic data to generate fake user
profiles. This is done to test how effective plartfom algorithms are at detecting fake
profiles.
In addition to using real data, Google has been known to use synthetic data.
Synthetic data is generated data that is designed to mimic real data. For example, Google
has to used synthetic data to train its machine learning algorithms to better understand
natural language. Google has also used synthetic data to generate fake reviews. This is
done to test how effective the platform’s algorithms at detectare detecting fake reviews.

19
Chapter 1 An Introduction to Synthetic Data

Twitter is also known to use synthetic data. The platform has used synthetic data
to generate fake tweets and fake user profiles to test how effective its algorithms are
at detecting detect them.
Now, let’s quickly explore how synthetic data can be employed in
the marketing realm.

Marketing
There are many benefits to using synthetic data in marketing. Perhaps the most obvious
benefit is that it can be used to generate data that would be otherwise unavailable.
This is especially useful for marketing research, as because it can be used to generate
data about consumer behavior that would be difficult or impossible to obtain through
traditional means.
The use of synthetic data in marketing is important for several reasons. First, it allows
marketing researchers to study behavior in a controlled environment. This is important
because it allows for the isolation of variables and the testing of hypotheses in a way that
would not be possible with real-world data. Second, synthetic data can be used to
generate new insights into consumer behavior. By analyzing how consumers behave in
a simulated environment, marketing researchers can develop new theories and models
that can be applied to real-world data. Finally, synthetic data can be used to evaluate
marketing campaigns and strategies. By testing campaigns and strategies in a simulated
environment, marketers can identify which ones are most likely to be successful in the
real world.
Synthetic data is also needed in marketing because it can be used to protect
the privacy of real customers. By using synthetic data instead of real customer data,
marketers can avoid having to collect and store sensitive information about their
customers. This is especially important for businesses that are subject to strict privacy
laws, such as those in the European Union.
Several marketing organizations use synthetic data to get a better understanding
of customer behavior and to improve marketing strategies. Each of these organizations
uses synthetic data in different ways, but all of them utilize it to gain insights into the
behavior of consumers.

20
Chapter 1 An Introduction to Synthetic Data

Natural Language Processing


Language models are trained on a large corpus of text and can be used to generate
new text that is similar to the training data. Language models can be used to generate
synthetic data that is representative of different groups of people [8] or to generate data
with specific properties.
Natural language processing (NLP) is a subfield of AI that deals with the
interpretation and manipulation of human language. NLP is used in a variety of
applications, including text classification, chatbots, and machine translation. NLP helps
computers to understand, interpret, and manipulate human language [9].
One area where NLP will likely have a significant impact is in the generation of
synthetic data. Synethetic data that is generated by artificial means, as opposed to
being collected from real-world sources. Synthetic data can be used to train machine
learning models [10], and NLP can be used to generate synthetic data that is realistic
and diverse. This is important because it allows machine learning models to be trained
on data that is representative of the real world, which can improve the accuracy of the
models. For example, synthetic data can be used to generate realistic images of people or
objects that don’t exist in the real world or to create simulated environments for training
autonomous vehicles. Machine learning models trained on data that is representative of
the real world have improved accuracy.
In addition, NLP can be used to automatically label synthetic data, which is
important for training supervised machine learning models. For example, NLP can be
used to also automatically generate descriptions of images or videos, which can then
be used as labels for training image recognition or object detection models. This is
important for training supervised machine learining models, as it can help reduce the
amount of manual labeling that is required.
Overall, NLP is a powerful tool for generating and manipulating synthetic data. It
can be used to automatically generate large amounts of realistic data, which is important
for training machine learning models. In addition, NLP can automatically label synthetic
data, which is important for training supervised machine learning models.
Consequently, in the feature, NLP will continue to play an importnat role in the
generation of synthetic data. The use of NLP to generate synthetic data will allow for the
creation of data that is more representative of different groups of people and will allow
for the creation of data with specific properties.

21
Chapter 1 An Introduction to Synthetic Data

Computer Vision
Computer vision is the process of using computers to interpret and understand digital
images. This can be done using algorithms that can identify patterns and features in
images, and then make decisions about what those images represent.
The computer can detect a human by identifying their silhouette or by drawing
bounding boxes around the person. In the photos in Figure 1-2, the computer would
count the person by identifying their silhouette, as seen in Figure 1-2 (a) (instance
segmentation) or by drawing bounding boxes around the person (object detection), as in
Figure 1-2 (b).

;ĂͿ /ŶƐƚĂŶĐĞ ƐĞŐŵĞŶƚĂƟŽŶ ;ďͿ KďũĞĐƚ ĚĞƚĞĐƟŽŶ

Figure 1-2. Computer vision: (a) instance segmentation (b) object detection

22
Chapter 1 An Introduction to Synthetic Data

An autonomous vehicle needs to be able to detect objects and understand what they
are to make decisions in real-time. This is done through object detection, which is the
process of finding bounding boxes around each item in the image. After that, semantic
segmentation assigns a label to every pixel in the image, indicating what the object is.
Finally, instance segmentation shows you how many individual objects there are [11].
Functions such as seeing, reading, and hearing on a computer are performed with
artificial neural networks that imitate the human brain. Recognition of human faces,
navigation of autonomous cars, and diagnosing patients with the help of medical images
obtained by scanning on the computer are all related to the computer’s vision, and these
works are done with algorithms called artificial neural networks.
The neural network is an area type of machine learning algorithm that is used to
simulate the workings of the human brain. It is made up of a series of interconnected
nodes, each with its weight and three should value. If the output of a node is above
the specified threshold, that node is activated and sends data to the next layer of the
network. If the output of a node is below the specified threshold, that node is not
activated, and no data is sent along [12]. The output of the node is compared to the
threshold. If the output is above the threshold, the node is activated and sends data to
the next layer.
When a model learns too much detail and noise from the training data, it is said to be
overfitting. This can negatively impact the model’s performance on new data [13]. This
means that if we use synthetic data when training a deep learning model, it will be less
likely to overfit the data and as a result, the model will be more accurate and able to
generalize to new data better.
Convolutional Neural Networks (CNN) are specifically designed to be good at
image recognition tasks. They work by taking in an image as input and then assigning
importance to various aspects or objects in the image. This allows the CNN to
differentiate between different objects in the image. Such a deep neural network has a
much more complex structure than a simple neural network. The following resources
can be reviewed for CNN applications in R and Python.
However, no matter how complex the algorithms are in computer vision; as we
mention before the quality of the data, which is only called “input” in such forms, is very
important for the accuracy of the results. Whether the data is text, audio or photographic,
the size of the data and its good labeling are essential to the accuracy of the results. The
raw material of artificial intelligence is big data. Privacy and data privacy can prevent

23
Chapter 1 An Introduction to Synthetic Data

companies from accessing big data with appropriate laws and enforcement. In addition,
labeling data for training purposes also creates a significant cost in machine learning.

Understanding of Visual Scenes


The Visual scenes are the images that are seen on the computer screen. These images
are made up of pixels, which are tiny dots of color. The pixels are arranged in a grid,
and each pixel has a specific address. The computer visual scenes are created by the
computer graphics software, which is responsible for creating the images that are seen
on the screen. The computer graphics software can create the images by using the
addresses of the pixels.
One of the most important purposes of computer vision is to examine visual scenes
and understand them in detail. This includes determining the objects in the image, their
places in 2D and 3D, their attributes, and making a semantic description of the scene.
Different types of tasks that can be performed with object recognition datasets: (a)
Image classification involves assigning a label to an entire image, (b) object bounding
box localization involves identifying the location of an object within an image, and (c)
semantic pixel-level segmentation involves classifying each pixel in an image, and (d)
segmenting individual object instances. This involves identifying each object in an
image and then segmenting it from the rest of the image, as demonstrated in Figure 1-3.

24
Chapter 1 An Introduction to Synthetic Data

Figure 1-3. Object recognition

We can explain what us done in the four panels shown in of Figure 1-3 as fallows:

(a) Image Classification: This task is done by giving binary labels to


objects so that they can be recognized as they exist in the image,
as shown in Figure 1-3 (a). It is widely used in classification, image
processing, and machine learning.

(b) Object bounding box localization: It is a method used


to determine the boundaries of an object, as shown in
Figure 1-3 (b). This method is used to detect where an object in an
image begins and ends.

25
Chapter 1 An Introduction to Synthetic Data

(c) Semantic pixel-level segmentation: It is a method used


to distinguish objects in an image from each other, as shown in
Figure 1-3 (c). This method is used to detect whether each pixel
belongs to an object or not.

(d) Segmenting individual object instances: It is a method used


to distinguish an object one by one, as shown in Figure 1-3 (d).
This method is used to distinguish between objects in an image.

Visual Scenes is used to create and manage online photo albums. It can be used
to store photos on a personal website or blog or to share photos with friends and family.
Visual Scenes provides an easy way to organize photos into albums, add captions and
tags, and share photos with others.
Visual scenes is used to help people with autism spectrum disorder improve their
communication and social skills. It is a software program that uses videos to help people
understand facial expressions, body language, and social situations. The program
is designed to help people with an autism spectrum disorder understand what is
happening in a scene and learn how to respond to social situations.

Segmentation Problem
Synthetic data is important for machine learning, and especially for computer “vision.”
Synthetic data provides us with an almost unlimited amount of perfectly labeled
data at a small cost. Computer scientists can create 3-D models of objects and their
environments to create synthetic data with randomized conditions. This can help solve
computer vision problems.
Semantic segmentation is the task of partitioning an image into semantically
meaningful regions. Each pixel in the image is assigned a semantic label that reflects
what the pixel represents. Semantic segmentation can be performed using a deep neural
network with an encoder-decoder structure.
The segmentation problem in synthetic data is the problem of accurately dividing a
dataset into homogeneous groups or segments. This is a difficult problem because it is
often hard to determine what constitutes a homogeneous group and because there may
be many ways to segment a dataset.
There are a few different approaches that can be used to segment a dataset. One
approach is to use a clustering algorithm, which will group data points that are similar to
each other. Another approach is to use a classification algorithm, which will assign each

26
Chapter 1 An Introduction to Synthetic Data

data point to a class. Finally, one can also use a rule-based approach, which will define a
set of rules that determine how to segment the data.
The segmentation problem is particularly difficult in synthetic data because the
data is often generated by a process that is not well understood. As a result, it is often
hard to determine what features of the data are important for grouping the data points.
In addition, synthetic data often contains a lot of noise, which can make it difficult to
accurately segment the data.

Summary
In this chapter, you learned about synthetic data and why it is beneficial for data
science and artificial intelligence. You also learned about the lifecycle of data and how
synthetic data can be used to improve data quality at various stages. Finally, you learned
how synthetic data is applied across various industries, including financial services,
manufacturing, healthcare, automotive, robotics, security, social media, marketing,
natural language processing, and computer vision.
Next, we’ll begin delving into the different types of Synthetic Data.

References
[1]. G. Andrews, “What Is Synthetic Data?” Carrushome.com, 2022.
https://www.carrushome.com/en/what-is-synthetic-data/
(accessed Apr. 13, 2022).

[2]. Scikit, “3.3. Metrics and scoring: quantifying the quality of


predictions — scikit-­learn 1.0.2 documentation.” Sicikit Learn,
2022. https://scikit-learn.org/stable/modules/model_
evaluation.html (accessed Apr. 13, 2022).

[3]. Airbus, “Can Synthetic Data Really Improve Algorithm Accuracy?”


May 20, 2021. https://www.intelligence-airbusds.com/
newsroom/news/can-synthetic-data-really-improve-
algorithm-accuracy/, (accessed Apr. 13, 2022).spiepr Par10

27
Chapter 1 An Introduction to Synthetic Data

[4]. S. Ramos et al., “RLDS: an Ecosystem to Generate, Share and Use


Datasets in Reinforcement Learning,” Nov. 2021, Accessed: Apr.
13, 2022. [Online]. Available: https://github.com/deepmind/
envlogger.

[5]. N. Sambasivan, S. Kapania, H. Highfill, D. Akrong, P. Paritosh, and


L. M. Aroyo, ‘Everyone Wants To Do The Model Work, Not The
Data Work’: Data Cascades in High-­Stakes AI, in Proceedings of the
2021 CHI Conference on Human Factors in Computing Systems,
May 6, 2021, pp. 1–15, doi: 10.1145/3411764.3445518.spiepr Par109

[6]. G. Press, “Andrew Ng Launches A Campaign For Data-Centric


AI,” Forbes, Jun. 16, 2021. https://www.forbes.com/sites/
gilpress/2021/06/16/andrew-ng-launches-a-campaign-for-
data-centric-ai/?sh=d556d9d74f57, (accessed Apr. 13, 2022).

[7]. K. Singh, “Synthetic Data — key benefits, types, generation


methods, and challenges,” Towards Data Science, May 12,
2021. https://towardsdatascience.com/synthetic-
data-key-benefits-types-generation-methods-and-
challenges-11b0ad304b55 (accessed Apr. 13, 2022).

[8]. B. Ding et al., “DAGA: Data Augmentation with a Generation


Approach for Low-­resource Tagging Tasks,” 2020. [Online].
Available: https://ntunlpsg.

[9]. MRP, “Natural Language Processing: Providing Structure To The


Complexity Of Language.” Jul. 30, 2018. https://www.mrpfd.com/
resources/naturallanguageprocessing/ (accessed Jul. 02, 2022)

[10]. S. I. Nikolenko, Synthetic Data for Deep Learning, vol.


174. Cham:Springer International Publishing, 2021, doi:
10.1007/978-3-030-75178-4.

[11]. S. Colaner, “Why Unity claims synthetic data sets can improve
computer vision models,” VentureBeat, Jul. 17, 2022. https://
venturebeat.com/2020/07/17/why-unity-claims-synthetic-
data-sets-can-improve-computer-vision-models/ (accessed
Apr. 13, 2022).

28
Another random document with
no related content on Scribd:
encircled this moving spectacle. The divers tints of the rainbow, the
most brilliant dyes that the sun lays in the lap of spring, added to the
whole family of gems, could not display a more beautiful, radiant,
and dazzling spectacle than accompanied the black horse. You
would have thought all the stars of heaven had met in merriment on
the turnpike. In the midst of this luminous configuration sat a man,
distinctly to be seen, in a miserable-looking chair, drawn by a black
horse. The turnpike-gate ought, by the laws of Nature and the laws
of the State, to have made a wreck of the whole, and have dissolved
the enchantment; but no, the horse without an effort passed over the
gate, and drew the man and chair horizontally after him without
touching the bar. This was what I call enchantment. What think you,
sir?”
“My friend,” said I, “you have grossly magnified a natural
occurrence. The man was Peter Rugg, on his way to Boston. It is
true, his horse travelled with unequalled speed, but as he reared
high his forefeet, he could not help displacing the thousand small
stones on which he trod, which flying in all directions struck one
another, and resounded and scintillated. The top bar of your gate is
not more than two feet from the ground, and Rugg’s horse at every
vault could easily lift the carriage over that gate.”
This satisfied Mr. McDoubt, and I was pleased at that
occurrence; for otherwise Mr. McDoubt, who is a worthy man, late
from the Highlands, might have added to his calendar of
superstitions. Having thus disenchanted the macadamized road and
the turnpike-gate, and also Mr. McDoubt, I pursued my journey
homeward to New York.
Little did I expect to see or hear anything further of Mr. Rugg, for
he was now more than twelve hours in advance of me. I could hear
nothing of him on my way to Elizabethtown, and therefore concluded
that during the past night he had turned off from the turnpike and
pursued a westerly direction; but just before I arrived at Powles’s
Hook, I observed a considerable collection of passengers in the
ferry-boat, all standing motionless, and steadily looking at the same
object. One of the ferry-men, Mr. Hardy, who knew me well,
observing my approach delayed a minute, in order to afford me a
passage, and coming up, said, “Mr. Dunwell, we have a curiosity on
board that would puzzle Dr. Mitchell.”
“Some strange fish, I suppose, has found its way into the
Hudson.”
“No,” said he, “it is a man who looks as if he had lain hidden in
the ark, and had just now ventured out. He has a little girl with him,
the counterpart of himself, and the finest horse you ever saw,
harnessed to the queerest-looking carriage that ever was made.”
“Ah, Mr. Hardy,” said I, “you have, indeed, hooked a prize; no
one before you could ever detain Peter Rugg long enough to
examine him.”
“Do you know the man?” said Mr. Hardy.
“No, nobody knows him, but everybody has seen him. Detain
him as long as possible; delay the boat under any pretence, cut the
gear of the horse, do anything to detain him.”
As I entered the ferry-boat, I was struck at the spectacle before
me. There, indeed, sat Peter Rugg and Jenny Rugg in the chair, and
there stood the black horse, all as quiet as lambs, surrounded by
more than fifty men and women, who seemed to have lost all their
senses but one. Not a motion, not a breath, not a rustle. They were
all eye. Rugg appeared to them to be a man not of this world; and
they appeared to Rugg a strange generation of men. Rugg spoke
not, and they spoke not; nor was I disposed to disturb the calm,
satisfied to reconnoitre Rugg in a state of rest. Presently, Rugg
observed in a low voice, addressed to nobody, “A new contrivance,
horses instead of oars; Boston folks are full of notions.”
It was plain that Rugg was of Dutch extraction. He had on three
pairs of small clothes, called in former days of simplicity breeches,
not much the worse for wear; but time had proved the fabric, and
shrunk one more than another, so that they showed at the knees
their different qualities and colors. His several waistcoats, the flaps
of which rested on his knees, made him appear rather corpulent. His
capacious drab coat would supply the stuff for half a dozen modern
ones; the sleeves were like meal bags, in the cuffs of which you
might nurse a child to sleep. His hat, probably once black, now of a
tan color, was neither round nor crooked, but in shape much like the
one President Monroe wore on his late tour. This dress gave the
rotund face of Rugg an antiquated dignity. The man, though deeply
sunburned, did not appear to be more than thirty years of age. He
had lost his sad and anxious look, was quite composed, and seemed
happy. The chair in which Rugg sat was very capacious, evidently
made for service, and calculated to last for ages; the timber would
supply material for three modern carriages. This chair, like a
Nantucket coach, would answer for everything that ever went on
wheels. The horse, too, was an object of curiosity; his majestic
height, his natural mane and tail, gave him a commanding
appearance, and his large open nostrils indicated inexhaustible wind.
It was apparent that the hoofs of his forefeet had been split, probably
on some newly macadamized road, and were now growing together
again; so that John Spring was not altogether in the wrong.
How long this dumb scene would otherwise have continued I
cannot tell. Rugg discovered no sign of impatience. But Rugg’s
horse having been quiet more than five minutes, had no idea of
standing idle; he began to whinny, and in a moment after, with his
right forefoot he started a plank. Said Rugg, “My horse is impatient,
he sees the North End. You must be quick, or he will be
ungovernable.”
At these words, the horse raised his left forefoot; and when he
laid it down every inch of the ferry-boat trembled. Two men
immediately seized Rugg’s horse by the nostrils. The horse nodded,
and both of them were in the Hudson. While we were fishing up the
men, the horse was perfectly quiet.
“Fret not the horse,” said Rugg, “and he will do no harm. He is
only anxious, like myself, to arrive at yonder beautiful shore; he sees
the North Church, and smells his own stable.”
“Sir,” said I to Rugg, practising a little deception, “pray tell me, for
I am a stranger here, what river is this, and what city is that opposite,
for you seem to be an inhabitant of it?”
“This river, sir, is called Mystic River, and this is Winnisimmet
ferry,—we have retained the Indian names,—and that town is
Boston. You must, indeed, be a stranger in these parts, not to know
that yonder is Boston, the capital of the New England provinces.”
“Pray, sir, how long have you been absent from Boston?”
“Why, that I cannot exactly tell. I lately went with this little girl of
mine to Concord, to see my friends; and I am ashamed to tell you, in
returning lost the way, and have been travelling ever since. No one
would direct me right. It is cruel to mislead a traveller. My horse,
Lightfoot, has boxed the compass; and it seems to me he has boxed
it back again. But, sir, you perceive my horse is uneasy; Lightfoot, as
yet, has only given a hint and a nod. I cannot be answerable for his
heels.”
At these words Lightfoot reared his long tail, and snapped it as
you would a whiplash. The Hudson reverberated with the sound.
Instantly the six horses began to move the boat. The Hudson was a
sea of glass, smooth as oil, not a ripple. The horses, from a smart
trot, soon pressed into a gallop; water now ran over the gunwale; the
ferry-boat was soon buried in an ocean of foam, and the noise of the
spray was like the roaring of many waters. When we arrived at New
York, you might see the beautiful white wake of the ferry-boat across
the Hudson.
Though Rugg refused to pay toll at turnpikes, when Mr. Hardy
reached his hand for the ferriage, Rugg readily put his hand into one
of his many pockets, took out a piece of silver, and handed it to
Hardy.
“What is this?” said Mr. Hardy.
“It is thirty shillings,” said Rugg.
“It might once have been thirty shillings, old tenor,” said Mr.
Hardy, “but it is not at present.”
“The money is good English coin,” said Rugg; “my grandfather
brought a bag of them from England, and had them hot from the
mint.”
Hearing this, I approached near to Rugg, and asked permission
to see the coin. It was a half-crown, coined by the English
Parliament, dated in the year 1649. On one side, “The
Commonwealth of England,” and St. George’s cross encircled with a
wreath of laurel. On the other, “God with us,” and a harp and St.
George’s cross united. I winked at Mr. Hardy, and pronounced it
good current money; and said loudly, “I will not permit the gentleman
to be imposed on, for I will exchange the money myself.”
On this, Rugg spoke,—“Please to give me your name, sir.”
“My name is Dunwell, sir,” I replied.
“Mr. Dunwell,” said Rugg, “you are the only honest man I have
seen since I left Boston. As you are a stranger here, my house is
your home; Dame Rugg will be happy to see her husband’s friend.
Step into my chair, sir, there is room enough; move a little, Jenny, for
the gentleman, and we will be in Middle Street in a minute.”
Accordingly I took a seat by Peter Rugg.
“Were you never in Boston before?” said Rugg.
“No,” said I.
“Well, you will now see the queen of New England, a town
second only to Philadelphia, in all North America.”
“You forget New York,” said I.
“Poh, New York is nothing; though I never was there. I am told
you might put all New York in our mill-pond. No, sir, New York, I
assure you, is but a sorry affair; no more to be compared with
Boston than a wigwam with a palace.”
As Rugg’s horse turned into Pearl Street, I looked Rugg as fully
in the face as good manners would allow, and said, “Sir, if this is
Boston, I acknowledge New York is not worthy to be one of its
suburbs.”
Before we had proceeded far in Pearl Street, Rugg’s
countenance changed: his nerves began to twitch; his eyes trembled
in their sockets; he was evidently bewildered. “What is the matter,
Mr. Rugg; you seem disturbed.”
“This surpasses all human comprehension; if you know, sir,
where we are, I beseech you to tell me.”
“If this place,” I replied, “is not Boston, it must be New York.”
“No, sir, it is not Boston; nor can it be New York. How could I be
in New York, which is nearly two hundred miles from Boston?”
By this time we had passed into Broadway, and then Rugg, in
truth, discovered a chaotic mind. “There is no such place as this in
North America. This is all the effect of enchantment; this is a grand
delusion, nothing real. Here is seemingly a great city, magnificent
houses, shops and goods, men and women innumerable, and as
busy as in real life, all sprung up in one night from the wilderness; or
what is more probable, some tremendous convulsion of Nature has
thrown London or Amsterdam on the shores of New England. Or,
possibly, I may be dreaming, though the night seems rather long; but
before now I have sailed in one night to Amsterdam, bought goods of
Vandogger, and returned to Boston before morning.”
At this moment a hue-and-cry was heard, “Stop the madmen,
they will endanger the lives of thousands!” In vain hundreds
attempted to stop Rugg’s horse. Lightfoot interfered with nothing; his
course was straight as a shooting-star. But on my part, fearful that
before night I should find myself behind the Alleghanies, I addressed
Mr. Rugg in a tone of entreaty, and requested him to restrain the
horse and permit me to alight.
“My friend,” said he, “we shall be in Boston before dark, and
Dame Rugg will be most exceeding glad to see us.”
“Mr. Rugg,” said I, “you must excuse me. Pray look to the west;
see that thunder-cloud swelling with rage, as if in pursuit of us.”
“Ah!” said Rugg, “it is in vain to attempt to escape. I know that
cloud; it is collecting new wrath to spend on my head.” Then
checking his horse, he permitted me to descend, saying, “Farewell,
Mr. Dunwell, I shall be happy to see you in Boston; I live in Middle
Street.”
It is uncertain in what direction Mr. Rugg pursued his course,
after he disappeared in Broadway; but one thing is sufficiently known
to everybody,—that in the course of two months after he was seen in
New York, he found his way most opportunely to Boston.
It seems the estate of Peter Rugg had recently fallen to the
Commonwealth of Massachusetts for want of heirs; and the
Legislature had ordered the solicitor-general to advertise and sell it
at public auction. Happening to be in Boston at the time, and
observing his advertisement, which described a considerable extent
of land, I felt a kindly curiosity to see the spot where Rugg once
lived. Taking the advertisement in my hand, I wandered a little way
down Middle Street, and without asking a question of any one, when
I came to a certain spot I said to myself, “This is Rugg’s estate; I will
proceed no farther. This must be the spot; it is a counterpart of Peter
Rugg.” The premises, indeed, looked as if they had fulfilled a sad
prophecy. Fronting on Middle Street, they extended in the rear to
Ann Street, and embraced about half an acre of land. It was not
uncommon in former times to have half an acre for a house-lot; for
an acre-of land then, in many parts of Boston, was not more valuable
than a foot in some places at present. The old mansion-house had
become a powder-post, and been blown away. One other building,
uninhabited, stood ominous, courting dilapidation. The street had
been so much raised that the bed-chamber had descended to the
kitchen and was level with the street. The house seemed conscious
of its fate; and as though tired of standing there, the front was fast
retreating from the rear, and waiting the next south wind to project
itself into the street. If the most wary animals had sought a place of
refuge, here they would have rendezvoused. Here, under the ridge-
pole, the crow would have perched in security; and in the recesses
below, you might have caught the fox and the weasel asleep. “The
hand of destiny,” said I, “has pressed heavy on this spot; still heavier
on the former owners. Strange that so large a lot of land as this
should want an heir! Yet Peter Rugg, at this day, might pass by his
own door-stone, and ask, ‘Who once lived here?’”
The auctioneer, appointed by the solicitor to sell this estate, was
a man of eloquence, as many of the auctioneers of Boston are. The
occasion seemed to warrant, and his duty urged, him to make a
display. He addressed his audience as follows,—
“The estate, gentlemen, which we offer you this day, was once
the property of a family now extinct. For that reason it has escheated
to the Commonwealth. Lest any one of you should be deterred from
bidding on so large an estate as this for fear of a disputed title, I am
authorized by the solicitor-general to proclaim that the purchaser
shall have the best of all titles,—a warranty-deed from the
Commonwealth. I state this, gentlemen, because I know there is an
idle rumor in this vicinity, that one Peter Rugg, the original owner of
this estate, is still living. This rumor, gentlemen, has no foundation,
and can have no foundation in the nature of things. It originated
about two years since, from the incredible story of one Jonathan
Dunwell, of New York. Mrs. Croft, indeed, whose husband I see
present, and whose mouth waters for this estate, has countenanced
this fiction. But, gentlemen, was it ever known that any estate,
especially an estate of this value, lay unclaimed for nearly half a
century, if any heir, ever so remote, were existing? For, gentlemen,
all agree that old Peter Rugg, if living, would be at least one hundred
years of age. It is said that he and his daughter, with a horse and
chaise, were missed more than half a century ago; and because they
never returned home, forsooth, they must be now living, and will
some day come and claim this great estate. Such logic, gentlemen,
never led to a good investment. Let not this idle story cross the noble
purpose of consigning these ruins to the genius of architecture. If
such a contingency could check the spirit of enterprise, farewell to all
mercantile excitement. Your surplus money, instead of refreshing
your sleep with the golden dreams of new sources of speculation,
would turn to the nightmare. A man’s money, if not employed, serves
only to disturb his rest. Look, then, to the prospect before you. Here
is half an acre of land,—more than twenty thousand square feet,—a
corner lot, with wonderful capabilities; none of your contracted lots of
forty feet by fifty, where, in dog-days, you can breathe only through
your scuttles. On the contrary, an architect cannot contemplate this
lot of land without rapture, for here is room enough for his genius to
shame the temple of Solomon. Then the prospect—how
commanding! To the east, so near to the Atlantic that Neptune,
freighted with the select treasures of the whole earth, can knock at
your door with his trident. From the west, the produce of the river of
Paradise—the Connecticut—will soon, by the blessings of steam,
railways, and canals pass under your windows; and thus, on this
spot, Neptune shall marry Ceres, and Pomona from Roxbury, and
Flora from Cambridge, shall dance at the wedding.
“Gentlemen of science, men of taste, ye of the literary emporium,
—for I perceive many of you present,—to you this is holy ground. If
the spot on which in times past a hero left only the print of a footstep
is now sacred, of what price is the birthplace of one who all the world
knows was born in Middle Street, directly opposite to this lot; and
who, if his birthplace were not well known, would now be claimed by
more than seven cities. To you, then, the value of these premises
must be inestimable. For ere long there will arise in full view of the
edifice to be erected here, a monument, the wonder and veneration
of the world. A column shall spring to the clouds; and on that column
will be engraven one word which will convey all that is wise in
intellect, useful in science, good in morals, prudent in counsel, and
benevolent in principle,—a name of one who, when living, was the
patron of the poor, the delight of the cottage, and the admiration of
kings; now dead, worth the whole seven wise men of Greece. Need I
tell you his name? He fixed the thunder and guided the lightning.
“Men of the North End! Need I appeal to your patriotism, in order
to enhance the value of this lot? The earth affords no such scenery
as this; there, around that corner, lived James Otis; here, Samuel
Adams; there, Joseph Warren; and around that other corner, Josiah
Quincy. Here was the birthplace of Freedom; here Liberty was born,
and nursed, and grew to manhood. Here man was newly created.
Here is the nursery of American Independence—I am too modest—
here began the emancipation of the world; a thousand generations
hence millions of men will cross the Atlantic just to look at the north
end of Boston. Your fathers—what do I say—yourselves,—yes, this
moment, I behold several attending this auction who lent a hand to
rock the cradle of Independence.
“Men of speculation,—ye who are deaf to everything except the
sound of money,—you, I know, will give me both of your ears when I
tell you the city of Boston must have a piece of this estate in order to
widen Ann Street. Do you hear me,—do you all hear me? I say the
city must have a large piece of this land in order to widen Ann Street.
What a chance! The city scorns to take a man’s land for nothing. If it
seizes your property, it is generous beyond the dreams of avarice.
The only oppression is, you are in danger of being smothered under
a load of wealth. Witness the old lady who lately died of a broken
heart when the mayor paid her for a piece of her kitchen-garden. All
the faculty agreed that the sight of the treasure, which the mayor
incautiously paid her in dazzling dollars, warm from the mint, sped
joyfully all the blood of her body into her heart, and rent it with
raptures. Therefore, let him who purchases this estate fear his good
fortune, and not Peter Rugg. Bid, then, liberally, and do not let the
name of Rugg damp your ardor. How much will you give per foot for
this estate?”
Thus spoke the auctioneer, and gracefully waved his ivory
hammer. From fifty to seventy-five cents per foot were offered in a
few moments. The bidding labored from seventy-five to ninety. At
length one dollar was offered. The auctioneer seemed satisfied; and
looking at his watch, said he would knock off the estate in five
minutes, if no one offered more.
There was a deep silence during this short period. While the
hammer was suspended, a strange rumbling noise was heard, which
arrested the attention of every one. Presently, it was like the sound
of many shipwrights driving home the bolts of a seventy-four. As the
sound approached nearer, some exclaimed, “The buildings in the
new market are falling in promiscuous ruins.” Others said, “No, it is
an earthquake; we perceive the earth tremble.” Others said, “Not so;
the sound proceeds from Hanover Street, and approaches nearer;”
and this proved true, for presently Peter Rugg was in the midst of us.
“Alas, Jenny,” said Peter, “I am ruined; our house has been
burned, and here are all our neighbors around the ruins. Heaven
grant your mother, Dame Rugg, is safe.”
“They don’t look like our neighbors,” said Jenny; “but sure
enough our house is burned, and nothing left but the door-stone and
an old cedar post. Do ask where mother is.”
In the mean time more than a thousand men had surrounded
Rugg and his horse and chair. Yet neither Rugg personally, nor his
horse and carriage, attracted more attention than the auctioneer. The
confident look and searching eyes of Rugg carried more conviction
to every one present that the estate was his, than could any
parchment or paper with signature and seal. The impression which
the auctioneer had just made on the company was effaced in a
moment; and although the latter words of the auctioneer were, “Fear
not Peter Rugg,” the moment the auctioneer met the eye of Rugg his
occupation was gone; his arm fell down to his hips, his late lively
hammer hung heavy in his hand, and the auction was forgotten. The
black horse, too, gave his evidence. He knew his journey was
ended; for he stretched himself into a horse and a half, rested his
head over the cedar post, and whinnied thrice, causing his harness
to tremble from headstall to crupper.
Rugg then stood upright in his chair, and asked with some
authority, “Who has demolished my house in my absence, for I see
no signs of a conflagration? I demand by what accident this has
happened, and wherefore this collection of strange people has
assembled before my door-step. I thought I knew every man in
Boston, but you appear to me a new generation of men. Yet I am
familiar with many of the countenances here present, and I can call
some of you by name; but in truth I do not recollect that before this
moment I ever saw any one of you. There, I am certain, is a
Winslow, and here a Sargent; there stands a Sewall, and next to him
a Dudley. Will none of you speak to me,—or is this all a delusion? I
see, indeed, many forms of men, and no want of eyes, but of motion,
speech, and hearing, you seem to be destitute. Strange! Will no one
inform me who has demolished my house?”
Then spake a voice from the crowd, but whence it came I could
not discern: “There is nothing strange here but yourself, Mr. Rugg.
Time, which destroys and renews all things, has dilapidated your
house, and placed us here. You have suffered many years under an
illusion. The tempest which you profanely defied at Menotomy has at
length subsided; but you will never see home, for your house and
wife and neighbors have all disappeared. Your estate, indeed,
remains, but no home. You were cut off from the last age, and you
can never be fitted to the present. Your home is gone, and you can
never have another home in this world.”
JAMES HALL
1793–1868

Judge Hall gained eminence in the early Middle West at both law and letters. His
law studies in Philadelphia, where he was born, were interrupted by the war of
1812. After soldiering along the Niagara he went sailoring with Decatur in the
Mediterranean (1815). Then completing his studies at Pittsburgh, he emigrated to
Shawneetown, where he became public prosecutor. The office of state treasurer
bringing him to Vandalia, he edited there, with Robert Blackwell, the Illinois
Intelligencer. The Western Souvenir, projected, edited, and largely written by him,
was published at Cincinnati in 1829; the Illinois Magazine, at Vandalia, 1829–1831,
then successively at Cincinnati, St. Louis, and again at Cincinnati. Following
himself its last remove, Hall continued it there, 1833–1835, as the Western
Monthly Magazine. Letters from the West, which appeared first, 1821–1824, in the
Portfolio, were printed collectively in London, 1828. His scattered observations
were brought into more consistent form: Sketches of the West, Philadelphia, 1835;
Notes on the Western States, Philadelphia, 1838. A collection of his tales, entitled
The Wilderness and the War Path, appeared in Wiley and Putnam’s “Library of
American Books,” 1846. A uniform edition of his works was published in four
volumes, 1853–1856 (a list is given in the American Cyclopedia). Some details of
his life not compiled in the cyclopedias were published by Hiram W. Beckwith of
Danville, in the fifth of the papers entitled “The Land of the Illini,” Chicago Tribune,
8th September, 1895.
A writer of tolerable verse and historically valuable descriptive sketches of the
frontier, Hall gave much of his leisure also to embodying the history, legend, and
local color of the Mississippi valley and the prairies beyond in tales. These are
often removed from our present taste by the magniloquence then considered
literary; but they keep the interest of close observation and have their flashes of
enduring human import. The local truth of Hall’s tales is commended in the
Western Monthly Review for November, 1828 (volume ii, page 367). Unless the
reference be to some of the early Letters from the West, the tale printed below
may have appeared earlier than the date of its incorporation in the Western
Souvenir. (See also pages 5, 9 and 12 of the Introduction.)
THE FRENCH VILLAGE
[From the “Western Souvenir,” 1829]

[A long introduction and a concluding summary of the effects of


American development have been omitted as not essential to the
narrative; and certain obvious corrections have been made in the
text.]

* * * * *

THIS little colony was composed partly of emigrants from France, and
partly of natives—not Indians, but bona fide French, born in America,
but preserving their language, their manners, and their agility in
dancing, although several generations had passed away since their
first settlement. Here they lived perfectly happy, and well they might,
for they enjoyed to the full extent those three blessings on which our
declaration of independence has laid so much stress—life, liberty, and
the pursuit of happiness. Their lives, it is true, were sometimes
threatened by the miasm aforesaid; but this was soon ascertained to
be an imaginary danger. For whether it was owing to their temperance
or their cheerfulness, their activity or their being acclimated, or to the
want of attraction between French people and fever, or to all these
together, certain it is that they were blessed with a degree of health
enjoyed only by the most favoured nations. As to liberty, the wild Indian
scarcely possessed more; for although the “grand monarque” had not
more loyal subjects in his wide domains, he had never condescended
to honor them with a single act of oppression, unless the occasional
visits of the Commandant could be so called. He sometimes, when
levying supplies, called upon the village for its portion, which they
always contributed with many protestations of gratitude for the honor
conferred on them. And as for happiness, they pursued nothing else.
Inverting the usual order, to enjoy life was their daily business, to
provide for its wants an occasional labor, sweetened by its brief
continuance, and its abundant fruit. A large tract of land around the
village was called the “common field.” Most of this was allowed to
remain in open pasturage; but spots of it were cultivated by any who
chose to enclose them; and such enclosure gave a firm title to the
individual so long as the occupancy lasted, but no longer. They were
not an agricultural people, further than the rearing of a few esculents
for the table made them such; relying chiefly on their large herds, and
on the produce of the chase for support. With the Indians they drove
an amicable, though not extensive, trade, for furs and peltry; giving
them in exchange, merchandize and trinkets, which they procured from
their countrymen at St. Louis. To the latter place they annually carried
their skins, bringing back a fresh supply of goods for barter, together
with such articles as their own wants required; not forgetting a large
portion of finery for the ladies, a plentiful supply of rosin and catgut for
the fiddler, and liberal presents for his reverence, the priest.
If this village had no other recommendation, it is endeared to my
recollection as the birthplace and residence of Monsieur Baptiste
Menou, who was one of its principal inhabitants when I first visited it.
He was a bachelor of forty, a tall, lank, hard featured personage, as
straight as a ramrod, and almost as thin, with stiff, black hair, sunken
cheeks, and a complexion a tinge darker than that of the aborigines.
His person was remarkably erect, his countenance grave, his gait
deliberate; and when to all this is added an enormous pair of sable
whiskers, it will be admitted that Mons. Baptiste was no insignificant
person. He had many estimable qualities of mind and person which
endeared him to his friends, whose respect was increased by the fact
of his having been a soldier and a traveller. In his youth he had
followed the French Commandant in two campaigns; and not a
comrade in the ranks was better dressed, or cleaner shaved on parade
than Baptiste, who fought besides with the characteristic bravery of the
nation to which he owed his lineage. He acknowledged, however, that
war was not as pleasant a business as is generally supposed.
Accustomed to a life totally free from constraint, he complained of
being obliged to eat and drink and sleep at the call of the drum.
Burnishing a gun, and brushing a coat, and polishing shoes, were
duties beneath a gentleman; and after all, Baptiste saw but little honor
in tracking the wily Indians through endless swamps. Besides he
began to have some scruples as to the propriety of cutting the throats
of the respectable gentry whom he had been in the habit of
considering as the original and lawful possessors of the soil. He
therefore proposed to resign, and was surprised when his commander
informed him that he was enlisted for a term, which was not yet
expired. He bowed, shrugged his shoulders, and submitted to his fate.
He had too much honor to desert, and was too loyal, and too polite, to
murmur; but he forthwith made a solemn vow to his patron saint never
again to get into a scrape from which he could not retreat whenever it
suited his convenience. It was thought that he owed his celibacy in
some measure to this vow. He had since accompanied the friendly
Indians on several hunting expeditions towards the sources of the
Mississippi, and had made a trading voyage to New Orleans. Thus
accomplished, he had been more than once called upon by the
Commandant to act as a guide, or an interpreter; honors which failed
not to elicit suitable marks of respect from his fellow villagers, but
which had not inflated the honest heart of Baptiste with any
unbecoming pride. On the contrary there was not a more modest man
in the village.
In his habits he was the most regular of men. He might be seen at
any hour of the day, either sauntering through the village, or seated in
front of his own door, smoking a large pipe formed of a piece of
buckhorn, curiously hollowed out, and lined with tin; to which was
affixed a short stem of cane from the neighboring swamp. This pipe
was his inseparable companion; and he evinced towards it a
constancy which would have immortalized his name, had it been
displayed in a better cause. When he walked abroad, it was to stroll
leisurely from door to door, chatting familiarly with his neighbours,
patting the white-haired children on the head, and continuing his
lounge, until he had peregrinated the village. His gravity was not a
“mysterious carriage of the body to conceal the defects of the mind,”
but a constitutional seriousness of aspect, which covered as happy
and as humane a spirit as ever existed. It was simply a want of
sympathy between his muscles and his brains; the former utterly
refusing to express any agreeable sensation which might happily
titillate the organs of the latter. Honest Baptiste loved a joke, and
uttered many, and good ones; but his rigid features refused to smile
even at his own wit—a circumstance which I am the more particular in
mentioning, as it is not common. He had an orphan niece whom he
had reared from childhood to maturity,—a lovely girl, of whose
beautiful complexion a poet might say that its roses were cushioned
upon ermine. A sweeter flower bloomed not upon the prairie than
Gabrielle Menou. But as she was never afflicted with weak nerves,
dyspepsia or consumption, and had but one avowed lover, whom she
treated with uniform kindness, and married with the consent of all
parties, she has no claim to be considered as the heroine of this
history. That station will be awarded by every sensible reader to the
important personage who will be presently introduced.
Across the street, immediately opposite to Mons. Baptiste, lived
Mademoiselle Jeannette Duval, a lady who resembled him in some
respects, but in many others was his very antipode. Like him, she was
cheerful and happy, and single; but unlike him, she was brisk, and fat,
and plump. Monsieur was the very pink of gravity; and Mademoiselle
was blessed with a goodly portion thereof,—but hers was specific
gravity. Her hair was dark, but her heart was light, and her eyes,
though black, were as brilliant a pair of orbs as ever beamed upon the
dreary solitude of a bachelor’s heart. Jeannette’s heels were as light
as her heart, and her tongue as active as her heels, so that
notwithstanding her rotundity, she was as brisk a Frenchwoman as
ever frisked through the mazes of a cotillion. To sum her perfections,
her complexion was of a darker olive than the genial sun of France
confers on her brunettes, and her skin was as smooth and shining as
polished mahogany. Her whole household consisted of herself and a
female negro servant. A spacious garden, which surrounded her
house, a pony, and a herd of cattle, constituted, in addition to her
personal charms, all the wealth of this amiable spinster. But with these
she was rich, as they supplied her table without adding much to her
cares. Her quadrupeds, according to the example set by their
superiors, pursued their own happiness without let or molestation,
wherever they could find it—waxing fat or lean, as nature was more or
less bountiful in supplying their wants; and when they strayed too far,
or when her agricultural labours became too arduous for the feminine
strength of herself and her sable assistant, every monsieur of the
village was proud of an occasion to serve Mam’selle. And well they
might be, for she was the most noticeable lady in the village, the life of
every party, the soul of every frolic. She participated in every festive
meeting, and every sad solemnity. Not a neighbor could get up a
dance, or get down a dose of bark, without her assistance. If the ball
grew dull, Mam’selle bounced on the floor, and infused new spirit into
the weary dancers. If the conversation flagged, Jeannette, who
occupied a kind of neutral ground between the young and the old, the
married and the single, chatted with all, and loosened all tongues. If
the girls wished to stroll in the woods, or romp on the prairie,
Mam’selle was taken along to keep off the wolves and the rude young
men; and in respect to the latter, she faithfully performed her office by
attracting them around her own person. Then she was the best
neighbour and the kindest soul! She made the richest soup, the
clearest coffee, and the neatest pastry in the village; and in virtue of
her confectionery was the prime favourite of all the children. Her
hospitality was not confined to her own domicile, but found its way in
the shape of sundry savoury viands, to every table in the vicinity. In the
sick chamber she was the most assiduous nurse, her step was the
lightest, and her voice the most cheerful—so that the priest must
inevitably have become jealous of her skill, had it not been for divers
plates of rich soup, and bottles of cordial, with which she conciliated
his favour, and purchased absolution for these and other offences.
Baptiste and Jeannette were the best of neighbours. He always
rose at the dawn, and after lighting his pipe, sallied forth into the open
air, where Jeannette usually made her appearance at the same time;
for there was an emulation of long standing between them, which
should be the earlier riser.
“Bon jour! Mam’selle Jeannette,” was his daily salutation.
“Ah! bon jour! Mons. Menou,” was her daily reply.
Then as he gradually approximated the little paling which
surrounded her door, he hoped Mam’selle was well this morning, and
she reiterated the kind inquiry, but with increased emphasis. Then
Monsieur enquired after Mam’selle’s pony and Mam’selle’s cow, and
her garden, and everything appertaining to her, real, personal and
mixed; and she displayed a corresponding interest in all concerns of
her kind neighbour. These discussions were mutually beneficial. If
Mam’selle’s cattle ailed, or if her pony was guilty of an impropriety, who
was so able to advise her as Mons. Baptiste? And if his plants
drooped, or his poultry died, who so skilful in such matters as
Mam’selle Jeanette? Sometimes Baptiste forgot his pipe in the
superior interest of the “tête à tête,” and must needs step in to light it at
Jeannette’s fire, which caused the gossips of the village to say that he
purposely let his pipe go out, in order that he might himself go in. But
he denied this, and, indeed, before offering to enter the dwelling of
Mam’selle on such occasions, he usually solicited permission to light
his pipe at Jeannette’s sparkling eyes, a compliment at which,
although it had been repeated some scores of times, Mam’selle never
failed to laugh and curtesy, with great good humour and good
breeding.
It cannot be supposed that a bachelor of so much discernment
could long remain insensible to the galaxy of charms which centered in
the person of Mam’selle Jeannette; and accordingly it was currently
reported that a courtship of some ten years standing had been slyly
conducted on his part, and as cunningly eluded on hers. It was not
averred that Baptiste had actually gone the fearful length of offering his
hand; or that Jeannette had been so imprudent as to discourage, far
less reject, a lover of such respectable pretensions. But there was
thought to exist a strong hankering on the part of the gentleman, which
the lady had managed so skilfully as to keep his mind in a kind of
equilibrium, like that of the patient animal between the two bundles of
hay—so that he would sometimes halt in the street, midway between
the two cottages, and cast furtive glances, first at the one, and then at
the other, as if weighing the balance of comfort, while the increased
volumes of smoke which issued from his mouth seemed to argue that
the fire of his love had other fuel than tobacco, and was literally
consuming the inward man. The wary spinster was always on the alert
on such occasions, manœuvering like a skilful general according to
circumstances. If honest Baptiste, after such a consultation, turned on
his heel and retired to his former cautious position at his own door,
Mam’selle rallied all her attractions, and by a sudden demonstration
drew him again into the field; but if he marched with an embarrassed
air towards her gate, she retired into her castle, or kept shy, and by
able evolutions avoided everything which might bring matters to an
issue. Thus the courtship continued longer than the siege of Troy, and
Jeannette maintained her freedom, while Baptiste with a magnanimity
superior to that of Agamemnon, kept his temper, and smoked his pipe
in good humour with Jeannette and all the world.
Such was the situation of affairs when I first visited this village,
about the time of the cession of Louisiana to the United States. The
news of that event had just reached this sequestered spot, and was
but indifferently relished. Independently of the national attachment,
which all men feel, and the French so justly, the inhabitants of this
region had reason to prefer to all others the government which had
afforded them protection without constraining their freedom, or
subjecting them to any burthens; and with the kindest feelings towards
the Americans, they would willingly have dispensed with any nearer
connexion than that which already existed. They, however, said little on
the subject; and that little was expressive of their cheerful
acquiescence in the honor done them by the American people, in
buying the country which the Emperor had done them the honor to sell.
It was on the first day of the Carnival that I arrived in the village,
about sunset, seeking shelter only for the night, and intending to
proceed on my journey in the morning. The notes of the violin and the
groups of gaily attired people who thronged the street attracted my
attention, and induced me to inquire the occasion of this merriment. My
host informed me that a “King Ball” was to be given at the house of a
neighbour, adding the agreeable intimation that strangers were always
expected to attend without invitation. Young and ardent, I required little
persuasion to change my dress, and hasten to the scene of festivity.
The moment I entered the room, I felt that I was welcome. Not a single
look of surprise, not a glance of more than ordinary attention, denoted
me as a stranger, or an unexpected guest. The gentlemen nearest the
door, bowed as they opened a passage for me through the crowd, in
which for a time I mingled, apparently unnoticed. At length a young
gentleman adorned with a large nosegay approached me, invited me
to join the dancers, and after inquiring my name, introduced me to
several females, among whom I had no difficulty in selecting a graceful
partner. I was passionately fond of dancing, so that readily imbibing the
joyous spirit of those around me, I advanced rapidly in their estimation.
The native ease and elegance of the females, reared in the wilderness,
and unhackneyed in the forms of society, surprised and delighted me,
as much as the amiable frankness of all classes. By and by the

You might also like