Welcome to Scribd!

Skip carousel

Better Data Science - Make Synthetic Datasets With Python

Uploaded by

Derek Degbedzui

0% found this document useful (0 votes)

6 views4 pages

Original Title

Better Data Science _ Make Synthetic Datasets with Python

Copyright

Available Formats

PDF, TXT or read online from Scribd

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Report this Document

Copyright:

Available Formats

Download as PDF, TXT or read online from Scribd

Flag for inappropriate content

Download as pdf or txt

0% found this document useful (0 votes)

6 views4 pages

Better Data Science - Make Synthetic Datasets With Python

Uploaded by

Derek Degbedzui

Copyright:

Available Formats

Download as PDF, TXT or read online from Scribd

Flag for inappropriate content

Download as pdf or txt

Jump to Page

You are on page 1of 4

Search inside document

Better Data Science | Make Synthetic Datasets

with Python
● Library imports
● rcParams is only here for plot stylings
In [1]:
import numpy as np
import pandas as pd
from sklearn.datasets import make_classification
import matplotlib.pyplot as plt
from matplotlib import rcParams
rcParams['axes.spines.top'] = False
rcParams['axes.spines.right'] = False

Make a synthetic dataset

● 1000 data points measured through 2 features

● Perfect (50:50) class distribution
● Binary target variable, every subset has a single cluster
● Make sure to use random_state=42 if you want reproducible results
In [2]:
X, y = make_classification(
n_samples=1000,
n_features=2,
n_redundant=0,
n_clusters_per_class=1,
random_state=42
)

df = pd.concat([pd.DataFrame(X), pd.Series(y)], axis=1)

df.columns = ['x1', 'x2', 'y']
# 5 random rows
df.sample(5)
Visualization

● The plot() function visualizes a synthetic dataset:

In [3]:
def plot(df: pd.DataFrame, x1: str, x2: str, y: str, title: str = '', save: bool = False,
figname='figure.png'):
plt.figure(figsize=(14, 7))
plt.scatter(x=df[df[y] == 0][x1], y=df[df[y] == 0][x2], label='y = 0')
plt.scatter(x=df[df[y] == 1][x1], y=df[df[y] == 1][x2], label='y = 1')
plt.title(title, fontsize=20)
plt.legend()
if save:
plt.savefig(figname, dpi=300, bbox_inches='tight', pad_inches=0)
plt.show()
In [4]:
plot(df=df, x1='x1', x2='x2', y='y', title='Dataset with 2 classes')

Adding noise

● You can use the flip_y parameter to add noise

● From the docs:
○ The fraction of samples whose class is assigned randomly. Larger
values introduce noise in the labels and make the classification
task harder. Note that the default setting flip_y > 0 might lead to
less than n_classes in y in some cases.
In [5]:
X, y = make_classification(
n_samples=1000,
n_features=2,
n_redundant=0,
n_clusters_per_class=1,
flip_y=0.15,
random_state=42
)

df = pd.concat([pd.DataFrame(X), pd.Series(y)], axis=1)

df.columns = ['x1', 'x2', 'y']

plot(df=df, x1='x1', x2='x2', y='y', title='Dataset with 2 classes - Added noise')

Add class imbalance

● Perfect class distribution (50:50) is rarely the case

● You can use the weights parameter to play with the distribution
○ Assigning the value of 0.95 makes the y = 1 class take 5% of the
data
In [6]:
X, y = make_classification(
n_samples=1000,
n_features=2,
n_redundant=0,
n_clusters_per_class=1,
weights=[0.95],
random_state=42
)

df = pd.concat([pd.DataFrame(X), pd.Series(y)], axis=1)

df.columns = ['x1', 'x2', 'y']

plot(df=df, x1='x1', x2='x2', y='y', title='Dataset with 2 classes - Class imbalance (y = 1)')

● You can do the opposite:

In [7]:
X, y = make_classification(
n_samples=1000,
n_features=2,
n_redundant=0,
n_clusters_per_class=1,
weights=[0.05],
random_state=42
)

df = pd.concat([pd.DataFrame(X), pd.Series(y)], axis=1)

df.columns = ['x1', 'x2', 'y']

plot(df=df, x1='x1', x2='x2', y='y', title='Dataset with 2 classes - Class imbalance (y = 0)')

Make classification task easier/harder

● You can play around with the class_sep parameter to adjust class separation
● Higher the value, the more separated the classes are
In [8]:
X, y = make_classification(
n_samples=1000,
n_features=2,
n_redundant=0,
n_clusters_per_class=1,
class_sep=5,
random_state=42
)

df = pd.concat([pd.DataFrame(X), pd.Series(y)], axis=1)

df.columns = ['x1', 'x2', 'y']

plot(df=df, x1='x1', x2='x2', y='y', title='Dataset with 2 classes - Make classification easier')

The Subtle Art of Not Giving a F*ck: A Counterintuitive Approach to Living a Good Life
From Everand
The Subtle Art of Not Giving a F*ck: A Counterintuitive Approach to Living a Good Life
Mark Manson
Rating: 4 out of 5 stars
4/5 (5835)
The Gifts of Imperfection: Let Go of Who You Think You're Supposed to Be and Embrace Who You Are
From Everand
The Gifts of Imperfection: Let Go of Who You Think You're Supposed to Be and Embrace Who You Are
Brene Brown
Rating: 4 out of 5 stars
4/5 (1093)
Never Split the Difference: Negotiating As If Your Life Depended On It
From Everand
Never Split the Difference: Negotiating As If Your Life Depended On It
Chris Voss
Rating: 4.5 out of 5 stars
4.5/5 (852)
Principles: Life and Work
From Everand
Principles: Life and Work
Ray Dalio
Rating: 4 out of 5 stars
4/5 (612)
The Glass Castle: A Memoir
From Everand
The Glass Castle: A Memoir
Jeannette Walls
Rating: 4.5 out of 5 stars
4.5/5 (1720)
Grit: The Power of Passion and Perseverance
From Everand
Grit: The Power of Passion and Perseverance
Angela Duckworth
Rating: 4 out of 5 stars
4/5 (590)
Sing, Unburied, Sing: A Novel
From Everand
Sing, Unburied, Sing: A Novel
Jesmyn Ward
Rating: 4 out of 5 stars
4/5 (1194)
Hidden Figures: The American Dream and the Untold Story of the Black Women Mathematicians Who Helped Win the Space Race
From Everand
Hidden Figures: The American Dream and the Untold Story of the Black Women Mathematicians Who Helped Win the Space Race
Margot Lee Shetterly
Rating: 4 out of 5 stars
4/5 (903)
Shoe Dog: A Memoir by the Creator of Nike
From Everand
Shoe Dog: A Memoir by the Creator of Nike
Phil Knight
Rating: 4.5 out of 5 stars
4.5/5 (541)
The Perks of Being a Wallflower
From Everand
The Perks of Being a Wallflower
Stephen Chbosky
Rating: 4.5 out of 5 stars
4.5/5 (2107)
The Hard Thing About Hard Things: Building a Business When There Are No Easy Answers
From Everand
The Hard Thing About Hard Things: Building a Business When There Are No Easy Answers
Ben Horowitz
Rating: 4.5 out of 5 stars
4.5/5 (350)
Elon Musk: Tesla, SpaceX, and the Quest for a Fantastic Future
From Everand
Elon Musk: Tesla, SpaceX, and the Quest for a Fantastic Future
Ashlee Vance
Rating: 4.5 out of 5 stars
4.5/5 (474)
Bad Feminist: Essays
From Everand
Bad Feminist: Essays
Roxane Gay
Rating: 4 out of 5 stars
4/5 (1029)
The Outsider: A Novel
From Everand
The Outsider: A Novel
Stephen King
Rating: 4 out of 5 stars
4/5 (1872)
Her Body and Other Parties: Stories
From Everand
Her Body and Other Parties: Stories
Carmen Maria Machado
Rating: 4 out of 5 stars
4/5 (824)
The Sympathizer: A Novel (Pulitzer Prize for Fiction)
From Everand
The Sympathizer: A Novel (Pulitzer Prize for Fiction)
Viet Thanh Nguyen
Rating: 4.5 out of 5 stars
4.5/5 (122)
The Emperor of All Maladies: A Biography of Cancer
From Everand
The Emperor of All Maladies: A Biography of Cancer
Siddhartha Mukherjee
Rating: 4.5 out of 5 stars
4.5/5 (271)
Angela's Ashes: A Memoir
From Everand
Angela's Ashes: A Memoir
Frank McCourt
Rating: 4.5 out of 5 stars
4.5/5 (443)
Brooklyn: A Novel
From Everand
Brooklyn: A Novel
Colm Tóibín
Rating: 3.5 out of 5 stars
3.5/5 (1948)
The Little Book of Hygge: Danish Secrets to Happy Living
From Everand
The Little Book of Hygge: Danish Secrets to Happy Living
Meik Wiking
Rating: 3.5 out of 5 stars
3.5/5 (405)
Steve Jobs
From Everand
Steve Jobs
Walter Isaacson
Rating: 4.5 out of 5 stars
4.5/5 (809)
A Man Called Ove: A Novel
From Everand
A Man Called Ove: A Novel
Fredrik Backman
Rating: 4.5 out of 5 stars
4.5/5 (4772)
The World Is Flat 3.0: A Brief History of the Twenty-first Century
From Everand
The World Is Flat 3.0: A Brief History of the Twenty-first Century
Thomas L. Friedman
Rating: 3.5 out of 5 stars
3.5/5 (2259)
The Art of Racing in the Rain: A Novel
From Everand
The Art of Racing in the Rain: A Novel
Garth Stein
Rating: 4 out of 5 stars
4/5 (4214)
The Yellow House: A Memoir (2019 National Book Award Winner)
From Everand
The Yellow House: A Memoir (2019 National Book Award Winner)
Sarah M. Broom
Rating: 4 out of 5 stars
4/5 (98)
Devil in the Grove: Thurgood Marshall, the Groveland Boys, and the Dawn of a New America
From Everand
Devil in the Grove: Thurgood Marshall, the Groveland Boys, and the Dawn of a New America
Gilbert King
Rating: 4.5 out of 5 stars
4.5/5 (266)
A Tree Grows in Brooklyn
From Everand
A Tree Grows in Brooklyn
Betty Smith
Rating: 4.5 out of 5 stars
4.5/5 (1930)
A Heartbreaking Work Of Staggering Genius: A Memoir Based on a True Story
From Everand
A Heartbreaking Work Of Staggering Genius: A Memoir Based on a True Story
Dave Eggers
Rating: 3.5 out of 5 stars
3.5/5 (231)
Yes Please
From Everand
Yes Please
Amy Poehler
Rating: 4 out of 5 stars
4/5 (1905)
Team of Rivals: The Political Genius of Abraham Lincoln
From Everand
Team of Rivals: The Political Genius of Abraham Lincoln
Doris Kearns Goodwin
Rating: 4.5 out of 5 stars
4.5/5 (234)
The Woman in Cabin 10
From Everand
The Woman in Cabin 10
Ruth Ware
Rating: 3.5 out of 5 stars
3.5/5 (2526)
Wolf Hall: A Novel
From Everand
Wolf Hall: A Novel
Hilary Mantel
Rating: 4 out of 5 stars
4/5 (3973)
Fear: Trump in the White House
From Everand
Fear: Trump in the White House
Bob Woodward
Rating: 3.5 out of 5 stars
3.5/5 (738)
6 powerBI Project PDF
Document16 pages
6 powerBI Project PDF
Pawan Mishra
100% (2)
John Adams
From Everand
John Adams
David McCullough
Rating: 4.5 out of 5 stars
4.5/5 (2410)
On Fire: The (Burning) Case for a Green New Deal
From Everand
On Fire: The (Burning) Case for a Green New Deal
Naomi Klein
Rating: 4 out of 5 stars
4/5 (74)
The Light Between Oceans: A Novel
From Everand
The Light Between Oceans: A Novel
M.L. Stedman
Rating: 4.5 out of 5 stars
4.5/5 (789)
Manhattan Beach: A Novel
From Everand
Manhattan Beach: A Novel
Jennifer Egan
Rating: 3.5 out of 5 stars
3.5/5 (880)
The Constant Gardener: A Novel
From Everand
The Constant Gardener: A Novel
John le Carré
Rating: 3.5 out of 5 stars
3.5/5 (104)
Sample Lifting Plan and Rigging Study
Document13 pages
Sample Lifting Plan and Rigging Study
Khawaja Arslan Ahmed
88% (16)
Basic Surface Wellhead School
Document47 pages
Basic Surface Wellhead School
IWCF IADC
100% (3)
The Unwinding: An Inner History of the New America
From Everand
The Unwinding: An Inner History of the New America
George Packer
Rating: 4 out of 5 stars
4/5 (45)
Rise of ISIS: A Threat We Can't Ignore
From Everand
Rise of ISIS: A Threat We Can't Ignore
Jay Sekulow
Rating: 3.5 out of 5 stars
3.5/5 (137)
Little Women
From Everand
Little Women
Louisa May Alcott
Rating: 4 out of 5 stars
4/5 (105)
Logistic Regression
Document10 pages
Logistic Regression
Derek Degbedzui
No ratings yet
Multiple Regression
Document7 pages
Multiple Regression
Derek Degbedzui
No ratings yet
Better Data Science - Generate PDF Reports With Python
Document5 pages
Better Data Science - Generate PDF Reports With Python
Derek Degbedzui
No ratings yet
Simple Linear Regression: Math Behind
Document6 pages
Simple Linear Regression: Math Behind
Derek Degbedzui
No ratings yet
Linear Regression For Absolute Beginners With Implementation in Python
Document17 pages
Linear Regression For Absolute Beginners With Implementation in Python
Derek Degbedzui
No ratings yet
K Nearest Neighbors
Document5 pages
K Nearest Neighbors
Derek Degbedzui
No ratings yet
Random Forest: The Algorithm in A Nutshell
Document10 pages
Random Forest: The Algorithm in A Nutshell
Derek Degbedzui
No ratings yet
Decision Trees
Document11 pages
Decision Trees
Derek Degbedzui
No ratings yet
What Is Data Analytics
Document13 pages
What Is Data Analytics
Derek Degbedzui
No ratings yet
Report Indigood
Document1 page
Report Indigood
sajjad yousuf
No ratings yet
Emulator Programming Reference
Document164 pages
Emulator Programming Reference
Manohar Ch
No ratings yet
STD10 NSQF It Weaker Students
Document32 pages
STD10 NSQF It Weaker Students
Roushan2104
No ratings yet
What Is Direct Memory Access MDA
Document1 page
What Is Direct Memory Access MDA
Juan Ortega Guerra
No ratings yet
Stability Report For Bentonite Tank
Document11 pages
Stability Report For Bentonite Tank
Dong-Yong Kim
No ratings yet
TNCM
Document23 pages
TNCM
fer
No ratings yet
Smart Building - Taipei 101: Access Control - Exterior and Vertical
Document1 page
Smart Building - Taipei 101: Access Control - Exterior and Vertical
Jaanissar Gera
No ratings yet
Implementation of A Community of Inquiry in Teaching English As A Foreign Language in Secondary Schools: A Literature Review
Document11 pages
Implementation of A Community of Inquiry in Teaching English As A Foreign Language in Secondary Schools: A Literature Review
Journal of Education and Learning
No ratings yet
Sub Office Memorandum
Document3 pages
Sub Office Memorandum
Christopher Olaya
No ratings yet
Logcat 1703850153356
Document6 pages
Logcat 1703850153356
rifaldimahmud70
No ratings yet
Lecture 24 JSON Web Token (JWT) - Final
Document50 pages
Lecture 24 JSON Web Token (JWT) - Final
Naveed
100% (1)
In An Exam, Students in A Class Scored As Follows: S (45, 67, 95, 89, 88, 40, 90, 88, 56, 78, 88, 76) What Is The Mode For The Data Shown Above?
Document10 pages
In An Exam, Students in A Class Scored As Follows: S (45, 67, 95, 89, 88, 40, 90, 88, 56, 78, 88, 76) What Is The Mode For The Data Shown Above?
Miguel Angelo Garcia
No ratings yet
Q3-CSS11 - Las 2
Document9 pages
Q3-CSS11 - Las 2
jazel aquino
No ratings yet
Long Lead Acid Battery 12v 65ah
Document2 pages
Long Lead Acid Battery 12v 65ah
nadeem
No ratings yet
Product Sheet Bernoulli Filter. Bernoulli Filter Venezuela.
Document4 pages
Product Sheet Bernoulli Filter. Bernoulli Filter Venezuela.
Renso Piovesan
100% (1)
English
Document15 pages
English
Anthony Pospiech
No ratings yet
Year 4 Revision Paper 1 On Sound
Document8 pages
Year 4 Revision Paper 1 On Sound
Sami
No ratings yet
Textbook Fundamentals and Applications of Supercritical Carbon Dioxide Sco2 Based Power Cycles 1St Edition Klaus Brun Ebook All Chapter PDF
Document54 pages
Textbook Fundamentals and Applications of Supercritical Carbon Dioxide Sco2 Based Power Cycles 1St Edition Klaus Brun Ebook All Chapter PDF
nicholas.kipp808
100% (13)
Computer Science Shubham Final
Document24 pages
Computer Science Shubham Final
Shubham Swami
No ratings yet
Chapter 2
Document23 pages
Chapter 2
Chhin Visal
No ratings yet
史都華平台之仿生物演算法模糊強化學習控制與FPGA實現
Document99 pages
史都華平台之仿生物演算法模糊強化學習控制與FPGA實現
李金輝
No ratings yet
FMC Spare Parts Use Record
Document1 page
FMC Spare Parts Use Record
Athiphap Srisupareerath
No ratings yet
110PAX4 Maint Manual
Document0 pages
110PAX4 Maint Manual
metrobs
No ratings yet
FC 360 New Programming Guide
Document126 pages
FC 360 New Programming Guide
Prasad
No ratings yet
SAP Innovation Management: A Detailed View
Document37 pages
SAP Innovation Management: A Detailed View
Ximeta Select
No ratings yet
OR Lecture Note - GBT PDF
Document32 pages
OR Lecture Note - GBT PDF
Sadeep Raut
No ratings yet
Glints CV Template
Document1 page
Glints CV Template
rizko
No ratings yet