Download as pdf or txt
Download as pdf or txt
You are on page 1of 4

Better Data Science | Make Synthetic Datasets

with Python
● Library imports
● rcParams is only here for plot stylings
In [1]:
import numpy as np
import pandas as pd
from sklearn.datasets import make_classification
import matplotlib.pyplot as plt
from matplotlib import rcParams
rcParams['axes.spines.top'] = False
rcParams['axes.spines.right'] = False

Make a synthetic dataset

● 1000 data points measured through 2 features


● Perfect (50:50) class distribution
● Binary target variable, every subset has a single cluster
● Make sure to use random_state=42 if you want reproducible results
In [2]:
X, y = make_classification(
n_samples=1000,
n_features=2,
n_redundant=0,
n_clusters_per_class=1,
random_state=42
)

df = pd.concat([pd.DataFrame(X), pd.Series(y)], axis=1)


df.columns = ['x1', 'x2', 'y']
# 5 random rows
df.sample(5)
Visualization

● The plot() function visualizes a synthetic dataset:


In [3]:
def plot(df: pd.DataFrame, x1: str, x2: str, y: str, title: str = '', save: bool = False,
figname='figure.png'):
plt.figure(figsize=(14, 7))
plt.scatter(x=df[df[y] == 0][x1], y=df[df[y] == 0][x2], label='y = 0')
plt.scatter(x=df[df[y] == 1][x1], y=df[df[y] == 1][x2], label='y = 1')
plt.title(title, fontsize=20)
plt.legend()
if save:
plt.savefig(figname, dpi=300, bbox_inches='tight', pad_inches=0)
plt.show()
In [4]:
plot(df=df, x1='x1', x2='x2', y='y', title='Dataset with 2 classes')

Adding noise

● You can use the flip_y parameter to add noise


● From the docs:
○ The fraction of samples whose class is assigned randomly. Larger
values introduce noise in the labels and make the classification
task harder. Note that the default setting flip_y > 0 might lead to
less than n_classes in y in some cases.
In [5]:
X, y = make_classification(
n_samples=1000,
n_features=2,
n_redundant=0,
n_clusters_per_class=1,
flip_y=0.15,
random_state=42
)

df = pd.concat([pd.DataFrame(X), pd.Series(y)], axis=1)


df.columns = ['x1', 'x2', 'y']

plot(df=df, x1='x1', x2='x2', y='y', title='Dataset with 2 classes - Added noise')


Add class imbalance

● Perfect class distribution (50:50) is rarely the case


● You can use the weights parameter to play with the distribution
○ Assigning the value of 0.95 makes the y = 1 class take 5% of the
data
In [6]:
X, y = make_classification(
n_samples=1000,
n_features=2,
n_redundant=0,
n_clusters_per_class=1,
weights=[0.95],
random_state=42
)

df = pd.concat([pd.DataFrame(X), pd.Series(y)], axis=1)


df.columns = ['x1', 'x2', 'y']

plot(df=df, x1='x1', x2='x2', y='y', title='Dataset with 2 classes - Class imbalance (y = 1)')

● You can do the opposite:


In [7]:
X, y = make_classification(
n_samples=1000,
n_features=2,
n_redundant=0,
n_clusters_per_class=1,
weights=[0.05],
random_state=42
)

df = pd.concat([pd.DataFrame(X), pd.Series(y)], axis=1)


df.columns = ['x1', 'x2', 'y']

plot(df=df, x1='x1', x2='x2', y='y', title='Dataset with 2 classes - Class imbalance (y = 0)')

Make classification task easier/harder


● You can play around with the class_sep parameter to adjust class separation
● Higher the value, the more separated the classes are
In [8]:
X, y = make_classification(
n_samples=1000,
n_features=2,
n_redundant=0,
n_clusters_per_class=1,
class_sep=5,
random_state=42
)

df = pd.concat([pd.DataFrame(X), pd.Series(y)], axis=1)


df.columns = ['x1', 'x2', 'y']

plot(df=df, x1='x1', x2='x2', y='y', title='Dataset with 2 classes - Make classification easier')

You might also like