Pandas in Scikit-Learn

Introduction to Pandas in Scikit-Learn
Pandas and Scikit-Learn are two fundamental libraries in the Python ecosystem for data science and
machine learning. While Pandas provides powerful data manipulation tools, Scikit-Learn is a
comprehensive library for machine learning that includes tools for model selection, preprocessing,
and evaluation. Using Pandas in conjunction with Scikit-Learn allows for seamless data preparation
and model training processes.
Key Features of Pandas
1. Data Structures:
 Series: One-dimensional labeled array capable of holding any data type.
 DataFrame: Two-dimensional labeled data structure with columns of potentially

different types.
2. Data Manipulation:
 Cleaning: Handling missing data, filtering, and reshaping data.
 Transformation: Applying functions, merging, and grouping data.
3. Integration: Seamlessly integrates with other data science libraries, including Scikit-Learn.
Data Preparation with Pandas
Effective machine learning requires thorough data preparation. Pandas excels in this aspect by
offering a range of tools to clean, transform, and manipulate data.
Loading Data
python
Copy code
import pandas as pd # Load dataset df = pd.read_csv('data.csv') print(df.head())
Handling Missing Values
python
Copy code
# Fill missing values with the mean of the column df.fillna(df.mean(), inplace=True) # Drop rows with
any missing values df.dropna(inplace=True)
Feature Engineering
1. Creating New Features
python
Copy code
# Creating a new feature based on existing ones df['new_feature'] = df['feature1'] * df['feature2']
2. Encoding Categorical Variables

python
Copy code
# One-hot encoding df = pd.get_dummies(df, columns=['categorical_feature'])
3. Label Encoding
python
Copy code
from sklearn.preprocessing import LabelEncoder le = LabelEncoder() df['label_encoded_feature'] =

le.fit_transform(df['categorical_feature'])
Using Pandas with Scikit-Learn
Scikit-Learn is designed to work smoothly with Pandas DataFrames, enabling easy transitions from
data preparation to model training and evaluation.
Splitting Data
python
Copy code
from sklearn.model_selection import train_test_split # Define features and target X =

df.drop('target', axis=1) y = df['target'] # Split into training and testing sets X_train, X_test, y_train,
y_test = train_test_split(X, y, test_size=0.2, random_state=42)
Preprocessing with Scikit-Learn and Pandas
1. Scaling Features
python
Copy code
from sklearn.preprocessing import StandardScaler scaler = StandardScaler() X_train_scaled =

scaler.fit_transform(X_train) X_test_scaled = scaler.transform(X_test)
2. Pipelines
Using pipelines to streamline preprocessing and model training:
python
Copy code
from sklearn.pipeline import Pipeline from sklearn.impute import SimpleImputer from

sklearn.preprocessing import StandardScaler from sklearn.linear_model import LogisticRegression
pipeline = Pipeline([ ('imputer', SimpleImputer(strategy='mean')), ('scaler', StandardScaler()),
('classifier', LogisticRegression()) ]) pipeline.fit(X_train, y_train)
Model Training and Evaluation
1. Training a Model
python
Copy code
from sklearn.ensemble import RandomForestClassifier model = RandomForestClassifier()

model.fit(X_train, y_train)
2. Evaluating a Model
python
Copy code
from sklearn.metrics import accuracy_score, classification_report # Predicting on test set y_pred =

model.predict(X_test) # Evaluating model performance print('Accuracy:', accuracy_score(y_test,
y_pred)) print(classification_report(y_test, y_pred))
Advanced Techniques
Cross-Validation
python
Copy code
from sklearn.model_selection import cross_val_score # Perform cross-validation cv_scores =

cross_val_score(model, X, y, cv=5) print('Cross-validation scores:', cv_scores)
Grid Search for Hyperparameter Tuning
python
Copy code
from sklearn.model_selection import GridSearchCV param_grid = { 'n_estimators': [100, 200, 300],

'max_depth': [None, 10, 20] } grid_search = GridSearchCV(estimator=model,
param_grid=param_grid, cv=5) grid_search.fit(X_train, y_train) print

Pandas in Scikit-Learn

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Pandas in Scikit-Learn

Uploaded by

Copyright:

Available Formats

Introduction to Pandas in Scikit-Learn

Key Features of Pandas

 Series: One-dimensional labeled array capable of holding any data type.

 DataFrame: Two-dimensional labeled data structure with columns of potentially

 Cleaning: Handling missing data, filtering, and reshaping data.

 Transformation: Applying functions, merging, and grouping data.

Data Preparation with Pandas

import pandas as pd # Load dataset df = pd.read_csv('data.csv') print(df.head())

Handling Missing Values

1. Creating New Features

# Creating a new feature based on existing ones df['new_feature'] = df['feature1'] * df['feature2']

2. Encoding Categorical Variables

# One-hot encoding df = pd.get_dummies(df, columns=['categorical_feature'])

from sklearn.preprocessing import LabelEncoder le = LabelEncoder() df['label_encoded_feature'] =

Using Pandas with Scikit-Learn

from sklearn.model_selection import train_test_split # Define features and target X =

Preprocessing with Scikit-Learn and Pandas

from sklearn.preprocessing import StandardScaler scaler = StandardScaler() X_train_scaled =

Using pipelines to streamline preprocessing and model training:

from sklearn.pipeline import Pipeline from sklearn.impute import SimpleImputer from

Model Training and Evaluation

from sklearn.ensemble import RandomForestClassifier model = RandomForestClassifier()

from sklearn.metrics import accuracy_score, classification_report # Predicting on test set y_pred =

from sklearn.model_selection import cross_val_score # Perform cross-validation cv_scores =

Grid Search for Hyperparameter Tuning

from sklearn.model_selection import GridSearchCV param_grid = { 'n_estimators': [100, 200, 300],

You might also like