Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 3

Introduction to Pandas in Scikit-Learn

Pandas and Scikit-Learn are two fundamental libraries in the Python ecosystem for data science and
machine learning. While Pandas provides powerful data manipulation tools, Scikit-Learn is a
comprehensive library for machine learning that includes tools for model selection, preprocessing,
and evaluation. Using Pandas in conjunction with Scikit-Learn allows for seamless data preparation
and model training processes.

Key Features of Pandas

1. Data Structures:

 Series: One-dimensional labeled array capable of holding any data type.

 DataFrame: Two-dimensional labeled data structure with columns of potentially


different types.

2. Data Manipulation:

 Cleaning: Handling missing data, filtering, and reshaping data.

 Transformation: Applying functions, merging, and grouping data.

3. Integration: Seamlessly integrates with other data science libraries, including Scikit-Learn.

Data Preparation with Pandas

Effective machine learning requires thorough data preparation. Pandas excels in this aspect by
offering a range of tools to clean, transform, and manipulate data.

Loading Data

python

Copy code

import pandas as pd # Load dataset df = pd.read_csv('data.csv') print(df.head())

Handling Missing Values

python

Copy code

# Fill missing values with the mean of the column df.fillna(df.mean(), inplace=True) # Drop rows with
any missing values df.dropna(inplace=True)

Feature Engineering

1. Creating New Features

python

Copy code

# Creating a new feature based on existing ones df['new_feature'] = df['feature1'] * df['feature2']

2. Encoding Categorical Variables


python

Copy code

# One-hot encoding df = pd.get_dummies(df, columns=['categorical_feature'])

3. Label Encoding

python

Copy code

from sklearn.preprocessing import LabelEncoder le = LabelEncoder() df['label_encoded_feature'] =


le.fit_transform(df['categorical_feature'])

Using Pandas with Scikit-Learn

Scikit-Learn is designed to work smoothly with Pandas DataFrames, enabling easy transitions from
data preparation to model training and evaluation.

Splitting Data

python

Copy code

from sklearn.model_selection import train_test_split # Define features and target X =


df.drop('target', axis=1) y = df['target'] # Split into training and testing sets X_train, X_test, y_train,
y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Preprocessing with Scikit-Learn and Pandas

1. Scaling Features

python

Copy code

from sklearn.preprocessing import StandardScaler scaler = StandardScaler() X_train_scaled =


scaler.fit_transform(X_train) X_test_scaled = scaler.transform(X_test)

2. Pipelines

Using pipelines to streamline preprocessing and model training:

python

Copy code

from sklearn.pipeline import Pipeline from sklearn.impute import SimpleImputer from


sklearn.preprocessing import StandardScaler from sklearn.linear_model import LogisticRegression
pipeline = Pipeline([ ('imputer', SimpleImputer(strategy='mean')), ('scaler', StandardScaler()),
('classifier', LogisticRegression()) ]) pipeline.fit(X_train, y_train)

Model Training and Evaluation

1. Training a Model

python
Copy code

from sklearn.ensemble import RandomForestClassifier model = RandomForestClassifier()


model.fit(X_train, y_train)

2. Evaluating a Model

python

Copy code

from sklearn.metrics import accuracy_score, classification_report # Predicting on test set y_pred =


model.predict(X_test) # Evaluating model performance print('Accuracy:', accuracy_score(y_test,
y_pred)) print(classification_report(y_test, y_pred))

Advanced Techniques

Cross-Validation

python

Copy code

from sklearn.model_selection import cross_val_score # Perform cross-validation cv_scores =


cross_val_score(model, X, y, cv=5) print('Cross-validation scores:', cv_scores)

Grid Search for Hyperparameter Tuning

python

Copy code

from sklearn.model_selection import GridSearchCV param_grid = { 'n_estimators': [100, 200, 300],


'max_depth': [None, 10, 20] } grid_search = GridSearchCV(estimator=model,
param_grid=param_grid, cv=5) grid_search.fit(X_train, y_train) print

You might also like