Day 18 - Numpy

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

Day 18: Array Masking and Filtering

Applying masks to arrays to filter data


Using Boolean arrays for advanced data manipulation

On Day 18, we will focus on Array Masking and Filtering in NumPy. Masking involves using Boolean arrays to filter and manipulate data in
arrays based on certain conditions. Let's explore how to apply masks to arrays and use Boolean arrays for advanced data manipulation:

Applying Masks to Arrays to Filter Data:


Example 1: Masking with a Condition

In [1]: import numpy as np

# Create a 1D NumPy array


arr = np.array([10, 20, 30, 40, 50])

# Create a mask for values greater than 30


mask = arr > 30

# Apply the mask to filter the array


filtered_arr = arr[mask]

print("Original Array:", arr)


print("Filtered Array:", filtered_arr)

Original Array: [10 20 30 40 50]


Filtered Array: [40 50]

Example 2: Applying Multiple Masks


In [2]: import numpy as np

# Create a 1D NumPy array


arr = np.array([10, 20, 30, 40, 50])

# Create masks for values greater than 20 and less than 40


mask1 = arr > 20
mask2 = arr < 40

# Combine masks using logical operators


combined_mask = mask1 & mask2

# Apply the combined mask to filter the array


filtered_arr = arr[combined_mask]

print("Original Array:", arr)


print("Filtered Array:", filtered_arr)

Original Array: [10 20 30 40 50]


Filtered Array: [30]

Using Boolean Arrays for Advanced Data Manipulation:


Example 3: Broadcasting with Boolean Arrays
In [3]: import numpy as np

# Create a 1D NumPy array


arr = np.array([1, 2, 3, 4, 5])

# Create a mask for even values


mask = arr % 2 == 0

# Replace even values with -1 using the mask


arr[mask] = -1

print("Modified Array:", arr)

Modified Array: [ 1 -1 3 -1 5]

Example 4: Using Boolean Arrays for Indexing

In [5]: import numpy as np

# Create a 1D NumPy array


arr = np.array([10, 20, 30, 40, 50])

# Create a mask for values greater than 25


mask = arr > 25

# Get the indices of the True values in the mask


indices = np.where(mask)[0]

print("Indices of Values Greater than 25:", indices)

Indices of Values Greater than 25: [2 3 4]

In these examples:

* We created masks using Boolean conditions to filter elements in an array.


* We applied masks to filter and manipulate array data.
* We used Boolean arrays for advanced data manipulation, including broadcasting and indexing.

Masking and filtering are powerful techniques for data manipulation and analysis. Boolean arrays provide a
flexible way to perform selective operations on array elements based on conditions.

🌐 Real-World Scenario:-

1. Data Cleaning and Preprocessing:

Use Case: In data preprocessing for machine learning, you often need to clean and filter out irrelevant or noisy data.
NumPy Application: Boolean indexing helps in this process. For instance, in a dataset containing customer reviews, you can use
Boolean indexing to filter out reviews with low ratings or specific keywords that are irrelevant to your analysis.
Example: Suppose you have a dataset of product reviews, and you want to filter out reviews with a rating lower than 3 stars. You can
create a mask with a condition for ratings and use it to filter the relevant data.

In [7]: import numpy as np

# Sample data: an array of customer reviews and star ratings


reviews = np.array(["Great product!", "Not so good...", "Excellent!", "Average.", "Terrible experience."])
ratings = np.array([5, 2, 5, 3, 1])

# Create a mask for reviews with ratings less than 3 stars


low_rating_mask = ratings < 3

# Use the mask to filter reviews


filtered_reviews = reviews[low_rating_mask]
filtered_ratings = ratings[low_rating_mask]

# Display the filtered reviews and ratings


for review, rating in zip(filtered_reviews, filtered_ratings):
print(f"Review: '{review}' | Rating: {rating} stars")

Review: 'Not so good...' | Rating: 2 stars


Review: 'Terrible experience.' | Rating: 1 stars
By using Boolean indexing and masks, you've efficiently filtered out reviews with low ratings, allowing you to
focus on more relevant data for your analysis or recommendation system. This is a common preprocessing step in
machine learning and data analysis workflows to improve data quality and model performance.

2. Financial Data Analysis:

Use Case: In financial analysis, you might have a dataset containing stock prices.
NumPy Application: Boolean indexing helps in filtering out specific days or conditions.
Example: You can use Boolean indexing to filter out days when the stock price crossed a certain threshold, helping you identify
significant market events.

In [10]: import numpy as np

# Sample data: an array of dates and corresponding stock prices


dates = np.array(['2023-01-01', '2023-01-02', '2023-01-03', '2023-01-04', '2023-01-05'])
prices = np.array([100.0, 102.0, 105.5, 103.2, 101.8])

# Create a mask for days when the stock price exceeded 105
significant_event_mask = prices > 105

# Use the mask to filter dates and prices


significant_dates = dates[significant_event_mask]
significant_prices = prices[significant_event_mask]

# Display the dates and corresponding stock prices for significant events
for date, price in zip(significant_dates, significant_prices):
print(f"Date: {date} | Stock Price: ${price:.2f}")

Date: 2023-01-03 | Stock Price: $105.50

Using Boolean indexing, you've efficiently filtered out the days when the stock price exceeded the specified
threshold, helping you identify significant market events. This can be valuable for making investment decisions or
further analysis in financial data analysis.

3. Epidemiology and Health Analysis:


Use Case: When analyzing health data, you can filter patient records based on specific medical conditions.
NumPy Application: Boolean indexing allows you to select only those who have been diagnosed with a particular disease for further
analysis.
Example: Imagine you have a dataset of patient health records, and you want to analyze data only for patients with a specific medical
condition, like diabetes. You can create a mask based on the condition and apply it to filter the relevant patient records.

In [11]: import numpy as np

# Sample data: an array of patient IDs, medical conditions (e.g., 'diabetes' or 'none'), and age
patient_ids = np.array([101, 102, 103, 104, 105])
medical_conditions = np.array(['diabetes', 'none', 'diabetes', 'none', 'diabetes'])
ages = np.array([45, 62, 38, 55, 60])

# Create a mask for patients with diabetes


diabetes_mask = medical_conditions == 'diabetes'

# Use the mask to filter patient records


diabetes_patients = patient_ids[diabetes_mask]
diabetes_conditions = medical_conditions[diabetes_mask]
diabetes_ages = ages[diabetes_mask]

# Display the information of patients with diabetes


for patient_id, condition, age in zip(diabetes_patients, diabetes_conditions, diabetes_ages):
print(f"Patient ID: {patient_id} | Condition: {condition} | Age: {age} years")

Patient ID: 101 | Condition: diabetes | Age: 45 years


Patient ID: 103 | Condition: diabetes | Age: 38 years
Patient ID: 105 | Condition: diabetes | Age: 60 years

Using Boolean indexing, you've efficiently filtered out patient records for those diagnosed with diabetes,
allowing you to focus on analyzing data specific to this medical condition in epidemiology and health analysis.
This can be crucial for research, treatment planning, or identifying trends related to diabetes.

You might also like