Applied Datascience - Phase3

DEVELOPMENT PART – 1
PROJECT – FUTURE SALES PREDICTION
DETAILED EXPLANATION OF DATASET:

Predicting future sales is a critical task for businesses, as it allows them to
make informed decisions about inventory management, marketing strategies,
and overall business planning. To create a dataset for future sales prediction,
you need to collect and organize various data points that can influence sales.
Below is a detailed explanation of the components of such a dataset:
1. Time Period: Your dataset should span a specific time period, such as
months, quarters, or years. The choice of time granularity depends on the
business's sales cycle and the desired level of prediction accuracy.
2. Sales Data: This is the primary target variable you want to predict. It includes
historical sales figures for each time period, which serve as the basis for
training and testing your prediction model.
3. Features:
Product Information: Information about the products you sell, such as
product ID, category, brand, and price.
Store Information : Details about your stores or sales channels, like store ID,
location, size, and any promotions or events.
Time-related Features : Time-related attributes like day of the week, month,
quarter, holidays, and special events.
Economic Indicators: Economic factors such as inflation rates, GDP, and
consumer sentiment that can impact sales.
Competitor Data: Data about competitors, such as their pricing, promotions,
and market share.
-Inventory Data: Information about inventory levels, stockouts, and lead
times for restocking.
Online Data: For e-commerce businesses, data on website traffic, bounce rates,
and conversion rates can be useful.
4. Lagged Variables**: Lagged or historical values of sales and other relevant

features. For example, the sales from the previous month or the same month
in the previous year.
5. Categorical Variables: If your features include categorical data (e.g., product

categories, store locations), you may need to encode them as numerical values
using techniques like one-hot encoding.
6. Data Preprocessing :
Missing Data Handling: Decide how to deal with missing data. You can impute
missing values, remove them, or use techniques like time-series interpolation.
Outlier Detection: Identify and handle outliers that might skew your
predictions.
7. Feature Engineering:
- Create new features from the existing ones if they are not sufficient for
making accurate predictions. For instance, you can calculate moving averages,
seasonality, or growth rates.
8. Data Splitting: Split your dataset into training, validation, and test sets to
assess and tune your model's performance.
9. Scaling and Normalization: Depending on the algorithms used, it may be

necessary to scale or normalize features to ensure they have similar ranges.
10. Time Series Data : If dealing with time series data, consider using
techniques like differencing, smoothing, or decomposition to handle trends
and seasonality.
11. Model Evaluation Metrics: Define the metrics you will use to evaluate the
performance of your sales prediction model, such as Mean Absolute Error
(MAE), Root Mean Square Error (RMSE), or Mean Absolute Percentage Error
(MAPE).
12. Modeling: Use various machine learning or time series forecasting models
like linear regression, decision trees, Random Forest, ARIMA, or deep learning
models like LSTM and GRU, depending on the nature of your data.
13. Hyperparameter Tuning : Optimize the hyperparameters of your chosen

model(s) to improve their performance.
14. Model Validation: Validate your model's performance using the validation
set and fine-tune it if necessary.
15. Testing and Deployment: Test your model on the test set to evaluate its
real-world predictive performance. Once satisfied with the results, deploy the
model for making future sales predictions.
16. Monitoring and Maintenance: Continuously monitor the model's

performance and update it as needed to account for changes in the business
environment.
IMPLEMENTATION OF DATASET:
To implement a dataset for future sales prediction program, we can use the following Python
code:
import pandas as pd
# Define the dataset
data = {
'Month': ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov',
'Dec'],
'Sales': [120, 150, 110, 180, 160, 170, 140, 190, 130, 160, 140, 150]
}
# Create a DataFrame from the dataset

df = pd.DataFrame(data)
# Display the DataFrame

print(df)
OUTPUT:
Month Sales
0 Jan 120
1 Feb 150
2 Mar 110
3 Apr 180
4 May 160
5 Jun 170
6 Jul 140
7 Aug 190
8 Sep 130
9 Oct 160
10 Nov 140
11 Dec 150
PREPROCESSING OF DATA:
1. Import the necessary libraries
import pandas as pd
import numpy as np
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
2. Load the dataset

df = pd.read_csv('your_dataset.csv')
3. Handle missing values

df = df.dropna()
4. Convert categorical data to numerical data

df['Month'] = df['Month'].astype('category').cat.codes
5. Create a time series dataset

df['Date'] = pd.to_datetime(df['Year'].astype(str) + '-' + df['Month'].astype(str))
df = df.set_index('Date')
6. Create lag features

df = df.assign(Sales_lag1=df['Sales'].shift(1))
7. Create a target variable

df['Target'] = df['Sales'].shift(-1)
8. Split the dataset into training and testing sets

train, test = train_test_split(df, test_size=0.2, shuffle=False)
9. Scale the data

scaler = MinMaxScaler()
train = scaler.fit_transform(train)
test = scaler.transform(test)
10. Convert the data back into a DataFrame

train = pd.DataFrame(train, columns=df.columns)
test = pd.DataFrame(test, columns=df.columns)
PERFORMING ANALYSIS IN THE DATASET:
Performing analysis on a dataset for future sales prediction involves several steps, including data
exploration, preprocessing, feature engineering, model selection, training, and evaluation. Here's a
step-by-step guide with code examples in Python using popular data science libraries like Pandas,
Scikit-Learn, and Matplotlib. We'll use a simplified example dataset for illustration:
```python
# Import necessary libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error
import matplotlib.pyplot as plt
# Load the dataset
data = pd.read_csv('sales_dataset.csv')
# Data Exploration
print(data.head())
print(data.describe())
print(data.info())
# Data Preprocessing
# Handle missing values if any
data.fillna(0, inplace=True)
# Feature Engineering
# Create lag features (e.g., sales from the previous month)
data['lag_1'] = data['sales'].shift(1)
data['lag_2'] = data['sales'].shift(2)
# Split data into training and test sets
X = data[['lag_1', 'lag_2']] # Features
y = data['sales'] # Target variable
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Model Selection and Training
model = RandomForestRegressor(n_estimators=100, random_state=42)
model.fit(X_train, y_train)
# Model Evaluation
y_pred = model.predict(X_test)
mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
print(f'Mean Absolute Error: {mae}')
print(f'Mean Squared Error: {mse}')
print(f'Root Mean Squared Error: {rmse}')
# Visualization of Predictions
plt.figure(figsize=(12, 6))
plt.plot(y_test.index, y_test.values, label='True Sales', marker='o')
plt.plot(y_test.index, y_pred, label='Predicted Sales', marker='x')
plt.xlabel('Time Period')
plt.ylabel('Sales')
plt.legend()
plt.show()
```
In this code, we perform the following steps:
1. **Data Loading**: Load the dataset, assuming you have a CSV file named 'sales_dataset.csv'.
2. **Data Exploration**: Print the first few rows of the dataset, basic statistics, and information
about missing values.
3. **Data Preprocessing**: Handle missing values (in this example, we fill them with zeros).
4. **Feature Engineering**: Create lag features to capture the influence of previous sales on future
sales.
5. **Data Splitting**: Split the data into training and test sets to evaluate the model.
6. **Model Selection and Training**: Use a Random Forest Regressor as the prediction model and
train it on the training data.
7. **Model Evaluation**: Calculate Mean Absolute Error (MAE), Mean Squared Error (MSE), and
Root Mean Squared Error (RMSE) to assess the model's performance.
8. **Visualization**: Plot the true sales values and the predicted sales values to visually assess the
model's performance.
This code is a simplified example, and in a real-world scenario, you would need to fine-tune the
model, handle more complex features, and perform more extensive data analysis and validation. The
output will include the calculated error metrics and a visualization of the true vs. predicted sales
values, which can help you assess the model's accuracy and make informed decisions for future sales
predictions.

Applied Datascience - Phase3

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Applied Datascience - Phase3

Uploaded by

Copyright:

Available Formats

DEVELOPMENT PART – 1

PROJECT – FUTURE SALES PREDICTION

DETAILED EXPLANATION OF DATASET:

4. Lagged Variables**: Lagged or historical values of sales and other relevant

5. Categorical Variables: If your features include categorical data (e.g., product

9. Scaling and Normalization: Depending on the algorithms used, it may be

13. Hyperparameter Tuning : Optimize the hyperparameters of your chosen

16. Monitoring and Maintenance: Continuously monitor the model's

# Create a DataFrame from the dataset

# Display the DataFrame

2. Load the dataset

3. Handle missing values

4. Convert categorical data to numerical data

5. Create a time series dataset

6. Create lag features

7. Create a target variable

8. Split the dataset into training and testing sets

9. Scale the data

10. Convert the data back into a DataFrame

# Import necessary libraries

from sklearn.model_selection import train_test_split

from sklearn.ensemble import RandomForestRegressor

from sklearn.metrics import mean_absolute_error, mean_squared_error

import matplotlib.pyplot as plt

# Load the dataset

# Handle missing values if any

# Create lag features (e.g., sales from the previous month)

X = data[['lag_1', 'lag_2']] # Features

y = data['sales'] # Target variable

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Model Selection and Training

model = RandomForestRegressor(n_estimators=100, random_state=42)

mae = mean_absolute_error(y_test, y_pred)

mse = mean_squared_error(y_test, y_pred)

print(f'Mean Absolute Error: {mae}')

print(f'Mean Squared Error: {mse}')

print(f'Root Mean Squared Error: {rmse}')

plt.plot(y_test.index, y_test.values, label='True Sales', marker='o')

plt.plot(y_test.index, y_pred, label='Predicted Sales', marker='x')

You might also like