Professional Documents
Culture Documents
Applied Datascience - Phase3
Applied Datascience - Phase3
1. Time Period: Your dataset should span a specific time period, such as
months, quarters, or years. The choice of time granularity depends on the
business's sales cycle and the desired level of prediction accuracy.
2. Sales Data: This is the primary target variable you want to predict. It includes
historical sales figures for each time period, which serve as the basis for
training and testing your prediction model.
3. Features:
Product Information: Information about the products you sell, such as
product ID, category, brand, and price.
Store Information : Details about your stores or sales channels, like store ID,
location, size, and any promotions or events.
Time-related Features : Time-related attributes like day of the week, month,
quarter, holidays, and special events.
Economic Indicators: Economic factors such as inflation rates, GDP, and
consumer sentiment that can impact sales.
Competitor Data: Data about competitors, such as their pricing, promotions,
and market share.
-Inventory Data: Information about inventory levels, stockouts, and lead
times for restocking.
Online Data: For e-commerce businesses, data on website traffic, bounce rates,
and conversion rates can be useful.
6. Data Preprocessing :
Missing Data Handling: Decide how to deal with missing data. You can impute
missing values, remove them, or use techniques like time-series interpolation.
Outlier Detection: Identify and handle outliers that might skew your
predictions.
7. Feature Engineering:
- Create new features from the existing ones if they are not sufficient for
making accurate predictions. For instance, you can calculate moving averages,
seasonality, or growth rates.
8. Data Splitting: Split your dataset into training, validation, and test sets to
assess and tune your model's performance.
11. Model Evaluation Metrics: Define the metrics you will use to evaluate the
performance of your sales prediction model, such as Mean Absolute Error
(MAE), Root Mean Square Error (RMSE), or Mean Absolute Percentage Error
(MAPE).
12. Modeling: Use various machine learning or time series forecasting models
like linear regression, decision trees, Random Forest, ARIMA, or deep learning
models like LSTM and GRU, depending on the nature of your data.
14. Model Validation: Validate your model's performance using the validation
set and fine-tune it if necessary.
15. Testing and Deployment: Test your model on the test set to evaluate its
real-world predictive performance. Once satisfied with the results, deploy the
model for making future sales predictions.
IMPLEMENTATION OF DATASET:
To implement a dataset for future sales prediction program, we can use the following Python
code:
import pandas as pd
# Define the dataset
data = {
'Month': ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov',
'Dec'],
'Sales': [120, 150, 110, 180, 160, 170, 140, 190, 130, 160, 140, 150]
}
OUTPUT:
Month Sales
0 Jan 120
1 Feb 150
2 Mar 110
3 Apr 180
4 May 160
5 Jun 170
6 Jul 140
7 Aug 190
8 Sep 130
9 Oct 160
10 Nov 140
11 Dec 150
PREPROCESSING OF DATA:
1. Import the necessary libraries
import pandas as pd
import numpy as np
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
```python
import pandas as pd
import numpy as np
data = pd.read_csv('sales_dataset.csv')
# Data Exploration
print(data.head())
print(data.describe())
print(data.info())
# Data Preprocessing
data.fillna(0, inplace=True)
# Feature Engineering
data['lag_1'] = data['sales'].shift(1)
data['lag_2'] = data['sales'].shift(2)
# Split data into training and test sets
model.fit(X_train, y_train)
# Model Evaluation
y_pred = model.predict(X_test)
rmse = np.sqrt(mse)
# Visualization of Predictions
plt.figure(figsize=(12, 6))
plt.xlabel('Time Period')
plt.ylabel('Sales')
plt.legend()
plt.show()
```
In this code, we perform the following steps:
1. **Data Loading**: Load the dataset, assuming you have a CSV file named 'sales_dataset.csv'.
2. **Data Exploration**: Print the first few rows of the dataset, basic statistics, and information
about missing values.
3. **Data Preprocessing**: Handle missing values (in this example, we fill them with zeros).
4. **Feature Engineering**: Create lag features to capture the influence of previous sales on future
sales.
5. **Data Splitting**: Split the data into training and test sets to evaluate the model.
6. **Model Selection and Training**: Use a Random Forest Regressor as the prediction model and
train it on the training data.
7. **Model Evaluation**: Calculate Mean Absolute Error (MAE), Mean Squared Error (MSE), and
Root Mean Squared Error (RMSE) to assess the model's performance.
8. **Visualization**: Plot the true sales values and the predicted sales values to visually assess the
model's performance.
This code is a simplified example, and in a real-world scenario, you would need to fine-tune the
model, handle more complex features, and perform more extensive data analysis and validation. The
output will include the calculated error metrics and a visualization of the true vs. predicted sales
values, which can help you assess the model's accuracy and make informed decisions for future sales
predictions.