Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 17

KDDI Research Inc,

Data Intelligence Division


Internship Program

o n u s in g
s tim a t i
P ro f ile E
Me sh s io n a n d
R e g r e s
Lo g is tic
F o r e st
Rand o m i n ( UC SM)
Hn e t Yee
b y Hs u L
n t ed
Prese
19
25/12/20
Outline
Abstract

Dataset and Features

Exploratory Data Analysis

Data Preparation

Feature Engineering

Experimental Results

References
Abstract
 The composed 4 files are “calendar.csv”, “dynamic_train.csv”,
“dynamic_test.csv” and “meta_data.csv”.
 Class labels are Residence, Station, Park, EventHall, Office.

 Firstly, all of features and records from these files are aggregated.

 From these aggregated features, some features are extracted by using


domain knowledge.
 According to correlation between features, some features are selected.

 To increase the performance of learning technique, randomizedsearch


parameter tuning is used.
 Finally, implemented by two learning techniques, LR and RF.
Dataset and features
Calendar.csv has daily information for one month. It has Mesh_Id.
It also has dow attribute for day of the week (0 – Mon, 1 Tue, 2-
Wed, 3-Thu, 4-fri, 5-sat, 6-sun). hd_flag for holiday flag (1 for
holiday, 0 for non_holiday)

In dynamic_train.csv, each 30 minutes for move and stay count


for 24 hours records. Each mesh has 2880 (24*2*2*30) records
for one month.

The dynamic_test.csv also has the same records like train data for
20 mesh.

The metadata.csv give the target variable for each mesh id in train
data.
Exploratory Data Analysis
There has 5 different class label. The following are class labels
in train dataset.

Residence : 7200 records


Station : 7200 records
Park : 5760 records
EventHall : 4320 records
Office : 4320 records
Exploratory Data Analysis
Weekend_count
Exploratory Data Analysis
EDA for target class with stay and move count
Exploratory Data Analysis
Exploratory Data Analysis

Weekend and Weekday stay for Weekend and Weekday move for
each class label each class label
Exploratory Data Analysis
EventHall
Office
Weekend Weekday

stay:move stay:move stay:move Stay:move


5:5 8.3:1.7 8.3:1.7 8.5:1.5

Park
stay:move stay:move
7.5:2.5 7.5:2.5
Residence
Station

stay:move stay:move stay:move stay:move


8.9:1.1 0:10 6.4:3.6 6.3:3.7
Data Preparation
The attributes and records are aggregated to one dataset.

Total attributes are 13 and records are 115200 (train dataset :


57600 and test dataset:57600).

The attribute named by “type” is separated to two features


“move” and “stay”.

The new feature named “offday” is created by the “dow” and


“hd_flag” features.

The gap between stay count and move count is created by


subtracting move from stay. This new one is “stay-move”.
Feature Engineering
Rolling window feature
The rolling window is applied on “stay” and “move” features.
Window size is set to size=32 for 16 hours because the dataset has each 30 minutes
record.
The features by rolling window feature is named by “stay_rolling_mean” and
“move_rolling_mean”.
Although there has 13 attributes, some of the unnecessary features are dropped
from dataset.

The unnecessary features are category, date_time, date, time, hd_flag, dow.

There has 7 features: mesh, stay, move, stay_rolling_mean,


move_rolling_mean, offday.

The dependent variable is target.


Correlation Between Features

 According to this correlation matrix, there has no feature over 0.95.


All of features are used for implementation.
Feature Engineering
 Label Encoding

To convert the categorical variables to numerical variables, label hot encoding is


used.
Parameter Tuning
To achieve the high performance of learning techniques, randomizedsearch
parameter tuning is used.
For Logistic Regression, the setting for hyper parameters are as follow:
C : np.logspace(0, 4, num=10)
max_iteration: 50,100,150,200,250,300,350,400,450,500
For Random Forest, hyper parametrs setting are
n_estimators : 50,100,150,200,250,300,350,400,450,500
max_depth : range(1,100)
Experimental Results
To avoid over fitting, cross validation with cv=5 is used.

Logistic regression with C=166.81 and max_iteration=500 is


evaluated.

Random Forest with max_depth = 13 and n_estimators = 400 is


evaluated.

Comparison of Logistic Regression and Random Forest

75 % 70
70 %
%

Accuracy on Logistic Accuracy on


Regression Random Forest
References
https://scikit-learn.org/stable/modules/generated/
sklearn.linear_model.LogisticRegression.html

https://python-graph-gallery.com/11-grouped-barplot/

https://scikit-learn.org/stable/modules/generated/
sklearn.model_selection.RandomizedSearchCV.html
Thank You

You might also like