Mesh PR Ofile Est Imation Using Logistic Regressi On and Random Forest

KDDI Research Inc,
Data Intelligence Division

Internship Program
o n u s in g
s tim a t i
P ro f ile E
Me sh s io n a n d
R e g r e s
Lo g is tic
F o r e st
Rand o m i n ( UC SM)
Hn e t Yee
b y Hs u L
n t ed
Prese
19
25/12/20
Outline
Abstract
Dataset and Features
Exploratory Data Analysis
Data Preparation
Feature Engineering
Experimental Results
References
Abstract
 The composed 4 files are “calendar.csv”, “dynamic_train.csv”,
“dynamic_test.csv” and “meta_data.csv”.
 Class labels are Residence, Station, Park, EventHall, Office.
 Firstly, all of features and records from these files are aggregated.
 From these aggregated features, some features are extracted by using

domain knowledge.
 According to correlation between features, some features are selected.
 To increase the performance of learning technique, randomizedsearch

parameter tuning is used.
 Finally, implemented by two learning techniques, LR and RF.
Dataset and features
Calendar.csv has daily information for one month. It has Mesh_Id.
It also has dow attribute for day of the week (0 – Mon, 1 Tue, 2-
Wed, 3-Thu, 4-fri, 5-sat, 6-sun). hd_flag for holiday flag (1 for
holiday, 0 for non_holiday)
In dynamic_train.csv, each 30 minutes for move and stay count

for 24 hours records. Each mesh has 2880 (24*2*2*30) records
for one month.
The dynamic_test.csv also has the same records like train data for
20 mesh.
The metadata.csv give the target variable for each mesh id in train
data.
There has 5 different class label. The following are class labels
in train dataset.
Residence : 7200 records

Station : 7200 records
Park : 5760 records
EventHall : 4320 records
Office : 4320 records
Weekend_count
EDA for target class with stay and move count
Weekend and Weekday stay for Weekend and Weekday move for
each class label each class label
EventHall
Office
Weekend Weekday
stay:move stay:move stay:move Stay:move

5:5 8.3:1.7 8.3:1.7 8.5:1.5
Park
stay:move stay:move
7.5:2.5 7.5:2.5
Residence
Station
stay:move stay:move stay:move stay:move

8.9:1.1 0:10 6.4:3.6 6.3:3.7
Data Preparation
The attributes and records are aggregated to one dataset.
Total attributes are 13 and records are 115200 (train dataset :

57600 and test dataset:57600).
The attribute named by “type” is separated to two features

“move” and “stay”.
The new feature named “offday” is created by the “dow” and

“hd_flag” features.
The gap between stay count and move count is created by

subtracting move from stay. This new one is “stay-move”.
Feature Engineering
Rolling window feature
The rolling window is applied on “stay” and “move” features.
Window size is set to size=32 for 16 hours because the dataset has each 30 minutes
record.
The features by rolling window feature is named by “stay_rolling_mean” and
“move_rolling_mean”.
Although there has 13 attributes, some of the unnecessary features are dropped
from dataset.
The unnecessary features are category, date_time, date, time, hd_flag, dow.
There has 7 features: mesh, stay, move, stay_rolling_mean,

move_rolling_mean, offday.
The dependent variable is target.

Correlation Between Features
 According to this correlation matrix, there has no feature over 0.95.

All of features are used for implementation.
Feature Engineering
 Label Encoding
To convert the categorical variables to numerical variables, label hot encoding is

used.
Parameter Tuning
To achieve the high performance of learning techniques, randomizedsearch
parameter tuning is used.
For Logistic Regression, the setting for hyper parameters are as follow:
C : np.logspace(0, 4, num=10)
max_iteration: 50,100,150,200,250,300,350,400,450,500
For Random Forest, hyper parametrs setting are
n_estimators : 50,100,150,200,250,300,350,400,450,500
max_depth : range(1,100)
Experimental Results
To avoid over fitting, cross validation with cv=5 is used.
Logistic regression with C=166.81 and max_iteration=500 is

evaluated.
Random Forest with max_depth = 13 and n_estimators = 400 is

evaluated.
Comparison of Logistic Regression and Random Forest
75 % 70
70 %
%
Accuracy on Logistic Accuracy on

Regression Random Forest
References
https://scikit-learn.org/stable/modules/generated/
sklearn.linear_model.LogisticRegression.html
https://python-graph-gallery.com/11-grouped-barplot/
https://scikit-learn.org/stable/modules/generated/
sklearn.model_selection.RandomizedSearchCV.html
Thank You

Mesh PR Ofile Est Imation Using Logistic Regressi On and Random Forest

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Mesh PR Ofile Est Imation Using Logistic Regressi On and Random Forest

Uploaded by

Copyright:

Available Formats

KDDI Research Inc,

Data Intelligence Division

Dataset and Features

Exploratory Data Analysis

 From these aggregated features, some features are extracted by using

 To increase the performance of learning technique, randomizedsearch

In dynamic_train.csv, each 30 minutes for move and stay count

Residence : 7200 records

stay:move stay:move stay:move Stay:move

stay:move stay:move stay:move stay:move

Total attributes are 13 and records are 115200 (train dataset :

The attribute named by “type” is separated to two features

The new feature named “offday” is created by the “dow” and

The gap between stay count and move count is created by

There has 7 features: mesh, stay, move, stay_rolling_mean,

The dependent variable is target.

 According to this correlation matrix, there has no feature over 0.95.

To convert the categorical variables to numerical variables, label hot encoding is

Logistic regression with C=166.81 and max_iteration=500 is

Random Forest with max_depth = 13 and n_estimators = 400 is

Comparison of Logistic Regression and Random Forest

Accuracy on Logistic Accuracy on

You might also like