Data Mining

13 주차
Task0
Json 파일 전처리
# Data 하위 폴더에 있는 json 파일 하나씩 search
base_path = './Data'
json_files = glob.glob(os.path.join(base_path, '**/*.json'),

recursive=True)
all_combined_df = pd.DataFrame()
for file_path in json_files:

print(file_path)
with open(file_path, 'r', encoding='UTF-8') as file:
json_data = json.load(file)
rows=[]
# Scene data
for scene in json_data['scene']['data']:
img_name = scene['img_name']
for occupant in scene['occupant']:
row = {
'img_name': img_name,
'occupant_id':
occupant.get('occupant_id', None),
'action': occupant.get('action', None),
'emotion': occupant.get('emotion',
None),
}
rows.append(row)
# Creating a DataFrame
combined_df = pd.DataFrame(rows)
# occupant df
occupant_df =
pd.DataFrame(json_data['occupant_info'])
# sensor df
sensor_df = pd.DataFrame()
for sensor_data in json_data['scene']['sensor']:

# Calculating the average and standard deviation
for each sensor
avg_ecg = sum(sensor_data['ECG']) /
len(sensor_data['ECG'])
std_ecg = statistics.stdev(sensor_data['ECG'])
if len(sensor_data['ECG']) > 1 else 0
avg_ppg = sum(sensor_data['PPG']) /
len(sensor_data['PPG'])
std_ppg = statistics.stdev(sensor_data['PPG'])
if len(sensor_data['PPG']) > 1 else 0
avg_spo2 = sum(sensor_data['SPO2']) /
len(sensor_data['SPO2'])
std_spo2 = statistics.stdev(sensor_data['SPO2'])
if len(sensor_data['SPO2']) > 1 else 0
avg_eeg = sum([sum(eeg_list) / len(eeg_list) for

eeg_list in sensor_data['EEG']]) / len(sensor_data['EEG'])
std_eeg = statistics.stdev([sum(eeg_list) /
len(eeg_list) for eeg_list in sensor_data['EEG']]) if
len(sensor_data['EEG']) > 1 else 0
# sensor 데이터가 시계열로 여러 개 찍혀있어서 평균과
표준편차로 값을 넣음
avg_std_sensor_df = pd.DataFrame({
'occupant_id': sensor_data['occupant_id'],
'ECG_avg': [avg_ecg],
'ECG_std': [std_ecg],
'PPG_avg': [avg_ppg],
'PPG_std': [std_ppg],
'SPO2_avg': [avg_spo2],
'SPO2_std': [std_spo2],
'EEG_avg': [avg_eeg],
'EEG_std': [std_eeg]
})
sensor_df = pd.concat([sensor_df,
avg_std_sensor_df], ignore_index=True)
# Combine scene_df, sensor_df, occupant_df
combined_df = pd.merge(combined_df, sensor_df,
on='occupant_id')
combined_df = pd.merge(combined_df, occupant_df,
on='occupant_id')
# 'img_name'과 'action' 열을 제외한 모든 열을 선택하여

중복되는 행 제거
# 근거: 영상을 여러 이미지로 나눈 데이터라 같은
탑승자인데, 중복되는 행이 한 scene 내에 여러 개가 존재함.
columns_to_check = [col for col in
combined_df.columns if col not in ['img_name', 'action']]
combined_df =
combined_df.drop_duplicates(subset=columns_to_check).reset_i
ndex(drop=True)
all_combined_df = pd.concat([all_combined_df,
combined_df], ignore_index=True)
특정 폴더 안에 있는 JSON 파일들을 읽어, 그 데이터를 분석하고 처리하는
과정을 담고 있다. 구체적으로 살펴보면 아래의 단계와 같다.
1. 파일 탐색 및 읽기
 glob 을 사용해 'Data' 폴더 안의 모든 .json 파일들을 찾는다.

glob.glob 함수는 주어진 패턴에 맞는 파일 경로 목록을 반환한다.
 각 JSON 파일에 대해, 파일을 열고 (open(file_path, 'r', encoding='UTF-

8')), json.load 를 사용해 JSON 데이터를 파이썬 객체로 변환한다.
2. JSON 데이터 처리
 파일마다 반복문을 돌면서, 주어진 JSON 구조에 따라 데이터를

분석하고 처리한다.
 데이터에서 각 'scene'의 'img_name'과 'occupant' 정보를

scene
추출하고, 이를 통해 행 데이터를 만들어

데이터프레임 combined_df 에 추가한다.
 occupant_info 에서 탑승자 정보를 담은 데이터프레임 occupant_df 를

생성한다.
 scene 의 sensor
부분에서 각 센서 데이터의 평균과 표준편차를
계산하여 sensor_df 에 추가한다.
3. 데이터 병합 및 중복 제거
 사용해 combined_df, sensor_df, occupant_df 를 'occupant_id'를

pd.merge 를
기준으로 병합한다. 이렇게 함으로써 각 탑승자에 대한 행동,

감정, 센서 데이터, 개인 정보가 하나의 행에 포함되도록 한다.
 중복된 행을 제거한다. 'img_name'과 'action'을 제외한 모든

열에서 중복된 데이터를 제거해, 같은 탑승자에 대한 중복
정보를 줄인다.
4. 결합된 데이터의 최종 병합
 각 JSON 파일로부터 생성된 combined_df 를 all_combined_df 에
추가한다. 이렇게 함으로써 모든 파일의 데이터가 하나의 큰
데이터프레임에 포함되도록 한다.
여러 JSON 파일로 분산된 데이터를 효율적으로 하나의 데이터프레임으로

결합하고, 중복을 제거하여 이후 Multi class logistic regression 으로
예측하는 데 사용된다. 데이터 병합과 중복 제거 과정은 데이터의 일관성과
정확성을 보장하는 데 중요하다.
# NA 값을 포함하는 행을 제거
all_df.dropna(inplace=True)
# occupant_age 를 numeric data 로 convert

def age_group_to_numeric(age_group):
if age_group == '20 대':
return 20
elif age_group == '30 대':
return 30
elif age_group == '40 대':
return 40
elif age_group == '60 대_이상':
return 60
else:
# '기타'인 경우 NA 값으로 처리 -> 매우 적어서 추후에 제거
return np.nan
# 'occupant_age' 열의 각 값에 대해 매핑 함수 적용
all_df['occupant_age'] =
all_df['occupant_age'].apply(age_group_to_numeric)
# typo
all_df = all_df.rename(columns={'occupant_posoition ':
'occupant_position'})
# 'occupant_sex' 및 'occupant_position' 열을 원-핫 인코딩
sex_dummies = pd.get_dummies(all_df['occupant_sex'],
prefix='sex')
position_dummies =
pd.get_dummies(all_df['occupant_position'],
prefix='position')
# 원-핫 인코딩된 데이터프레임을 원래의 데이터프레임에 결합

all_df = pd.concat([all_df, sex_dummies, position_dummies],
axis=1)
# SPO2_std, EEG_std 의 unique 값이 하나 -> 표준편차에 변화가

없음 -> 삭제
all_df = all_df.drop(columns=['SPO2_std', 'EEG_std'])
# normalize
columns_to_normalize = ['ECG_avg', 'ECG_std', 'PPG_avg',
'PPG_std', 'SPO2_avg', 'EEG_avg', 'occupant_age']
all_df[columns_to_normalize] = (all_df[columns_to_normalize]
- all_df[columns_to_normalize].mean()) /
all_df[columns_to_normalize].std()
all_df.to_csv('all_data_norm.csv', index=False,
encoding='CP949')
추가적으로, 데이터 전처리 과정을 수행하여, JSON 파일에서 추출된
데이터를 분석에 적합한 형태로 변환한다. 주요 단계는 다음과 같다:
1. 결측치 제거
 함수를 사용해 all_df 데이터프레임에서 결측치(NA)를

dropna()
포함하는 모든 행을 제거한다. 이는 데이터의 정확성과

신뢰성을 보장하기 위한 표준 절차이다.
2. 연령 데이터 변환
 함수는 'occupant_age' 열의 문자열 값을 숫자
age_group_to_numeric
값으로 변환한다. 예를 들어 '20 대'는 20 으로, '30 대'는 30 으로

변환한다.
 '기타'와 같은 분류되지 않은 연령 그룹은 np.nan 으로 설정하여

추후에 제거할 수 있도록 한다.
3. 데이터 정제
 함수를 사용해 'occupant_position' 열의 이름 오류를

rename
수정한다.
4. 원-핫 인코딩
 성별('occupant_sex')과 위치('occupant_position') 열을 원-핫

인코딩한다. 이는 범주형 데이터를 모델이 처리하기 쉬운
형태로 변환하는 방법이다.
 함수를 사용해 각 범주에 대한 새 열을 생성하고,

pd.get_dummies
이를 원래 데이터프레임에 결합한다.
5. 불필요한 열 제거
 'SPO2_std'와 'EEG_std' 열을 제거한다. 이 열들은 모든 값이

동일하여 변화의 표준편차가 없으므로 분석에 유용하지 않다고
판단된다.
6. 데이터 정규화
 센서 데이터와 연령 데이터 열을 정규화한다. 이는 데이터의

평균을 0 으로, 표준편차를 1 로 조정하여 다른 변수들과의 비교
및 분석을 용이하게 한다.
7. 데이터 저장:
 최종적으로 처리된 데이터프레임을 'all_data_norm.csv' 파일로

저장한다. 'CP949' 인코딩을 사용해 한국어 호환성을 유지한다.
최종 data set 은 아래와 같다.

Task1
13 주차 실습문제는 aihub.or.kr 에서 제공하는 "운전자 및 탑승자 상태 및
이상행동 모니터링" 데이터를 사용하여 운전자의 감정 상태를 인식하는
multi-class classification 을 수행하는 것이다.
# Load necessary libraries
library(readr)
library(dplyr)
library(tidyr)
############################################################
####################
### Task 1.
############################################################
########
############################################################
####################
# Read the CSV file

data <- read.csv("csv_Data/all_data_norm.csv")
# '분류 없음' 행 제거
data <- data %>% filter(emotion != '분류 없음')
head(data)
# factor 변환
data$emotion <- as.factor(data$emotion)
# Handle other preprocessing tasks (e.g., normalization,

missing value imputation)
# label
labelDF <- data.frame(y = data$emotion, val=1, i =
1:nrow(data))
# One-hot encoding for the emotion variable

labelDF.new <- (labelDF %>%
spread(key = y, value = val, fill = 0))[,-
1]
labelDF.new %>% colSums()
# Name the columns

emotion_levels <- levels(data$emotion)
names(labelDF.new) <- emotion_levels
# 미사용하는 열 제거
dataX <- data[, -which(names(data) %in% c("img_name",
"emotion", "action", "occupant_id",
"occupant_sex",
"occupant_position"))]
# glm 모델 리스트 생성: class 개수(5 개) 만큼

modelList <- lapply(emotion_levels, function(x){
glm(data = cbind(labelDF.new %>% select(x), dataX),
formula = paste0(x,' ~.'), family =
binomial(link = 'logit'))
})
# Prediction function
predict_emotion <- function(inputMat){
nData = data.frame(inputMat)
result <- sapply(modelList, function(x) {predict(x,
newdata = nData, type = 'response')})
return(apply(result, 1, function(x)
{emotion_levels[which.max(x)]})) # 가장 확률이 높은 class
# Split data into training and testing sets

set.seed(1234) # for reproducibility
train_indices <- sample(1:nrow(data), size = 0.8 *
nrow(data))
train_data <- dataX[train_indices, ]
test_data <- dataX[-train_indices, ]
train_emotions <- data$emotion[train_indices]

test_emotions <- data$emotion[-train_indices]
# Make predictions on the test set

train_pred_emotions <- predict_emotion(train_data)
test_pred_emotions <- predict_emotion(test_data)
# Calculate accuracy
train_accuracy <- mean(train_pred_emotions ==
train_emotions)
test_accuracy <- mean(test_pred_emotions == test_emotions)
# Compare
print(train_accuracy)
print(test_accuracy)
# under fitting
위 단계는 Multi class logistic regression 을 binary logistic regression 모델을
class 의 개수만큼 생성하여 직접 구현한다.
1. 라이브러리 불러오기 및 데이터 읽기

 readr, dplyr, tidyr
라이브러리를 불러온다. 이들은 데이터 처리와
조작을 위한 표준 R 패키지들이다.
 함수를 사용해 'all_data_norm.csv' 파일을 읽고

read.csv data
변수에 저장한다.
2. 데이터 전처리
 함수를 사용해 'emotion' 열에서 '분류 없음' 값을 갖는 행을

filter
제거한다.
 as.factor 를 사용해 'emotion' 열을 팩터(범주형 변수)로 변환한다.
3. 원-핫 인코딩 및 라벨 준비
 감정(emotion) 변수를 원-핫 인코딩한다. 이는 각 감정을

별도의 열로 변환하여 모델에 적합한 형태로 만든다.
 함수를 사용해 각 감정에 대한 열을 생성하고, 불필요한

spread
열을 제거한다.
4. 데이터 정제
 불필요한 열(img_name, emotion, action, occupant_id,

occupant_sex, occupant_position)을 제거한다. 모든 행에서
unique 값이나, 더미 변수로 만들어진 열을 제거해야 하기
때문이다.
5. 로지스틱 회귀 모델 리스트 생성
 각 감정 카테고리에 대해 별도의 로지스틱 회귀 모델을

생성한다. lapply 함수를 사용해 각각의 감정 수준에 대한 모델을
리스트로 저장한다.
6. 예측 함수 정의
 함수는 입력 데이터에 대해 각 모델의 예측을

predict_emotion
수행하고, 가장 높은 확률을 가진 감정을 반환한다.
7. 데이터 분할 및 모델 평가
 데이터를 훈련 세트와 테스트 세트로 분할한다.
 훈련 데이터와 테스트 데이터에 대해 감정 예측을 수행하고,

정확도를 계산한다.
모델 성능 비교
 훈련 세트와 테스트 세트의 정확도를 출력한다. train 정확도는 0.445,

test 정확도는 0.457 로 train 데이터에 다소 under fitting 된다고 볼 수
있다. prediction 값을 직접 확인한 결과, 대부분이 ‘중립’으로 예측한
값들이 많았다.
Task2
feature engineering 을 진행하여 새로운 변수를 추가하거나 필요 없는
변수를 삭제하는 방법으로 모델을 최적화해보자. 변수 추가를 통한 최적화
과정 중 overfitting 이 일어난다면 regularization 을 통해서 과적합을
해소하여보고 성능의 변화를 확인해보자.
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler,
PolynomialFeatures
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score,
classification_report
# 데이터 불러오기
data = pd.read_csv('all_data.csv', encoding='cp949')
# '분류없음' 행 제거
data = data[data['emotion'] != '분류없음']
# 원-핫 인코딩 적용
emotion_one_hot = pd.get_dummies(data['emotion'])
data = pd.concat([data, emotion_one_hot], axis=1)
# 상호작용 및 다항 특성 생성
polynomial_features = PolynomialFeatures(degree=2,
include_bias=False, interaction_only=False)
sensor_data = data[['ECG_avg', 'ECG_std', 'PPG_avg',
'PPG_std', 'SPO2_avg', 'EEG_avg']]
sensor_poly = polynomial_features.fit_transform(sensor_data)
sensor_poly_df = pd.DataFrame(sensor_poly,
columns=polynomial_features.get_feature_names_out(sensor_dat
a.columns))
data = pd.concat([data, sensor_poly_df], axis=1)
# 범주형 변수 변환 (이미 원-핫 인코딩이 적용된 경우 생략 가능)

# data = pd.get_dummies(data, columns=['occupant_sex',
'occupant_position'])
# 필요한 특성 선택
features = list(sensor_poly_df.columns) + ['occupant_age',
'position_back', 'position_front']
X = data[features]
# 타겟 변수 선택
y = data[emotion_one_hot.columns]
# 데이터 분할 및 랜덤 시드 설정
np.random.seed(123)
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.2, random_state=123)
# 특성 표준화
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# 모델 구축 및 예측
models = {}
emotions = y.columns
for emotion in emotions:
model = LogisticRegression(solver='lbfgs',
max_iter=1000, random_state=123)
model.fit(X_train_scaled, y_train[emotion])
models[emotion] = model
# 멀티클래스 예측 함수 정의
def predict_multiclass(X_scaled):
probabilities = np.array([model.predict_proba(X_scaled)
[:,1] for emotion, model in models.items()]).T
predicted_class_indices = np.argmax(probabilities,
axis=1)
predicted_classes = [emotions[idx] for idx in
predicted_class_indices]
return predicted_classes
# 훈련 및 테스트 데이터에 대한 예측
y_train_pred = predict_multiclass(X_train_scaled)
y_test_pred = predict_multiclass(X_test_scaled)
# 원본 'emotion' 타겟과 비교를 위한 인덱스 매핑

y_train_original = y_train.idxmax(axis=1)
y_test_original = y_test.idxmax(axis=1)
# 성능 평가
train_accuracy = accuracy_score(y_train_original,
y_train_pred)
test_accuracy = accuracy_score(y_test_original, y_test_pred)
print("Train Accuracy:", train_accuracy)

print("Test Accuracy:", test_accuracy)
위 코드는 성능개선을 위해 기존의 로지스틱 회귀 기반 감정 분류 모델에
대한 Feature Engineering 을 수행하는 과정이다. 추가적으로 더해진 기능은
아래와 같다.
1. 다항 특성 생성:
 PolynomialFeatures 는주어진 센서 데이터(ECG_avg, ECG_std, PPG_avg, PPG_std,

SPO2_avg, EEG_avg)의 상호작용 및 다항 특성을 생성한다.
 설정은 각 특성의 제곱 및 특성들 간의 모든 이차 조합을

degree=2
포함한다는 의미다. 이는 원본 데이터에서는 포착되지 않는
복잡한 관계를 모델이 학습할 수 있게 한다.
2. 데이터프레임 변환:
 생성된 sensor_poly 배열을 DataFrame 으로 변환하고, get_feature_names_out

메서드를 사용하여 열 이름을 지정한다.
 이 새로운 데이터프레임 sensor_poly_df 는 원본 데이터셋 data 에

추가된다.
특성 선택
1. 특성 리스트 구성:
 생성된 다항 특성(sensor_poly_df.columns)과 추가적인 특성(occupant_age,

position_back, position_front)을 포함하는 리스트를 만든다.
 이 특성 리스트는 모델 학습에 사용될 최종 특성 세트를

결정한다.
2. 최종 특성 데이터셋 구축:
 X = data[features]를 통해 모델 학습에 사용될 최종 특성 데이터셋 X

를 구축한다.
성능 평가결과 1
성능이 더욱 저조하게 나와 추가적인 다른 feature engineering 을 수행이

필요해 보인다.
추가적으로 처리된 부분은 아래와 같다.
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score,
classification_report
from sklearn.preprocessing import StandardScaler,
PolynomialFeatures
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import SelectKBest, f_classif
# 데이터 불러오기
data = pd.read_csv('all_data.csv', encoding='cp949')
# '분류없음' 행 제거
data = data[data['emotion'] != '분류없음']
# 이상치 처리 (예시: Z-score 사용)

from scipy import stats
data = data[(np.abs(stats.zscore(data[['ECG_avg', 'PPG_avg',
'SPO2_avg', 'EEG_avg']])) < 3).all(axis=1)]
# 원-핫 인코딩 적용
emotion_one_hot = pd.get_dummies(data['emotion'])
data = pd.concat([data, emotion_one_hot], axis=1)
# 상호작용 및 다항 특성 생성
polynomial_features = PolynomialFeatures(degree=2,
include_bias=False, interaction_only=False)
sensor_data = data[['ECG_avg', 'PPG_avg', 'EEG_avg']]
sensor_poly = polynomial_features.fit_transform(sensor_data)
sensor_poly_df = pd.DataFrame(sensor_poly,
columns=polynomial_features.get_feature_names_out(sensor_dat
a.columns))
data = pd.concat([data, sensor_poly_df], axis=1)
# 필요한 특성 선택
X = data.drop(['emotion', 'img_name', 'occupant_id',
'occupant_sex', 'occupant_position'] +
list(emotion_one_hot.columns), axis=1)
# 타겟 변수 선택
y = data[emotion_one_hot.columns]
# 데이터 분할 및 랜덤 시드 설정
np.random.seed(123)
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.2, random_state=123)
# 특성 표준화
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# 특성 선택
selector = SelectKBest(f_classif, k=20)
X_train_scaled = selector.fit_transform(X_train_scaled,
y_train.idxmax(axis=1))
X_test_scaled = selector.transform(X_test_scaled)
# 감정 클래스별로 모델 구축
models = {}
emotions = y.columns
for emotion in emotions:
model = LogisticRegression(solver='lbfgs',
max_iter=1000, random_state=123)
model.fit(X_train_scaled, y_train[emotion])
models[emotion] = model
# 멀티클래스 예측 함수
def predict_multiclass(X_scaled):
probabilities = np.array([model.predict_proba(X_scaled)
[:,1] for emotion, model in models.items()]).T
predicted_class_indices = np.argmax(probabilities,
axis=1)
predicted_classes = [emotions[idx] for idx in
predicted_class_indices]
return predicted_classes
# 훈련 및 테스트 데이터에 대한 예측
y_train_pred = predict_multiclass(X_train_scaled)
y_test_pred = predict_multiclass(X_test_scaled)
# 원본 'emotion' 타겟과 비교
y_train_original = y_train.idxmax(axis=1)
y_test_original = y_test.idxmax(axis=1)
# 성능 평가
train_accuracy = accuracy_score(y_train_original,
y_train_pred)
test_accuracy = accuracy_score(y_test_original, y_test_pred)
print("Train Accuracy:", train_accuracy)
print("Test Accuracy:", test_accuracy)
추가된 부분 :
1. 이상치 처리:
 Z-score 를 사용하여 ECG_avg, PPG_avg, SPO2_avg, EEG_avg 데이터의

이상치를 제거한다. Z-score 는 데이터가 평균으로부터 얼마나
떨어져 있는지를 나타내며, 여기서는 절대값이 3 보다 작은
데이터만을 유지한다. 이는 데이터의 정확도를 높이고, 모델의
성능을 개선하기 위한 단계이다.
2. 상호작용 및 다항 특성 생성:
 선정된 센서 데이터에 대해 PolynomialFeatures 를 사용하여 상호작용

및 다항 특성을 생성한다. 이는 데이터의 비선형 관계를 모델이
학습할 수 있게 하여, 예측의 정확도를 높이는데 기여한다.
3. 특성 선택:
 SelectKBest 와 f_classif 를
사용하여 모델에 가장 유의미한 영향을
미치는 상위 20 개의 특성을 선택한다. 이는 모델의 복잡도를
관리하고, 중요한 정보에 집중하게 함으로써 성능을 향상시킬
수 있다.
이러한 추가적인 단계는 기존 모델에 비해 데이터의 품질을 향상시키고,

모델이 데이터에서 유의미한 패턴을 더 잘 학습하도록 돕는다. 그러나
이상치 처리나 특성 선택 과정에서 중요한 정보가 손실될 수도 있으므로,
이러한 방법들은 신중하게 적용되어야 한다.
성능 평가결과 2
1. 데이터 처리의 미흡
 이상치 및 노이즈: 이상치나 노이즈가 충분히 처리되지 않았다면,

이러한 데이터가 모델 학습에 부정적인 영향을 미쳤을 수 있다.
 데이터 분포: 새로운 특성이 데이터의 본질적인 분포를 왜곡했을
가능성이 있다. 예를 들어, 일부 특성이 다른 특성에 비해 과도하게
강조될 수 있다.
2. 하이퍼파라미터 설정
 모델 튜닝: 사용된 하이퍼파라미터(max_iter, solver 등)가 새로운 특성

세트에 적합하지 않았을 수 있다. 적절한 하이퍼파라미터 튜닝
없이는 모델 성능이 최적화되지 않을 수 있다.
3. 클래스 불균형
 클래스 비율: 데이터셋 내에서 각각의 감정 클래스가 불균형하게

분포되어 있을 경우, 모델이 다수 클래스에 편향되어 학습될 수 있다.
결론
Feature Engineering 을 통해 다항 및 상호작용 특성을 추가했음에도
불구하고 성능이 저조해진 결과는 여러 요인에 기인할 수 있다. 데이터
처리의 미흡, 하이퍼파라미터 설정의 부적절함, 그리고 클래스 불균형을
의심해 볼 수 있는데, 향후 성능 개선을 위해서는 아래와 같은 접근을 통해
모델의 성능을 개선을 시도해 볼 수 있다.
 데이터 전처리 재검토: 광범위하고 엄격한 EDAf 를 통해 이상치 제거,

데이터 정규화 등을 재검토하여 데이터의 품질을 개선한다.
 하이퍼파라미터 최적화: 그리드 서치, 랜덤 서치 등을 통해 최적의

하이퍼파라미터를 찾는다.
 클래스 불균형 해소: SMOTE 나 다운샘플링 등의 기법을 사용하여

클래스 불균형 문제를 해결한다.
결국, 모델의 복잡도와 데이터의 특성을 적절히 조율하고, 더욱 정교한 모델

튜닝을 통한 해결이 필요하다.

Data Mining

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Mining

Uploaded by

Copyright:

Available Formats

13 주차

json_files = glob.glob(os.path.join(base_path, '**/*.json'),

for file_path in json_files:

for sensor_data in json_data['scene']['sensor']:

avg_eeg = sum([sum(eeg_list) / len(eeg_list) for

# 'img_name'과 'action' 열을 제외한 모든 열을 선택하여

 glob 을 사용해 'Data' 폴더 안의 모든 .json 파일들을 찾는다.

 각 JSON 파일에 대해, 파일을 열고 (open(file_path, 'r', encoding='UTF-

 파일마다 반복문을 돌면서, 주어진 JSON 구조에 따라 데이터를

 데이터에서 각 'scene'의 'img_name'과 'occupant' 정보를

추출하고, 이를 통해 행 데이터를 만들어

 occupant_info 에서 탑승자 정보를 담은 데이터프레임 occupant_df 를

 사용해 combined_df, sensor_df, occupant_df 를 'occupant_id'를

기준으로 병합한다. 이렇게 함으로써 각 탑승자에 대한 행동,

 중복된 행을 제거한다. 'img_name'과 'action'을 제외한 모든

여러 JSON 파일로 분산된 데이터를 효율적으로 하나의 데이터프레임으로

# occupant_age 를 numeric data 로 convert

# 원-핫 인코딩된 데이터프레임을 원래의 데이터프레임에 결합

# SPO2_std, EEG_std 의 unique 값이 하나 -> 표준편차에 변화가

 함수를 사용해 all_df 데이터프레임에서 결측치(NA)를

포함하는 모든 행을 제거한다. 이는 데이터의 정확성과

값으로 변환한다. 예를 들어 '20 대'는 20 으로, '30 대'는 30 으로

 '기타'와 같은 분류되지 않은 연령 그룹은 np.nan 으로 설정하여

 함수를 사용해 'occupant_position' 열의 이름 오류를

 성별('occupant_sex')과 위치('occupant_position') 열을 원-핫

 함수를 사용해 각 범주에 대한 새 열을 생성하고,

 'SPO2_std'와 'EEG_std' 열을 제거한다. 이 열들은 모든 값이

 센서 데이터와 연령 데이터 열을 정규화한다. 이는 데이터의

 최종적으로 처리된 데이터프레임을 'all_data_norm.csv' 파일로

최종 data set 은 아래와 같다.

# Read the CSV file

# Handle other preprocessing tasks (e.g., normalization,

# One-hot encoding for the emotion variable

labelDF.new %>% colSums()

# Name the columns

# glm 모델 리스트 생성: class 개수(5 개) 만큼

# Split data into training and testing sets

train_emotions <- data$emotion[train_indices]

# Make predictions on the test set

1. 라이브러리 불러오기 및 데이터 읽기

 함수를 사용해 'all_data_norm.csv' 파일을 읽고

 함수를 사용해 'emotion' 열에서 '분류 없음' 값을 갖는 행을

 as.factor 를 사용해 'emotion' 열을 팩터(범주형 변수)로 변환한다.

 감정(emotion) 변수를 원-핫 인코딩한다. 이는 각 감정을

 함수를 사용해 각 감정에 대한 열을 생성하고, 불필요한

 불필요한 열(img_name, emotion, action, occupant_id,

 각 감정 카테고리에 대해 별도의 로지스틱 회귀 모델을

 함수는 입력 데이터에 대해 각 모델의 예측을

수행하고, 가장 높은 확률을 가진 감정을 반환한다.

 훈련 데이터와 테스트 데이터에 대해 감정 예측을 수행하고,

 훈련 세트와 테스트 세트의 정확도를 출력한다. train 정확도는 0.445,

# 범주형 변수 변환 (이미 원-핫 인코딩이 적용된 경우 생략 가능)

# 원본 'emotion' 타겟과 비교를 위한 인덱스 매핑

print("Train Accuracy:", train_accuracy)

 PolynomialFeatures 는주어진 센서 데이터(ECG_avg, ECG_std, PPG_avg, PPG_std,

 설정은 각 특성의 제곱 및 특성들 간의 모든 이차 조합을

 생성된 sensor_poly 배열을 DataFrame 으로 변환하고, get_feature_names_out

 이 새로운 데이터프레임 sensor_poly_df 는 원본 데이터셋 data 에

 생성된 다항 특성(sensor_poly_df.columns)과 추가적인 특성(occupant_age,

 이 특성 리스트는 모델 학습에 사용될 최종 특성 세트를

 X = data[features]를 통해 모델 학습에 사용될 최종 특성 데이터셋 X

성능이 더욱 저조하게 나와 추가적인 다른 feature engineering 을 수행이