와인데이터로 해보는 부스팅 알고리즘(Boosting Algorithm)

1. 앙상블

1.1 앙상블이란

앙상블은 전통적으로 Voting, Bagging, Boosting, 스태깅으로 나뉨
보팅과 배깅은 여러개의 분류기가 투표를 통해 최종 예측 결과를 결정하는 방식임
보팅과 배깅의 차이점은 보팅은 각각 다른 분류기, 배깅은 같은 분류기를 사용함
대표적인 배깅은 랜덤 포레스트

1.2 Boosting의 개요

여러개의 분류기가 순차적으로 학습을 하면서, 앞에서 학습한 분류기가 예측이 틀린 데이터에 대해 다음 분류기가 가중치를 인가해서 학습을 이어 진행하는 방식
예측 성능이 뛰어나서 앙상블 학습을 주도함
그래디언트 부스트(Gradient Boost), XGBoost, LightGBM 등이 있음

1.3 배깅과 부스팅의 차이

배깅 : 한번에 병렬적으로 결과를 얻음
부스팅 : 순차적으로 진행이 됨

1.4 Adaboost

순차적으로 가중치를 부여해서 최종 결과를 얻음
AdaBoost는 Decision Tree기반의 알고리즘임
여러 Step을 거치며 각 Step에서 틀린 데이터에 가중치를 인가하며 경계선을 결정함
마지막으로 앞의 Step들에서 결정한 경계들을 모두 합침

1.5 부스팅 기법

GBM Gradient Boosting : AdaBoost 기법과 비슷하지면 가중치를 업데이트할때 경사하강법(Gradient Descent)을 사용
XGBoost : GBM에서 PC의 파워를 효율적으로 사용하기 위해 다양한 기법에 채택되어 빠른 속도와 효율을 가짐
LigthGBM : XGBoost보다 빠른 속도를 가짐

1.6 Bagging = Bootstrap AGGregatING

1.7 Bagging과 Boosting의 차이

2. Wine 데이터로 실습

2.1 Data load

import pandas as pd

wine_url = 'https://raw.githubusercontent.com/hmkim312/datas/main/wine/wine.csv'

wine = pd.read_csv(wine_url, index_col=0)
wine['taste'] = [1. if grade > 5 else 0. for grade in wine['quality']]

X = wine.drop(['taste','quality'],  axis = 1)
y = wine['taste']

데이터를 불러오고, quality를 기준으로 taste 컬럼까지 생성

2.2 Scaler 적용 후 데이터 나누기

from sklearn.preprocessing import StandardScaler

sc = StandardScaler()
X_sc = sc.fit_transform(X)

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X_sc, y, test_size=0.2, random_state=13)

2.3 모든 컬럼의 히스토그램 확인

import matplotlib.pyplot as plt
%matplotlib inline

wine.hist(bins = 10, figsize=(24, 24))
plt.show()

2.4 Quality 별 다른 특성이 어떤지 확인

colum_names = ['fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar',
               'chlorides', 'free sulfur dioxide', 'total sulfur dioxide', 'density',
               'pH', 'sulphates', 'alcohol']
df_pivot_table = wine.pivot_table(colum_names, ['quality'], aggfunc='median')
df_pivot_table

	alcohol	chlorides	citric acid	density	fixed acidity	free sulfur dioxide	pH	residual sugar	sulphates	total sulfur dioxide	volatile acidity
quality
3	10.15	0.0550	0.33	0.995900	7.45	17.0	3.245	3.15	0.505	102.5	0.415
4	10.00	0.0505	0.26	0.994995	7.00	15.0	3.220	2.20	0.485	102.0	0.380
5	9.60	0.0530	0.30	0.996100	7.10	27.0	3.190	3.00	0.500	127.0	0.330
6	10.50	0.0460	0.31	0.994700	6.90	29.0	3.210	3.10	0.510	117.0	0.270
7	11.40	0.0390	0.32	0.992400	6.90	30.0	3.220	2.80	0.520	114.0	0.270
8	12.00	0.0370	0.32	0.991890	6.80	34.0	3.230	4.10	0.480	118.0	0.280
9	12.50	0.0310	0.36	0.990300	7.10	28.0	3.280	2.20	0.460	119.0	0.270

quaity를 기준으로 pivot 테이블을 만들어봄
free sulfur dioxide가 quality 별로 차이가 나 보인다.

2.5 Quality에 대한 나머지 특성들의 상관관계

corr_matrix = wine.corr()
print(corr_matrix['quality'].sort_values(ascending = False))

quality                 1.000000
taste                   0.814484
alcohol                 0.444319
citric acid             0.085532
free sulfur dioxide     0.055463
sulphates               0.038485
pH                      0.019506
residual sugar         -0.036980
total sulfur dioxide   -0.041385
fixed acidity          -0.076743
color                  -0.119323
chlorides              -0.200666
volatile acidity       -0.265699
density                -0.305858
Name: quality, dtype: float64

quality의 상관관계를 확인해보니, alcohol, free sulfur dioxide가 양의 상과관계를, density가 음의 상관관계를 보인다
당연히 quality 기준으로 taste를 만들었으니, 이 둘은 상관관계가 높을수 밖에 없으니 제외함

2.6 Taste 컬럼의 분포

import seaborn as sns

sns.countplot(wine['taste'])
plt.show()

Taste 컬럼은 맛있음(1)이 더 많다.

2.7 다양한 모델을 한번에 테스트해보기

from sklearn.ensemble import AdaBoostClassifier, GradientBoostingClassifier, RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression

models = []
models.append(('RandomForestClassifier',RandomForestClassifier()))
models.append(('DecisionTreeClassifier',DecisionTreeClassifier()))
models.append(('AdaBoostClassifier',AdaBoostClassifier()))
models.append(('GradientBoostingClassifier',GradientBoostingClassifier()))
models.append(('LogisticRegression',LogisticRegression(solver = "liblinear")))

여러가지 모델을 불러와서 model이라는 리스트에 넣어줌, 하이퍼 파라미터는 설정하지 않음

2.8 결과를 확인

from sklearn.model_selection import KFold, cross_val_score

results  = []
names = []

for name, model in models :
    kfold = KFold(n_splits= 5, random_state=13, shuffle=True)
    cv_results = cross_val_score(model, X_train, y_train, cv=kfold, scoring='accuracy')
    
    results.append(cv_results)
    names.append(name)
    
    print(name, cv_results.mean(), cv_results.std())

RandomForestClassifier 0.8185420522691939 0.018560021121147078
DecisionTreeClassifier 0.7498519286295995 0.013712535378522434
AdaBoostClassifier 0.7533103205745169 0.02644765901536818
GradientBoostingClassifier 0.7663959428444511 0.021596556352125432
LogisticRegression 0.7425394240023693 0.015704134753742827

Kfold를 적용하여 각 모델별로 검증

2.9 Cross-Validation의 결과를 그래프로 보기

fig = plt.figure(figsize=(14,8))
fig.suptitle('Algorithm Comparison')
ax = fig.add_subplot(111)
plt.boxplot(results)
ax.set_xticklabels(names)
plt.show()

랜덤포레스트가 좋아보임
Boxplot으로 보는 이유는 각 데이터의 accuray의 분포와 outlier를 한번에 볼수 있기 때문

2.10 같은 방식으로 test 데이터 대입

from sklearn.metrics import accuracy_score

for name, model in models:
    model.fit(X_train, y_train)
    pred = model.predict(X_test)
    print(name, accuracy_score(y_test, pred))

RandomForestClassifier 0.8392307692307692
DecisionTreeClassifier 0.7784615384615384
AdaBoostClassifier 0.7553846153846154
GradientBoostingClassifier 0.7876923076923077
LogisticRegression 0.7469230769230769

마찬가지로 랜덤포레스트의 결과가 좋음

와인데이터로 해보는 부스팅 알고리즘(Boosting Algorithm)

1. 앙상블

1.1 앙상블이란

1.2 Boosting의 개요

1.3 배깅과 부스팅의 차이

1.4 Adaboost

1.5 부스팅 기법

1.6 Bagging = Bootstrap AGGregatING

1.7 Bagging과 Boosting의 차이

2. Wine 데이터로 실습

2.1 Data load

2.2 Scaler 적용 후 데이터 나누기

2.3 모든 컬럼의 히스토그램 확인

2.4 Quality 별 다른 특성이 어떤지 확인

2.5 Quality에 대한 나머지 특성들의 상관관계

2.6 Taste 컬럼의 분포

2.7 다양한 모델을 한번에 테스트해보기

2.8 결과를 확인

2.9 Cross-Validation의 결과를 그래프로 보기

2.10 같은 방식으로 test 데이터 대입

Recent Update

Trending Tags

Contents

Trending Tags

와인데이터로 해보는 부스팅 알고리즘(Boosting Algorithm)

1. 앙상블

1.1 앙상블이란

1.2 Boosting의 개요

1.3 배깅과 부스팅의 차이

1.4 Adaboost

1.5 부스팅 기법

1.6 Bagging = Bootstrap AGGregatING

1.7 Bagging과 Boosting의 차이

2. Wine 데이터로 실습

2.1 Data load

2.2 Scaler 적용 후 데이터 나누기

2.3 모든 컬럼의 히스토그램 확인

2.4 Quality 별 다른 특성이 어떤지 확인

2.5 Quality에 대한 나머지 특성들의 상관관계

2.6 Taste 컬럼의 분포

2.7 다양한 모델을 한번에 테스트해보기

2.8 결과를 확인

2.9 Cross-Validation의 결과를 그래프로 보기

2.10 같은 방식으로 test 데이터 대입

Recent Update

Trending Tags

Contents

Further Reading

앙상블(Ensemble)

HAR 데이터로 해보는 GBM, XGBoost, LightGBM

군집 분석 (2) (Clustering)

Trending Tags