Posts 머신러닝을 이용한 타이타닉 생존자 예측
Post
Cancel

머신러닝을 이용한 타이타닉 생존자 예측

1. 타이타닉 EDA


1.1 데이터 로드

1
2
3
4
import pandas as pd

titanic = pd.read_excel('https://github.com/hmkim312/datas/blob/main/titanic/titanic.xls?raw=true')
titanic.head()
pclasssurvivednamesexagesibspparchticketfarecabinembarkedboatbodyhome.dest
011Allen, Miss. Elisabeth Waltonfemale29.00000024160211.3375B5S2NaNSt Louis, MO
111Allison, Master. Hudson Trevormale0.916712113781151.5500C22 C26S11NaNMontreal, PQ / Chesterville, ON
210Allison, Miss. Helen Lorainefemale2.000012113781151.5500C22 C26SNaNNaNMontreal, PQ / Chesterville, ON
310Allison, Mr. Hudson Joshua Creightonmale30.000012113781151.5500C22 C26SNaN135.0Montreal, PQ / Chesterville, ON
410Allison, Mrs. Hudson J C (Bessie Waldo Daniels)female25.000012113781151.5500C22 C26SNaNNaNMontreal, PQ / Chesterville, ON
  • 타이타닉은 1910년 당시 최대의 여객선으로 영국에거 미국 뉴욕으로 가던 여객선
  • 위의 깃헙 링크에 타이타닉 파일을 올려놓았음
  • Pclass : 객실 등급
  • Survived : 생존 유무
  • Sex : 성별
  • Name : 이름
  • Age : 나이
  • Sibsp : 형제 혹은 부부의 수
  • Parch : 부모 혹은 자녀의 수
  • Fare : 지불한 요금
  • Boat : 탈출시 사용한 보트 번호


1.2 생존 상황

1
2
3
4
5
6
7
8
9
10
11
12
13
import matplotlib.pyplot as plt
import seaborn as sns

f, ax = plt.subplots(1, 2, figsize=(18, 8))

titanic['survived'].value_counts().plot.pie(
    explode=[0, 0.1], autopct='%1.1f%%', ax=ax[0], shadow=True)

ax[0].set_title('Pie plot - Survived')
ax[0].set_ylabel('')
sns.countplot('survived', data=titanic, ax=ax[1])
ax[1].set_title('Count plot - Survived')
plt.show()

  • 38.2%의 생존율


1.3 성별에 따른 생존 상황

1
2
3
4
5
6
7
8
9
f, ax = plt.subplots(1, 2, figsize = (18, 8))

sns.countplot('sex', data = titanic, ax=ax[0])
ax[0].set_title('Count of Passengers of Sex')
ax[0].set_ylabel('')

sns.countplot('sex', hue = 'survived', data = titanic, ax=ax[1])
ax[1].set_title('Sex:Survived and Unsurvived')
plt.show()

  • 탑승객은 남성이 많지만, 남성이 생존확률은 낮음


1.4 경재력 대비 생존률

1
pd.crosstab(titanic['pclass'], titanic['survived'], margins=True)
survived01All
pclass
1123200323
2158119277
3528181709
All8095001309
  • 1등실의 생존 가능성이 아주 높음
  • 여성의 생존률도 높음
  • 1등실에는 여성이 많이 타고있었는가?


1.5 선실 등급별 성별 상황

1
2
3
4
grid = sns.FacetGrid(titanic, row = 'pclass', col = 'sex', height = 4, aspect=2)
grid.map(plt.hist, 'age', alpha =0.8, bins = 20)
grid.add_legend()
plt.show()

  • 3등실에는 남성이 많았음, 특히 20대 남성


1.6 나이별 승객 현황

1
2
3
4
import plotly.express as px

fig = px.histogram(titanic, x = 'age')
fig.show()

  • 아이들과 20대 ~ 30대 인원이 많음


1.7 객실별 생존율을 연연별로 관찰

1
2
3
4
grid = sns.FacetGrid(titanic, col = 'survived', row = 'pclass', height = 4, aspect = 2)
grid.map(plt.hist, 'age', alpha = .5, bins = 20)
grid.add_legend()
plt.show()

  • 선싱 등급이 높으면 생존율이 높음


1.8 나이를 5단계로 정리하기

1
2
3
titanic['age_cat'] = pd.cut(titanic['age'], bins=[0, 7, 15, 30, 60, 100],
                            include_lowest=True, labels=['baby', 'teen', 'young', 'adult', 'old'])
titanic.head()
pclasssurvivednamesexagesibspparchticketfarecabinembarkedboatbodyhome.destage_cat
011Allen, Miss. Elisabeth Waltonfemale29.00000024160211.3375B5S2NaNSt Louis, MOyoung
111Allison, Master. Hudson Trevormale0.916712113781151.5500C22 C26S11NaNMontreal, PQ / Chesterville, ONbaby
210Allison, Miss. Helen Lorainefemale2.000012113781151.5500C22 C26SNaNNaNMontreal, PQ / Chesterville, ONbaby
310Allison, Mr. Hudson Joshua Creightonmale30.000012113781151.5500C22 C26SNaN135.0Montreal, PQ / Chesterville, ONyoung
410Allison, Mrs. Hudson J C (Bessie Waldo Daniels)female25.000012113781151.5500C22 C26SNaNNaNMontreal, PQ / Chesterville, ONyoung


1.9 나이, 성별, 등급별 생존자 수를 한번에 파악하기

1
2
3
4
5
6
7
8
9
10
plt.figure(figsize=(12, 4))
plt.subplot(131)
sns.barplot('pclass', 'survived', data=titanic)
plt.subplot(132)
sns.barplot('age_cat', 'survived', data=titanic)
plt.subplot(133)
sns.barplot('sex', 'survived', data=titanic)
plt.subplots_adjust(top=1, bottom=0.1, left=0.1,
                    right=1, hspace=0.5, wspace=0.5)
plt.show()

  • 어리고, 여성이고, 1등실이면 생존하기 유리했을듯하다


1.10 남/여 나이별 생존 상황

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(14, 6))

women = titanic[titanic['sex'] == 'female']
men = titanic[titanic['sex'] == 'male']

ax = sns.distplot(women[women['survived'] == 1]['age'], bins=20,
                        label='survived', ax=axes[0], kde=False)
ax = sns.distplot(women[women['survived'] == 0]['age'], bins=40,
                        label='survived', ax=axes[0], kde=False)
ax.legend()
ax.set_title('Female')

ax = sns.distplot(men[men['survived'] == 1]['age'], bins=18,
                      label='survived', ax=axes[1], kde=False)
ax = sns.distplot(men[men['survived'] == 0]['age'], bins=40,
                      label='survived', ax=axes[1], kde=False)
ax.legend()
ax.set_title('Male')
1
Text(0.5, 1.0, 'Male')

1.11 탑승객의 이름에서 신분을 확인

1
2
for idx, dataset in titanic.iterrows():
    print(dataset['name'])
1
2
3
4
5
6
Allen, Miss. Elisabeth Walton
Allison, Master. Hudson Trevor
Allison, Miss. Helen Loraine
Allison, Mr. Hudson Joshua Creighton
Allison, Mrs. Hudson J C (Bessie Waldo Daniels)
...


1.13 응용해서 title이라는 사회적 신분을 얻음

1
2
3
4
5
6
7
8
9
10
import re

title = []

for idx, dataset in titanic.iterrows():
    tmp = dataset['name']
    title.append(re.search('\,\s\w+(\s\w+)?\.', tmp).group()[2:-1])
    
titanic['title'] = title
titanic.head()


pclasssurvivednamesexagesibspparchticketfarecabinembarkedboatbodyhome.destage_cattitle
011Allen, Miss. Elisabeth Waltonfemale29.00000024160211.3375B5S2NaNSt Louis, MOyoungMiss
111Allison, Master. Hudson Trevormale0.916712113781151.5500C22 C26S11NaNMontreal, PQ / Chesterville, ONbabyMaster
210Allison, Miss. Helen Lorainefemale2.000012113781151.5500C22 C26SNaNNaNMontreal, PQ / Chesterville, ONbabyMiss
310Allison, Mr. Hudson Joshua Creightonmale30.000012113781151.5500C22 C26SNaN135.0Montreal, PQ / Chesterville, ONyoungMr
410Allison, Mrs. Hudson J C (Bessie Waldo Daniels)female25.000012113781151.5500C22 C26SNaNNaNMontreal, PQ / Chesterville, ONyoungMrs


1.14 성별별로 본 귀족

1
pd.crosstab(titanic['title'], titanic['sex'])
sexfemalemale
title
Capt01
Col04
Don01
Dona10
Dr17
Jonkheer01
Lady10
Major02
Master061
Miss2600
Mlle20
Mme10
Mr0757
Mrs1970
Ms20
Rev08
Sir01
the Countess10


1.15 사회적 신분 전처리

1
2
3
4
5
6
7
8
9
10
11
12
13
14
titanic['title'] = titanic['title'].replace('Mlle', 'Miss')
titanic['title'] = titanic['title'].replace('Ms', 'Miss')
titanic['title'] = titanic['title'].replace('Mme', 'Mrs')

Rare_f = ['Dona', 'Dr', 'Lady', 'the Countess']
Rare_m = ['Capt', 'Col', 'Don', 'Major', 'Rev', 'Sir', 'Jonkheer', 'Master']

for each in Rare_f:
    titanic['title'] = titanic['title'].replace(each, 'Rare_f')

for each in Rare_m:
    titanic['title'] = titanic['title'].replace(each, 'Rare_m')

titanic['title'].unique()
1
array(['Miss', 'Rare_m', 'Mr', 'Mrs', 'Rare_f'], dtype=object)
  • 귀속이나 다른 이름을 가지고 있는 사람들은 Miss, Mrs, Mr, Rare_m, Rafe_f로 변경


1.16 귀족도 생각보다 많이 살았음

1
titanic[['title','survived']].groupby(['title'], as_index=False).mean()
titlesurvived
0Miss0.678030
1Mr0.162483
2Mrs0.787879
3Rare_f0.636364
4Rare_m0.443038
  • 귀속의 생존율도 높은편이다.


2. 머신러닝을 이용한 생존자 예측


2.1 구조확인

1
titanic.info()
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 16 columns):
 #   Column     Non-Null Count  Dtype   
---  ------     --------------  -----   
 0   pclass     1309 non-null   int64   
 1   survived   1309 non-null   int64   
 2   name       1309 non-null   object  
 3   sex        1309 non-null   object  
 4   age        1046 non-null   float64 
 5   sibsp      1309 non-null   int64   
 6   parch      1309 non-null   int64   
 7   ticket     1309 non-null   object  
 8   fare       1308 non-null   float64 
 9   cabin      295 non-null    object  
 10  embarked   1307 non-null   object  
 11  boat       486 non-null    object  
 12  body       121 non-null    float64 
 13  home.dest  745 non-null    object  
 14  age_cat    1046 non-null   category
 15  title      1309 non-null   object  
dtypes: category(1), float64(3), int64(4), object(8)
memory usage: 155.0+ KB
  • null값의 처리와, 라벨인코딩도 필요할듯 하다


2.2 성별 컬럼을 숫자로 변경하기

1
titanic['sex'].unique()
1
array(['female', 'male'], dtype=object)


2.3 Label Encode를 사용하기

1
2
3
4
5
6
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
le.fit(titanic['sex'])
titanic['gender'] = le.transform(titanic['sex'])
titanic.head()


pclasssurvivednamesexagesibspparchticketfarecabinembarkedboatbodyhome.destage_cattitlegender
011Allen, Miss. Elisabeth Waltonfemale29.00000024160211.3375B5S2NaNSt Louis, MOyoungMiss0
111Allison, Master. Hudson Trevormale0.916712113781151.5500C22 C26S11NaNMontreal, PQ / Chesterville, ONbabyRare_m1
210Allison, Miss. Helen Lorainefemale2.000012113781151.5500C22 C26SNaNNaNMontreal, PQ / Chesterville, ONbabyMiss0
310Allison, Mr. Hudson Joshua Creightonmale30.000012113781151.5500C22 C26SNaN135.0Montreal, PQ / Chesterville, ONyoungMr1
410Allison, Mrs. Hudson J C (Bessie Waldo Daniels)female25.000012113781151.5500C22 C26SNaNNaNMontreal, PQ / Chesterville, ONyoungMrs0
  • gender라는 컬럼을 생성하여 성별을 변경함


2.4 결측치 제외

1
2
3
titanic = titanic[titanic['age'].notnull()]
titanic = titanic[titanic['fare'].notnull()]
titanic.info()
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1045 entries, 0 to 1308
Data columns (total 17 columns):
 #   Column     Non-Null Count  Dtype   
---  ------     --------------  -----   
 0   pclass     1045 non-null   int64   
 1   survived   1045 non-null   int64   
 2   name       1045 non-null   object  
 3   sex        1045 non-null   object  
 4   age        1045 non-null   float64 
 5   sibsp      1045 non-null   int64   
 6   parch      1045 non-null   int64   
 7   ticket     1045 non-null   object  
 8   fare       1045 non-null   float64 
 9   cabin      272 non-null    object  
 10  embarked   1043 non-null   object  
 11  boat       417 non-null    object  
 12  body       119 non-null    float64 
 13  home.dest  685 non-null    object  
 14  age_cat    1045 non-null   category
 15  title      1045 non-null   object  
 16  gender     1045 non-null   int64   
dtypes: category(1), float64(3), int64(5), object(8)
memory usage: 140.0+ KB
  • gender 컬럼 생성


2.5 상관관계

1
2
3
correlation_matrix = titanic.corr().round(1)
sns.heatmap(data=correlation_matrix, annot=True, cmap='bwr')
plt.show()

  • 실수형 데이터들의 상관관계를 보았을때, 생존(Survived)는 성별과 pclass의 상관관계가 높다


2.6 특성 선택 후 데이터 나누기

1
2
3
4
5
6
from sklearn.model_selection import train_test_split

X = titanic[['pclass', 'age', 'sibsp', 'parch', 'fare','gender']]
y = titanic['survived']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 13)
  • ‘pclass’, ‘age’, ‘sibsp’, ‘parch’, ‘fare’, ‘gender’ 만 사용하여 예측에 사용해봄


2.7 DecisionTree

1
2
3
4
5
6
7
8
9
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

dt = DecisionTreeClassifier(max_depth= 4, random_state= 13)
dt.fit(X_train, y_train)

pred = dt.predict(X_test)

print(accuracy_score(y_test, pred))
1
0.7655502392344498
  • 의사결정나무로 해보았을때 accuracy는 0.76으로 나온다.
  • 생각보다 높지는 않아보임


2.8 LogisticRegression

1
2
3
4
5
6
7
8
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

lr = LogisticRegression(random_state= 13, solver='liblinear')
lr.fit(X_train, y_train)

pred = lr.predict(X_test)
print(accuracy_score(y_test, pred))
1
0.7511961722488039
  • 로지스틱 회귀는 0.75가 나온다.


2.9 RandomForest

1
2
3
4
5
6
7
8
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

rf = RandomForestClassifier(random_state= 13, n_estimators= 100, max_depth=4)
rf.fit(X_train, y_train)

pred = rf.predict(X_test)
print(accuracy_score(y_test, pred))
1
0.7799043062200957
  • 앙상블인 랜덤포레스트는 다른 두개보다 높은 결과를 가져온다

2.10 Pipeline을 만들고

1
2
3
4
5
6
7
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

estimators = [('scaler', StandardScaler()),
             ('clf', RandomForestClassifier(random_state=13))]

pipe = Pipeline(estimators)
  • Standard스케일러를 사용하여, 랜덤포레스트로 Pipeline을 만들었음

2.11 최적의 파라미터를 위한 그리드 서치

1
2
3
4
5
6
7
8
9
10
11
from sklearn.model_selection import GridSearchCV

params = [{
    'clf__max_depth': [6, 8, 10, 100],
    'clf__n_estimators': [50, 100, 200, 1000]
}]

gridsearch = GridSearchCV(
    estimator=pipe, param_grid=params, return_train_score=True, cv=5, verbose=2)

gridsearch.fit(X, y)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
Fitting 5 folds for each of 16 candidates, totalling 80 fits
[CV] clf__max_depth=6, clf__n_estimators=50 ..........................
[CV] ........... clf__max_depth=6, clf__n_estimators=50, total=   0.1s
[CV] clf__max_depth=6, clf__n_estimators=50 ..........................
[CV] ........... clf__max_depth=6, clf__n_estimators=50, total=   0.1s
[CV] clf__max_depth=6, clf__n_estimators=50 ..........................


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.1s remaining:    0.0s


[CV] ........... clf__max_depth=6, clf__n_estimators=50, total=   0.1s
[CV] clf__max_depth=6, clf__n_estimators=50 ..........................
[CV] ........... clf__max_depth=6, clf__n_estimators=50, total=   0.1s
[CV] clf__max_depth=6, clf__n_estimators=50 ..........................
[CV] ........... clf__max_depth=6, clf__n_estimators=50, total=   0.1s
[CV] clf__max_depth=6, clf__n_estimators=100 .........................
[CV] .......... clf__max_depth=6, clf__n_estimators=100, total=   0.1s
[CV] clf__max_depth=6, clf__n_estimators=100 .........................
[CV] .......... clf__max_depth=6, clf__n_estimators=100, total=   0.1s
[CV] clf__max_depth=6, clf__n_estimators=100 .........................
[CV] .......... clf__max_depth=6, clf__n_estimators=100, total=   0.1s
[CV] clf__max_depth=6, clf__n_estimators=100 .........................
[CV] .......... clf__max_depth=6, clf__n_estimators=100, total=   0.1s
[CV] clf__max_depth=6, clf__n_estimators=100 .........................
[CV] .......... clf__max_depth=6, clf__n_estimators=100, total=   0.1s
[CV] clf__max_depth=6, clf__n_estimators=200 .........................
[CV] .......... clf__max_depth=6, clf__n_estimators=200, total=   0.2s
[CV] clf__max_depth=6, clf__n_estimators=200 .........................
[CV] .......... clf__max_depth=6, clf__n_estimators=200, total=   0.2s
[CV] clf__max_depth=6, clf__n_estimators=200 .........................
[CV] .......... clf__max_depth=6, clf__n_estimators=200, total=   0.2s
[CV] clf__max_depth=6, clf__n_estimators=200 .........................
[CV] .......... clf__max_depth=6, clf__n_estimators=200, total=   0.2s
[CV] clf__max_depth=6, clf__n_estimators=200 .........................
[CV] .......... clf__max_depth=6, clf__n_estimators=200, total=   0.2s
[CV] clf__max_depth=6, clf__n_estimators=1000 ........................
[CV] ......... clf__max_depth=6, clf__n_estimators=1000, total=   1.0s
[CV] clf__max_depth=6, clf__n_estimators=1000 ........................
[CV] ......... clf__max_depth=6, clf__n_estimators=1000, total=   1.1s
[CV] clf__max_depth=6, clf__n_estimators=1000 ........................
[CV] ......... clf__max_depth=6, clf__n_estimators=1000, total=   1.0s
[CV] clf__max_depth=6, clf__n_estimators=1000 ........................
[CV] ......... clf__max_depth=6, clf__n_estimators=1000, total=   1.1s
[CV] clf__max_depth=6, clf__n_estimators=1000 ........................
[CV] ......... clf__max_depth=6, clf__n_estimators=1000, total=   1.1s
[CV] clf__max_depth=8, clf__n_estimators=50 ..........................
[CV] ........... clf__max_depth=8, clf__n_estimators=50, total=   0.1s
[CV] clf__max_depth=8, clf__n_estimators=50 ..........................
[CV] ........... clf__max_depth=8, clf__n_estimators=50, total=   0.1s
[CV] clf__max_depth=8, clf__n_estimators=50 ..........................
[CV] ........... clf__max_depth=8, clf__n_estimators=50, total=   0.1s
[CV] clf__max_depth=8, clf__n_estimators=50 ..........................
[CV] ........... clf__max_depth=8, clf__n_estimators=50, total=   0.1s
[CV] clf__max_depth=8, clf__n_estimators=50 ..........................
[CV] ........... clf__max_depth=8, clf__n_estimators=50, total=   0.1s
[CV] clf__max_depth=8, clf__n_estimators=100 .........................
[CV] .......... clf__max_depth=8, clf__n_estimators=100, total=   0.1s
[CV] clf__max_depth=8, clf__n_estimators=100 .........................
[CV] .......... clf__max_depth=8, clf__n_estimators=100, total=   0.1s
[CV] clf__max_depth=8, clf__n_estimators=100 .........................
[CV] .......... clf__max_depth=8, clf__n_estimators=100, total=   0.1s
[CV] clf__max_depth=8, clf__n_estimators=100 .........................
[CV] .......... clf__max_depth=8, clf__n_estimators=100, total=   0.1s
[CV] clf__max_depth=8, clf__n_estimators=100 .........................
[CV] .......... clf__max_depth=8, clf__n_estimators=100, total=   0.1s
[CV] clf__max_depth=8, clf__n_estimators=200 .........................
[CV] .......... clf__max_depth=8, clf__n_estimators=200, total=   0.2s
[CV] clf__max_depth=8, clf__n_estimators=200 .........................
[CV] .......... clf__max_depth=8, clf__n_estimators=200, total=   0.2s
[CV] clf__max_depth=8, clf__n_estimators=200 .........................
[CV] .......... clf__max_depth=8, clf__n_estimators=200, total=   0.2s
[CV] clf__max_depth=8, clf__n_estimators=200 .........................
[CV] .......... clf__max_depth=8, clf__n_estimators=200, total=   0.2s
[CV] clf__max_depth=8, clf__n_estimators=200 .........................
[CV] .......... clf__max_depth=8, clf__n_estimators=200, total=   0.2s
[CV] clf__max_depth=8, clf__n_estimators=1000 ........................
[CV] ......... clf__max_depth=8, clf__n_estimators=1000, total=   1.1s
[CV] clf__max_depth=8, clf__n_estimators=1000 ........................
[CV] ......... clf__max_depth=8, clf__n_estimators=1000, total=   1.1s
[CV] clf__max_depth=8, clf__n_estimators=1000 ........................
[CV] ......... clf__max_depth=8, clf__n_estimators=1000, total=   1.1s
[CV] clf__max_depth=8, clf__n_estimators=1000 ........................
[CV] ......... clf__max_depth=8, clf__n_estimators=1000, total=   1.1s
[CV] clf__max_depth=8, clf__n_estimators=1000 ........................
[CV] ......... clf__max_depth=8, clf__n_estimators=1000, total=   1.1s
[CV] clf__max_depth=10, clf__n_estimators=50 .........................
[CV] .......... clf__max_depth=10, clf__n_estimators=50, total=   0.1s
[CV] clf__max_depth=10, clf__n_estimators=50 .........................
[CV] .......... clf__max_depth=10, clf__n_estimators=50, total=   0.1s
[CV] clf__max_depth=10, clf__n_estimators=50 .........................
[CV] .......... clf__max_depth=10, clf__n_estimators=50, total=   0.1s
[CV] clf__max_depth=10, clf__n_estimators=50 .........................
[CV] .......... clf__max_depth=10, clf__n_estimators=50, total=   0.1s
[CV] clf__max_depth=10, clf__n_estimators=50 .........................
[CV] .......... clf__max_depth=10, clf__n_estimators=50, total=   0.1s
[CV] clf__max_depth=10, clf__n_estimators=100 ........................
[CV] ......... clf__max_depth=10, clf__n_estimators=100, total=   0.1s
[CV] clf__max_depth=10, clf__n_estimators=100 ........................
[CV] ......... clf__max_depth=10, clf__n_estimators=100, total=   0.1s
[CV] clf__max_depth=10, clf__n_estimators=100 ........................
[CV] ......... clf__max_depth=10, clf__n_estimators=100, total=   0.1s
[CV] clf__max_depth=10, clf__n_estimators=100 ........................
[CV] ......... clf__max_depth=10, clf__n_estimators=100, total=   0.1s
[CV] clf__max_depth=10, clf__n_estimators=100 ........................
[CV] ......... clf__max_depth=10, clf__n_estimators=100, total=   0.1s
[CV] clf__max_depth=10, clf__n_estimators=200 ........................
[CV] ......... clf__max_depth=10, clf__n_estimators=200, total=   0.2s
[CV] clf__max_depth=10, clf__n_estimators=200 ........................
[CV] ......... clf__max_depth=10, clf__n_estimators=200, total=   0.2s
[CV] clf__max_depth=10, clf__n_estimators=200 ........................
[CV] ......... clf__max_depth=10, clf__n_estimators=200, total=   0.2s
[CV] clf__max_depth=10, clf__n_estimators=200 ........................
[CV] ......... clf__max_depth=10, clf__n_estimators=200, total=   0.2s
[CV] clf__max_depth=10, clf__n_estimators=200 ........................
[CV] ......... clf__max_depth=10, clf__n_estimators=200, total=   0.2s
[CV] clf__max_depth=10, clf__n_estimators=1000 .......................
[CV] ........ clf__max_depth=10, clf__n_estimators=1000, total=   1.2s
[CV] clf__max_depth=10, clf__n_estimators=1000 .......................
[CV] ........ clf__max_depth=10, clf__n_estimators=1000, total=   1.2s
[CV] clf__max_depth=10, clf__n_estimators=1000 .......................
[CV] ........ clf__max_depth=10, clf__n_estimators=1000, total=   1.2s
[CV] clf__max_depth=10, clf__n_estimators=1000 .......................
[CV] ........ clf__max_depth=10, clf__n_estimators=1000, total=   1.2s
[CV] clf__max_depth=10, clf__n_estimators=1000 .......................
[CV] ........ clf__max_depth=10, clf__n_estimators=1000, total=   1.2s
[CV] clf__max_depth=100, clf__n_estimators=50 ........................
[CV] ......... clf__max_depth=100, clf__n_estimators=50, total=   0.1s
[CV] clf__max_depth=100, clf__n_estimators=50 ........................
[CV] ......... clf__max_depth=100, clf__n_estimators=50, total=   0.1s
[CV] clf__max_depth=100, clf__n_estimators=50 ........................
[CV] ......... clf__max_depth=100, clf__n_estimators=50, total=   0.1s
[CV] clf__max_depth=100, clf__n_estimators=50 ........................
[CV] ......... clf__max_depth=100, clf__n_estimators=50, total=   0.1s
[CV] clf__max_depth=100, clf__n_estimators=50 ........................
[CV] ......... clf__max_depth=100, clf__n_estimators=50, total=   0.1s
[CV] clf__max_depth=100, clf__n_estimators=100 .......................
[CV] ........ clf__max_depth=100, clf__n_estimators=100, total=   0.1s
[CV] clf__max_depth=100, clf__n_estimators=100 .......................
[CV] ........ clf__max_depth=100, clf__n_estimators=100, total=   0.1s
[CV] clf__max_depth=100, clf__n_estimators=100 .......................
[CV] ........ clf__max_depth=100, clf__n_estimators=100, total=   0.1s
[CV] clf__max_depth=100, clf__n_estimators=100 .......................
[CV] ........ clf__max_depth=100, clf__n_estimators=100, total=   0.1s
[CV] clf__max_depth=100, clf__n_estimators=100 .......................
[CV] ........ clf__max_depth=100, clf__n_estimators=100, total=   0.1s
[CV] clf__max_depth=100, clf__n_estimators=200 .......................
[CV] ........ clf__max_depth=100, clf__n_estimators=200, total=   0.2s
[CV] clf__max_depth=100, clf__n_estimators=200 .......................
[CV] ........ clf__max_depth=100, clf__n_estimators=200, total=   0.2s
[CV] clf__max_depth=100, clf__n_estimators=200 .......................
[CV] ........ clf__max_depth=100, clf__n_estimators=200, total=   0.2s
[CV] clf__max_depth=100, clf__n_estimators=200 .......................
[CV] ........ clf__max_depth=100, clf__n_estimators=200, total=   0.2s
[CV] clf__max_depth=100, clf__n_estimators=200 .......................
[CV] ........ clf__max_depth=100, clf__n_estimators=200, total=   0.2s
[CV] clf__max_depth=100, clf__n_estimators=1000 ......................
[CV] ....... clf__max_depth=100, clf__n_estimators=1000, total=   1.2s
[CV] clf__max_depth=100, clf__n_estimators=1000 ......................
[CV] ....... clf__max_depth=100, clf__n_estimators=1000, total=   1.2s
[CV] clf__max_depth=100, clf__n_estimators=1000 ......................
[CV] ....... clf__max_depth=100, clf__n_estimators=1000, total=   1.2s
[CV] clf__max_depth=100, clf__n_estimators=1000 ......................
[CV] ....... clf__max_depth=100, clf__n_estimators=1000, total=   1.2s
[CV] clf__max_depth=100, clf__n_estimators=1000 ......................
[CV] ....... clf__max_depth=100, clf__n_estimators=1000, total=   1.3s


[Parallel(n_jobs=1)]: Done  80 out of  80 | elapsed:   35.1s finished



GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('scaler', StandardScaler()),
                                       ('clf',
                                        RandomForestClassifier(random_state=13))]),
             param_grid=[{'clf__max_depth': [6, 8, 10, 100],
                          'clf__n_estimators': [50, 100, 200, 1000]}],
             return_train_score=True, verbose=2)
  • pipeline을 사용하여 gridsearchcv를 사용해보았음
  • max_depth는 6,8,10,100, n_estimators는 50, 100, 200, 1000개를 사용함
  • pipeline의 gridsearch는 __를 붙여야 한다.


2.12 각 모델간 성능 비교

1
2
3
score_df = pd.DataFrame(gridsearch.cv_results_)
score_df[['params', 'rank_test_score', 'mean_train_score',
          'mean_test_score', 'std_train_score']]
paramsrank_test_scoremean_train_scoremean_test_scorestd_train_score
0{'clf__max_depth': 6, 'clf__n_estimators': 50}40.8619620.6918660.013440
1{'clf__max_depth': 6, 'clf__n_estimators': 100}20.8605260.6947370.014366
2{'clf__max_depth': 6, 'clf__n_estimators': 200}30.8590910.6928230.013937
3{'clf__max_depth': 6, 'clf__n_estimators': 1000}10.8598090.7043060.013605
4{'clf__max_depth': 8, 'clf__n_estimators': 50}80.8983250.6842110.011597
5{'clf__max_depth': 8, 'clf__n_estimators': 100}50.9002390.6880380.012491
6{'clf__max_depth': 8, 'clf__n_estimators': 200}60.8988040.6861240.013247
7{'clf__max_depth': 8, 'clf__n_estimators': 1000}60.8990430.6861240.012985
8{'clf__max_depth': 10, 'clf__n_estimators': 50}120.9339710.6641150.013730
9{'clf__max_depth': 10, 'clf__n_estimators': 100}110.9339710.6660290.013264
10{'clf__max_depth': 10, 'clf__n_estimators': 200}90.9327750.6708130.012417
11{'clf__max_depth': 10, 'clf__n_estimators': 1000}90.9356460.6708130.014540
12{'clf__max_depth': 100, 'clf__n_estimators': 50}130.9811000.6488040.004102
13{'clf__max_depth': 100, 'clf__n_estimators': 100}140.9818180.6459330.003960
14{'clf__max_depth': 100, 'clf__n_estimators': 200}160.9818180.6430620.003960
15{'clf__max_depth': 100, 'clf__n_estimators': 1...140.9818180.6459330.003960
  • 각 모델간의 성능을 비교하여 데이터프레임으로 생성함


2.13 Best Model

1
gridsearch.best_estimator_
1
2
3
4
Pipeline(steps=[('scaler', StandardScaler()),
                ('clf',
                 RandomForestClassifier(max_depth=6, n_estimators=1000,
                                        random_state=13))])
  • 베스트모델 확인


2.14 Test 데이터에서 다시 확인

1
2
pred = gridsearch.best_estimator_.predict(X_test)
print(accuracy_score(y_test, pred))
1
0.8325358851674641
  • Gridsearch를 사용하여 찾아낸 Best모델을 적용한 Accuracy는 0.83으로 나옴


3. 디카프리오와 윈슬릿의 생존율은?


3.1 디카프리오

1
2
3
import numpy as np
decaprio = np.array([[3, 18, 0, 0, 5, 1]])
print('Decaprio :', gridsearch.best_estimator_.predict_proba(decaprio)[0, 1])
1
Decaprio : 0.16496996405863845
  • 디카프리오의 정보(3등급, 18세, 형제나 배우자 없음, 부모 없음, 요금은 5불, 성별은 남자 )로 넣음
  • 생존확률은 16% 정도 나온다.


3.2 윈슬릿

1
2
3
import numpy as np
winslet = np.array([[1, 16, 1, 1, 100, 0]])
print('Winslet :', gridsearch.best_estimator_.predict_proba(winslet)[0, 1])
1
Winslet : 0.9628936507308983 넣고
  • 윈슬릿의 정보 (1등급, 16세, 형제 1명, 부모 1명, 요금은 100불, 여성)로 넣음
  • 생존확률은 96% 나온다.
This post is licensed under CC BY 4.0 by the author.