타이타닉 데이터로 해보는 PCA와 kNN

1. 타이타닉 데이터 preprocessing

1.1 Data load

import pandas as pd

titanic_url = 'https://github.com/hmkim312/datas/blob/main/titanic/titanic.xls?raw=true'
titanic = pd.read_excel(titanic_url)
titanic.head()

	pclass	survived	name	sex	age	sibsp	parch	ticket	fare	cabin	embarked	boat	body	home.dest
0	1	1	Allen, Miss. Elisabeth Walton	female	29.0000	0	0	24160	211.3375	B5	S	2	NaN	St Louis, MO
1	1	1	Allison, Master. Hudson Trevor	male	0.9167	1	2	113781	151.5500	C22 C26	S	11	NaN	Montreal, PQ / Chesterville, ON
2	1	0	Allison, Miss. Helen Loraine	female	2.0000	1	2	113781	151.5500	C22 C26	S	NaN	NaN	Montreal, PQ / Chesterville, ON
3	1	0	Allison, Mr. Hudson Joshua Creighton	male	30.0000	1	2	113781	151.5500	C22 C26	S	NaN	135.0	Montreal, PQ / Chesterville, ON
4	1	0	Allison, Mrs. Hudson J C (Bessie Waldo Daniels)	female	25.0000	1	2	113781	151.5500	C22 C26	S	NaN	NaN	Montreal, PQ / Chesterville, ON

깃헙의 레파지토리에 따로 데이터를 업로드 하였으니, 해당 url에서 데이터를 가져오면 됨
타이타닉의 EDA는 링크 참조 https://hmkim312.github.io/posts/타이타닉_튜토리얼_with_Kaggle/

1.2 이름으로 title 만들기

import re

title = []
for idx, dataset in titanic.iterrows():
    title.append(re.search('\,\s\w+(\s\w+)?\.', dataset['name']).group()[2:-1])
    
titanic['title'] = title
titanic.head()

	pclass	survived	name	sex	age	sibsp	parch	ticket	fare	cabin	embarked	boat	body	home.dest	title
0	1	1	Allen, Miss. Elisabeth Walton	female	29.0000	0	0	24160	211.3375	B5	S	2	NaN	St Louis, MO	Miss
1	1	1	Allison, Master. Hudson Trevor	male	0.9167	1	2	113781	151.5500	C22 C26	S	11	NaN	Montreal, PQ / Chesterville, ON	Master
2	1	0	Allison, Miss. Helen Loraine	female	2.0000	1	2	113781	151.5500	C22 C26	S	NaN	NaN	Montreal, PQ / Chesterville, ON	Miss
3	1	0	Allison, Mr. Hudson Joshua Creighton	male	30.0000	1	2	113781	151.5500	C22 C26	S	NaN	135.0	Montreal, PQ / Chesterville, ON	Mr
4	1	0	Allison, Mrs. Hudson J C (Bessie Waldo Daniels)	female	25.0000	1	2	113781	151.5500	C22 C26	S	NaN	NaN	Montreal, PQ / Chesterville, ON	Mrs

name 컬럼에서 Miss, Master 등을 포함하는 title 컬럼을 생성함

1.3 귀족과 평민 등급 구별

print(set(title))

{'Sir', 'Dr', 'Mme', 'Major', 'Col', 'Mlle', 'Don', 'Jonkheer', 'Rev', 'Mr', 'Master', 'Dona', 'Ms', 'Capt', 'Lady', 'Mrs', 'Miss', 'the Countess'}

Miss, Mr, Ms 등을 제외하고 귀족의 성이 보인다. 이를 하나의 귀족이름으로 변경

titanic['title'] = titanic['title'].replace('Mlle', 'Miss')
titanic['title'] = titanic['title'].replace('Ms', 'Miss')
titanic['title'] = titanic['title'].replace('Mme', 'Mrs')

Rare_f = ['Dona', 'Dr','Lady','the Countess']
Rare_m = ['Capt', 'Col','Don','Major','Rev','Sir','Jonkheer','Master']

for each in Rare_f:
    titanic['title'] = titanic['title'].replace(each, 'Rare_f')
    
for each in Rare_m:
    titanic['title'] = titanic['title'].replace(each, 'Rare_m')
    
titanic['title'].unique()

array(['Miss', 'Rare_m', 'Mr', 'Mrs', 'Rare_f'], dtype=object)

Mlle, MS는 Miss로 변경
Mm 는 Mrs로 변경함
Dona, Or, Lady 등은 여자 귀족이름으로 변경
Capt, Col, Don 등은 남자 귀족이름으로 변경함

1.4 Gender 컬럼 생성

from sklearn.preprocessing import LabelEncoder

le_sex = LabelEncoder()
le_sex.fit(titanic['sex'])
titanic['gender'] = le_sex.transform(titanic['sex'])

le_sex.classes_

array(['female', 'male'], dtype=object)

성별 컬럼에서 female과 male을 0과 1로 LabelEncoder를 해줌
컴퓨터는 female과 male을 알수없으니, 0과 1로 변경해주는 전처리를 해주는것
다만 0이 1보다 낮거나 안좋은건 아님

1.5 Grade 컬럼 생성

le_grade = LabelEncoder()
le_grade.fit(titanic['title'])
titanic['grade'] = le_grade.transform(titanic['title'])

le_grade.classes_

array(['Miss', 'Mr', 'Mrs', 'Rare_f', 'Rare_m'], dtype=object)

마찬가지로 title의 miss, mr, mrs, rare_f, rare_m도 labelencoding을 해줌

1.6 Null은 제외

titanic = titanic[titanic['age'].notnull()]
titanic = titanic[titanic['fare'].notnull()]
titanic.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1045 entries, 0 to 1308
Data columns (total 17 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 pclass     1045 non-null   int64  
 survived   1045 non-null   int64  
 name       1045 non-null   object 
 sex        1045 non-null   object 
 age        1045 non-null   float64
 sibsp      1045 non-null   int64  
 parch      1045 non-null   int64  
 ticket     1045 non-null   object 
 fare       1045 non-null   float64
 cabin      272 non-null    object 
embarked   1043 non-null   object 
boat       417 non-null    object 
body       119 non-null    float64
home.dest  685 non-null    object 
title      1045 non-null   object 
gender     1045 non-null   int64  
grade      1045 non-null   int64  
dtypes: float64(3), int64(6), object(8)
memory usage: 147.0+ KB

age와 fare의 컬럼의 null값을 제거함
그외 null값이 있는 컬럼은 사용하지 않은 컬럼

2. PCA

2.1 Data split

from sklearn.model_selection import train_test_split

X = titanic[['pclass', 'age', 'sibsp', 'parch', 'fare', 'gender', 'grade']].astype('float')

y = titanic['survived']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 13)

‘pclass’, ‘age’, ‘sibsp’, ‘parch’, ‘fare’, ‘gender’, ‘grade’ 컬럼만 사용하여 X 데이터로 만듬

2.2 PCA 함수 생성

from sklearn.decomposition import PCA

def get_pca_data(ss_data, n_components = 2):
    pca = PCA(n_components = n_components)
    pca.fit(ss_data)
    
    return pca.transform(ss_data), pca

PCA를 만드는 함수 작성

def get_pd_from_pca(pca_data, col_num):
    cols = ['pca_'+str(n) for n in range(col_num)]
    return pd.DataFrame(pca_data, columns = cols)

데이터 프레임으로 만드는 함수 작성

import numpy as np

def print_variance_ratio(pca, only_sum = False):
    if only_sum == False:
        print('variance_ratio :', pca.explained_variance_ratio_)
    print('sum of variance_ratio: ', np.sum(pca.explained_variance_ratio_))

PCA의 설명력을 프린트하는 함수 작성

2.3 PCA 적용 (2개의 특성)

pca_data, pca = get_pca_data(X_train, n_components=2)
print_variance_ratio(pca)

variance_ratio : [0.93577394 0.06326916]
sum of variance_ratio:  0.9990431009511274

2개의 특성으로도 데이터의 99%를 설명함

2.4 데이터 시각화

import seaborn as sns

pca_columns = ['pca_1', 'pca_2']
pca_pd = pd.DataFrame(pca_data, columns=pca_columns)
pca_pd['survived'] = y_train

sns.pairplot(pca_pd, hue='survived', height=5,
             x_vars=['pca_1'], y_vars=['pca_2'])

plt.show()

생존자와 비 생존자가는 잘 구별이 안되는듯 하다

2.5 PCA 적용 (3개의 특성)

pca_data, pca = get_pca_data(X_train, n_components=3)
print_variance_ratio(pca)

variance_ratio : [9.35773938e-01 6.32691630e-02 4.00903990e-04]
sum of variance_ratio:  0.9994440049413533

2.6 데이터 프레임 생성

pca_pd = get_pd_from_pca(pca_data, 3)

pca_pd['survived'] = y_train.values
pca_pd.head()

	pca_0	pca_1	pca_2	survived
0	-28.763184	4.479379	-0.451531	0
1	41.587362	22.084594	0.011834	0
2	-19.598979	-10.999936	0.558167	0
3	-28.232483	-6.559632	-1.349217	1
4	-29.055717	-1.510811	-0.538886	0

3개의 특성으로 변환함

2.7 데이터 시각화

from mpl_toolkits.mplot3d import Axes3D

markers = ['^', 'o']

fig = plt.figure(figsize=(10, 8))
ax = fig.add_subplot(111, projection='3d')

for i, marker in enumerate(markers):
    x_axis_data = pca_pd[pca_pd['survived'] == i]['pca_0']
    y_axis_data = pca_pd[pca_pd['survived'] == i]['pca_1']
    z_axis_data = pca_pd[pca_pd['survived'] == i]['pca_2']

    ax.scatter(x_axis_data, y_axis_data, z_axis_data,
               s=20, alpha=0.5, marker=marker)
    
ax.view_init(30, 80)
plt.show()

2.8 Pipe Line 구축

from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

estimatiors = [('scaler', StandardScaler()),
               ('pca', PCA(n_components=3)),
               ('clf', KNeighborsClassifier(n_neighbors=20))]

pipe = Pipeline(estimatiors)
pipe.fit(X_train, y_train)

pred = pipe.predict(X_test)
print(accuracy_score(y_test, pred))

0.7703349282296651

KNN, StandardScaler를 사용하여 Pipe라인을 구축함
accuracy는 0.77 나옴

2.9 디카프리오와 윈슬렛의 생존 확률

decaprio = np.array([[3, 18, 0, 0, 5, 1, 1]])
print('Decaprio : ', pipe.predict_proba(decaprio)[0, 1])

winslet = np.array([[1, 16, 1, 1, 100, 0, 3]])
print('Winslet : ', pipe.predict_proba(winslet)[0, 1])

Decaprio :  0.05
Winslet :  0.85

타이타닉 데이터로 해보는 PCA와 kNN

1. 타이타닉 데이터 preprocessing

1.1 Data load

1.2 이름으로 title 만들기

1.3 귀족과 평민 등급 구별

1.4 Gender 컬럼 생성

1.5 Grade 컬럼 생성

1.6 Null은 제외

2. PCA

2.1 Data split

2.2 PCA 함수 생성

2.3 PCA 적용 (2개의 특성)

2.4 데이터 시각화

2.5 PCA 적용 (3개의 특성)

2.6 데이터 프레임 생성

2.7 데이터 시각화

2.8 Pipe Line 구축

2.9 디카프리오와 윈슬렛의 생존 확률

Recent Update

Trending Tags

Contents

Trending Tags

타이타닉 데이터로 해보는 PCA와 kNN

1. 타이타닉 데이터 preprocessing

1.1 Data load

1.2 이름으로 title 만들기

1.3 귀족과 평민 등급 구별

1.4 Gender 컬럼 생성

1.5 Grade 컬럼 생성

1.6 Null은 제외

2. PCA

2.1 Data split

2.2 PCA 함수 생성

2.3 PCA 적용 (2개의 특성)

2.4 데이터 시각화

2.5 PCA 적용 (3개의 특성)

2.6 데이터 프레임 생성

2.7 데이터 시각화

2.8 Pipe Line 구축

2.9 디카프리오와 윈슬렛의 생존 확률

Recent Update

Trending Tags

Contents

Further Reading

머신러닝을 이용한 타이타닉 생존자 예측

MNIST로 해보는 PCA와 kNN

강아지와 고양이 분류기 on PCA

Trending Tags