판다스(Pandas) 기초(3)

1. Pandas io (input, output)

1.1 Load

titanic = pd.read_csv('datas/train.csv')
titanic.tail(2)

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
889	890	1	1	Behr, Mr. Karl Howell	male	26.0	0	0	111369	30.00	C148	C
890	891	0	3	Dooley, Mr. Patrick	male	32.0	0	0	370376	7.75	NaN	Q

csv = 콤마로 값을 분류
tsv = 탭으로 값을 분류
pd.read_xxx(경로, 옵션) 으로 불러온다
타이타닉 데이터를 불러온것

1.2 Save

titanic.to_csv('datas/titanic.csv', index = False)

sep = 옵션을 조정하면 다른것으로 분류하는것도 가능, 예를 들자면 \t
index = 옵션은 보통 저장하지 않음, 어차피 로드할때 index가 저장됨
dataframe.to_xxx(경로, 옵션)으로 저장함

1.3 UnicodeDecodeError

pd.read_csv("datas/2014_p.csv")
df.tail

---------------------------------------------------------------------------

UnicodeDecodeError                        Traceback (most recent call last)

pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._convert_tokens()

pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._convert_with_dtype()

pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._string_convert()

pandas/_libs/parsers.pyx in pandas._libs.parsers._string_box_utf8()

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xba in position 0: invalid start byte

During handling of the above exception, another exception occurred:

UnicodeDecodeError                        Traceback (most recent call last)

<ipython-input-3-b8fe44f7bef5> in <module>
----> 1 pd.read_csv("datas/2014_p.csv")
      2 df.tail

/opt/anaconda3/lib/python3.8/site-packages/pandas/io/parsers.py in parser_f(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, cache_dates, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, doublequote, escapechar, comment, encoding, dialect, error_bad_lines, warn_bad_lines, delim_whitespace, low_memory, memory_map, float_precision)
    674         )
    675 
--> 676         return _read(filepath_or_buffer, kwds)
    677 
    678     parser_f.__name__ = name

/opt/anaconda3/lib/python3.8/site-packages/pandas/io/parsers.py in _read(filepath_or_buffer, kwds)
    452 
    453     try:
--> 454         data = parser.read(nrows)
    455     finally:
    456         parser.close()

/opt/anaconda3/lib/python3.8/site-packages/pandas/io/parsers.py in read(self, nrows)
   1131     def read(self, nrows=None):
   1132         nrows = _validate_integer("nrows", nrows)
-> 1133         ret = self._engine.read(nrows)
   1134 
   1135         # May alter columns / col_dict

/opt/anaconda3/lib/python3.8/site-packages/pandas/io/parsers.py in read(self, nrows)
   2035     def read(self, nrows=None):
   2036         try:
-> 2037             data = self._reader.read(nrows)
   2038         except StopIteration:
   2039             if self._first_chunk:

pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader.read()

pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._read_low_memory()

pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._read_rows()

pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._convert_column_data()

pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._convert_tokens()

pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._convert_with_dtype()

pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._string_convert()

pandas/_libs/parsers.pyx in pandas._libs.parsers._string_box_utf8()

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xba in position 0: invalid start byte

uft-8 인코딩 옵션으로 읽을수 없다는 애러
보통 utf-8이면 되는데, 간혹 다른 인코딩을 사용한 경우엔 그 옵션을 줘야함

df = pd.read_csv('datas/2014_p.csv', encoding='euc-kr')
df.tail()

	ID	RCTRCK	RACE_DE	RACE_NO	PARTCPT_NO	RANK	RCHOSE_NM	HRSMN	RCORD	ARVL_DFFRNC	EACH_SCTN_PASAGE_RANK	A_WIN_SYTM_EXPECT_ALOT	WIN_STA_EXPECT_ALOT
27213	27214	제주	2014-11-29	9	4	2.0	황용신화	이재웅	0:01:27.1	2½	4 - - - 5 - 5 - 2	1.8	2.2
27214	27215	제주	2014-11-29	4	5	2.0	백록장원	장우성	0:01:19.9	머리	7 - - - 7 - 6 - 4	3.5	1.3
27215	27216	제주	2014-11-29	4	3	7.0	산정무한	안득수	0:01:22.8	1½	4 - - - 4 - 4 - 6	30.9	5.2
27216	27217	제주	2014-11-29	9	7	6.0	미주여행	김경휴	0:01:31.1	13	2 - - - 2 - 3 - 6	6.2	9.4
27217	27218	제주	2014-11-29	9	6	1.0	철옹성	장우성	0:01:26.6	NaN	1 - - - 1 - 1 - 1	3.9	2.9

encoding 옵션을 euc-kr을 주어 파일을 제대로 읽어옴

1.4 Encoding 이란?

문자를 컴퓨터 언어(2진수)로 바꾸는 방법 아래의 3가지가 가장 많이 쓰임
load시 encoding = “방법”을 적어주어야 함 보통은 utf-8
ascii : 영문, 숫자, 특문만 인코딩 가능
utf-8 : 영문 한글 일본어 등 모든나라의 언어방식을 인코딩 가능
euc-kr : 영문 한글 숫자 특문 인코딩 가능

2. Pandas Pivot

2.1 Pandas Pivot이란

데이터 프레임의 컬럼데이터에서 index, column, value를 선택해서 데이터 프레임을 만드는 방법
df.pivot(index, columns, values)
- groupby 하고 pivot을 실행 (index와 column이 중복되면 안됨)
df.pivot_table(values, index, columns, aggfunc

2.2 Pivot 학습

df1 = titanic.groupby(['Sex', 'Pclass']).size().reset_index(name = 'counts')
df1

	Sex	Pclass	counts
0	female	1	94
1	female	2	76
2	female	3	144
3	male	1	122
4	male	2	108
5	male	3	347

유명한 타이타닉 데이터를 가지고 실습
성별 좌석 등급에 따른 데이터의 수

result = df1.pivot('Sex', columns='Pclass', values='counts')
result

Pclass	1	2	3
Sex
female	94	76	144
male	122	108	347

group by한 데이터 프레임에 pivot하여 테이블 형태로 만들었음

2.3 Pivot Table 학습

titanic['Counts'] = 1
titanic.tail(1)

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked	Counts
890	891	0	3	Dooley, Mr. Patrick	male	32.0	0	0	370376	7.75	NaN	Q	1

result = titanic.pivot_table('Counts', 'Sex', 'Survived',aggfunc= np.sum)
result

Survived	0	1
Sex
female	81	233
male	468	109

성별간 생존 인원수를 카운트한것
0은 죽은것 1은 산것
dataframe.pivot_table(value, index, column, 통계수치 순으로 사용)

result = titanic.pivot_table('Counts', 'Pclass', 'Survived',aggfunc= np.sum)
result

Survived	0	1
Pclass
1	80	136
2	97	87
3	372	119

객실간 생존 인원수를 카운트한것

result['total'] = result[0] + result[1]
result

Survived	0	1	total
Pclass
1	80	136	216
2	97	87	184
3	372	119	491

토탈 만들기 (열)
Dataframe의 열을 선택후 더하기, 뺴기, 나누기 등 모든 산술연산자를 하여 새로운 컬럼을 생성 할수 있음

result.loc['total'] = result.loc[1] + result.loc[2] + result.loc[3]
result

Survived	0	1	total
Pclass
1	80	136	216
2	97	87	184
3	372	119	491
total	549	342	891

토탈 만들기 (행))
위와 똑같이 하면되나, 행은 loc 옵션을 주어 행을 선택하였다는것을 인지해주면 됨

판다스(Pandas) 기초(3)

1. Pandas io (input, output)

1.1 Load

1.2 Save

1.3 UnicodeDecodeError

1.4 Encoding 이란?

2. Pandas Pivot

2.1 Pandas Pivot이란

2.2 Pivot 학습

2.3 Pivot Table 학습

Recent Update

Trending Tags

Contents

Trending Tags

판다스(Pandas) 기초(3)

1. Pandas io (input, output)

1.1 Load

1.2 Save

1.3 UnicodeDecodeError

1.4 Encoding 이란?

2. Pandas Pivot

2.1 Pandas Pivot이란

2.2 Pivot 학습

2.3 Pivot Table 학습

Recent Update

Trending Tags

Contents

Further Reading

판다스(Pandas) 기초(1)

판다스(Pandas) 기초(2)

Average Population of Each Continent

Trending Tags