Posts Good Books 데이터로 해보는 추천 시스템(Recommendations)
Post
Cancel

Good Books 데이터로 해보는 추천 시스템(Recommendations)

1. Good Books


1.1 Good Books 데이터

https://www.kaggle.com/zygmunt/goodbooks-10k

  • ratings, books,tag, book_tags, to_read의 10k(10,000) 데이터


2. 추천 시스템 실습


2.1 Data load

2.1.1 Books data

1
2
3
4
5
import numpy as np
import pandas as pd

books = pd.read_csv('https://media.githubusercontent.com/media/hmkim312/datas/main/goodbooks-10k/books.csv', encoding='ISO-8859-1')
books.head()
idbook_idbest_book_idwork_idbooks_countisbnisbn13authorsoriginal_publication_yearoriginal_title...ratings_countwork_ratings_countwork_text_reviews_countratings_1ratings_2ratings_3ratings_4ratings_5image_urlsmall_image_url
012767052276705227927752724390234839.780439e+12Suzanne Collins2008.0The Hunger Games...478065349423651552546671512793656009214813052706317https://images.gr-assets.com/books/1447303603m...https://images.gr-assets.com/books/1447303603s...
123346407994914395549349.780440e+12J.K. Rowling, Mary GrandPré1997.0Harry Potter and the Philosopher's Stone...46024794800065758677550410167645502411563183011543https://images.gr-assets.com/books/1474154022m...https://images.gr-assets.com/books/1474154022s...
23418654186532122582263160158499.780316e+12Stephenie Meyer2005.0Twilight...38668393916824950094561914368027933198750731355439https://images.gr-assets.com/books/1361039443m...https://images.gr-assets.com/books/1361039443s...
34265726573275794487611200819.780061e+12Harper Lee1960.0To Kill a Mockingbird...31986713340896725866042711741544683510019521714267https://images.gr-assets.com/books/1361975680m...https://images.gr-assets.com/books/1361975680s...
454671467124549413567432735679.780743e+12F. Scott Fitzgerald1925.0The Great Gatsby...268366427737455199286236197621606158936012947718https://images.gr-assets.com/books/1490528560m...https://images.gr-assets.com/books/1490528560s...

5 rows × 23 columns

  • Book에 대한 정보가 담긴 csv 파일
  • 이번 데이터들은 encoding을 ISO-8859-1로 읽어야함
  • rating 1 ~ 5의 의미는 별점 1점부터 5점의 갯수임


2.1.2 Ratings Data

1
2
ratings = pd.read_csv('https://media.githubusercontent.com/media/hmkim312/datas/main/goodbooks-10k/ratings.csv', encoding='ISO-8859-1')
ratings.head()
book_iduser_idrating
013145
114393
215885
3111694
4111854
  • rating 데이터에는 Book_id와 User_id 그리고 해당 유저가 준 rating 점수가 있음


2.1.3 Book tags Data load

1
2
book_tags = pd.read_csv('https://media.githubusercontent.com/media/hmkim312/datas/main/goodbooks-10k/book_tags.csv', encoding='ISO-8859-1')
book_tags.head()
goodreads_book_idtag_idcount
0130574167697
111130537174
211155734173
31871712986
413311412716
  • Book의 id와 tag의 id가 있음


2.1.4 Tags Data load

1
2
tags = pd.read_csv('https://media.githubusercontent.com/media/hmkim312/datas/main/goodbooks-10k/tags.csv')
tags.tail()
tag_idtag_name
3424734247Childrens
3424834248Favorites
3424934249Manga
3425034250SERIES
3425134251favourites
  • Tag의 id와 해당 tag와 연결되는 name이 있음


2.1.5 Read Data load

1
2
to_read = pd.read_csv('https://media.githubusercontent.com/media/hmkim312/datas/main/goodbooks-10k/to_read.csv')
to_read.head()
user_idbook_id
01112
11235
21533
311198
411874
  • 유저가 어떤 책을 읽었는지에 대한 id가 적혀있음


2.2 Tag Data 전처리

1
2
tags_join_Df = pd.merge(book_tags, tags, left_on='tag_id', right_on='tag_id', how = 'inner')
tags_join_Df.head()
goodreads_book_idtag_idcounttag_name
0130574167697to-read
123057424549to-read
2330574496107to-read
353057411909to-read
4630574298to-read
  • Tagid와 tag_name을 books id가 있는 데이터 프레임과 merge함


2.3 Authors로 Tfidf

1
books['authors'][:3]
1
2
3
4
0                 Suzanne Collins
1    J.K. Rowling, Mary GrandPré
2                 Stephenie Meyer
Name: authors, dtype: object
  • books 데이터에는 작가명 컬럼이 있음


1
2
3
4
5
from sklearn.feature_extraction.text import TfidfVectorizer

tf = TfidfVectorizer(analyzer='word', ngram_range=(1, 2), min_df = 0, stop_words = 'english')
tfidf_matrix = tf.fit_transform(books['authors'])
tfidf_matrix
1
2
<10000x14742 sparse matrix of type '<class 'numpy.float64'>'
	with 43235 stored elements in Compressed Sparse Row format>
  • Books에 있는 작가명으로 Tfidf를 수행함


2.4 유사도 측정

1
2
3
4
from sklearn.metrics.pairwise import linear_kernel

cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)
cosine_sim
1
2
3
4
5
6
7
array([[1., 0., 0., ..., 0., 0., 0.],
       [0., 1., 0., ..., 0., 0., 0.],
       [0., 0., 1., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 1., 0., 0.],
       [0., 0., 0., ..., 0., 1., 0.],
       [0., 0., 0., ..., 0., 0., 1.]])
  • 사이킷런의 linear_kernel을 사용하여 작가명으로 만든 Tfidf매트릭스를 유사도 행렬로 생성


2.5 Hobbit과 유사한 책은?

1
2
3
title = books['title']
indices = pd.Series(books.index, index=books['title'])
indices['The Hobbit']
1
6
  • Hobbit의 index는 6번이다
  • 6번 행을 불러와서 비슷한 책을 찾게 해보자


1
cosine_sim[indices['The Hobbit']]
1
array([0., 0., 0., ..., 0., 0., 0.])
  • 유사도 행렬에서 hobbit의 인덱스의 행을 불러옴


1
cosine_sim[indices['The Hobbit']].shape
1
(10000,)
  • 총 1만개의 책 데이터가 있음


1
list(enumerate(cosine_sim[indices['The Hobbit']]))[:3]
1
[(0, 0.0), (1, 0.0), (2, 0.0)]
  • 유사도 행렬에서 The Hobbit의 인덱스만 가져오고, 해당 컬럼(다른책 책 인덱스)와 코사인 유사도 점수를 enumerate를 사용하여 튜플형식으로 만들고, 해당 데이터를 list에 넣는다


2.6 가장 유사한 책의 Index

1
2
3
sim_scores = list(enumerate(cosine_sim[indices['The Hobbit']]))
sim_scores = sorted(sim_scores, key = lambda x : x[1], reverse= True)
sim_scores[:3]
1
[(6, 1.0), (18, 1.0), (154, 1.0)]
  • 호빗과 가장 유사한 책의 인덱스(여기서는 열)와 코사인 점수를 정렬하여 출력함
  • 완전 똑같은 1점도 보인다. 18번, 154번
  • 참고로 맨 앞에 (6, 1.0)은 본인 자신임


1
2
3
print(f'Index 6번의 책 이름 :', books['title'][6])
print(f'Index 18번의 책 이름 :', books['title'][18])
print(f'Index 154번의 책 이름 :', books['title'][154])
1
2
3
Index 6번의 책 이름 : The Hobbit
Index 18번의 책 이름 : The Fellowship of the Ring (The Lord of the Rings, #1)
Index 154번의 책 이름 : The Two Towers (The Lord of the Rings, #2)
  • 호빗과 비슷한 책은 반지의 제왕 시리즈가 나옴


2.7 작가로 본 유사 책 검색

1
2
3
sim_scores = sim_scores[1:11]
book_indices = [i[0] for i in sim_scores]
title.iloc[book_indices]
1
2
3
4
5
6
7
8
9
10
11
18      The Fellowship of the Ring (The Lord of the Ri...
154            The Two Towers (The Lord of the Rings, #2)
160     The Return of the King (The Lord of the Rings,...
188     The Lord of the Rings (The Lord of the Rings, ...
963     J.R.R. Tolkien 4-Book Boxed Set: The Hobbit an...
4975        Unfinished Tales of Númenor and Middle-Earth
2308                               The Children of Húrin
610              The Silmarillion (Middle-Earth Universe)
8271                   The Complete Guide to Middle-Earth
1128     The History of the Hobbit, Part One: Mr. Baggins
Name: title, dtype: object
  • 그 외의 다른 책들도 대부분 Hobbit이긴 하나, 아마 작가가 동일인일 가능성이 높다.
  • 사실 생각해 보면 작가이름으로만 Tfidf를 했기 때문에, 작가 이름이 같다면 모두 동일한 점수(1)로 나올것이다


2.8 Tag 추가

1
2
books_with_tags = pd.merge(books, tags_join_Df, left_on= 'book_id', right_on='goodreads_book_id', how = 'inner')
books_with_tags.head()
idbook_idbest_book_idwork_idbooks_countisbnisbn13authorsoriginal_publication_yearoriginal_title...ratings_2ratings_3ratings_4ratings_5image_urlsmall_image_urlgoodreads_book_idtag_idcounttag_name
012767052276705227927752724390234839.780439e+12Suzanne Collins2008.0The Hunger Games...12793656009214813052706317https://images.gr-assets.com/books/1447303603m...https://images.gr-assets.com/books/1447303603s...27670523057411314to-read
112767052276705227927752724390234839.780439e+12Suzanne Collins2008.0The Hunger Games...12793656009214813052706317https://images.gr-assets.com/books/1447303603m...https://images.gr-assets.com/books/1447303603s...27670521130510836fantasy
212767052276705227927752724390234839.780439e+12Suzanne Collins2008.0The Hunger Games...12793656009214813052706317https://images.gr-assets.com/books/1447303603m...https://images.gr-assets.com/books/1447303603s...27670521155750755favorites
312767052276705227927752724390234839.780439e+12Suzanne Collins2008.0The Hunger Games...12793656009214813052706317https://images.gr-assets.com/books/1447303603m...https://images.gr-assets.com/books/1447303603s...2767052871735418currently-reading
412767052276705227927752724390234839.780439e+12Suzanne Collins2008.0The Hunger Games...12793656009214813052706317https://images.gr-assets.com/books/1447303603m...https://images.gr-assets.com/books/1447303603s...27670523311425968young-adult

5 rows × 27 columns

  • Books 데이터 프레임에, 앞에서 만든 tagid와 tag name을 merge함


2.9 Tag를 Tfidf

1
2
3
tf_tag = TfidfVectorizer(analyzer='word', ngram_range=(1,2), min_df = 0, stop_words='english')
tfidf_matrix_tag = tf_tag.fit_transform(books_with_tags['tag_name'].head(10000))
cosine_sim_tag = linear_kernel(tfidf_matrix_tag, tfidf_matrix_tag)
  • 앞에선 작가 이름으로 Tfidf를 했고, 이번엔 Tag로 해본다


2.10 추천책을 반환하는 함수

1
2
3
4
5
6
7
8
9
10
11
title_tag = books['title']
indices_tag = pd.Series(books.index, index=books['title'])


def tags_recommendations(title):
    idx = indices_tag[title]
    sim_scores = list(enumerate(cosine_sim_tag[idx]))
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    sim_scores = sim_scores[1:11]
    book_indices = [i[0] for i in sim_scores]
    return title_tag.iloc[book_indices]
  • 이번에는 책의 제목을 넣으면 추천책을 반환하는 함수를 작성
  • sim_scores = sim_scores[1:11]은 총 10개를 가리키며, 1부터 한것은 0번은 입력한 책 제목 자신이 나오기 떄문임


2.11 Tag로 찾아본 Hobbits와 유사책

1
tags_recommendations('The Hobbit').head(20)
1
2
3
4
5
6
7
8
9
10
11
16             Catching Fire (The Hunger Games, #2)
31                                  Of Mice and Men
107    Confessions of a Shopaholic (Shopaholic, #1)
125                       Dune (Dune Chronicles #1)
149                                    The Red Tent
206          One for the Money (Stephanie Plum, #1)
214                                Ready Player One
231             The Gunslinger (The Dark Tower, #1)
253          Shiver (The Wolves of Mercy Falls, #1)
313                         Inkheart (Inkworld, #1)
Name: title, dtype: object
  • 헝거게임, 듄 등 호빗과 비슷한 판타지 장르가 나오는듯 싶다.


2.12 Book id에 tag name을 한번에 붙이기

1
2
temp_df = books_with_tags.groupby('book_id')['tag_name'].apply(' '.join).reset_index()
temp_df.head()
book_idtag_name
01to-read fantasy favorites currently-reading yo...
12to-read fantasy favorites currently-reading yo...
23to-read fantasy favorites currently-reading yo...
35to-read fantasy favorites currently-reading yo...
46to-read fantasy young-adult fiction harry-pott...
  • Book Id에 있는 모든 tag_name들을 한번에 모아놓음


2.13 Boos에 Merge

1
2
books = pd.merge(books, temp_df, on = 'book_id', how = 'inner')
books.head()
idbook_idbest_book_idwork_idbooks_countisbnisbn13authorsoriginal_publication_yearoriginal_title...work_ratings_countwork_text_reviews_countratings_1ratings_2ratings_3ratings_4ratings_5image_urlsmall_image_urltag_name
012767052276705227927752724390234839.780439e+12Suzanne Collins2008.0The Hunger Games...49423651552546671512793656009214813052706317https://images.gr-assets.com/books/1447303603m...https://images.gr-assets.com/books/1447303603s...to-read fantasy favorites currently-reading yo...
123346407994914395549349.780440e+12J.K. Rowling, Mary GrandPré1997.0Harry Potter and the Philosopher's Stone...4800065758677550410167645502411563183011543https://images.gr-assets.com/books/1474154022m...https://images.gr-assets.com/books/1474154022s...to-read fantasy favorites currently-reading yo...
23418654186532122582263160158499.780316e+12Stephenie Meyer2005.0Twilight...3916824950094561914368027933198750731355439https://images.gr-assets.com/books/1361039443m...https://images.gr-assets.com/books/1361039443s...to-read fantasy favorites currently-reading yo...
34265726573275794487611200819.780061e+12Harper Lee1960.0To Kill a Mockingbird...3340896725866042711741544683510019521714267https://images.gr-assets.com/books/1361975680m...https://images.gr-assets.com/books/1361975680s...to-read favorites currently-reading young-adul...
454671467124549413567432735679.780743e+12F. Scott Fitzgerald1925.0The Great Gatsby...27737455199286236197621606158936012947718https://images.gr-assets.com/books/1490528560m...https://images.gr-assets.com/books/1490528560s...to-read favorites currently-reading young-adul...

5 rows × 24 columns

  • 이번에는 tag name이 하나의 컬럼에 여러개가 들어있음


2.14 작가와 Tag name을 합침

1
2
3
4
5
books['corpus'] = (pd.Series(books[['authors', 'tag_name']]
                            .fillna('')
                            .values.tolist()
                           ).str.join(' '))
books['corpus'][:3]
1
2
3
4
0    Suzanne Collins to-read fantasy favorites curr...
1    J.K. Rowling, Mary GrandPré to-read fantasy f...
2    Stephenie Meyer to-read fantasy favorites curr...
Name: corpus, dtype: object
  • corpus라는 컬럼에 저자와 태그가 한번에 모두 있음


2.15 Tfidf 실행

1
2
3
4
5
tf_corpus = TfidfVectorizer(analyzer='word', ngram_range=(1,2), min_df=0, stop_words='english')
tfidf_matrix_corpus = tf_corpus.fit_transform(books['corpus'])
cosine_sim_corpus = linear_kernel(tfidf_matrix_corpus, tfidf_matrix_corpus)
titles = books['title']
indices = pd.Series(books.index, index=books['title'])
  • 작가와 Tag name을 합친것을 Tfidf를 실행함


2.16 추천 함수 작성

1
2
3
4
5
6
7
def corpus_recommendations(title):
    idx = indices[title]
    sim_scores = list(enumerate(cosine_sim_corpus[idx]))
    sim_scores = sorted(sim_scores, key = lambda x : x[1], reverse=True)
    sim_scores = sim_scores[1:11]
    book_indices = [i[0] for i in sim_scores]
    return titles.iloc[book_indices]


2.17 비슷한 책은?

1
corpus_recommendations('The Hobbit')
1
2
3
4
5
6
7
8
9
10
11
188     The Lord of the Rings (The Lord of the Rings, ...
154            The Two Towers (The Lord of the Rings, #2)
160     The Return of the King (The Lord of the Rings,...
18      The Fellowship of the Ring (The Lord of the Ri...
610              The Silmarillion (Middle-Earth Universe)
4975        Unfinished Tales of Númenor and Middle-Earth
2308                               The Children of Húrin
963     J.R.R. Tolkien 4-Book Boxed Set: The Hobbit an...
465                             The Hobbit: Graphic Novel
8271                   The Complete Guide to Middle-Earth
Name: title, dtype: object
  • The Hobbit과 비슷한 책은 이제 잘 나오는듯 하다.


1
corpus_recommendations('Twilight (Twilight, #1)')
1
2
3
4
5
6
7
8
9
10
11
51                                 Eclipse (Twilight, #3)
48                                New Moon (Twilight, #2)
991                    The Twilight Saga (Twilight, #1-4)
833                         Midnight Sun (Twilight, #1.5)
731     The Short Second Life of Bree Tanner: An Eclip...
1618    The Twilight Saga Complete Collection  (Twilig...
4087    The Twilight Saga: The Official Illustrated Gu...
2020             The Twilight Collection (Twilight, #1-3)
72                                The Host (The Host, #1)
219     Twilight: The Complete Illustrated Movie Compa...
Name: title, dtype: object
  • 트와일라잇과 비슷한 책들


1
corpus_recommendations('Harry Potter and the Prisoner of Azkaban (Harry Potter, #3)')
1
2
3
4
5
6
7
8
9
10
11
1       Harry Potter and the Sorcerer's Stone (Harry P...
26      Harry Potter and the Half-Blood Prince (Harry ...
22      Harry Potter and the Chamber of Secrets (Harry...
24      Harry Potter and the Deathly Hallows (Harry Po...
23      Harry Potter and the Goblet of Fire (Harry Pot...
20      Harry Potter and the Order of the Phoenix (Har...
3752         Harry Potter Collection (Harry Potter, #1-6)
398                          The Tales of Beedle the Bard
1285                           Quidditch Through the Ages
421              Harry Potter Boxset (Harry Potter, #1-7)
Name: title, dtype: object
  • 해리포터와 비슷한 책


1
corpus_recommendations('Romeo and Juliet')
1
2
3
4
5
6
7
8
9
10
11
352                      Othello
769                Julius Caesar
124                       Hamlet
153                      Macbeth
247    A Midsummer Night's Dream
838       The Merchant of Venice
854                Twelfth Night
529       Much Ado About Nothing
713                    King Lear
772      The Taming of the Shrew
Name: title, dtype: object
  • 로미오와 줄리엣과 비슷한 책


3. 요약


3.1 요약

  • 책 데이터로 해본 추천 시스템, Tfidf를 사용하였고, 사실 작가나 태그만 사용한다면 같은 작가, 같은 태그의 책들만 추천을 해줬을것이다.
  • 하지만 하나의 컬럼에 모아서 Tfidf를 하였을땐 조금 다른 결과가 나왔으나, 이렇게 하는것이 맞는지, 혹은 더 다른 방법은 없는지 싶다
  • 추천 시스템은 어려운듯 하다
This post is licensed under CC BY 4.0 by the author.