에브리타임 WordCloud

에브리타임은 대학교 커뮤니티 & 시간표 플랫폼입니다.

대학생들이 많이 사용하는데 여기서 주로 어떤 단어들이 사용되는지 알아보겠습니다.

jupyter notebook을 사용하고 python3를 이용했습니다.

𝟷. 우선 필요한 모듈을 설치합니다.

pip install wordcloud
 
pip install krwordrank 
cs

𝟸. selenium을 사용하기 위해서 selenium과 webdriver를 설치합니다.

𝟸-𝟷 selenium 설치

pip install selenium
cs

(기본적으로 BeautifulSoup이 설치되어있어야합니다.)

𝟸-𝟸 webdriver 설치

Chrome webdriver

이때 webdriver의 위치를 기억해두자!

𝟹. 로그인을 하기 위해 아이디, 비밀번호, 로그인, 자유게시판 버튼의 XPath를 가져옵니다.

도구 더보기 -> 개발자 도구 -> 를 클릭 후 -> 원하는 곳에 마우스 클릭하면 그 부분의 태그가 보이고 사진과 같이 XPath를 복사합니다.

<아이디 입력 창> <비밀번호 입력 창>

<로그인 버튼> <자유게시판 버튼>

𝟺. 자유게시판에 게시글의 태그와 class를 확인합니다.

태그안의 각 게시물의 url인 href를 긁어 everytime_link 리스트에 추가하고 txt파일로 저장합니다.

그리고 다음 페이지로 넘어갑니다. 이때 변경해야할것은 https://everytime.kr/370451/p/2 입니다. (아래의 사진을 보면 확인할 수 있습니다.)

𝟻. everytime_link.txt에서 하나씩 url을 읽어 게시물 제목, 내용, 시간, 답글을 크롤링합니다.

위 와 같은 방식으로 제목, 내용, 시간, 답글의 태그와 클래스 이름을 알아냅니다.

<게시글 제목> <게시글 내용>

<게시글 날짜> <게시글 답변>

이를 json형태로 저장합니다.

그리고 driver를 닫습니다.

▼ 전체코드

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
from selenium import webdriver
from bs4 import BeautifulSoup
from collections import defaultdict
import time
import datetime
import os
import json
from krwordrank.word import KRWordRank
from wordcloud import WordCloud
 
 
def tree(): 
    return defaultdict(tree)
 
driver = webdriver.Chrome('chromedriver 저장위치')
driver.implicitly_wait(10)
driver.get('https://everytime.kr/login')
 
# 접속
driver.find_element_by_xpath('id input XPath').send_keys('자신의 id')      # 아이디
driver.find_element_by_xpath('password input XPath').send_keys("자신의 password")   # 비밀번호
driver.find_element_by_xpath('login button XPath').click()                  # 로그인 버튼
time.sleep(2)
driver.find_element_by_xpath('자유게시판 button XPath').click()              # 자유게시판 클릭
 
everytime_link = list()    # 링크 리스트
fail_link = list()         # 실패 리스트
page_number = 2            # 맨 처음으로 들어오는 페이지가 이미 1페이지를 긁어오기때문에 2페이지부터 넣음
content_number = 0
 
 
if not os.path.isdir("./Result/"):
    os.mkdir("./Result/") 
    
for i in range(1, 3):
    time.sleep(5)    
    html = driver.page_source # 보고있는 페이지를 가져옴 (1페이지를 여기서 긁음)
    soup = BeautifulSoup(html, 'html.parser')
    
    content = soup.findAll('article')
 
    for url in content:
        find_url = url.find('a', attrs={'class', 'article'}).get('href')
        everytime_link.append(find_url)
    time.sleep(2)
    driver.get('https://everytime.kr/각 게시판의 고유 번호 적기/p/' + str(page_number)) # 자유 게시판 url # ex) 자유게시판의 고유 번호 = 370451
 
    page_number = page_number + 1 # 다음 페이지로 넘어가기(한 페이지당 20개의 게시글)
 
    
with open('./everytime_link.txt', 'w') as fileobject:    # 각 게시글 링크 저장
    for join_link in everytime_link:
        fileobject.write(join_link)
        fileobject.write('\n')
 
 
 
for url in everytime_link:
        
    time_now = datetime.datetime.now()      #현재 시간 저장
    json_data = dict()
    comment_text = list()
    comment_time = list()
    json_data['comment_text'] = list()
    json_data['comment_time'] = list()
    
    try:
        driver.get('https://everytime.kr' + url)
        time.sleep(5)
 
        html = driver.page_source
        soup = BeautifulSoup(html, 'html.parser')
 
        title = soup.find('h2', attrs={'class', 'large'}).get_text()        # 제목
        text = soup.find('p', attrs={'class', 'large'}).get_text()          # 내용
        text_time = soup.find('time', attrs={'class', 'large'}).get_text()  # 날짜
        try:
            comment = soup.findAll('article')
 
            for content in comment:
                comment_text.append(content.find('p').get_text())
                comment_time.append(content.find('time').get_text())
        except:
            pass                   #댓글없음
        
        
        json_data['title'] = title
        json_data['text'] = text
        json_data['tex_time'] = text_time
        json_data['now_time'] = str(time_now)
        
        json_data['comment_text'] = comment_text
        json_data['comment_time'] = comment_time
        
    except Exception as e:
        print(e)
        fail_link.append(url)
        continue
 
    with open('./Result19/' + 'text' + str(content_number) + '.txt', 'w') as fileobject:
        json.dump(json_data, fileobject)
        content_number = content_number + 1
            
            
with open('./fail_url2.txt', 'w') as fileobject:
    for join_link in fail_link:
        fileobject.write(join_link)
        fileobject.write('\n')
            
driver.close()
Colored by Color Scripter
cs

디렉토리 속 파일 경로 리스트를 가져오는 함수와 json으로 해당 파일을 가져오는 함수를 만든후 첫번째꺼만 꺼내보면 아래와 같은 결과가 나타납니다.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
def search(dirname):        # 디렉토리 속 파일 경로 리스트 가져오기 ex : ./Result/file.txt
    file_name_list = list()
    filenames = os.listdir(dirname)
    for filename in filenames:
        full_filename = os.path.join(dirname, filename)
        file_name_list.append(full_filename)
    return file_name_list
 
 
def file_read(file_name_list):  # json 으로 해당 경로 파일 가져오기
    data = list()
    for file_path in file_name_list:
        with open(file_path, 'r') as file_point:
            data.append(json.load(file_point))
    return data
 
Colored by Color Scripter
cs

1
2
3
4
5
6
7
file_list = search('./Result/')
data = file_read(file_list)
 
print(file_list)
print(data[0].keys())
print(data[4]['title'])
print(data[0]['text'])
cs

▶︎ 결과
이렇게 각 저장된 파일 리스트와 첫번째 파일의 key값과 제목, 내용이 출력됩니다.


각 글의 제목, 내용, 답글을 모아 input_text로 만듭니다.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
wordrank_extractor = KRWordRank(
    min_count = 10, # 단어의 최소 출현 빈도수 (그래프 생성 시)
    max_length = 15, # 단어의 최대 길이
    verbose = True
    )
 
beta = 0.85    # PageRank의 decaying factor beta
max_iter = 10
 
 
input_text = str()
text = list()
for content in data:
    text.append(content['text'])
    text.append(content['title'])
    
    for comment in content['comment_text']:
        text.append(comment)
 
input_text = ' '.join(text)
cs

1
2
print(input_text)
type(input_text)
cs

▶︎ 결과
각 모든 내용과 type이 출력됩니다.



각 단어별 랭크를 순서대로 60개 보여줍니다.

1
2
3
4
keywords = None
rank = ''
graph = None
keywords, rank, graph = wordrank_extractor.extract(text, beta, max_iter)
cs

1
2
for word, r in sorted(keywords.items(), key=lambda x:x[1], reverse=True)[:60]:
    print('%8s:\t%.4f' % (word, r))
Colored by Color Scripter
cs




1
2
print(keywords)
print(type(keywords))
cs

1
2
3
4
5
6
7
8
9
10
11
12
13
wordcloud = WordCloud(
    font_path = './NanumGothic.ttf',
    width = 1200,
    height = 1200,
    background_color="white"
)
 
keywords.pop('삭제된')     # 이상한 값들 제거
keywords.pop('댓글입니다.')
keywords.pop('처리중입니다')
 
#wordcloud = wordcloud.generate_from_text(text)
wordcloud = wordcloud.generate_from_frequencies(keywords)
cs

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
def __array__(self):
    """Convert to numpy array.
    Returns
    -------
    image : nd-array size (width, height, 3)
        Word cloud image as numpy matrix.
    """
    return self.to_array()
 
def to_array(self):
    """Convert to numpy array.
    Returns
    -------
    image : nd-array size (width, height, 3)
        Word cloud image as numpy matrix.
    """
    return np.array(self.to_image())
cs

1
2
3
array = wordcloud.to_array()
print(type(array)) # numpy.ndarray
print(array.shape) # (, 800, 3)
cs

1
2
3
4
5
6
7
import matplotlib.pyplot as plt
 
fig = plt.figure(figsize=(10, 10))
plt.imshow(array, interpolation="bilinear")
plt.axis("off")
plt.show()
fig.savefig('자유게시판.png')
cs

이를 다 끝내면 아래와 같이 wordcloud가 저장됩니다.

이정민

에브리타임 WordCloud

티스토리툴바