Python Crawling(크롤링)

3 min readAug 17, 2019

웹 크롤링(Web Crawling)

웹 크롤링이란 웹상에 존재하는 정보들을 수집하는 작업을 말한다. 웹 크롤링 방법은 여러가지가 있다.

오픈 API를 활용해 받은 데이터중 필요한 데이터만 사용하는 방법
HTML 소스를 가져와서 원하는 정보를 사용하는 방법
브라우저를 조작해 원하는 정보를 사용하는 방법

여기서는 requests , beautifulsoup 를 사용해 웹 크롤링을 해 볼것이다.

먼저 request , beautifulsoup 를 다운받아야 한다.

$ pip install requests beautifulsoup4

requests는 url을 활용해 html 소스를 가져오는 역할을 할 것이다.
beautifulsoup4는HTML의 태그를 파싱해서 필요한 데이터만 추출하는 함수를 제공하는 라이브러리이다.

먼저 google html 소스를 가져와보저

crawling.py

import requests
from bs4 import BeautifulSoupdef crawler(): 
    
    url = 'https://www.google.com'
    html = requests.get(url)
    print(html.text)crawler()

위의 파일을 실행해 보면 다음과 같이 나온다

beautifulsoup를 활용해 html 소스를 파싱해 원하는 데이터를 추출할때 2가지 방법이 있다.

find 를 사용해 원하는 태그의 내용을 추출하는 방법
select 를 사용해 원하는 selector의 내용을 추출하는 방법

여기서는 find를 사용해 원하는 태그의 내용을 추출하는 방법을 쓸 것이다. google.com 사이트의 meta를 가져와 보자

crawling.py

import requests
from bs4 import BeautifulSoupdef crawler(): 
    
    url = 'https://www.google.com'
    html = requests.get(url)
    soup = BeautifulSoup(html, 'html.parser')
    select = soup.head.find_all('meta')    for meta in select:
        print(meta.get('content'))crawler()

이 파일을 실행하면 다음과 같이 google의 meta 태그의 내용을 가져올 수 있다.

Python Crawling(크롤링)

웹 크롤링(Web Crawling)

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Written by 홍찬기

No responses yet