Phthon - Web 정보 긁어오기

개발 조각글

Phthon - Web 정보 긁어오기

BaekNohing 2022. 5. 30. 18:47

DoCwalTest.cs

사이드 프로젝트로 웹의 정보를 긁어올 일이 생겨서 간단하게 만들어 보았다.

urllib.request 모듈로 requset 결과를 저장한 다음에. BeautifulSoup을 이용해 html 파싱을 진행하는 방식이다.

archive 내의 텍스트만 필요한 상황이어서

    arta = soup.find_all('div', class_='*** class name ***')

    filteredText = url + '\n'
    for art in arta:
        filteredText += art.text.strip()

soup(soup는 BeautifulSoup로 파싱한 결과물)에서 원하는 div를 찾은 뒤. text.strip()을 통해 텍스트만 추려서 넣어두었다.

이렇게 하면 <div> </div> 같은 필요 없는 정보를 걸러낼 수 있다.

이번에 처음 써보는 거지만, 확실히 파이썬은 간단한 모듈을 후다닥 만들기 편한듯.

Crawl Code는 이곳의 도움을 받았다 : https://crazyj.tistory.com/190 [크레이지J의 탐구생활:티스토리]

import urllib.request  
from bs4 import BeautifulSoup  
import os
import time

def get_index():
    url = '*** target Url ***'
    conn = urllib.request.urlopen(url)
    soup = BeautifulSoup(conn, 'html.parser')
    arta = soup.find_all('li', class_='px-1')
    index = ""
    for art in arta:
        index += art.find('a')["href"] + '\n'
    print(index)
    with open('cwals/index/index.txt', 'wt', encoding='utf8') as f:
        f.write(str(index))
    return index

def get_subIndex(url):
    conn = urllib.request.urlopen('*** target Url ***' + url)
    soup = BeautifulSoup(conn, 'html.parser')
    arta = soup.find_all('li', class_='py-1')
    subIndex = ""
    for art in arta:
        subIndex += art.find('a')["href"] + '\n'
    print(subIndex)
    with open('cwals/index/' + url + '.txt', 'wt', encoding='utf8') as f:
        f.write(str(subIndex))
    return subIndex

def get_text_from_index():
    listdir = os.listdir('cwals/index/archives')
    for i in listdir:
        if i.endswith('.txt'):
            urls = open('cwals/index/archives/' + i, 'r', encoding='utf8').read().splitlines()
            for url in urls :
                try:
                    get_text(url)
                except:
                    print("error :", url)
                time.sleep(1)

def get_text(url):
    fname = 'cwals/results' + url + '.txt'
    if os.path.isfile(fname) :
        return 
    print(fname)
    conn = urllib.request.urlopen('https://neolook.com' + url)
    soup = BeautifulSoup(conn, "html.parser")  
    arta = soup.find_all('div', class_='archives-description')

    filteredText = url + '\n'
    for art in arta:
        filteredText += art.text.strip()
    filteredText = filteredText.replace(',', '')
    filteredText = filteredText.replace('\n\n\n', ',    ')

    with open(fname, 'wt', encoding='utf8') as f:
        f.write(str(filteredText))

text = input('keyword: ')
if  text.lower() == 'index':
    get_index()
elif  text.lower() == 'url':
    get_text(input('url: ')) 
elif  text.lower() == 'subindex':
    with open('cwals/index.txt', 'rt', encoding='utf8') as f:
        lines = f.read().splitlines()
        for line in lines:
            get_subIndex(line)
elif  text.lower() == 'text':
    get_text_from_index()
else:
    exit(print('keyward error'))

저작자표시

'개발 조각글' 카테고리의 다른 글

Python - Torch 챗봇 테스트 (0)	2022.06.12
Processing - Video_Capture가 작동하지 않을 때 (0)	2022.06.04
Unity UI 컴포넌트 캐싱[0] (0)	2022.05.20
Unity 씬에서 선택한게 하이라키에서 안잡힐 때 (0)	2022.03.21
전처리 지시어를 사용해 모바일과 에디터에서 rayCast받기 (0)	2022.03.07

현재글Phthon - Web 정보 긁어오기

__ 유니티 클라이언트 개발자 __ Thanks for nothing :)

/* baeknothing */

Phthon - Web 정보 긁어오기

'개발 조각글' 카테고리의 다른 글

'개발 조각글'의 다른글

티스토리툴바

Phthon - Web 정보 긁어오기

'개발 조각글' 카테고리의 다른 글

'개발 조각글'의 다른글

관련글

티스토리툴바