以前講過利用phantomjs做爬蟲抓網頁 //www.jb51.net/article/55789.htm 是配合選擇器做的
利用 beautifulSoup(文檔 :http://www.crummy.com/software/BeautifulSoup/bs4/doc/)這個python模塊,可以很輕松的抓取網頁內容
# coding=utf-8
import urllib
from bs4 import BeautifulSoup
url ='http://www.baidu.com/s'
values ={'wd':'網球'}
encoded_param = urllib.urlencode(values)
full_url = url +'?'+ encoded_param
response = urllib.urlopen(full_url)
soup =BeautifulSoup(response)
alinks = soup.find_all('a')
上面可以抓取百度搜出來結果是網球的記錄。
beautifulSoup內置了很多非常有用的方法。
幾個比較好用的特性:
構造一個node元素
soup = BeautifulSoup(' Extremely bold ')
tag = soup.b
type(tag)
#
屬性可以使用attr拿到,結果是字典
tag.attrs
# {u'class': u'boldest'}
或者直接tag.class取屬性也可。
也可以自由操作屬性
tag['class'] = 'verybold'
tag['id'] = 1
tag
#
Extremely bold
del tag['class']
del tag['id']
tag
#
Extremely bold
tag['class']
# KeyError: 'class'
print(tag.get('class'))
# None
還可以隨便操作,查找dom元素,比如下面的例子
1.構建一份文檔
html_doc = """
The Dormouse's story
The Dormouse's story
Once upon a time there were three little sisters; and their names were
Elsie
,
Lacie
and
Tillie
;
and they lived at the bottom of a well.
...
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc)
2.各種搞
soup.head
#
The Dormouse's story
soup.title
#
soup.body.b
#
The Dormouse's story
soup.a
#
Elsie
soup.find_all('a')
# [
Elsie
,
#
Lacie
,
#
Tillie
]
head_tag = soup.head
head_tag
#
The Dormouse's story
head_tag.contents
[
]
title_tag = head_tag.contents[0]
title_tag
#
title_tag.contents
# [u'The Dormouse's story']
len(soup.contents)
# 1
soup.contents[0].name
# u'html'
text = title_tag.contents[0]
text.contents
for child in title_tag.children:
print(child)
head_tag.contents
# [
]
for child in head_tag.descendants:
print(child)
#
# The Dormouse's story
len(list(soup.children))
# 1
len(list(soup.descendants))
# 25
title_tag.string
# u'The Dormouse's story'
更多文章、技術交流、商務合作、聯系博主
微信掃碼或搜索:z360901061
微信掃一掃加我為好友
QQ號聯系: 360901061
您的支持是博主寫作最大的動力,如果您喜歡我的文章,感覺我的文章對您有幫助,請用微信掃描下面二維碼支持博主2元、5元、10元、20元等您想捐的金額吧,狠狠點擊下面給點支持吧,站長非常感激您!手機微信長按不能支付解決辦法:請將微信支付二維碼保存到相冊,切換到微信,然后點擊微信右上角掃一掃功能,選擇支付二維碼完成支付。
【本文對您有幫助就好】元

