百度貼吧的爬蟲制作和糗百的爬蟲制作原理基本相同，都是通過查看源碼扣出關鍵數據，然后將其存儲到本地txt文件。

項目內容：

用Python寫的百度貼吧的網絡爬蟲。

使用方法：

新建一個BugBaidu.py文件，然后將代碼復制到里面后，雙擊運行。

程序功能：

將貼吧中樓主發布的內容打包txt存儲到本地。

原理解釋：

首先，先瀏覽一下某一條貼吧，點擊只看樓主并點擊第二頁之后url發生了一點變化，變成了：
http://tieba.baidu.com/p/2296712428?see_lz=1&pn=1
可以看出來，see_lz=1是只看樓主，pn=1是對應的頁碼，記住這一點為以后的編寫做準備。
這就是我們需要利用的url。
接下來就是查看頁面源碼。
首先把題目摳出來存儲文件的時候會用到。
可以看到百度使用gbk編碼，標題使用h1標記：

復制代碼代碼如下:

          
            【原創】時尚首席（關于時尚，名利，事業，愛情，勵志）
          
          ?

同樣，正文部分用div和class綜合標記，接下來要做的只是用正則表達式來匹配即可。
運行截圖：

生成的txt文件：

復制代碼代碼如下:

# -*- coding: utf-8 -*-?
#---------------------------------------?
#?? 程序：百度貼吧爬蟲?
#?? 版本：0.5?
#?? 作者：why?
#?? 日期：2013-05-16?
#?? 語言：Python 2.7?
#?? 操作：輸入網址后自動只看樓主并保存到本地文件?
#?? 功能：將樓主發布的內容打包txt存儲到本地。?
#---------------------------------------?
??
import string?
import urllib2?
import re?
?
#----------- 處理頁面上的各種標簽 -----------?
class HTML_Tool:?
??? # 用非貪婪模式匹配 \t 或者 \n 或者空格或者超鏈接或者圖片?
??? BgnCharToNoneRex = re.compile("(\t|\n| | | )")?
?????
??? # 用非貪婪模式匹配任意<>標簽?
??? EndCharToNoneRex = re.compile("<.*?>")?
?
??? # 用非貪婪模式匹配任意

標簽?
??? BgnPartRex = re.compile(" ")?
??? CharToNewLineRex = re.compile("(
|

)")?
??? CharToNextTabRex = re.compile("")?
?
??? # 將一些html的符號實體轉變為原始符號?
??? replaceTab = [("<","<"),(">",">"),("&","&"),("&","\""),(" "," ")]?
?????
??? def Replace_Char(self,x):?
??????? x = self.BgnCharToNoneRex.sub("",x)?
??????? x = self.BgnPartRex.sub("\n??? ",x)?
??????? x = self.CharToNewLineRex.sub("\n",x)?
??????? x = self.CharToNextTabRex.sub("\t",x)?
??????? x = self.EndCharToNoneRex.sub("",x)?
?
??????? for t in self.replaceTab:???
??????????? x = x.replace(t[0],t[1])???
??????? return x???
?????
class Baidu_Spider:?
??? # 申明相關的屬性?
??? def __init__(self,url):???
??????? self.myUrl = url + '?see_lz=1'?
??????? self.datas = []?
??????? self.myTool = HTML_Tool()?
??????? print u'已經啟動百度貼吧爬蟲，咔嚓咔嚓'?
???
??? # 初始化加載頁面并將其轉碼儲存?
??? def baidu_tieba(self):?
??????? # 讀取頁面的原始信息并將其從gbk轉碼?
??????? myPage = urllib2.urlopen(self.myUrl).read().decode("gbk")?
??????? # 計算樓主發布內容一共有多少頁?
??????? endPage = self.page_counter(myPage)?
??????? # 獲取該帖的標題?
??????? title = self.find_title(myPage)?
??????? print u'文章名稱：' + title?
??????? # 獲取最終的數據?
??????? self.save_data(self.myUrl,title,endPage)?
?
??? #用來計算一共有多少頁?
??? def page_counter(self,myPage):?
??????? # 匹配 "共有 12 頁" 來獲取一共有多少頁?
??????? myMatch = re.search(r'class="red">(\d+?)', myPage, re.S)?
??????? if myMatch:???
??????????? endPage = int(myMatch.group(1))?
??????????? print u'爬蟲報告：發現樓主共有%d頁的原創內容' % endPage?
??????? else:?
??????????? endPage = 0?
??????????? print u'爬蟲報告：無法計算樓主發布內容有多少頁！'?
??????? return endPage?
?
??? # 用來尋找該帖的標題?
??? def find_title(self,myPage):?
??????? # 匹配

xxxxxxxxxx

找出標題?
??????? myMatch = re.search(r' (.*?)', myPage, re.S)?
??????? title = u'暫無標題'?
??????? if myMatch:?
??????????? title? = myMatch.group(1)?
??????? else:?
??????????? print u'爬蟲報告：無法加載文章標題！'?
??????? # 文件名不能包含以下字符： \ / ： * ? " < > |?
??????? title = title.replace('\\','').replace('/','').replace(':','').replace('*','').replace('?','').replace('"','').replace('>','').replace('<','').replace('|','')?
??????? return title?
?
??? # 用來存儲樓主發布的內容?
??? def save_data(self,url,title,endPage):?
??????? # 加載頁面數據到數組中?
??????? self.get_data(url,endPage)?
??????? # 打開本地文件?
??????? f = open(title+'.txt','w+')?
??????? f.writelines(self.datas)?
??????? f.close()?
??????? print u'爬蟲報告：文件已下載到本地并打包成txt文件'?
??????? print u'請按任意鍵退出...'?
??????? raw_input();?
?
??? # 獲取頁面源碼并將其存儲到數組中?
??? def get_data(self,url,endPage):?
??????? url = url + '&pn='?
??????? for i in range(1,endPage+1):?
??????????? print u'爬蟲報告：爬蟲%d號正在加載中...' % i?
??????????? myPage = urllib2.urlopen(url + str(i)).read()?
??????????? # 將myPage中的html代碼處理并存儲到datas里面?
??????????? self.deal_data(myPage.decode('gbk'))?
?????????????
??? # 將內容從頁面代碼中摳出來?
??? def deal_data(self,myPage):?
??????? myItems = re.findall('id="post_content.*?>(.*?)

',myPage,re.S)?
??????? for item in myItems:?
??????????? data = self.myTool.Replace_Char(item.replace("\n","").encode('gbk'))?
??????????? self.datas.append(data+'\n')?

#-------- 程序入口處 ------------------?
print u"""#---------------------------------------
#?? 程序：百度貼吧爬蟲
#?? 版本：0.5
#?? 作者：why
#?? 日期：2013-05-16
#?? 語言：Python 2.7
#?? 操作：輸入網址后自動只看樓主并保存到本地文件
#?? 功能：將樓主發布的內容打包txt存儲到本地。
#---------------------------------------
"""?
# 以某小說貼吧為例子?
# bdurl = 'http://tieba.baidu.com/p/2296712428?see_lz=1&pn=1'?
?
print u'請輸入貼吧的地址最后的數字串：'?
bdurl = 'http://tieba.baidu.com/p/' + str(raw_input(u'http://tieba.baidu.com/p/'))??
?
#調用?
mySpider = Baidu_Spider(bdurl)?
mySpider.baidu_tieba()?

以上就是改進之后的抓取百度貼吧的全部代碼了，非常的簡單實用吧，希望能對大家有所幫助

更多文章、技術交流、商務合作、聯系博主

微信掃碼或搜索：z360901061

微信掃一掃加我為好友

QQ號聯系： 360901061

您的支持是博主寫作最大的動力，如果您喜歡我的文章，感覺我的文章對您有幫助，請用微信掃描下面二維碼支持博主2元、5元、10元、20元等您想捐的金額吧，狠狠點擊下面給點支持吧，站長非常感激您！手機微信長按不能支付解決辦法：請將微信支付二維碼保存到相冊，切換到微信，然后點擊微信右上角掃一掃功能，選擇支付二維碼完成支付。

【本文對您有幫助就好】元

2元

5元

10元

20元

自定義