接著第一篇繼續學習。
一、數據分類
正確數據:id、性別、活動時間三者都有
放在這個文件里file1 = 'ruisi\\correct%s-%s.txt' % (startNum, endNum)
數據格式為293001 男 2015-5-1 19:17
- 沒有時間:有id、有性別,無活動時間
放這個文件里file2 = 'ruisi\\errTime%s-%s.txt' % (startNum, endNum)
數據格式為2566 女 notime
-
用戶不存在:該id沒有對應的用戶
放這個文件里file3 = 'ruisi\\notexist%s-%s.txt' % (startNum, endNum)
數據格式為29005 notexist
-
未知性別:有id,但是性別從網頁上無法得知(經檢查,這種情況也沒有活動時間)
放這個文件里 file4 = 'ruisi\\unkownsex%s-%s.txt' % (startNum, endNum)
數據格式 221794 unkownsex
-
網絡錯誤:網斷了,或者服務器故障,需要對這些id重新檢查
放這個文件里 file5 = 'ruisi\\httperror%s-%s.txt' % (startNum, endNum)
數據格式 271004 httperror
如何不間斷得爬蟲信息
- 本項目有一個考慮:是不間斷爬取信息,如果因為斷網、BBS服務器故障啥的,我的爬蟲程序就退出的話。那我們還得從間斷的地方繼續爬,或者更麻煩的是從頭開始爬。
- 所以,我采取的方法是,如果遇到故障,就把這些異常的id記錄下來。等一次遍歷之后,才對這些異常的id進行重新爬取性別。
- 本文系列(一)給出了一個 getInfo(myurl, seWord),通過給定鏈接和給定正則表達式爬取信息。
- 這個函數可以用來查看性別的最后活動時間信息。
-
我們再定義一個安全的爬取函數,不會間斷程序運行的,這兒用到try except異常處理。
這兒代碼試了兩次getInfo(myurl, seWord),如果第2次還是拋出異常了,就把這個id保存在file5里面
如果能獲取到信息,就返回信息
file5 = 'ruisi\\httperror%s-%s.txt' % (startNum, endNum)
def safeGet(myid, myurl, seWord):
try:
return getInfo(myurl, seWord)
except:
try:
return getInfo(myurl, seWord)
except:
httperrorfile = open(file5, 'a')
info = '%d %s\n' % (myid, 'httperror')
httperrorfile.write(info)
httperrorfile.close()
return 'httperror'
依次遍歷,獲取id從[1,300,000]的用戶信息
我們定義一個函數,這兒的思路是獲取sex和time,如果有sex,進而繼續判斷是否有time;如果沒sex,判斷是否這個用戶不存在還是性別無法爬取。
其中要考慮到斷網或者BBS服務器故障的情況。
url1 = 'http://rs.xidian.edu.cn/home.php?mod=space&uid=%s'
url2 = 'http://rs.xidian.edu.cn/home.php?mod=space&uid=%s&do=profile'
def searchWeb(idArr):
for id in idArr:
sexUrl = url1 % (id) #將%s替換為id
timeUrl = url2 % (id)
sex = safeGet(id,sexUrl, sexRe)
if not sex: #如果sexUrl里面找不到性別,在timeUrl再嘗試找一下
sex = safeGet(id,timeUrl, sexRe)
time = safeGet(id,timeUrl, timeRe)
#如果出現了httperror,需要重新爬取
if (sex is 'httperror') or (time is 'httperror') :
pass
else:
if sex:
info = '%d %s' % (id, sex)
if time:
info = '%s %s\n' % (info, time)
wfile = open(file1, 'a')
wfile.write(info)
wfile.close()
else:
info = '%s %s\n' % (info, 'notime')
errtimefile = open(file2, 'a')
errtimefile.write(info)
errtimefile.close()
else:
#這兒是性別是None,然后確定一下是不是用戶不存在
#斷網的時候加上這個,會導致4個重復httperror
#可能用戶的性別我們無法知道,他沒有填寫
notexist = safeGet(id,sexUrl, notexistRe)
if notexist is 'httperror':
pass
else:
if notexist:
notexistfile = open(file3, 'a')
info = '%d %s\n' % (id, 'notexist')
notexistfile.write(info)
notexistfile.close()
else:
unkownsexfile = open(file4, 'a')
info = '%d %s\n' % (id, 'unkownsex')
unkownsexfile.write(info)
unkownsexfile.close()
這兒后期檢查發現了一個問題
sex = safeGet(id,sexUrl, sexRe)
if not sex:
sex = safeGet(id,timeUrl, sexRe)
time = safeGet(id,timeUrl, timeRe)
這個代碼如果斷網的時候,調用了3次safeGet,每次調用都會往文本里面同一個id寫多次httperror
251538 httperror
251538 httperror
251538 httperror
251538 httperror
多線程爬取信息?
數據統計可以用多線程,因為是獨立的多個文本
1、Popen介紹
使用Popen可以自定義標準輸入、標準輸出和標準錯誤輸出。我在SAP實習的時候,項目組在linux平臺下經常使用Popen,可能是因為可以方便重定向輸出。
下面這段代碼借鑒了以前項目組的實現方法,Popen可以調用系統cmd命令。下面3個communicate()連在一起表示要等這3個線程都結束。
疑惑?
試驗了一下,必須3個communicate()緊挨著才能保證3個線程同時開啟,最后等待3個線程都結束。
p1=Popen(['python', 'ruisi.py', str(s0),str(s1)],bufsize=10000, stdout=subprocess.PIPE)
p2=Popen(['python', 'ruisi.py', str(s1),str(s2)],bufsize=10000, stdout=subprocess.PIPE)
p3=Popen(['python', 'ruisi.py', str(s2),str(s3)],bufsize=10000, stdout=subprocess.PIPE)
p1.communicate()
p2.communicate()
p3.communicate()
2、定義一個單線程的爬蟲
用法:python ruisi.py
這段代碼就是爬取[startNum, endNum)信息,輸出到相應的文本里。它是一個單線程的程序,若要實現多線程的話,在外部調用它的地方實現多線程。
# ruisi.py
# coding=utf-8
import urllib2, re, sys, threading, time,thread
# myurl as 指定鏈接
# seWord as 正則表達式,用unicode表示
# 返回根據正則表達式匹配的信息或者None
def getInfo(myurl, seWord):
headers = {
'User-Agent': 'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6'
}
req = urllib2.Request(
url=myurl,
headers=headers
)
time.sleep(0.3)
response = urllib2.urlopen(req)
html = response.read()
html = unicode(html, 'utf-8')
timeMatch = seWord.search(html)
if timeMatch:
s = timeMatch.groups()
return s[0]
else:
return None
#嘗試兩次getInfo()
#第2次失敗后,就把這個id標記為httperror
def safeGet(myid, myurl, seWord):
try:
return getInfo(myurl, seWord)
except:
try:
return getInfo(myurl, seWord)
except:
httperrorfile = open(file5, 'a')
info = '%d %s\n' % (myid, 'httperror')
httperrorfile.write(info)
httperrorfile.close()
return 'httperror'
#輸出一個 idArr 范圍,比如[1,1001)
def searchWeb(idArr):
for id in idArr:
sexUrl = url1 % (id)
timeUrl = url2 % (id)
sex = safeGet(id,sexUrl, sexRe)
if not sex:
sex = safeGet(id,timeUrl, sexRe)
time = safeGet(id,timeUrl, timeRe)
if (sex is 'httperror') or (time is 'httperror') :
pass
else:
if sex:
info = '%d %s' % (id, sex)
if time:
info = '%s %s\n' % (info, time)
wfile = open(file1, 'a')
wfile.write(info)
wfile.close()
else:
info = '%s %s\n' % (info, 'notime')
errtimefile = open(file2, 'a')
errtimefile.write(info)
errtimefile.close()
else:
notexist = safeGet(id,sexUrl, notexistRe)
if notexist is 'httperror':
pass
else:
if notexist:
notexistfile = open(file3, 'a')
info = '%d %s\n' % (id, 'notexist')
notexistfile.write(info)
notexistfile.close()
else:
unkownsexfile = open(file4, 'a')
info = '%d %s\n' % (id, 'unkownsex')
unkownsexfile.write(info)
unkownsexfile.close()
def main():
reload(sys)
sys.setdefaultencoding('utf-8')
if len(sys.argv) != 3:
print 'usage: python ruisi.py
'
sys.exit(-1)
global sexRe,timeRe,notexistRe,url1,url2,file1,file2,file3,file4,startNum,endNum,file5
startNum=int(sys.argv[1])
endNum=int(sys.argv[2])
sexRe = re.compile(u'em>\u6027\u522b(.*?)\u4e0a\u6b21\u6d3b\u52a8\u65f6\u95f4(.*?))\u62b1\u6b49\uff0c\u60a8\u6307\u5b9a\u7684\u7528\u6237\u7a7a\u95f4\u4e0d\u5b58\u5728<')
url1 = 'http://rs.xidian.edu.cn/home.php?mod=space&uid=%s'
url2 = 'http://rs.xidian.edu.cn/home.php?mod=space&uid=%s&do=profile'
file1 = '..\\newRuisi\\correct%s-%s.txt' % (startNum, endNum)
file2 = '..\\newRuisi\\errTime%s-%s.txt' % (startNum, endNum)
file3 = '..\\newRuisi\\notexist%s-%s.txt' % (startNum, endNum)
file4 = '..\\newRuisi\\unkownsex%s-%s.txt' % (startNum, endNum)
file5 = '..\\newRuisi\\httperror%s-%s.txt' % (startNum, endNum)
searchWeb(xrange(startNum,endNum))
# numThread = 10
# searchWeb(xrange(endNum))
# total = 0
# for i in xrange(numThread):
# data = xrange(1+i,endNum,numThread)
# total =+ len(data)
# t=threading.Thread(target=searchWeb,args=(data,))
# t.start()
# print total
main()
多線程爬蟲
代碼
# coding=utf-8
from subprocess import Popen
import subprocess
import threading,time
startn = 1
endn = 300001
step =1000
total = (endn - startn + 1 ) /step
ISOTIMEFORMAT='%Y-%m-%d %X'
#hardcode 3 threads
#?]有深究3個線程好還是4或者更多個線程好
#輸出格式化的年月日時分秒
#輸出程序的耗時(以秒為單位)
for i in xrange(0,total,3):
startNumber = startn + step * i
startTime = time.clock()
s0 = startNumber
s1 = startNumber + step
s2 = startNumber + step*2
s3 = startNumber + step*3
p1=Popen(['python', 'ruisi.py', str(s0),str(s1)],bufsize=10000, stdout=subprocess.PIPE)
p2=Popen(['python', 'ruisi.py', str(s1),str(s2)],bufsize=10000, stdout=subprocess.PIPE)
p3=Popen(['python', 'ruisi.py', str(s2),str(s3)],bufsize=10000, stdout=subprocess.PIPE)
startftime ='[ '+ time.strftime( ISOTIMEFORMAT, time.localtime() ) + ' ] '
print startftime + '%s - %s download start... ' %(s0, s1)
print startftime + '%s - %s download start... ' %(s1, s2)
print startftime + '%s - %s download start... ' %(s2, s3)
p1.communicate()
p2.communicate()
p3.communicate()
endftime = '[ '+ time.strftime( ISOTIMEFORMAT, time.localtime() ) + ' ] '
print endftime + '%s - %s download end !!! ' %(s0, s1)
print endftime + '%s - %s download end !!! ' %(s1, s2)
print endftime + '%s - %s download end !!! ' %(s2, s3)
endTime = time.clock()
print "cost time " + str(endTime - startTime) + " s"
time.sleep(5)
這兒是記錄時間戳的日志:
"D:\Program Files\Python27\python.exe" E:/pythonProject/webCrawler/sum.py
[ 2015-11-23 11:31:15 ] 1 - 1001 download start...
[ 2015-11-23 11:31:15 ] 1001 - 2001 download start...
[ 2015-11-23 11:31:15 ] 2001 - 3001 download start...
[ 2015-11-23 11:53:44 ] 1 - 1001 download end !!!
[ 2015-11-23 11:53:44 ] 1001 - 2001 download end !!!
[ 2015-11-23 11:53:44 ] 2001 - 3001 download end !!!
cost time 1348.99480677 s
[ 2015-11-23 11:53:50 ] 3001 - 4001 download start...
[ 2015-11-23 11:53:50 ] 4001 - 5001 download start...
[ 2015-11-23 11:53:50 ] 5001 - 6001 download start...
[ 2015-11-23 12:16:56 ] 3001 - 4001 download end !!!
[ 2015-11-23 12:16:56 ] 4001 - 5001 download end !!!
[ 2015-11-23 12:16:56 ] 5001 - 6001 download end !!!
cost time 1386.06407734 s
[ 2015-11-23 12:17:01 ] 6001 - 7001 download start...
[ 2015-11-23 12:17:01 ] 7001 - 8001 download start...
[ 2015-11-23 12:17:01 ] 8001 - 9001 download start...
上面是多線程的Log記錄,從下面可以看出,1000個用戶平均需要500s,一個id需要0.5s。500*300/3600 = 41.666666666667小時,大概需要兩天的時間。
我們再試驗一次單線程爬蟲的耗時,記錄如下:
"D:\Program Files\Python27\python.exe" E:/pythonProject/webCrawler/sum.py
1 - 1001 download start...
1 - 1001 download end !!!
cost time 1583.65911889 s
1001 - 2001 download start...
1001 - 2001 download end !!!
cost time 1342.46874278 s
2001 - 3001 download start...
2001 - 3001 download end !!!
cost time 1327.10885725 s
3001 - 4001 download start...
我們發現一次線程爬取1000個用戶耗時的時間也需要1500s,而多線程程序是3*1000個用戶耗時1500s。
故多線程確實能比單線程省很多時間。
Note:
在getInfo(myurl, seWord)里有time.sleep(0.3)這樣一段代碼,是為了防止批判訪問BBS,而被BBS拒絕訪問。這個0.3s對于上文多線程和單線程的統計時間有影響。
最后附上原始的,沒有帶時間戳的記錄。(加上時間戳,可以知道程序什么時候開始爬蟲的,以應對線程卡死情況。)
"D:\Program Files\Python27\python.exe" E:/pythonProject/webCrawler/sum.py
1 - 1001 download start...
1001 - 2001 download start...
2001 - 3001 download start...
1 - 1001 download end !!!
1001 - 2001 download end !!!
2001 - 3001 download end !!!
cost time 1532.74102812 s
3001 - 4001 download start...
4001 - 5001 download start...
5001 - 6001 download start...
3001 - 4001 download end !!!
4001 - 5001 download end !!!
5001 - 6001 download end !!!
cost time 2652.01624951 s
6001 - 7001 download start...
7001 - 8001 download start...
8001 - 9001 download start...
6001 - 7001 download end !!!
7001 - 8001 download end !!!
8001 - 9001 download end !!!
cost time 1880.61513696 s
9001 - 10001 download start...
10001 - 11001 download start...
11001 - 12001 download start...
9001 - 10001 download end !!!
10001 - 11001 download end !!!
11001 - 12001 download end !!!
cost time 1634.40575553 s
12001 - 13001 download start...
13001 - 14001 download start...
14001 - 15001 download start...
12001 - 13001 download end !!!
13001 - 14001 download end !!!
14001 - 15001 download end !!!
cost time 1403.62795496 s
15001 - 16001 download start...
16001 - 17001 download start...
17001 - 18001 download start...
15001 - 16001 download end !!!
16001 - 17001 download end !!!
17001 - 18001 download end !!!
cost time 1271.42177906 s
18001 - 19001 download start...
19001 - 20001 download start...
20001 - 21001 download start...
18001 - 19001 download end !!!
19001 - 20001 download end !!!
20001 - 21001 download end !!!
cost time 1476.04122024 s
21001 - 22001 download start...
22001 - 23001 download start...
23001 - 24001 download start...
21001 - 22001 download end !!!
22001 - 23001 download end !!!
23001 - 24001 download end !!!
cost time 1431.37074164 s
24001 - 25001 download start...
25001 - 26001 download start...
26001 - 27001 download start...
24001 - 25001 download end !!!
25001 - 26001 download end !!!
26001 - 27001 download end !!!
cost time 1411.45186874 s
27001 - 28001 download start...
28001 - 29001 download start...
29001 - 30001 download start...
27001 - 28001 download end !!!
28001 - 29001 download end !!!
29001 - 30001 download end !!!
cost time 1396.88837788 s
30001 - 31001 download start...
31001 - 32001 download start...
32001 - 33001 download start...
30001 - 31001 download end !!!
31001 - 32001 download end !!!
32001 - 33001 download end !!!
cost time 1389.01316718 s
33001 - 34001 download start...
34001 - 35001 download start...
35001 - 36001 download start...
33001 - 34001 download end !!!
34001 - 35001 download end !!!
35001 - 36001 download end !!!
cost time 1318.16040825 s
36001 - 37001 download start...
37001 - 38001 download start...
38001 - 39001 download start...
36001 - 37001 download end !!!
37001 - 38001 download end !!!
38001 - 39001 download end !!!
cost time 1362.59222822 s
39001 - 40001 download start...
40001 - 41001 download start...
41001 - 42001 download start...
39001 - 40001 download end !!!
40001 - 41001 download end !!!
41001 - 42001 download end !!!
cost time 1253.62498539 s
42001 - 43001 download start...
43001 - 44001 download start...
44001 - 45001 download start...
42001 - 43001 download end !!!
43001 - 44001 download end !!!
44001 - 45001 download end !!!
cost time 1313.50461988 s
45001 - 46001 download start...
46001 - 47001 download start...
47001 - 48001 download start...
45001 - 46001 download end !!!
46001 - 47001 download end !!!
47001 - 48001 download end !!!
cost time 1322.32317331 s
48001 - 49001 download start...
49001 - 50001 download start...
50001 - 51001 download start...
48001 - 49001 download end !!!
49001 - 50001 download end !!!
50001 - 51001 download end !!!
cost time 1381.58027296 s
51001 - 52001 download start...
52001 - 53001 download start...
53001 - 54001 download start...
51001 - 52001 download end !!!
52001 - 53001 download end !!!
53001 - 54001 download end !!!
cost time 1357.78699459 s
54001 - 55001 download start...
55001 - 56001 download start...
56001 - 57001 download start...
54001 - 55001 download end !!!
55001 - 56001 download end !!!
56001 - 57001 download end !!!
cost time 1359.76377246 s
57001 - 58001 download start...
58001 - 59001 download start...
59001 - 60001 download start...
57001 - 58001 download end !!!
58001 - 59001 download end !!!
59001 - 60001 download end !!!
cost time 1335.47829775 s
60001 - 61001 download start...
61001 - 62001 download start...
62001 - 63001 download start...
60001 - 61001 download end !!!
61001 - 62001 download end !!!
62001 - 63001 download end !!!
cost time 1354.82727645 s
63001 - 64001 download start...
64001 - 65001 download start...
65001 - 66001 download start...
63001 - 64001 download end !!!
64001 - 65001 download end !!!
65001 - 66001 download end !!!
cost time 1260.54731607 s
66001 - 67001 download start...
67001 - 68001 download start...
68001 - 69001 download start...
66001 - 67001 download end !!!
67001 - 68001 download end !!!
68001 - 69001 download end !!!
cost time 1363.58255686 s
69001 - 70001 download start...
70001 - 71001 download start...
71001 - 72001 download start...
69001 - 70001 download end !!!
70001 - 71001 download end !!!
71001 - 72001 download end !!!
cost time 1354.17163074 s
72001 - 73001 download start...
73001 - 74001 download start...
74001 - 75001 download start...
72001 - 73001 download end !!!
73001 - 74001 download end !!!
74001 - 75001 download end !!!
cost time 1335.00425259 s
75001 - 76001 download start...
76001 - 77001 download start...
77001 - 78001 download start...
75001 - 76001 download end !!!
76001 - 77001 download end !!!
77001 - 78001 download end !!!
cost time 1360.44054978 s
78001 - 79001 download start...
79001 - 80001 download start...
80001 - 81001 download start...
78001 - 79001 download end !!!
79001 - 80001 download end !!!
80001 - 81001 download end !!!
cost time 1369.72662457 s
81001 - 82001 download start...
82001 - 83001 download start...
83001 - 84001 download start...
81001 - 82001 download end !!!
82001 - 83001 download end !!!
83001 - 84001 download end !!!
cost time 1369.95550676 s
84001 - 85001 download start...
85001 - 86001 download start...
86001 - 87001 download start...
84001 - 85001 download end !!!
85001 - 86001 download end !!!
86001 - 87001 download end !!!
cost time 1482.53886433 s
87001 - 88001 download start...
88001 - 89001 download start...
89001 - 90001 download start...
以上就是關于python實現爬蟲統計學校BBS男女比例的第二篇,重點介紹了多線程爬蟲,希望對大家的學習有所幫助。
更多文章、技術交流、商務合作、聯系博主
微信掃碼或搜索:z360901061
微信掃一掃加我為好友
QQ號聯系: 360901061
您的支持是博主寫作最大的動力,如果您喜歡我的文章,感覺我的文章對您有幫助,請用微信掃描下面二維碼支持博主2元、5元、10元、20元等您想捐的金額吧,狠狠點擊下面給點支持吧,站長非常感激您!手機微信長按不能支付解決辦法:請將微信支付二維碼保存到相冊,切換到微信,然后點擊微信右上角掃一掃功能,選擇支付二維碼完成支付。
【本文對您有幫助就好】元

