日韩久久一区二区,99久久综合九九亚洲,夜夜做日日做夜夜爽

拉勾網爬蟲

解析拉勾網網站：

在拉勾網上輸入關鍵詞后我們可以得到相應的崗位信息（這里以Python為例），我們先獲取到網站中所有的城市信息，再通過城市信息遍歷爬取全國的Python職位信息。

在數據包的Headers中我們可以得到網頁頭的相關信息，如網頁URL、請求方法、Cookies信息、用戶代理等相關信息。

獲取所有城市：

            
              class CrawlLaGou(object):
    def __init__(self):
        # 使用session保存cookies信息
        self.lagou_session = requests.session()
        self.header = {
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko)'
        }
        self.city_list = ""

    #獲取城市
    def crawl_city(self):
        #使用正則表達式獲取HTML代碼中的城市名稱
        city_search = re.compile(r'www\.lagou\.com\/.*\/">(.*?)')
        #網頁URL
        city_url = "https://www.lagou.com/jobs/allCity.html"
        city_result = self.crawl_request(method="GET", url=city_url)
        self.city_list = city_search.findall(city_result)
        self.lagou_session.cookies.clear()

    #返回結果
    def crawl_request(self,method,url,data=None,info=None):
        while True:
            if method == "GET":
                response = self.lagou_session.get(url=url,headers=self.header)
            elif method == "POST":
                response = self.lagou_session.post(url=url, headers=self.header, data=data)
            response.encoding = "utf8"
            return response.text

if __name__ == '__main__':
    lagou = CrawlLaGou()
    lagou.crawl_city()
    print(lagou.city_list)

其中self.header中的User-Agent信息也在上圖中Headers中可以找到。上述代碼先將url所對應的網頁源碼爬取下來，再通過正則表達式獲取到網頁中的所有城市名稱。

運行結果：

在我們獲取完所有的城市名稱信息后，我們開始獲取城市對應的職位信息，我們回到職位列表（https://www.lagou.com/jobs/list_python），找到存放有職位信息的數據包，以及其對應的請求頭部信息。

存放職位信息的數據包：

在得到網頁的職位信息后，我們可以使用https://www.json.cn/進行解析，并找出我們需要的信息內容。

從json解析中，我們可以得到職位信息的列表為’content’→’positionResult’→’result’

獲取職位信息：

            
              #獲取職位信息
def crawl_city_job(self,city):
    #職位列表數據包的url
    first_request_url = "https://www.lagou.com/jobs/list_python?city=%s&cl=false&fromSearch=true&labelWords=&suginput="%city
    first_response = self.crawl_request(method="GET", url=first_request_url)
    #使用正則表達式獲取職位列表的頁數
    total_page_search = re.compile(r'class="span\stotalNum">(\d+)')
    try:
        total_page = total_page_search.search(first_response).group(1)
    except:
        # 如果沒有職位信息，直接return
        return
    else:
        for i in range(1, int(total_page) + 1):
            #data信息中的字段
            data = {
                "pn":i,
                "kd":"python"
            }
            #存放職位信息的url
            page_url = "https://www.lagou.com/jobs/positionAjax.json?city=%s&needAddtionalResult=false" % city
            #添加對應的Referer
            referer_url = "https://www.lagou.com/jobs/list_python?city=%s&cl=false&fromSearch=true&labelWords=&suginput="% city
            self.header['Referer'] = referer_url.encode()
            response = self.crawl_request(method="POST",url=page_url,data=data,info=city)
            lagou_data = json.loads(response)
            #通過json解析得到的職位信息存放的列表
            job_list = lagou_data['content']['positionResult']['result']
            for job in job_list:
                print(job）

在上述代碼中，先通過存放職位列表的數據包url（first_request_url）中獲取網頁代碼中的頁碼信息，并通過頁碼來判斷是否存在崗位信息，若沒有則返回。若有，則通過存放職位信息的數據包url（page_url），并添加對應的data數據和Refer信息，來獲取該數據包中的所有信息，最后通過’content’→’positionResult’→’result’的列表順序來獲得到我們所需要的職位信息。
運行結果：

解決“操作太頻繁，請稍后再試”的問題：

如在爬蟲運行過程中出現“操作太頻繁”則說明該爬蟲已經被網站發現，此時我們需要清除cookies信息并重新獲取該url，并讓程序停止10s后再繼續運行。

            
              #返回結果
def crawl_request(self,method,url,data=None,info=None):
    while True:
        if method == "GET":
            response = self.lagou_session.get(url=url,headers=self.header)
        elif method == "POST":
            response = self.lagou_session.post(url=url, headers=self.header, data=data)
        response.encoding = "utf8"
        #解決操作太頻繁問題
        if '頻繁' in response.text:
            print(response.text)
            self.lagou_session.cookies.clear()
            first_request_url = "https://www.lagou.com/jobs/list_python?city=%s&cl=false&fromSearch=true&labelWords=&suginput=" % info
            self.crawl_request(method="GET", url=first_request_url)
            time.sleep(10)
            continue 
        return response.text

將爬取到的數據保存到數據庫：

在以上我們爬取到的結果中，我們只是爬取了在result列表中的所有數據，可讀性還比較差。我們需要創建一個數據庫，并篩選出我們需要的數據插入進去。

創建數據庫：

創建數據庫：

            
              #創建數據庫連接
engine = create_engine("mysql+pymysql://root:root@127.0.0.1:3306/lagou?charset=utf8")
#操作數據庫
Session = sessionmaker(bind=engine)
#聲明一個基類
Base = declarative_base()

class Lagoutables(Base):
    #表名稱
    __tablename__ = 'lagou_java'
    #id,設置為主鍵和自動增長
    id = Column(Integer,primary_key=True,autoincrement=True)
    #職位id
    positionID = Column(Integer,nullable=True)
    # 經度
    longitude = Column(Float, nullable=False)
    # 緯度
    latitude = Column(Float, nullable=False)
    # 職位名稱
    positionName = Column(String(length=50), nullable=False)
    # 工作年限
    workYear = Column(String(length=20), nullable=False)
    # 學歷
    education = Column(String(length=20), nullable=False)
    # 職位性質
    jobNature = Column(String(length=20), nullable=True)
    # 公司類型
    financeStage = Column(String(length=30), nullable=True)
    # 公司規模
    companySize = Column(String(length=30), nullable=True)
    # 業務方向
    industryField = Column(String(length=30), nullable=True)
    # 所在城市
    city = Column(String(length=10), nullable=False)
    # 崗位標簽
    positionAdvantage = Column(String(length=200), nullable=True)
    # 公司簡稱
    companyShortName = Column(String(length=50), nullable=True)
    # 公司全稱
    companyFullName = Column(String(length=200), nullable=True)
    # 工資
    salary = Column(String(length=20), nullable=False)
    # 抓取日期
    crawl_date = Column(String(length=20), nullable=False)

插入數據：

            
              def __init__(self):
    self.mysql_session = Session()
    self.date = time.strftime("%Y-%m-%d",time.localtime())

#數據存儲方法
def insert_item(self,item):
    #今天
    date = time.strftime("%Y-%m-%d",time.localtime())
    #數據結構
    data = Lagoutables(
        #職位ID
        positionID = item['positionId'],
        # 經度
        longitude=item['longitude'],
        # 緯度
        latitude=item['latitude'],
        # 職位名稱
        positionName=item['positionName'],
        # 工作年限
        workYear=item['workYear'],
        # 學歷
        education=item['education'],
        # 職位性質
        jobNature=item['jobNature'],
        # 公司類型
        financeStage=item['financeStage'],
        # 公司規模
        companySize=item['companySize'],
        # 業務方向
        industryField=item['industryField'],
        # 所在城市
        city=item['city'],
        # 職位標簽
        positionAdvantage=item['positionAdvantage'],
        # 公司簡稱
        companyShortName=item['companyShortName'],
        # 公司全稱
        companyFullName=item['companyFullName'],
         # 工資
        salary=item['salary'],
        # 抓取日期
        crawl_date=date
    )

    #在存儲數據之前查詢表里是否有這條職位信息
    query_result = self.mysql_session.query(Lagoutables).filter(Lagoutables.crawl_date==date,
                                                                Lagoutables.positionID == item['positionId']).first()

    if query_result:
        print('該職位信息已存在%s:%s:%s' % (item['positionId'], item['city'], item['positionName']))
    else:
        #插入數據
        self.mysql_session.add(data)
        #提交數據
        self.mysql_session.commit()
        print('新增職位信息%s' % item['positionId'])

運行結果：

此時職位信息已保存到數據庫中：

完整代碼：
github：https://github.com/KeerZhou/crawllagou
csdn：https://download.csdn.net/download/keerzhou/11584694

更多文章、技術交流、商務合作、聯系博主

微信掃碼或搜索：z360901061

微信掃一掃加我為好友

QQ號聯系： 360901061

您的支持是博主寫作最大的動力，如果您喜歡我的文章，感覺我的文章對您有幫助，請用微信掃描下面二維碼支持博主2元、5元、10元、20元等您想捐的金額吧，狠狠點擊下面給點支持吧，站長非常感激您！手機微信長按不能支付解決辦法：請將微信支付二維碼保存到相冊，切換到微信，然后點擊微信右上角掃一掃功能，選擇支付二維碼完成支付。

【本文對您有幫助就好】元

2元

5元

10元

20元

自定義