成人在线视频观看,91香蕉国产亚洲一区二区三区 ,亚洲啪视频

關于nutch的基礎知識能夠參考 lemo的專欄

nutch支持二次開發(fā)，為了滿足搜索的準確率的問題，考慮只將網(wǎng)頁正文的內(nèi)容提取出來作為索引的內(nèi)容，相應的是parse_text的數(shù)據(jù)。我使用的事nutch1.4 版本號，在cygwin下運行crawl命令進行爬取。

      bin/nutch crawl urls -dir crawl -depth 3 -topN 30

爬取的流程例如以下：inject ：將urls下的url文檔中的url注入到數(shù)據(jù)庫，generate：從數(shù)據(jù)庫中取得url獲取須要爬取的url隊列，fetch：從url爬取隊列中爬取page，parse：解析page的內(nèi)容。從這里看到我須要改寫的是parse對網(wǎng)頁解析部分，parse對網(wǎng)頁進行解析后將解析的text放入crawl/segments下相應的parse_text目錄下，我們能夠通過命令

      bin/nutch readseg -dump crawl/segments/20120710142020 segdata

查看詳細爬取的內(nèi)容。

從系統(tǒng)的擴展點，通過實現(xiàn)系統(tǒng)中的parser擴展點，就可以實現(xiàn)自己的parse應用，而系統(tǒng)中對html頁面解析是通過默認的parse-html插件實現(xiàn)的，這里我們?yōu)榱朔奖悖ǖ塶utch版本號之后就不方便了），直接在parse-html插件處進行改動。

首先我們先找到parse-html實現(xiàn)parser借口的getparse方法，這種方法是詳細解析網(wǎng)頁內(nèi)容的。

      public ParseResult getParse(Content content) {

    HTMLMetaTags metaTags = new HTMLMetaTags();



    URL base;

    try {

      base = new URL(content.getBaseUrl());

    } catch (MalformedURLException e) {

      return new ParseStatus(e).getEmptyParseResult(content.getUrl(), getConf());

    }



    String text = "";

    String title = "";

    Outlink[] outlinks = new Outlink[0];

    Metadata metadata = new Metadata();



    // parse the content

    DocumentFragment root;

    try {

      byte[] contentInOctets = content.getContent();

      InputSource input = new InputSource(new ByteArrayInputStream(contentInOctets));



      EncodingDetector detector = new EncodingDetector(conf);

      detector.autoDetectClues(content, true);

      detector.addClue(sniffCharacterEncoding(contentInOctets), "sniffed");

      String encoding = detector.guessEncoding(content, defaultCharEncoding);



      metadata.set(Metadata.ORIGINAL_CHAR_ENCODING, encoding);

      metadata.set(Metadata.CHAR_ENCODING_FOR_CONVERSION, encoding);



      input.setEncoding(encoding);

      if (LOG.isTraceEnabled()) { LOG.trace("Parsing..."); }

      root = parse(input);

    } catch (IOException e) {

      return new ParseStatus(e).getEmptyParseResult(content.getUrl(), getConf());

    } catch (DOMException e) {

      return new ParseStatus(e).getEmptyParseResult(content.getUrl(), getConf());

    } catch (SAXException e) {

      return new ParseStatus(e).getEmptyParseResult(content.getUrl(), getConf());

    } catch (Exception e) {

      e.printStackTrace(LogUtil.getWarnStream(LOG));

      return new ParseStatus(e).getEmptyParseResult(content.getUrl(), getConf());

    }

      

    // get meta directives

    HTMLMetaProcessor.getMetaTags(metaTags, root, base);

    if (LOG.isTraceEnabled()) {

      LOG.trace("Meta tags for " + base + ": " + metaTags.toString());

    }

    // check meta directives

    if (!metaTags.getNoIndex()) {               // okay to index

      StringBuffer sb = new StringBuffer();

      if (LOG.isTraceEnabled()) { LOG.trace("Getting text..."); }

           try {

    	  
      
        utils.getText(sb, root);// 這里是詳細解析text的位置


      
          	  text = sb.toString();

      } catch (SAXException e) {

    	  // TODO Auto-generated catch block

    	  e.printStackTrace();

      }

      sb.setLength(0);

      if (LOG.isTraceEnabled()) { LOG.trace("Getting title..."); }

      utils.getTitle(sb, root);         // extract title

      title = sb.toString().trim();

    }

      

    if (!metaTags.getNoFollow()) {              // okay to follow links

      ArrayList<Outlink> l = new ArrayList<Outlink>();   // extract outlinks

      URL baseTag = utils.getBase(root);

      if (LOG.isTraceEnabled()) { LOG.trace("Getting links..."); }

      utils.getOutlinks(baseTag!=null?baseTag:base, l, root);

      outlinks = l.toArray(new Outlink[l.size()]);

      if (LOG.isTraceEnabled()) {

        LOG.trace("found "+outlinks.length+" outlinks in "+content.getUrl());

      }

    }

    

    ParseStatus status = new ParseStatus(ParseStatus.SUCCESS);

    if (metaTags.getRefresh()) {

      status.setMinorCode(ParseStatus.SUCCESS_REDIRECT);

      status.setArgs(new String[] {metaTags.getRefreshHref().toString(),

        Integer.toString(metaTags.getRefreshTime())});      

    }

    ParseData parseData = new ParseData(status, title, outlinks,

                                        content.getMetadata(), metadata);

    ParseResult parseResult = ParseResult.createParseResult(content.getUrl(), 

                                                 new ParseImpl(text, parseData));



    // run filters on parse

    ParseResult filteredParse = this.htmlParseFilters.filter(content, parseResult, 

                                                             metaTags, root);

    if (metaTags.getNoCache()) {             // not okay to cache

      for (Map.Entry<org.apache.hadoop.io.Text, Parse> entry : filteredParse) 

        entry.getValue().getData().getParseMeta().set(Nutch.CACHING_FORBIDDEN_KEY, 

                                                      cachingPolicy);

    }

    return filteredParse;

  }

我們從代碼中能夠看到詳細解析text的位置，我們須要改動的就是這個位置的代碼了，能夠通過查看源碼，nutch是通過Dom tree的方式進行解析text內(nèi)容的，而我在這里為了拿到page的正文部分的內(nèi)容，我選用了開源的工具boilerpipe進行正文的提取。插入如上函數(shù)的代碼段為：

      text = BoilerpipeUtils.getMainbodyTextByBoilerpipe(new InputSource(

    			  new ByteArrayInputStream(content.getContent())));

    	  if(text.equals("")){

    		  utils.getText(sb, root);

    	  	  text = sb.toString();

    	  	  if (LOG.isTraceEnabled()) { 

    	  		  LOG.trace("Extract text using DOMContentUtils..."); 

    	  	  }

    	  }else if (LOG.isTraceEnabled()) { 

    			  LOG.trace("Extract text using Boilerpipe..."); 

    	  }

    	  FileWriter fw = new FileWriter("E://mainbodypage//URLText.txt",true);

    	  fw.write("url::" + content.getUrl() + "\n");

    	  fw.write("text::" + text + "\n");

    	  fw.close();

我將相應的page的url和text內(nèi)容寫入到特定的path下，這樣能夠方便測試，如上代碼段調(diào)用的靜態(tài)方法類例如以下：

      package org.apache.nutch.parse.html;



import org.xml.sax.InputSource;

import org.xml.sax.SAXException;



import de.l3s.boilerpipe.BoilerpipeExtractor;

import de.l3s.boilerpipe.BoilerpipeProcessingException;

import de.l3s.boilerpipe.document.TextDocument;

import de.l3s.boilerpipe.extractors.CommonExtractors;

import de.l3s.boilerpipe.sax.BoilerpipeSAXInput;



public class BoilerpipeUtils {

	public static String getMainbodyTextByBoilerpipe(InputSource is) throws BoilerpipeProcessingException, SAXException{

		final TextDocument doc = new BoilerpipeSAXInput(is).getTextDocument();

		final BoilerpipeExtractor extractor = CommonExtractors.ARTICLE_EXTRACTOR;

		extractor.process(doc);  

		if(doc.getContent() != null && !doc.getContent().equals(""))

			return doc.getContent();

		else

			return "";

	  }

}

因為用到了開源的工具boilerpipe，因此須要將相關的jar包放入到插件文件夾下的lib文件夾中，同一時候相應的plugin.xml配置中runtime段例如以下：

      <runtime>

      <library name="parse-html.jar">

         <export name="*"/>

      </library>

      <library name="tagsoup-1.2.1.jar"/>

      <library name="boilerpipe-1.2.0.jar">

      </library>

      <library name="nekohtml-1.9.13.jar">

      </library>

      <library name="xerces-2.9.1.jar">

      </library>

   </runtime>

至此就完畢了插件的功能，在eclipse下執(zhí)行build project后執(zhí)行如上的crawl命令，就可以得到自己想要的正文部分的parse_text數(shù)據(jù)了，假設在cwgwin下執(zhí)行crawl命令，還會報NoClassDefFound的runtimeException，找不到指定的jar包，將如上的三個jar包放入到runtime/local/lib文件夾下就可以。

然而boilerpipe的正文提取效果還存在提升的空間，不盡理想；另外也能夠用針對特定站點的定制功能去提取text信息。

Nutch 二次開發(fā)之parse正文內(nèi)容

更多文章、技術交流、商務合作、聯(lián)系博主

微信掃碼或搜索：z360901061

微信掃一掃加我為好友

QQ號聯(lián)系： 360901061

您的支持是博主寫作最大的動力，如果您喜歡我的文章，感覺我的文章對您有幫助，請用微信掃描下面二維碼支持博主2元、5元、10元、20元等您想捐的金額吧，狠狠點擊下面給點支持吧，站長非常感激您！手機微信長按不能支付解決辦法：請將微信支付二維碼保存到相冊，切換到微信，然后點擊微信右上角掃一掃功能，選擇支付二維碼完成支付。

【本文對您有幫助就好】元

2元

5元

10元

20元

自定義