爬取网站的内容并树立索引,履行查找功用51CTO博客 - 牛牛娱乐

爬取网站的内容并树立索引,履行查找功用51CTO博客

2019年04月26日14时27分33秒 | 作者: 紫真 | 标签: 树立,索引,查找 | 浏览: 2610

下面以爬取51Job为例

爬取和查找也是现在比较盛行的一个功用,想写这么一个小功用。期望今后对你们有协助。

首要咱们要爬取数据必需求树立相应的数据库,我写一个pojo,然后你们能够依据它树立表,数据库运用的MySql,下面都以爬取51Job为例。

写一个pojo而且树立结构办法初始化:

public class Jobcrawl implements java.io.Serializable {
  //主键
  private String toid;
  //作业称号
  private String jobname;
  //公司称号
  private String companyname;
  //公司性质
  private String comtype;
  //作业发布时刻
  private String publishtime;
  //作业地址
  private String place;
  //职位要求
  private Integer requirecount;
  //作业年限
  private String workyear;
  //学历
  private String qualifications;
  //作业内容
  private String jobcontent;
  //公司详细信息网址(在51job上的)
  private String url;
  //公司职业
  private String industry;
  //公司规划
  private String comscale;
  // Constructors
  public String getComscale() {
    return comscale;
  }
  public void setComscale(String comscale) {
    this.comscale = comscale;
  }
  public String getIndustry() {
    return industry;
  }
  public void setIndustry(String industry) {
    this.industry = industry;
  }
  /** default constructor */
  public Jobcrawl() {
  }
  /** minimal constructor */
  public Jobcrawl(String toid) {
    this.toid = toid;
  }
  /** full constructor */
  public Jobcrawl(String toid, String jobname, String companyname,String comtype, String publishtime, String place,
Integer requirecount, String workyear, String qualifications,
String jobcontent, String url) {
    this.toid = toid;
    this.jobname = jobname;
    this.companyname = companyname;
    this.comtype = comtype;
    this.publishtime = publishtime;
    this.place = place;
    this.requirecount = requirecount;
    this.workyear = workyear;
    this.qualifications = qualifications;
    this.jobcontent = jobcontent;
    this.url = url;
  }
  public String getToid() {
    return this.toid;
  }
  public void setToid(String toid) {
    this.toid = toid;
  }
  public String getJobname() {
    return this.jobname;
  }
  public void setJobname(String jobname) {
    this.jobname = jobname;
  }
  public String getCompanyname() {
    return this.companyname;
  }
  public void setCompanyname(String companyname) {
    this.companyname = companyname;
  }
  public String getComtype() {
    return this.comtype;
  }
  public void setComtype(String comtype) {
    this.comtype = comtype;
  }
  public String getPublishtime() {
    return this.publishtime;
  }
  public void setPublishtime(String publishtime) {
    this.publishtime = publishtime;
  }
  public String getPlace() {
    return this.place;
  }
  public void setPlace(String place) {
    this.place = place;
  }
  public Integer getRequirecount() {
    return this.requirecount;
  }
  public void setRequirecount(Integer requirecount) {
    this.requirecount = requirecount;
  }
  public String getWorkyear() {
    return this.workyear;
  }
  public void setWorkyear(String workyear) {
    this.workyear = workyear;
  }
  public String getQualifications() {
    return this.qualifications;
  }
  public void setQualifications(String qualifications) {
    this.qualifications = qualifications;
  }
  public String getJobcontent() {
    return this.jobcontent;
  }
  public void setJobcontent(String jobcontent) {
    this.jobcontent = jobcontent;
  }
  public String getUrl() {
    return this.url;
  }
  public void setUrl(String url) {
    this.url = url;
  }
}

进行爬取,有必要读取这个网站的url,然后在网站内履行查找:

1、选取你想要爬去的内容,比方说想要爬取上海的java相关的职位,那你就要读取这个url

2、然后再点进一个详细的职位,对网页源码进行剖析,找到你想要爬去的字段的一些称号,比方一些特点,元素,id,标签名,经过jsoup剖析一下,怕取出里边的信息,存储到数据库里边

3、存储到数据库里边之后,咱们就要解析数据库里边的字段,用分词解析器剖析一下,树立字段索引,也能够把一切的字段都树立成索引(我便是这样做的)

4、输入想要的关键字,只要是字段里有这个词,就会查找出来,而且关键字会高亮显现(高亮显现是我附加的一个功用)

好了,剖析完了之后咱们就完成这个操作:

首要爬取网站

树立接口

public interface CrawlService {
  public void doCrawl()throws Exception;
}

完成接口

import org.apache.commons.httpclient.HttpClient;
import org.apache.commons.httpclient.methods.GetMethod;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.nodes.TextNode;
import org.jsoup.select.Elements;
import org.springframework.orm.hibernate3.support.HibernateDaoSupport;
import pojo.Jobcrawl;
public class CrawlServiceImpl extends HibernateDaoSupport implements  CrawlService {
                                       
  @Override
  public void doCrawl() throws Exception{
    HttpClient httpClient=new HttpClient();
    GetMethod getMethod=new GetMethod("http://search.51job.com/list/%2B,%2B,%2B,%2B,%2B,%2B,java,2,%2B.html?lang=c&stype=1");
    httpClient.executeMethod(getMethod);
    String html=getMethod.getResponseBodyAsString();
    html=new String(html.getBytes("iso8859-1"),"gb2312");
                                       
    Document doc=Jsoup.parse(html);
    Element totalCountEle=doc.select("table.navBold").select("td").get(1);
    String totalCountStr=totalCountEle.text();
    totalCountStr=totalCountStr.split("/")[1].trim();
    int totalCount=Integer.parseInt(totalCountStr);
    //总页数
    int pageCount=totalCount/30;
    for(int currentPage=1;currentPage<5;currentPage++){
    GetMethod gmPerPage=new GetMethod("http://search.51job.com/jobsearch/search_result.php?curr_page="+currentPage+"&&keyword=java");
    httpClient.executeMethod(gmPerPage);
    String perPageHtml=gmPerPage.getResponseBodyAsString();
    perPageHtml=new String(perPageHtml.getBytes("iso8859-1"),"gb2312");
    Document pageDoc=Jsoup.parse(perPageHtml);
    Elements eles=pageDoc.select("a.jobname");
    for(int i=0;i<eles.size();i++){
     Element ele=eles.get(i);
     //详细信息的url
     String detailUrl=ele.attr("href");
     GetMethod detailGet=new GetMethod(detailUrl);
     httpClient.executeMethod(detailGet);
     String detailHtml=detailGet.getResponseBodyAsString();
   detailHtml=new String(detailHtml.getBytes("iso8859-1"),"gb2312");
     Document detailDoc=Jsoup.parse(detailHtml);
     //得到职位称号
     Elements detailEles=detailDoc.select("td.sr_bt");
     Element jobnameEle=detailEles.get(0);
     String jobname=jobnameEle.text(); 
     System.out.println("职位称号:"+jobname);
                                          
     //获得公司称号
     Elements companyEles=detailDoc.select("table.jobs_1");
     Element companyEle=companyEles.get(0);
     Element companyEle_Rel=companyEle.select("a").get(0);
     String companyName=companyEle_Rel.text();
     System.out.println("公司称号:"+companyName);
                                       
     //公司职业
     Elements comp_industry=detailDoc.select("strong:contains(公司职业)");
     String comp_industry_name="";
     if(comp_industry.size()>0){
      Element comp_ele=comp_industry.get(0);
 TextNode comp_ele_real=(TextNode)comp_ele.nextSibling();
      comp_industry_name=comp_ele_real.text();
      System.out.println("公司职业:"+comp_industry_name);
     }
                                          
     //公司性质
     Elements compTypeEles=detailDoc.select("strong:contains(公司性质)");
     String comType="";
     if(compTypeEles.size()>0){
       Element compTypeEle=compTypeEles.get(0);
TextNode comTypeNode=(TextNode)compTypeEle.nextSibling();
       comType=comTypeNode.text();
       System.out.println("公司性质:"+comType);  
     }
                                          
     //公司规划
 Elements compScaleEles=detailDoc.select("strong:contains(公司规划)");
     String comScale="";
     if(compScaleEles.size()>0){
comScale=((TextNode)compScaleEles.get(0).nextSibling()).text();
       System.out.println("公司规划: "+comScale);
     }
     //发布日期
     Elements publishTimeEles=detailDoc.select("td:contains(发布日期)");
     Element publishTimeEle=publishTimeEles.get(0).nextElementSibling();
     String publishTime=publishTimeEle.text();
     System.out.println("发布日期:"+publishTime);
                                         
     //作业地址
     Elements placeEles=detailDoc.select("td:contains(作业地址)");
     String place="";
     if(placeEles.size()>0){
       place=placeEles.get(0).nextElementSibling().text();
       System.out.println("作业地址:"+place);
     }
                                          
Elements jobDeteilEle=detailDoc.select("td.txt_4.wordBreakNormal.job_detail");
   Elements jobDetailDivs=jobDeteilEle.get(0).select("div");
     Element jobDetailDiv=jobDetailDivs.get(0);
     String jobcontent=jobDetailDiv.html();
                                          
     Jobcrawl job=new Jobcrawl();
     job.setJobname(jobname);
     job.setCompanyname(companyName);
     job.setIndustry(comp_industry_name);
     job.setComtype(comType);
     job.setComscale(comScale);
     job.setPublishtime(publishTime);
     job.setPlace(place);
     job.setJobcontent(jobcontent);
     this.getHibernateTemplate().save(job);
     System.out.println("=");  
    }
    }  
  }
}

树立索引,树立一个文件夹,把索引的东西放到里边:

树立接口,而且树立要寄存索引的文件夹

public interface IndexService {
public static final String INDEXPATH="D:\\Workspaces\\Job51\\indexDir";
                            
  public void createIndex() throws Exception;
}

完成接口

import java.io.File;
import java.util.List;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.util.Version;
import org.springframework.orm.hibernate3.support.HibernateDaoSupport;
import org.wltea.analyzer.lucene.IKAnalyzer;
import pojo.Jobcrawl;
public class IndexServiceImpl extends HibernateDaoSupport implements IndexService {
  public void createIndex() throws Exception {
                            
    //索引文件夹目标
    Directory dir=FSDirectory.open(new File(IndexService.INDEXPATH));
    //中文剖析器
    Analyzer analyzer=new IKAnalyzer();
    //IndexWriter的装备类
    IndexWriterConfig iwc = new IndexWriterConfig(Version.LUCENE_31, analyzer);
    //创立IndexWrite,它用于写索引
    IndexWriter writer = new IndexWriter(dir, iwc);
    List<Jobcrawl> list=this.getHibernateTemplate().find("from Jobcrawl");
    writer.deleteAll();
    for(Jobcrawl job:list){
    Document doc=new Document();
    Field toidField=new Field("toid",job.getToid(),Field.Store.YES,Field.Index.NOT_ANALYZED_NO_NORMS);
    doc.add(toidField);
                            
    Field jobField=new Field("jobname",job.getJobname(),Field.Store.YES,Field.Index.NOT_ANALYZED_NO_NORMS);
    doc.add(jobField);
                            
    Field companyField=new Field("companyname",job.getCompanyname(),Field.Store.YES,Field.Index.NOT_ANALYZED_NO_NORMS);
    doc.add(companyField);
                            
    Field placeField=new Field("place",job.getPlace(),Field.Store.YES,Field.Index.NOT_ANALYZED_NO_NORMS);
    doc.add(placeField);
                            
    Field publishTimeField=new Field("publishTime",job.getPublishtime(),Field.Store.YES,Field.Index.NOT_ANALYZED_NO_NORMS);
    doc.add(publishTimeField);
                            
    //把一切的字段都参加索引
    String content=job.getJobname()+job.getComtype()+job.getIndustry()+job.getPlace()+job.getWorkyear()+job.getJobcontent();
    Field contentField=new Field("content",content,Field.Store.NO,Field.Index.ANALYZED);
    doc.add(contentField);
    writer.addDocument(doc);
    }
    writer.close();
  }
}

实施查找,树立接口:

import java.util.List;
import pojo.Jobcrawl;
public interface SearchService {
  public List<Jobcrawl> searchJob(String keyword)throws Exception;
}

完成接口:

import java.io.File;
import java.io.StringReader;
import java.util.ArrayList;
import java.util.List;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.document.Document;
import org.apache.lucene.queryParser.QueryParser;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.ScoreDoc;
import org.apache.lucene.search.TopDocs;
import org.apache.lucene.search.highlight.Highlighter;
import org.apache.lucene.search.highlight.QueryScorer;
import org.apache.lucene.search.highlight.SimpleFragmenter;
import org.apache.lucene.search.highlight.SimpleHTMLFormatter;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.util.Version;
import org.wltea.analyzer.lucene.IKAnalyzer;
import pojo.Jobcrawl;
public class SearchServiceImpl implements SearchService {
  @Override
  public List<Jobcrawl> searchJob(String keyword)throws Exception {
    IndexSearcher searcher = new IndexSearcher(FSDirectory
      .open(new File(IndexService.INDEXPATH)));
    // 词汇中文剖析器,树立索引的分词器有必要和查询的分词器共同
    Analyzer analyzer = new IKAnalyzer();
    //创立查询解析目标
    QueryParser parser = new QueryParser(Version.LUCENE_34,"content",
      analyzer);
    //创立查询目标,传入要查询的关键词
    Query query = parser.parse(keyword);
    TopDocs top_docs=searcher.search(query,20);
    ScoreDoc[] docs=top_docs.scoreDocs;
    //高亮显现
    SimpleHTMLFormatter simpleHTMLFormatter = new SimpleHTMLFormatter("<font color=red>", "</font>");
    Highlighter highlighter = new Highlighter(simpleHTMLFormatter,new QueryScorer(query));
    highlighter.setTextFragmenter(new SimpleFragmenter(1024));
    List<Jobcrawl> list=new ArrayList<Jobcrawl>();
    for(ScoreDoc sd:docs){
    Document pojoDoc=searcher.doc(sd.doc);
    Jobcrawl job=new Jobcrawl();
    job.setToid(pojoDoc.get("toid"));
                          
    String jobname=pojoDoc.get("jobname");
    TokenStream tokenStream = analyzer.tokenStream("jobname",new StringReader(jobname)); 
    String jobname_high = highlighter.getBestFragment(tokenStream,jobname);
    if(jobname_high!=null){
      jobname=jobname_high;
    }
    job.setJobname(jobname);
                          
    String companyname=pojoDoc.get("companyname");
    tokenStream = analyzer.tokenStream("companyname",new StringReader(companyname));
    String companyname_high=highlighter.getBestFragment(tokenStream,companyname);
    if(companyname_high!=null){
      companyname=companyname_high;
    }
    job.setCompanyname(companyname);
                          
    String place=pojoDoc.get("place");
    tokenStream = analyzer.tokenStream("place",new StringReader(place));
    String place_high=highlighter.getBestFragment(tokenStream,place);
    if(place_high!=null){
      place=place_high;
    }
    job.setPlace(place);
                          
    job.setPublishtime(pojoDoc.get("publishTime"));
    list.add(job);
    }
    return list;
  }
  public static void main(String[] args) throws Exception{
    String keyword="android";
    IndexSearcher searcher = new IndexSearcher(FSDirectory
      .open(new File(IndexService.INDEXPATH)));
    // 词汇中文剖析器,树立索引的分词器有必要和查询的分词器共同
    Analyzer analyzer = new IKAnalyzer();
    //创立查询解析目标
    QueryParser parser = new QueryParser(Version.LUCENE_34,"content",
      analyzer);
    //创立查询目标,传入要查询的关键词
    Query query = parser.parse(keyword);
    TopDocs top_docs=searcher.search(query,20);
    ScoreDoc[] docs=top_docs.scoreDocs;
                          
    //高亮显现
    SimpleHTMLFormatter simpleHTMLFormatter = new SimpleHTMLFormatter("<font color=red>", "</font>");
    Highlighter highlighter = new Highlighter(simpleHTMLFormatter,new QueryScorer(query));
    highlighter.setTextFragmenter(new SimpleFragmenter(1024));
    for(ScoreDoc sd:docs){
    Document pojoDoc=searcher.doc(sd.doc);
    String jobname=pojoDoc.get("jobname");
    TokenStream tokenStream = analyzer.tokenStream("jobname",new StringReader(jobname)); 
    String highLightText = highlighter.getBestFragment(tokenStream,jobname);
    System.out.println(highLightText);
    }   
  }
}

抓取简历和树立索引

<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml">
  <head>
  <title>后台办理</title>
            
  <script type="text/javascript" src="<%=path%>/js/jquery-1.7.1.min.js"></script>
  <script type="text/javascript" src="<%=path%>/bootstrap/js/bootstrap.min.js"></script>
   <link rel="stylesheet" media="screen"
    href="<%=path%>/bootstrap/css/bootstrap.min.css">
    <link rel="stylesheet" href="<%=path%>/bootstrap/css/bootstrap-responsive.min.css">
            
  </head>
            
  <body>
  <div >
  <div >
  <div >
   <ul >
   <li ><a href="<%=path%>/index/userAction!doCrawl.action"> 抓取简历</a></li>
   <li ><a href="<%=path%>/index/userAction!doIndex.action">树立索引</a></li>
   <li ><a href="#">修正账号</a></li>
   </ul>
  </div>
  </div>
  </div>
             
  </body>
</html>

下面写个查找框

!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml">
  <head>
  <title>欢迎来到查找引擎</title>
  <script type="text/javascript" src="<%=path%>/js/jquery-1.7.1.min.js"></script>
  <script type="text/javascript" src="<%=path%>/bootstrap/js/bootstrap.min.js"></script>
   <link rel="stylesheet" media="screen"
    href="<%=path%>/bootstrap/css/bootstrap.min.css">
    <link rel="stylesheet" href="<%=path%>/bootstrap/css/bootstrap-responsive.min.css">
  </head>
                 
  <body>
  <div >
  <div >
    <div ><a href="<%=path%>/account/accountAction!toRegister.action">注册</a>,<a href="/account/accountAction!toLogin.action">登录</a></div>
  </div>
   <form action="<%=path%>/web/userAction!searchJob.action" >
   <div  >
    <div >&nbsp;</div>
    <div >
                   
   <input type="text" name="keyword" >
                   
    </div>
    <div  ><button type="submit" >Search</button></div>
    <div >&nbsp;</div>
  </div>
  </form>
  </div>
  </body>
</html>

职位展现:

<%@ page language="java" import="java.util.*" pageEncoding="UTF-8"%>
<%@ taglib uri="http://java.sun.com/jsp/jstl/core" prefix="c"%>
<%
String path = request.getContextPath();
%>
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
  <head>
  <title>查询职位</title>
  <script type="text/javascript" src="<%=path%>/js/jquery-1.7.1.min.js"></script>
  <script type="text/javascript" src="<%=path%>/bootstrap/js/bootstrap.min.js"></script>
   <link rel="stylesheet" media="screen"
    href="<%=path%>/bootstrap/css/bootstrap.min.css">
    <link rel="stylesheet" href="<%=path%>/bootstrap/css/bootstrap-responsive.min.css">
  </head>
               
  <body>
  <div >
  <div  >
    <div >
    <form action="<%=path%>/web/userAction!searchJob.action" >
    <div>
    <input type="text" Style="height:30px" name="keyword" value="${param.keyword}" >
    <button type="submit" >Search</button>
    </div>
    </form>
    </div>
  </div>
  <div >
    <div >
   <table >
   <tr>
    <td>职位称号</td>
    <td>公司称号</td>
    <td>作业地址</td>
    <td>更新日</td>
   </tr>
   <c:forEach var="job" items="${requestScope.results}" >
   <tr>
    <td><a target="_blank" href="<%=path%>/web/userAction!searchJobDetail.action?toid=${job.toid}">${job.jobname}</a></td>
    <td>${job.companyname}</td>
    <td>${job.place}</td>
    <td>${job.publishtime}</td>
   </tr>
   </c:forEach>
   </table>
    </div>
  </div>
  </div>
  </body>
</html>

最上面的第二个脚本,别忘记加上,用的c标签。

上面运用的是三大结构做的,其他的增修改查操作需求自己写。

1.树立索引

2.履行查找

3.查找内容


附件:http://down.51cto.com/data/2364017
版权声明
本文来源于网络,版权归原作者所有,其内容与观点不代表牛牛娱乐立场。转载文章仅为传播更有价值的信息,如采编人员采编有误或者版权原因,请与我们联系,我们核实后立即修改或删除。

猜您喜欢的文章