搭建基于nutch的搜索引擎

最新推荐文章于 2025-05-23 15:45:47 发布

原创最新推荐文章于 2025-05-23 15:45:47 发布 · 662 阅读

0 ·

本内容遵循CC 4.0 BY-SA版权协议

标签

#搜索引擎 #正则表达式 #java #tomcat #lucene

搜索引擎专栏收录该内容

4 篇文章

订阅专栏

本文介绍如何使用Nutch和Lucene构建简易搜索引擎，包括环境配置、软件安装、爬虫设置等步骤。

原文出处： http://blog.sina.com.cn/s/blog_694448320100kzsc.html 这两天闲着没事，看了看开源项目luncence和nutch，冲动之下利用nutch搭建一个简单的仿百度下的小小的搜索引擎，在搭建过程中，参考了于天恩老师写的《LUCENE搜索引擎开发权威经典》一书中后面的一章，感觉这本书对lucence的知识写的通俗易懂，下面说下搭建nutch的方法：准备软件： cywin:下载地址： http://inst.eecs.berkeley.edu/~instcd/iso//cygwin-release-20061108.iso nutch:下载地址： http://apache.freelamp.com/nutch/nutch-0.9.tar.gz tomat：：（下载地址不提供，自己下） JDK：（下载地址不提供，自己下）怎么安装的就不介绍了，可以到网上去找，比较容易下面介绍安装后的配置：环境配置: 1、修改：C:/nutch-0.9/conf/nutch-site.xml 内容如下：说明：由于nutch自动的这个配置文件中间是空的，所以需要把下面的给添加上去，内容可以根据相关规则自己可以更改 --------------------------------------------- http.agent.name Nutch Peter Pu Wang http.robots.agents Nutch,* The agent strings we'll look for in robots.txt files,comma-separated,in decreasiong order of precedence.You should put the value of http.agent.name as the first agent name,and keep the default * at the end of the list. E.g:BlurflDev,Blurfl,* http.agent.description Nutch Futher description of our bot-ths text is used in the User-Agent header.It appears in parenthesis after the agent name. http.agent.url http://lucene.apache.org/nutch/bot.html A Url to advertise in the User-Agent header.Thes will appear in parenthesis after the agent name.Custom dictates that this should be a URL of a page explaining the purpose and behavior of this crawler. http.agent.email Nutch-agent@lucence.apache.org An email address to advertise in the HTTP 'From' request header and User-Agent header.A good practice is to mangle this address(e.g.'info at example dot com') to avoid spamming. ---------------------------------------------------- 2、在nutch根目录下创建urls目录文件，在urls目录里面创建url.txt 里面内容写上爬虫起先开始抓取的网址：如直接输入：http://www.qq.com/ 3、修改：C:/nutch-0.9/conf/crawl-urlfilter.txt 内容如下：说明：只需要修改下面的：+^http://www.qq.com/ 把网址可以换掉这个网址，刚开始这个地方是个正则表达式，也可以自己写成一个正则表达式匹配更多网址 ------------------------------------------------ # The url filter file used by the crawl command. # Better for intranet crawling. # Be sure to change MY.DOMAIN.NAME to your domain name. # Each non-comment, non-blank line contains a regular expression # prefixed by '+' or '-'. The first matching pattern in the file # determines whether a URL is included or ignored. If no pattern # matches, the URL is ignored. # skip file:, ftp:, & mailto: urls -^(file|ftp|mailto): # skip image and other suffixes we can't yet parse -/.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$ # skip URLs containing certain characters as probable queries, etc. -[?*!@=] # skip URLs with slash-delimited segment that repeats 3+ times, to break loops -.*(/.+?)/.*?/1/.*?/1/ # accept hosts in MY.DOMAIN.NAME +^http://www.qq.com/ # skip everything else -. ------------------------------------------------ 4、配置nutch系统服务 NUTCH_JAVA_HOME：C:/Program Files/Java/jdk1.6.0_13 clsspath：%JAVA_HOME%/lib/dt.jar;%JAVA_HOME%/lib/htmlconverter.jar;%JAVA_HOME%/lib/tools.jar;%JAVA_HOME%/sample/jnlp/servlet/jnlp-servlet.jar;%JAVA_HOME%/lib/jconsole.jar 5、配置cywin爬虫抓取网页数据的方法安装cywin后桌面会出现一个cywin的图标，点击后会出现一个控制台第一步：进入到下载的nutch解压后的的根目录下，输入cd c:/nutch0.9 第二步：键入命令：bin/nutch crawl urls -dir mydir -depth 10 说明:第二步中的命令参数有好些个可以自己选择，上面是我自己敲入的几个参数：其中mydir：抓取的网页的输入存储的路径，这个名称可以自定义，抓取结束后会发现nutch跟目下会有个mydir的目录 -depth 10：-depth是指抓取的网页的深度，我这里设置的10，就是深度为10，你也可以设置大点或者小一点都可以 6、nutch程序部署将nutch更目录下的C:/nutch-0.9/nutch-0.9.war拷贝到tomcat下直接部署即可部署的程序配置： 1、D:/apache-tomcat-6.0.18/webapps/nutch-9/WEB-INF/classes/hadoop-site.xml 修改为：其中mydir目录名称是你进行爬虫的时候配置创建的目录，需要对应名称 searcher.dir C:/nutch-0.9/mydir 2、启动tomcat 以上的配置设置完之后，先执行上面的第四项先抓取数据后在启动tomcat也就是执行第五步可以开始访问了。