注册 登录  
 加关注
   显示下一条  |  关闭
温馨提示!由于新浪微博认证机制调整,您的新浪微博帐号绑定已过期,请重新绑定!立即重新绑定新浪微博》  |  关闭

和申的个人主页

专注于java开发,1985wanggang

 
 
 

日志

 
 

使用 HtmlUnit引擎解析javascript,抓取网页。  

2013-07-12 15:36:10|  分类: 爬虫相关 |  标签: |举报 |字号 订阅

  下载LOFTER 我的照片书  |
Java开源项目

htmlunit 是一款开源的java 页面分析工具,读取页面后,可以有效的使用htmlunit分析页面上的内容。项目可以模拟浏览器运行,被誉为java浏览器的开源实现。这个没有界面的浏览器,运行速度也是非常迅速的。
编辑本段
Javascript引擎

采用的是Rhinojs引擎。模拟js运行
编辑本段
主要用途
常规意义上,该项目可以用来进行页面的测试工作,实现网页自动化测试,(包括JS)
但是一般来说,在小型爬虫项目中,这种框架十分常用,可以有效的分析出 dom的标签,并且有效的运行页面上的js以便得到一些需要执行JS才能得到的值。

对于一个内部需要javassript解析,需要跳转的页面,使用HtmlUnit 可以很方便的自动跳转到最终的页面。

String url = "http://out.zhe800.com/ju/deal/maorongwan_40990";
// 模拟一个浏览器
WebClient webClient = new WebClient();
// 设置webClient的相关参数
webClient.setAjaxController(new NicelyResynchronizingAjaxController());
webClient.getOptions().setJavaScriptEnabled(true);
webClient.getOptions().setCssEnabled(false);
webClient.getOptions().setTimeout(35000);
webClient.getOptions().setThrowExceptionOnScriptError(false);


// 模拟浏览器打开一个目标网址
HtmlPage rootPage = (HtmlPage) webClient.getCurrentWindow().getEnclosedPage();
rootPage = webClient.getPage(url);
System.out.println(rootPage.getUrl());

//System.out.println(rootPage.getHead().asXml());
//System.out.println(rootPage.getBody().asXml());
System.out.println("==================== all info ====================");
System.out.println(rootPage.asXml());

页面输出如下,可以看到成功跳转到最终的页面,然后打出了最终的网址和网页内容:
at com.gargoylesoftware.htmlunit.DefaultPageCreator.createHtmlPage(DefaultPageCreator.java:268)
at com.gargoylesoftware.htmlunit.DefaultPageCreator.createPage(DefaultPageCreator.java:156)
at com.gargoylesoftware.htmlunit.WebClient.loadWebResponseInto(WebClient.java:434)
at com.gargoylesoftware.htmlunit.WebClient.loadDownloadedResponses(WebClient.java:2289)
at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine.doProcessPostponedActions(JavaScriptEngine.java:697)
at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine.processPostponedActions(JavaScriptEngine.java:784)
at com.gargoylesoftware.htmlunit.html.HtmlElement.click(HtmlElement.java:1255)
at com.gargoylesoftware.htmlunit.html.HtmlElement.click(HtmlElement.java:1198)
at com.gargoylesoftware.htmlunit.html.HtmlElement.click(HtmlElement.java:1158)
at com.gargoylesoftware.htmlunit.javascript.host.html.HTMLElement.click(HTMLElement.java:2106)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at net.sourceforge.htmlunit.corejs.javascript.MemberBox.invoke(MemberBox.java:120)
at net.sourceforge.htmlunit.corejs.javascript.FunctionObject.call(FunctionObject.java:448)
at net.sourceforge.htmlunit.corejs.javascript.Interpreter.interpretLoop(Interpreter.java:1531)
at net.sourceforge.htmlunit.corejs.javascript.Interpreter.interpret(Interpreter.java:798)
at net.sourceforge.htmlunit.corejs.javascript.InterpretedFunction.call(InterpretedFunction.java:105)
at net.sourceforge.htmlunit.corejs.javascript.ContextFactory.doTopCall(ContextFactory.java:405)
at com.gargoylesoftware.htmlunit.javascript.HtmlUnitContextFactory.doTopCall(HtmlUnitContextFactory.java:275)
at net.sourceforge.htmlunit.corejs.javascript.ScriptRuntime.doTopCall(ScriptRuntime.java:3031)
at net.sourceforge.htmlunit.corejs.javascript.InterpretedFunction.exec(InterpretedFunction.java:115)
at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine$3.doRun(JavaScriptEngine.java:546)
at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine$HtmlUnitContextAction.run(JavaScriptEngine.java:654)
at net.sourceforge.htmlunit.corejs.javascript.Context.call(Context.java:601)
at net.sourceforge.htmlunit.corejs.javascript.ContextFactory.call(ContextFactory.java:507)
at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine.execute(JavaScriptEngine.java:555)
at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine.execute(JavaScriptEngine.java:530)
at com.gargoylesoftware.htmlunit.html.HtmlPage.executeJavaScriptIfPossible(HtmlPage.java:979)
at com.gargoylesoftware.htmlunit.html.HtmlScript.executeInlineScriptIfNeeded(HtmlScript.java:337)
at com.gargoylesoftware.htmlunit.html.HtmlScript.executeScriptIfNeeded(HtmlScript.java:415)
at com.gargoylesoftware.htmlunit.html.HtmlScript$3.execute(HtmlScript.java:260)
at com.gargoylesoftware.htmlunit.html.HtmlScript.onAllChildrenAddedToPage(HtmlScript.java:276)
at com.gargoylesoftware.htmlunit.html.HTMLParser$HtmlUnitDOMBuilder.endElement(HTMLParser.java:676)
at org.apache.xerces.parsers.AbstractSAXParser.endElement(Unknown Source)
at com.gargoylesoftware.htmlunit.html.HTMLParser$HtmlUnitDOMBuilder.endElement(HTMLParser.java:635)
at org.cyberneko.html.HTMLTagBalancer.callEndElement(HTMLTagBalancer.java:1170)
at org.cyberneko.html.HTMLTagBalancer.endElement(HTMLTagBalancer.java:1072)
at org.cyberneko.html.filters.DefaultFilter.endElement(DefaultFilter.java:206)
at org.cyberneko.html.filters.NamespaceBinder.endElement(NamespaceBinder.java:330)
at org.cyberneko.html.HTMLScanner$ContentScanner.scanEndElement(HTMLScanner.java:3074)
at org.cyberneko.html.HTMLScanner$ContentScanner.scan(HTMLScanner.java:2041)
at org.cyberneko.html.HTMLScanner.scanDocument(HTMLScanner.java:918)
at org.cyberneko.html.HTMLConfiguration.parse(HTMLConfiguration.java:499)
at org.cyberneko.html.HTMLConfiguration.parse(HTMLConfiguration.java:452)
at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
at com.gargoylesoftware.htmlunit.html.HTMLParser$HtmlUnitDOMBuilder.parse(HTMLParser.java:892)
at com.gargoylesoftware.htmlunit.html.HTMLParser.parse(HTMLParser.java:241)
at com.gargoylesoftware.htmlunit.html.HTMLParser.parseHtml(HTMLParser.java:187)
at com.gargoylesoftware.htmlunit.DefaultPageCreator.createHtmlPage(DefaultPageCreator.java:268)
at com.gargoylesoftware.htmlunit.DefaultPageCreator.createPage(DefaultPageCreator.java:156)
at com.gargoylesoftware.htmlunit.WebClient.loadWebResponseInto(WebClient.java:434)
at com.gargoylesoftware.htmlunit.WebClient.getPage(WebClient.java:309)
at com.gargoylesoftware.htmlunit.WebClient.getPage(WebClient.java:374)
at com.gargoylesoftware.htmlunit.WebClient.getPage(WebClient.java:359)
at Test.main(Test.java:58)
Caused by: net.sourceforge.htmlunit.corejs.javascript.EcmaError: ReferenceError: "TShop" is not defined. (script in http://detail.tmall.com/item.htm?id=17444693020&ali_trackid=2:mm_40395660_0_0,438901857:1373614462_6k3_768915461&spm=2014.21426951.1.0 from (978, 32) to (978, 60)#978)
at net.sourceforge.htmlunit.corejs.javascript.ScriptRuntime.constructError(ScriptRuntime.java:3603)
at net.sourceforge.htmlunit.corejs.javascript.ScriptRuntime.constructError(ScriptRuntime.java:3587)
at net.sourceforge.htmlunit.corejs.javascript.ScriptRuntime.notFoundError(ScriptRuntime.java:3657)
at net.sourceforge.htmlunit.corejs.javascript.ScriptRuntime.name(ScriptRuntime.java:1685)
at net.sourceforge.htmlunit.corejs.javascript.Interpreter.interpretLoop(Interpreter.java:1622)
at net.sourceforge.htmlunit.corejs.javascript.Interpreter.interpret(Interpreter.java:798)
at net.sourceforge.htmlunit.corejs.javascript.InterpretedFunction.exec(InterpretedFunction.java:118)
at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine$3.doRun(JavaScriptEngine.java:546)
at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine$HtmlUnitContextAction.run(JavaScriptEngine.java:654)
... 82 more
Enclosed exception: 
net.sourceforge.htmlunit.corejs.javascript.EcmaError: ReferenceError: "TShop" is not defined. (script in http://detail.tmall.com/item.htm?id=17444693020&ali_trackid=2:mm_40395660_0_0,438901857:1373614462_6k3_768915461&spm=2014.21426951.1.0 from (978, 32) to (978, 60)#978)
at net.sourceforge.htmlunit.corejs.javascript.ScriptRuntime.constructError(ScriptRuntime.java:3603)
at net.sourceforge.htmlunit.corejs.javascript.ScriptRuntime.constructError(ScriptRuntime.java:3587)
at net.sourceforge.htmlunit.corejs.javascript.ScriptRuntime.notFoundError(ScriptRuntime.java:3657)
at net.sourceforge.htmlunit.corejs.javascript.ScriptRuntime.name(ScriptRuntime.java:1685)
at net.sourceforge.htmlunit.corejs.javascript.Interpreter.interpretLoop(Interpreter.java:1622)
at script(script in http://detail.tmall.com/item.htm?id=17444693020&ali_trackid=2:mm_40395660_0_0,438901857:1373614462_6k3_768915461&spm=2014.21426951.1.0 from (978, 32) to (978, 60):978)
at net.sourceforge.htmlunit.corejs.javascript.Interpreter.interpret(Interpreter.java:798)
at net.sourceforge.htmlunit.corejs.javascript.InterpretedFunction.exec(InterpretedFunction.java:118)
at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine$3.doRun(JavaScriptEngine.java:546)
at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine$HtmlUnitContextAction.run(JavaScriptEngine.java:654)
at net.sourceforge.htmlunit.corejs.javascript.Context.call(Context.java:601)
at net.sourceforge.htmlunit.corejs.javascript.ContextFactory.call(ContextFactory.java:507)
at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine.execute(JavaScriptEngine.java:555)
at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine.execute(JavaScriptEngine.java:530)
at com.gargoylesoftware.htmlunit.html.HtmlPage.executeJavaScriptIfPossible(HtmlPage.java:979)
at com.gargoylesoftware.htmlunit.html.HtmlScript.executeInlineScriptIfNeeded(HtmlScript.java:337)
at com.gargoylesoftware.htmlunit.html.HtmlScript.executeScriptIfNeeded(HtmlScript.java:415)
at com.gargoylesoftware.htmlunit.html.HtmlScript$3.execute(HtmlScript.java:260)
at com.gargoylesoftware.htmlunit.html.HtmlScript.onAllChildrenAddedToPage(HtmlScript.java:276)
at com.gargoylesoftware.htmlunit.html.HTMLParser$HtmlUnitDOMBuilder.endElement(HTMLParser.java:676)
at org.apache.xerces.parsers.AbstractSAXParser.endElement(Unknown Source)
at com.gargoylesoftware.htmlunit.html.HTMLParser$HtmlUnitDOMBuilder.endElement(HTMLParser.java:635)
at org.cyberneko.html.HTMLTagBalancer.callEndElement(HTMLTagBalancer.java:1170)
at org.cyberneko.html.HTMLTagBalancer.endElement(HTMLTagBalancer.java:1072)
at org.cyberneko.html.filters.DefaultFilter.endElement(DefaultFilter.java:206)
at org.cyberneko.html.filters.NamespaceBinder.endElement(NamespaceBinder.java:330)
at org.cyberneko.html.HTMLScanner$ContentScanner.scanEndElement(HTMLScanner.java:3074)
at org.cyberneko.html.HTMLScanner$ContentScanner.scan(HTMLScanner.java:2041)
at org.cyberneko.html.HTMLScanner.scanDocument(HTMLScanner.java:918)
at org.cyberneko.html.HTMLConfiguration.parse(HTMLConfiguration.java:499)
at org.cyberneko.html.HTMLConfiguration.parse(HTMLConfiguration.java:452)
at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
at com.gargoylesoftware.htmlunit.html.HTMLParser$HtmlUnitDOMBuilder.parse(HTMLParser.java:892)
at com.gargoylesoftware.htmlunit.html.HTMLParser.parse(HTMLParser.java:241)
at com.gargoylesoftware.htmlunit.html.HTMLParser.parseHtml(HTMLParser.java:187)
at com.gargoylesoftware.htmlunit.DefaultPageCreator.createHtmlPage(DefaultPageCreator.java:268)
at com.gargoylesoftware.htmlunit.DefaultPageCreator.createPage(DefaultPageCreator.java:156)
at com.gargoylesoftware.htmlunit.WebClient.loadWebResponseInto(WebClient.java:434)
at com.gargoylesoftware.htmlunit.WebClient.loadDownloadedResponses(WebClient.java:2289)
at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine.doProcessPostponedActions(JavaScriptEngine.java:697)
at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine.processPostponedActions(JavaScriptEngine.java:784)
at com.gargoylesoftware.htmlunit.html.HtmlElement.click(HtmlElement.java:1255)
at com.gargoylesoftware.htmlunit.html.HtmlElement.click(HtmlElement.java:1198)
at com.gargoylesoftware.htmlunit.html.HtmlElement.click(HtmlElement.java:1158)
at com.gargoylesoftware.htmlunit.javascript.host.html.HTMLElement.click(HTMLElement.java:2106)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at net.sourceforge.htmlunit.corejs.javascript.MemberBox.invoke(MemberBox.java:120)
at net.sourceforge.htmlunit.corejs.javascript.FunctionObject.call(FunctionObject.java:448)
at net.sourceforge.htmlunit.corejs.javascript.Interpreter.interpretLoop(Interpreter.java:1531)
at script.bol(script in http://s.click.taobao.com/t_js?tu=http%3A%2F%2Fs.click.taobao.com%2Ft%3Fe%3Dm%253D2%2526s%253DDIRCqCGUzuwcQipKwQzePOeEDrYVVa64yK8Cckff7TVRAdhuF14FMb7qCLa%252BrnRbt4hWD5k2kjOHPBikNt6ub9QRGlh44dSUO3o2WZUdy2KBoEHZXsbZGZV%252FuXncgIhTxGL99wIqWvSss8W%252F04dFBjOjShYn6lyE%26spm%3D2014.21426951.1.0%26unid%3D438901857%26ref%3D%26et%3DjFBDO397%252F7fmQw%253D%253D from (6, 32) to (49, 10):44)
at script(script in http://s.click.taobao.com/t_js?tu=http%3A%2F%2Fs.click.taobao.com%2Ft%3Fe%3Dm%253D2%2526s%253DDIRCqCGUzuwcQipKwQzePOeEDrYVVa64yK8Cckff7TVRAdhuF14FMb7qCLa%252BrnRbt4hWD5k2kjOHPBikNt6ub9QRGlh44dSUO3o2WZUdy2KBoEHZXsbZGZV%252FuXncgIhTxGL99wIqWvSss8W%252F04dFBjOjShYn6lyE%26spm%3D2014.21426951.1.0%26unid%3D438901857%26ref%3D%26et%3DjFBDO397%252F7fmQw%253D%253D from (6, 32) to (49, 10):48)
at net.sourceforge.htmlunit.corejs.javascript.Interpreter.interpret(Interpreter.java:798)
at net.sourceforge.htmlunit.corejs.javascript.InterpretedFunction.call(InterpretedFunction.java:105)
at net.sourceforge.htmlunit.corejs.javascript.ContextFactory.doTopCall(ContextFactory.java:405)
at com.gargoylesoftware.htmlunit.javascript.HtmlUnitContextFactory.doTopCall(HtmlUnitContextFactory.java:275)
at net.sourceforge.htmlunit.corejs.javascript.ScriptRuntime.doTopCall(ScriptRuntime.java:3031)
at net.sourceforge.htmlunit.corejs.javascript.InterpretedFunction.exec(InterpretedFunction.java:115)
at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine$3.doRun(JavaScriptEngine.java:546)
at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine$HtmlUnitContextAction.run(JavaScriptEngine.java:654)
at net.sourceforge.htmlunit.corejs.javascript.Context.call(Context.java:601)
at net.sourceforge.htmlunit.corejs.javascript.ContextFactory.call(ContextFactory.java:507)
at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine.execute(JavaScriptEngine.java:555)
at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine.execute(JavaScriptEngine.java:530)
at com.gargoylesoftware.htmlunit.html.HtmlPage.executeJavaScriptIfPossible(HtmlPage.java:979)
at com.gargoylesoftware.htmlunit.html.HtmlScript.executeInlineScriptIfNeeded(HtmlScript.java:337)
at com.gargoylesoftware.htmlunit.html.HtmlScript.executeScriptIfNeeded(HtmlScript.java:415)
at com.gargoylesoftware.htmlunit.html.HtmlScript$3.execute(HtmlScript.java:260)
at com.gargoylesoftware.htmlunit.html.HtmlScript.onAllChildrenAddedToPage(HtmlScript.java:276)
at com.gargoylesoftware.htmlunit.html.HTMLParser$HtmlUnitDOMBuilder.endElement(HTMLParser.java:676)
at org.apache.xerces.parsers.AbstractSAXParser.endElement(Unknown Source)
at com.gargoylesoftware.htmlunit.html.HTMLParser$HtmlUnitDOMBuilder.endElement(HTMLParser.java:635)
at org.cyberneko.html.HTMLTagBalancer.callEndElement(HTMLTagBalancer.java:1170)
at org.cyberneko.html.HTMLTagBalancer.endElement(HTMLTagBalancer.java:1072)
at org.cyberneko.html.filters.DefaultFilter.endElement(DefaultFilter.java:206)
at org.cyberneko.html.filters.NamespaceBinder.endElement(NamespaceBinder.java:330)
at org.cyberneko.html.HTMLScanner$ContentScanner.scanEndElement(HTMLScanner.java:3074)
at org.cyberneko.html.HTMLScanner$ContentScanner.scan(HTMLScanner.java:2041)
at org.cyberneko.html.HTMLScanner.scanDocument(HTMLScanner.java:918)
at org.cyberneko.html.HTMLConfiguration.parse(HTMLConfiguration.java:499)
at org.cyberneko.html.HTMLConfiguration.parse(HTMLConfiguration.java:452)
at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
at com.gargoylesoftware.htmlunit.html.HTMLParser$HtmlUnitDOMBuilder.parse(HTMLParser.java:892)
at com.gargoylesoftware.htmlunit.html.HTMLParser.parse(HTMLParser.java:241)
at com.gargoylesoftware.htmlunit.html.HTMLParser.parseHtml(HTMLParser.java:187)
at com.gargoylesoftware.htmlunit.DefaultPageCreator.createHtmlPage(DefaultPageCreator.java:268)
at com.gargoylesoftware.htmlunit.DefaultPageCreator.createPage(DefaultPageCreator.java:156)
at com.gargoylesoftware.htmlunit.WebClient.loadWebResponseInto(WebClient.java:434)
at com.gargoylesoftware.htmlunit.WebClient.getPage(WebClient.java:309)
at com.gargoylesoftware.htmlunit.WebClient.getPage(WebClient.java:374)
at com.gargoylesoftware.htmlunit.WebClient.getPage(WebClient.java:359)
at Test.main(Test.java:58)
======= EXCEPTION END ========
http://detail.tmall.com/item.htm?id=17444693020&ali_trackid=2:mm_40395660_0_0,438901857:1373614462_6k3_768915461&spm=2014.21426951.1.0
==================== all info ====================
<?xml version="1.0" encoding="GBK"?>
<html class="TMDstandard ks-trident0 ks-trident ks-ie8 ks-ie">
  <head>
    <script ........
</html>
========================================================================================
========================================================================================

相关信息:
统计信息唧唧歪歪唧唧网ggyygg.net
  评论这张
 
阅读(4222)| 评论(4)
推荐 转载

历史上的今天

在LOFTER的更多文章

评论

<#--最新日志,群博日志--> <#--推荐日志--> <#--引用记录--> <#--博主推荐--> <#--随机阅读--> <#--首页推荐--> <#--历史上的今天--> <#--被推荐日志--> <#--上一篇,下一篇--> <#-- 热度 --> <#-- 网易新闻广告 --> <#--右边模块结构--> <#--评论模块结构--> <#--引用模块结构--> <#--博主发起的投票-->
 
 
 
 
 
 
 
 
 
 
 
 
 
 

页脚

网易公司版权所有 ©1997-2016