The method has been found out after several days' hard work. Very simple and high-developing efficiency.
Steps:
1 Download the required web page
2 Tidy Webpage into standard xhtml file
2.1 Translate Entities &--> &
2.2 Strong tag pair <span> <meta> <br> <link> <img>
2.3 Add XML features, PI, encoding....
2.4 The Quote Symbol
3 Retag current xhtml wtih followiing rules:
method 1: add "_d(num)" to current tags
where the (num) is node depth from document root.
method 2: add "_tl(num)" to current tags
where the (num) is the table depth of current node relative to node body.
Both rules are applied to all nodes execpt, Preprocessor Instructions , comments nodes and script nodes.
4 Write out the re-tagged xhtml as xml file
Remove namespace of xhml from here, otherwise xslt can not work well
5 write corresponding xlst file
Notice here: Clear your special template or element
6 Write perfect schema file
7 Transform to get the final xml file.
Make sure that you have got correct character encoding. Otherwise, MSXML will fail.
Nice steps.
My question is: how to access attribute value in ? <a href=""> </a>
本文介绍了一种将网页转换为标准XHTML文件的方法,并进一步将其转化为XML格式的步骤。该过程包括下载网页、清理并标准化HTML代码、重新标记XHTML、编写对应的XSLT文件、创建模式文件及最终转换成XML文件。特别关注了如何处理特殊标签及属性。


被折叠的 条评论
为什么被折叠?



