如何刮取所有在html结构say中维护的刮取数据中的say标签(<;p>;<;/p>;)


How to scrape to get say tags(<p> </p>) all in the scraped data as maintained in the html structure say

<html>my news article</html>
<title>scraping</title>
<p>the world of so many articles</p>
<p>has been placed in this blocknotes</p>
<p>and i really wanna scraped that html structure just as it is</p>
<p>with all the tags in the scraped data</p>

如何抓取中的所有标签?

我希望抓取的数据像。。。。。。。。。。。

这个Python脚本可能会有所帮助:

from lxml import html
HTML = """<html>
<title>scraping</title>
<p>the world of so many articles</p>
<p>has been placed in this blocknotes</p>
<p>and i really wanna scraped that html structure just as it is</p>
<p>with all the tags in the scrapped data</p>
</html>"""
tree = html.fromstring(HTML)
print ' '.join("<p>{}</p>".format(x) for x in tree.xpath('//p/text()'))

输出:

<p>the world of so many articles</p> <p>has been placed in this blocknotes</p> <p>and i really wanna scraped that html structure just as it is</p> <p>with all the tags in the scrapped data</p>