查找'p'标记并遍历它们以使用php抓取底层文本


Finding number of 'p' tags and iterate through them to scrape the underlying text using php

所以我不知道如何从网站页面中抓取段落的底层文本,使用php没有任何'id'或'class'。其中一种方法是对a中的

标记进行计数和迭代,但是在遇到任何

标记之前,div本身会关闭。我打算通过抓取wikitravel.org的信息来学习抓取。这是wikitravel.org的一个例子

   <h2><span class="editsection">[<a href="/wiki/en/index.php?title=Kanniyakumari&    amp;action=edit&amp;section=18" title="Edit section: Sleep">edit</a>][<a href="#Sleep" title="click to add a sleep listing" onclick="addListing(this, '18', 'sleep', 'Kanniyakumari');">add listing</a>]</span> <span class="mw-headline" id="Sleep">Sleep</span></h2>
   <p>There are numerous hotels, residencies etc. in and around Kanyakumari and therefore, staying over is not be a problem. But there are agents, touts and brokers in every nook and corner looking for unsuspecting tourists. Eschew buying or booking rooms from them, as many a time you end up paying a lot more than the actual price. Vivekananda Kendra can be a good option for people looking for a decent, yet cheap accommodation, but it's around 3 km from Kanyakumari. Prefer hotels near the beach especially if you want to watch the sunrise right out of your bed! Note that, you should quote this preference when booking the room or else, you'll always be given a room without a window opening out to the sea. Moreover many a times, these rooms are in great demand and you'll find yourself shelling a extra 400 - 500 Rs (~10 US$)for such a room. Hotel Sea View, Hotel Sangam and a couple of other hotels offer such rooms and the rent is about Rs. 1100 (~ 25 US$) for 12 hrs. Note that many rooms are priced for 12 hrs  and not per day especially during the peak season.
</p>
<p>ATM's in Kanyakumari:</p>
 <p>Canara Bank 
 Main Road, Kanyakumari 629702, ,
 </p>
 <p>Indian Bank 
  S No 658 / 1, National High Way Opp St Antony'S Higher Secondary Sckanyakumari 629702
 </p>
<p>State Bank Of Travancore 
P.B.No.1, 1/17 Amman Sannathi Street, Kanyakumari, Tamil Nadu, 629702
</p>

有人能帮忙吗?提前感谢!

看一下simplehtmldom解析器。它应该与类似jquery的选择器一起工作。

下面是你的例子:

$html = file_get_html('http://www.wikitravel.com/yourpage');
foreach($html->find('p') as $element){
    echo $element->innertext; // the content in all the p tags
}

我一直认为JQuery是抓取HTML数据的最佳方式。让PHP用JQuery渲染出一个页面,该页面解析抓取的HTML,然后将JSON数据集返回给PHP。

如果您想坚持使用纯PHP路由,请尝试如下的库:http://simplehtmldom.sourceforge.net/