我是php的新手。我想做的是获取分页的链接。页面上有页码,当然链接会随着我们选择页面而改变。如何通过停留在http://ahadith.co.uk/sahihmuslim.php
主页面上获取分页的url。
<?php
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, "http://ahadith.co.uk/sahihmuslim.php");
//fetches data from the site mentioned above
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
$output = curl_exec($ch);
$pattern = "/href=[']([^'][a-zA-Z]+.[a-zA-Z]+.[cid]+=[0-9]+)[']?/";
//this regex brings the links from the above url
preg_match_all($pattern, $output, $matches, PREG_PATTERN_ORDER);
foreach ($matches[1] as $data) {
$homepage = file_get_contents('http://ahadith.co.uk/'.$data);
//all the links data which was caught above using regex has been stored in $homepage
$pattern_chapter= "/(?<='<h2'>)('s*.*'s*)(?='<'/h2'>)/";
//Here I have fetched the chapters from the data stored in $homepage
preg_match_all($pattern_chapter, $homepage, $matches_chapter, PREG_PATTERN_ORDER);
foreach ($matches_chapter[1] as $chapters) {
print_r($chapters);
}
?>
现在,我必须从存储在$homepage
中的数据中获取分页的链接。在这种情况下,分页有44页,我想得到所有44页的链接。这是匹配分页http:'/'/([a-zA-Z]+.[a-zA-Z]+.[a-zA-Z]+.[a-zA-Z]+.[a-zA-Z]+.[cid]+=[0-9]&[a-zA-Z]+=[0-9]&[a-zA-Z]+=[0-9]+)
中的链接的正则表达式我找了很多地方找这个,但找不到任何与此相关的东西。有人能帮我吗。
使用"HtmlPageDom"。它是一个第三方库,用于使用DOM轻松操作HTML文档。您可以从任何页面中提取所需的任何类型的数据。
https://github.com/wasinger/htmlpagedom/blob/master/README.md