我想获得我网站所有页面的<title>
标记的值。我试图只在我的网站域上运行脚本,并获取我网站上的所有页面链接和标题。
这是我的代码:
$html = file_get_contents('http://xxxxxxxxx.com');
//Create a new DOM document
$dom = new DOMDocument;
//Parse the HTML. The @ is used to suppress any parsing errors
//that will be thrown if the $html string isn't valid XHTML.
@$dom->loadHTML($html);
//Get all links. You could also use any other tag name here,
//like 'img' or 'table', to extract other tags.
$links = $dom->getElementsByTagName('a');
//Iterate over the extracted links and display their URLs
foreach ($links as $link){
//Extract and show the "href" attribute.
echo $link->nodeValue;
echo $link->getAttribute('href'), '<br>';
}
我得到的是:<a href="z1.html">z2</a>
我得到z1.html
和z2
。。。。我的z1.html
有一个名为z3
的title
。我想要z1.html
和z3
,而不是z2
。有人能帮我吗?
在hitesh的答案中添加一位,以检查元素是否具有属性以及所需的属性是否存在。此外,如果获取"title"元素在尝试获取第一个项之前确实返回了至少一个项($a_html_title->item(0))。
并为curl添加了一个跟随位置的选项(我为google.com进行的硬编码测试需要它)
foreach ($links as $link) {
//Extract and show the "href" attribute.
if ($link->hasAttributes()){
if ($link->hasAttribute('href')){
$href = $link->getAttribute('href');
$href = 'http://google.com'; // hardcoding just for testing
echo $link->nodeValue;
echo "<br/>".'MY ANCHOR LINK : - ' . $href . "---TITLE--->";
$a_html = my_curl_function($href);
$a_doc = new DOMDocument();
@$a_doc->loadHTML($a_html);
$a_html_title = $a_doc->getElementsByTagName('title');
//get and display what you need:
if ($a_html_title->length){
$a_html_title = $a_html_title->item(0)->nodeValue;
echo $a_html_title;
echo '<br/>';
}
}
}
}
function my_curl_function($url) {
$curl_handle = curl_init();
curl_setopt($curl_handle, CURLOPT_URL, $url);
curl_setopt($curl_handle, CURLOPT_CONNECTTIMEOUT, 2);
curl_setopt($curl_handle, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($curl_handle, CURLOPT_USERAGENT, 'name');
curl_setopt($curl_handle, CURLOPT_FOLLOWLOCATION, TRUE); // added this
$html = curl_exec($curl_handle);
curl_close($curl_handle);
return $html;
}
您需要创建自己的自定义函数并在适当的地方调用它,如果您需要从锚标记中的页面中获取多个标记,则只需要创建新的自定义函数。
下面的代码将帮助您开始
$html = my_curl_function('http://www.anchorartspace.org/');
$doc = new DOMDocument();
@$doc->loadHTML($html);
$mytag = $doc->getElementsByTagName('title');
//get and display what you need:
$title = $mytag->item(0)->nodeValue;
$links = $doc->getElementsByTagName('a');
//Iterate over the extracted links and display their URLs
foreach ($links as $link) {
//Extract and show the "href" attribute.
echo $link->nodeValue;
echo "<br/>".'MY ANCHOR LINK : - ' . $link->getAttribute('href') . "---TITLE--->";
$a_html = my_curl_function($link->getAttribute('href'));
$a_doc = new DOMDocument();
@$a_doc->loadHTML($a_html);
$a_html_title = $a_doc->getElementsByTagName('title');
//get and display what you need:
$a_html_title = $a_html_title->item(0)->nodeValue;
echo $a_html_title;
echo '<br/>';
}
echo "Title: $title" . '<br/><br/>';
function my_curl_function($url) {
$curl_handle = curl_init();
curl_setopt($curl_handle, CURLOPT_URL, $url);
curl_setopt($curl_handle, CURLOPT_CONNECTTIMEOUT, 2);
curl_setopt($curl_handle, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($curl_handle, CURLOPT_USERAGENT, 'name');
$html = curl_exec($curl_handle);
curl_close($curl_handle);
return $html;
}
如果您需要更多帮助,请告诉我。