我有一个php网络爬虫,我很想向它添加get_meta_tags()函数。它扫描给定的网页中的所有URL等等。有可能将get_meta_tag方法添加到网络爬虫中,以便从扫描的URL中获取元吗?
session_start();
$domain = "www.ebay.com";
if(empty($_SESSION['page']))
{
$original_file = file_get_contents("http://" . $domain . "/");
$_SESSION['i'] = 0;
$connect = mysql_connect("cust-mysql-123-05", "uthe_774575_0001", "rooney08");
if (!$connect)
{
die("MySQL could not connect!");
}
$DB = mysql_select_db('theqlickcom_774575_db1');
if(!$DB)
{
die("MySQL could not select Database!");
}
}
if(isset($_SESSION['page']))
{
$connect = mysql_connect("xxxxx", "xxxxx", "xxxx");
if (!$connect)
{
die("MySQL could not connect!");
}
$DB = mysql_select_db('xxxx');
if(!$DB)
{
die("MySQL could not select Database!");
}
$PAGE = $_SESSION['page'];
$original_file = file_get_contents("$PAGE");
}
$stripped_file = strip_tags($original_file, "<a>");
preg_match_all("/<a(?:[^>]*)href='"([^'"]*)'"(?:[^>]*)>(?:[^<]*)<'/a>/is", $stripped_file, $matches);
foreach($matches[1] as $key => $value)
{
if(strpos($value,"http://") != 'FALSE' && strpos($value,"https://") != 'FALSE')
{
$New_URL = "http://" . $domain . $value;
}
else
{
$New_URL = $value;
}
$New_URL = addslashes($New_URL);
$Check = mysql_query("SELECT * FROM pages WHERE url='$New_URL'");
$Num = mysql_num_rows($Check);
if($Num == 0)
{
mysql_query("INSERT INTO pages (url)
VALUES ('$New_URL')");
$_SESSION['i']++;
echo $_SESSION['i'] . "";
}
echo mysql_error();
}
$RandQuery = mysql_query("SELECT DISTINCT * FROM pages ORDER BY rank LIMIT 0,1");
$RandReturn = mysql_num_rows($RandQuery);
while($row1 = mysql_fetch_assoc($RandQuery))
{
$_SESSION['page'] = $row1['url'];
}
echo $RandReturn;
echo $_SESSION['page'];
mysql_close();
?>
我以前在从外部源读取html标记时遇到过这个问题。Jstel为我提供了一个很好的解决方案,尽管我相信你可以将她的解决方案融入你的解决方案中。
http://www.php.net/manual/en/function.get-meta-tags.php#92197
根据您的代码,以下是它的工作原理:
$domain = "www.ebay.com";
$original_file = file_get_contents("http://" . $domain . "/");
preg_match_all("/<meta[^>]+(http'-equiv|name)='"([^'"]*)'"[^>]" . "+content='"([^'"]*)'"[^>]*>/i",$original_file, $result);
print_r($result);
您将在下面看到我从这个正则表达式中得到的示例结果:
Array
(
[0] => Array
(
[0] => <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
[1] => <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
[2] => <meta name="keywords" content="ebay, electronics, cars, clothing, apparel, collectibles, sporting goods, digital cameras, antiques, tickets, jewelry, online shopping, auction, online auction">
[3] => <meta name="description" content="Buy and sell electronics, cars, fashion apparel, collectibles, sporting goods, digital cameras, baby items, coupons, and everything else on eBay, the world's online marketplace">
[4] => <meta name="verify-v1" content="j6ZKbG61n+f9pUtbkf69zFRBrRSeUqyfEJ2BjiRxWDQ=">
[5] => <meta name="y_key" content="acf32e2a69cbc2b0">
[6] => <meta name="msvalidate.01" content="31154A785F516EC9842FC3BA2A70FB1A">
)
[1] => Array
(
[0] => http-equiv
[1] => http-equiv
[2] => name
[3] => name
[4] => name
[5] => name
[6] => name
)
[2] => Array
(
[0] => Content-Type
[1] => Content-Type
[2] => keywords
[3] => description
[4] => verify-v1
[5] => y_key
[6] => msvalidate.01
)
[3] => Array
(
[0] => text/html; charset=UTF-8
[1] => text/html; charset=UTF-8
[2] => ebay, electronics, cars, clothing, apparel, collectibles, sporting goods, digital cameras, antiques, tickets, jewelry, online shopping, auction, online auction
[3] => Buy and sell electronics, cars, fashion apparel, collectibles, sporting goods, digital cameras, baby items, coupons, and everything else on eBay, the world's online marketplace
[4] => j6ZKbG61n+f9pUtbkf69zFRBrRSeUqyfEJ2BjiRxWDQ=
[5] => acf32e2a69cbc2b0
[6] => 31154A785F516EC9842FC3BA2A70FB1A
)
)
首先,为什么要在这行加引号?:
$original_file = file_get_contents("$PAGE");
其次,可以检索到所有元标记
$tags = get_meta_tags('http://www.example.com/');
请参阅php.net
因此,在您的示例中,我想您将不得不使用:
$tags = get_meta_tags($New_URL);
并将该数组保存到数据库中。