如何获取嵌入代码的域URL,我有400k视频,我从许多网站获取视频,其中一些使用iframe或对象,获取嵌入代码域的简单方法和最佳方法是什么?
内帧代码示例:
<iframe src="http://www.websites-test.com/video231/" frameborder=0 width=510 height=400 scrolling=no></iframe>
嵌入代码示例:
<object width="990" height="750"> <param name="movie" value="http://www.websites-test.com/video231/"></param><param name="AllowScriptAccess" value="always"></param><param name="wmode" value="transparent"></param><embed src="http://www.websites-test.com/video231/" type="application/x-shockwave-flash" wmode="transparent"` AllowScriptAccess="always" width="990" height="750"></embed></object>
因此,假设$Domain_嵌入= websites-test.com
我建议你解析HTML代码(如何在PHP中解析和处理HTML/XML?),然后从适当的属性中提取域。例如:
<?php
function getDomainFromEmbed($html, $all = false)
{
$result = array();
$doc = new DOMDocument;
@$doc->loadHTML($html);
$iframes = $doc->getElementsByTagName('iframe');
if (!empty($iframes)) {
foreach ($iframes as $iframe) {
if ($iframe->hasAttribute('src')) {
$url = parse_url($iframe->getAttribute('src'), PHP_URL_HOST);
if ($all) {
$result[] = $url;
} else {
return $url;
}
}
}
}
$objects = $doc->getElementsByTagName('object');
if (!empty($objects)) {
foreach ($objects as $object) {
if ($object->hasAttribute('data')) {
$url = parse_url($object->getAttribute('data'), PHP_URL_HOST);
if ($all) {
$result[] = $url;
} else {
return $url;
}
}
$params = $object->getElementsByTagName('param');
if (!empty($params)) {
foreach ($params as $param) {
if ($param->hasAttribute('name') && $param->hasAttribute('value') && 'movie' === $param->getAttribute('name')) {
$url = parse_url($param->getAttribute('value'), PHP_URL_HOST);
if ($all) {
$result[] = $url;
} else {
return $url;
}
}
}
}
}
}
$embeds = $doc->getElementsByTagName('embed');
if (!empty($embeds)) {
foreach ($embeds as $embed) {
if ($embed->hasAttribute('src')) {
$url = parse_url($embed->getAttribute('src'), PHP_URL_HOST);
if ($all) {
$result[] = $url;
} else {
return $url;
}
}
}
}
return $all ? $result : null;
}
echo '<pre>';
var_dump(getDomainFromEmbed('<iframe src="http://www.websites-test.com/video231/" frameborder=0 width=510 height=400 scrolling=no></iframe>'));
var_dump(getDomainFromEmbed('<object width="990" height="750"> <param name="movie" value="http://www.websites-test.com/video231/"></param><param name="AllowScriptAccess" value="always"></param><param name="wmode" value="transparent"></param><embed src="http://www.websites-test.com/video231/" type="application/x-shockwave-flash" wmode="transparent"` AllowScriptAccess="always" width="990" height="750"></embed></object>'));
echo '</pre>';
试试这段代码:
function getDomain($html) {
preg_match('`<[^>]*src=["'''s]?([^"^''^'s]+)["'''s][^>]*>`i', $html, $matches);
if(isset($matches[1]))
return parse_url($matches[1], PHP_URL_HOST);
return false;
}
$html = '<iframe src="http://www.websites-test.com/video231/" frameborder=0 width=510 height=400 scrolling=no></iframe>';
echo getDomain($html);
echo '<br />';
$html = '<object width="990" height="750"> <param name="movie" value="http://www.websites-test.com/video231/"></param><param name="AllowScriptAccess" value="always"></param><param name="wmode" value="transparent"></param><embed src="http://www.websites-test.com/video231/" type="application/x-shockwave-flash" wmode="transparent"` AllowScriptAccess="always" width="990" height="750"></embed></object>';
echo getDomain($html);
当然,您可以根据需要将其$Domain_Embed = getDomain($html)
到变量中,而不是echo getDomain($html)
。 $html
是包含这些标记的 HTML 代码,其中包含您提到的src
。
对于同一$html
中的多个对象,您可以更改函数以获取结果数组:
function getDomains($html) {
$results = array();
preg_match_all('`<[^>]*src=["'''s]?([^"^''^'s]+)["'''s][^>]*>`i', $html, $matches);
if(isset($matches[1]) && is_array($matches[1]))
foreach($matches[1] as $match)
$results[] = parse_url($match, PHP_URL_HOST);
return empty($results) ? false : $results;
}
echo '<pre>' . print_r(getDomains($html), true) . '</pre>';