我想从使用curl和DOM的网站获得所有脚本src链接。
我有这样的代码:
$scripts = $dom->getElementsByTagName('script');
foreach ($scripts as $scripts1) {
if($scripts1->getAttribute('src')) {
echo $scripts1->getAttribute('src');
}
}
这个脚本工作得很完美,但是如果一个网站有一个像这样的脚本标签会发生什么:
<script type="text/javascript">
window._wpemojiSettings = {"source":{"concatemoji":"http:'/'/domain.com'/wp-includes'/js'/wp-emoji-release.min.js?ver=4.2.4"}}; ........
</script>
我还需要得到这个脚本src。我该怎么做呢?
如果第一个解析器为空,我会使用正则表达式创建另一个解析器,即:
$html = file_get_contents("http://somesite.com/");
preg_match_all('/<script.*?(http.*?'.js(?:'?.*?)?)"/si', $html, $matches, PREG_PATTERN_ORDER);
for ($i = 0; $i < count($matches[1]); $i++) {
echo str_replace("''/", "/", $matches[1][$i]);
}
你可能需要调整正则表达式与不同的网站工作,但上面的代码应该给你一个想法,你需要什么。
演示:http://ideone.com/Fwf6Mb
正则表达式的解释:
<script.*?(http.*?'.js(?:'?.*?)?)"
----------------------------------
Match the character string “<script” literally «<script»
Match any single character «.*?»
Between zero and unlimited times, as few times as possible, expanding as needed (lazy) «*?»
Match the regex below and capture its match into backreference number 1 «(http.*?'.js(?:'?.*?)?)»
Match the character string “http” literally «http»
Match any single character «.*?»
Between zero and unlimited times, as few times as possible, expanding as needed (lazy) «*?»
Match the character “.” literally «'.»
Match the character string “js” literally «js»
Match the regular expression below «(?:'?.*?)?»
Between zero and one times, as many times as possible, giving back as needed (greedy) «?»
Match the character “?” literally «'?»
Match any single character «.*?»
Between zero and unlimited times, as few times as possible, expanding as needed (lazy) «*?»
Match the character “"” literally «"»
Regex教程
http://www.regular-expressions.info/tutorial.html