PHP DOM获取网站所有脚本src


PHP DOM Get website all scripts src

我想从使用curl和DOM的网站获得所有脚本src链接。

我有这样的代码:

$scripts = $dom->getElementsByTagName('script');
foreach ($scripts as $scripts1) {
    if($scripts1->getAttribute('src')) {
        echo $scripts1->getAttribute('src');
    }
}

这个脚本工作得很完美,但是如果一个网站有一个像这样的脚本标签会发生什么:

<script type="text/javascript">
window._wpemojiSettings = {"source":{"concatemoji":"http:'/'/domain.com'/wp-includes'/js'/wp-emoji-release.min.js?ver=4.2.4"}}; ........
</script>

我还需要得到这个脚本src。我该怎么做呢?

如果第一个解析器为空,我会使用正则表达式创建另一个解析器,即:

$html = file_get_contents("http://somesite.com/");
preg_match_all('/<script.*?(http.*?'.js(?:'?.*?)?)"/si', $html, $matches, PREG_PATTERN_ORDER);
for ($i = 0; $i < count($matches[1]); $i++) {
    echo str_replace("''/", "/", $matches[1][$i]);
}

你可能需要调整正则表达式与不同的网站工作,但上面的代码应该给你一个想法,你需要什么。


演示:http://ideone.com/Fwf6Mb


正则表达式的解释:

<script.*?(http.*?'.js(?:'?.*?)?)"
----------------------------------
Match the character string “<script” literally «<script»
Match any single character «.*?»
   Between zero and unlimited times, as few times as possible, expanding as needed (lazy) «*?»
Match the regex below and capture its match into backreference number 1 «(http.*?'.js(?:'?.*?)?)»
   Match the character string “http” literally «http»
   Match any single character «.*?»
      Between zero and unlimited times, as few times as possible, expanding as needed (lazy) «*?»
   Match the character “.” literally «'.»
   Match the character string “js” literally «js»
   Match the regular expression below «(?:'?.*?)?»
      Between zero and one times, as many times as possible, giving back as needed (greedy) «?»
      Match the character “?” literally «'?»
      Match any single character «.*?»
         Between zero and unlimited times, as few times as possible, expanding as needed (lazy) «*?»
Match the character “"” literally «"»

Regex教程

http://www.regular-expressions.info/tutorial.html