preg_match在使用cURL获取数据时错过了一些id - preg_match misses some ids while fetching data with cURL

preg_match misses some ids while fetching data with cURL

出于学习目的，我试图从Steam Store中获取数据，如果图像game_header_image_full存在，我就进入了游戏。这两种选择都是可行的，但有一个陷阱。一个非常慢，另一个似乎错过了一些数据，因此没有将URL写入文本文件。

出于某种原因，SimpleHTMLDOM成功捕获了9个URL，而第二个（cURL）只捕获了8个带有preg_match的URL。

问题1。

$reg的格式化方式是$html->find('img.game_header_image_full')可以捕获的，但不是我的preg_match吗？还是问题出在别的地方？

问题2。

我在这里做得对吗？计划使用cURL替代方案，但我能以某种方式让它更快吗？

简单HTML DOM解析器（搜索100个ID的时间：1分钟，39秒。返回：9 URL。）

<?php
    include('simple_html_dom.php');
    $i = 0;
    $times_to_run = 100;
    set_time_limit(0);
    while ($i++ < $times_to_run) {
        // Find target image
        $url = "http://store.steampowered.com/app/".$i;
        $html = file_get_html($url);
        $element = $html->find('img.game_header_image_full');
        if($i == $times_to_run) {
            echo "Success!";
        }
        foreach($element as $key => $value){
        // Check if image was found
            if (strpos($value,'img') == false) {
                // Do nothing, repeat loop with $i++;
            } else {
                // Add (don't overwrite) to file steam.txt
                file_put_contents('steam.txt', $url.PHP_EOL , FILE_APPEND);
            }
        }
    }
?>

与。cURL备选方案。。（搜索100个ID的时间：34s。返回：8个URL。）

<?php
    $i = 0;
    $times_to_run = 100;
    set_time_limit(0);
    while ($i++ < $times_to_run) {
        $ch = curl_init();
        curl_setopt( $ch, CURLOPT_URL, 'http://store.steampowered.com/app/'.$i);
        curl_setopt( $ch, CURLOPT_RETURNTRANSFER, true);
        $content = curl_exec($ch);
        $url = "http://store.steampowered.com/app/".$i;
        $reg = "/<''s*img''s+[^>]*class=[''"][^''"]*game_header_image_full[^''"]*[''"]/i";
        if(preg_match($reg, $content)) {
            file_put_contents('steam.txt', $url.PHP_EOL , FILE_APPEND);
        }
    }
?>

您不应该在HTML中使用regex。它大多有效，但如果无效，你必须浏览数百页，找出哪一页是失败的，为什么，并更正正则表达式，然后希望并祈祷在未来不会再发生这样的事情。剧透警报：会的。

长话短说，读一读这个有趣的答案：RegEx匹配除了XHTML自包含标签之外的开放标签

不要使用正则表达式来解析HTML。使用HTML解析器，这是一种复杂的算法，不使用正则表达式，并且是可靠的（只要HTML有效）。在第一个示例中，您已经在使用一个。是的，它很慢，因为它不仅仅是搜索文档中的字符串。但它是可靠的。您还可以使用其他实现，尤其是本机实现，如http://php.net/manual/en/domdocument.loadhtml.php