网页抓取 - PHP 不允许我输出某些网站的 html，为什么 - web scraping - PHP doesn't let me output the html of certain sites, why?

web scraping - PHP doesn't let me output the html of certain sites, why?

我正在尝试构建一个基本的网络抓取工具。它几乎适用于任何网站，但是有些网站我无法废弃，为什么会这样？这是我在一个有效的网站（这个网站）上的代码：

<!doctype html>
<html lang="en-US">
  <body>
    <?php
      $url ='http://stackoverflow.com/';
      $output = file_get_contents($url);
      echo $output;
    ?>
  </body>
</html>

在我自己的本地主机上运行时，这会将 stackoverflow.com 的内容输出到我的网站中。这是一个不适用于的网站：

<!doctype html>
<html lang="en-US">
  <body>
    <?php
      $url ='https://www.galottery.com/en-us/home.html';
      $output = file_get_contents($url);
      echo $output;
    ?>
  </body>
</html>

我没有加载网站，而是收到此错误：

警告：file_get_contents（https://www.galottery.com/en-us/home.html）：无法打开流：HTTP 请求失败！HTTP/1.1 403 禁止在 C：''xampp''htdocs''projects''QD''webScraping''index.php 第 6 行

为什么这适用于某些网站而不适用于其他网站？我认为这可能是因为一个是HTTPS站点，但是我已经为其他站点（如 https://google.com）尝试了此代码，并且工作正常。

我正在使用 XAMMP 来运行本地 PHP。

这是工作;

<?php
$ops =  array(
    'http' => array(
        'method' => "GET",
        'header' => "Accept-language: en'r'nAccept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8'r'n" .
                    "Cookie: foo=bar'r'n" . 
                    "User-Agent: Mozilla/5.0 (iPad; U; CPU OS 3_2 like Mac OS X; en-us) AppleWebKit/531.21.10 (KHTML, like Gecko) Version/4.0.4 Mobile/7B334b Safari/531.21.102011-10-16 20:23:10'r'n"
    )
);
$context = stream_context_create($ops);
echo file_get_contents('https://www.galottery.com/en-us/home.html', false, $context);

他们要么正在检查UserAgent，要么他们正在禁止您的IP地址。

要模拟正确的用户代理，您必须使用 curl ，如下所示：

$ch = curl_init();
curl_setopt($ch, CURLOPT_HEADER, false);
curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0)');
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_AUTOREFERER, true);
curl_setopt($ch, CURLOPT_URL, "https://www.galottery.com/en-us/home.html");
$result = curl_exec($ch);
echo $result;

虽然，他们可能会使用一些javascript重定向，比如。首先，您加载网页，他们设置cookie并执行document.location.href重定向。而不是他们正在检查该cookie。

更新：刚刚测试，我的解决方案工作正常。