如何从跨域 http 请求中获取特定内容 - How to get specific content from cross-domain http request

有一个荷兰新闻网站：nu.nl我对获得驻留在她身上的第一个 url 标题非常感兴趣：

<h3 class="hdtitle">
          <a style="" onclick="NU.AT.internalLink(this, event);" xtclib="position1_article_1" href="/buitenland/2880252/griekse-hotels-ontruimd-bosbranden.html">
            Griekse hotels ontruimd om bosbranden            <img src="/images/i18n/nl/slideshow/bt_fotograaf.png" class="vidlinkicon" alt="">          </a>
        </h3>

所以我的问题是如何获得这个网址？我可以用Jquery做到这一点吗？我认为不会，因为它不在我的服务器上。所以也许我必须使用PHP？我从哪里开始...？

已测试并正常工作

由于 http://www.nu.nl 不是您的网站，因此您可以使用PHP代理方法进行跨域GET，否则会出现此类错误：

XMLHttpRequest 无法加载 http://www.nu.nl/。起源访问控制允许源不允许 http://yourdomain.com。

首先在 PHP 端的服务器中使用这个文件：

代理.php（更新）

<?php
if(isset($_GET['site'])){
  $f = fopen($_GET['site'], 'r');
  $html = '';
  while (!feof($f)) {
    $html .= fread($f, 24000);
  }
  fclose($f);
  echo $html;
}
?>

现在，在使用jQuery的javascript端，你可以执行以下操作：

（只是知道我正在使用prop();因为我使用 jQuery 1.7.2 版本。因此，如果您使用的是 1.6.x 之前的版本，请尝试 attr(); 代替）

$(function(){
   var site = 'http://www.nu.nl';
   $.get('proxy.php', { site:site }, function(data){
      var href = $(data).find('.hdtitle').first().children(':first-child').prop('href');
      var url = href.split('/');
      href = href.replace(url[2], 'nu.nl');
      // Put the 'href' inside your div as a link
      $('#myDiv').html('<a href="' + href + '" target="_blank">' + href + '</a>');
   }, 'html');
});

如您所见，请求在您的域中，但这是一种棘手的事情，因此您不会再次收到Access-Control-Allow-Origin错误！

更新

如果您想像在评论中写的那样href所有标题，您可以执行以下操作：

只需像这样更改jQuery代码...

$(function(){
   var site = 'http://www.nu.nl';
   $.get('proxy.php', { site:site }, function(data){
        // get all html headlines
        headlines = $(data).find('.hdtitle');
        // get 'href' attribute of each headline and put it inside div
        headlines.map(function(elem, index){ 
            href = $(this).children(':first-child').prop('href');
            url = href.split('/');
            href = href.replace(url[2], 'nu.nl');
            $('#myDiv').append('<a href="' + href + '" target="_blank">' + href + '</a><br/>');
        });
   }, 'html');
});

并使用更新的proxy.php文件（对于这两种情况，1 个或全部标题）。

希望这有帮助：-）

您可以使用simplehtmldom库来获取该链接

类似的东西

$html = file_get_html('website_link');
echo $html->getElementById("hdtitle")->childNodes(1)->getAttribute('href');

在此处阅读更多内容

我会建议使用RSS，但不幸的是，您正在寻找的标题似乎没有出现在那里。

<?
$f = fopen('http://www.nu.nl', 'r');
$html = '';
while(strpos($html, 'position1_article_1') === FALSE)
    $html .= fread($f, 24000);
fclose($f);
$pos = strpos($html, 'position1_article_1');
$urlleft = substr($html, $pos + 27);
$url = substr($urlleft, 0, strpos($urlleft, '"'));
echo 'http://www.nu.nl' . $url;
?>

输出：http://www.nu.nl/buitenland/2880252/griekse-hotels-ontruimd-bosbranden.html

使用 cURL 检索页面。然后，使用以下函数分析您提供的字符串;

preg_match("/<a.*?href'='"(.*?)'".*?>/is",$text,$matches);

结果 URL 将位于$matches数组中。

如果您想设置一个 jQuery 机器人来通过浏览器抓取页面（Google Chrome 扩展程序允许此功能）：

// print out the found anchor link's href attribute
console.log($('.hdtitle').find('a').attr('href'));

如果你想使用PHP，你需要抓取这个href链接的页面。使用SimpleTest等库来实现此目的。定期抓取的最佳方法是将 PHP 脚本链接到cronjob。

简单测试：http://www.lastcraft.com/browser_documentation.php

克伦乔布：http://net.tutsplus.com/tutorials/php/managing-cron-jobs-with-php-2/

祝你好运！