使用 php 抓取 HTML


Scraping HTML using php

我正在尝试从给定的文本中抓取以下文本

刮:

  1. 答应我这个(呼吸之间,#4)

  2. src 作为 http://d.gr-assets.com/books/1402555544l/22077246.jpg 的图像

  3. 一段新的爱情将考验隔壁一个享有特权的男孩和那个纹身的蓝发女孩之间的激情界限,后者帮助他拥抱他狂野的一面......'''内特在校园里已经建立了相当花花公子的声誉。不是他不尊重或不信任女性;他不相信自己。内特家里的男人容易有虐待行为——这是内特一生都在逃避的肮脏秘密——所以内特不做关系。但是他无法控制自己在一个女孩身边......''杰西坚强,独立,在纹身店工作。内特忍不住靠近她,即使这完全是友谊。但没过多久,内特就承认,他想要和杰西在一起的不仅仅是友好。'''和杰西在一起,他可以做自己,探索他内心一直觉得的可怕的黑暗。即使内特开始以一种既震惊又让他恐惧的方式渴望她,杰西仍然想知道他的每一个部分。一起测试他们的界限需要一种信任,这种信任可能会使他们形影不离......或将它们撕开

.HTML:

<div class="leftAlignedImage bookBox">
<div class="coverWrapper" id="bookCover646987_22077246">
<a href="/book/show/22077246-promise-me-this"><img alt="Promise Me This (Between Breaths, #4)" class="bookImage" src="https://i.stack.imgur.com/NXMoh.jpg" title="" width="115" /></a>
</div>
<script type="text/javascript">
//<![CDATA[
      var newTip = new Tip($('bookCover646987_22077246'), "'n'n  <h2><a href='"http://www.goodreads.com/book/show/22077246-promise-me-this?from_choice=false&amp;from_home_module=false'" class='"readable'">Promise Me This (Between Breaths, #4)<'/a><'/h2>'n'n  <div>'n    by <a href='"/author/show/7060187.Christina_Lee'" class='"authorName'">Christina  Lee<'/a><span title='"Goodreads Author!'">*<'/span>'n  <'/div>'n  <div class='"smallText uitext darkGreyText'">'n    <span class='"minirating'"><span class='"stars staticStars'"><a class='"staticStar p10'" size='"12x12'" title='"4.13 of 5 stars'">4.13 of 5 stars<'/a><a class='"staticStar p10'" size='"12x12'" title='"4.13 of 5 stars'"><'/a><a class='"staticStar p10'" size='"12x12'" title='"4.13 of 5 stars'"><'/a><a class='"staticStar p10'" size='"12x12'" title='"4.13 of 5 stars'"><'/a><a class='"staticStar p3'" size='"12x12'" title='"4.13 of 5 stars'"><'/a><'/span> 4.13 avg rating &mdash; 388 ratings<'/span>'n    &mdash; published 2014'n  <'/div>'n'n    <div class='"addBookTipDescription'">'n      'n<span id='"freeTextContainer3494377565927542800'" class='"elementOne'">'n  A new love will test the boundaries of passion between a privileged boy next door and the tattooed, blue-haired girl who helps him embrace his wild side...'n'n'nNate has developed quite a playboy reputation around campus. It''s not that he doesn''t respect or trust women; he doesn''t trust himself. The men<'/span>'n  <span id='"freeText3494377565927542800'" class='"elementTwo'" style='"display:none'">'n  A new love will test the boundaries of passion between a privileged boy next door and the tattooed, blue-haired girl who helps him embrace his wild side...'n'n'nNate has developed quite a playboy reputation around campus. It''s not that he doesn''t respect or trust women; he doesn''t trust himself. The men in Nate’s family are prone to abusive behavior—a dirty secret that Nate’s been running from his entire life—so Nate doesn''t do relationships. But he can’t help himself around one girl…'n'nJessie is strong, independent, and works at a tattoo parlor. Nate can’t resist getting close to her, even if it’s strictly a friendship. But it doesn''t take long for Nate to admit that what he wants with Jessie is more than just friendly.'n'nWith Jessie, he can be himself and explore what he’s always felt was a terrifying darkness inside him. Even when Nate begins to crave her in a way that both shocks and horrifies him, Jessie still wants to know every part of him. Testing their boundaries together will take a trust that could render them inseparable… or tear them apart.<'/span>'n  <a data-text-id='"3494377565927542800'" href='"#'" onclick='"swapContent($(this));; return false;'">...more<'/a>'n    <'/div>'n'n'n'n", { style: 'addbook', stem: 'leftMiddle', hook: { tip: 'leftMiddle', target: 'rightMiddle' }, hideOn: false, width: 400, hideAfter: 0.05, delay: 0.35 });
  $('bookCover646987_22077246').observe('prototip:shown', function() {
    if (this.up('#box')) {
      $$('div.prototip')[0].setStyle({zIndex: $('box').getStyle('z-index')});
    } else {
      $$('div.prototip')[0].setStyle({zIndex: 6000});
    }
  });
  newTip['wrapper'].addClassName('prototipAllowOverflow');
    $('bookCover646987_22077246').observe('prototip:shown', function () {
      $$('div.prototip').each(function (e) {
        if ($('bookCover646987_22077246').hasClassName('ignored')) {
          e.setStyle({'display': 'none'});
          return;
        }
        e.setStyle({'overflow': 'visible'});
      });
    });
  $('bookCover646987_22077246').observe('prototip:hidden', function () {
    $$('span.elementTwo').each(function (e) {
      if (e.getStyle('display') !== 'none') {
        var lessLink = e.next();
        swapContent(lessLink);
      }
    });
  });
//]]>
</script>

            </div>

我是php和Xampp的新手,已经上网寻求帮助,但没有用。我已经从 Xampp 控制面板连接了 Apache,做了一个保存.php页面,其中我写了以下内容:

<?php
$html = file_get_contents('http://www.goodreads.com/genres/new_releases/art');
libxml_use_internal_errors( true);
$doc = new DOMDocument;
$doc->loadHTML( $html);
$xpath = new DOMXpath( $doc);
$node = $xpath->query( '//div[@name="coverWrapper"]')->item( 0);
echo $node->textContent;
?>

这给了我第 11 行
的错误错误:尝试在第 11 行的 C:''xampp''htdocs''xampp''ind''save.php中获取非对象的属性

对于这样简单的事情,我会省去自己的头痛并跳过 xpath......您已经在将 HTML 读取到文本字符串中,$html作为字符串处理可能会更容易。例如:

知道您正在查看的页面上的书名介于class='"readable'"(仅在文档中显示一次)和<'/a>之间。

对于图像,只有一个 img 标签,因此以下 src 属性应始终属于 img 标签,因此类似于以下内容的代码将很快将其切出。

$imgStart = stristr ($html, '<img'); // get the start of the img tag
$srcStart = stristr(subtr($html, $imgStart), 'src="');
$srcStart += 5; // Offset for the chars src="
$srcEnd = stristr((subtr($html, $srcStart), '"');
$imgSrc = substr($html, $srcStart, $srcEnd - $srcStart);

超级坚固?不。。。但是你正在屏幕抓取,没有真正强大的方法来做到这一点,因为你总是非常依赖别人代码的精确结构或语法。

还要确保您使用的网站的使用条款允许抓取。很多网站对此非常不满。