PHP saveHTML函数没有正确保存HTML


PHP saveHTML function is not saving HTML properly

我一直在尝试使用PHP保存网页的一部分的源代码。当我提取整个网页的内容时,源代码顺序被保留,但当我试图使用

获取文档的一部分时
$dom = new DOMDocument;
$dom->loadHTML($webpage);
$xpath = new DOMXPath($dom);
$query_tag = "//div[contains(@class, 'class-name')]";
$result = $dom->saveHTML($xpath->query($query_tag)->item(0));

script标签弄乱了。到目前为止,这是唯一一个出现此问题的网站。saveHTML函数是否有我不知道的局限性?

这是我应该收到的:

<div id="sponsored-category-header" class="page-header sponsored-category-header clear"> <script type="text/javascript">jQuery(document).ready(function($) {
        var cat_head_params = {"sponsor":"SEO PowerSuite","sponsor_logo":"https:'/'/www.searchenginejournal.com'/wp-content'/plugins'/abm-sej'/includes'/category-images'/SPS_128.png","sponsor_text":"<div class='"taxonomy-description'">Dominate Google local search results with ease! Get your copy of SEO PowerSuite and keep <a rel='"nofollow'" href='"http:'/'/sejr.nl'/PowerSuite-2016-5'" onClick='"__gaTracker('send', 'event', 'Sponsored Category Click Var 1', 'Local Search', 'SEO PowerSuite');'" target='"_blank'">your local SEO strategy<'/a> up to par.<'/div>","logo_url":"http:'/'/sejr.nl'/PowerSuite-2016-5","ga_labels":["Local Search","SEO PowerSuite"]}            
        $('#sponsored-category-header').append('<div class="sponsored-category-logo"></div>');
                     $('#sponsored-category-header .sponsored-category-logo').append(' <a rel="nofollow" href="'+cat_head_params.logo_url+'" onClick="__gaTracker(''send'', ''event'', ''Sponsored Category Click Var 1'', '''+cat_head_params.ga_labels[0]+''', '''+cat_head_params.ga_labels[0]+''');" target="_blank"><img class="nopin" src="'+cat_head_params.sponsor_logo+'" width="96" height="96" /></a>');
                                   $('#sponsored-category-header').append('<div class="sponsored-category-details"></div>');
         $('#sponsored-category-header .sponsored-category-details').append('<h3 class="page-title sponsored-category-title">'+cat_head_params.sponsor+'</h3>');
         $('#sponsored-category-header .sponsored-category-details').append(cat_head_params.sponsor_text);

});</script> </div>

这是我实际得到的:

<div id="sponsored-category-header" class="page-header sponsored-category-header clear"> <script type="text/javascript">jQuery(document).ready(function($) {
        var cat_head_params = {"sponsor":"SEO PowerSuite","sponsor_logo":"https:'/'/www.searchenginejournal.com'/wp-content'/plugins'/abm-sej'/includes'/category-images'/SPS_128.png","sponsor_text":"<div class='"taxonomy-description'">Dominate Google local search results with ease! Get your copy of SEO PowerSuite and keep <a rel='"nofollow'" href='"http:'/'/sejr.nl'/PowerSuite-2016-5'" onClick='"__gaTracker('send', 'event', 'Sponsored Category Click Var 1', 'Local Search', 'SEO PowerSuite');'" target='"_blank'">your local SEO strategy<'/a> up to par.<'/div>","logo_url":"http:'/'/sejr.nl'/PowerSuite-2016-5","ga_labels":["Local Search","SEO PowerSuite"]}            
        $('#sponsored-category-header').append('<div class="sponsored-category-logo"></script>

</div>');
                     $('#sponsored-category-header .sponsored-category-logo').append(' <a rel="nofollow" href="'+cat_head_params.logo_url+'" onclick="__gaTracker(''send'', ''event'', ''Sponsored Category Click Var 1'', '''+cat_head_params.ga_labels[0]+''', '''+cat_head_params.ga_labels[0]+''');" target="_blank"><img class="nopin" src="'+cat_head_params.sponsor_logo+'" width="96" height="96"></a>');
                                   $('#sponsored-category-header').append('<div class="sponsored-category-details"></div>');
         $('#sponsored-category-header .sponsored-category-details').append('<h3 class="page-title sponsored-category-title">'+cat_head_params.sponsor+'</h3>');
         $('#sponsored-category-header .sponsored-category-details').append(cat_head_params.sponsor_text);

    }); </div>

如果您错过了它,结束的script标记已经向上移动了几行。

只是为了清楚,我不是在谈论渲染的HTML。我说的是我在发出请求后得到的实际源代码。如果您能帮助解决这个问题,我将不胜感激。

我知道saveHTML函数导致问题,因为当我通过PHP返回整个页面时,每个标签都在正确的位置。

首先,您的代码应该触发一堆像这样的警告:

警告:DOMDocument::loadHTML(): htmlParseEntityRef: expected ';' inEntity
警告:DOMDocument::loadHTML():意外结束标签:strong in Entity
Warning: DOMDocument::loadHTML(): Tag header实体

无效

这是在野外的HTML(这个页面的代码并不是特别糟糕),但你甚至没有提到它,这让我怀疑你可能没有在你的开发箱中启用错误报告。

此外,该页面有大量的JavaScript和DOMDocument只是一个HTML解析器。

有了这个,我们可以清楚地了解正在发生的事情。由于DOMDocument不是一个成熟的浏览器,它不理解JavaScript代码。这意味着它检测<script>标签,但它不像javascript那样处理其内容-它只是寻找结束标签,他找到的第一个标签是:

$('#sponsored-category-header').append('<div class="sponsored-category-logo"></div>');
                                                                             ^^^^^^

它不知道它是一个JavaScript字符串,应该被忽略。相反,它认为错误的标签正在关闭,所以它试图修复技术上无效的HTML并添加缺失的 </script>标签。

正是由于这个原因,<script>...</script>标记集传统上是这样写的:

<script type="text/javascript"><!--
var foo = '<p>Escaped end tag<'/p>';
//--></script>

…所以不知道JavaScript的用户代理可以安全地忽略整个标签(嘿,这只是一个很好的旧HTML注释)。然而,现在它几乎被普遍认为是不好的做法,因为"所有浏览器都理解JavaScript"。

最后注意:DOM扩展可能知道<script>标记,并且知道它不允许在里面有其他标记。这就解释了为什么不考虑内部开始标记。