Domdocument saveHTML()添加额外的引号和一些其他url编码字符


Domdocument saveHTML() adding extra quotes and some other url encoded characters

我一直在使用PHP的Domdocument扩展来查找没有alt属性或空alt属性的图像标签。以下是我用于测试目的的html代码:

<span style="font-weight:bold;">Blender</span> is an Open Source 3D modelling and animation software. 
This is a very popular software among hobbyists.<i>Blender</i> has a vast list of features which include bones and meshing, textures, particle physics etc.
<u>Blender</u> was originally a proprietary software which was eventually made opensource. 
Blender is known to be difficult to learn because its interface is very intimiding to a newbie. 
But on the other hand, <a href="http://www.blender.org">Blender</a> is so much customizable that you can actually modify your workspace according to your personal preference. 
Also blender interface has been developed in the OpenGL graphics library, so blender looks all the same on all platforms whether you use Windows, Linux, BSD or even Mac. 
3D is a very interesting field to work with but 3D is somewhat tough to start with. You can <a href="http://www.google.com"" target="_blank">Google</a> for numerous tutorials on Blender. 
There are quite some awesome websites dedicated to blender development, such as BlenderGuru.com. <img src="http://www.cochinsquare.com/wp-content/uploads/2010/08/Blender.jpg">

这里是Domdocument代码,我用来搜索IMG标签,并添加一个alt属性。

$dom=new DOMDocument();
$dom->loadHTML($content);
$dom->formatOutput = true;
$imgs = $dom->getElementsByTagName("img");
foreach($imgs as $img){
 $alt = $img->getAttribute('alt');
 if ($alt == ''){
  $k_alt = $this->keyword;    
 }else{
  $k_alt = $alt;
 }
 $img->setAttribute( 'alt' , $k_alt );
}
$html_mod = preg_replace('/^<!DOCTYPE.+?>/', '', str_replace( array('<html>', '</html>', '<body>', '</body>'), array('', '', '', ''), $dom->saveHTML()));
return $html_mod;

这是我得到的html返回值

<span style='"font-weight:bold;"'>Blender</span> is an Open Source 3D modelling and animation software. 
This is a very popular software among hobbyists.<i>Blender</i> has a vast list of features which include bones and meshing, textures, particle physics etc.
<u>Blender</u> was originally a proprietary software which was eventually made opensource. 
Blender is known to be difficult to learn because its interface is very intimiding to a newbie. 
But on the other hand, <a href=""http://www.blender.org"">Blender</a> is so much customizable that you can actually modify your workspace according to your personal preference. 
Also blender interface has been developed in the OpenGL graphics library, so blender looks all the same on all platforms whether you use Windows, Linux, BSD or even Mac. 
3D is a very interesting field to work with but 3D is somewhat tough to start with. You can <a href=""http://www.google.com""" target='"_blank"'>Google</a> for numerous tutorials on Blender. 
There are quite some awesome websites dedicated to blender development, such as BlenderGuru.com. 
<img src=""http://www.cochinsquare.com/wp-content/uploads/2010/08/Blender.jpg"" alt="Blender">

观察img src和锚标记以及span的style属性中的额外引号(单引号和双引号)。

请帮忙!我希望html返回完整,只添加新的alt属性。

我还想提一下,我使用PHP 5.3.2与Suhosin补丁在Ubuntu 10.04

我终于找到了解决这个问题的方法,想和大家分享一下我的方法。

为了避免在saveHtml之后添加引号,你应该在saveHtml函数的结果上使用html_entity_decode例如:

$filecontent = file_get_contents('file.html');
$doc = new DOMDocument();
$doc->loadHTML($filecontent);
$xpath = new DOMXpath($doc);
$xpath->query("//*[id='bg']")[0]->nodeValue = 'asd';
$filecontent = html_entity_decode($doc->saveHTML());
file_put_contents('file.html', $file_contents);

所以你会得到好的正确的html代码在$ filcontent变量没有多余的引号不客气!