PHP preg_仅替换MS Word图像标签


PHP Preg_replace MS Word Image Tags only

我有一个客户,他使用word生成新闻简报,然后将HTML复制到MailChimp中发送。
Word有它所有奇怪而奇妙的格式想法,我需要保留其中的大部分,这样格式就和他使用的保持一致,他在Word中看到的也是一致的。

唯一真正的问题是MS Word如何插入图像,这里是一个片段,它添加了一个标签,和标签:

<td width=640 style='width:480.0pt;border-top:solid #1F497D 1.0pt;mso-border-top-themecolor: text2;border-left:none;border-bottom:solid #1F497D 1.0pt;mso-border-bottom-themecolor: text2;border-right:none;background:#1F497D;mso-background-themecolor:text2; padding:0cm 0cm 0cm 0cm;height:26.6pt'>
<p class=MsoNormal align=center style='text-align:center'><b style='mso-bidi-font-weight:normal'><span style='font-family:"Arial","sans-serif"; mso-ansi-language:EN-NZ;mso-fareast-language:EN-NZ;mso-no-proof:yes'><!--[if gte vml 1]><v:shapetype id="_x0000_t75" coordsize="21600,21600" o:spt="75" o:preferrelative="t" path="m@4@5l@4@11@9@11@9@5xe" filled="f" stroked="f">
<v:stroke joinstyle="miter"/>
<v:formulas>
<v:f eqn="if lineDrawn pixelLineWidth 0"/>
<v:f eqn="sum @0 1 0"/>
<v:f eqn="sum 0 0 @1"/>
<v:f eqn="prod @2 1 2"/>
<v:f eqn="prod @3 21600 pixelWidth"/>
<v:f eqn="prod @3 21600 pixelHeight"/>
<v:f eqn="sum @0 0 1"/>
<v:f eqn="prod @6 1 2"/>
<v:f eqn="prod @7 21600 pixelWidth"/>
<v:f eqn="sum @8 21600 0"/>
<v:f eqn="prod @7 21600 pixelHeight"/>
<v:f eqn="sum @10 21600 0"/>
</v:formulas>
<v:path o:extrusionok="f" gradientshapeok="t" o:connecttype="rect"/>
<o:lock v:ext="edit" aspectratio="t"/>
</v:shapetype><v:shape id="_x0000_i1033" type="#_x0000_t75" style='width:479.25pt;height:112.5pt;visibility:visible;mso-wrap-style:square'>
<v:imagedata src="22nd%20September%20-%20Take%205...%20Your%205%20minute%20fortnightly%20roundup%20of%20alcohol%20and%20other%20drug%20news%20and%20research%202_files/image001.png" o:title=""/>
</v:shape><![endif]--><![if !vml]><img border=0 width=639 height=150 src="22nd%20September%20-%20Take%205...%20Your%205%20minute%20fortnightly%20roundup%20of%20alcohol%20and%20other%20drug%20news%20and%20research%202_files/image025.png"v:shapes="_x0000_i1033"><![endif]></span></b><b style='mso-bidi-font-weight:normal'><span lang=EN-GB style='font-family:"Arial","sans-serif"'><o:p></o:p></span></b></p>
</td>

如果我去掉所有的MS代码,它会杀死所有的格式化:

$parsed_html = preg_replace('/<!--'[['s'S]*?']-->/s', '', $html);

我已经试着说得更具体了:

$parsed_html = preg_replace('/<!--'[if gte vml 1']*?--><!'[if !vml']>/s', '', $html);

但是这工作得很好,但是又剥离了太多。你知道在word中是否有一种方法可以导出更好的html(哈哈)或更好的匹配模式吗?

这是一个完整的word HTML文档:http://pastebin.com/myPwnHbd

下面是目前为止的PHP(从一个简单的html表单上传html文件):http://pastebin.com/Wc7hEk7c

谢谢,这个线程指向:http://htmlpurifier.org/

我最后的代码(总结)是:
<?php
    error_reporting(0); ini_set('display_errors', FALSE);
    require_once 'htmlpurifier-4.8.0/library/HTMLPurifier.auto.php';
    $html = file_get_contents($_FILES['file']['tmp_name']);
    $config = HTMLPurifier_Config::createDefault();
    $config->set('Core.Encoding', 'ISO-8859-1');
    $config->set('AutoFormat.AutoParagraph', true);
    $purifier = new HTMLPurifier($config);
    $clean_html = $purifier->purify( $html );
    echo $clean_html;