
Can I use HTML purifier to find encoding issues instead of just stripping them?

我有一个(大)文本体,我正在努力尝试将它从最初的网络友好格式转换为"稍微"更严格的格式(epub -一些读者对他们接受的HTML非常挑剔)。


当HTML净化器工作很好是当它遇到编码问题。许多字符被保存在Ӓ格式,是什么(显然?)HTML净化器不关心。也许我需要更好地配置它。另一个问题是我存在的祸根:大引号、破折号等。我已经设法对这些问题做了大量的搜索和替换,但让我担心的是,我可能在某个地方遗漏了一个字符(因为遇到了一个包含重音和坟墓标记的deja vu拼写案例)。



$text="It is a long established fact that a reader will be distracted by the readable content of a page when looking at its layout. The point of using Lorem Ipsum is that it has a more-or-less normal distribution of letters, as opposed to using 'Content here, content here', making it look like readable English. Many desktop publishing packages and web page editors now use Lorem Ipsum as their default model text, and a search for 'lorem ipsum' will uncover many web sites still in their infancy. Various versions have evolved over the years, sometimes by accident, sometimes on purpose (injected humour and the like).";
$main = mysql_real_escape_string($text);