将重音字符和 HTML 实体转换为 UTF-8


Converting accented characters and HTML entities into UTF-8?

我正在做一个项目,该项目将允许我从 Portkey.org 下载故事以便在我的kindle上阅读,而且我一生都无法弄清楚如何正确编码/解析从网站上抓取的HTML。我正在使用simple_html_dom来抓取它,并且正在传递保存故事进行解析的主要元素的innertext

因此,我在这里要完成的是以下内容:

  1. 从 Portkey.org 故事中获取HTML
  2. 将页面上的所有HTML实体转换为常规字符以进行读取(”“…等实体)
  3. 任何重音字符或其他语言字符(如韩语、日语、中文等)都应保持原样。
  4. 使用 tidy 修复 HTML 并将其保存到.html文件中。

到目前为止,我尝试的所有方法都会导致以下任一结果:

  • 钻石内部带有问号,重音字符应位于何处
  • 断开的 UTF-8 字符,其中应有引号和省略号,但重音字符显示正确

故事HTML中的示例:

<p> Wel [snip] your emotions&hellip;but most impor [snip] ng fiancé </p>

编辑

html_entity_decode会产生以下输出:

 Wel [snip] your emotions…but most impor [snip] ng fiancé

如您所见,重音字符是正确的,但&hellip;现在显示不正确。

编辑2:

get_html_translation_table(HTML_ENTITIES)结果 :

array(252) { ["""]=> string(6) """ ["&"]=> string(5) "&" ["<"]=> string(4) "<" [">"]=> string(4) ">" [" "]=> string(6) " " ["¡"]=> string(7) "¡" ["¢"]=> string(6) "¢" ["£"]=> string(7) "£" ["¤"]=> string(8) "¤" ["Â¥"]=> string(5) "¥" ["¦"]=> string(8) "¦" ["§"]=> string(6) "§" ["¨"]=> string(5) "¨" ["©"]=> string(6) "©" ["ª"]=> string(6) "ª" ["«"]=> string(7) "«" ["¬"]=> string(5) "¬" ["­"]=> string(5) "­" ["®"]=> string(5) "®" ["¯"]=> string(6) "¯" ["°"]=> string(5) "°" ["±"]=> string(8) "±" ["²"]=> string(6) "²" ["³"]=> string(6) "³" ["´"]=> string(7) "´" ["µ"]=> string(7) "µ" ["¶"]=> string(6) "¶" ["·"]=> string(8) "·" ["¸"]=> string(7) "¸" ["¹"]=> string(6) "¹" ["º"]=> string(6) "º" ["»"]=> string(7) "»" ["¼"]=> string(8) "¼" ["½"]=> string(8) "½" ["¾"]=> string(8) "¾" ["¿"]=> string(8) "¿" ["À"]=> string(8) "À" ["Ã"]=> string(8) "Á" ["Â"]=> string(7) "Â" ["Ã"]=> string(8) "Ã" ["Ä"]=> string(6) "Ä" ["Ã…"]=> string(7) "Å" ["Æ"]=> string(7) "Æ" ["Ç"]=> string(8) "Ç" ["È"]=> string(8) "È" ["É"]=> string(8) "É" ["Ê"]=> string(7) "Ê" ["Ë"]=> string(6) "Ë" ["ÃŒ"]=> string(8) "Ì" ["Ã"]=> string(8) "Í" ["ÃŽ"]=> string(7) "Î" ["Ã"]=> string(6) "Ï" ["Ã"]=> string(5) "Ð" ["Ñ"]=> string(8) "Ñ" ["Ã’"]=> string(8) "Ò" ["Ó"]=> string(8) "Ó" ["Ô"]=> string(7) "Ô" ["Õ"]=> string(8) "Õ" ["Ö"]=> string(6) "Ö" ["×"]=> string(7) "×" ["Ø"]=> string(8) "Ø" ["Ù"]=> string(8) "Ù" ["Ú"]=> string(8) "Ú" ["Û"]=> string(7) "Û" ["Ãœ"]=> string(6) "Ü" ["Ã"]=> string(8) "Ý" ["Þ"]=> string(7) "Þ" ["ß"]=> string(7) "ß" ["à "]=> string(8) "à" ["á"]=> string(8) "á" ["â"]=> string(7) "â" ["ã"]=> string(8) "ã" ["ä"]=> string(6) "ä" ["Ã¥"]=> string(7) "å" ["æ"]=> string(7) "æ" ["ç"]=> string(8) "ç" ["è"]=> string(8) "è" ["é"]=> string(8) "é" ["ê"]=> string(7) "ê" ["ë"]=> string(6) "ë" ["ì"]=> string(8) "ì" ["í"]=> string(8) "í" ["î"]=> string(7) "î" ["ï"]=> string(6) "ï" ["ð"]=> string(5) "ð" ["ñ"]=> string(8) "ñ" ["ò"]=> string(8) "ò" ["ó"]=> string(8) "ó" ["ô"]=> string(7) "ô" ["õ"]=> string(8) "õ" ["ö"]=> string(6) "ö" ["÷"]=> string(8) "÷" ["ø"]=> string(8) "ø" ["ù"]=> string(8) "ù" ["ú"]=> string(8) "ú" ["û"]=> string(7) "û" ["ü"]=> string(6) "ü" ["ý"]=> string(8) "ý" ["þ"]=> string(7) "þ" ["ÿ"]=> string(6) "ÿ" ["Å’"]=> string(7) "Œ" ["Å“"]=> string(7) "œ" ["Å "]=> string(8) "Š" ["Å¡"]=> string(8) "š" ["Ÿ"]=> string(6) "Ÿ" ["Æ’"]=> string(6) "ƒ" ["ˆ"]=> string(6) "ˆ" ["Ëœ"]=> string(7) "˜" ["Α"]=> string(7) "Α" ["Î’"]=> string(6) "Β" ["Γ"]=> string(7) "Γ" ["Δ"]=> string(7) "Δ" ["Ε"]=> string(9) "Ε" ["Ζ"]=> string(6) "Ζ" ["Η"]=> string(5) "Η" ["Θ"]=> string(7) "Θ" ["Ι"]=> string(6) "Ι" ["Κ"]=> string(7) "Κ" ["Λ"]=> string(8) "Λ" ["Îœ"]=> string(4) "Μ" ["Î"]=> string(4) "Ν" ["Ξ"]=> string(4) "Ξ" ["Ο"]=> string(9) "Ο" ["Î "]=> string(4) "Π" ["Ρ"]=> string(5) "Ρ" ["Σ"]=> string(7) "Σ" ["Τ"]=> string(5) "Τ" ["Î¥"]=> string(9) "Υ" ["Φ"]=> string(5) "Φ" ["Χ"]=> string(5) "Χ" ["Ψ"]=> string(5) "Ψ" ["Ω"]=> string(7) "Ω" ["α"]=> string(7) "α" ["β"]=> string(6) "β" ["γ"]=> string(7) "γ" ["δ"]=> string(7) "δ" ["ε"]=> string(9) "ε" ["ζ"]=> string(6) "ζ" ["η"]=> string(5) "η" ["θ"]=> string(7) "θ" ["ι"]=> string(6) "ι" ["κ"]=> string(7) "κ" ["λ"]=> string(8) "λ" ["μ"]=> string(4) "μ" ["ν"]=> string(4) "ν" ["ξ"]=> string(4) "ξ" ["ο"]=> string(9) "ο" ["Ï€"]=> string(4) "π" ["Ï"]=> string(5) "ρ" ["Ï‚"]=> string(8) "ς" ["σ"]=> string(7) "σ" ["Ï„"]=> string(5) "τ" ["Ï…"]=> string(9) "υ" ["φ"]=> string(5) "φ" ["χ"]=> string(5) "χ" ["ψ"]=> string(5) "ψ" ["ω"]=> string(7) "ω" ["Ï‘"]=> string(10) "ϑ" ["Ï’"]=> string(7) "ϒ" ["Ï–"]=> string(5) "ϖ" [" "]=> string(6) " " [" "]=> string(6) " " [" "]=> string(8) " " ["‌"]=> string(6) "‌" ["â€"]=> string(5) "‍" ["‎"]=> string(5) "‎" ["â€"]=> string(5) "‏" ["–"]=> string(7) "–" ["—"]=> string(7) "—" ["‘"]=> string(7) "‘" ["’"]=> string(7) "’" ["‚"]=> string(7) "‚" ["“"]=> string(7) "“" ["â€"]=> string(7) "”" ["„"]=> string(7) "„" ["†"]=> string(8) "†" ["‡"]=> string(8) "‡" ["•"]=> string(6) "•" ["…"]=> string(8) "…" ["‰"]=> string(8) "‰" ["′"]=> string(7) "′" ["″"]=> string(7) "″" ["‹"]=> string(8) "‹" ["›"]=> string(8) "›" ["‾"]=> string(7) "‾" ["â„"]=> string(7) "⁄" ["€"]=> string(6) "€" ["â„‘"]=> string(7) "ℑ" ["℘"]=> string(8) "℘" ["â„œ"]=> string(6) "ℜ" ["â„¢"]=> string(7) "™" ["ℵ"]=> string(9) "ℵ" ["â†"]=> string(6) "←" ["↑"]=> string(6) "↑" ["→"]=> string(6) "→" ["↓"]=> string(6) "↓" ["↔"]=> string(6) "↔" ["↵"]=> string(7) "↵" ["â‡"]=> string(6) "⇐" ["⇑"]=> string(6) "⇑" ["⇒"]=> string(6) "⇒" ["⇓"]=> string(6) "⇓" ["⇔"]=> string(6) "⇔" ["∀"]=> string(8) "∀" ["∂"]=> string(6) "∂" ["∃"]=> string(7) "∃" ["∅"]=> string(7) "∅" ["∇"]=> string(7) "∇" ["∈"]=> string(6) "∈" ["∉"]=> string(7) "∉" ["∋"]=> string(4) "∋" ["âˆ"]=> string(6) "∏" ["∑"]=> string(5) "∑" ["−"]=> string(7) "−" ["∗"]=> string(8) "∗" ["√"]=> string(7) "√" ["âˆ"]=> string(6) "∝" ["∞"]=> string(7) "∞" ["∠"]=> string(5) "∠" ["∧"]=> string(5) "∧" ["∨"]=> string(4) "∨" ["∩"]=> string(5) "∩" ["∪"]=> string(5) "∪" ["∫"]=> string(5) "∫" ["∴"]=> string(8) "∴" ["∼"]=> string(5) "∼" ["≅"]=> string(6) "≅" ["≈"]=> string(7) "≈" ["≠"]=> string(4) "≠" ["≡"]=> string(7) "≡" ["≤"]=> string(4) "≤" ["≥"]=> string(4) "≥" ["⊂"]=> string(5) "⊂" ["⊃"]=> string(5) "⊃" ["⊄"]=> string(6) "⊄" ["⊆"]=> string(6) "⊆" ["⊇"]=> string(6) "⊇" ["⊕"]=> string(7) "⊕" ["⊗"]=> string(8) "⊗" ["⊥"]=> string(6) "⊥" ["â‹…"]=> string(6) "⋅" ["⌈"]=> string(7) "⌈" ["⌉"]=> string(7) "⌉" ["⌊"]=> string(8) "⌊" ["⌋"]=> string(8) "⌋" ["〈"]=> string(6) "⟨" ["〉"]=> string(6) "⟩" ["â—Š"]=> string(5) "◊" ["â™ "]=> string(8) "♠" ["♣"]=> string(7) "♣" ["♥"]=> string(8) "♥" ["♦"]=> string(7) "♦" }

编辑3:

为了完全披露,这是我为弄清楚这一点而设置的测试文件。目前,所有实体都正确显示,但重音字符显示为

<?php
header('Content-Type: text/html; charset=UTF-8');
require_once('_RESOURCES/simple_html_dom.php');
$url = 'http://fanfiction.portkey.org/index.php?act=read&storyid=1585&chapterid=&agree=1';
function tidyHTML($html) {
    ob_start();
    $tidy = new tidy;
    $config = array('indent' => true, 'output-xhtml' => false, 'wrap' => 200, 'clean' => false, 'show-body-only' => true);
    $tidy->parseString($html, $config, 'utf8');
    $tidy->cleanRepair();
    $input = $tidy;
    return $input;
}
function filter($html) {
    $html = preg_replace('~>'s+<~', '><', $html);
    $html = preg_replace('/<'/b>'s?<b>/', '', $html);
    $html = preg_replace('/<'/i>'s?<i>/', '', $html);
    $html = str_replace('<br>', '', $html);
    $output = $html;
    return $output;
}
$page_html = file_get_html($url);
$chapter_html = $page_html->find('td[class="story"]', 0);
foreach ($chapter_html->find('center') as $node) { $node->outertext = ''; }
$entities = html_entity_decode($chapter_html->innertext, ENT_QUOTES, 'UTF-8');
echo tidyHTML(filter($entities));
// var_dump(get_html_translation_table(HTML_ENTITIES));
?>

你可能想要html_entity_decode。来自文档:"将字符串中的所有 HTML 实体转换为其适用的字符。根据您的 PHP 版本和设置,您可能需要手动指定编码。像这样:

html_entity_decode($raw_text, ENT_QUOTES, 'UTF-8');

Tidy 可能会重新编码您的实体。我不确定您的输入字符串有多复杂,但如果您不需要格式完全匹配,可以考虑使用类似 strip_tags 的东西删除 HTML 标签。

我通过将 tidy 的编码从

$tidy->parseString($html, $config, 'utf8');

$tidy->parseString($html, $config, 'win1252');

这会将重音字符转换为 HTML 实体。然后,我使用 html_entity_decode 将所有实体转换为 UTF-8 字符。

新的测试文件(有效!

<?php
header('Content-Type: text/html; charset=UTF-8');
require_once('_RESOURCES/simple_html_dom.php');
$url = 'http://fanfiction.portkey.org/index.php?act=read&storyid=1585&chapterid=&agree=1';
function tidyHTML($html) {
    ob_start();
    $tidy = new tidy;
    $config = array('indent' => true, 'output-xhtml' => false, 'wrap' => 200, 'clean' => false, 'show-body-only' => true);
    $tidy->parseString($html, $config, 'win1252');
    $tidy->cleanRepair();
    $input = $tidy;
    return $input;
}
function filter($html) {
    $html = preg_replace('~>'s+<~', '><', $html);
    $html = preg_replace('/<'/b>'s?<b>/', '', $html);
    $html = preg_replace('/<'/i>'s?<i>/', '', $html);
    $html = str_replace('<br>', '', $html);
    $output = $html;
    return $output;
}
$page_html = file_get_html($url);
$chapter_html = $page_html->find('td[class="story"]', 0);
foreach ($chapter_html->find('center') as $node) { $node->outertext = ''; }
echo filter(html_entity_decode(tidyHTML($chapter_html->innertext)));
?>

没有你,臭鼬华夫饼,不可能做到!