用问号替换无效的UTF-8字符,mbstring.似乎忽略了Substitute_character


Replacing invalid UTF-8 characters by question marks, mbstring.substitute_character seems ignored

我想用引号替换无效的UTF-8字符(PHP 5.3.5)。

到目前为止,我有这个解决方案,但无效字符被删除,而不是用'?'代替。

function replace_invalid_utf8($str)
{
  return mb_convert_encoding($str, 'UTF-8', 'UTF-8');
}
echo mb_substitute_character()."'n";
echo replace_invalid_utf8('éééaaaàààeeé')."'n";
echo replace_invalid_utf8('eeeaaaaaaeeé')."'n";

应该输出:

63 // ASCII code for '?' character
???aaa???eé // or ??aa??eé
eeeaaaaaaeeé

但当前输出:

63
aaaee // removed invalid characters
eeeaaaaaaeeé

任何建议吗?

你会用另一种方式(例如使用preg_replace()吗?)

谢谢。

从PHP 5.4开始,可以使用mb_convert_encoding()htmlspecialchars()'s ENT_SUBSTITUTE选项。当然,您也可以使用preg_match()。如果你使用intl,从PHP 5.5开始,你可以使用UConverter

建议用U+FFFD代替无效字节序列。参见3.1.2替换病态子序列;在utr# 36: Unicode安全注意事项的细节。

当使用mb_convert_encoding()时,可以通过将Unicode code point传递给mb_substitute_character()mbstring来指定替换字符。substitute_character 指令。替换的默认字符是?(问号- U+003F).

// REPLACEMENT CHARACTER (U+FFFD)
mb_substitute_character(0xFFFD);
function replace_invalid_byte_sequence($str)
{
    return mb_convert_encoding($str, 'UTF-8', 'UTF-8');
}
function replace_invalid_byte_sequence2($str)
{
    return htmlspecialchars_decode(htmlspecialchars($str, ENT_SUBSTITUTE, 'UTF-8'));
}

UConverter提供了过程和面向对象的API。

function replace_invalid_byte_sequence3($str)
{
    return UConverter::transcode($str, 'UTF-8', 'UTF-8');
}
function replace_invalid_byte_sequence4($str)
{
    return (new UConverter('UTF-8', 'UTF-8'))->convert($str);
}

在使用preg_match()时,需要注意字节范围,避免UTF-8非最短形式的漏洞。尾字节的范围取决于前导字节的范围。

lead byte: 0x00 - 0x7F, 0xC2 - 0xF4
trail byte: 0x80(or 0x90 or 0xA0) - 0xBF(or 0x8F)

您可以参考以下资源来检查字节范围。

  1. UTF-8字节序列的语法"在RFC 3629
  2. "表3 - 7。格式良好的UTF-8字节序列;Unicode Standard 6.1
  3. 多语言格式编码;在W3C国际化

字节范围表如下:

      Code Points    First Byte Second Byte Third Byte Fourth Byte
  U+0000 -   U+007F   00 - 7F
  U+0080 -   U+07FF   C2 - DF    80 - BF
  U+0800 -   U+0FFF   E0         A0 - BF     80 - BF
  U+1000 -   U+CFFF   E1 - EC    80 - BF     80 - BF
  U+D000 -   U+D7FF   ED         80 - 9F     80 - BF
  U+E000 -   U+FFFF   EE - EF    80 - BF     80 - BF
 U+10000 -  U+3FFFF   F0         90 - BF     80 - BF    80 - BF
 U+40000 -  U+FFFFF   F1 - F3    80 - BF     80 - BF    80 - BF
U+100000 - U+10FFFF   F4         80 - 8F     80 - BF    80 - BF

如何在不破坏有效字符的情况下替换无效的字节序列,请参见3.1.1不良子序列"在utr# 36: Unicode安全考虑和表3-8中。U+FFFD在UTF-8转换中的应用

Unicode标准给出了一个例子:

before: <61    F1 80 80  E1 80  C2    62    80    63    80    BF    64  >
after:  <0061  FFFD      FFFD   FFFD  0062  FFFD  0063  FFFD  FFFD  0064>

下面是preg_replace_callback()根据上述规则的实现。

function replace_invalid_byte_sequence5($str)
{
    // REPLACEMENT CHARACTER (U+FFFD)
    $substitute = "'xEF'xBF'xBD";
    $regex = '/
      (['x00-'x7F]                       #   U+0000 -   U+007F
      |['xC2-'xDF]['x80-'xBF]            #   U+0080 -   U+07FF
      | 'xE0['xA0-'xBF]['x80-'xBF]       #   U+0800 -   U+0FFF
      |['xE1-'xEC'xEE'xEF]['x80-'xBF]{2} #   U+1000 -   U+CFFF
      | 'xED['x80-'x9F]['x80-'xBF]       #   U+D000 -   U+D7FF
      | 'xF0['x90-'xBF]['x80-'xBF]{2}    #  U+10000 -  U+3FFFF
      |['xF1-'xF3]['x80-'xBF]{3}         #  U+40000 -  U+FFFFF
      | 'xF4['x80-'x8F]['x80-'xBF]{2})   # U+100000 - U+10FFFF
      |('xE0['xA0-'xBF]                  #   U+0800 -   U+0FFF (invalid)
      |['xE1-'xEC'xEE'xEF]['x80-'xBF]    #   U+1000 -   U+CFFF (invalid)
      | 'xED['x80-'x9F]                  #   U+D000 -   U+D7FF (invalid)
      | 'xF0['x90-'xBF]['x80-'xBF]?      #  U+10000 -  U+3FFFF (invalid)
      |['xF1-'xF3]['x80-'xBF]{1,2}       #  U+40000 -  U+FFFFF (invalid)
      | 'xF4['x80-'x8F]['x80-'xBF]?)     # U+100000 - U+10FFFF (invalid)
      |(.)                               # invalid 1-byte
    /xs';
    // $matches[1]: valid character
    // $matches[2]: invalid 3-byte or 4-byte character
    // $matches[3]: invalid 1-byte
    $ret = preg_replace_callback($regex, function($matches) use($substitute) {
        if (isset($matches[2]) || isset($matches[3])) {
            return $substitute;
        }
    
        return $matches[1];
    }, $str);
    return $ret;
}

可以直接比较字节,避免preg_match对字节大小的限制。

function replace_invalid_byte_sequence6($str) {
    $size = strlen($str);
    $substitute = "'xEF'xBF'xBD";
    $ret = '';
    $pos = 0;
    $char;
    $char_size;
    $valid;
    while (utf8_get_next_char($str, $size, $pos, $char, $char_size, $valid)) {
        $ret .= $valid ? $char : $substitute;
    }
    return $ret;
}
function utf8_get_next_char($str, $str_size, &$pos, &$char, &$char_size, &$valid)
{
    $valid = false;
    if ($str_size <= $pos) {
        return false;
    }
    if ($str[$pos] < "'x80") {
        $valid = true;
        $char_size =  1;
    } else if ($str[$pos] < "'xC2") {
        $char_size = 1;
    } else if ($str[$pos] < "'xE0")  {
        if (!isset($str[$pos+1]) || $str[$pos+1] < "'x80" || "'xBF" < $str[$pos+1]) {
            $char_size = 1;
        } else {
            $valid = true;
            $char_size = 2;
        }
    } else if ($str[$pos] < "'xF0") {
        $left = "'xE0" === $str[$pos] ? "'xA0" : "'x80";
        $right = "'xED" === $str[$pos] ? "'x9F" : "'xBF";
        if (!isset($str[$pos+1]) || $str[$pos+1] < $left || $right < $str[$pos+1]) {
            $char_size = 1;
        } else if (!isset($str[$pos+2]) || $str[$pos+2] < "'x80" || "'xBF" < $str[$pos+2]) {
            $char_size = 2;
        } else {
            $valid = true;
            $char_size = 3;
       }
    } else if ($str[$pos] < "'xF5") {
        $left = "'xF0" === $str[$pos] ? "'x90" : "'x80";
        $right = "'xF4" === $str[$pos] ? "'x8F" : "'xBF";
        if (!isset($str[$pos+1]) || $str[$pos+1] < $left || $right < $str[$pos+1]) {
            $char_size = 1;
        } else if (!isset($str[$pos+2]) || $str[$pos+2] < "'x80" || "'xBF" < $str[$pos+2]) {
            $char_size = 2;
        } else if (!isset($str[$pos+3]) || $str[$pos+3] < "'x80" || "'xBF" < $str[$pos+3]) {
            $char_size = 3;
        } else {
            $valid = true;
            $char_size = 4;
        }
    } else {
        $char_size = 1;
    }
    $char = substr($str, $pos, $char_size);
    $pos += $char_size;
    return true;
}

测试用例在这里。

function run(array $callables, array $arguments)
{
    return array_map(function($callable) use($arguments) {
         return array_map($callable, $arguments);
    }, $callables);
}
    
$data = [
    // Table 3-8. Use of U+FFFD in UTF-8 Conversion
    // http://www.unicode.org/versions/Unicode6.1.0/ch03.pdf)
    "'x61"."'xF1'x80'x80"."'xE1'x80"."'xC2"."'x62"."'x80"."'x63"
    ."'x80"."'xBF"."'x64",
    // 'FULL MOON SYMBOL' (U+1F315) and invalid byte sequence
    "'xF0'x9F'x8C'x95"."'xF0'x9F'x8C"."'xF0'x9F'x8C"
];
var_dump(run([
    'replace_invalid_byte_sequence', 
    'replace_invalid_byte_sequence2',
    'replace_invalid_byte_sequence3',
    'replace_invalid_byte_sequence4',
    'replace_invalid_byte_sequence5',
    'replace_invalid_byte_sequence6'
], $data));
mb_convert_encoding有一个错误,它会在无效字节序列之后中断5个有效字符,或者在有效字符之后删除无效字节序列,而不添加U+FFFD
$data = [
    // U+20AC
    "'xE2'x82'xAC"."'xE2'x82'xAC"."'xE2'x82'xAC",
    "'xE2'x82"    ."'xE2'x82'xAC"."'xE2'x82'xAC",
    // U+24B62
    "'xF0'xA4'xAD'xA2"."'xF0'xA4'xAD'xA2"."'xF0'xA4'xAD'xA2",
    "'xF0'xA4'xAD"    ."'xF0'xA4'xAD'xA2"."'xF0'xA4'xAD'xA2",
    "'xA4'xAD'xA2"."'xF0'xA4'xAD'xA2"."'xF0'xA4'xAD'xA2",
    // 'FULL MOON SYMBOL' (U+1F315)
    "'xF0'x9F'x8C'x95" . "'xF0'x9F'x8C",
    "'xF0'x9F'x8C'x95" . "'xF0'x9F'x8C" . "'xF0'x9F'x8C"
];

虽然preg_match()可以代替preg_replace_callback,但是这个函数对字节大小有限制。请参阅bug报告#36463了解详细信息。您可以通过以下测试用例进行确认。

str_repeat('a', 10000)

最后,我的基准测试结果如下:

mb_convert_encoding()
0.19628190994263
htmlspecialchars()
0.082863092422485
UConverter::transcode()
0.15999984741211
UConverter::convert()
0.29843020439148
preg_replace_callback()
0.63967490196228
direct comparision
0.71933102607727

基准代码在这里。

function timer(array $callables, array $arguments, $repeat = 10000)
{
    $ret = [];
    $save = $repeat;
    foreach ($callables as $key => $callable) {
        $start = microtime(true);
        do {
    
            array_map($callable, $arguments);
        } while($repeat -= 1);
        $stop = microtime(true);
        $ret[$key] = $stop - $start;
        $repeat = $save;
    }
    return $ret;
}
$functions = [
    'mb_convert_encoding()' => 'replace_invalid_byte_sequence',
    'htmlspecialchars()' => 'replace_invalid_byte_sequence2',
    'UConverter::transcode()' => 'replace_invalid_byte_sequence3',
    'UConverter::convert()' => 'replace_invalid_byte_sequence4',
    'preg_replace_callback()' => 'replace_invalid_byte_sequence5',
    'direct comparision' => 'replace_invalid_byte_sequence6'
];
foreach (timer($functions, $data) as $description => $time) {
    echo $description, PHP_EOL,
         $time, PHP_EOL;
}