将字符串清理为 UTF-8 的最佳 PHP 方法/类是什么


What is the best PHP method/class for cleaning strings to UTF-8

我一直在一个简单的类上使用几种方法,这些方法对我来说效果很好,但我注意到由于strtr()和定义了大量的翻译,它们真的很慢。 此外,它真的很长,因此更难以维护和理解。

也就是说,所有"糟糕"的例子都是现实世界问题的解决方案,这些问题将字符串转换为 UTF8。

谁能告诉我有一种众所周知或更有效的方法可以做到这一点? (是的,我已经尝试了htmlentities()方法和iconv()方法,但都没有真正正确替换所有时髦字符。

这是我目前使用的类:https://gist.github.com/2559140

从 PHP 5.4.0 开始,

mbstring 支持现在默认启用(但未加载)。加载扩展,这使您可以执行以下操作:

<? //PHP 5.4+
$ensureIsUTF8 = static function($data){
    $dataEncoding = 'mb_detect_encoding(
        $data,
        ['UTF-8', 'windows-1251', 'iso-8859-1', /*others you encounter*/],
        true
    );
    //UTF-16/32 encoding detection always fails for PHP <= 5.4.1
    //Use detection code copied from PHP docs comments:
    //http://www.php.net/manual/en/function.mb-detect-encoding.php
    if ($dataEncoding === false){
        $UTF32_BIG_ENDIAN_BOM = chr(0x00) . chr(0x00) . chr(0xFE) . chr(0xFF);
        $UTF32_LITTLE_ENDIAN_BOM = chr(0xFF) . chr(0xFE) . chr(0x00) . chr(0x00);
        $UTF16_BIG_ENDIAN_BOM = chr(0xFE) . chr(0xFF);
        $UTF16_LITTLE_ENDIAN_BOM = chr(0xFF) . chr(0xFE);
        $first2 = 'substr($data, 0, 2);
        $first4 = 'substr($data, 0, 4);
        if ($first4 === $UTF32_BIG_ENDIAN_BOM) {
            $dataEncoding = 'UTF-32BE';
        } elseif ($first4 === $UTF32_LITTLE_ENDIAN_BOM) {
            $dataEncoding = 'UTF-32LE';
        } elseif ($first2 === $UTF16_BIG_ENDIAN_BOM) {
            $dataEncoding = 'UTF-16BE';
        } elseif ($first2 === $UTF16_LITTLE_ENDIAN_BOM) {
            $dataEncoding = 'UTF-16LE';
        } else {
            throw new 'Exception('Whoa! No idea what that was.');
        }
    }
    if ($dataEncoding === 'UTF-8'){
        return $data;
    } else {
        return 'mb_convert_encoding(
           $data,
           'UTF-8',
           $dataEncoding
        );      
    }
};
$utf8Data = $ensureIsUTF8('file_get_contents('something'));
$utf8Data = $ensureIsUTF8('file_get_contents('http://somethingElse'));
$utf8Data = $ensureIsUTF8($userProvidedData);
?>