PHP UTF-8 handling


PHP UTF-8 handling

我正在解析一个文本文件,我偶尔会遇到这样的数据:

CASTA¥EDA, JASON  

使用mongodb后端当我尝试保存信息时,我得到的错误如下:

[MongoDB'Driver'Exception'UnexpectedValueException]
  Got invalid UTF-8 value serializing 'Jason Casta�eda'

在谷歌搜索了几个地方之后,我找到了作者说可以工作的两个函数:

 function is_utf8( $str )
    {
        return preg_match( "/^(
         ['x09'x0A'x0D'x20-'x7E]            # ASCII
       | ['xC2-'xDF]['x80-'xBF]             # non-overlong 2-byte
       |  'xE0['xA0-'xBF]['x80-'xBF]        # excluding overlongs
       | ['xE1-'xEC'xEE'xEF]['x80-'xBF]{2}  # straight 3-byte
       |  'xED['x80-'x9F]['x80-'xBF]        # excluding surrogates
       |  'xF0['x90-'xBF]['x80-'xBF]{2}     # planes 1-3
       | ['xF1-'xF3]['x80-'xBF]{3}          # planes 4-15
       |  'xF4['x80-'x8F]['x80-'xBF]{2}     # plane 16
      )*$/x",
            $str
        );
    }
    public function force_utf8($str, $inputEnc='WINDOWS-1252')
    {
        if ( $this->is_utf8( $str ) ) // Nothing to do.
            return $str;
        if ( strtoupper( $inputEnc ) === 'ISO-8859-1' )
            return utf8_encode( $str );
        if ( function_exists( 'mb_convert_encoding' ) )
            return mb_convert_encoding( $str, 'UTF-8', $inputEnc );
        if ( function_exists( 'iconv' ) )
            return iconv( $inputEnc, 'UTF-8', $str );
        // You could also just return the original string.
        trigger_error(
            'Cannot convert string to UTF-8 in file '
            . __FILE__ . ', line ' . __LINE__ . '!',
            E_USER_ERROR
        );
    }

使用上面的两个函数,我试图通过调用is_utf8($text)来确定一行文本是否具有UTF-8,如果不是,那么我调用force_utf8($text)函数。然而,我得到同样的错误。指针吗?

这个问题很老了,但是对于那些面临同样问题并且像我一样进入这个页面的人:

mb_convert_encoding($value, 'UTF-8', 'UTF-8');

这段代码应该用?符号替换所有非UTF-8字符,这将是安全的MongoDB插入/更新操作。