我可以在单元测试中使用哪些有效和无效的UTF-8字符串


What are some valid and invalid UTF-8 strings I can use for my unit tests?

我用PHP编写了两个函数,str_to_utf8()seems_utf8()(它们是由我从其他代码中借来的部分组成的)。现在我正在为他们编写单元测试,我想确保我有合适的单元测试。我目前从Facebook上获得的:

public function test_str_to_utf8()
{
    // Make sure ASCII characters are ignored
    $this->assertEquals( "this'x01 is a 'x7f test string", str_to_utf8( "this'x01 is a 'x7f test string" ) );
    // Make sure UTF8 characters are ignored
    $this->assertEquals( "'xc3'x9c 'xc3'xbc 'xe6'x9d'xb1!", str_to_utf8( "'xc3'x9c 'xc3'xbc 'xe6'x9d'xb1!" ) );
    // Test long strings
    #str_to_utf8( str_repeat( 'x', 1024 * 1024 ) );
    $this->assertEquals( TRUE, TRUE );
    // Test some invalid UTF8 to see if it is properly fixed
    $input = "'xc3 this has 'xe6'x9d some invalid utf8 'xe6";
    $expect = "'xEF'xBF'xBD this has 'xEF'xBF'xBD'xEF'xBF'xBD some invalid utf8 'xEF'xBF'xBD";
    $this->assertEquals( $expect, str_to_utf8( $input ) );
}

这些是有效的测试用例吗?

我发现这个资源在测试UTF-8时很有用。

如果您使用任何非拉丁-1文本,您需要确保您的PHP文件保存为UTF-8,或者对其进行预转义