PHP-带有特殊字符的X字符后的子字符串 - PHP - Substring after X characters with special-characters

对不起标题，我真的不知道该怎么说。。。

我经常有一个字符串需要在X个字符后剪切，我的问题是这个字符串经常包含特殊字符，比如：&egrave；

所以，我想知道，在php中，如果我在切割字符串时，我处于一个特殊字符的中间，那么在不转换字符串的情况下，它们是不是一种知道的方法。

示例

This is my string with a special char : &egrave; - and I want it to cut in the middle of the "&egrave;" but still keeping the string intact

所以现在我的子字符串的结果是：

This is my string with a special char : &egra

但我想要这样的东西：

This is my string with a special char : &egrave;

这里最好的做法是将字符串存储为UTF-8，不包含任何html实体，并使用以utf8为编码的mb_*函数族。

但是，如果您的字符串是ASCII或iso-8859-1/win1252，则可以使用mb_string库的特殊HTML-ENTITIES编码：

$s = 'This is my string with a special char : &egrave; - and I want it to cut in the middle of the "&egrave;" but still keeping the string intact';
echo mb_substr($s, 0, 40, 'HTML-ENTITIES');
echo mb_substr($s, 0, 41, 'HTML-ENTITIES');

但是，如果基础字符串是UTF-8或其他多字节编码，则使用HTML-ENTITIES是不安全的！这是因为HTML-ENTITIES实际上意味着"以高位字符作为html实体的win1252"。这是一个可能出错的例子：

// Assuming that é is in utf8:
mb_substr('é ', 0, 2, 'HTML-ENTITIES') === '&Atilde;&copy;'
// should be '&eacute; '

如果字符串采用多字节编码，则在拆分之前必须将所有html实体转换为通用编码。例如：

$strings_actual_encoding = 'utf8';
$s_noentities = html_entity_decode($s, ENT_QUOTES, $strings_actual_encoding); 
$s_trunc_noentities =  mb_substr($s_noentities, 0, 41, $strings_actual_encoding);

最好的解决方案是将文本存储为UTF-8，而不是将它们存储为HTML实体。除此之外，如果您不介意计数关闭（&grave;等于一个字符，而不是7），那么以下代码段应该可以工作：

<?php
$string = 'This is my string with a special char : &egrave; - and I want it to cut in the middle of the "&egrave;" but still keeping the string intact';
$cut_string = htmlentities(mb_substr(html_entity_decode($string, NULL, 'UTF-8'), 0, 45), NULL, 'UTF-8')."<br><br>";

注意：如果使用其他函数对文本进行编码（例如htmlspecialchars()），请使用该函数而不是htmlentities()。如果使用自定义函数，则使用与新自定义函数相反的另一个自定义函数而不是html_entity_decode()（以及自定义函数而非htmlentities()）

最长的HTML实体长度为10个字符，包括与号和分号。如果要剪切X字节的字符串，请检查字节X-9到X-1中的"与"号。如果相应的分号出现在字节X或更高版本，请在分号之后而不是字节X之后剪切字符串。

然而，如果您愿意对字符串进行预处理，Mike的解决方案将更准确，因为他将字符串切割为X字符，而不是字节。

您可以首先使用html_entity_decode（）来解码所有html实体。然后把绳子分开。然后使用htmlenties（）对实体进行重新编码。

$decoded_string = html_entity_decode($original_string);
// implement logic to split string here
// then for each string part do the following:
$encoded_string_part = htmlentities($split_string_part);

我对PCRE表达式不太满意的一个小的bruteforce解决方案，假设您想要传递80个字符，并且最长的HTML表达式是7个字符长：

$regex = '~^(.{73}([^&]{7}|.{0,7}$|[^&]{0,6}&[^;]+;))(.*)~mx'
// Note, this could return a bit of shorter text
return preg_replace( $regexp, '$1', $text);

正如你所知：

.{73}-73个字符
[^&]{7}-好的，我们可以用任何不包含&
.{0,7}$-记住可能的结尾（这应该没有必要，因为较短的文本根本不匹配）
[^&]{0,6}&[^;]+;-最多6个字符（你会在第79位），然后&，让它结束

看起来更好但需要玩一点数字游戏的东西是：

// check whether $text is at least $N chars long :)
if( strlen( $text) < $N){
    return;
}
// Get last &
$pos = strrpos( $text, '&', $N);
// We're not young anymore, we have to check this too (not entries at all) :)
if( $pos === false){
    return substr( $text, 0, $N);
}
// Get Last
$end = strpos( $text, ';', $N);
// false wouldn't be smaller then 0 (entry open at the beginning
if( $end === false){
    $end = -1;
}
// Okay, entry closed (; is after &)(
if( $end > $pos){
   return substr($text, 0, $N);
}
// Now we need to find first ;
$end = strpos( $text, ';', $N)
if( $end === false){
    // Not valid HTML, not closed entry, do whatever you want
}
return substr($text, 0, $end);

_{检查数字，索引中可能有+/-1}

我认为您必须使用strpos和strrpos的组合来查找下一个和上一个空格，解析空格之间的文本，根据已知的特殊字符列表进行检查，如果匹配，则将"剪切"扩展到下一个空格的位置。如果你有一个你现在拥有的代码样本，我们可以给你一个更好的答案。