用于在换行符处或多个字符后进行 utf-8 字符串切片的正则表达式


Regular expression for utf-8 string sliceing at linebreaks or after a number of characters

我在网络上找到了一个函数,它使用常规的 experssion 来迭代字符串并在指定数量的字符后插入换行符,因此它将适合具有固定宽度的窄表格单元格。这是函数:

/**
     * wordwrap for utf8 encoded strings
     *
     * @param string $str
     * @param integer $len
     * @param string $what
     * @return string
     * @author Milian Wolff <mail@milianw.de>
     */
    function utf8_wordwrap($str, $width, $break, $cut = false) {
    if (!$cut || $_SESSION['wordwrap']) {
        $regexp = '#^(?:['x00-'x7F]|['xC0-'xFF]['x80-'xBF]+){'.$width.'}#';
    } else {
            return $str; //if no wordwrap turned on, returns the original string
    }
    if (function_exists('mb_strlen')) {
        $str_len = mb_strlen($str,'UTF-8');
    } else {
        $str_len = preg_match_all('/['x00-'x7F'xC0-'xFD]/', $str, $var_empty);
    }
    $while_what = ceil($str_len / $width);
    $i = 1;
    $return = '';
    while ($i < $while_what) {
        preg_match($regexp, $str,$matches);
        $string = $matches[0];
        $return .= $string.$break;
        $str = substr($str, strlen($string));
        $i++;
    }
    return $return.$str;
    }

这是正则表达式:

#^(?:['x00-'x7F]|['xC0-'xFF]['x80-'xBF]+){20}#

如果它与 while 循环结合使用,直到字符串中有换行符,它就可以很好地完成它的工作。

示例字符串:

1. first
2. second
3. third

prag_match的输出:

array (
  0 => '1. first
2. second
3',
)

所以它只计算第 20 个字符,然后返回它。

我需要的是:要使其返回所有内容,直到新行字符 ('),或者如果没有,则返回前 20 个字符。因此,在这种情况下,输出将是这样的:

array (
      0 => '1. first',
      1 => '2. second',
      2 => '3. third'
    )

更新:我尝试了史蒂夫·罗宾斯的答案,它工作得很好,直到字符串中有一些规范的 UTF-8 字符。这是我的错,我一开始就没有提供一个像样的例子。以下是它的作用:

<?php
header('Content-type: text/html; charset=UTF-8');
$input = '1. first
2. second
3. third
ez eg nyoulőűúúú3456789öüö987654323456789öü
pam
param';
$output = array();
foreach (explode("'n", $input) as $value) {
    foreach (str_split($value, 20) as $v) {
        $trimmed = trim($v);
        if (!empty($trimmed))
            $output[] = $trimmed;
    }
}
var_dump($output);

输出为:

array(8) {
  [0]=>
  string(8) "1. first"
  [1]=>
  string(9) "2. second"
  [2]=>
  string(8) "3. third"
  [3]=>
  string(20) "ez eg nyoulőűúú�"
  [4]=>
  string(20) "�3456789öüö987654"
  [5]=>
  string(13) "323456789öü"
  [6]=>
  string(3) "pam"
  [7]=>
  string(5) "papam"
}

http://codepad.org/Gt4CshXt

为什么要使用正则表达式?

<?php
$input = '1. first
2. second
3. third';
$output = array();
foreach (explode("'n", $input) as $value) {
    foreach (str_split($value, 20) as $v) {
        $trimmed = trim($v);
        if (!empty($trimmed))
            $output[] = $trimmed;
    }
}
var_dump($output);

array(3) {
  [0]=>
  string(8) "1. first"
  [1]=>
  string(9) "2. second"
  [2]=>
  string(8) "3. third"
}

示例:http://codepad.org/OoillEUu

感谢大家的努力!我在这里找到了解决方案

<?php
header('Content-Type: text/html; charset=utf-8');
$input = '1. first
2. second
3. third
ez eg nyoulőűúúú3456789öüö987654323456789öü
pam
papam';
var_dump(utf8_wordwrap($input,20,"<br>",true));
function utf8_wordwrap($string, $width=20, $break="'n", $cut=false)
{
  if($cut) {
    // Match anything 1 to $width chars long followed by whitespace or EOS,
    // otherwise match anything $width chars long
    $search = '/(.{1,'.$width.'})(?:'s|$)|(.{'.$width.'})/uS';
    $replace = '$1$2'.$break;
  } else {
    // Anchor the beginning of the pattern with a lookahead
    // to avoid crazy backtracking when words are longer than $width
    $pattern = '/(?='s)(.{1,'.$width.'})(?:'s|$)/uS';
    $replace = '$1'.$break;
  }
  return preg_replace($search, $replace, $string);
}
?>
string '1. first
<br>2. second
<br>3. third
<br>ez eg<br>nyoulőűúúú3456789öüö<br>987654323456789öü
<br>pam
<br>papam<br>' (length=122)