preg_split正则表达式用于拆分范围,但保留该范围的一部分作为字符串后缀


preg_split regex for splitting on a range but retaining some of that range as the string suffix

处理文本:

This is some text
which I am working   on.
This text has whitespace before the new line but after this word 
Another line.

我正在使用preg_split来拆分 unicode 空格和除换行符以外的所有特殊字符,如下所示:

preg_split("/'p{Z}|[^'S'n]/u", $data, -1, PREG_SPLIT_OFFSET_CAPTURE);

该标志是因为我绝对需要保留字符串的位置。

我想让preg_split保留换行符及其前面的单词。例如,换行符可以出现在下一个单词的开头,甚至可以单独出现。

正常工作时的预期输出:

This
is
some
text'n
which
I
am
working
on.'n
This
text
has
whitespace
before
the
new
line
but
after
this
word'n
Another
line.

谁能解释一下如何做到这一点?谢谢

使用后视来匹配换行符后面存在的边界。

<?php
$str = <<<EOT
This is some text
which I am working   on.
This text has whitespace before the new line but after this word 
Another line.
EOT;
$splits = preg_split("~(?<='n)|'p{Z}+(?!'n)~", $str);
print_r($splits);
?>

输出:

Array
(
    [0] => This
    [1] => is
    [2] => some
    [3] => text
    [4] => which
    [5] => I
    [6] => am
    [7] => working
    [8] => on.
    [9] => This
    [10] => text
    [11] => has
    [12] => whitespace
    [13] => before
    [14] => the
    [15] => new
    [16] => line
    [17] => but
    [18] => after
    [19] => this
    [20] => word 
    [21] => Another
    [22] => line.
)