按分隔符拆分字符串,但如果转义,则不拆分字符串


Split string by delimiter, but not if it is escaped

如何通过分隔符拆分字符串,但如果它被转义则不然?例如,我有一个字符串:

1|2'|2|3''|4'''|4

分隔符|,转义的分隔符'|。此外,我想忽略转义的反斜杠,因此''| |仍然是分隔符。

所以使用上面的字符串,结果应该是:

[0] => 1
[1] => 2'|2
[2] => 3''
[3] => 4'''|4

使用黑魔法:

$array = preg_split('~''''.(*SKIP)(*FAIL)|'|~s', $string);

''''.匹配后跟一个字符的反斜杠,(*SKIP)(*FAIL)跳过它,'|匹配您的分隔符。

而不是split(...),IMO使用某种像词汇标记器一样操作的"扫描"功能更直观。在PHP中,这将是preg_match_all函数。您只需说要匹配:

  1. '|以外的其他内容
  2. 或后'后跟'|
  3. 至少重复一次 #1 或 #2

以下演示:

$input = "1|2''|2|3''''|4''''''|4";
echo $input . "'n'n";
preg_match_all('/(?:''''.|[^''''|])+/', $input, $parts);
print_r($parts[0]);

将打印:

1|2'|2|3''|4'''|4
Array
(
    [0] => 1
    [1] => 2'|2
    [2] => 3''
    [3] => 4'''|4
)

最近我设计了一个解决方案:

$array = preg_split('~ ((?<!'''')|(?<=[^'''']('''''''')+)) '| ~x', $string);

但黑魔法解决方案的速度还是快了三倍。

对于未来的读者,这是一个通用的解决方案。它基于NikiC的想法,(*SKIP)(*FAIL)

function split_escaped($delimiter, $escaper, $text)
{
    $d = preg_quote($delimiter, "~");
    $e = preg_quote($escaper, "~");
    $tokens = preg_split(
        '~' . $e . '(' . $e . '|' . $d . ')(*SKIP)(*FAIL)|' . $d . '~',
        $text
    );
    $escaperReplacement = str_replace(['''', '$'], ['''''', '''$'], $escaper);
    $delimiterReplacement = str_replace(['''', '$'], ['''''', '''$'], $delimiter);
    return preg_replace(
        ['~' . $e . $e . '~', '~' . $e . $d . '~'],
        [$escaperReplacement, $delimiterReplacement],
        $tokens
    );
}

一试:

// the base situation:
$text = "asdf'',fds'',ddf,'''',f'',,dd";
$delimiter = ",";
$escaper = "''";
print_r(split_escaped($delimiter, $escaper, $text));
// other signs:
$text = "dk!%fj%slak!%df!!jlskj%%dfl%isr%!%%jlf";
$delimiter = "%";
$escaper = "!";
print_r(split_escaped($delimiter, $escaper, $text));
// delimiter with multiple characters:
$text = "aksd()jflaksd())jflkas(('()j()fkl'()()as()d('')jf";
$delimiter = "()";
$escaper = "'";
print_r(split_escaped($delimiter, $escaper, $text));
// escaper is same as delimiter:
$text = "asfl''asjf'lkas'''jfkl''d'jsl";
$delimiter = "'";
$escaper = "'";
print_r(split_escaped($delimiter, $escaper, $text));

输出:

Array
(
    [0] => asdf,fds,ddf
    [1] => '
    [2] => f,
    [3] => dd
)
Array
(
    [0] => dk%fj
    [1] => slak%df!jlskj
    [2] => 
    [3] => dfl
    [4] => isr
    [5] => %
    [6] => jlf
    )
Array
(
    [0] => aksd
    [1] => jflaksd
    [2] => )jfl'kas((()j
    [3] => fkl()
    [4] => as
    [5] => d(')jf
)
Array
(
    [0] => asfl'asjf
    [1] => lkas'
    [2] => jfkl'd
    [3] => jsl
)

注意:有一个理论级别的问题:implode('::', ['a:', ':b'])implode('::', ['a', '', 'b'])结果相同的字符串:'a::::b' 。内爆也可能是一个有趣的问题。

正则表达式非常慢。更好的方法是在拆分之前从字符串中删除转义字符,然后将它们放回

$foo = 'a,b|,c,d||,e';
function splitEscaped($str, $delimiter,$escapeChar = '''') {
    //Just some temporary strings to use as markers that will not appear in the original string
    $double = "'0'0'0_doub";
    $escaped = "'0'0'0_esc";
    $str = str_replace($escapeChar . $escapeChar, $double, $str);
    $str = str_replace($escapeChar . $delimiter, $escaped, $str);
    $split = explode($delimiter, $str);
    foreach ($split as &$val) $val = str_replace([$double, $escaped], [$escapeChar, $delimiter], $val);
    return $split;
}
print_r(splitEscaped($foo, ',', '|'));

它在","上拆分,但如果用"|"转义则不拆分。它还支持双重转义,因此"||"在拆分发生后变为单个"|":

Array ( [0] => a [1] => b,c [2] => d| [3] => e )