使用正则表达式将字符串拆分为数组以获得键值对


Splitting a string to an array using a regex to obtain key-values pairs

我正在分析一个文本,但当缺少空格时,我无法获得一个片段(这是可以的)
编辑:我在自由文本中添加了冒号
编辑:好吧,这是一种可以写入键值对的任意文本格式。丢弃元素[0],数组上的其余元素将生成一个键值序列。并且它接受多行值。

这是测试用例文本:

:part1  only one 's removed:OK
:part2 :text :with
new lines
on it
:noSpaceAfterThis
:thisShoudBeAStandAlongText but: here there are more text
:part4 :even more text

这就是我想要的:

Array
(
    [0] => 
    [1] => part1
    [2] =>  only one 's removed:OK
    [3] => part2
    [4] => :text :with
new lines
on it
    [5] => noSpaceAfterThis
    [6] => 
    [7] => thisShoudBeAStandAlongText
    [8] => but: here there are more text
    [9] => part4
    [10] => :even more text
)

这就是我得到的:

Array
(
    [0] => 
    [1] => part1
    [2] =>  only one 's removed:OK
    [3] => part2
    [4] => :text :with
new lines
on it
    [5] => noSpaceAfterThis
    [6] => :thisShoudBeAStandAlongText but: here there are more text
    [7] => part4
    [8] => :even more text
)

这是我的测试代码:

<?php
$text = '
:part1  only one 's removed:OK
:part2 :text :with
new lines
on it
:noSpaceAfterThis
:thisShoudBeAStandAlongText but: here there are more text
:part4 :even more text';
echo '<pre>';
// my effort so far:
$ret = preg_split('|'r?'n:(['w'd]+)(?:'r?'s)?|i', $text, -1, PREG_SPLIT_DELIM_CAPTURE);
print_r($ret);
// nor this one:
$ret = preg_split('|'r?'n:(['w'd]+)'r?'s?|i', $text, -1, PREG_SPLIT_DELIM_CAPTURE);
print_r($ret);
// for debuging, an extra capturing group
$ret = preg_split('|'r?'n:(['w'd]+)('r?'s)?|i', $text, -1, PREG_SPLIT_DELIM_CAPTURE);
var_dump($ret);

preg_match_all的另一种方法:

$pattern = '~(?<=^:|'n:)'S++|(?<='s)(?:[^:]+?|(?<!'n):)+?(?= *+(?>'n:|$))~';
preg_match_all($pattern, $text, $matches);
echo '<pre>' . print_r($matches[0], true);

图案细节:

# capture all the first word at line begining preceded by a colon #
(?<=^:|'n:)       # lookbehind, preceded by the begining of the string
                  # and a colon or a newline and a colon
'S++              # all that is not a space
# capture all the content until the next line with : at first position #
(?<='s)           # lookbehind, preceded by a space
(?:               # open a non capturing group
   [^:]+?         # all character that is not a colon, one or more times (lazy)
  |               # OR
   (?<!^|'n):     # negative lookbehind, a colon not preceded by a newline
                  # or the begining of the string
)+?               # close the non capturing group, 
                  #repeat one or more times (lazy)
(?= *+(?>'n:|$))  # lookahead, followed by spaces (zero or more) and a newline 
                  # with colon at first position or the end of the string

这里的优点是避免了无效的结果。

或带有preg_split:

$res = preg_split('~(?:'s*'n|^):('S++)(?: )?~', $text, -1, PREG_SPLIT_DELIM_CAPTURE);

解释:

目标是将文本分为两种情况:

  • 当第一个字符为:时在换行符上
  • 当该行以:开始时,在该行的第一个空格处

因此,在一条线的起点,两个分裂点围绕着这个:word。必须删除:和后面的空格,但必须保留单词。这就是我使用PREG_SPLIT_DELIM_CAPTURE来保留单词的原因。

图案细节:

(?:           # non capturing group (all inside will be removed)
   's*'n      # trim the spaces of the precedent line and the newline
  |           # OR
   ^          # it is the begining of the string
)             # end of the non capturing group
:             # remove the first character when it is a :
('S++)        # keep the first word with DELIM_CAPTURE
(?: )?        # remove the first space if present