通过PHP和REGEXP提取文本片段


Extract fragments of text via PHP and REGEXP

假设我有字符串变量:

$str = '
[WhiteTitle "GM"]
[WhiteCountry "Cuba"]
[BlackCountry "United States"]
1. d4 d5 2. Nf3 Nf6 3. e3 c6 4. c4 e6 5. Nc3 Nbd7 6. Bd3 Bd6
7. O-O O-O 8. e4 dxe4 9. Nxe4 Nxe4 10. Bxe4 Nf6 11. Bc2 h6
12. b3 b6 13. Bb2 Bb7 14. Qd3 g6 15. Rae1 Nh5 16. Bc1 Kg7
17. Rxe6 Nf6 18. Ne5 c5 19. Bxh6+ Kxh6 20. Nxf7+ 1-0
';

我想从这个变量中提取一些信息到一个数组中,这个数组看起来像这样:

Array {
    ["WhiteTitle"] => "GM",
    ["WhiteCountry"] => "Cuba",
    ["BlackCountry"] => "United States"
}

谢谢。

这里有一个更安全、更紧凑的解决方案:

$re = '~'[([^]["]*?)'s*"([^]"]+)~';   // Defining the regex
$str = "[WhiteTitle '"GM'"]'n[WhiteCountry '"Cuba'"]'n[BlackCountry '"United States'"]'n'n1. d4 d5 2. Nf3 Nf6 3. e3 c6 4. c4 e6 5. Nc3 Nbd7 6. Bd3 Bd6'n7. O-O O-O 8. e4 dxe4 9. Nxe4 Nxe4 10. Bxe4 Nf6 11. Bc2 h6'n12. b3 b6 13. Bb2 Bb7 14. Qd3 g6 15. Rae1 Nh5 16. Bc1 Kg7'n17. Rxe6 Nf6 18. Ne5 c5 19. Bxh6+ Kxh6 20. Nxf7+ 1-0"; 
preg_match_all($re, $str, $matches);  // Getting all matches
print_r(array_combine($matches[1],$matches[2])); // Creating the final array with array_combine

请参阅IDEONE PHP演示和regex演示。

Regex详细信息

  • '[-打开[
  • ([^]["]*?)-第1组匹配除"[]之外的0+个字符,尽可能少,最多可达
  • 's*-0+空白(用于修剪第一个值)
  • "-双引号
  • ([^]"]+)-第2组匹配除]"之外的1+个字符

您可以使用:

preg_match_all('/'[(.*?) "(.*?)"']/m', $str, $matches, PREG_SET_ORDER);
print_r($matches);

它会给你数组中的所有匹配,0键将是完全匹配,1键将是第一部分,2键将是第二部分:

Output:
Array
(
    [0] => Array
        (
            [0] => [WhiteTitle "GM"]
            [1] => WhiteTitle
            [2] => GM
        )
    [1] => Array
        (
            [0] => [WhiteCountry "Cuba"]
            [1] => WhiteCountry
            [2] => Cuba
        )
    [2] => Array
        (
            [0] => [BlackCountry "United States"]
            [1] => BlackCountry
            [2] => United States
        )
)

如果你想要你要求的格式,你可以使用简单的循环:

$array = array();
foreach($matches as $match){
    $array[$match[1]] = $match[2];
}
print_r($array);
Output:
Array
(
    [WhiteTitle] => GM
    [WhiteCountry] => Cuba
    [BlackCountry] => United States
)

您可以使用以下内容:

<?php
$string = <<< EOF
[WhiteTitle "GM"]
[WhiteCountry "Cuba"]
[BlackCountry "United States"]
1. d4 d5 2. Nf3 Nf6 3. e3 c6 4. c4 e6 5. Nc3 Nbd7 6. Bd3 Bd6
7. O-O O-O 8. e4 dxe4 9. Nxe4 Nxe4 10. Bxe4 Nf6 11. Bc2 h6
12. b3 b6 13. Bb2 Bb7 14. Qd3 g6 15. Rae1 Nh5 16. Bc1 Kg7
17. Rxe6 Nf6 18. Ne5 c5 19. Bxh6+ Kxh6 20. Nxf7+ 1-0
EOF;
$final = array();
preg_match_all('/'[(.*?)'s+(".*?")']/', $string, $matches, PREG_PATTERN_ORDER);
for($i = 0; $i < count($matches[1]); $i++) {
    $final[$matches[1][$i]] = $matches[2][$i];
}
print_r($final);

输出:

Array
(
    [WhiteTitle] => "GM"
    [WhiteCountry] => "Cuba"
    [BlackCountry] => "United States"
)

Ideone演示:

http://ideone.com/wQYshT


Regex解释:

'[(.*?)'s+(".*?")']
Match the character “[” literally «'[»
Match the regex below and capture its match into backreference number 1 «(.*?)»
   Match any single character that is NOT a line break character (line feed) «.*?»
      Between zero and unlimited times, as few times as possible, expanding as needed (lazy) «*?»
Match a single character that is a “whitespace character” (any Unicode separator, tab, line feed, carriage return, vertical tab, form feed, next line) «'s+»
   Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
Match the regex below and capture its match into backreference number 2 «(".*?")»
   Match the character “"” literally «"»
   Match any single character that is NOT a line break character (line feed) «.*?»
      Between zero and unlimited times, as few times as possible, expanding as needed (lazy) «*?»
   Match the character “"” literally «"»
Match the character “]” literally «']»