Regex头痛与php的多行文本 - Regex headache with a multiple lines text with php

Regex headache with a multiple lines text with php

本文关键字：文本 php Regex | 更新日期: 2023-09-27

我从pdf中提取了未格式化的文本数据，如下所示：

AB01234 This could be a
long question with multiple
new lines a)these b)are c)the responses which could
contains new lines d)either b
AB01235 This is another question with same multiple
response a) one b) two c) three d) four c
...

我的目标是将问题标识符、问题、答案和正确答案分组，这是最后一个字符。有什么方法可以用正则表达式做到这一点吗？

{
   [0] => 'AB01234',
   [1] => 'This could be a long question with multiple new lines',
   [2] => 'these'
   [3] => 'are',
   [4] => 'the responses which could contains new lines',
   [5] => 'either',
   [6] => 'b'
}

我不会试图用一个正则表达式来实现这一点。输入中的差异太大了。我会这样清理文本：

$text = '
    AB01234 This could be a
    long question with multiple
    new lines a)these b)are c)the responses which could
    contains new lines d)either b
    AB01235 This is another question with same multiple
    response a) one b) two c) three d) four c
';
$text = preg_replace('/([A-Z]{2}[0-9]{5})/', ' QUESTION'1 ', $text);
$text = preg_replace('/([a-z]'))/', ' ANSWER'1 ', $text);
$text = trim(preg_replace('/'s+/', ' ', $text));
print($text);

你会看到文本现在相当干净。这是一行。间距已清理。您也有明确的"问题"answers"答案"标志。你可以将其更改为任何你喜欢的内容，例如！@#$#@！#回答一个问题。它们必须是永远不会出现在文本中的东西。

现在，您可以尝试使用正则表达式，但此时分解更容易，因为您标记了分隔符。在这个例子中，我经常使用爆炸和内爆，以防你没怎么看过。你不必使用它。你可以使用正则表达式或子字符串。

$questions = array();
$qas = explode("QUESTION", $text);
foreach($qas as $qa)
{
    if($qa == "") continue;
    $answers = explode("ANSWER", $qa);
    $q = array();
    foreach($answers as $i=>$answer)
    {
        $a = explode(' ', $answer);
        if($i == 0) $q[] = $a[0];
        $questions[0] = $a[0];
        array_shift($a);
        $q[] = implode(' ', $a);
    }
    $questions[] = $q;
}
print_r($questions);

现在，您应该有一个所需的数组。