我从pdf中提取了未格式化的文本数据,如下所示:
AB01234 This could be a
long question with multiple
new lines a)these b)are c)the responses which could
contains new lines d)either b
AB01235 This is another question with same multiple
response a) one b) two c) three d) four c
...
我的目标是将问题标识符、问题、答案和正确答案分组,这是最后一个字符。有什么方法可以用正则表达式做到这一点吗?
{
[0] => 'AB01234',
[1] => 'This could be a long question with multiple new lines',
[2] => 'these'
[3] => 'are',
[4] => 'the responses which could contains new lines',
[5] => 'either',
[6] => 'b'
}
我不会试图用一个正则表达式来实现这一点。输入中的差异太大了。我会这样清理文本:
$text = '
AB01234 This could be a
long question with multiple
new lines a)these b)are c)the responses which could
contains new lines d)either b
AB01235 This is another question with same multiple
response a) one b) two c) three d) four c
';
$text = preg_replace('/([A-Z]{2}[0-9]{5})/', ' QUESTION'1 ', $text);
$text = preg_replace('/([a-z]'))/', ' ANSWER'1 ', $text);
$text = trim(preg_replace('/'s+/', ' ', $text));
print($text);
你会看到文本现在相当干净。这是一行。间距已清理。您也有明确的"问题"answers"答案"标志。你可以将其更改为任何你喜欢的内容,例如!@#$#@!#回答一个问题。它们必须是永远不会出现在文本中的东西。
现在,您可以尝试使用正则表达式,但此时分解更容易,因为您标记了分隔符。在这个例子中,我经常使用爆炸和内爆,以防你没怎么看过。你不必使用它。你可以使用正则表达式或子字符串。
$questions = array();
$qas = explode("QUESTION", $text);
foreach($qas as $qa)
{
if($qa == "") continue;
$answers = explode("ANSWER", $qa);
$q = array();
foreach($answers as $i=>$answer)
{
$a = explode(' ', $answer);
if($i == 0) $q[] = $a[0];
$questions[0] = $a[0];
array_shift($a);
$q[] = implode(' ', $a);
}
$questions[] = $q;
}
print_r($questions);
现在,您应该有一个所需的数组。