如何使用 php(并且没有正则表达式)将文本文件中的不同部分分解为数组


How to explode different section from a textfile into an array using php (and no regex)?

这个问题几乎与如何将结构化文本文件转换为PHP多维数组重复,但由于我无法理解给出的基于正则表达式的解决方案,我再次发布了它。尝试仅使用 PHP 解决此问题似乎更好,以便我可以实际从中学习(正则表达式在这一点上太难理解了(。

假定以下文本文件:

HD Alcoa Earnings Soar; Outlook Stays Upbeat 
BY By James R. Hagerty and Matthew Day 
PD 12 July 2011
LP 
Alcoa Inc.'s profit more than doubled in the second quarter.
The giant aluminum producer managed to meet analysts' forecasts.
However, profits wereless than expected
TD
Licence this article via our website:
http://example.com

我用PHP阅读了这个文本文件,需要一种强大的方法来将文件内容放入数组中,如下所示:

array(
  [HD] => Alcoa Earnings Soar; Outlook Stays Upbeat,
  [BY] => By James R. Hagerty and Matthew Day,
  [PD] => 12 July 2011,
  [LP] => Alcoa Inc.'s profit...than expected,
  [TD] => Licence this article via our website: http://example.com
)

单词HD BY PD LP TD是标识文件中新节的关键。在数组中,可以从值中删除所有换行符。理想情况下,我将能够在没有正则表达式的情况下执行此操作。我相信在所有键上爆炸可能是一种方式,但它会非常肮脏:

$fields = array('HD', 'BY', 'PD', 'LP', 'TD');
$parts = explode($text, "'nHD ");
$HD = $parts[0];

没有人对如何遍历文本有更清晰的想法,甚至可能是一次,并将其划分为上面给出的数组?

这是另一种不使用正则表达式的更短的方法。

/**
 * @param  array  array of stopwords eq: array('HD', 'BY', ...)
 * @param  string Text to search in
 * @param  string End Of Line symbol
 * @return array  [ stopword => string, ... ]
 */
function extract_parts(array $parts, $str, $eol=PHP_EOL) {
  $ret=array_fill_keys($parts, '');
  $current=null;
  foreach(explode($eol, $str) AS $line) {
    $substr = substr($line, 0, 2);
    if (isset($ret[$substr])) {
      $current = $substr;
      $line = trim(substr($line, 2));
    }
    if ($current) $ret[$current] .= $line;
  }
  return $ret;
}
$ret = extract_parts(array('HD', 'BY', 'PD', 'LP', 'TD'), $str);
var_dump($ret);

为什么不使用正则表达式?

由于 php 文档,特别是在 preg_* 函数中,建议如果不强烈要求,不要使用正则表达式。我想知道这个问题的答案中哪个例子具有最好的效果。

结果让我自己大吃一惊:

Answer 1 by: hek2mgl     2.698 seconds (regexp)
Answer 2 by: Emo Mosley  2.38  seconds
Answer 3 by: anubhava    3.131 seconds (regexp)
Answer 4 by: jgb         1.448 seconds

我本以为正则表达式变体将是最快的。

好吧,在任何情况下不使用正则表达式都不是一件坏事。换句话说:使用正则表达式通常不是最好的解决方案。您必须根据具体情况决定最佳解决方案。

您可以使用此脚本重复测量。


编辑

下面是一个使用正则表达式模式的简短、更优化的示例。仍然没有我上面的例子那么快,但比其他基于正则表达式的示例更快。

输出格式可以优化(空格/换行符(。

function extract_parts_regexp($str) {
  $a=array();
  preg_match_all('/(?<k>[A-Z]{2})(?<v>.*?)(?='n[A-Z]{2}|$)/Ds', $str, $a);
  return array_combine($a['k'], $a['v']);
}

代表简化、快速和可读正则表达式代码的请求!

(来自评论中的 Pr0no(您认为您可以简化正则表达式还是有关如何开始使用 php 解决方案的提示? 是的,Pr0n0,我相信我可以简化正则表达式。

我想说明的是,正则表达式是迄今为止完成这项工作的最佳工具,它不一定像我们之前看到的那样是令人恐惧和不可读的表达。我已经将此功能分解为易于理解的部分。

我避免了复杂的正则表达式功能,如捕获组和通配符表达式,并专注于尝试生成一些简单的东西,你会在 3 个月内回到这里感到舒服。

我提议的功能(已评论(

function headerSplit($input) {
    // First, let's put our headers (any two consecutive uppercase characters at the start of a line) in an array
    preg_match_all(
        "/^[A-Z]{2}/m",       /* Find 2 uppercase letters at start of a line   */
        $input,               /* In the '$input' string                        */
        $matches              /* And store them in a $matches array            */
    );
    // Next, let's split our string into an array, breaking on those headers
    $split = preg_split(
        "/^[A-Z]{2}/m",       /* Find 2 uppercase letters at start of a line   */
        $input,               /* In the '$input' string                        */
        null,                 /* No maximum limit of matches                   */
        PREG_SPLIT_NO_EMPTY   /* Don't give us an empty first element          */
    );
    // Finally, put our values into a new associative array
    $result = array();
    foreach($matches[0] as $key => $value) {
        $result[$value] = str_replace(
            "'r'n",              /* Search for a new line character            */
            " ",                 /* And replace with a space                   */
            trim($split[$key])   /* After trimming the string                  */
        );
    }
    return $result;
}

和输出(注意:根据您的操作系统,您可能需要str_replace函数中将'r'n替换为'n(:

array(5) {
  ["HD"]=> string(41) "Alcoa Earnings Soar; Outlook Stays Upbeat"
  ["BY"]=> string(35) "By James R. Hagerty and Matthew Day"
  ["PD"]=> string(12) "12 July 2011"
  ["LP"]=> string(172) "Alcoa Inc.'s profit more than doubled in the second quarter.  The giant aluminum producer managed to meet analysts' forecasts.    However, profits wereless than expected"
  ["TD"]=> string(59) "Licence this article via our website:    http://example.com"
}

删除更简洁函数的注释

此函数的精简版本。它与上面完全相同,但删除了注释:

function headerSplit($input) {
    preg_match_all("/^[A-Z]{2}/m",$input,$matches);
    $split = preg_split("/^[A-Z]{2}/m",$input,null,PREG_SPLIT_NO_EMPTY);
    $result = array();
    foreach($matches[0] as $key => $value) $result[$value] = str_replace("'r'n"," ",trim($split[$key]));
    return $result;
}

从理论上讲,在实时代码中使用哪一个并不重要,因为解析注释对性能的影响很小,因此请使用您更熟悉的注释。

此处使用的正则表达式的细分

函数中只有一个表达式(尽管使用了两次(,为简单起见,让我们将其分解:

"/^[A-Z]{2}/m"
/     - This is a delimiter, representing the start of the pattern.
^     - This means 'Match at the beginning of the text'.
[A-Z] - This means match any uppercase character.
{2}   - This means match exactly two of the previous character (so exactly two uppercase characters).
/     - This is the second delimiter, meaning the pattern is over.
m     - This is 'multi-line mode', telling regex to treat each line as a new string.

这个微小的表达式足够强大,可以匹配HD但不HDM在行首,也不HD(例如在Full HD中(在行中间。使用非正则表达式选项,您将不容易实现这一点。

如果要使用两个或多个(而不是正好 2 个(连续的大写字符来表示新节,请使用 /^[A-Z]{2,}/m

使用预定义标头列表

阅读了您的最后一个问题以及您在@jgb帖子下的评论后,您似乎想使用预定义的标题列表。您可以通过将我们的正则表达式替换为 "/^(HD|BY|WC|PD|SN|SC|PG|LA|CY|LP|TD|CO|IN|NS|RE|IPC|PUB|AN)/m 来做到这一点——|在正则表达式中被视为"or"。

基准测试 - 可读并不意味着慢

不知何故,基准测试已经成为对话的一部分,尽管我认为它错过了为您提供可读和可维护解决方案的重点,但我重写了 JGB 的基准测试以向您展示一些内容。

以下是我的结果,表明这个基于正则表达式的代码是这里最快的选项(这些结果基于 5,000 次迭代(:

SWEETIE BELLE'S SOLUTION (2 UPPERCASE IS A HEADER):         0.054 seconds
SWEETIE BELLE'S SOLUTION (2+ UPPERCASE IS A HEADER):        0.057 seconds
MATEWKA'S SOLUTION (MODIFIED, 2 UPPERCASE IS A HEADER):     0.069 seconds
BABA'S SOLUTION (2 UPPERCASE IS A HEADER):                  0.075 seconds
SWEETIE BELLE'S SOLUTION (USES DEFINED LIST OF HEADERS):    0.086 seconds
JGB'S SOLUTION (USES DEFINED LIST OF HEADERS, MODIFIED):    0.107 seconds

以及输出格式不正确的解决方案的基准测试:

MATEWKA'S SOLUTION:                                         0.056 seconds
JGB'S SOLUTION:                                             0.061 seconds
HEK2MGL'S SOLUTION:                                         0.106 seconds
ANUBHAVA'S SOLUTION:                                        0.167 seconds

我之所以提供 JGB 函数的修改版本,是因为他的原始函数在将段落添加到输出数组之前不会删除换行符。小字符串操作在性能上会产生巨大差异,必须平等地进行基准测试才能获得公平的性能估计。

此外,使用 jgb 的函数,如果您传入完整的标头列表,您将在数组中获得一堆 null 值,因为它似乎不会在分配之前检查键是否存在。如果您想稍后循环访问这些值,这将导致另一个性能下降,因为您必须先检查empty

这是一个没有正则表达式的简单解决方案

$data = explode("'n", $str);
$output = array();
$key = null;
foreach($data as $text) {
    $newKey = substr($text, 0, 2);
    if (ctype_upper($newKey)) {
        $key = $newKey;
        $text = substr($text, 2);
    }
    $text = trim($text);
    isset($output[$key]) ? $output[$key] .= $text : $output[$key] = $text;
}
print_r($output);

输出

Array
(
    [HD] => Alcoa Earnings Soar; Outlook Stays Upbeat
    [BY] => By James R. Hagerty and Matthew Day
    [PD] => 12 July 2011
    [LP] => Alcoa Inc.'s profit more than doubled in the second quarter.The giant aluminum producer managed to meet analysts' forecasts.However, profits wereless than expected
    [TD] => Licence this article via our website:http://example.com
)

观看现场演示

注意

您可能还需要执行以下操作:

  • 检查重复数据
  • 确保仅使用HD|BY|PD|LP|TD
  • 删除$text = trim($text)以便新行保留在文本中

如果每个文件只有一条记录,您可以:

$record = array();
foreach(file('input.txt') as $line) {
    if(preg_match('~^(HD|BY|PD|LP|TD) ?(.*)?$~', $line, $matches)) {
        $currentKey = $matches[1];
        $record[$currentKey] = $matches[2];
    } else {
        $record[$currentKey] .= str_replace("'n", ' ', $line);
    }   
}

代码循环访问每行输入,并检查该行是否以标识符开头。如果是这样,currentKey将设置为此标识符。删除新行后,除非找到新标识符,否则所有后续内容都将添加到数组中的此键中。

var_dump($record);

输出:

array(5) {
  'HD' =>
  string(42) "Alcoa Earnings Soar; Outlook Stays Upbeat "
  'BY' =>
  string(36) "By James R. Hagerty and Matthew Day "
  'PD' =>
  string(12) "12 July 2011"
  'LP' =>
  string(169) " Alcoa Inc.'s profit more than doubled in the second quarter. The giant aluminum producer managed to meet analysts' forecasts.  However, profits wereless than expected  "
  'TD' =>
  string(58) "Licence this article via our website:  http://example.com "
}

注意:如果每个文件有多个记录,则可以优化解析器以返回多维数组:

$records = array();
foreach(file('input.txt') as $line) {
    if(preg_match('~^(HD|BY|PD|LP|TD) ?(.*)?$~', $line, $matches)) {
        $currentKey = $matches[1];
        // start a new record if `HD` was found.
        if($currentKey === 'HD') {
            if(is_array($record)) {
                $records []= $record;
            }
            $record = array();
        }
        $record[$currentKey] = $matches[2];
    } else {
        $record[$currentKey] .= str_replace("'n", ' ', $line);
    }   
}

但是,数据格式本身对我来说看起来很脆弱。如果 LP 看起来像这样:

LP dfks ldsfjksdjlf
lkdsjflk dsfjksld..
HD defsdf sdf sd....

你看,在我的例子中,LP的数据中有一个HD。为了保持数据的可解析性,您必须避免这种情况。

更新 :

鉴于发布的示例输入文件和代码,我已经更改了我的答案。 我添加了 OP 提供的"部分",用于定义部分代码并使函数能够处理 2 位或更多位数的代码。下面是一个非正则表达式过程函数,应该会产生所需的结果:

# Parses the given text file and populates an array with coded sections.
# INPUT:
#   filename = (string) path and filename to text file to parse
# RETURNS: (assoc array)
#   null is returned if there was a file error or no data was found
#   otherwise an associated array of the field sections is returned
function getSections($parts, $lines) {
   $sections = array();
   $code = "";
   $str = "";
   # examine each line to build section array
   for($i=0; $i<sizeof($lines); $i++) {
      $line = trim($lines[$i]);
      # check for special field codes
      $words = explode(' ', $line, 2);
      $left = $words[0];
      #echo "DEBUG: left[$left]'n";
      if(in_array($left, $parts)) {
         # field code detected; first, finish previous section, if exists
         if($code) {
            # store the previous section
            $sections[$code] = trim($str);
         }
         # begin to process new section
         $code = $left;
         $str = trim(substr($line, strlen($code)));
      } else if($code && $line) {
         # keep a running string of section content
         $str .= " ".$line;
      }
   } # for i
   # check for no data
   if(!$code)
      return(null);
   # store the last section and return results
   $sections[$code] = trim($str);
   return($sections);
} # getSections()

$parts = array('HD', 'BY', 'WC', 'PD', 'SN', 'SC', 'PG', 'LA', 'CY', 'LP', 'TD', 'CO', 'IN', 'NS', 'RE', 'IPC', 'PUB', 'AN');
$datafile = $argv[1]; # NOTE: I happen to be testing this from command-line
# load file as array of lines
$lines = file($datafile);
if($lines === false)
   die("ERROR: unable to open file ".$datafile."'n");
$data = getSections($parts, $lines);
echo "Results from ".$datafile.":'n";
if($data)
   print_r($data);
else
   echo "ERROR: no data detected in ".$datafile."'n";

结果:

Array
(   
    [HD] => Alcoa Earnings Soar; Outlook Stays Upbeat
    [BY] => By James R. Hagerty and Matthew Day
    [PD] => 12 July 2011
    [LP] => Alcoa Inc.'s profit more than doubled in the second quarter. The giant aluminum producer managed to meet analysts' forecasts. However, profits wereless than expected
    [TD] => Licence this article via our website: http://example.com
)

这是一个问题,我认为使用正则表达式应该不是考虑解析传入数据的规则的问题。考虑这样的代码:

$s = file_get_contents('input'); // read input file into a string
$match = array(); // will hold final output
if (preg_match_all('~(^|[A-Z]{2})'s(.*?)(?=[A-Z]{2}'s|$)~s', $s, $arr)) {
    for ( $i = 0; $i < count($arr[1]); $i++ )
       $match[ trim($arr[1][$i]) ] = str_replace( "'n", "", $arr[2][$i] );
}
print_r($match);

如您所见,由于使用preg_match_all来匹配输入文件中的数据的方式,代码变得多么紧凑。

输出:

Array
(
    [HD] => Alcoa Earnings Soar; Outlook Stays Upbeat 
    [BY] => By James R. Hagerty and Matthew Day 
    [PD] => 12 July 2011
    [LP] => Alcoa Inc.'s profit more than doubled in the second quarter.The giant aluminum producer managed to meet analysts' forecasts.However, profits wereless than expected
    [TD] => Licence this article via our website:http://example.com
)

根本不循环。这个怎么样(假设每个文件一条记录(?

$inrec = file_get_contents('input');
$inrec = str_replace( "'n'", "'", str_replace( array( 'HD ', 'BY ', 'PD ', 'LP', 'TD' ), array( "'HD' => '", "','BY' => '", "','PD' => '", "','LP' => '", "','TD' => '" ), str_replace( "'", "'''", $inrec ) ) )."'";
eval( '$record = array('.$inrec.');' );
var_export($record);

结果:

array (
  'HD' => 'Alcoa Earnings Soar; Outlook Stays Upbeat ',
  'BY' => 'By James R. Hagerty and Matthew Day ',
  'PD' => '12 July 2011',
  'LP' => ' 
Alcoa Inc.''s profit more than doubled in the second quarter.
The giant aluminum producer managed to meet analysts'' forecasts.
However, profits wereless than expected
',
  'TD' => '
Licence this article via our website:
http://example.com',
)

如果每个文件可以有多个记录,请尝试以下操作:

$inrecs = explode( 'HD ', file_get_contents('input') );
$records = array();
foreach ( $inrecs as $inrec ) {
   $inrec = str_replace( "'n'", "'", str_replace( array( 'HD ', 'BY ', 'PD ', 'LP', 'TD' ), array( "'HD' => '", "','BY' => '", "','PD' => '", "','LP' => '", "','TD' => '" ), str_replace( "'", "'''", 'HD ' . $inrec ) ) )."'";
   eval( '$records[] = array('.$inrec.');' );
}
var_export($records);

编辑

这是一个将$inrec函数拆分出来的版本,因此可以更容易理解 - 并进行了一些调整:去除换行符,修剪前导和尾随空格,并解决EVAL中的反斜杠问题,以防数据来自不受信任的来源。

$inrec = file_get_contents('input');
$inrec = str_replace( '''', '''''', $inrec );       // Preceed all backslashes with backslashes
$inrec = str_replace( "'", "'''", $inrec );         // Precede all single quotes with backslashes
$inrec = str_replace( PHP_EOL, " ", $inrec );       // Replace all new lines with spaces
$inrec = str_replace( array( 'HD ', 'BY ', 'PD ', 'LP ', 'TD ' ), array( "'HD' => trim('", "'),'BY' => trim('", "'),'PD' => trim('", "'),'LP' => trim('", "'),'TD' => trim('" ), $inrec )."')";
eval( '$record = array('.$inrec.');' );
var_export($record);

结果:

array (
  'HD' => 'Alcoa Earnings Soar; Outlook Stays Upbeat',
  'BY' => 'By James R. Hagerty and Matthew Day',
  'PD' => '12 July 2011',
  'LP' => 'Alcoa Inc.''s profit more than doubled in the second quarter. The giant aluminum producer managed to meet analysts'' forecasts.  However, profits wereless than expected',
  'TD' => 'Licence this article via our website:  http://example.com',
)

更新

我突然意识到,在多记录场景中,在记录循环之外构建$repl会表现得更好。这是 2 字节关键字版本:

$inrecs = file_get_contents('input');
$inrecs = str_replace( PHP_EOL, " ", $inrecs );
$keys  = array( 'HD', 'BY', 'PD', 'LP', 'TD' );
$split = chr(255);
$repl = explode( ',', $split . implode( ','.$split, $keys ) );
$inrecs = explode( 'HD ', $inrecs );
array_shift( $inrecs );
$records = array();
foreach( $inrecs as $inrec ) $records[] = parseRecord( $keys, $repl, 'HD '.$inrec );
function parseRecord( $keys, $repl, $rec ) {
    $split = chr(255);
    $lines = explode( $split, str_replace( $keys, $repl, $rec ) );
    array_shift( $lines );
    $out = array();
    foreach ( $lines as $line ) $out[ substr( $line, 0, 2 ) ] = trim( substr( $line, 3 ) );
    return $out;
}

基准测试(感谢@jgb(:

Answer 1 by: hek2mgl     6.783 seconds (regexp)
Answer 2 by: Emo Mosley  4.738 seconds
Answer 3 by: anubhava    6.299 seconds (regexp)
Answer 4 by: jgb         2.47 seconds
Answer 5 by: gwc         3.589 seconds (eval)
Answer 6 by: gwc         1.871 seconds

这是多个输入记录(假设每个记录以"HD"开头(并支持 2 字节、2 或 3 字节或可变长度关键字的另一个答案。

$inrecs = file_get_contents('input');
$inrecs = str_replace( PHP_EOL, " ", $inrecs );
$keys  = array( 'HD', 'BY', 'PD', 'LP', 'TD' );
$inrecs = explode( 'HD ', $inrecs );
array_shift( $inrecs );
$records = array();
foreach( $inrecs as $inrec ) $records[] = parseRecord( $keys, 'HD '.$inrec );

使用 2 字节关键字解析记录:

function parseRecord( $keys, $rec ) {
    $split = chr(255);
    $repl = explode( ',', $split . implode( ','.$split, $keys ) );
    $lines = explode( $split, str_replace( $keys, $repl, $rec ) );
    array_shift( $lines );
    $out = array();
    foreach ( $lines as $line ) $out[ substr( $line, 0, 2 ) ] = trim( substr( $line, 3 ) );
    return $out;
}

使用 2 或 3 字节关键字解析记录(假设键和内容之间有空格或PHP_EOL(:

function parseRecord( $keys, $rec ) {
    $split = chr(255);
    $repl = explode( ',', $split . implode( ','.$split, $keys ) );
    $lines = explode( $split, str_replace( $keys, $repl, $rec ) );
    array_shift( $lines );
    $out = array();
    foreach ( $lines as $line ) $out[ trim( substr( $line, 0, 3 ) ) ] = trim( substr( $line, 3 ) );
    return $out;
}

使用可变长度关键字解析记录(假设键和内容之间有空间或PHP_EOL(:

function parseRecord( $keys, $rec ) {
    $split = chr(255);
    $repl = explode( ',', $split . implode( ','.$split, $keys ) );
    $lines = explode( $split, str_replace( $keys, $repl, $rec ) );
    array_shift( $lines );
    $out = array();
    foreach ( $lines as $line ) {
        $keylen = strpos( $line.' ', ' ' );
        $out[ trim( substr( $line, 0, $keylen ) ) ] = trim( substr( $line, $keylen+1 ) );
    }
    return $out;
}

期望上面的每个parseRecord函数的性能会比其前身差一点。

结果:

Array
(
    [0] => Array
        (
            [HD] => Alcoa Earnings Soar; Outlook Stays Upbeat
            [BY] => By James R. Hagerty and Matthew Day
            [PD] => 12 July 2011
            [LP] => Alcoa Inc.'s profit more than doubled in the second quarter. The giant aluminum producer managed to meet analysts' forecasts.  However, profits wereless than expected
            [TD] => Licence this article via our website:  http://example.com
        )
)

我准备了自己的解决方案,结果比 jgb 的答案略快。代码如下:

function answer_5(array $parts, $str) {
    $result = array_fill_keys($parts, '');
    $poss = $result;
    foreach($poss as $key => &$val) {
        $val = strpos($str, "'n" . $key) + 2;
    }
    arsort($poss);
    foreach($poss as $key => $pos) {
        $result[$key] = trim(substr($str, $pos+1));
        $str = substr($str, 0, $pos-1);
    }
    return str_replace("'n", "", $result);
}

以下是性能的比较:

Answer 1 by: hek2mgl    2.791 seconds (regexp) 
Answer 2 by: Emo Mosley 2.553 seconds 
Answer 3 by: anubhava   3.087 seconds (regexp) 
Answer 4 by: jgb        1.53  seconds 
Answer 5 by: matewka    1.403 seconds

测试环境与 jgb 相同(100000 次迭代 - 从这里借用的脚本(。

享受并请发表评论。