PHP-检测CSV分隔符的最佳方法


PHP - Best approach to detect CSV delimiter

我看到了多个线程,讨论自动检测传入CSV的分隔符的最佳解决方案。其中大多数是长度在20-30行之间的函数,多个循环预先确定的分隔符列表,读取前5行并匹配计数e.t.c e.t.c

以下是一个示例

我刚刚实施了这个程序,并做了一些修改。工作出色。

然后我发现了以下代码:

private function DetectDelimiter($fh)
{
    $data_1 = null;
    $data_2 = null;
    $delimiter = self::$delim_list['comma'];
    foreach(self::$delim_list as $key=>$value)
    {
        $data_1 = fgetcsv($fh, 4096, $value);
        $delimiter = sizeof($data_1) > sizeof($data_2) ? $key : $delimiter;
        $data_2 = $data_1;
    }
    $this->SetDelimiter($delimiter);
    return $delimiter;
}

在我看来,这似乎实现了相同的结果,其中$delim_list是一个分隔符数组,如下所示:

static protected $delim_list = array('tab'=>"'t", 
                                     'semicolon'=>";", 
                                     'pipe'=>"|", 
                                     'comma'=>",");

有人能解释为什么我不应该用这种更简单的方式来做吗?为什么在我看来,更复杂的解决方案似乎是公认的答案?

谢谢!

这个函数很优雅:)

/**
* @param string $csvFile Path to the CSV file
* @return string Delimiter
*/
public function detectDelimiter($csvFile)
{
    $delimiters = [";" => 0, "," => 0, "'t" => 0, "|" => 0];
    $handle = fopen($csvFile, "r");
    $firstLine = fgets($handle);
    fclose($handle); 
    foreach ($delimiters as $delimiter => &$count) {
        $count = count(str_getcsv($firstLine, $delimiter));
    }
    return array_search(max($delimiters), $delimiters);
}

这些都不适用于我的用例。所以我做了一些小的修改。

   /**
    * @param string $filePath
    * @param int $checkLines
    * @return string
    */
   public function getCsvDelimiter(string $filePath, int $checkLines = 3): string
   {
      $delimiters =[",", ";", "'t"];
      $default =",";
       $fileObject = new 'SplFileObject($filePath);
       $results = [];
       $counter = 0;
       while ($fileObject->valid() && $counter <= $checkLines) {
           $line = $fileObject->fgets();
           foreach ($delimiters as $delimiter) {
               $fields = explode($delimiter, $line);
               $totalFields = count($fields);
               if ($totalFields > 1) {
                   if (!empty($results[$delimiter])) {
                       $results[$delimiter] += $totalFields;
                   } else {
                       $results[$delimiter] = $totalFields;
                   }
               }
           }
           $counter++;
       }
       if (!empty($results)) {
           $results = array_keys($results, max($results));
           return $results[0];
       }
return $default;
}

固定版本。

在代码中,如果一个字符串有多个分隔符,则会得到错误的结果(例如:val;string,带逗号;val2;val3)。此外,如果文件有1行(行数<分隔符数)。

这里有一个固定的变体:

private function detectDelimiter($fh)
{
    $delimiters = ["'t", ";", "|", ","];
    $data_1 = null; $data_2 = null;
    $delimiter = $delimiters[0];
    foreach($delimiters as $d) {
        $data_1 = fgetcsv($fh, 4096, $d);
        if(sizeof($data_1) > sizeof($data_2)) {
            $delimiter = $d;
            $data_2 = $data_1;
        }
        rewind($fh);
    }
    return $delimiter;
}

通常,您无法检测文本文件的分隔符。如果有其他提示,您需要在检测中实现它们以确保安全。

所建议的方法的一个特殊问题是,它将计算文件不同行中的元素数量。假设你有一个这样的文件:

a;b;c;d
a   b;  c   d
this|that;here|there
It's not ready, yet.; We have to wait for peter, paul, and mary.; They will know what to do

尽管这似乎用分号分隔,但您的方法将返回comma

另一个(结合我在互联网上找到的许多答案构建而成:

/**
 * Detects the delimiter of a CSV file (can be semicolon, comma or pipe) by trying every delimiter, then
 * counting how many potential columns could be found with this delimiter and removing the delimiter from array of
 * only one columns could be created (without a working limiter you'll always have "one" column: the entire row).
 * The delimiter that created the most columns is returned.
 *
 * @param string $pathToCSVFile path to the CSV file
 * @return string|null nullable delimiter
 * @throws 'Exception
 */
public static function detectDelimiter(string $pathToCSVFile): ?string
{
    $delimiters = [
        ';' => 0,
        ',' => 0,
        "|" => 0,
    ];
    $handle = fopen($pathToCSVFile, 'r');
    $firstLine = fgets($handle);
    fclose($handle);
    foreach ($delimiters as $delimiterCharacter => $delimiterCount) {
        $foundColumnsWithThisDelimiter = count(str_getcsv($firstLine, $delimiterCharacter));
        if ($foundColumnsWithThisDelimiter > 1) {
            $delimiters[$delimiterCharacter] = $foundColumnsWithThisDelimiter;
        }else {
            unset($delimiters[$delimiterCharacter]);
        }
    }
    if (!empty($delimiters)) {
        return array_search(max($delimiters), $delimiters);
    } else {
        throw new 'Exception('The CSV delimiter could not been found. Should be semicolon, comma or pipe!');
    }
}

以及相应的单元测试(您必须添加自定义的test.csv文件):

/**
 * Test the delimiter detector
 *
 * @test
 */
public function testDetectDelimiter()
{
    $this->assertEquals(',', Helper::detectDelimiter('test1.csv'));
    $this->assertEquals(';', Helper::detectDelimiter('test-csv-with-semicolon-delimiter.csv'));
    $this->assertEquals('|', Helper::detectDelimiter('test-csv-with-pipe-delimiter.csv'));
    $this->expectExceptionMessage('The CSV delimiter could not been found. Should be semicolon, comma or pipe!');
    Helper::detectDelimiter('test-csv-with-failing-delimiter.csv');
}

好的,这个解析CSV的一行(通常是第一行),如果可能有多个delimeter或没有匹配的delimeter,则抛出Exception。根据这一点,看起来你想要测试的delimeter不在引用的字符串中或转义了。

    public function getDelimiter(string $content, $throwExceptionOnNonUnique = true, $expectSingleColumn = false): string
    {
        // Would be cleaner if you pass the delimiters from outside
        // as also the order matters in the special case you've got something like "a,b;c"
        // and you don't throw the exception - then the first match is preferred
        // But for StackOverflow I put them inside
        $delimiters = ["'t", ";", "|", ","];
        $result = ',';
        $maxCount = 0;
        foreach ($delimiters as $delimiter) {
            // Impress your code reviewer by some badass regex ;)
            $pattern = "/(?<!''')(?:''''''')*(?!'B'"[^'''"]*)''" . $delimiter . "(?![^'"]*'''"'B)/";
            $amount = preg_match_all($pattern, $content);
            if ($maxCount > 0 && $amount > 0 && $throwExceptionOnNonUnique) {
                $msg = 'Identifier is not clear: "' . $result . '" and "' . $delimiter . '" are possible';
                throw new 'Exception($msg);
            }
            if ($amount > $maxCount) {
                $maxCount = $amount;
                $result = $delimiter;
            }
        }
        // If nothing matches and you don't expect that just the CSV just
        // consists of one single column without a delimeter at the end
        if ($maxCount === 0 && !$expectSingleColumn) {
            throw new 'Exception('Unknown delimiter');
        }
        return $result;
    }

附言:也是单元测试,但我不想在这里粘贴100多行测试;)

这是最短的版本,通过使用SplFileObject类和方法getCsvControl并使用数组取消引用来检测CSV分隔符。

不过有一个问题,下面的函数只有在使用setCsvControl()函数手动设置分隔符的情况下才有效,否则,请使用评分最高的答案之一。

// SplFileObject::getCsvControl — Get the delimiter, enclosure and escape character for CSV
function detectDelimiter($csvFile){
    if(!file_exists($csvFile) || !is_readable($csvFile)){
        return false;
    }   
    $file = new SplFileObject($csvFile);
    return $file->getCsvControl()[0]; 
}