将格式不正确的txt转换为csv


converting badly formatted txt to csv

我有一个格式不正确的文本文件,我想将其转换为csv。

这里有一个例子:

100910 NA/1-2013-99636 VIA DEI PESCATORI 2/A LODI APR 8 2013 4:24PM DANNEGGIATO -10% 200 2700 0 0 NO
148013 NA/1-2014-146194 CAVALLOTTI SNC LODI GEN 3 2014 3:37PM DANNEGGIATO -10% 0 0 2 0 NO
160032 NA/1-2014-158129 PAOLO GORINI SNC LODI MAG 6 2014 11:51AM DANNEGGIATO -10% 2 0 2 0 NO
54900 NA/1-2014-158070 STRADA VECCHIA CREMONESE SNC LODI MAG 6 2014 9:53AM DANNEGGIATO +10% 10 0 10 0 NO
100910 NA/1-2013-99636 VIA DEI PESCATORI 2/A LODI APR 8 2013 4:24PM DANNEGGIATO -10% 200 2700 0 0 NO
147959 NA/1-2014-146140 DOSSENA SNC LODI GEN 3 2014 10:45AM DANNEGGIATO -10% 200 0 200 0 NO

大致是这样的形式:

[number] [id] [awfully formatted street] ['LODI'] [timestamp] [damaged or not] [percentage] [squaremeters] [squaremeters] [squaremeters] [squaremeters] [asbest-crumbled or not]

我的问题是如何提取第三部分,[格式糟糕的街道]。基本上,它是[id]之后的字符串,在字符串['LODI']之前(但['LODI']必须刚好在[timestamp]之前)

我应该用空格对每一行进行explode(),然后向后遍历数组,超过[timestamp],超过[LODI'],并连接数组[id]之前的值,即数组[1]吗?或者有更聪明(优雅)的方法可以做到这一点,也许可以使用preg_match()?

谢谢你的提示!

<?php
    // read file line by line
    $line = '148013 NA/1-2014-146194 CAVALLOTTI SNC LODI GEN 3 2014 3:37PM DANNEGGIATO -10% 0 0 2 0 NO';
    //start by seperating the string on LODI
    $lodi_split = explode('LODI', $line);
    // Now split the first occ into an array on space
    $bits = explode(' ', $lodi_split[0]);
    $address = '';
    // start reading occurance from occ 2 to loose the first 2 fields
    for ($i=2; $i < count($bits); $i++ ) {
        $address .= $bits[$i] . ' ';
    }
    echo $address . PHP_EOL;

结果是

CAVALLOTTI SNC

这应该可以从一行中提取地址。

<?php 
$row = "100910 NA/1-2013-99636 VIA DEI PESCATORI 2/A LODI APR 8 2013 4:24PM DANNEGGIATO -10% 200 2700 0 0 NO";
$row_array = preg_split('/'s+/', $row);

array_shift($row_array);
array_shift($row_array);
for($i=0; $i<12; $i++){
    array_pop($row_array);
}
$address = implode(" ", $row_array);
?>

我认为爆炸在这里不行。我建议使用regexp。例如,如果您将.txt文件读取为一个字符串(其中数据字符串用分隔):

$f = fopen($fname="file.txt", "rt");
$str = fread($f, filesize($fname)));
fclose($f);

然后像这样使用preg_match_all()

$re = "/^(''d+)''s*(.*)(LODI)''s*(.+(?:AM|PM))''s*(''w+)''s+(-?''d{1,3}%)''s+(''d+)''s+(''d+)''s+(''d+)''s+(''d+)''s+(''w+)$/m"; 
preg_match_all($re, $str, $matches,PREG_SET_ORDER );
echo "<pre>'n";
print_r($matches);
echo "</pre>'n";

输出如下所示:

Array
(
    [0] => Array
        (
            [0] => 100910 NA/1-2013-99636 VIA DEI PESCATORI 2/A LODI APR 8 2013 4:24PM DANNEGGIATO -10% 200 2700 0 0 NO
            [1] => 100910
            [2] => NA/1-2013-99636 VIA DEI PESCATORI 2/A 
            [3] => LODI
            [4] => APR 8 2013 4:24PM
            [5] => DANNEGGIATO
            [6] => -10%
            [7] => 200
            [8] => 2700
            [9] => 0
            [10] => 0
            [11] => NO
        )
    [1] => Array
        (
            [0] => 148013 NA/1-2014-146194 CAVALLOTTI SNC LODI GEN 3 2014 3:37PM DANNEGGIATO -10% 0 0 2 0 NO
            [1] => 148013
            [2] => NA/1-2014-146194 CAVALLOTTI SNC 
            [3] => LODI
            [4] => GEN 3 2014 3:37PM
            [5] => DANNEGGIATO
            [6] => -10%
            [7] => 0
            [8] => 0
            [9] => 2
            [10] => 0
            [11] => NO
     )
..........// And so on

我在这个例子中使用了您上面提供的文本。因此,在输出中,您会收到格式化为数组列表的数据。所以你可以用它做任何你想做的事。$matches[$i][0]-将存储整个匹配,所以跳过它,使用$matches[$i]]$匹配[$i][11]作为您的数据。