从 HTM 中提取表数据


extract table data from htm

我需要从html表中提取数据并将其保存在csv文件中。有没有简单的方法可以在 bash 或 php 中获取表属性中的所有信息?

这是代码

<html><head> <link rel=STYLESHEET href="/XPIcons/style.css" type="text/css">
<title>Control 20 November 2014</title>
</head>
<body 
>
<table width="100%" cellapdding="0" cellspacing="0">
<tr><td WIDTH="100%" class="username">xxxx<br><font  color=#A4A6A0>IDIOMABASE</font>    </td>
<td>&nbsp;</td><td bgcolor=#FFFFFF rowspan="3" align="right"><img src="/XPIcons/logo.jpg"></td></tr>
<tr><td width="100%" align="top" class="guio"><img src="/XPIcons/guion_verde.jpg"></td></tr>
<tr><td width="100%" align="top" class="title">CONTROL 20 NOVEMBER 2014
<br><font color=#B7D30C size="1px"><SCRIPT LANGUAGE="JavaScript" SRC="/XPIcons /calendar.js"></SCRIPT></font>
</td></tr>
</table>
<P>
<script language="JavaScript">
function doNothing(){
        }
function ShowData(){
        var obj = "QUALCTRL.ShowDataTD?p_date="+p_date.value;
        location.href=obj;
    }
</script>
<script language="JavaScript">
function test(formu)
            {
             error=formu.p_date.value==""?"ErrorDate'n":"";
             if  (error != "")
                alert(error);
             else
                formu.submit();
            }
</script>
<table>
<td>Fecha</td><td>
<input type="text" id="p_date" name="p_date" value="20/11/2014"  onblur="Compruebap_fecha(formu.p_date);">
<A HREF="javascript:doNothing()" onClick="var obj=document.getElementById('p_date'); setDateField(obj);top.newWin=window.open('/XPIcons/calendar.html', 'cal', 'dependent=yes, resizable=yes, width=210, height=230, screenX=200, screenY=300, titlebar=no')">
<IMG SRC="/XPIcons/calendar.gif" BORDER=0></A><font size=1>Ver calendario</font>
(dd/mm/YYYY)
</td></table>
<input type="button" class="btn" onClick="javascript:RecarregaPlana();" value="Cambia de Dia >>">
<P>
<table border="1%">
<tr><td class="fila_blanca">Población</td>
<td class="fila_mesgris">MAX</td>
<td class="fila_mesgris">MIN</td>
<td class="fila_menysgris">MASS MAX</td>
<td class="fila_menysgris">MASS MIN</td>
<td class="fila_mesgris">MERGE MAX</td>
<td class="fila_mesgris">MERGE MIN</td>
<td class="fila_menysgris">MOS MAX</td>
<td class="fila_menysgris">MOS MIN</td>
<td class="fila_blanca">DIF MAX</td>
<td class="fila_blanca">DIF MIN</td>
<td class="fila_blanca">DIF MAX MERGE</td>
<td class="fila_blanca">DIF MIN MERGE</td>
<td class="fila_blanca">DIF MAX MOS</td>
<td class="fila_blanca">DIF MIN MOS</td>
</tr>
<tr>
<td class="fila_blanca">Palermo</td>
<td class="fila_mesgris">20</td>
<td class="fila_mesgris">11</td>
<td class="fila_menysgris">21</td>
<td class="fila_menysgris">10</td>
<td class="fila_mesgris">20</td>
<td class="fila_mesgris">17</td>
<td class="fila_menysgris">20</td>
<td class="fila_menysgris">9</td>
<td class="fila_blanca">-1</td>
<td class="fila_blanca">1</td>
<td class="fila_mesgris">0</td>
<td class="fila_mesgris">-6</td>
<td class="fila_menysgris">0</td>
<td class="fila_menysgris">2</td>
</tr>
<tr>
<td class="fila_blanca">Bergamo</td>
<td class="fila_mesgris"></td>
<td class="fila_mesgris"></td>
<td class="fila_menysgris">16</td>
<td class="fila_menysgris">7</td>
<td class="fila_mesgris">17</td>
<td class="fila_mesgris">7</td>
<td class="fila_menysgris">17</td>
<td class="fila_menysgris">7</td>
<td class="fila_blanca"></td>
<td class="fila_blanca"></td>
<td class="fila_mesgris"></td>
<td class="fila_mesgris"></td>
<td class="fila_menysgris"></td>
<td class="fila_menysgris"></td>
</tr>
<tr>
<td class="fila_blanca">Rome</td>
<td class="fila_mesgris"></td>
<td class="fila_mesgris"></td>
<td class="fila_menysgris">19</td>
<td class="fila_menysgris">16</td>
<td class="fila_mesgris">19</td>
<td class="fila_mesgris">14</td>
<td class="fila_menysgris">19</td>
<td class="fila_menysgris">14</td>
<td class="fila_blanca"></td>
<td class="fila_blanca"></td>
<td class="fila_mesgris"></td>
<td class="fila_mesgris"></td>
<td class="fila_menysgris"></td>
<td class="fila_menysgris"></td>
</tr>
</table>
<SCRIPT>
              function openSearch() {
                 window.open('XPSearch.Search', 'XPSearch', 'scrollbars=yes,resizable=yes,toolbar=yes,location=yes,status=yes,width=550,height=500,screenX=550,screenY=500');                   }
              function doNothing() {
              }
</SCRIPT>
<P>
<table width="100%" cellspacing="0">
<tr><td class="pie" width="100%"><a href="XPMenuPrincipal.menu"><b>Menú principal</b></a> </td><td bgcolor=#FCFCFA><a href="javascript:doNothing()" onClick="javascript:openSearch()"><img border="0" src="/XPIcons/search.jpg" ></a></td>   <td><marquee hspace=147></marquee></td></table>
</body></html>

我想得到这样的csv:

Población,MAX,MIN,MASS MAX,MASS MIN,MERGE MAX,MERGE MIN,MOS MAX,MOS MIN,DIF MAX,DIF MIN ,DIF MAX MERGE,DIF MIN MERGE,DIF MAX MOS,DIF MIN MOS
Palermo,20,11,21,10,20,17,20,9,-1,1,0,-6,0,2
Bergamo,,,16,7,17,7,17,7,,,,
Rome,,,19,16,19,14,19,14,,,,        

这可能是一种方法:

awk -F'">|<' -v OFS="," 
     'NF>3{if (r) {r=r OFS $3} else r=$3}
      /tr/ {print r; r=""}' file

对于您的示例输入:

$ awk -F'">|<' -v OFS="," 'NF>3{if (r) {r=r OFS $3} else r=$3} /tr/ {print r; r=""}' a
td class="fila_blanca
MAX,MIN,MASS MAX,MASS MIN,MERGE MAX,MERGE MIN,MOS MAX,MOS MIN,DIF MAX,DIF MIN,DIF MAX MERGE,DIF MIN MERGE,DIF MAX MOS,DIF MIN MOS
Palermo,20,11,21,10,20,17,20,9,-1,1,0,-6,0,2
Bergamo,,,16,7,17,7,17,7,,,,,,
Rome,,,19,16,19,14,19,14,,,,,,

解释

  • -F'">|<'输入字段分隔符设置为 ">< 。这样,我们可以轻松捕获标签中的值,而无需进一步处理。
  • -v OFS=","输出字段分隔符设置为逗号。
  • NF>3{if (r) {r=r OFS $3} else r=$3}如果记录包含 3 个以上的字段,请将第 3 个字段存储在变量 r 中。这将继续添加内容,直到找到<tr...
  • /tr/ {print r; r=""},这就是我们打印内容并清空变量以开始处理下一个块的时候。