查找表的ID使用简单的HTML DOM解析器


Find Tables by ID using Simple HTML DOM Parser

我去年写了一个数据库播种器来抓取一个统计网站。在重新访问我的代码时,它似乎不再工作,我有点难倒的原因。$html->find()应该返回找到的元素数组,但是它似乎只在使用时找到第一个表。

根据文档,我尝试使用find()并指定每个表的ID,但是这似乎也失败了。

$table_passing = $html->find('table[id=passing]');
谁能帮我弄清楚这里出了什么问题?我不知道为什么这两种方法都不起作用,页面源代码清楚地显示了多个表和id,这两种方法都应该起作用。
private function getTeamStats()
{
    $url = 'http://www.pro-football-reference.com/years/2016/opp.htm';
    $html = file_get_html($url);
    $tables = $html->find('table');
    $table_defense = $tables[0];
    $table_passing = $tables[1];
    $table_rushing = $tables[2];
    //$table_passing = $html->find('table[id=passing]');
    $teams = array();
    # OVERALL DEFENSIVE STATISTICS #
    foreach ($table_defense->find('tr') as $row)
    {
        $stats = $row->find('td');
        // Ignore the lines that don't have ranks, these aren't teams
        if (isset($stats[0]) && !empty($stats[0]->plaintext))
        {
            $name = $stats[1]->plaintext;
            $rank = $stats[0]->plaintext;
            $games = $stats[2]->plaintext;
            $yards = $stats[4]->plaintext;
            // Calculate the Yards Allowed per Game by dividing Total / Games
            $tydag = $yards / $games;
            $teams[$name]['rank'] = $rank;
            $teams[$name]['games'] = $games;
            $teams[$name]['tydag'] = $tydag;
        }
    }
    # PASSING DEFENSIVE STATISTICS #
    foreach ($table_passing->find('tr') as $row)
    {
        $stats = $row->find('td');
        // Ignore the lines that don't have ranks, these aren't teams
        if (isset($stats[0]) && !empty($stats[0]->plaintext))
        {
            $name = $stats[1]->plaintext;
            $pass_rank = $stats[0]->plaintext;
            $pass_yards = $stats[14]->plaintext;
            $teams[$name]['pass_rank'] = $pass_rank;
            $teams[$name]['paydag'] = $pass_yards;
        }
    }
    # RUSHING DEFENSIVE STATISTICS #
    foreach ($table_rushing->find('tr') as $row)
    {
        $stats = $row->find('td');
        // Ignore the lines that don't have ranks, these aren't teams
        if (isset($stats[0]) && !empty($stats[0]->plaintext))
        {
            $name = $stats[1]->plaintext;
            $rush_rank = $stats[0]->plaintext;
            $rush_yards = $stats[7]->plaintext;
            $teams[$name]['rush_rank'] = $rush_rank;
            $teams[$name]['ruydag'] = $rush_yards;
        }
    }

我从不使用simplexml或其他衍生品,但当使用XPath查询查找属性(如ID)时,通常会使用@前缀并应引用值-因此对于您的情况,它可能是

$table_passing = $html->find('table[@id="passing"]');

使用标准的DOMDocument &DOMXPath方法的问题是,实际的表在源代码中是"commented out"—因此简单的替换html注释的字符串可以使下面的工作—这可以很容易地适应原始代码。

$url='http://www.pro-football-reference.com/years/2016/opp.htm';
$html=file_get_contents( $url );
/* remove the html comments */
$html=str_replace( array('<!--','-->'), '', $html );
libxml_use_internal_errors( true );
$dom=new DOMDocument;
$dom->validateOnParse=false;
$dom->standalone=true;
$dom->strictErrorChecking=false;
$dom->recover=true;
$dom->formatOutput=false;
$dom->loadHTML( $html );
libxml_clear_errors();  

$xp=new DOMXPath( $dom );
$tbl=$xp->query( '//table[@id="passing"]' );
foreach( $tbl as $n )echo $n->tagName.' > '.$n->getAttribute('id');
/* outputs */
table > passing