需要帮助选择PHP简单HTML DOM解析器


Need Help Selecting with PHP Simple HTML DOM Parser

在社区网站上工作,将其从ASP转换为PHP。目前,客户手动输入我们当地影院每周的电影时间,他们从另一个网站上获取。我想我会尝试自动化这个过程,因为我们无论如何都在重做网站,所以我找到了PHP Simple HTML DOM Parser。我一直在选择这部电影的评分(PG、18等)。

这里有一个div,其中包括一部电影的信息:

            <div class="mshow">
                <span style="float:right; font-size:11px;">
                    <a href="/trailers/enders-game/19330/" title="enders-game movie trailer" style="font-size:11px;">Trailer</a> | 
                    <a href="/reviews/enders-game/30945/" title="Ender's Game movie reviews" style="font-size:11px;">Rating: </a>
                    <b>Tribute</b>
                    <img src="/images/stars/4_sm.gif" alt="Current rating: 3.88" border="0" />
                </span>
                <strong>
                    <a href="/movies/enders-game/30945/" title="Ender's Game movie info">Ender's Game</a>
                </strong>
                (PG)<br />
                <div class="block">&nbsp;</div>
                <div class="rsd">Fri, Nov 15: </div>
                <div class="rst" >7:00pm &nbsp;&nbsp;9:20pm &nbsp;&nbsp;</div><br />
                <div class="rsd">Sat, Nov 16: </div>
                <div class="rst" >1:00pm &nbsp;&nbsp;3:15pm &nbsp;&nbsp;7:00pm &nbsp;&nbsp;9:20pm &nbsp;&nbsp;</div><br />
                <div class="rsd">Sun, Nov 17: </div>
                <div class="rst" >1:00pm &nbsp;&nbsp;3:15pm &nbsp;&nbsp;7:00pm &nbsp;&nbsp;9:20pm &nbsp;&nbsp;</div><br />
                <div class="rsd">Mon, Nov 18: </div>
                <div class="rst" >7:00pm &nbsp;&nbsp;9:20pm &nbsp;&nbsp;</div><br />
                <div class="rsd">Tue, Nov 19: </div>
                <div class="rst" >7:00pm &nbsp;&nbsp;9:20pm &nbsp;&nbsp;</div><br />
                <div class="rsd">Wed, Nov 20: </div>
                <div class="rst" >7:00pm &nbsp;&nbsp;9:20pm &nbsp;&nbsp;</div><br />
                <div class="rsd">Thu, Nov 21: </div>
                <div class="rst" >7:00pm &nbsp;&nbsp;9:20pm &nbsp;&nbsp;</div><br />
            </div>

这是我迄今为止的代码:

            <?php
            include_once('../simple_html_dom.php');
            $html = file_get_html('http://www.tribute.ca/showtimes/theatres/may-cinema-6/mayc5/?datefilter=-1');
            $movies = array();
            foreach ($html->find("div.mshow") as $movie) {
                $item['trailer'] = $movie->find('a', 0)->href;
                $item['reviews'] = $movie->find('a', 1)->href;
                $item['link'] = $movie->find('a', 2)->href;
                $item['title'] = $movie->find('a', 2)->plaintext;
                $movies[] = $item;
            }
            var_dump($movies);
            ?>

我不知道怎么抓(PG)。有什么建议吗?

编辑:这是有效的,但似乎不是一个很好的解决方案。

            function parseDOM($url) {
                $movies = array();
                foreach ($url->find("div.mshow") as $movie) {
                    $item['trailer'] = $movie->find('a', 0)->href;
                    $item['reviews'] = $movie->find('a', 1)->href;
                    $item['link'] = $movie->find('a', 2)->href;
                    $item['title'] = $movie->find('a', 2)->plaintext;
                    $info = $movie->plaintext;
                    preg_match('/'((.*?)')/', $info, $matches);
                    $item['rating'] = $matches[1];
                    $movies[] = $item;
                }
                return $movies;
            }

不幸的是,简单HTML DOM库是一个糟糕的选择。它不支持完整的XPath查询,也没有合适的同级节点选择器。

有了内置的DOM模块,你可以很容易地实现你想要的:

$dom = new DOMDocument;
@$dom->loadHTMLFile('http://www.tribute.ca/showtimes/theatres/may-cinema-6/mayc5/?datefilter=-1');
$xpath = new DOMXPath($dom);
$movies = array();
foreach ($xpath->query("//div[@class='mshow']") as $movie) {
    $item = array();
    $links = $xpath->query('.//a', $movie);
    $item['trailer'] = $links->item(0)->getAttribute('href');
    $item['reviews'] = $links->item(1)->getAttribute('href');
    $item['link'] = $links->item(2)->getAttribute('href');
    $item['title'] = $links->item(2)->nodeValue;
    $item['rating'] = trim($xpath->query('.//strong/following-sibling::text()',
        $movie)->item(0)->nodeValue);
    $movies[] = $item;
}
var_dump($movies);

这给了我以下信息:

阵列(7){[0]=>阵列(5){["拖车"]=>string(28)"/预告片/enders game/19330/"["reviews"]=>string(27)"/reviews/enders game/30945/"["link"]=>string(26)"/imovies/enders game/30945/"["title"]=>string(12)"安德的游戏"["rating"]=>string(4)"(PG)"}[1] =>阵列(5){["拖车"]=>string(27)"/拖车/自由鸟/19436/"["reviews"]=>string(26)"/reviews/free birds/36183/"["link"]=>string(25)"/电影/自由鸟/36183/"["title"]=>string(10)"自由的小鸟"["rating"]=>字符串(3)"(G)"}[2] =>阵列(5){["拖车"]=>string(30)"/拖车/自由鸟-3d/14421/"["reviews"]=>string(29)"/reviews/free-birds-3d/37230/"["link"]=>string(28)"/电影/自由鸟-3d/37230/"["title"]=>string(13)"自由鸟3D"["rating"]=>字符串(3)"(G)"}[3] =>阵列(5){["拖车"]=>string(45)"/预告片/蠢驴送坏爷爷/19318/"["reviews"]=>string(44)"/评论/蠢驴介绍坏爷爷/36493/"["link"]=>string(43)"/电影/蠢驴送坏爷爷/36493/"["title"]=>string(29)"蠢驴礼物:坏爷爷"["rating"]=>string(5)"(14A)"}[4] =>阵列(5){["拖车"]=>string(27)"/拖车/最后一个vegas/19291/"["reviews"]=>string(26)"/reviews/last-vegas/35853/"["link"]=>string(25)"/imovies/last vegas/35853/"["title"]=>string(10)"最后的拉斯维加斯"["rating"]=>string(4)"(PG)"}[5] =>阵列(5){["拖车"]=>string(36)"/预告片/黑暗世界之旅/19327/"["reviews"]=>string(35)"/reviews/thor the dark world/32002/"["link"]=>string(34)"/电影/黑暗世界/32002/"["title"]=>string(20)《雷神:黑暗世界》["rating"]=>string(4)"(PG)"}[6] =>阵列(5){["拖车"]=>string(39)"/treats/thor-the-dark-world-3d/14425/"["reviews"]=>string(38)"/reviews/thor-the-ddark-world-3d/34705/"["link"]=>string(37)"/moiles/thor-the-dark-world-3d/34705/"["title"]=>string(23)《雷神:黑暗世界3D》["rating"]=>string(4)"(PG)"}}