使用PHP抓取HTML列表结构


Scrape HTML list structure with PHP

我想要抓取一个html列表结构,这样我就可以分别保存父节点和子节点了。

这是html

的视图源
<ul class="categories_list">
    <li><a href="/sports-nutrition">Sports Nutrition</a>
        <ul class="categories_list">
            <li><a href="/protein">Protein</a>
                <ul class="categories_list">
                    <li><a href="/protein-powder">Protein Powder</a>
                        <ul class="categories_list">
                            <li><a href="/whey-protein">Whey Protein</a>
                                <ul class="categories_list">
                                    <li><a href="/whey-protein-isolate">Whey Protein Isolate</a></li>
                                </ul>
                            </li>
                        </ul>
                    </li>
                </ul>
            </li>
        </ul>
        <ul class="categories_list">
            <li><a href="/pre-workout-supplements">Pre Workout Supplements</a></li>
        </ul>
        <ul class="categories_list">
            <li><a href="/creatine">Creatine</a>
                <ul class="categories_list">
                    <li><a href="/creatine-monohydrate">Creatine Monohydrate</a></li>
                </ul>
            </li>
        </ul>
        <ul class="categories_list">
            <li><a href="/amino-acids">Amino Acids</a>
                <ul class="categories_list">
                    <li><a href="/essential-amino-acids">Essential Amino Acids</a>
                        <ul class="categories_list">
                            <li><a href="/bcaa">BCAA</a></li>
                        </ul>
                    </li>
                </ul>
            </li>
        </ul>
        <ul class="categories_list">
            <li><a href="/joint-supplements">Joint Supplements</a>
                <ul class="categories_list">
                    <li><a href="/curcumin">Curcumin</a>
                        <ul class="categories_list">
                            <li><a href="/curcumin-phytosome">Curcumin Phytosome</a></li>
                        </ul>
                    </li>
                </ul>
            </li>
        </ul>
        <ul class="categories_list">
            <li><a href="/energy-endurance">Energy &amp; Endurance</a>
                <ul class="categories_list">
                    <li><a href="/stimulants">Stimulants</a></li>
                </ul>
            </li>
        </ul>
    </li>
</ul>

我使用简单的HTML DOM进行抓取。我能得到所有的类别,但我不能得到它们在适当的层次结构。我也试过用孩子的方法,但没有用。

所以我正在寻找一些帮助在我现有的工作。下面是我现有的代码:

$html= file_get_html($url);
foreach ($html->find('ul.categories_list li') as $link) {
    echo $link->plaintext.'<br>';
}

这个脚本试图获取所有元素。这需要改进:

<?php 
require_once("simple_html_dom.php");
$dom = file_get_html("source.php");
getCategory($dom);
print_r($categoryList);
function getCategory(simple_html_dom $dom){
    global $categoryList;
    foreach($dom->find('ul.categories_list li') as $ul){
        //extract the a tag if found
        $categoryName = $ul->find('a',0)->href;
        $categoryLabel = $ul->find('a',0)->innertext;
        $categoryList[] = array(
                                            "categoryName"  =>  $categoryName,
                                            "categoryLabel" =>  $categoryLabel,
                                            );
        //remove a node
        $ul->find('a',0)->outertext = '';
        $string = $ul->innertext;
        if(trim($string) == ''){
            continue;
        }else{
            // die($string);
            $dom2 = str_get_html($string);
            getCategory($dom2);
        }
    }       
}

它基本上在每次调用时递归填充$categoryList