我想要抓取一个html列表结构,这样我就可以分别保存父节点和子节点了。
这是html
的视图源<ul class="categories_list">
<li><a href="/sports-nutrition">Sports Nutrition</a>
<ul class="categories_list">
<li><a href="/protein">Protein</a>
<ul class="categories_list">
<li><a href="/protein-powder">Protein Powder</a>
<ul class="categories_list">
<li><a href="/whey-protein">Whey Protein</a>
<ul class="categories_list">
<li><a href="/whey-protein-isolate">Whey Protein Isolate</a></li>
</ul>
</li>
</ul>
</li>
</ul>
</li>
</ul>
<ul class="categories_list">
<li><a href="/pre-workout-supplements">Pre Workout Supplements</a></li>
</ul>
<ul class="categories_list">
<li><a href="/creatine">Creatine</a>
<ul class="categories_list">
<li><a href="/creatine-monohydrate">Creatine Monohydrate</a></li>
</ul>
</li>
</ul>
<ul class="categories_list">
<li><a href="/amino-acids">Amino Acids</a>
<ul class="categories_list">
<li><a href="/essential-amino-acids">Essential Amino Acids</a>
<ul class="categories_list">
<li><a href="/bcaa">BCAA</a></li>
</ul>
</li>
</ul>
</li>
</ul>
<ul class="categories_list">
<li><a href="/joint-supplements">Joint Supplements</a>
<ul class="categories_list">
<li><a href="/curcumin">Curcumin</a>
<ul class="categories_list">
<li><a href="/curcumin-phytosome">Curcumin Phytosome</a></li>
</ul>
</li>
</ul>
</li>
</ul>
<ul class="categories_list">
<li><a href="/energy-endurance">Energy & Endurance</a>
<ul class="categories_list">
<li><a href="/stimulants">Stimulants</a></li>
</ul>
</li>
</ul>
</li>
</ul>
我使用简单的HTML DOM进行抓取。我能得到所有的类别,但我不能得到它们在适当的层次结构。我也试过用孩子的方法,但没有用。
所以我正在寻找一些帮助在我现有的工作。下面是我现有的代码:
$html= file_get_html($url);
foreach ($html->find('ul.categories_list li') as $link) {
echo $link->plaintext.'<br>';
}
这个脚本试图获取所有元素。这需要改进:
<?php
require_once("simple_html_dom.php");
$dom = file_get_html("source.php");
getCategory($dom);
print_r($categoryList);
function getCategory(simple_html_dom $dom){
global $categoryList;
foreach($dom->find('ul.categories_list li') as $ul){
//extract the a tag if found
$categoryName = $ul->find('a',0)->href;
$categoryLabel = $ul->find('a',0)->innertext;
$categoryList[] = array(
"categoryName" => $categoryName,
"categoryLabel" => $categoryLabel,
);
//remove a node
$ul->find('a',0)->outertext = '';
$string = $ul->innertext;
if(trim($string) == ''){
continue;
}else{
// die($string);
$dom2 = str_get_html($string);
getCategory($dom2);
}
}
}
它基本上在每次调用时递归填充$categoryList
。