我需要帮助开发一个正则表达式来从HTML中获取一些数据。HTML模式如下
<h5>Work Experience</h5>
<p><span id="organization">Company Name 1</span></p>
Designation 1
<p>Date 1
</p>
<ul>
<li>Some text 1</li>
</ul>
<p><span id="organization">Company Name 2</span></p>
Designation 2
<p>Date 2
</p>
<ul>
<li>Some text 2</li>
</ul>
<p><span id="organization">Company Name 3</span></p>
Designation 3
<p>Date 3
</p>
<ul>
<li>Some text 3</li>
</ul></div>
我尝试了以下正则表达式:
|<h5>Work Experience<'/h5>'s*<p>(.*)<'/p>(.*)<p>(.*)<'/p>'s*<ul>(.*)<'/ul>'s*<'/div>|Uis
我记下了所有的公司名称、名称和日期。
请帮帮我。提前谢谢。
不要使用正则表达式来解析HTML(请参阅这个著名的答案以获得原因的详细解释)。相反,使用DOM之类的东西,这会使事情变得更容易。对于上面的例子,你可以做:
$doc = new DOMDocument();
$doc->loadHTML($html); // $html should contain the HTML source
// Get all spans from the document
$spans = $doc->getElementsByTagName('span');
// Loop over the spans
foreach ($spans as $span) {
// Check if the span has an id attribute with "organization" as value
if ($span->hasAttribute('id') && $span->getAttribute('id') === 'organization') {
echo $span->nodeValue; // This will echo the company name
}
}
您可以在此处看到完整的工作示例及其结果:https://3v4l.org/XdrQ1
另一个使用解析器的建议。将此示例与SimpleXML
和xpath
查询一起考虑。此外,ID需要是唯一的,因此最好使用class
:
<?php
$html = '
<div>
<h5>Work Experience</h5>
<p><span class="organization">Company Name 1</span></p>
Designation 1
<p>Date 1</p>
<ul>
<li>Some text 1</li>
</ul>
<p><span class="organization">Company Name 2</span></p>
Designation 2
<p>Date 2</p>
<ul>
<li>Some text 2</li>
</ul>
</div>';
$xml = simplexml_load_string($html);
$spans = $xml->xpath("//span[@class='organization']");
foreach ($spans as $span) {
// do sth. useful here
}
?>
提示:
正如@Oldskool所指出的,您可能无法访问原始(无效)HTML字符串。在这种情况下,您需要这样更改查询:
$spans = $xml->xpath("//span[@id='organization']");
试试这个
<span id="organization">(?<company_name>[^<]+)<'/span><'/p>'n's*(?<designation>[^'n]+)'n's*<p>(?<date>[^'n]+)
Regex演示
输出:
MATCH 1
company_name [54-68] `Company Name 1`
designation [82-95] `Designation 1`
date [103-109] `Date 1`
MATCH 2
company_name [192-206] `Company Name 2`
designation [220-233] `Designation 2`
date [241-247] `Date 2`
MATCH 3
company_name [330-344] `Company Name 3`
designation [358-371] `Designation 3`
date [379-385] `Date 3`
我建议在这种情况下使用SimpleXML而不是regex,因为这样可以使用特定的选择器来解析DOM。
此外,DOM中的ID应该是唯一的。
有关SimpleXML的详细信息:http://en.php.net/SimpleXML
这是我的演示。只是循环使用爆炸来分解字符串:
<?php
$html = '<div>
<h5>Work Experience</h5>
<p><span class="organization">Company Name 1</span></p>
Designation 1
<p>Date 1</p>
<ul>
<li>Some text 1</li>
</ul>
<p><span class="organization">Company Name 2</span></p>
Designation 2
<p>Date 2</p>
<ul>
<li>Some text 2</li>
</ul>
</div>';
$companyBlocks = explode('</ul>', $html);
for($i=0; $i < count($companyBlocks); $i++){
$company = explode('organization">', $companyBlocks[$i]);
$company = explode('</span>', $company[1]);
echo 'Company: ' . $company[0] . '<br>';
$designation = explode('</span></p>', $companyBlocks[$i]);
$designation = explode('<p>', $designation[1]);
echo 'Designation: ' . $designation[0] . '<br>';
$date = explode('</span></p>', $companyBlocks[$i]);
$date = explode('<p>', $date[1]);
$date = explode('</p>', $date[1]);
echo 'Date: ' . $date[0] . '<br>';
}