需要PHP的正则表达式


Need regular expression for PHP

我需要帮助开发一个正则表达式来从HTML中获取一些数据。HTML模式如下

<h5>Work Experience</h5>
  <p><span id="organization">Company Name 1</span></p>
  Designation 1
    <p>Date 1
  </p>
    <ul>
      <li>Some text 1</li>
    </ul>
  <p><span id="organization">Company Name 2</span></p>
  Designation 2
    <p>Date 2
  </p>
    <ul>
      <li>Some text 2</li>
    </ul>
  <p><span id="organization">Company Name 3</span></p>
  Designation 3
    <p>Date 3
  </p>
    <ul>
      <li>Some text 3</li>
    </ul></div>

我尝试了以下正则表达式:

|<h5>Work Experience<'/h5>'s*<p>(.*)<'/p>(.*)<p>(.*)<'/p>'s*<ul>(.*)<'/ul>'s*<'/div>|Uis

我记下了所有的公司名称、名称和日期。

请帮帮我。提前谢谢。

不要使用正则表达式来解析HTML(请参阅这个著名的答案以获得原因的详细解释)。相反,使用DOM之类的东西,这会使事情变得更容易。对于上面的例子,你可以做:

$doc = new DOMDocument();
$doc->loadHTML($html); // $html should contain the HTML source
// Get all spans from the document
$spans = $doc->getElementsByTagName('span');
// Loop over the spans
foreach ($spans as $span) {
    // Check if the span has an id attribute with "organization" as value
    if ($span->hasAttribute('id') && $span->getAttribute('id') === 'organization') {
        echo $span->nodeValue; // This will echo the company name
    }
}

您可以在此处看到完整的工作示例及其结果:https://3v4l.org/XdrQ1

另一个使用解析器的建议。将此示例与SimpleXMLxpath查询一起考虑。此外,ID需要是唯一的,因此最好使用class:

<?php
$html = '
<div>
    <h5>Work Experience</h5>
    <p><span class="organization">Company Name 1</span></p>
    Designation 1
    <p>Date 1</p>
    <ul>
      <li>Some text 1</li>
    </ul>
    <p><span class="organization">Company Name 2</span></p>
    Designation 2
    <p>Date 2</p>
    <ul>
      <li>Some text 2</li>
    </ul>
</div>';
$xml = simplexml_load_string($html);
$spans = $xml->xpath("//span[@class='organization']");
foreach ($spans as $span) {
    // do sth. useful here
}
?>

提示:

正如@Oldskool所指出的,您可能无法访问原始(无效)HTML字符串。在这种情况下,您需要这样更改查询:

$spans = $xml->xpath("//span[@id='organization']");

试试这个

<span id="organization">(?<company_name>[^<]+)<'/span><'/p>'n's*(?<designation>[^'n]+)'n's*<p>(?<date>[^'n]+)

Regex演示

输出:

MATCH 1
company_name    [54-68] `Company Name 1`
designation [82-95] `Designation 1`
date    [103-109]   `Date 1`
MATCH 2
company_name    [192-206]   `Company Name 2`
designation [220-233]   `Designation 2`
date    [241-247]   `Date 2`
MATCH 3
company_name    [330-344]   `Company Name 3`
designation [358-371]   `Designation 3`
date    [379-385]   `Date 3`

我建议在这种情况下使用SimpleXML而不是regex,因为这样可以使用特定的选择器来解析DOM。

此外,DOM中的ID应该是唯一的。

有关SimpleXML的详细信息:http://en.php.net/SimpleXML

这是我的演示。只是循环使用爆炸来分解字符串:

<?php
$html = '<div>
    <h5>Work Experience</h5>
    <p><span class="organization">Company Name 1</span></p>
    Designation 1
    <p>Date 1</p>
    <ul>
      <li>Some text 1</li>
    </ul>
    <p><span class="organization">Company Name 2</span></p>
    Designation 2
    <p>Date 2</p>
    <ul>
      <li>Some text 2</li>
    </ul>
</div>';
$companyBlocks = explode('</ul>', $html);
for($i=0; $i < count($companyBlocks); $i++){
    $company = explode('organization">', $companyBlocks[$i]);
    $company = explode('</span>', $company[1]);
    echo 'Company: ' . $company[0] . '<br>';
    $designation = explode('</span></p>', $companyBlocks[$i]);
    $designation = explode('<p>', $designation[1]);
    echo 'Designation: ' . $designation[0] . '<br>';
    $date = explode('</span></p>', $companyBlocks[$i]);
    $date = explode('<p>', $date[1]);
    $date = explode('</p>', $date[1]);
    echo 'Date: ' . $date[0] . '<br>';
}