PHP Regex，将两个特定单词/标签之间的任何内容与条件进行匹配 - PHP Regex, Matching anything between two specific words/tags with conditions

我的正则表达式很差，这是我的场景，

我试图从一个包含多个表的网页中提取一些信息，只有一些表包含一个唯一的url（比如说"非常/唯一.key"），所以它看起来像这样：

<table ....>
   (bunch of content)
</table>
<table ....>
   (bunch of content)
</table>
<table ....>
   (bunch of content + "very/unique.key" keyword)
</table>
<table ....>
   (bunch of content)
</table>
<table ....>
   (bunch of content + "very/unique.key" keyword)
</table>

所以我想要的是提取所有包含"very/unique.key"关键字的表的内容。下面是我尝试过的模式：

$pattern = "#<table[^>]+>((?!'<table)(?=very'/unique'.key).*)<'/table>#i";

这对我没有任何回报…

$pattern = "#<table[^>]+>((?!<table).*)<'/table>#i";

这将返回从表1的打开标记<table...>到最后一个表的关闭标记</table>的所有内容，即使有(?!<table)条件。。。

感谢任何愿意在这方面帮助我的人，谢谢。

--编辑--

以下是我发现的使用DOM循环遍历每个表的解决方案

--我的解决方案--

    $index;//indexes of all the table(s) that contains the keyword
        $cd = 0;//counter
        $DOM = new DOMDocument();
        $DOM->loadHTMLFile("http://uni.corp/sub/sub/target.php?key=123");
        $xpath = new DomXPath($DOM);
        $tables = $DOM->getElementsByTagName("table");
        for ($n = 0; $n < $tables->length; $n++) {
            $rows = $tables->item($n)->getElementsByTagName("tr");
            for ($i = 0; $i < $rows->length; $i++) {
                $cols = $rows->item($i)->getElementsbyTagName("td");
                for ($j = 0; $j < $cols->length; $j++) {

                     $td = $cols->item($j); // grab the td element
                     $img = $xpath->query('./img',$td)->item(0); // grab the first direct img child element

                    if(isset($img) ){
                        $image = $img->getAttribute('src'); // grab the source of the image
                        echo $image;
                        if($image == "very/unique.key"){
                            echo $cols->item($j)->nodeValue, "'t";
                            $index[$cd] = $n;
                            if($n > $cd){
                                $cd++;
                            }

                            echo $cd . " " . $n;//for troubleshooting
                        }

                    }
                }
                echo "<br/>";
            }
        }   
        //loop that echo out only the table(s) that I want which contains the keyword
        $loop = sizeof($index);
        for ($n = 0; $n < $loop; $n++) {
            $temp = $index[$n];
            $rows = $tables->item($temp)->getElementsbyTagName("tr");
            for ($i = 0; $i < $rows->length; $i++) {
                $cols = $rows->item($i)->getElementsbyTagName("td");                
                for ($j = 0; $j < $cols->length; $j++) {
                    echo $cols->item($j)->nodeValue, "'t";
                    //proccess the extracted table content here
                }
                //echo "<br/>";
            }
        }

但就我个人而言，我仍然对Regex部分感到好奇，希望有人能找到这个问题的Regex模式的解决方案。无论如何，感谢所有在这方面帮助/建议我的人（尤其是AbsoluteƵERæ）。

这适用于PHP5。我们解析这些表，并使用preg_match()来检查密钥。之所以要使用这样的方法，是因为HTML不必像XML那样在语法上正确。正因为如此，您实际上可能没有合适的结束标记。此外，您可能有嵌套的表，这将为您提供多个尝试将开始和结束标记与REGEX匹配的结果。通过这种方式，我们只检查密钥本身，而不是正在解析的文档的良好形式。

<?php
$input = "<html>
<table id='1'>
<tr>
<td>This does not contain the key.</td>
</tr>
</table>
<table id='2'>
<tr>
<td>This does contain the unique.key!</td>
</tr>
</table>
<table id='3'>
<tr>
<td>This also contains the unique.key.</td>
</tr>
</table>
</html>";
$html = new DOMDocument;
$html->loadHTML($input);
$findings = array();
$tables = $html->getElementsByTagName('table');
foreach($tables as $table){
    $element = $table->nodeValue;
    if(preg_match('!unique'.key!',$element)){
        $findings[] = $element;
    }
}
print_r($findings);
?>

输出

Array
(
    [0] => This does contain the unique.key!
    [1] => This also contains the unique.key.
)

虽然我同意你帖子中的评论，但我会给出解决方案。如果您想用其他东西替换非常/unique.key，正确的正则表达式将类似于以下

#<table(.*)>((.*)very'/unique'.key(.*))<'/table>#imsU

这里的关键是使用正确的修饰符使其与输入字符串一起工作。有关这些修饰符的详细信息，请参见http://www.php.net/manual/en/reference.pcre.pattern.modifiers.php

下面是一个例子，我用"foobar"替换了very/unique.key

<?php
$string = "
<table ....>
   (bunch of content)
</table>
<table ....>
   (bunch of content)
</table>
<table ....>
   bunch of content very/unique.key 
</table>
<table ....>
   (bunch of content)
</table>
<table ....>
   blabla very/unique.key
</table>
";
$pattern = '#<table(.*)>((.*)very'/unique'.key(.*))<'/table>#imsU';
echo preg_replace($pattern, '<table$1>$3foobar$4</table>', $string);
?>

这段代码打印完全相同的字符串，但用"foobar"替换了两个"very/unique.key"，就像我们想要的那样。

尽管这个解决方案可以工作，但它肯定不是最有效也不是最简单的解决方案。正如Mehdi在评论中所说，PHP有一个专门针对XML（即HTML）进行操作的扩展。

这是该扩展的文档链接http://www.php.net/manual/en/intro.dom.php

使用它，您可以轻松地浏览每个表元素，并找到具有唯一键的元素。