Regex解析Amazon代码段HTML标记 - Regex to parse Amazon snippet HTML tag

Regex to parse Amazon snippet HTML tag

我得到了这两个片段：

<a rel="nofollow" href="http://www.amazon.de/gp/product/B004DI7A5S/ref=as_li_tl?ie=UTF8&camp=1638&creative=6742&creativeASIN=B004DI7A5S&linkCode=as2&tag=webbigode-21">PFIFF Reitstrumpf kariert, grau/lila, 37-39, 100322-144-37</a><img src="http://ir-de.amazon-adsystem.com/e/ir?t=webbigode-21&l=as2&o=3&a=B004DI7A5S" width="1" height="1" border="0" alt="" style="border:none !important; margin:0px !important;" />

第二个：

<a rel="nofollow" href="http://www.amazon.de/gp/product/B004DI7A5S/ref=as_li_tl?ie=UTF8&camp=1638&creative=6742&creativeASIN=B004DI7A5S&linkCode=as2&tag=webbigode-21"><img border="0" src="http://ws-eu.amazon-adsystem.com/widgets/q?_encoding=UTF8&ASIN=B004DI7A5S&Format=_SL110_&ID=AsinImage&MarketPlace=DE&ServiceVersion=20070822&WS=1&tag=webbigode-21" ></a><img src="http://ir-de.amazon-adsystem.com/e/ir?t=webbigode-21&l=as2&o=3&a=B004DI7A5S" width="1" height="1" border="0" alt="" style="border:none !important; margin:0px !important;" />

（请注意，它们很相似，但第二个稍长。）

从第一个片段中，我需要href的内容，从第二个片段中我需要图像源的内容。

这不起作用：

$result = preg_match_all("/<img.*?src's*=.*?>/",$_POST['bild'],$matches);

我该怎么办？

您可以使用Simple HTML DOM来解析HTML，而不是使用RegEx。

include 'simple_html_dom.php';
$html = str_get_html('<a rel="nofollow" href="http://www.amazon.de/gp/product/B004DI7A5S/ref=as_li_tl?ie=UTF8&camp=1638&creative=6742&creativeASIN=B004DI7A5S&linkCode=as2&tag=webbigode-21"><img border="0" src="http://ws-eu.amazon-adsystem.com/widgets/q?_encoding=UTF8&ASIN=B004DI7A5S&Format=_SL110_&ID=AsinImage&MarketPlace=DE&ServiceVersion=20070822&WS=1&tag=webbigode-21" ></a><img src="http://ir-de.amazon-adsystem.com/e/ir?t=webbigode-21&l=as2&o=3&a=B004DI7A5S" width="1" height="1" border="0" alt="" style="border:none !important; margin:0px !important;" />');
echo $html->find('a', 0)->href . PHP_EOL;
echo $html->find('img', 0)->src;

您可以使用非常简单的正则表达式来解析这些值，使用非贪婪"点"（.*?）的概念。尽管点可以匹配任何内容，但它一次只消耗一个字符，然后让模式的其余部分（双引号分隔符）匹配。为了可读性和结果访问，您可以添加一些命名组：

href="(?<href>.*?)"|src="(?<imgsrc>.*?)" //global

正如Laurel所指出的，这种复杂性的降低是以执行速度为代价的。权衡取决于您的用例

regex演示

这个提取href（~36步）：

<a(?:'s*(?!href)[^'s>]*)*'s*href=["']([^"']+)

这一步提取src（~59步）：

<img(?:'s*(?!src)[^'s>]*)*'s*src=["']([^"']+)

标记是正则的，并且可以很容易地通过正则表达式进行解析。请注意，我假设属性（href和src）被任意一种类型的引号所包围。

这些正则表达式非常快（它们比其他正则表达式快10倍以上）。事实上，考虑到PCRE中的所有优化，它们可能比完整的解析器更快。

从本质上讲，我的正则表达式几乎完全相同。他们找到标签<a的开头，看看后面是否有任何属性。如果这些属性不是你想要的，就会跳过(?:'s*(?!href)[^'s>]*)*。您想要的是捕获's*href=["']([^"']+)["']。