使用regex从页面中提取url - extracting urls from a page with regex

extracting urls from a page with regex

本文关键字：提取 url regex 使用 | 更新日期: 2023-09-27

我有一个php，它从页面中提取所有URL：

$regex = '/https?':'/'/[^'" ]+/i';
preg_match_all($regex, $page, $matches);
$links = ($matches[0]);
foreach($links as $link)
{
  echo $link.'<br />';
}

在这种情况下，我如何修改它以提取不是所有链接，而是仅提取与某个部分url匹配的链接：`http://www.site.com/artist/"我要找的结果是一个列表，比如：

http://www.site.com/artist/Nirvana/

http://www.site.com/artist/Jayz/

等等。

通过将分隔符更改为感叹号，不需要额外的转义符。's字符类匹配空白字符，如制表符、空格和新行。我还确保我们涵盖了这两种类型的报价（以防页面不同）。

$regex = '!https?://www.site.com/artist/[^''"'s]+!i';
preg_match_all($regex, $page, $matches);
$links = ($matches[0]);
foreach($links as $link)
{
  echo $link.'<br />';
}

$regex = 'http:'/'/www.site.com'/artist'/[^" ]+'/';

当然，artist部分之后的内容取决于什么是可接受的输入。

如果你只接受字母和数字，那么就使用[a-zA-Z0-9]+。

这些URL在哪里？它们在网页上吗？试试这个：

http://www.site.com/artist/.*''b

更新1:

如果你使用的是PHP，试试这个：

preg_match_all('%http://www'.site'.com/artist/.*'b%', $html, $urls, PREG_PATTERN_ORDER);
$urls = $urls[0];