我如何提取所有的锚标记,他们的三个refs和他们的锚文本在一个字符串


How can I extract all anchor tags, their hrefs and their anchor text within a string?

我需要以几种不同的方式处理html字符串中的链接。

$str = 'My long <a href="http://example.com/abc" rel="link">string</a> has any
        <a href="/local/path" title="with attributes">number</a> of
        <a href="#anchor" data-attr="lots">links</a>.'
$links = extractLinks($str);
foreach ($links as $link) {
    $pattern = "#((http|https|ftp)://('S*?'.'S*?))('s|';|')|']|'[|'{|'}|,|'"|'|:|'<|$|'.'s)#ie";
    if (preg_match($pattern,$str)) {
        // Process Remote links
        //   For example, replace url with short url,
        //   or replace long anchor text with truncated
    } else {
        // Process Local Links, Anchors
    }
}
function extractLinks($str) {
    // First, I tried DomDocument
    $dom = new DomDocument();
    $dom->loadHTML($str);
    return $dom->getElementsByTagName('a');
    // But this just returns:
    //   DOMNodeList Object
    //   (
    //       [length] => 3
    //   )
    // Then I tried Regex
    if(preg_match_all("|<a.*(?=href='"([^'"]*)'")[^>]*>([^<]*)</a>|i", $str, $matches)) {
        print_r($matches);
    }
    // But this didn't work either.
}

extractLinks($str)的期望结果:

[0] => Array(
           'str' = '<a href="http://example.com/abc" rel="link">string</a>',
           'href' = 'http://example.com/abc';
           'anchorText' = 'string'
       ),
[1] => Array(
           'str' = '<a href="/local/path" title="with attributes">number</a>',
           'href' = '/local/path';
           'anchorText' = 'number'
       ),
[2] => Array(
           'str' = '<a href="#anchor" data-attr="lots">links</a>',
           'href' = '#anchor';
           'anchorText' = 'links'
       );

我需要所有这些,所以我可以做一些事情,比如编辑href(添加跟踪,缩短等),或者用其他东西替换整个标签(<a href="/u/username">username</a>可以变成username)。

下面是我要做的一个演示

您只需将其更改为:

$str = 'My long <a href="http://example.com/abc" rel="link">string</a> has any
    <a href="/local/path" title="with attributes">number</a> of
    <a href="#anchor" data-attr="lots">links</a>.';
$dom = new DomDocument();
$dom->loadHTML($str);
$output = array();
foreach ($dom->getElementsByTagName('a') as $item) {
   $output[] = array (
      'str' => $dom->saveHTML($item),
      'href' => $item->getAttribute('href'),
      'anchorText' => $item->nodeValue
   );
}

通过将其放入循环并使用getAttribute, nodeValuesaveHTML(THE_NODE),您将得到您的输出

像这样

<a's*href="([^"]+)"[^>]+>([^<]+)</a>
  1. 整体匹配是你想要的0数组元素
  2. 组#1捕获是你想要的1个数组元素
  3. 组#2捕获是你想要的2个数组元素

使用preg_match($pattern,$string,$m)

数组元素将在$m[0] $m[1] $m[3]

工作的PHP演示在这里

$string = 'My long <a href="http://example.com/abc" rel="link">string</a> has any
        <a href="/local/path" title="with attributes">number</a> of
        <a href="#anchor" data-attr="lots">links</a>. ';
$regex='|<a's*href="([^"]+)"[^>]+>([^<]+)</a>|';
$howmany = preg_match_all($regex,$string,$res,PREG_SET_ORDER);
print_r($res);
相关文章: