使用preg_split提取HTML标记 - Extract HTML Tags using preg_split

Extract HTML Tags using preg_split

本文关键字：HTML 标记提取 split preg 使用 | 更新日期: 2023-09-27

我有一个字符串

$string = 'this is test <b>bold</b> this is another test <img src="#"> image' ;

我想要单独的拆分html标记&仅普通文本。

需要以下输出，如：

[0] => this is test
[1] => <b>bold</b>
[2] => this is another test
[3] => <img src="#">
[4] => image

使用此代码。

$strip = preg_split('/'s+(?![^<>]+>)/m', $string , -1, PREG_SPLIT_DELIM_CAPTURE) ;

输出。

[0] => this
[1] => is
[2] => test
[3] => <b>bold</b>
[4] => this
[5] => .....

我是新手。请帮忙！

我发现使用preg_match:更容易获得结果

$string = 'this is test <b>bold</b> this is another test <img src="#"> image <hr/>';
preg_match_all('/<([^'s>]+)(.*?)>((.*?)<'/'1>)?|(?<=^|>)(.+?)(?=$|<)/i',$string,$result);
$result = $result[0];
// assign the result to the variable
foreach ($result as &$group) {
    $group = preg_replace('/^'s*(.*?)'s*$/','$1',$group);
    // this is to eliminate preceding and trailing spaces
}
print_r($result);

编辑：

我假设标签的开头和结尾之间应该至少有一个字符，但没有必要，所以我将第二个+改为*，并考虑到标签中不区分大小写的可能性。

输出：

Array
(
    [0] => this is test
    [1] => <b>bold</b>
    [2] => this is another test
    [3] => <img src="#">
    [4] => image
    [4] => <hr/>
)

编辑2:

这对不规则的情况不起作用，比如评论中举例说明的方法：

foobaritalic或foobarbazfail

为了让它发挥作用，RegEx应该进行调整，以查看比赛的内部情况并进行相应的处理。