剪切 html 输入,同时使用 PHP 保留标签


Cut html input while preserving tags with PHP

我需要将html输入剪切到一定长度,同时仍保留标签。总长度必须包含 html 标记。我找不到在最终长度中包含 html 标签长度的解决方案。如何在不破坏标签并确保所有开始标签都有关闭的情况下将 html 输入剪切到一定长度?

基本上,我从用户那里获得 html 输入,我需要使其适合一定的长度。为此,我不想破坏任何 html 标签,但我需要确保总长度(包括标签(小于最大值。

例如,将此字符串剪切为 20 个字符:

<p>this is an example</p>

应给出的输出

<p>this is an ex</p>

这减少了 50 个字符

<p>this <a href="http://example.com">click me</a>jiasd</p>

应该给

<p>this <a href="http://example.com">click</a></p>

我已经尝试过这个解决方案,它适用于将文本剪切到一定长度,但我找不到一种方法来让它计算总数中的标签长度:

function truncateHtml($text, $length = 100, $ending = '...', $exact = false, $considerHtml = true) {
    if ($considerHtml) {
        // if the plain text is shorter than the maximum length, return the whole text
        if (strlen(preg_replace('/<.*?>/', '', $text)) <= $length) {
            return $text;
        }
        // splits all html-tags to scanable lines
        preg_match_all('/(<.+?>)?([^<>]*)/s', $text, $lines, PREG_SET_ORDER);
        $total_length = strlen($ending);
        $open_tags = array();
        $truncate = '';
        foreach ($lines as $line_matchings) {
            // if there is any html-tag in this line, handle it and add it (uncounted) to the output
            if (!empty($line_matchings[1])) {
                // if it's an "empty element" with or without xhtml-conform closing slash
                if (preg_match('/^<('s*.+?'/'s*|'s*(img|br|input|hr|area|base|basefont|col|frame|isindex|link|meta|param)('s.+?)?)>$/is', $line_matchings[1])) {
                    // do nothing
                // if tag is a closing tag
                } else if (preg_match('/^<'s*'/([^'s]+?)'s*>$/s', $line_matchings[1], $tag_matchings)) {
                    // delete tag from $open_tags list
                    $pos = array_search($tag_matchings[1], $open_tags);
                    if ($pos !== false) {
                    unset($open_tags[$pos]);
                    }
                // if tag is an opening tag
                } else if (preg_match('/^<'s*([^'s>!]+).*?>$/s', $line_matchings[1], $tag_matchings)) {
                    // add tag to the beginning of $open_tags list
                    array_unshift($open_tags, strtolower($tag_matchings[1]));
                }
                // add html-tag to $truncate'd text
                $truncate .= $line_matchings[1];
            }
            // calculate the length of the plain text part of the line; handle entities as one character
            $content_length = strlen(preg_replace('/&[0-9a-z]{2,8};|&#[0-9]{1,7};|[0-9a-f]{1,6};/i', ' ', $line_matchings[2]));
            if ($total_length+$content_length> $length) {
                // the number of characters which are left
                $left = $length - $total_length;
                $entities_length = 0;
                // search for html entities
                if (preg_match_all('/&[0-9a-z]{2,8};|&#[0-9]{1,7};|[0-9a-f]{1,6};/i', $line_matchings[2], $entities, PREG_OFFSET_CAPTURE)) {
                    // calculate the real length of all entities in the legal range
                    foreach ($entities[0] as $entity) {
                        if ($entity[1]+1-$entities_length <= $left) {
                            $left--;
                            $entities_length += strlen($entity[0]);
                        } else {
                            // no more characters left
                            break;
                        }
                    }
                }
                $truncate .= substr($line_matchings[2], 0, $left+$entities_length);
                // maximum lenght is reached, so get off the loop
                break;
            } else {
                $truncate .= $line_matchings[2];
                $total_length += $content_length;
            }
            // if the maximum length is reached, get off the loop
            if($total_length>= $length) {
                break;
            }
        }
    } else {
        if (strlen($text) <= $length) {
            return $text;
        } else {
            $truncate = substr($text, 0, $length - strlen($ending));
        }
    }
    // if the words shouldn't be cut in the middle...
    if (!$exact) {
        // ...search the last occurance of a space...
        $spacepos = strrpos($truncate, ' ');
        if (isset($spacepos)) {
            // ...and cut the text in this position
            $truncate = substr($truncate, 0, $spacepos);
        }
    }
    // add the defined ending to the text
    $truncate .= $ending;
    if($considerHtml) {
        // close all unclosed html-tags
        foreach ($open_tags as $tag) {
            $truncate .= '</' . $tag . '>';
        }
    }
    return $truncate;
}

对于正则表达式,在遍历字符时不容易记住长度和出现次数,因此您可以遵循与正则表达式和 DOM 相结合的解决方案(此解决方法主要由 DOM 帮助完成(。

整个哲学是这样的:

1-如果最后一个节点是DOMElement或DOMText:

  • 如果通过删除此节点,则总长度将大于限制 ->删除它。
  • 如果通过删除此节点,则总长度将小于限制 ->截断。
  • 总长度>限制 ?重复 (1(:返回结果。

function truncateHTML($html, $limit = 20) {
    static $wrapper = null;
    static $wrapperLength = 0;
    // trim unwanted CR/LF characters
    $html = trim($html);
    // Remove HTML comments
    $html = preg_replace("~<!--.*?-->~", '', $html);
    // If $html in in plain text
    if ((strlen(strip_tags($html)) > 0) && strlen(strip_tags($html)) == strlen($html))  {
        return substr($html, 0, $limit);
    }
    // If $html doesn't have a root element
    elseif (is_null($wrapper)) {
        if (!preg_match("~^'s*<[^'s!?]~", $html)) {
            // Defining a tag as our HTML wrapper
            $wrapper = 'div';
            $htmlWrapper = "<$wrapper></$wrapper>";
            $wrapperLength = strlen($htmlWrapper);
            $html = preg_replace("~><~", ">$html<", $htmlWrapper);
        }
    }
    // Calculating total length
    $totalLength = strlen($html);
    // If our input length is less than limit, we are done.
    if ($totalLength <= $limit) {
        if ($wrapper) {
            return preg_replace("~^<$wrapper>|</$wrapper>$~", "", $html);
        }
        return strlen(strip_tags($html)) > 0 ? $html : '';
    }
    // Initializing a DOM object to hold our HTML
    $dom = new DOMDocument;
    $dom->loadHTML($html,  LIBXML_HTML_NOIMPLIED  | LIBXML_HTML_NODEFDTD);
    // Initializing a DOMXPath object to query on our DOM
    $xpath = new DOMXPath($dom);
    // Query last node (this does not include comment or text nodes)
    $lastNode = $xpath->query("./*[last()]")->item(0);
    // While manipulating, when there is no HTML element left
    if ($totalLength > $limit && is_null($lastNode)) {
        if (strlen(strip_tags($html)) >= $limit) {
            $textNode = $xpath->query("//text()")->item(0);
            if ($wrapper) {
                $textNode->nodeValue = substr($textNode->nodeValue, 0, $limit );
                $html = $dom->saveHTML();
                return preg_replace("~^<$wrapper>|</$wrapper>$~", "", $html);
            } else {
                $lengthAllowed = $limit - ($totalLength - strlen($textNode->nodeValue));
                if ($lengthAllowed <= 0) {
                    return '';
                }
                $textNode->nodeValue = substr($textNode->nodeValue, 0, $lengthAllowed);
                $html = $dom->saveHTML();
                return strlen(strip_tags($html)) > 0 ? $html : '';
            }
        } else {
            $textNode = $xpath->query("//text()")->item(0);
            $textNode->nodeValue = substr($textNode->nodeValue, 0, -(($totalLength - ($wrapperLength > 0 ? $wrapperLength : 0)) - $limit));
            $html = $dom->saveHTML();
            return strlen(strip_tags($html)) > 0 ? $html : '';
        }
    }
    // If we have a text node after our last HTML element
    elseif ($nextNode = $lastNode->nextSibling) {
        if ($nextNode->nodeType === 3 /* DOMText */) {
            $nodeLength = strlen($nextNode->nodeValue);
            // If by removing our text node total length will be greater than limit
            if (($totalLength - ($wrapperLength > 0 ? $wrapperLength : 0)) - $nodeLength >= $limit) {
                // We should remove it
                $nextNode->parentNode->removeChild($nextNode);
                $html = $dom->saveHTML();
                return truncateHTML($html, $limit);
            }
            // If by removing our text node total length will be less than limit
            else {
                // We should truncate our text to fit the limit
                $nextNode->nodeValue = substr($nextNode->nodeValue, 0, ($limit - (($totalLength - ($wrapperLength > 0 ? $wrapperLength : 0)) - $nodeLength)));
                $html = $dom->saveHTML();
                // Caring about custom wrapper
                if ($wrapper) {
                    return preg_replace("~^<$wrapper>|</$wrapper>$~", "", $html);
                }
                return $html;
            } 
        }
    }
    // If current node is an HTML element 
    elseif ($lastNode->nodeType === 1 /* DOMElement */) {
        $nodeLength = strlen($lastNode->nodeValue);
        // If by removing current HTML element total length will be greater than limit
        if (($totalLength - ($wrapperLength > 0 ? $wrapperLength : 0)) - $nodeLength >= $limit) {
            // We should remove it
            $lastNode->parentNode->removeChild($lastNode);
            $html = $dom->saveHTML();
            return truncateHTML($html, $limit);
        }
        // If by removing current HTML element total length will be less than limit
        else {
            // We should truncate our node value to fit the limit
            $lastNode->nodeValue = substr($lastNode->nodeValue, 0, ($limit - (($totalLength - ($wrapperLength > 0 ? $wrapperLength : 0)) - $nodeLength)));
            $html = $dom->saveHTML();
            if ($wrapper) {
                return preg_replace("~^<$wrapper>|</$wrapper>$~", "", $html);
            }
            return $html;
        }
    }
}

例子

1-给出如下所示的输入$limit = 16

<div>some data from <span class="first">blahblah test</span> was <span class="second">good</span>test<p> something</p><span>letter</span></div>

将生成一个 HTML,此处逐步说明:

Step 0: <div>some data from <span class="first">blahblah test</span> was <span class="second">good</span>test<p> something</p><span>letter</span></div>
Step 1: <div>some data from <span class="first">blahblah test</span> was <span class="second">good</span>test<p> something</p></div>
Step 2: <div>some data from <span class="first">blahblah test</span> was <span class="second">good</span>test</div>
Step 3: <div>some data from <span class="first">blahblah test</span> was <span class="second">good</span></div>
Step 4: <div>some data from <span class="first">blahblah test</span> was </div>
Step 5: <div>some data from <span class="first">blahblah test</span></div>
Step 6: <div>some data from </div>
Step 7: <div>some </div>

2-在您自己的示例中,此输入为$limit = 50

<p>this <a href="http://example.com">click me</a>jiasd</p>

将输出预期的 HTML:

<p>this <a href="http://example.com">click</a></p>

3-纯文本的处理方式相同($limit = 10(:

Hi how are you doing?

输出:

Hi how are

4- 包括HTML注释($limit = 10(:

<div>some data from <span class="first">blahblah test</span> was <span class="second">good</span>test<p> something</p><span class="text">hola</span><!-- comment --></div>

输出:

string(0) ""

为什么是空的?因为当函数最后但一步时,它会看到长度为 11<div></div>。我们无法对它做任何事情,所以我们完全删除了它。

5-最后一个带有$limit = 12的示例的输出:

<div>s</div>

PHP现场演示

您应该从截断的字符串中提取 HTML 标记。然后计算这些字符并再次截断字符串的纯文本部分。

请注意,以下正则表达式将所有 html 标签与所有属性(开始和结束标签(匹配。

在结束return $truncate;之前:

preg_match_all('/<[^>]+>/', $truncate, $html_tags);
$html_tags_length = strlen(implode('', $html_tags));

此时,您可以提取字符串的文本部分并将其切断,也可以递归调用您的函数。

这有点启发式,但它应该有效。

function truncateHtml($text, $length = 100) {
    $current_size = strlen($text);
    $diff = strlen($text);
    $remainder = $current_size - $length;
    while($diff > 0 AND $remainder > 0) {
        $pattern = "/(.*)[^<>](?=<)/s";
        $text = preg_replace($pattern, "$1", $text);
        $diff = $current_size - strlen($text);
        $current_size = strlen($text);
        $remainder = $current_size - $length;
    }
// iff $diff == 0 there are no more characters to remove
// iff $remainder == 0 there should removed no more characters
    return $text;
}

这里是运行代码。

您可以通过正则表达式实现拆分并仅计算不属于标签的字符。

例:

<?php
function htmlTrim($inStr, $length) {
    $c = 0;
    $outStr =  preg_replace_callback(
        "/<.*?>|[^<>]*/",
        function($str) use (& $c, $length){
            $str = $str[0];
            if ($str && $str[0] == "<") { // Is tag.
                return $str;
            } else {
                if ($c >= $length) return ""; // Lenght already exceeded.
                $l = strlen($str);
                $c += $l;
                $overflow = $c - $length;
                if ($overflow > 0) {
                    return substr($str, 0,  $l - $overflow);
                } else {
                    return $str;
                }
            };
        },
        $inStr
    );
    return $outStr;
};
echo htmlTrim("<span>Hello <b>World foobar</b></span>", 11);