我正在尝试在给定网页上获取文本与HTML的比例。我正在使用strip_html_tags
去掉html标签并将其与页面上的原始内容进行比较以获得比率。我的问题是我觉得我的strip_html_tags
函数可能无法获得网页上的所有标签。有没有更好的方法可以做到这一点...也许这只是取代了以<和>开头的所有内容。我已经可以指出,我缺少很多应该在正则表达式中删除的标签,但必须有更好的方法来完成所有这些工作。和>
function strip_html_tags($text)
{
$text = preg_replace(array(
'@<head[^>]*?>.*?</head>@siu',
'@<style[^>]*?>.*?</style>@siu',
'@<script[^>]*?.*?</script>@siu',
'@<object[^>]*?.*?</object>@siu',
'@<embed[^>]*?.*?</embed>@siu',
'@<applet[^>]*?.*?</applet>@siu',
'@<noframes[^>]*?.*?</noframes>@siu',
'@<noscript[^>]*?.*?</noscript>@siu',
'@<noembed[^>]*?.*?</noembed>@siu',
'@</?((address)|(blockquote)|(center)|(del))@iu',
'@</?((div)|(h[1-9])|(ins)|(isindex)|(p)|(pre))@iu',
'@</?((dir)|(dl)|(dt)|(dd)|(li)|(menu)|(ol)|(ul))@iu',
'@</?((table)|(th)|(td)|(caption))@iu',
'@</?((form)|(button)|(fieldset)|(legend)|(input))@iu',
'@</?((label)|(select)|(optgroup)|(option)|(textarea))@iu',
'@</?((frameset)|(frame)|(iframe))@iu',
'#<['/'!]*?[^<>]*?>#siu', // Strip out HTML tags
'#<!['s'S]*?--[ 't'n'r]*>#siu' // Strip multi-line comments including CDATA
), array(
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
' ',
"'n'$0",
"'n'$0",
"'n'$0",
"'n'$0",
"'n'$0",
"'n'$0",
"'n'$0",
"'n'$0"
), $text);
return strip_tags($text);
}
function check_ratio($url)
{
$file_content = // getting data from curl request here
$page_size = mb_strlen($file_content, '8bit');
$content = strip_html_tags($file_content);
$text_size = mb_strlen($content, '8bit');
$content = preg_replace("/(^['r'n]*|['r'n]+)['s't]*['r'n]+/", " ", $content);
$len_real = strlen($file_content);
$len_strip = strlen($content);
return round((($len_strip / $len_real) * 100), 2);
}
你为什么要重新发明轮子?
这是更好的方法:http://php.net/manual/en/function.strip-tags.php
DOMNode::$textContent 可以作为一个起点:
$domd = new DOMDocument();
libxml_use_internal_errors(true);
$domd->loadHTML(file_get_contents('http://www.google.com'));
libxml_use_internal_errors(false);
$items = $domd->getElementsByTagName('body');
var_dump($items[0]->textContent);
它还包括来自您可能不会认为是"文本"的标签的数据,例如<style>
或<script>
但考虑到这一点应该不难。
这是
使用正则表达式。
更新 1:
-必须在不可见内容的标签正文周围添加一个原子组,
或者如果报价不平衡,可能会导致灾难性的回溯。
- 添加了将删除的不可见内容列表:
script, style, head, object, embed, applet, noframes, noscript, noembed
如果没有结束标记,则仅删除标记,否则将随标记一起删除其内容。
演示
查找原始正则表达式
<(?:(?:(?:(script|style|head|object|embed|applet|noframes|noscript|noembed)(?:'s+(?>"['S's]*?"|'['S's]*?'|(?:(?!/>)[^>])?)+)?'s*>)['S's]*?</'1's*(?=>))|(?:/?['w:]+'s*/?)|(?:['w:]+'s+(?:"['S's]*?"|'['S's]*?'|[^>]?)+'s*/?)|'?['S's]*?'?|(?:!(?:(?:DOCTYPE['S's]*?)|(?:'[CDATA'[['S's]*?']'])|(?:--['S's]*?--)|(?:ATTLIST['S's]*?)|(?:ENTITY['S's]*?)|(?:ELEMENT['S's]*?))))>
替换为任何内容。
各种字符串/分隔表示
Delimiter only: /<(?:(?:(?:(script|style|head|object|embed|applet|noframes|noscript|noembed)(?:'s+(?>"['S's]*?"|'['S's]*?'|(?:(?!'/>)[^>])?)+)?'s*>)['S's]*?<'/'1's*(?=>))|(?:'/?['w:]+'s*'/?)|(?:['w:]+'s+(?:"['S's]*?"|'['S's]*?'|[^>]?)+'s*'/?)|'?['S's]*?'?|(?:!(?:(?:DOCTYPE['S's]*?)|(?:'[CDATA'[['S's]*?']'])|(?:--['S's]*?--)|(?:ATTLIST['S's]*?)|(?:ENTITY['S's]*?)|(?:ELEMENT['S's]*?))))>/
Single Quote & Delimiter: '/<(?:(?:(?:(script|style|head|object|embed|applet|noframes|noscript|noembed)(?:'s+(?>"['S's]*?"|''['S's]*?''|(?:(?!'/>)[^>])?)+)?'s*>)['S's]*?<'/'1's*(?=>))|(?:'/?['w:]+'s*'/?)|(?:['w:]+'s+(?:"['S's]*?"|''['S's]*?''|[^>]?)+'s*'/?)|'?['S's]*?'?|(?:!(?:(?:DOCTYPE['S's]*?)|(?:'[CDATA'[['S's]*?']'])|(?:--['S's]*?--)|(?:ATTLIST['S's]*?)|(?:ENTITY['S's]*?)|(?:ELEMENT['S's]*?))))>/'
Double Quote only: "<(?:(?:(?:(script|style|head|object|embed|applet|noframes|noscript|noembed)(?:''s+(?>'"[''S''s]*?'"|'[''S''s]*?'|(?:(?!/>)[^>])?)+)?''s*>)[''S''s]*?</''1''s*(?=>))|(?:/?[''w:]+''s*/?)|(?:[''w:]+''s+(?:'"[''S''s]*?'"|'[''S''s]*?'|[^>]?)+''s*/?)|''?[''S''s]*?''?|(?:!(?:(?:DOCTYPE[''S''s]*?)|(?:''[CDATA''[[''S''s]*?'']''])|(?:--[''S''s]*?--)|(?:ATTLIST[''S''s]*?)|(?:ENTITY[''S''s]*?)|(?:ELEMENT[''S''s]*?))))>"
扩大
# <(?:(?:(?:(script|style|head|object|embed|applet|noframes|noscript|noembed)(?:'s+(?>"['S's]*?"|'['S's]*?'|(?:(?!/>)[^>])?)+)?'s*>)['S's]*?</'1's*(?=>))|(?:/?['w:]+'s*/?)|(?:['w:]+'s+(?:"['S's]*?"|'['S's]*?'|[^>]?)+'s*/?)|'?['S's]*?'?|(?:!(?:(?:DOCTYPE['S's]*?)|(?:'[CDATA'[['S's]*?']'])|(?:--['S's]*?--)|(?:ATTLIST['S's]*?)|(?:ENTITY['S's]*?)|(?:ELEMENT['S's]*?))))>
<
(?:
(?:
(?:
# Invisible content; end tag req'd
( # (1 start)
script
| style
| head
| object
| embed
| applet
| noframes
| noscript
| noembed
) # (1 end)
(?:
's+
(?>
" ['S's]*? "
| ' ['S's]*? '
| (?:
(?! /> )
[^>]
)?
)+
)?
's* >
)
['S's]*? </ '1 's*
(?= > )
)
| (?: /? ['w:]+ 's* /? )
| (?:
['w:]+
's+
(?:
" ['S's]*? "
| ' ['S's]*? '
| [^>]?
)+
's* /?
)
| '? ['S's]*? '?
| (?:
!
(?:
(?: DOCTYPE ['S's]*? )
| (?: '[CDATA'[ ['S's]*? ']'] )
| (?: -- ['S's]*? -- )
| (?: ATTLIST ['S's]*? )
| (?: ENTITY ['S's]*? )
| (?: ELEMENT ['S's]*? )
)
)
)
>
基准:
Regex1: <(?:(?:(?:(script|style|head|object|embed|applet|noframes|noscript|noembed)(?:'s+(?>"['S's]*?"|'['S's]*?'|(?:(?!/>)[^>])?)+)?'s*>)['S's]*?</'1's*(?=>))|(?:/?['w:]+'s*/?)|(?:['w:]+'s+(?:"['S's]*?"|'['S's]*?'|[^>]?)+'s*/?)|'?['S's]*?'?|(?:!(?:(?:DOCTYPE['S's]*?)|(?:'[CDATA'[['S's]*?']'])|(?:--['S's]*?--)|(?:ATTLIST['S's]*?)|(?:ENTITY['S's]*?)|(?:ELEMENT['S's]*?))))>
Options: < none >
Completed iterations: 3 / 3 ( x 1000 )
Matches found per iteration: 3780
Elapsed Time: 43.52 s, 43523.08 ms, 43523084 µs
样品分析,页面大小126,000 bytes
:
3,780 tags / page
x 3,000 iterations
--------------------------
11,340,000 total tags
/ 43.52 seconds
--------------------------
260,569 tags / second
/ 3,780 tags / page
--------------------------
70 pages / second