我想使用 REGEX 从 HTML 文档中获取所有链接,除了具有指定类名的链接。
例如:
<a href="someSite" class="className">qwe</a> <a href="someSite">qwe</a>
结果,我只想从不包含等于"类名"的类的链接中使用 href="someSite"
我创建了正则表达式:
(?<=<'s*a.*)href's*?='s*?("|').*?("|')
它完全返回了我想要的内容,但是从所有链接中,我不知道如何在我的正则表达式中添加异常,以不重新删除指定了类名的链接
任何帮助将不胜感激:)
如果你愿意使用 jQuery,你可以在不使用正则表达式的情况下做到这一点:
var list = $("a", document).filter(function () {
return $(this).hasClass("className") == false;
});
假设你在某个变量中有 HTML,你可以使用 http://code.google.com/p/phpquery/wiki/Selectors (phpquery - 一个 php jQuery-esq 的东西)。
其他答案是明智的。但是,如果出于任何原因您坚持使用 REGEX 方法。试试这个。
我假设您正在通过PHP(或.NET)进行REGEX,因为您的模式包含负面的后视断言,这在JavaScript中不受支持。
我还将匹配与过滤中的那些具有错误类的匹配分开,因为 REGEX 对于后者并不理想(因为类属性可能出现在链接的开始标记中的任何点)。
$str = "<a href='bad_href' class='badClass'>bad link</a> <a href='good_href'>good link</a>";
preg_match_all('/<a.+(href ?= ?("|'')[^'2]*'2).*>.*<'/a>/U', $str, $matches);
foreach($matches[0] as $key => $match)
if (preg_match('/class=(''|")[^'1]*badClass[^'1]*'1/', $match))
unset($matches[1][$key]);
$matches = $matches[1]; //array containing "href='good_href'"
var aList= document.getElementsByTagName('a');
for (var i in aList) {
if (aList.hasOwnProperty(i)) {
if (aList[i].className.indexOf(YourClassName) != -1) continue;
//...
//... Your code
}
}
免责声明:
正如其他人将或已经指出的那样,使用正则表达式来解析非常规语言充满了危险!最好使用专门为作业设计的专用解析器,尤其是在解析 HTML 标签汤时。
可是。。。
如果您坚持使用正则表达式,这里有一个经过测试的 PHP 脚本,实现了正则表达式解决方案,它做得"相当不错":
<?php // test.php Rev:20120626_2100
function strip_html_anchor_tags_not_having_class($text) {
$re_html_anchor_not_having_class ='% # Rev:20120626_1300
# Match an HTML 4.01 A element NOT having a specific class.
<a'b # Anchor element start tag open delimiter
(?: # Zero or more attributes before CLASS.
's+ # Attributes are separated by whitespace.
(?!class'b) # Only non-CLASS attributes here.
[A-Za-z]['w'-:.]* # Attribute name is required.
(?: # Attribute value is optional.
's*='s* # Name and value separated by =
(?: # Group for value alternatives.
"[^"]*" # Either a double-quoted string,
| ''[^'']*'' # or a single-quoted string,
| ['w'-:.]+ # or a non-quoted string.
) # End group of value alternatives.
)? # Attribute value is optional.
)* # Zero or more attributes before CLASS.
(?: # Optional CLASS (but only if NOT MyClass).
's+ # CLASS attribute is separated by whitespace.
class # (case insensitive) CLASS attribute name.
's*='s* # Name and value separated by =
(?: # Group allowable CLASS value alternatives.
(?-i) # Use case-sensitive match for CLASS value.
" # Either a double-quoted value...
(?: # Single-char-step through CLASS value.
(?! # Assert each position is NOT MyClass.
(?<=["'s]) # Preceded by opening quote or space.
MyClass # (case sensitive) CLASS value to NOT be matched.
(?=["'s]) # Followed by closing quote or space.
) # End assert each position is NOT MyClass.
[^"] # Safe to match next CLASS value char.
)* # Single-char-step through CLASS value.
" # Ok. DQ value does not contain MyClass.
| '' # Or a single-quoted value...
(?: # Single-char-step through CLASS value.
(?! # Assert each position is NOT MyClass.
(?<=['''s]) # Preceded by opening quote or space.
MyClass # (case sensitive) CLASS value to NOT be matched.
(?=['''s]) # Followed by closing quote or space.
) # End assert each position is NOT MyClass.
[^''] # Safe to match next CLASS value char.
)* # Single-char-step through CLASS value.
'' # Ok. SQ value does not contain MyClass.
| # Or a non-quoted, non-MyClass value...
(?! # Assert this value is NOT MyClass.
MyClass # (case sensitive) CLASS value to NOT be matched.
) # Ok. NQ value is not MyClass.
['w'-:.]+ # Safe to match non-quoted CLASS value.
) # End group of allowable CLASS values.
(?: # Zero or more attribs allowed after CLASS.
's+ # Attributes are separated by whitespace.
[A-Za-z]['w'-:.]* # Attribute name is required.
(?: # Attribute value is optional.
's*='s* # Name and value separated by =
(?: # Group for value alternatives.
"[^"]*" # Either a double-quoted string,
| ''[^'']*'' # or a single-quoted string,
| ['w'-:.]+ # or a non-quoted string.
) # End group of value alternatives.
)? # Attribute value is optional.
)* # Zero or more attributes after CLASS.
)? # Optional CLASS (but only if NOT MyClass).
's* # Optional whitespace before closing >
> # Anchor element start tag close delimiter
( # $1: Anchor element contents.
[^<]* # {normal*} Zero or more non-<
(?: # Begin {(special normal*)*} construct
< # {special} Allow a < but only if
(?!/?a'b) # not the start of the </a> close tag.
[^<]* # more {normal*} Zero or more non-<
)* # Finish {(special normal*)*} construct
) # End $1: Anchor element contents.
</a's*> # A element close tag.
%ix';
// Remove all matching start and end tags but keep the element contents.
return preg_replace($re_html_anchor_not_having_class, '$1', $text);
}
$input = file_get_contents('testdata.html');
$output = strip_html_anchor_tags_not_having_class($input);
file_put_contents('testdata_out.html', $output);
?>
function strip_html_anchor_tags_not_having_class($text)
此函数去除所有 HTML 4.01 锚元素的开始和匹配结束标记(即 <A>
标记),这些标记没有包含以下内容的特定(区分大小写)CLASS
属性值:MyClass
。CLASS
值可以包含任意数量的值,但其中一个必须恰好是:MyClass
。锚标记名称和 CLASS 属性名称匹配不区分大小写。
示例输入 ( testdata.html
):
<h2>Paragraph contains links to be preserved (CLASS has "MyClass"):</h2>
<p>
Single DQ matching CLASS: <a href="URL" class="MyClass">Test 01</a>.
Single SQ matching CLASS: <a href="URL" class='MyClass'>Test 02</a>.
Single NQ matching CLASS: <a href="URL" class=MyClass>Test 03</a>.
Variable whitespace: <a href = "URL" class = MyClass >Test 04</a>.
Variable capitalization: <A HREF = "URL" CLASS = "MyClass" >Test 04</A>.
Reversed attribute order: <a class="MyClass" href="URL">Test 05</a>
Class before MyClass: <a href="URL" class="Pre MyClass">Test 06</a>.
Class after MyClass: <a href="URL" class="MyClass Post">Test 07</a>.
Sandwiched MyClass: <a href="URL" class="Pre MyClass Post">Test 08</a>.
Link with HTML content: <a class="MyClass" href="URL"><b>Test</b> 09</a>.
</p>
<h2>Paragraph contains links to be stripped (NO CLASS with "MyClass"):</h2>
<p>
Case does not match: <a href="URL" class="myclass">TEST 10</a>.
CLASS not whole word: <a href="URL" class="NotMyClass">TEST 11</a>.
No class attribute: <a href="URL">TEST 12</a>.
Link with HTML content: <a class="NotMyClass" href="URL"><b>Test</b> 13</a>.
</p>
示例输出 ( testdata_out.html
):
<h2>Paragraph contains links to be preserved (CLASS has "MyClass"):</h2>
<p>
Single DQ matching CLASS: <a href="URL" class="MyClass">Test 01</a>.
Single SQ matching CLASS: <a href="URL" class='MyClass'>Test 02</a>.
Single NQ matching CLASS: <a href="URL" class=MyClass>Test 03</a>.
Variable whitespace: <a href = "URL" class = MyClass >Test 04</a>.
Variable capitalization: <A HREF = "URL" CLASS = "MyClass" >Test 04</A>.
Reversed attribute order: <a class="MyClass" href="URL">Test 05</a>
Class before MyClass: <a href="URL" class="Pre MyClass">Test 06</a>.
Class after MyClass: <a href="URL" class="MyClass Post">Test 07</a>.
Sandwiched MyClass: <a href="URL" class="Pre MyClass Post">Test 08</a>.
Link with HTML content: <a class="MyClass" href="URL"><b>Test</b> 09</a>.
</p>
<h2>Paragraph contains links to be stripped (NO CLASS with "MyClass"):</h2>
<p>
Case does not match: TEST 10.
CLASS not whole word: TEST 11.
No class attribute: TEST 12.
Link with HTML content: <b>Test</b> 13.
</p>
希望推进他们的正则表达式的读者最好研究这个(相当长和复杂的)正则表达式。它经过精心手工制作,具有准确性和速度,并实施了几种先进的效率技术。当然,它被充分评论以允许纯粹的人类阅读。这个例子清楚地表明,"正则表达式"已经演变成一种丰富的(非正则)编程语言。
请注意,总会有此解决方案失败的边缘情况。 例如,CDATA 部分中的恶意字符串、注释、脚本、样式和标签属性值可能会使这种情况绊倒。(请参阅上面的免责声明。也就是说,这个解决方案在许多情况下会做得很好(但永远不会 100% 可靠!