我正在尝试在我的网站上"查找和替换"词汇表术语。
这些术语取自我的数据库,并像这样从一个简单的字符串数组中构建:
/* get the glossary terms */
$results = $wpdb->get_results( 'SELECT post_title AS list FROM wp_posts WHERE post_status="publish" AND post_type="glossary" AND post_parent>0' );
$glossary_terms = array();
foreach ( $results as $row ) {
$term = preg_quote( str_replace( array("/", "'"), array("/", """), $row->list ) );
$glossary_terms[] = $term;
}
此$glossary_terms
用作以下函数中的$glossary
:
$urls = array();
$pattern = array();
// build a normalized lookup (case-insensitive, whitespace-agnostic)
foreach ($glossary as $term) {
$term_norm = preg_replace('/'s+/', ' ', strtoupper(trim($term)));
$pattern[] = preg_replace('/ /', '''s+', preg_quote($term_norm));
$initial = substr($term, 0, 1);
$urls[$term_norm] = '/dev/glossary/' . $initial . '/' . rawurlencode($term);
$rels[$term_norm] = '/dev/glossary/' . $initial . '/' . rawurlencode($term) . '?preview=true';
$title[$term_norm] = $term;
}
$pattern = '/'b(' . implode('|', $pattern) . ')'b/i';
现在,$pattern
正在显示这个单词列表。其中的一段摘录,包括我认为可能给我带来问题的几个词,是:
MANGROVE''s+TREE|MANTLE|MARACYN|MARACYN''-2|MARBLED|MARGIN|MARGINAL|MARINE|MOTROPHY|MATURE|MAXILLA|MAXILRAY|MEANDER|MEDIAL|MEDIAN|MELANIN|MELANOPHORE|MEMBRANE|MENISCUS|MENTAL|MENTAL''s+BARBEL|MERISTIC|MERISTICS|MERISTIC''s+CHARTER|MESETHMOID|MESIAL|MESO''-|MESOCORACOID|META''-|代谢|代谢'' s+蓝色|微生物|微生物捕食者|微生物|微小陨石|迁移率|MIGRATION | MILLITRE''s+''
我遇到的问题是,过滤器正在失控,链接$content
中的每一个空格和单词。
我的问题是$pattern
中的哪些术语(根据pastebin/摘录)导致了这个问题?我怀疑这与'
和BAUDELOT'S's+LIGAMENT
有关,但我不确定如何纠正这一点,因为preg_quote
似乎无法逃脱撇号?
EDIT这是附加代码,用于尝试确定这是否是问题所在,而不是preg_replace
:
$text_nodes = $xpath->query('//text()[not(ancestor::a)]');
foreach($text_nodes as $original_node) {
$text = $original_node->nodeValue;
$hitcount = preg_match_all($pattern, $text, $matches, PREG_OFFSET_CAPTURE);
if ($hitcount == 0) continue;
$offset = 0;
$parent = $original_node->parentNode;
$refnode = $original_node->nextSibling;
$parent->removeChild($original_node);
foreach ($matches[0] as $i => $match) {
$term_txt = $match[0];
$term_pos = $match[1];
$term_norm = preg_replace('/'s+/', ' ', strtoupper($term_txt));
// insert any text before the term instance
$prefix = substr($text, $offset, $term_pos - $offset);
$parent->insertBefore($document->createTextNode($prefix), $refnode);
// insert the actual term instance as a link
$link = $document->createElement("a", $term_txt);
$link->setAttribute("href", $urls[$term_norm]);
$link->setAttribute("rel", $rels[$term_norm]);
$link->setAttribute("class", "link_glossary");
$parent->insertBefore($link, $refnode);
$offset = $term_pos + strlen($term_txt);
if ($i == $hitcount - 1) { // last match, append remaining text
$suffix = substr($text, $offset);
$parent->insertBefore($document->createTextNode($suffix), $refnode);
}
}
}
提前感谢,
但我不知道如何纠正这一点,因为preg_quote似乎无法逃脱撇号?
preg_quote
不需要转义撇号,因为它们在正则表达式中并不特殊。
我不明白为什么这个正则表达式应该匹配每个空格和所有未列出的单词。
但我看到的一个问题是,你用单词边界'b
包围正则表达式的交替,这在单词以单词字符结尾的情况下会有问题,比如"MACRO''-|"或"MESO''-|MESOCORACOID|META''-|"。当然,如果破折号后面直接有一个单词字符,它就会匹配。(我不知道你想要匹配的文本。)
$term_norm = preg_replace('/'s+/', ' ', strtoupper(trim($term)));
你需要preg_quote($term)
那里^