在字符串中提取给定搜索字符串周围的X个单词 - Extract X number of words surrounding a given search string within a string

Extract X number of words surrounding a given search string within a string

我正在寻找一种在搜索中提取给定单词两侧X个单词的方法。

例如，如果用户输入"囚犯"作为搜索词，MySQL查询发现文章的内容中包含"囚犯"，我希望返回的不是文章的全部内容，而是两侧的x个单词，以便为用户提供文章的要点，然后他们可以决定是否要继续阅读文章并完整阅读。

我正在使用PHP。

谢谢!

您可能无法使用regex完全解决此问题。在单词之间有太多其他字符的可能性

但是你可以试试这个regex:

((?:'S+'s*){0,5}'S*inmate'S*(?:'s*'S+){0,5})

参见此处:rubular

您可能还希望排除某些字符，因为它们不被算作单词。现在regex将被空格包围的任何非空格字符序列计数为word.

只匹配真实的单词:

((?:'w+'s*){0,5}<search word>(?:'s*'w+){0,5})

但是这里任何非单词字符(，"。等)终止匹配。

所以你可以继续…

((?:['w"',.-]+'s*){0,5}["',.-]?<search word>["',.-]?(?:'s*['w"',.-]+){0,5})

这也将匹配包含"'，"的5个单词。-围绕你的搜索词。

在php中使用:

$sourcestring="For example, if a user enters '"inmate'" as a search word and the MySQL";
preg_match_all('/(?:'S+'s*){0,5}'S*inmate'S*(?:'s*'S+){0,5}/s',$sourcestring,$matches);
echo $matches[0][0]; // you might have more matches, they will be in $matches[0][x]

我会在php中使用这个正则表达式，它也考虑了UTF8字符

'~(?:['p{L}'p{N}'']+[^'p{L}'p{N}'']+){0,5}<search word>(?:[^'p{L}'p{N}'']+['p{L}'p{N}'']+){0,5}~u'

在这种情况下，'~'是分隔符，结尾的修饰符'u'标识正则表达式是UTF8解释的。

请参阅Unicode Regex标识符的文档:

http://www.regular-expressions.info/refunicode.html