PHP中的Unicode(UTF8）字符串字数 - Unicode (UTF8) string word count in PHP

Unicode (UTF8) string word count in PHP

本文关键字：字符串 UTF8 中的 Unicode PHP | 更新日期: 2023-09-27

我需要获得以下unicode字符串的字数。使用str_word_count:

$input = 'Hello, chào buổi sáng'; 
$count = str_word_count($input);
echo $count;

结果是

7

这显然是错误的。

如何获得想要的结果（4）？

$tags = 'Hello, chào buổi sáng'; 
$word = explode(' ', $tags);
echo count($word);

下面是一个演示：http://codepad.org/667Cr1pQ

这里有一个快速而肮脏的基于正则表达式（使用Unicode）的单词计数函数：

function mb_count_words($string) {
    preg_match_all('/['pL'pN'pPd]+/u', $string, $matches);
    return count($matches[0]);
}

"单词"是指包含以下一个或多个内容的任何内容：

任何字母
任何数字
任何连字符/破折号

这意味着以下内容包含5个"单词"（4个正常单词，1个连字符）：

 echo mb_count_words('Hello, chào buổi sáng, chào-sáng');

现在，这个功能不太适合非常大的文本；尽管它应该能够处理互联网上的大部分文本块。这是因为preg_match_all需要构建和填充一个大数组，但一旦计数就将其丢弃（这是非常低效的）。一种更有效的计数方法是逐个字符地遍历文本，识别unicode空白序列，并增加辅助变量。这不会那么难，但它很乏味，需要时间。

您可以使用此函数来计算给定字符串中的unicode单词：

function count_unicode_words( $unicode_string ){
  // First remove all the punctuation marks & digits
  $unicode_string = preg_replace('/[[:punct:][:digit:]]/', '', $unicode_string);
  // Now replace all the whitespaces (tabs, new lines, multiple spaces) by single space
  $unicode_string = preg_replace('/[[:space:]]/', ' ', $unicode_string);
  // The words are now separated by single spaces and can be splitted to an array
  // I have included 'n'r't here as well, but only space will also suffice
  $words_array = preg_split( "/['n'r't ]+/", $unicode_string, 0, PREG_SPLIT_NO_EMPTY );
  // Now we can get the word count by counting array elments
  return count($words_array);
}

所有学分归作者所有。

我用这个代码来计数单词。你可以试试这个

$s = 'Hello, chào buổi sáng'; 
 $s1 = array_map('trim', explode(' ', $s));
 $s2 = array_filter($s1, function($value) { return $value !== ''; });
echo count($s2);