从长文本中选择单词并计数(PHP)


Select words and count from long text (PHP)

我需要获取标签并从长文本中计算它们。我知道我可以用regex做到这一点,但我做不到。如果你能帮我,我将不胜感激。这是我的示例文本;

#巴黎#爱#春天#户外#生活#伊斯坦布尔#圣器节#巴黎#法国#乳胶#狗这就是世界,毕竟巴黎是一场由对比记忆组成的无休止的战斗。雨没了,我看得很清楚#音乐我能看到路上的所有障碍#巴黎#queenstret#foreveronvocation从未有过如此迷人的感觉#ski#音乐#滑雪#滑雪者#地球公园#巴黎#滑雪板#单板滑雪板#长板滑雪板#longboarding#longboarder#滑板运动员#滑板#冬季#只有我的声音和我的好朋友Danny Marin会为我们的听觉探索做dj#堆栈#over#flow to be or not be#诗歌#音乐#巴黎

我只需要得到像"#paris"这样的标签,并计算每个标签,最后通过迭代排序标签。例如

#巴黎(6)
#音乐(3)
#。。。(2)
#。。。(2)
#。。。(1)
#。。。(1)
#。。。(1)

 preg_match_all("/('#'w+)/", $string, $array);
$array = array_count_values($array[1]);
asort($array);
foreach($array as $key => $value) {
    echo "$key ($value)<br>'n";
}

应该给你你需要的

编辑:很抱歉忘记了数组的索引

工作示例:
http://sandbox.onlinephpfunctions.com/code/d1fe24cbc8deedd24f7825ea4e48eaa691b8d401

在'#'上将字符串拆分为一个数组

在"上拆分该数组的每个元素,只保留第一个单词

获取每个令牌的计数并存储在并行阵列中

使用并行数组进行排序

您可以使用array_count_values,这里有一个例子:

<?php
$html = <<< EOF
#paris #love #spring #outdoor #life #istanbul #par #sacrecoeur #paris #france #latex #dog Thats what the world is, paris after all, an endless battle of contrasting memories. I can see clearly now the rain is gone. #music I can see all obstacles in my way. #paris #queenstreet #foreveronvocationNever felt more glamorous. #ski #music #skiing #skier #terrainpark #paris #snowboard #snowboarding #snowboarder #longboard #longboarding #longboarder #skateboard #skateboarder #skateboarding #winter #just my voice and my good friend Danny Marin will dj for our auditory exploration. #stack #over #flow to be or not to be #poem #music #paris
EOF;
preg_match_all('/(#.*?'S+)/im', $html, $hTags, PREG_PATTERN_ORDER);
print_r(array_count_values($hTags[1]));

输出:

Array
(
    [#paris] => 5
    [#love] => 1
    [#spring] => 1
    [#outdoor] => 1
    [#life] => 1
    [#istanbul] => 1
    [#par] => 1
    [#sacrecoeur] => 1
    [#france] => 1
    [#latex] => 1
    [#dog] => 1
    [#music] => 3
    [#queenstreet] => 1
    [#foreveronvocationNever] => 1
    [#ski] => 1
    [#skiing] => 1
    [#skier] => 1
    [#terrainpark] => 1
    [#snowboard] => 1
    [#snowboarding] => 1
    [#snowboarder] => 1
    [#longboard] => 1
    [#longboarding] => 1
    [#longboarder] => 1
    [#skateboard] => 1
    [#skateboarder] => 1
    [#skateboarding] => 1
    [#winter] => 1
    [#just] => 1
    [#stack] => 1
    [#over] => 1
    [#flow] => 1
    [#poem] => 1
)

Regex解释:

(#.*?'S+)
Match the regex below and capture its match into backreference number 1 «(#.*?'S+)»
   Match the character “#” literally «#»
   Match any single character that is NOT a line break character (line feed) «.*?»
      Between zero and unlimited times, as few times as possible, expanding as needed (lazy) «*?»
   Match a single character that is NOT a “whitespace character” (any Unicode separator, tab, line feed, carriage return, vertical tab, form feed, next line) «'S+»
      Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»

实时演示

如果你愿意,你可以使用PHP来完成这个技巧:

$tagString = "#paris #love #spring #outdoor #life #istanbul #par #sacrecoeur #paris #france #latex #dog Thats what the world is, paris after all, an endless battle of contrasting memories. I can see clearly now the rain is gone. #music I can see all obstacles in my way. #paris #queenstreet #foreveronvocationNever felt more glamorous. #ski #music #skiing #skier #terrainpark #paris #snowboard #snowboarding #snowboarder #longboard #longboarding #longboarder #skateboard #skateboarder #skateboarding #winter #just my voice and my good friend Danny Marin will dj for our auditory exploration. #stack #over #flow to be or not to be #poem #music #paris";
$countArray = array();
foreach (explode("#", trim($tagString, '#')) as $tag) {
    $tag = trim($tag);
    if (array_key_exists($tag, $countArray)) {
        $countArray[$tag] = (int) $countArray[$tag] + 1;
    } else {
        $countArray[$tag] = 1;
    }
}
arsort($countArray);
var_dump($countArray);

其给出:

array(34) {
  ["paris"]=>
  int(5)
  ["music"]=>
  int(2)
  ["skateboard"]=>
  int(1)
  ["snowboarding"]=>
  int(1)
  ["snowboarder"]=>
  int(1)
  ["longboard"]=>
  int(1)
  ["longboarding"]=>
  int(1)
  ["longboarder"]=>
  int(1)
  ["skateboarder"]=>
  int(1)
  ["terrainpark"]=>
  int(1)
  ["skateboarding"]=>
  int(1)
  ["winter"]=>
  int(1)
  ["just my voice and my good friend Danny Marin will dj for our auditory exploration."]=>
  int(1)
  ["stack"]=>
  int(1)
  ["over"]=>
  int(1)
  ["flow to be or not to be"]=>
  int(1)
  ["snowboard"]=>
  int(1)
  ["skier"]=>
  int(1)
  ["love"]=>
  int(1)
  ["skiing"]=>
  int(1)
  ["ski"]=>
  int(1)
  ["foreveronvocationNever felt more glamorous."]=>
  int(1)
  ["queenstreet"]=>
  int(1)
  ["music I can see all obstacles in my way."]=>
  int(1)
  ["dog Thats what the world is, paris after all, an endless battle of contrasting memories. I can see clearly now the rain is gone."]=>
  int(1)
  ["latex"]=>
  int(1)
  ["france"]=>
  int(1)
  ["sacrecoeur"]=>
  int(1)
  ["par"]=>
  int(1)
  ["istanbul"]=>
  int(1)
  ["life"]=>
  int(1)
  ["outdoor"]=>
  int(1)
  ["spring"]=>
  int(1)
  ["poem"]=>
  int(1)
}

你可以在这里在线测试:http://sandbox.onlinephpfunctions.com/code/3058b887590845e33685b25e14e21df9959e94e7