如何在PHP中拆分字符串中的泰米尔语字符


How to split Tamil characters in a string in PHP

如何在字符串中拆分泰米尔语字符?

当我使用preg_match_all('/./u', $str, $results)时,
我明白了"த","ம","ி","ழ"answers"்".

如何获得组合字符"த","மி"answers"ழ்"?

我认为您应该能够使用grapheme_extract函数来迭代组合字符(在技术上称为"字形簇")。

或者,如果您更喜欢regex方法,我认为您可以使用以下方法:

preg_match_all('/'pL'pM*|./u', $str, $results)

其中'pL表示Unicode"字母",'pM表示Unicode"标记"。

(免责声明:我还没有测试过这两种方法。)

如果我正确理解你的问题,你有一个包含代码点的unicode字符串,你想把它转换成一个graphames数组吗?

我正在开发一个开源Python库,为泰米尔语网站执行这样的任务。

我已经有一段时间没有使用PHP了,所以我会发布逻辑。您可以查看amuthaa/TamilWord.py文件的split_letters()函数中的代码。

正如ruakh所提到的,泰米尔文字形是作为代码点构建的。

  • 元音(உயிர் எழுத்து),aytham(ஆய்த எழுத்து - ஃ)以及所有的组合((உயிர்-மெய் எழுத்து)在"a"列中(அ வரி -即。க, ச, ட, த, ப, ற, ங, ஞ, ண, ந, ம, ன, ய, ர, ள, வ, ழ, ல)每个都使用单个码点。

  • 每个辅音都由两个代码点组成:a组合字母+pulli。例如。ப் = ப + ்

  • 除a组合外的每个组合也由两个代码点组成:a组合字母+标记:例如。பி = ப் + ி, தை = த் + ை

所以,如果你的逻辑是这样的:

initialize an empty array
for each codepoint in word:
    if the codepoint is a vowel, a-combination or aytham, it is also its grapheme, so add it to the array
    otherwise, the codepoint is a marking such as the pulli (i.e. ்) or one of the combination extensions (e.g.  ி or  ை), so append it to the end of the last element of the array

当然,这是假设你的字符串是良好的,并且你没有像两个标记一样的东西。

以下是Python代码,以防您发现它有帮助。如果你想帮助我们将其移植到PHP,请告诉我:

@staticmethod
def split_letters(word=u''):
    """ Returns the graphemes (i.e. the Tamil characters) in a given word as a list """
    # ensure that the word is a valid word
    TamilWord.validate(word)
    # list (which will be returned to user)
    letters = []
    # a tuple of all combination endings and of all அ combinations
    combination_endings = TamilLetter.get_combination_endings()
    a_combinations = TamilLetter.get_combination_column(u'அ').values()
    # loop through each codepoint in the input string
    for codepoint in word:
        # if codepoint is an அ combination, a vowel, aytham or a space,
        # add it to the list
        if codepoint in a_combinations or '
            TamilLetter.is_whitespace(codepoint) or '
            TamilLetter.is_vowel(codepoint) or '
            TamilLetter.is_aytham(codepoint):
            letters.append(codepoint)
        # if codepoint is a combination ending or a pulli ('்'), add it
        # to the end of the previously-added codepoint
        elif codepoint in combination_endings or '
            codepoint == TamilLetter.get_pulli():
            # ensure that at least one character already exists
            if len(letters) > 0:
                letters[-1] = letters[-1] + codepoint
            # otherwise raise an Error. However, validate_word()
            # should catch this
            else:
                raise ValueError("""%s cannot be first character of a word""" % (codepoint))
    return letters