倒排索引分词问题

Viewed 25

版本:2.1.0
问题描述:在使用倒排索引全文见检索时,分词时,无法完成中文和英文的结合。
例如歌曲名称为《那些花儿开在春末夏初The Flowers Bloom in Late Spring and Early Summer》
想要查询的是初T,但是目前不管是中文分词还是UNICODE分词,好像都不支持此种类型。

中文分词:
mysql> SELECT TOKENIZE('那些花儿开在春末夏初The Flowers Bloom in Late Spring and Early Summer','"parser"="chinese","parser_mode"="fine_grained"');
+------------------------------------------------------------------------------------------------------------------------------------------------+
| tokenize('那些花儿开在春末夏初The Flowers Bloom in Late Spring and Early Summer', '"parser"="chinese","parser_mode"="fine_grained"') |
+------------------------------------------------------------------------------------------------------------------------------------------------+
| ["那些", "花儿", "开在", "春", "末", "夏初", "The", "Flowers", "Bloom", "Late", "Spring", "Early", "Summer"] |
+------------------------------------------------------------------------------------------------------------------------------------------------+
1 row in set (0.81 sec)

unicode分词:
mysql> SELECT TOKENIZE('那些花儿开在春末夏初The Flowers Bloom in Late Spring and Early Summer','"parser"="unicode"');
+---------------------------------------------------------------------------------------------------------------------------------+
| tokenize('那些花儿开在春末夏初The Flowers Bloom in Late Spring and Early Summer', '"parser"="unicode"') |
+---------------------------------------------------------------------------------------------------------------------------------+
| ["那", "些", "花", "儿", "开", "在", "春", "末", "夏", "初", "flowers", "bloom", "late", "spring", "early", "summer"] |
+---------------------------------------------------------------------------------------------------------------------------------+
1 row in set (0.02 sec)

问题1:是否存在此类单字分词的分词器,如果没有可以添加进需求中吗?
问题2:按照腾讯音乐使用Apache Doris替换掉ES案例中,此类问题是如何解决的?

1 Answers

分词器一般都不会把英文单词拆分成字母,后续会引入 ngram 分词到倒排索引,可以实现n个字符一个词,现在可以通过 LIKE '%初T%'

场景不同