Manjusaka

Manjusaka

A Brief Discussion on MySQL Full-Text Indexing

Recently, I've encountered quite a few pitfalls with MySQL's full-text search, so I’m writing a detailed article to summarize some knowledge points about MySQL's full-text search.

During our use of MySQL, there are always some fuzzy search requirements. For example, we have a table like this:

create table if not exists `user`
(
    `id`         bigint(20)   not null auto_increment,
    `name`       varchar(255) not null,
    `age`        int          not null,
    `update_time` timestamp    not null,
    `create_time` timestamp    not null,
    index (`name`),
    primary key (`id`)
) engine = InnoDB
  charset = 'utf8mb4';

Now we need to perform some fuzzy matching on name, for instance, we want to match names that contain the character . So, after some thought, we write the following SQL:

select * from user where name like '%草%'

Great! When you excitedly deploy this code, you find that it crashes online. Why? Because of the pitfalls in MySQL. MySQL's like query has two limitations:

  1. Only prefix matching ' 草 %' and suffix matching '% 草 ' will use the index, while arbitrary matching will not.
  2. When the index cannot be used, MySQL will traverse the entire table to query data.

When the data scale of a table is large, this brute-force table scan will inevitably bring significant overhead.

However, in our actual work, there are definitely many arbitrary matching requirements. So what should we do? Perhaps we can try full-text search.

Full-text search is not unfamiliar to everyone. In simple terms, it is a way to find matches in a pile of text data using a set of keywords. Currently, the more mainstream full-text search solutions in the industry include:

  1. Relational databases that support full-text search
  2. Apache Lucene
  3. ElasticSearch based on Apache Lucene
  4. Apache Solr

The latter two are the main solutions in the industry today, and many full-text search requirements will consider using ES or Solr. However, this method is not without cost. There are several practical issues:

  1. The operational issues of ES/Solr when the data volume is large; how to ensure the HA of the cluster will be a significant test of the team's capabilities.
  2. How to perform real-time/offline ETL of data from MySQL or other data sources to the search engine.
  3. The added learning and codebase maintenance costs.
  4. After adding a dependency, ensuring the overall HA of the system.

In technical decision-making, we often need to weigh the ROI of an option to assist in decision-making. If we are facing a relatively simple search scenario, the overhead brought by choosing ES/Solr will make its ROI relatively low. Therefore, in some simple scenarios, we may prefer to utilize the capabilities of the database itself to meet our needs.

Fortunately, starting from MySQL 5.5, it supports certain full-text search capabilities.

The prerequisite for MySQL full-text search is to create a Full Text Index in the table.

alter table `user`
    ADD FULLTEXT INDEX name_index (`name`);

Note that full-text indexes only take effect on fields of type CHAR/VARCHAR/TEXT.

Next, we insert two pieces of data:

insert into `user` (name, age, createTime, updateTime)
values ('Jeff.S.Wang', 18, current_timestamp, current_timestamp);

insert into `user` (name, age, createTime, updateTime)
values ('Jeff.Li', 18, current_timestamp, current_timestamp);

Now, let's see how MySQL performs full-text queries.

First, according to the official definition,

MATCH (col1,col2,...) AGAINST (expr [search_modifier])

The search_modifier is the selected matching mode, and there are four types in MySQL:

  1. IN NATURAL LANGUAGE MODE
  2. IN NATURAL LANGUAGE MODE WITH QUERY EXPANSION
  3. IN BOOLEAN MODE
  4. WITH QUERY EXPANSION

The commonly used modes are Natural Language Mode and Boolean Mode.

First, let's talk about Natural Language Mode. It's quite simple; as the name suggests, MySQL directly calculates the keywords to be matched and returns the corresponding values. Here’s a quote from the official explanation:

By default or with the IN NATURAL LANGUAGE MODE modifier, the MATCH() function performs a natural language search for a string against a text collection. A collection is a set of one or more columns included in a FULLTEXT index. The search string is given as the argument to AGAINST(). For each row in the table, MATCH() returns a relevance value; that is, a similarity measure between the search string and the text in that row in the columns named in the MATCH() list.

Let's write an SQL statement:

select * 
from `user` 
where MATCH(name) AGAINST('Jeff' IN NATURAL LANGUAGE MODE)

Then we find we get the following results:

idnameageupdateTimecreateTime
1Jeff Li182020-03-01 15:38:072020-03-01 15:38:07
2Jeff.S.Wang182020-03-01 15:42:282020-03-01 15:42:28

Next, we try to match the user's last name, for example, we want to find a user with the surname Wang.

We write the following SQL:

select * 
from `user` 
where MATCH(name) AGAINST('Wang' IN NATURAL LANGUAGE MODE)

We get the following result:

idnameageupdateTimecreateTime
2Jeff.S.Wang182020-03-01 15:42:282020-03-01 15:42:28

Now we start trying to search for a user with the surname Li, and we write the following SQL:

select * 
from `user` 
where MATCH(name) AGAINST('Li' IN NATURAL LANGUAGE MODE)

And we find that there are no results?????? WTF? Why?

The reason lies in the granularity of the word segmentation. When we enter new data, MySQL segments the data in the indexed fields based on a certain word segmentation standard length and stores it for querying. There are four parameters that control the word segmentation length:

  1. innodb_ft_min_token_size
  2. innodb_ft_max_token_size
  3. ft_min_word_len (serves the same purpose but is for the MyISAM engine)
  4. ft_max_word_len

For InnoDB, the default value of innodb_ft_min_token_size is 3. In other words, in the data we previously entered, the segmented units stored in our data are:

  1. Jeff
  2. Wang

So we had no results for the second search. Now, after modifying MySQL's parameters, let's execute it again:

select * 
from `user` 
where MATCH(name) AGAINST('Li' IN NATURAL LANGUAGE MODE)

Still not working?????

After checking the official documentation, we find the following description:

Some variable changes require that you rebuild the FULLTEXT indexes in your tables. Instructions for doing so are given later in this section.

The index word segmentation granularity is also included in this, so we need to delete/rebuild the index, and then execute it again (a bit of a pitfall...).

select * 
from `user` 
where MATCH(name) AGAINST('Li' IN NATURAL LANGUAGE MODE)

Now we get the normal return results:

idnameageupdateTimecreateTime
1Jeff Li182020-03-01 15:38:072020-03-01 15:38:07

Now let's talk about another matching mode, BOOLEAN MODE.

Boolean mode allows us to use some operators to retrieve data. Here are some common examples; the rest can be found in the MySQL official documentation:

  1. AGAINST('Jeff Li' IN BOOLEAN MODE) means either Jeff or Li must exist.
  2. AGAINST('+Jeff' IN BOOLEAN MODE) means Jeff must exist.
  3. AGAINST('+Jeff -Li' IN BOOLEAN MODE) means Jeff must exist and Li must not exist.

Let's execute these SQL statements:

select * 
from `user` 
where MATCH(name) AGAINST('Jeff Li' IN BOOLEAN MODE)

Result:

idnameageupdateTimecreateTime
1Jeff Li182020-03-01 15:38:072020-03-01 15:38:07
2Jeff.S.Wang182020-03-01 15:42:282020-03-01 15:42:28
select *
from `user` 
where MATCH(name) AGAINST('+Jeff' IN BOOLEAN MODE)

Result:

idnameageupdateTimecreateTime
1Jeff Li182020-03-01 15:38:072020-03-01 15:38:07
2Jeff.S.Wang182020-03-01 15:42:282020-03-01 15:42:28
select * 
from `user` 
where MATCH(name) AGAINST('+Jeff -Li' IN BOOLEAN MODE) 

Result:

idnameageupdateTimecreateTime
2Jeff.S.Wang182020-03-01 15:42:282020-03-01 15:42:28

Now, we have some Chinese search requirements. Let's first insert some data:

insert into `user` (name, age, createTime, updateTime)
values ('奥特曼', 18, current_timestamp, current_timestamp);

Now we want to search for users with the surname . Following the previous guide, we write the following SQL:

select *
from `user`
where MATCH(name) AGAINST('+奥' IN BOOLEAN MODE)

Then we are surprised to find that there are still no results???? Why???

Actually, it’s still the issue mentioned earlier, word segmentation. MySQL's default word segmentation engine only supports English word segmentation and does not support Chinese word segmentation. So without word segmentation, there’s no search? What to do?

Starting from MySQL 5.7, MySQL provides the ngram component to help us perform Chinese word segmentation, and it's very easy to use:

alter table `user`
    add fulltext index name_index (`name`) with parser ngram;

Here are a few points to note:

  1. ngram is not only suitable for Chinese; according to the official documentation, it also supports Korean and Japanese.
  2. A single field can only have one full-text index, so the original full-text index needs to be deleted.

At the same time, like the default word segmentation, ngram is also subject to word segmentation granularity limits, but the settings for ngram are:

  1. ngram_token_size

We can set it according to our needs.

Summary#

Full-text search is a common requirement in daily development. When our infrastructure does not allow us to confidently use external components, leveraging the capabilities provided by the database may be a good option. However, there are still many pitfalls to navigate, and many parameters to optimize. By the way, the settings for Alibaba Cloud's RDS are really difficult to use (quietly complaining).

Alright... my procrastination is really hopeless... and I've been suffering from toothache these days, which is really frustrating.

Loading...
Ownership of this post data is guaranteed by blockchain and smart contracts to the creator alone.