ArangoDB 3.12 Product Release Announcement! Read the blog for details. Read Blog

Vector-5

Present and Future of ArangoDB Fulltext Index

Estimated reading time: 2 minutes

The ArangoDB Fulltext index allows you to search for text in arbitrary strings. It is a great way to implement things like autocompletion, product searches or many other use-cases which need some form of fulltext search.
The Fulltext Index is suitable for you if your use-case can be broken down to:

  • Full matches of words
  • Prefix matches of words
  • You do not need a “ranking” of the matching documents

Usage Example

Using the fulltext index is fairly straightforward, you create the index on an existing emails collection:

emails.ensureIndex({ type: "fulltext", fields: [ "text" ], minLength: 3 })
emails.insert({text: “banana apple”})
emails.insert({text: “banana mango”})
emails.insert({text: “banana avocado”})

Search for documents

FOR mail IN FULLTEXT(emails, "text", "banana,-apple")
    RETURN mail

This will return documents containing “banana”, but not “apple”. Other word combinations are possible, for more information check the Fulltext documentation page.

New to multi-model and graphs? Check out our free

ArangoDB Graph Course

Dealing with Human Language

The current fulltext index has a number of shortcomings when it comes to dealing with arbitrary languages, substring search and especially with non-latin languages like Chinese, Japanese or arabic. In the following is a list of tasks which the current fulltext index does not yet perform out of the box, but might be relevant for your use-case:Normalizing Words

1. Normalizing Words

  • Removing diacritics
  • Removing wordstems
  • Matching synonymous words

Removing diacritics like ^,°,`, from words e.g. turning “ç” into “c” such that “Curaçao” will also match “Curacao”.

Wordstems are the “common” forms of words as opposed to word inflections. For example, you build plurals in English by adding an -s (house / houses) similarly for past tense forms you can have inflections like pay, paid, paying. A lot of the time we want to also match inflected forms of words, which is possible by indexing only the word-stem.

Words should also be matched for synonyms e.g. “quick” also matches for “fast”.

2. Removing Stop-Words

Commonly used words such as “the”, “and” in English or “und”, “am”, “an” in German are not relevant for a lot of use-cases. Removing these words before indexing them can improve the perceived quality of search results

3. Identifying Word Boundaries

In non-latin languages such as Chinese words may consists of one or more characters. For example in Chinese the word “公共汽車” means bus but “汽車” would, for example, mean car.

What’s next?

For the next release of ArangoDB, we are planning to release a new feature which will offer advanced text analysis functionality (and more). This will allow you to perform matchings by similarity as well as sorting result sets according to several different scoring algorithms.

This will allow you to index and analyze human language input, as well as data from other scientific domains via customizable analyzers: It will, for example, be possible to store DNA sequences in ArangoDB and later search for sub-sequences by similarity.

For more information check out this section of the ArangoDB docs.
[Editors note: The current working title of the upcoming feature is IResearch. We are thinking of another name for it, so the name might change.]

Picture of Jan Steemann
January 11, 2018 ,

After more than 30 years of playing around with 8 bit computers, assembler and scripting languages, Jan decided to move on to work in database engineering. Jan is now a senior C/C++ developer with the ArangoDB core team, being there from version 0.1. He is mostly working on performance optimization, storage engines and the querying functionality. He also wrote most of AQL (ArangoDB’s query language).

Leave a Comment