Present and Future of ArangoDB Fulltext Index

January 11, 2018

Estimated reading time: 2 minutes

The ArangoDB Fulltext index allows you to search for text in arbitrary strings. It is a great way to implement things like autocompletion, product searches or many other use-cases which need some form of fulltext search.
The Fulltext Index is suitable for you if your use-case can be broken down to:

Full matches of words
Prefix matches of words
You do not need a “ranking” of the matching documents

Usage Example

Using the fulltext index is fairly straightforward, you create the index on an existing emails collection:

emails.ensureIndex({ type: "fulltext", fields: [ "text" ], minLength: 3 })
emails.insert({text: “banana apple”})
emails.insert({text: “banana mango”})
emails.insert({text: “banana avocado”})

Search for documents

FOR mail IN FULLTEXT(emails, "text", "banana,-apple")
    RETURN mail

This will return documents containing “banana”, but not “apple”. Other word combinations are possible, for more information check the Fulltext documentation page.

New to multi-model and graphs? Check out our free

ArangoDB Graph Course

Dealing with Human Language

The current fulltext index has a number of shortcomings when it comes to dealing with arbitrary languages, substring search and especially with non-latin languages like Chinese, Japanese or arabic. In the following is a list of tasks which the current fulltext index does not yet perform out of the box, but might be relevant for your use-case:Normalizing Words

1. Normalizing Words

Removing diacritics
Removing wordstems
Matching synonymous words

Removing diacritics like ^,°,`, from words e.g. turning “ç” into “c” such that “Curaçao” will also match “Curacao”.

Wordstems are the “common” forms of words as opposed to word inflections. For example, you build plurals in English by adding an -s (house / houses) similarly for past tense forms you can have inflections like pay, paid, paying. A lot of the time we want to also match inflected forms of words, which is possible by indexing only the word-stem.

Words should also be matched for synonyms e.g. “quick” also matches for “fast”.

2. Removing Stop-Words

Commonly used words such as “the”, “and” in English or “und”, “am”, “an” in German are not relevant for a lot of use-cases. Removing these words before indexing them can improve the perceived quality of search results

3. Identifying Word Boundaries

In non-latin languages such as Chinese words may consists of one or more characters. For example in Chinese the word “公共汽車” means bus but “汽車” would, for example, mean car.

What’s next?

For the next release of ArangoDB, we are planning to release a new feature which will offer advanced text analysis functionality (and more). This will allow you to perform matchings by similarity as well as sorting result sets according to several different scoring algorithms.

This will allow you to index and analyze human language input, as well as data from other scientific domains via customizable analyzers: It will, for example, be possible to store DNA sequences in ArangoDB and later search for sub-sequences by similarity.

For more information check out this section of the ArangoDB docs.
[Editors note: The current working title of the upcoming feature is IResearch. We are thinking of another name for it, so the name might change.]

Why Graph

Products

Solutions

Developers

Customers

Learn

Company

Pricing

Present and Future of ArangoDB Fulltext Index

Usage Example

Search for documents

Dealing with Human Language

1. Normalizing Words

2. Removing Stop-Words

3. Identifying Word Boundaries

What’s next?

Leave a Comment

Quick Links

Info

About Us