Full-Text Index

ArcadeDB’s full-text index is built on Apache Lucene and supports configurable analyzers, advanced Lucene query syntax, multi-property indexing, relevance scoring, and "More Like This" similarity search.

Creating a Full-Text Index

CREATE INDEX ON Article (content) FULL_TEXT

Multiple properties can be indexed together:

CREATE INDEX ON Article (title, body) FULL_TEXT

Configuring the Analyzer

Pass a METADATA block to choose a Lucene analyzer:

CREATE INDEX ON Article (content) FULL_TEXT
  METADATA {
    "analyzer": "org.apache.lucene.analysis.en.EnglishAnalyzer",
    "allowLeadingWildcard": false,
    "defaultOperator": "OR"
  }

Metadata Options

Option Type Default Description

analyzer

string

org.apache.lucene.analysis.standard.StandardAnalyzer

Lucene analyzer class used for both indexing and querying

index_analyzer

string

Override analyzer used only at index time

query_analyzer

string

Override analyzer used only at query time

allowLeadingWildcard

boolean

false

Allow *term wildcard queries

defaultOperator

"OR" | "AND"

"OR"

Default operator between terms when none is specified

<field>_analyzer

string

Per-field analyzer override, e.g. "title_analyzer"

Common Analyzers

Analyzer Class Description

org.apache.lucene.analysis.standard.StandardAnalyzer

General-purpose tokenizer (default)

org.apache.lucene.analysis.en.EnglishAnalyzer

English stemming and stop words

org.apache.lucene.analysis.core.SimpleAnalyzer

Lowercase only, no stop words

org.apache.lucene.analysis.core.WhitespaceAnalyzer

Split on whitespace only

Per-Field Analyzers

For multi-property indexes, each field can use a different analyzer:

CREATE INDEX ON Article (title, body) FULL_TEXT
  METADATA {
    "analyzer": "org.apache.lucene.analysis.standard.StandardAnalyzer",
    "title_analyzer": "org.apache.lucene.analysis.en.EnglishAnalyzer"
  }

Here title uses the English analyzer while body falls back to the standard analyzer.

Searching

SEARCH_INDEX()

Searches a named full-text index using Lucene query syntax.

SELECT * FROM Article
WHERE SEARCH_INDEX('Article[content]', 'java programming')

Signature: SEARCH_INDEX(indexName, query)

Parameter Type Description

indexName

string

The full index name as shown in the schema, e.g. Article[content]

query

string

A Lucene query string

SEARCH_FIELDS()

Finds the full-text index automatically from field names, without needing to know the index name.

SELECT * FROM Article
WHERE SEARCH_FIELDS(['title', 'body'], 'database tutorial')

Signature: SEARCH_FIELDS(fieldNames, query)

Parameter Type Description

fieldNames

array of strings

Fields to search; a full-text index covering these fields must exist

query

string

A Lucene query string

Lucene Query Syntax

Both SEARCH_INDEX and SEARCH_FIELDS accept standard Lucene query syntax.

Boolean Operators

Syntax Meaning

java programming

Either term (OR, default)

+java +programming

Both terms required (AND)

java -python

java required, python excluded

java AND programming

Explicit AND

java OR python

Explicit OR

-- Requires both terms
SELECT * FROM Article WHERE SEARCH_INDEX('Article[content]', '+java +programming')

-- Excludes documents about python
SELECT * FROM Article WHERE SEARCH_INDEX('Article[content]', 'java -python')

Phrase Queries

Wrap a phrase in double quotes to require terms to appear in order:

SELECT * FROM Article WHERE SEARCH_INDEX('Article[content]', '"machine learning"')

Wildcard Queries

Syntax Matches

data*

database, datastore, dataset…​

dat?base

database, datXbase…​

*base

Requires allowLeadingWildcard: true

SELECT * FROM Article WHERE SEARCH_INDEX('Article[content]', 'data*')

Fuzzy Queries

Append ~ to match terms within an edit distance:

SELECT * FROM Article WHERE SEARCH_INDEX('Article[content]', 'database~')
-- also matches terms within edit distance 2 (Lucene default), e.g. "databaSe", "databasee"

Field-Qualified Queries (Multi-Property Indexes)

For indexes over multiple fields, restrict a term to a specific field:

-- Only match "database" in the title field
SELECT * FROM Article WHERE SEARCH_INDEX('Article[title,body]', 'title:database')

-- Combine field-specific and general terms
SELECT * FROM Article WHERE SEARCH_INDEX('Article[title,body]', '+title:"multi model" -nosql')

Relevance Score ($score)

Every match carries a relevance score. Use $score in projections or ORDER BY clauses:

SELECT title, $score
FROM Article
WHERE SEARCH_INDEX('Article[content]', 'java programming')
ORDER BY $score DESC

Documents that match more query terms receive higher scores.

SELECT title, $score AS relevance
FROM Article
WHERE SEARCH_FIELDS(['content'], 'java programming')
ORDER BY relevance DESC
LIMIT 10

More Like This

"More Like This" finds documents similar to one or more source documents. It extracts representative terms from the sources, then searches for other documents sharing those terms.

The four full-text search functions (searchIndex, searchFields, searchIndexMore, searchFieldsMore) are registered under the fulltext.* namespace. The legacy snake_case names (search_index, search_fields, search_index_more, search_fields_more) remain available as aliases for backward compatibility. Both forms dispatch to the same implementation.

fulltext.searchIndexMore()

SELECT title, $score, $similarity
FROM Article
WHERE `fulltext.searchIndexMore`('Article[content]', [#10:3])
ORDER BY $similarity DESC

Signature: fulltext.searchIndexMore(indexName, sourceRIDs [, options]) (alias: search_index_more)

Parameter Type Description

indexName

string

Full-text index name

sourceRIDs

array of RIDs

One or more source documents

options

map

Optional MLT tuning options (see More Like This Configuration). Unknown keys are rejected.

fulltext.searchFieldsMore()

Same as fulltext.searchIndexMore but resolves the full-text index automatically from field names:

SELECT title, $similarity
FROM Article
WHERE `fulltext.searchFieldsMore`(['content'], [#10:3])
ORDER BY $similarity DESC

Signature: fulltext.searchFieldsMore(fieldNames, sourceRIDs [, options]) (alias: search_fields_more)

Parameter Type Description

fieldNames

array of strings

Fields to search; a full-text index covering these fields must exist

sourceRIDs

array of RIDs

One or more source documents

options

map

Optional MLT tuning options (see More Like This Configuration). Unknown keys are rejected.

Multiple Source Documents

Provide multiple RIDs to find documents similar to a combination of sources:

SELECT title, $similarity
FROM Article
WHERE SEARCH_INDEX_MORE('Article[content]', [#10:3, #10:4])
ORDER BY $similarity DESC

Similarity Score ($similarity)

$similarity is a normalized score from 0.0 to 1.0. The most similar document in the result set always receives 1.0.

SELECT title, $score, $similarity
FROM Article
WHERE SEARCH_INDEX_MORE('Article[content]', [#10:3])
ORDER BY $similarity DESC
LIMIT 5

More Like This Configuration

Pass an optional map of tuning options. Unknown keys are rejected with a descriptive error to catch typos; keys are case sensitive.

SELECT title, $similarity
FROM Article
WHERE `fulltext.searchFieldsMore`(['title', 'content'], [#10:3], {
  minTermFreq: 1,
  minDocFreq: 3,
  maxQueryTerms: 50,
  excludeSource: false
})
Option Type Default Description

minTermFreq

int

2

Minimum times a term must appear in the source document(s) to be considered

minDocFreq

int

5

Minimum number of index documents that must contain a term for it to be used

maxDocFreqPercent

float

null

Exclude terms appearing in more than this fraction of all documents (e.g. 0.5 = 50%)

maxQueryTerms

int

25

Maximum number of terms to use in the similarity query

minWordLen

int

0

Ignore terms shorter than this length (0 = no minimum)

maxWordLen

int

0

Ignore terms longer than this length (0 = no maximum)

boostByScore

boolean

true

Weight terms by TF-IDF score rather than raw frequency

excludeSource

boolean

true

Exclude source documents from results

maxSourceDocs

int

25

Maximum number of source RIDs allowed

Practical Examples

Blog Search with Ranking

SELECT title, author, $score AS relevance
FROM BlogPost
WHERE SEARCH_INDEX('BlogPost[title,body]', '+java +spring -legacy')
ORDER BY relevance DESC
LIMIT 20

Autocomplete with Prefix Wildcard

SELECT title
FROM Product
WHERE SEARCH_INDEX('Product[name]', 'micro*')

Stemming with English Analyzer

With EnglishAnalyzer, searching for "run" also matches "running", "runs", and "ran":

CREATE INDEX ON Article (content) FULL_TEXT
  METADATA { "analyzer": "org.apache.lucene.analysis.en.EnglishAnalyzer" }

SELECT * FROM Article WHERE SEARCH_INDEX('Article[content]', 'running')
-- also returns documents containing "run", "runs", "ran"
SELECT title, $similarity AS score
FROM Article
WHERE SEARCH_INDEX_MORE('Article[title,body]', [#10:42], {
  "minTermFreq": 1,
  "minDocFreq": 2,
  "maxQueryTerms": 30
})
ORDER BY score DESC
LIMIT 5

Readers Who Liked This Also Liked

SELECT title, $similarity
FROM Article
WHERE SEARCH_INDEX_MORE('Article[content]', [#10:1, #10:7, #10:12])
ORDER BY $similarity DESC
LIMIT 10

Notes

  • Full-text indexes require all indexed properties to be of type STRING.

  • Full-text indexes cannot be marked UNIQUE.

  • Without a METADATA block, indexes use StandardAnalyzer with OR default operator and leading wildcards disabled.

  • Indexes created without metadata continue to work with SEARCH_INDEX and SEARCH_INDEX_MORE exactly as before.

  • $score and $similarity are always available as query variables for matching documents; non-matching documents receive 0.