Full-Text Index

ArcadeDB’s full-text index is built on Apache Lucene and supports configurable analyzers, advanced Lucene query syntax, multi-property indexing, native BM25 relevance scoring (with tunable parameters and per-field boosts), and "More Like This" similarity search.

Creating a Full-Text Index

CREATE INDEX ON Article (content) FULL_TEXT

Multiple properties can be indexed together:

CREATE INDEX ON Article (title, body) FULL_TEXT

Configuring the Analyzer

Pass a METADATA block to choose a Lucene analyzer:

CREATE INDEX ON Article (content) FULL_TEXT
  METADATA {
    "analyzer": "org.apache.lucene.analysis.en.EnglishAnalyzer",
    "allowLeadingWildcard": false,
    "defaultOperator": "OR"
  }

Metadata Options

Option Type Default Description

analyzer

string

org.apache.lucene.analysis.standard.StandardAnalyzer

Lucene analyzer class used for both indexing and querying

index_analyzer

string

Override analyzer used only at index time

query_analyzer

string

Override analyzer used only at query time

allowLeadingWildcard

boolean

false

Allow *term wildcard queries

defaultOperator

"OR" | "AND"

"OR"

Default operator between terms when none is specified

<field>_analyzer

string

Per-field analyzer override, e.g. "title_analyzer"

similarity

"BM25" | "CLASSIC"

"BM25"

Relevance model (see Relevance Scoring). BM25 is the default for newly created indexes; CLASSIC is the legacy term-coordination (match-count) model

bm25_k1

float

1.2

BM25 term-frequency saturation (must be >= 0). Higher values let repeated terms keep increasing the score

bm25_b

float

0.75

BM25 document-length normalization, in [0, 1] (0 disables length normalization)

<field>_boost

float

1.0

Per-field BM25 boost multiplier applied to field-qualified matches on that field (must be >= 0), e.g. "title_boost"

Common Analyzers

Analyzer Class Description

org.apache.lucene.analysis.standard.StandardAnalyzer

General-purpose tokenizer (default)

org.apache.lucene.analysis.en.EnglishAnalyzer

English stemming and stop words

org.apache.lucene.analysis.core.SimpleAnalyzer

Lowercase only, no stop words

org.apache.lucene.analysis.core.WhitespaceAnalyzer

Split on whitespace only

Per-Field Analyzers

For multi-property indexes, each field can use a different analyzer:

CREATE INDEX ON Article (title, body) FULL_TEXT
  METADATA {
    "analyzer": "org.apache.lucene.analysis.standard.StandardAnalyzer",
    "title_analyzer": "org.apache.lucene.analysis.en.EnglishAnalyzer"
  }

Here title uses the English analyzer while body falls back to the standard analyzer.

Searching

SEARCH_INDEX()

Searches a named full-text index using Lucene query syntax.

SELECT * FROM Article
WHERE SEARCH_INDEX('Article[content]', 'java programming')

Signature: SEARCH_INDEX(indexName, query)

Parameter Type Description

indexName

string

The full index name as shown in the schema, e.g. Article[content]

query

string

A Lucene query string

SEARCH_FIELDS()

Finds the full-text index automatically from field names, without needing to know the index name.

SELECT * FROM Article
WHERE SEARCH_FIELDS(['title', 'body'], 'database tutorial')

Signature: SEARCH_FIELDS(fieldNames, query)

Parameter Type Description

fieldNames

array of strings

Fields to search; a full-text index covering these fields must exist

query

string

A Lucene query string

Lucene Query Syntax

Both SEARCH_INDEX and SEARCH_FIELDS accept standard Lucene query syntax.

Boolean Operators

Syntax Meaning

java programming

Either term (OR, default)

+java +programming

Both terms required (AND)

java -python

java required, python excluded

java AND programming

Explicit AND

java OR python

Explicit OR

-- Requires both terms
SELECT * FROM Article WHERE SEARCH_INDEX('Article[content]', '+java +programming')

-- Excludes documents about python
SELECT * FROM Article WHERE SEARCH_INDEX('Article[content]', 'java -python')

Phrase Queries

Wrap a phrase in double quotes to require terms to appear in order:

SELECT * FROM Article WHERE SEARCH_INDEX('Article[content]', '"machine learning"')

Wildcard Queries

Syntax Matches

data*

database, datastore, dataset…​

dat?base

database, datXbase…​

*base

Requires allowLeadingWildcard: true

SELECT * FROM Article WHERE SEARCH_INDEX('Article[content]', 'data*')

Fuzzy Queries

Append ~ to match terms within an edit distance:

SELECT * FROM Article WHERE SEARCH_INDEX('Article[content]', 'database~')
-- also matches terms within edit distance 2 (Lucene default), e.g. "databaSe", "databasee"

Field-Qualified Queries (Multi-Property Indexes)

For indexes over multiple fields, restrict a term to a specific field:

-- Only match "database" in the title field
SELECT * FROM Article WHERE SEARCH_INDEX('Article[title,body]', 'title:database')

-- Combine field-specific and general terms
SELECT * FROM Article WHERE SEARCH_INDEX('Article[title,body]', '+title:"multi model" -nosql')

Boosting Terms (Caret)

Append ^N to a term, phrase, field-qualified term, or parenthesized group to weight it more (or less, with N < 1) when scoring. This follows the Lucene/Elasticsearch convention and applies to BM25 indexes.

Syntax Meaning

java^3

Boost the term java 3x

java^0.5

De-emphasize the term java (any N < 1 lowers its weight)

"machine learning"^2

Boost the phrase 2x

title:java^5

Boost a field-qualified term 5x

(java OR python)^3

Boost a whole parenthesized group 3x

-- Make "java" count three times as much as "database" in the ranking
SELECT title, $score FROM Article
WHERE SEARCH_INDEX('Article[content]', 'java^3 database') ORDER BY $score DESC

-- Boost a field-qualified term, combined with a required term
SELECT title, $score FROM Article
WHERE SEARCH_INDEX('Article[title,body]', '+database title:java^5') ORDER BY $score DESC

-- Boost a whole OR group
SELECT title, $score FROM Article
WHERE SEARCH_INDEX('Article[title,body]', '(title:java OR title:python)^3 body:cooking') ORDER BY $score DESC

The query-time caret combines multiplicatively with any configured per-field boost (see Relevance Scoring): the effective weight is caret * field_boost.

Relevance Scoring

Similarity Models: BM25 and CLASSIC

Full-text indexes rank results with one of two similarity models, selected by the similarity metadata option:

Model Description

BM25 (default)

Industry-standard Okapi BM25 ranking: combines term frequency (TF), inverse document frequency (IDF, so rare terms count more than common ones) and document-length normalization (shorter documents matching the same term rank higher). Tunable via bm25_k1 and bm25_b.

CLASSIC

Legacy term-coordination model: the score is simply the number of distinct query terms a document matches. Kept for backward compatibility.

BM25 is the default for newly created indexes. Configure it explicitly, optionally tuning bm25_k1/bm25_b:

-- BM25 with tuned parameters
CREATE INDEX ON Article (content) FULL_TEXT
  METADATA { "similarity": "BM25", "bm25_k1": 1.2, "bm25_b": 0.75 }

-- Keep the legacy match-count scoring
CREATE INDEX ON Article (content) FULL_TEXT
  METADATA { "similarity": "CLASSIC" }

Per-Field Boosts

On a multi-property index, give a field a default BM25 weight with <field>_boost, so matches in that field contribute more to the score:

CREATE INDEX ON Article (title, body) FULL_TEXT
  METADATA { "similarity": "BM25", "title_boost": 3.0 }

A field boost applies to field-qualified query terms on that field (e.g. title:java). Unqualified terms score against the field-agnostic token with boost 1.0. Per-field boosts and query-time caret boosts compose: the effective weight is caret * field_boost.

From the Java API

The same configuration is available on the index builder:

database.getSchema().buildTypeIndex("Article", new String[] { "title", "body" })
    .withType(Schema.INDEX_TYPE.FULL_TEXT).withFullTextType()
    .withBM25(1.2f, 0.75f)          // similarity = BM25, k1, b
    .withFieldBoost("title", 3.0f)  // per-field boost
    .create();

Use withSimilarity("CLASSIC") to select the legacy model instead of withBM25(…​).

Relevance Score ($score)

Every match carries a floating-point relevance score, exposed as the query variable $score. Use it in projections or ORDER BY clauses:

SELECT title, $score
FROM Article
WHERE SEARCH_INDEX('Article[content]', 'java programming')
ORDER BY $score DESC
SELECT title, $score AS relevance
FROM Article
WHERE SEARCH_FIELDS(['content'], 'java programming')
ORDER BY relevance DESC
LIMIT 10

With BM25, documents matching rarer, more discriminative terms (and shorter documents matching the same term) rank higher. With CLASSIC, documents that match more distinct query terms rank higher.

$score is always a floating-point number, including for CLASSIC indexes (where it is the match count widened to a float, e.g. 3.0). Application code that previously read $score as an integer must read it as a Number/float.

Inspecting Scores with EXPLAIN / PROFILE

EXPLAIN (or PROFILE) of a query whose WHERE uses SEARCH_INDEX(…​) annotates the full-text fetch step with a SCORING line: the similarity model, the k1/b parameters, the corpus statistics, and, per query term, its document frequency (df), idf and applied boost.

EXPLAIN SELECT title, $score FROM Article
WHERE SEARCH_INDEX('Article[content]', 'java database') = true

More Like This

"More Like This" finds documents similar to one or more source documents. It extracts representative terms from the sources, then searches for other documents sharing those terms.

The four full-text search functions (searchIndex, searchFields, searchIndexMore, searchFieldsMore) are registered under the fulltext.* namespace. The legacy snake_case names (search_index, search_fields, search_index_more, search_fields_more) remain available as aliases for backward compatibility. Both forms dispatch to the same implementation.

fulltext.searchIndexMore()

SELECT title, $score, $similarity
FROM Article
WHERE `fulltext.searchIndexMore`('Article[content]', [#10:3])
ORDER BY $similarity DESC

Signature: fulltext.searchIndexMore(indexName, sourceRIDs [, options]) (alias: search_index_more)

Parameter Type Description

indexName

string

Full-text index name

sourceRIDs

array of RIDs

One or more source documents

options

map

Optional MLT tuning options (see More Like This Configuration). Unknown keys are rejected.

fulltext.searchFieldsMore()

Same as fulltext.searchIndexMore but resolves the full-text index automatically from field names:

SELECT title, $similarity
FROM Article
WHERE `fulltext.searchFieldsMore`(['content'], [#10:3])
ORDER BY $similarity DESC

Signature: fulltext.searchFieldsMore(fieldNames, sourceRIDs [, options]) (alias: search_fields_more)

Parameter Type Description

fieldNames

array of strings

Fields to search; a full-text index covering these fields must exist

sourceRIDs

array of RIDs

One or more source documents

options

map

Optional MLT tuning options (see More Like This Configuration). Unknown keys are rejected.

Multiple Source Documents

Provide multiple RIDs to find documents similar to a combination of sources:

SELECT title, $similarity
FROM Article
WHERE SEARCH_INDEX_MORE('Article[content]', [#10:3, #10:4])
ORDER BY $similarity DESC

Similarity Score ($similarity)

$similarity is a normalized score from 0.0 to 1.0. The most similar document in the result set always receives 1.0.

SELECT title, $score, $similarity
FROM Article
WHERE SEARCH_INDEX_MORE('Article[content]', [#10:3])
ORDER BY $similarity DESC
LIMIT 5

More Like This Configuration

Pass an optional map of tuning options. Unknown keys are rejected with a descriptive error to catch typos; keys are case sensitive.

SELECT title, $similarity
FROM Article
WHERE `fulltext.searchFieldsMore`(['title', 'content'], [#10:3], {
  minTermFreq: 1,
  minDocFreq: 3,
  maxQueryTerms: 50,
  excludeSource: false
})
Option Type Default Description

minTermFreq

int

2

Minimum times a term must appear in the source document(s) to be considered

minDocFreq

int

5

Minimum number of index documents that must contain a term for it to be used

maxDocFreqPercent

float

null

Exclude terms appearing in more than this fraction of all documents (e.g. 0.5 = 50%)

maxQueryTerms

int

25

Maximum number of terms to use in the similarity query

minWordLen

int

0

Ignore terms shorter than this length (0 = no minimum)

maxWordLen

int

0

Ignore terms longer than this length (0 = no maximum)

boostByScore

boolean

true

Weight terms by TF-IDF score rather than raw frequency

excludeSource

boolean

true

Exclude source documents from results

maxSourceDocs

int

25

Maximum number of source RIDs allowed

Practical Examples

Blog Search with Ranking

SELECT title, author, $score AS relevance
FROM BlogPost
WHERE SEARCH_INDEX('BlogPost[title,body]', '+java +spring -legacy')
ORDER BY relevance DESC
LIMIT 20

Autocomplete with Prefix Wildcard

SELECT title
FROM Product
WHERE SEARCH_INDEX('Product[name]', 'micro*')

Stemming with English Analyzer

With EnglishAnalyzer, searching for "run" also matches "running", "runs", and "ran":

CREATE INDEX ON Article (content) FULL_TEXT
  METADATA { "analyzer": "org.apache.lucene.analysis.en.EnglishAnalyzer" }

SELECT * FROM Article WHERE SEARCH_INDEX('Article[content]', 'running')
-- also returns documents containing "run", "runs", "ran"
SELECT title, $similarity AS score
FROM Article
WHERE SEARCH_INDEX_MORE('Article[title,body]', [#10:42], {
  "minTermFreq": 1,
  "minDocFreq": 2,
  "maxQueryTerms": 30
})
ORDER BY score DESC
LIMIT 5

Readers Who Liked This Also Liked

SELECT title, $similarity
FROM Article
WHERE SEARCH_INDEX_MORE('Article[content]', [#10:1, #10:7, #10:12])
ORDER BY $similarity DESC
LIMIT 10

Hybrid Retrieval: Fusing BM25 with Vector Search

With native BM25, the full-text index becomes a first-class leg of hybrid retrieval: exact keyword relevance (BM25), dense semantic similarity, and learned-sparse recall can be merged into a single ranked top-K, server-side, with vector.fuse (see concepts/vector-search.adoc#hybrid-search).

This is the accuracy-first pattern behind enterprise document search - for example legal/government case files, where matching an exact statute or case number (keyword) matters as much as semantic similarity. Here a Case type carries the report content (full-text, BM25), a dense embedding, and learned-sparse tokens/weights. RRF fusion tolerates the very different score scales (BM25 in [0, +inf), cosine in [-1, 1]):

SELECT title, caseNumber, score
FROM (
  SELECT expand(`vector.fuse`(
      `vector.neighbors`('Case[dense]', :denseVec, 100),
      `vector.sparseNeighbors`('Case[tokens,weights]', :qIdx, :qVal, 100),
      (SELECT @rid, $score FROM Case WHERE SEARCH_INDEX('Case[content]', :keywords) = true),
      { fusion: 'RRF', k: 60 }
  ))
)
ORDER BY score DESC
LIMIT 10

Because the fused candidates are records, the graph can be expanded in the same statement - rank the most relevant cases, then traverse to their related cases, circulars, and evidence:

SELECT caseNumber, title
FROM (
  SELECT expand(out('RELATED_TO'))
  FROM (
    SELECT expand(`vector.fuse`(
        `vector.neighbors`('Case[dense]', :denseVec, 100),
        (SELECT @rid, $score FROM Case WHERE SEARCH_INDEX('Case[content]', :keywords) = true),
        { fusion: 'RRF' }
    )) LIMIT 10
  )
)

Maintenance: BM25 Corpus Statistics

BM25 maintains per-type corpus counters (document count and total document length) that feed the average-document-length normalizer. They are updated incrementally as documents are indexed and removed. Because they are not transactionally reversed on rollback, a workload with many rolled-back transactions can let them drift slightly over time, which only affects length normalization (suboptimal, never wrong, scores). The first BM25 query of a session self-corrects them with a cheap check; for an exact repair without a full reindex, run:

-- Re-derive the corpus statistics for one index...
REBUILD INDEX `Article[content]` WITH statsOnly = true

-- ...or for every full-text index
REBUILD INDEX * WITH statsOnly = true

This is also a good way to "pre-warm" the counters of a BM25 index migrated from an older version before serving traffic (otherwise the first query does a one-time full scan to build them).

Behavior Changes When Upgrading

When BM25 support was introduced, existing full-text indexes keep their CLASSIC ranking (BM25 is opt-in for existing data, see Getting BM25 on Existing Indexes), but the following user-visible behaviors changed for all full-text indexes, including CLASSIC ones. Review them before upgrading.

Change Impact Migration

$score is now a float

It was an Integer match count for CLASSIC indexes and is now the same value widened to a Float (e.g. 33.0). Code reading $score as a java.lang.Integer will fail (e.g. ClassCastException).

Read $score as a Number/Float, e.g. Number) result.getProperty("$score".floatValue(). SQL that only sorts/filters on $score is unaffected.

Cursor order is now most-relevant-first

The CLASSIC full-text cursor previously returned matches in ascending score order (least relevant first) and ignored the result limit. It now returns descending score (with a stable tiebreaker) and honors limit - consistent with BM25. Applications that consumed the raw Index.get(…​) order without an explicit ORDER BY will see it reversed, and a limit now truncates to the top-N.

Add an explicit ORDER BY $score ASC if you relied on the old order; SQL queries already using ORDER BY $score DESC are unaffected.

IndexCursorEntry identity excludes the score (extension API)

equals()/hashCode() now use only (record, keys). Code that put IndexCursorEntry objects into a Set (or used them as map keys) expecting score-differentiated entries now deduplicates on (record, keys) only.

Treat (record, keys) as the identity. Affects only code working directly with the low-level index cursor API.

Getting BM25 on Existing Indexes

Indexes created before BM25 support open and keep scoring with CLASSIC (apart from the behavior changes in Behavior Changes When Upgrading). To switch existing data to BM25, rebuild the index: term frequencies were never stored before, so there is no in-place migration.

REBUILD INDEX `Article[content]`

Notes

  • Full-text indexes require all indexed properties to be of type STRING.

  • Full-text indexes cannot be marked UNIQUE.

  • Without a METADATA block, newly created indexes use StandardAnalyzer with OR default operator, leading wildcards disabled, and BM25 similarity.

  • $score and $similarity are always available as query variables for matching documents; non-matching documents receive 0.

  • $score is a floating-point value (for CLASSIC indexes too, where it is the match count as a float).

  • BM25 is scored per bucket (like an Elasticsearch shard): on a multi-bucket type, $score is not globally calibrated across buckets, so prefer ORDER BY $score over comparing absolute scores across documents in different buckets.

  • The direct Index.get(text) Java API does simple token matching only; use SEARCH_INDEX(…​) for the full Lucene query syntax (boolean operators, phrases, wildcards, caret boosts).