Full-Text Index
ArcadeDB’s full-text index is built on Apache Lucene and supports configurable analyzers, advanced Lucene query syntax, multi-property indexing, native BM25 relevance scoring (with tunable parameters and per-field boosts), and "More Like This" similarity search.
Creating a Full-Text Index
CREATE INDEX ON Article (content) FULL_TEXT
Multiple properties can be indexed together:
CREATE INDEX ON Article (title, body) FULL_TEXT
Configuring the Analyzer
Pass a METADATA block to choose a Lucene analyzer:
CREATE INDEX ON Article (content) FULL_TEXT
METADATA {
"analyzer": "org.apache.lucene.analysis.en.EnglishAnalyzer",
"allowLeadingWildcard": false,
"defaultOperator": "OR"
}
Metadata Options
| Option | Type | Default | Description |
|---|---|---|---|
|
string |
|
Lucene analyzer class used for both indexing and querying |
|
string |
— |
Override analyzer used only at index time |
|
string |
— |
Override analyzer used only at query time |
|
boolean |
|
Allow |
|
|
|
Default operator between terms when none is specified |
|
string |
— |
Per-field analyzer override, e.g. |
|
|
|
Relevance model (see Relevance Scoring). |
|
float |
|
BM25 term-frequency saturation (must be >= 0). Higher values let repeated terms keep increasing the score |
|
float |
|
BM25 document-length normalization, in |
|
float |
|
Per-field BM25 boost multiplier applied to field-qualified matches on that field (must be >= 0), e.g. |
Common Analyzers
| Analyzer Class | Description |
|---|---|
|
General-purpose tokenizer (default) |
|
English stemming and stop words |
|
Lowercase only, no stop words |
|
Split on whitespace only |
Per-Field Analyzers
For multi-property indexes, each field can use a different analyzer:
CREATE INDEX ON Article (title, body) FULL_TEXT
METADATA {
"analyzer": "org.apache.lucene.analysis.standard.StandardAnalyzer",
"title_analyzer": "org.apache.lucene.analysis.en.EnglishAnalyzer"
}
Here title uses the English analyzer while body falls back to the standard analyzer.
Searching
SEARCH_INDEX()
Searches a named full-text index using Lucene query syntax.
SELECT * FROM Article
WHERE SEARCH_INDEX('Article[content]', 'java programming')
Signature: SEARCH_INDEX(indexName, query)
| Parameter | Type | Description |
|---|---|---|
|
string |
The full index name as shown in the schema, e.g. |
|
string |
A Lucene query string |
SEARCH_FIELDS()
Finds the full-text index automatically from field names, without needing to know the index name.
SELECT * FROM Article
WHERE SEARCH_FIELDS(['title', 'body'], 'database tutorial')
Signature: SEARCH_FIELDS(fieldNames, query)
| Parameter | Type | Description |
|---|---|---|
|
array of strings |
Fields to search; a full-text index covering these fields must exist |
|
string |
A Lucene query string |
Lucene Query Syntax
Both SEARCH_INDEX and SEARCH_FIELDS accept standard Lucene query syntax.
Boolean Operators
| Syntax | Meaning |
|---|---|
|
Either term (OR, default) |
|
Both terms required (AND) |
|
|
|
Explicit AND |
|
Explicit OR |
-- Requires both terms
SELECT * FROM Article WHERE SEARCH_INDEX('Article[content]', '+java +programming')
-- Excludes documents about python
SELECT * FROM Article WHERE SEARCH_INDEX('Article[content]', 'java -python')
Phrase Queries
Wrap a phrase in double quotes to require terms to appear in order:
SELECT * FROM Article WHERE SEARCH_INDEX('Article[content]', '"machine learning"')
Wildcard Queries
| Syntax | Matches |
|---|---|
|
database, datastore, dataset… |
|
database, datXbase… |
|
Requires |
SELECT * FROM Article WHERE SEARCH_INDEX('Article[content]', 'data*')
Fuzzy Queries
Append ~ to match terms within an edit distance:
SELECT * FROM Article WHERE SEARCH_INDEX('Article[content]', 'database~')
-- also matches terms within edit distance 2 (Lucene default), e.g. "databaSe", "databasee"
Field-Qualified Queries (Multi-Property Indexes)
For indexes over multiple fields, restrict a term to a specific field:
-- Only match "database" in the title field
SELECT * FROM Article WHERE SEARCH_INDEX('Article[title,body]', 'title:database')
-- Combine field-specific and general terms
SELECT * FROM Article WHERE SEARCH_INDEX('Article[title,body]', '+title:"multi model" -nosql')
Boosting Terms (Caret)
Append ^N to a term, phrase, field-qualified term, or parenthesized group to weight it more (or less, with N < 1) when scoring.
This follows the Lucene/Elasticsearch convention and applies to BM25 indexes.
| Syntax | Meaning |
|---|---|
|
Boost the term |
|
De-emphasize the term |
|
Boost the phrase 2x |
|
Boost a field-qualified term 5x |
|
Boost a whole parenthesized group 3x |
-- Make "java" count three times as much as "database" in the ranking
SELECT title, $score FROM Article
WHERE SEARCH_INDEX('Article[content]', 'java^3 database') ORDER BY $score DESC
-- Boost a field-qualified term, combined with a required term
SELECT title, $score FROM Article
WHERE SEARCH_INDEX('Article[title,body]', '+database title:java^5') ORDER BY $score DESC
-- Boost a whole OR group
SELECT title, $score FROM Article
WHERE SEARCH_INDEX('Article[title,body]', '(title:java OR title:python)^3 body:cooking') ORDER BY $score DESC
The query-time caret combines multiplicatively with any configured per-field boost (see Relevance Scoring): the effective weight is caret * field_boost.
Relevance Scoring
Similarity Models: BM25 and CLASSIC
Full-text indexes rank results with one of two similarity models, selected by the similarity metadata option:
| Model | Description |
|---|---|
|
Industry-standard Okapi BM25 ranking: combines term frequency (TF), inverse document frequency (IDF, so rare terms count more than common ones) and document-length normalization (shorter documents matching the same term rank higher). Tunable via |
|
Legacy term-coordination model: the score is simply the number of distinct query terms a document matches. Kept for backward compatibility. |
BM25 is the default for newly created indexes. Configure it explicitly, optionally tuning bm25_k1/bm25_b:
-- BM25 with tuned parameters
CREATE INDEX ON Article (content) FULL_TEXT
METADATA { "similarity": "BM25", "bm25_k1": 1.2, "bm25_b": 0.75 }
-- Keep the legacy match-count scoring
CREATE INDEX ON Article (content) FULL_TEXT
METADATA { "similarity": "CLASSIC" }
Per-Field Boosts
On a multi-property index, give a field a default BM25 weight with <field>_boost, so matches in that field contribute more to the score:
CREATE INDEX ON Article (title, body) FULL_TEXT
METADATA { "similarity": "BM25", "title_boost": 3.0 }
A field boost applies to field-qualified query terms on that field (e.g. title:java). Unqualified terms score against the field-agnostic token with boost 1.0. Per-field boosts and query-time caret boosts compose: the effective weight is caret * field_boost.
From the Java API
The same configuration is available on the index builder:
database.getSchema().buildTypeIndex("Article", new String[] { "title", "body" })
.withType(Schema.INDEX_TYPE.FULL_TEXT).withFullTextType()
.withBM25(1.2f, 0.75f) // similarity = BM25, k1, b
.withFieldBoost("title", 3.0f) // per-field boost
.create();
Use withSimilarity("CLASSIC") to select the legacy model instead of withBM25(…).
Relevance Score ($score)
Every match carries a floating-point relevance score, exposed as the query variable $score.
Use it in projections or ORDER BY clauses:
SELECT title, $score
FROM Article
WHERE SEARCH_INDEX('Article[content]', 'java programming')
ORDER BY $score DESC
SELECT title, $score AS relevance
FROM Article
WHERE SEARCH_FIELDS(['content'], 'java programming')
ORDER BY relevance DESC
LIMIT 10
With BM25, documents matching rarer, more discriminative terms (and shorter documents matching the same term) rank higher. With CLASSIC, documents that match more distinct query terms rank higher.
|
|
Inspecting Scores with EXPLAIN / PROFILE
EXPLAIN (or PROFILE) of a query whose WHERE uses SEARCH_INDEX(…) annotates the full-text fetch step with a SCORING line: the similarity model, the k1/b parameters, the corpus statistics, and, per query term, its document frequency (df), idf and applied boost.
EXPLAIN SELECT title, $score FROM Article
WHERE SEARCH_INDEX('Article[content]', 'java database') = true
More Like This
"More Like This" finds documents similar to one or more source documents. It extracts representative terms from the sources, then searches for other documents sharing those terms.
|
The four full-text search functions ( |
fulltext.searchIndexMore()
SELECT title, $score, $similarity
FROM Article
WHERE `fulltext.searchIndexMore`('Article[content]', [#10:3])
ORDER BY $similarity DESC
Signature: fulltext.searchIndexMore(indexName, sourceRIDs [, options]) (alias: search_index_more)
| Parameter | Type | Description |
|---|---|---|
|
string |
Full-text index name |
|
array of RIDs |
One or more source documents |
|
map |
Optional MLT tuning options (see More Like This Configuration). Unknown keys are rejected. |
fulltext.searchFieldsMore()
Same as fulltext.searchIndexMore but resolves the full-text index automatically from field names:
SELECT title, $similarity
FROM Article
WHERE `fulltext.searchFieldsMore`(['content'], [#10:3])
ORDER BY $similarity DESC
Signature: fulltext.searchFieldsMore(fieldNames, sourceRIDs [, options]) (alias: search_fields_more)
| Parameter | Type | Description |
|---|---|---|
|
array of strings |
Fields to search; a full-text index covering these fields must exist |
|
array of RIDs |
One or more source documents |
|
map |
Optional MLT tuning options (see More Like This Configuration). Unknown keys are rejected. |
Multiple Source Documents
Provide multiple RIDs to find documents similar to a combination of sources:
SELECT title, $similarity
FROM Article
WHERE SEARCH_INDEX_MORE('Article[content]', [#10:3, #10:4])
ORDER BY $similarity DESC
Similarity Score ($similarity)
$similarity is a normalized score from 0.0 to 1.0.
The most similar document in the result set always receives 1.0.
SELECT title, $score, $similarity
FROM Article
WHERE SEARCH_INDEX_MORE('Article[content]', [#10:3])
ORDER BY $similarity DESC
LIMIT 5
More Like This Configuration
Pass an optional map of tuning options. Unknown keys are rejected with a descriptive error to catch typos; keys are case sensitive.
SELECT title, $similarity
FROM Article
WHERE `fulltext.searchFieldsMore`(['title', 'content'], [#10:3], {
minTermFreq: 1,
minDocFreq: 3,
maxQueryTerms: 50,
excludeSource: false
})
| Option | Type | Default | Description |
|---|---|---|---|
|
int |
|
Minimum times a term must appear in the source document(s) to be considered |
|
int |
|
Minimum number of index documents that must contain a term for it to be used |
|
float |
|
Exclude terms appearing in more than this fraction of all documents (e.g. |
|
int |
|
Maximum number of terms to use in the similarity query |
|
int |
|
Ignore terms shorter than this length (0 = no minimum) |
|
int |
|
Ignore terms longer than this length (0 = no maximum) |
|
boolean |
|
Weight terms by TF-IDF score rather than raw frequency |
|
boolean |
|
Exclude source documents from results |
|
int |
|
Maximum number of source RIDs allowed |
Practical Examples
Blog Search with Ranking
SELECT title, author, $score AS relevance
FROM BlogPost
WHERE SEARCH_INDEX('BlogPost[title,body]', '+java +spring -legacy')
ORDER BY relevance DESC
LIMIT 20
Autocomplete with Prefix Wildcard
SELECT title
FROM Product
WHERE SEARCH_INDEX('Product[name]', 'micro*')
Stemming with English Analyzer
With EnglishAnalyzer, searching for "run" also matches "running", "runs", and "ran":
CREATE INDEX ON Article (content) FULL_TEXT
METADATA { "analyzer": "org.apache.lucene.analysis.en.EnglishAnalyzer" }
SELECT * FROM Article WHERE SEARCH_INDEX('Article[content]', 'running')
-- also returns documents containing "run", "runs", "ran"
Related Articles Recommendation
SELECT title, $similarity AS score
FROM Article
WHERE SEARCH_INDEX_MORE('Article[title,body]', [#10:42], {
"minTermFreq": 1,
"minDocFreq": 2,
"maxQueryTerms": 30
})
ORDER BY score DESC
LIMIT 5
Readers Who Liked This Also Liked
SELECT title, $similarity
FROM Article
WHERE SEARCH_INDEX_MORE('Article[content]', [#10:1, #10:7, #10:12])
ORDER BY $similarity DESC
LIMIT 10
Hybrid Retrieval: Fusing BM25 with Vector Search
With native BM25, the full-text index becomes a first-class leg of hybrid retrieval: exact keyword relevance (BM25), dense semantic similarity, and learned-sparse recall can be merged into a single ranked top-K, server-side, with vector.fuse (see concepts/vector-search.adoc#hybrid-search).
This is the accuracy-first pattern behind enterprise document search - for example legal/government case files, where matching an exact statute or case number (keyword) matters as much as semantic similarity. Here a Case type carries the report content (full-text, BM25), a dense embedding, and learned-sparse tokens/weights. RRF fusion tolerates the very different score scales (BM25 in [0, +inf), cosine in [-1, 1]):
SELECT title, caseNumber, score
FROM (
SELECT expand(`vector.fuse`(
`vector.neighbors`('Case[dense]', :denseVec, 100),
`vector.sparseNeighbors`('Case[tokens,weights]', :qIdx, :qVal, 100),
(SELECT @rid, $score FROM Case WHERE SEARCH_INDEX('Case[content]', :keywords) = true),
{ fusion: 'RRF', k: 60 }
))
)
ORDER BY score DESC
LIMIT 10
Because the fused candidates are records, the graph can be expanded in the same statement - rank the most relevant cases, then traverse to their related cases, circulars, and evidence:
SELECT caseNumber, title
FROM (
SELECT expand(out('RELATED_TO'))
FROM (
SELECT expand(`vector.fuse`(
`vector.neighbors`('Case[dense]', :denseVec, 100),
(SELECT @rid, $score FROM Case WHERE SEARCH_INDEX('Case[content]', :keywords) = true),
{ fusion: 'RRF' }
)) LIMIT 10
)
)
Maintenance: BM25 Corpus Statistics
BM25 maintains per-type corpus counters (document count and total document length) that feed the average-document-length normalizer. They are updated incrementally as documents are indexed and removed. Because they are not transactionally reversed on rollback, a workload with many rolled-back transactions can let them drift slightly over time, which only affects length normalization (suboptimal, never wrong, scores). The first BM25 query of a session self-corrects them with a cheap check; for an exact repair without a full reindex, run:
-- Re-derive the corpus statistics for one index...
REBUILD INDEX `Article[content]` WITH statsOnly = true
-- ...or for every full-text index
REBUILD INDEX * WITH statsOnly = true
This is also a good way to "pre-warm" the counters of a BM25 index migrated from an older version before serving traffic (otherwise the first query does a one-time full scan to build them).
Behavior Changes When Upgrading
When BM25 support was introduced, existing full-text indexes keep their CLASSIC ranking (BM25 is opt-in for existing data, see Getting BM25 on Existing Indexes), but the following user-visible behaviors changed for all full-text indexes, including CLASSIC ones. Review them before upgrading.
| Change | Impact | Migration |
|---|---|---|
|
It was an |
Read |
Cursor order is now most-relevant-first |
The CLASSIC full-text cursor previously returned matches in ascending score order (least relevant first) and ignored the result |
Add an explicit |
|
|
Treat |
Getting BM25 on Existing Indexes
Indexes created before BM25 support open and keep scoring with CLASSIC (apart from the behavior changes in Behavior Changes When Upgrading).
To switch existing data to BM25, rebuild the index: term frequencies were never stored before, so there is no in-place migration.
REBUILD INDEX `Article[content]`
Notes
-
Full-text indexes require all indexed properties to be of type
STRING. -
Full-text indexes cannot be marked
UNIQUE. -
Without a
METADATAblock, newly created indexes useStandardAnalyzerwithORdefault operator, leading wildcards disabled, and BM25 similarity. -
$scoreand$similarityare always available as query variables for matching documents; non-matching documents receive0. -
$scoreis a floating-point value (forCLASSICindexes too, where it is the match count as a float). -
BM25 is scored per bucket (like an Elasticsearch shard): on a multi-bucket type,
$scoreis not globally calibrated across buckets, so preferORDER BY $scoreover comparing absolute scores across documents in different buckets. -
The direct
Index.get(text)Java API does simple token matching only; useSEARCH_INDEX(…)for the full Lucene query syntax (boolean operators, phrases, wildcards, caret boosts).