Full-Text Index
ArcadeDB’s full-text index is built on Apache Lucene and supports configurable analyzers, advanced Lucene query syntax, multi-property indexing, relevance scoring, and "More Like This" similarity search.
Creating a Full-Text Index
CREATE INDEX ON Article (content) FULL_TEXT
Multiple properties can be indexed together:
CREATE INDEX ON Article (title, body) FULL_TEXT
Configuring the Analyzer
Pass a METADATA block to choose a Lucene analyzer:
CREATE INDEX ON Article (content) FULL_TEXT
METADATA {
"analyzer": "org.apache.lucene.analysis.en.EnglishAnalyzer",
"allowLeadingWildcard": false,
"defaultOperator": "OR"
}
Metadata Options
| Option | Type | Default | Description |
|---|---|---|---|
|
string |
|
Lucene analyzer class used for both indexing and querying |
|
string |
— |
Override analyzer used only at index time |
|
string |
— |
Override analyzer used only at query time |
|
boolean |
|
Allow |
|
|
|
Default operator between terms when none is specified |
|
string |
— |
Per-field analyzer override, e.g. |
Common Analyzers
| Analyzer Class | Description |
|---|---|
|
General-purpose tokenizer (default) |
|
English stemming and stop words |
|
Lowercase only, no stop words |
|
Split on whitespace only |
Per-Field Analyzers
For multi-property indexes, each field can use a different analyzer:
CREATE INDEX ON Article (title, body) FULL_TEXT
METADATA {
"analyzer": "org.apache.lucene.analysis.standard.StandardAnalyzer",
"title_analyzer": "org.apache.lucene.analysis.en.EnglishAnalyzer"
}
Here title uses the English analyzer while body falls back to the standard analyzer.
Searching
SEARCH_INDEX()
Searches a named full-text index using Lucene query syntax.
SELECT * FROM Article
WHERE SEARCH_INDEX('Article[content]', 'java programming')
Signature: SEARCH_INDEX(indexName, query)
| Parameter | Type | Description |
|---|---|---|
|
string |
The full index name as shown in the schema, e.g. |
|
string |
A Lucene query string |
SEARCH_FIELDS()
Finds the full-text index automatically from field names, without needing to know the index name.
SELECT * FROM Article
WHERE SEARCH_FIELDS(['title', 'body'], 'database tutorial')
Signature: SEARCH_FIELDS(fieldNames, query)
| Parameter | Type | Description |
|---|---|---|
|
array of strings |
Fields to search; a full-text index covering these fields must exist |
|
string |
A Lucene query string |
Lucene Query Syntax
Both SEARCH_INDEX and SEARCH_FIELDS accept standard Lucene query syntax.
Boolean Operators
| Syntax | Meaning |
|---|---|
|
Either term (OR, default) |
|
Both terms required (AND) |
|
|
|
Explicit AND |
|
Explicit OR |
-- Requires both terms
SELECT * FROM Article WHERE SEARCH_INDEX('Article[content]', '+java +programming')
-- Excludes documents about python
SELECT * FROM Article WHERE SEARCH_INDEX('Article[content]', 'java -python')
Phrase Queries
Wrap a phrase in double quotes to require terms to appear in order:
SELECT * FROM Article WHERE SEARCH_INDEX('Article[content]', '"machine learning"')
Wildcard Queries
| Syntax | Matches |
|---|---|
|
database, datastore, dataset… |
|
database, datXbase… |
|
Requires |
SELECT * FROM Article WHERE SEARCH_INDEX('Article[content]', 'data*')
Fuzzy Queries
Append ~ to match terms within an edit distance:
SELECT * FROM Article WHERE SEARCH_INDEX('Article[content]', 'database~')
-- also matches terms within edit distance 2 (Lucene default), e.g. "databaSe", "databasee"
Field-Qualified Queries (Multi-Property Indexes)
For indexes over multiple fields, restrict a term to a specific field:
-- Only match "database" in the title field
SELECT * FROM Article WHERE SEARCH_INDEX('Article[title,body]', 'title:database')
-- Combine field-specific and general terms
SELECT * FROM Article WHERE SEARCH_INDEX('Article[title,body]', '+title:"multi model" -nosql')
Relevance Score ($score)
Every match carries a relevance score.
Use $score in projections or ORDER BY clauses:
SELECT title, $score
FROM Article
WHERE SEARCH_INDEX('Article[content]', 'java programming')
ORDER BY $score DESC
Documents that match more query terms receive higher scores.
SELECT title, $score AS relevance
FROM Article
WHERE SEARCH_FIELDS(['content'], 'java programming')
ORDER BY relevance DESC
LIMIT 10
More Like This
"More Like This" finds documents similar to one or more source documents. It extracts representative terms from the sources, then searches for other documents sharing those terms.
|
The four full-text search functions ( |
fulltext.searchIndexMore()
SELECT title, $score, $similarity
FROM Article
WHERE `fulltext.searchIndexMore`('Article[content]', [#10:3])
ORDER BY $similarity DESC
Signature: fulltext.searchIndexMore(indexName, sourceRIDs [, options]) (alias: search_index_more)
| Parameter | Type | Description |
|---|---|---|
|
string |
Full-text index name |
|
array of RIDs |
One or more source documents |
|
map |
Optional MLT tuning options (see More Like This Configuration). Unknown keys are rejected. |
fulltext.searchFieldsMore()
Same as fulltext.searchIndexMore but resolves the full-text index automatically from field names:
SELECT title, $similarity
FROM Article
WHERE `fulltext.searchFieldsMore`(['content'], [#10:3])
ORDER BY $similarity DESC
Signature: fulltext.searchFieldsMore(fieldNames, sourceRIDs [, options]) (alias: search_fields_more)
| Parameter | Type | Description |
|---|---|---|
|
array of strings |
Fields to search; a full-text index covering these fields must exist |
|
array of RIDs |
One or more source documents |
|
map |
Optional MLT tuning options (see More Like This Configuration). Unknown keys are rejected. |
Multiple Source Documents
Provide multiple RIDs to find documents similar to a combination of sources:
SELECT title, $similarity
FROM Article
WHERE SEARCH_INDEX_MORE('Article[content]', [#10:3, #10:4])
ORDER BY $similarity DESC
Similarity Score ($similarity)
$similarity is a normalized score from 0.0 to 1.0.
The most similar document in the result set always receives 1.0.
SELECT title, $score, $similarity
FROM Article
WHERE SEARCH_INDEX_MORE('Article[content]', [#10:3])
ORDER BY $similarity DESC
LIMIT 5
More Like This Configuration
Pass an optional map of tuning options. Unknown keys are rejected with a descriptive error to catch typos; keys are case sensitive.
SELECT title, $similarity
FROM Article
WHERE `fulltext.searchFieldsMore`(['title', 'content'], [#10:3], {
minTermFreq: 1,
minDocFreq: 3,
maxQueryTerms: 50,
excludeSource: false
})
| Option | Type | Default | Description |
|---|---|---|---|
|
int |
|
Minimum times a term must appear in the source document(s) to be considered |
|
int |
|
Minimum number of index documents that must contain a term for it to be used |
|
float |
|
Exclude terms appearing in more than this fraction of all documents (e.g. |
|
int |
|
Maximum number of terms to use in the similarity query |
|
int |
|
Ignore terms shorter than this length (0 = no minimum) |
|
int |
|
Ignore terms longer than this length (0 = no maximum) |
|
boolean |
|
Weight terms by TF-IDF score rather than raw frequency |
|
boolean |
|
Exclude source documents from results |
|
int |
|
Maximum number of source RIDs allowed |
Practical Examples
Blog Search with Ranking
SELECT title, author, $score AS relevance
FROM BlogPost
WHERE SEARCH_INDEX('BlogPost[title,body]', '+java +spring -legacy')
ORDER BY relevance DESC
LIMIT 20
Autocomplete with Prefix Wildcard
SELECT title
FROM Product
WHERE SEARCH_INDEX('Product[name]', 'micro*')
Stemming with English Analyzer
With EnglishAnalyzer, searching for "run" also matches "running", "runs", and "ran":
CREATE INDEX ON Article (content) FULL_TEXT
METADATA { "analyzer": "org.apache.lucene.analysis.en.EnglishAnalyzer" }
SELECT * FROM Article WHERE SEARCH_INDEX('Article[content]', 'running')
-- also returns documents containing "run", "runs", "ran"
Notes
-
Full-text indexes require all indexed properties to be of type
STRING. -
Full-text indexes cannot be marked
UNIQUE. -
Without a
METADATAblock, indexes useStandardAnalyzerwithORdefault operator and leading wildcards disabled. -
Indexes created without metadata continue to work with
SEARCH_INDEXandSEARCH_INDEX_MOREexactly as before. -
$scoreand$similarityare always available as query variables for matching documents; non-matching documents receive0.