Indexes
ArcadeDB supports multiple index algorithms, each optimized for different access patterns:
-
LSM Tree (default) — Optimized for range scans, ordered iteration, and write-heavy workloads.
-
Hash Index — O(1) equality lookups using extendable hashing. Best for primary key access, JOINs, and edge traversal where ordering is not needed.
LSM Tree algorithm
LSM tree is a type of data structure that is used to store and retrieve data efficiently. It works by organizing data in a tree-like structure, where each node in the tree represents a certain range of data.
Here’s how it works:
-
When you want to store a piece of data in the LSM tree, it first goes into a special part of the tree called a "write buffer." The write buffer is like a temporary storage area where new data is kept until it’s ready to be added to the tree.
-
When the write buffer gets full, the LSM tree will "flush" the data from the write buffer into the main part of the tree. This is done by creating a new node in the tree and adding the data from the write buffer to it.
-
As more and more data is added to the tree, it will eventually become too large to be stored in memory (this is known as "overflowing"). When this happens, the LSM tree will start to "compact" the data by moving some of it to disk storage. This allows the tree to continue growing without running out of memory.
-
When you want to retrieve a piece of data from the LSM tree, the algorithm will search for it in the write buffer, the main part of the tree, and any data that has been compacted to disk storage. If the data is found, it will be returned to you
LSM Tree vs B+Tree
B+Tree is the most common algorithm used by relational DBMSs. What are the differences?
-
LSM tree and B+ tree are both data structures that are commonly used to store and retrieve data efficiently. Here are some of the main advantages of LSM tree over B+ tree:
-
LSM tree is more efficient for writes: LSM tree uses a write buffer to temporarily store new data, which allows it to batch writes and reduce the number of disk accesses required. This can make it faster than B+ tree for inserting large amounts of data.
-
LSM tree is more efficient for compaction: Because LSM tree stores data in a sorted fashion, it can compact data more efficiently by simply merging sorted data sets. B+ tree, on the other hand, requires more complex rebalancing operations when compacting data.
-
LSM tree is more space-efficient: LSM tree stores data in a compact, sorted format, which can make it more space-efficient than B+ tree. This can be especially useful when storing large amounts of data on disk.
-
However, there are also some potential disadvantages of LSM tree compared to B+ tree. For example, B+ tree may be faster for queries that require range scans or random access, and it may be easier to implement in some cases.
If you’re interested to ArcadeDB’s LSM-Tree index implementation detail, look at LSM-Tree
Hash Index algorithm
ArcadeDB’s Hash Index uses extendable hashing, a disk-oriented algorithm that provides O(1) equality lookups with typically 1-2 page reads.
How it works
-
Each key is hashed to produce a binary hash code. The index maintains a global depth — the number of leading bits used from each hash.
-
A directory maps each possible bit prefix (2globalDepth entries) to a bucket page. Multiple directory entries can point to the same bucket if the bucket’s local depth is less than the global depth.
-
To look up a key, the index hashes the key, reads the directory entry for the corresponding prefix, and reads the target bucket page. The key is found via binary search within the bucket.
-
When a bucket overflows, it splits: the bucket’s local depth increases by one, and entries are redistributed based on the additional bit. If the local depth exceeds the global depth, the directory doubles in size.
When to use Hash Index vs LSM Tree
| Use Case | Hash Index | LSM Tree |
|---|---|---|
Point lookup ( |
Best — O(1), 1-2 page reads |
O(log N), multiple page reads |
JOINs and edge traversal |
Best — constant-time resolution |
Good |
Range scans ( |
Not supported |
Best — ordered iteration |
|
Not supported |
Best — natural ordering |
Bulk insertion throughput |
Consistent — no compaction |
May degrade at scale due to compaction |
Key properties
-
No compaction needed: Unlike LSM Tree, there are no background merge operations and no write amplification.
-
Local splits: When a bucket is full, only that bucket splits — no global reorganization.
-
Supports unique and non-unique: Both
UNIQUE_HASHandNOTUNIQUE_HASHmodes are available. -
Null strategy: Supports the same
NULL_STRATEGYoptions as LSM Tree (SKIP,ERROR,INDEX).
Case-Insensitive Indexes (COLLATE CI)
By default, ArcadeDB indexes are case-sensitive: "Hello" and "hello" are treated as different keys.
You can create case-insensitive indexes by using the COLLATE CI clause in the CREATE INDEX statement.
When COLLATE CI is specified for a property, the index stores values internally in lowercase (using the root locale).
This means:
-
Equality lookups are case-insensitive:
WHERE Name = 'Hello World'matches"Hello World","HELLO WORLD","hello world", etc. -
Range queries are case-insensitive:
WHERE Name > 'a'uses the lowercased form for comparison. -
Unique constraints are case-insensitive: A
UNIQUEindex withCOLLATE CIprevents inserting both"Admin"and"admin". -
Original values are preserved: The document still stores the original cased value; only the index key is lowercased.
Syntax
CREATE INDEX ON <type> (<property> COLLATE CI) <index-type>
In composite indexes, COLLATE CI can be specified independently per property:
CREATE INDEX ON Product (Name COLLATE CI, Code) UNIQUE
Here Name is case-insensitive while Code remains case-sensitive.
Query optimizer support
The query optimizer automatically recognizes queries that use .toLowerCase() on a property with a COLLATE CI index.
For example:
SELECT FROM Product WHERE Name.toLowerCase() = 'hello world'
This query will use the case-insensitive index on Name instead of performing a full scan, avoiding the overhead of calling toLowerCase() on every record.
When to use
-
User-facing lookups: Usernames, email addresses, product names — where the end user expects case-insensitive matching.
-
Deduplication: When you want a unique constraint that ignores case differences.
-
Replacing
toLowerCase()patterns: If your queries already use.toLowerCase()for case-insensitive matching, aCOLLATE CIindex is more efficient because the conversion happens once at insert time, not on every query.
Index Property Types
ArcadeDB indexes can index property fields in various ways:
-
All properties (scalar or collections) can be indexed in a unique or non-unique index (as a whole).
-
An index can be for a single property, or multiple properties in a compound index.
-
String properties can be index tokenized in a full-text index.
-
Object properties can be indexed by keys or by values.
-
List properties can be indexed by values.
-
Vectors of floats can be indexed using specialized vector indexes (HNSW or LSMVectorIndex).
-
WKT geometry strings can be indexed using a geospatial index.
JVectorLSMVectorIndex
LSMVectorIndex is a vector index built on ArcadeDB’s LSM Tree architecture, providing efficient persistent storage and retrieval of vector embeddings. Key features include:
-
Persistent Storage: Vector indexes are stored on disk with automatic page management and compaction
-
Configurable Similarity Functions: Supports COSINE (default), DOT_PRODUCT, and EUCLIDEAN distance metrics
-
Configurable ID Property: The property used to identify vertices can be customized (defaults to "id")
-
Multi-Language Query Support: Query vectors using
vector.neighbors()from SQL orCALL db.index.vector.queryNodes()from Cypher -
Automatic Compaction: Efficiently reclaims disk space through automatic compaction of immutable pages
-
High Performance: Leverages LSM Tree benefits for write efficiency and space optimization at scale