Indexes

ArcadeDB supports multiple index algorithms, each optimized for different access patterns:

  • LSM Tree (default) — Optimized for range scans, ordered iteration, and write-heavy workloads.

  • Hash Index — O(1) equality lookups using extendable hashing. Best for primary key access, JOINs, and edge traversal where ordering is not needed.

LSM Tree algorithm

LSM tree is a type of data structure that is used to store and retrieve data efficiently. It works by organizing data in a tree-like structure, where each node in the tree represents a certain range of data.

Here’s how it works:

  1. When you want to store a piece of data in the LSM tree, it first goes into a special part of the tree called a "write buffer." The write buffer is like a temporary storage area where new data is kept until it’s ready to be added to the tree.

  2. When the write buffer gets full, the LSM tree will "flush" the data from the write buffer into the main part of the tree. This is done by creating a new node in the tree and adding the data from the write buffer to it.

  3. As more and more data is added to the tree, it will eventually become too large to be stored in memory (this is known as "overflowing"). When this happens, the LSM tree will start to "compact" the data by moving some of it to disk storage. This allows the tree to continue growing without running out of memory.

  4. When you want to retrieve a piece of data from the LSM tree, the algorithm will search for it in the write buffer, the main part of the tree, and any data that has been compacted to disk storage. If the data is found, it will be returned to you

LSM Tree vs B+Tree

B+Tree is the most common algorithm used by relational DBMSs. What are the differences?

  1. LSM tree and B+ tree are both data structures that are commonly used to store and retrieve data efficiently. Here are some of the main advantages of LSM tree over B+ tree:

  2. LSM tree is more efficient for writes: LSM tree uses a write buffer to temporarily store new data, which allows it to batch writes and reduce the number of disk accesses required. This can make it faster than B+ tree for inserting large amounts of data.

  3. LSM tree is more efficient for compaction: Because LSM tree stores data in a sorted fashion, it can compact data more efficiently by simply merging sorted data sets. B+ tree, on the other hand, requires more complex rebalancing operations when compacting data.

  4. LSM tree is more space-efficient: LSM tree stores data in a compact, sorted format, which can make it more space-efficient than B+ tree. This can be especially useful when storing large amounts of data on disk.

  5. However, there are also some potential disadvantages of LSM tree compared to B+ tree. For example, B+ tree may be faster for queries that require range scans or random access, and it may be easier to implement in some cases.

If you’re interested to ArcadeDB’s LSM-Tree index implementation detail, look at LSM-Tree

Hash Index algorithm

ArcadeDB’s Hash Index uses extendable hashing, a disk-oriented algorithm that provides O(1) equality lookups with typically 1-2 page reads.

How it works

  1. Each key is hashed to produce a binary hash code. The index maintains a global depth — the number of leading bits used from each hash.

  2. A directory maps each possible bit prefix (2globalDepth entries) to a bucket page. Multiple directory entries can point to the same bucket if the bucket’s local depth is less than the global depth.

  3. To look up a key, the index hashes the key, reads the directory entry for the corresponding prefix, and reads the target bucket page. The key is found via binary search within the bucket.

  4. When a bucket overflows, it splits: the bucket’s local depth increases by one, and entries are redistributed based on the additional bit. If the local depth exceeds the global depth, the directory doubles in size.

When to use Hash Index vs LSM Tree

Use Case Hash Index LSM Tree

Point lookup (WHERE id = ?)

Best — O(1), 1-2 page reads

O(log N), multiple page reads

JOINs and edge traversal

Best — constant-time resolution

Good

Range scans (WHERE age > 30)

Not supported

Best — ordered iteration

ORDER BY on indexed field

Not supported

Best — natural ordering

Bulk insertion throughput

Consistent — no compaction

May degrade at scale due to compaction

Key properties

  • No compaction needed: Unlike LSM Tree, there are no background merge operations and no write amplification.

  • Local splits: When a bucket is full, only that bucket splits — no global reorganization.

  • Supports unique and non-unique: Both UNIQUE_HASH and NOTUNIQUE_HASH modes are available.

  • Null strategy: Supports the same NULL_STRATEGY options as LSM Tree (SKIP, ERROR, INDEX).

Case-Insensitive Indexes (COLLATE CI)

By default, ArcadeDB indexes are case-sensitive: "Hello" and "hello" are treated as different keys. You can create case-insensitive indexes by using the COLLATE CI clause in the CREATE INDEX statement.

When COLLATE CI is specified for a property, the index stores values internally in lowercase (using the root locale). This means:

  • Equality lookups are case-insensitive: WHERE Name = 'Hello World' matches "Hello World", "HELLO WORLD", "hello world", etc.

  • Range queries are case-insensitive: WHERE Name > 'a' uses the lowercased form for comparison.

  • Unique constraints are case-insensitive: A UNIQUE index with COLLATE CI prevents inserting both "Admin" and "admin".

  • Original values are preserved: The document still stores the original cased value; only the index key is lowercased.

Syntax

CREATE INDEX ON <type> (<property> COLLATE CI) <index-type>

In composite indexes, COLLATE CI can be specified independently per property:

CREATE INDEX ON Product (Name COLLATE CI, Code) UNIQUE

Here Name is case-insensitive while Code remains case-sensitive.

Query optimizer support

The query optimizer automatically recognizes queries that use .toLowerCase() on a property with a COLLATE CI index. For example:

SELECT FROM Product WHERE Name.toLowerCase() = 'hello world'

This query will use the case-insensitive index on Name instead of performing a full scan, avoiding the overhead of calling toLowerCase() on every record.

When to use

  • User-facing lookups: Usernames, email addresses, product names — where the end user expects case-insensitive matching.

  • Deduplication: When you want a unique constraint that ignores case differences.

  • Replacing toLowerCase() patterns: If your queries already use .toLowerCase() for case-insensitive matching, a COLLATE CI index is more efficient because the conversion happens once at insert time, not on every query.

Index Property Types

ArcadeDB indexes can index property fields in various ways:

  1. All properties (scalar or collections) can be indexed in a unique or non-unique index (as a whole).

  2. An index can be for a single property, or multiple properties in a compound index.

  3. String properties can be index tokenized in a full-text index.

  4. Object properties can be indexed by keys or by values.

  5. List properties can be indexed by values.

  6. Vectors of floats can be indexed using specialized vector indexes (HNSW or LSMVectorIndex).

  7. WKT geometry strings can be indexed using a geospatial index.

JVectorLSMVectorIndex

LSMVectorIndex is a vector index built on ArcadeDB’s LSM Tree architecture, providing efficient persistent storage and retrieval of vector embeddings. Key features include:

  • Persistent Storage: Vector indexes are stored on disk with automatic page management and compaction

  • Configurable Similarity Functions: Supports COSINE (default), DOT_PRODUCT, and EUCLIDEAN distance metrics

  • Configurable ID Property: The property used to identify vertices can be customized (defaults to "id")

  • Multi-Language Query Support: Query vectors using vector.neighbors() from SQL or CALL db.index.vector.queryNodes() from Cypher

  • Automatic Compaction: Efficiently reclaims disk space through automatic compaction of immutable pages

  • High Performance: Leverages LSM Tree benefits for write efficiency and space optimization at scale