Vector Search Tutorial

This tutorial walks you through building a semantic search system with ArcadeDB. You will create vector embeddings, index them, query by similarity, and combine vector search with graph traversal.

What You Will Build

A product catalog with semantic search: given a query like "portable computing device", find the most relevant products by embedding similarity rather than keyword matching.

Prerequisites

ArcadeDB running (Docker or binary install)
A way to send queries (Console, HTTP API, or a Python/JavaScript client)

Step 1: Create the Schema

Create a vertex type with a vector property:

CREATE VERTEX TYPE Product
CREATE PROPERTY Product.name STRING
CREATE PROPERTY Product.category STRING
CREATE PROPERTY Product.embedding LIST OF FLOAT

Step 2: Create a Vector Index

Create an LSM_VECTOR index on the embedding property. Specify the number of dimensions and the similarity metric:

CREATE INDEX ON Product (embedding) LSM_VECTOR METADATA {
  dimensions: 4,
  similarity: 'COSINE'
}

In production, embeddings are typically 384-1536 dimensions. This tutorial uses 4 dimensions for simplicity. For production workloads, add quantization: 'INT8' for significantly better search performance — see concepts/vector-search.adoc#quantization-performance below.

Step 3: Insert Data with Embeddings

Insert products with pre-computed embedding vectors:

CREATE VERTEX Product SET name = 'Laptop',       category = 'Electronics', embedding = [0.9, 0.1, 0.8, 0.2]
CREATE VERTEX Product SET name = 'Tablet',        category = 'Electronics', embedding = [0.85, 0.15, 0.75, 0.25]
CREATE VERTEX Product SET name = 'Smartphone',    category = 'Electronics', embedding = [0.8, 0.2, 0.7, 0.3]
CREATE VERTEX Product SET name = 'Headphones',    category = 'Electronics', embedding = [0.6, 0.4, 0.5, 0.5]
CREATE VERTEX Product SET name = 'Novel',         category = 'Books',       embedding = [0.1, 0.9, 0.2, 0.8]
CREATE VERTEX Product SET name = 'Textbook',      category = 'Books',       embedding = [0.2, 0.8, 0.3, 0.7]
CREATE VERTEX Product SET name = 'Running Shoes', category = 'Sports',      embedding = [0.3, 0.5, 0.9, 0.1]
CREATE VERTEX Product SET name = 'Yoga Mat',      category = 'Sports',      embedding = [0.25, 0.55, 0.85, 0.15]

In a real application, you would generate embeddings using an external model such as OpenAI’s text-embedding-3-small (1536 dimensions) or Sentence Transformers' all-MiniLM-L6-v2 (384 dimensions).

Step 4: Query by Similarity

Find the 3 products most similar to a query vector:

SELECT name, category, distance FROM (
  SELECT expand(vectorNeighbors('Product[embedding]', [0.88, 0.12, 0.78, 0.22], 3))
)

The vectorNeighbors() function returns a list of results — expand() flattens it into individual rows so you can access properties like name, category, and distance directly. The query vector [0.88, 0.12, 0.78, 0.22] is close to the electronics cluster. The results should return Laptop, Tablet, and Smartphone — the three most similar items by cosine similarity.

Step 5: Add Graph Relationships

Make the example more interesting by adding edges between products:

CREATE EDGE TYPE FREQUENTLY_BOUGHT_WITH
CREATE EDGE TYPE SIMILAR_TO

CREATE EDGE FREQUENTLY_BOUGHT_WITH
  FROM (SELECT FROM Product WHERE name = 'Laptop')
  TO (SELECT FROM Product WHERE name = 'Headphones')

CREATE EDGE SIMILAR_TO
  FROM (SELECT FROM Product WHERE name = 'Laptop')
  TO (SELECT FROM Product WHERE name = 'Tablet')

CREATE EDGE FREQUENTLY_BOUGHT_WITH
  FROM (SELECT FROM Product WHERE name = 'Novel')
  TO (SELECT FROM Product WHERE name = 'Textbook')

Step 6: Combine Vector Search with Graph Traversal

Find similar products, then expand recommendations through graph relationships:

-- Step 1: Find top 3 by vector similarity
SELECT name, category, distance FROM (
  SELECT expand(vectorNeighbors('Product[embedding]', [0.88, 0.12, 0.78, 0.22], 3))
)

Then traverse from those results to find related products:

-- Step 2: Get products frequently bought with the top match
SELECT friend.name, friend.category
FROM MATCH {type: Product, as: product, where: (name = 'Laptop')}
     -FREQUENTLY_BOUGHT_WITH-> {type: Product, as: friend}

This two-step pattern — vector search to find semantically similar items, then graph traversal to expand through relationships — is the foundation of the Graph RAG and Recommendation Engine patterns.

Step 7: Use Different Similarity Metrics

Create additional indexes with different metrics for comparison:

-- Euclidean distance (absolute distance in vector space)
CREATE PROPERTY Product.embedding_l2 LIST OF FLOAT
CREATE INDEX ON Product (embedding_l2) LSM_VECTOR METADATA {
  dimensions: 4,
  similarity: 'EUCLIDEAN'
}

-- Dot product (fastest for normalized vectors)
CREATE PROPERTY Product.embedding_dot LIST OF FLOAT
CREATE INDEX ON Product (embedding_dot) LSM_VECTOR METADATA {
  dimensions: 4,
  similarity: 'DOT_PRODUCT'
}

Step 8: Control Search Quality with efSearch

By default, ArcadeDB uses an adaptive search strategy that works well for most queries. You can override it per-query by passing efSearch as the 4th argument:

-- Higher efSearch for better recall (useful for critical queries)
SELECT name, category, distance FROM (
  SELECT expand(vectorNeighbors('Product[embedding]', [0.88, 0.12, 0.78, 0.22], 3, 200))
)

See Adaptive efSearch for details on how the default strategy works.

Step 9: Enable Quantization for Large Datasets

For production datasets with many vectors, enable INT8 quantization to reduce memory by 75%:

CREATE VERTEX TYPE LargeProduct
CREATE PROPERTY LargeProduct.embedding ARRAY_OF_FLOATS

CREATE INDEX ON LargeProduct (embedding) LSM_VECTOR METADATA {
  dimensions: 384,
  similarity: 'COSINE',
  quantization: 'INT8'
}

Queries work exactly the same — quantization is transparent:

SELECT name, distance FROM (
  SELECT expand(vectorNeighbors('LargeProduct[embedding]', $queryVector, 10))
)

Next Steps

Vector Search Concepts — Architecture, algorithms, and parameter tuning
Vector Embeddings How-To — Production best practices
Recommendation Engine — Full use case with vector + graph + time-series
Graph RAG — Vector + graph for LLM retrieval augmentation