Graph Batch
| Available since ArcadeDB 26.4.1. |
The GraphBatch is a high-performance bulk graph loader designed for importing large volumes of edges into an ArcadeDB graph.
It achieves 8-12x speedup over the standard edge creation API by buffering edges in memory, sorting them by vertex for sequential I/O, parallel edge connection, and deferring incoming edge connection to a single optimized pass.
Quick Start
Database db = new DatabaseFactory("mydb").create();
db.transaction(() -> {
db.getSchema().createVertexType("Person");
db.getSchema().createEdgeType("KNOWS");
});
try (GraphBatch batch = db.batch()
.withExpectedEdgeCount(1_000_000)
.build()) {
// Phase 1: Create vertices in bulk
RID[] vertices = batch.createVertices("Person", 100_000);
// Phase 2: Buffer edges (auto-flushes when batch is full)
for (int i = 0; i < edgeCount; i++)
batch.newEdge(vertices[src[i]], "KNOWS", vertices[dst[i]]);
} // close() flushes remaining edges + connects all incoming edges
All edges (both outgoing and incoming) are fully connected when close() returns.
How It Works
The importer operates in phases for maximum throughput:
-
Buffer: Edges are stored in flat primitive arrays (minimal GC pressure)
-
Sort: On flush, outgoing edges are sorted by source vertex, converting random I/O into sequential page access
-
Connect OUT: Outgoing edges are connected to source vertices in sorted order, using vectorized segment writes
-
Accumulate IN: Incoming edges are deferred to an in-memory buffer across flushes
-
Connect IN at close(): All deferred incoming edges are sorted by destination vertex and connected in a single pass
-
Batch vertex update: All vertex head-chunk pointers are updated in one final sorted pass
Edges with Properties
The importer supports edges with properties. Pass them as key-value pairs:
try (GraphBatch batch = db.batch()
.withExpectedEdgeCount(500_000)
.withLightEdges(false) // edges have properties
.build()) {
RID[] vertices = batch.createVertices("Person", 50_000);
for (int i = 0; i < edgeCount; i++)
batch.newEdge(vertices[src[i]], "KNOWS", vertices[dst[i]],
"weight", weights[i], "timestamp", timestamps[i]);
}
Property edges use a bulk record creation path that bypasses per-record overhead, achieving nearly the same throughput as light edges.
Builder Options
| Method | Default | Description |
|---|---|---|
|
— |
Auto-tunes the batch size to |
|
100,000 |
Maximum edges buffered before auto-flush. Overrides auto-tuning if both are set. |
|
|
If true, edges without properties are created as light edges (no record stored, only connectivity pointers). |
|
|
If true, incoming edges are also connected. Set to false for unidirectional graphs. |
|
|
Write-Ahead Logging during import. Disable for maximum speed; enable if crash recovery is needed. |
|
0 (WAL off) / 50,000 (WAL on) |
Edges per transaction commit within a flush. Set to 0 to commit once per flush. Auto-set to 0 when WAL is disabled. |
|
2048 |
Initial size in bytes for edge segments. Larger values reduce segment splits for high-degree vertices. |
|
|
Pre-allocate empty edge segments at vertex creation time, eliminating lazy allocation cost. |
|
|
Parallel edge connection partitioned by bucket. Uses the database async executor to connect edges across buckets concurrently. |
Auto-Tuning Batch Size
The withExpectedEdgeCount() method automatically selects the optimal batch size based on benchmarks. A single flush (batch size = edge count) delivers the best throughput, so the batch size is set to edgeCount, clamped between 100K and 5M:
| Expected Edges | Auto-tuned Batch | Approx. RAM |
|---|---|---|
< 100K |
100K (minimum) |
~200MB |
100K - 5M |
= edge count (single flush) |
~300MB-1GB |
> 5M |
5M (maximum) |
~1GB |
You can always override with withBatchSize() if you know your workload characteristics.
Key Methods
| Method | Description |
|---|---|
|
Creates vertices in bulk within a single transaction. Returns an array of RIDs for use with |
|
Buffers an edge for batch processing. Automatically triggers a flush when the batch size is reached. |
|
Manually flushes all buffered edges. Outgoing edges are sorted and connected; incoming edges are deferred. |
|
Flushes remaining edges, connects all deferred incoming edges in a single sorted pass, and batch-updates vertex head-chunk pointers. |
Performance
Benchmark results on a single machine (500K edges, 50K vertices):
| Method | Time (ms) | Edges/sec | Speedup |
|---|---|---|---|
Standard API (tx batches of 1000) |
11,933 |
42K |
1.0x |
GraphBatch (light edges) |
1,028 |
486K |
11.6x |
Standard API + properties (int + long) |
13,385 |
37K |
1.0x |
GraphBatch + properties |
1,584 |
316K |
8.5x |
Importing into an Existing Graph
The importer can also connect edges to vertices that already exist in the database (not just vertices created by createVertices()). Simply pass existing vertex RIDs to newEdge():
try (GraphBatch batch = db.batch()
.withExpectedEdgeCount(100_000)
.build()) {
// Use existing vertex RIDs from queries or lookups
for (int i = 0; i < edgeCount; i++)
batch.newEdge(existingVertexRIDs[src[i]], "FOLLOWS", existingVertexRIDs[dst[i]]);
}
HTTP Batch Endpoint
GraphBatch is also available over HTTP for non-Java clients. The POST /api/v1/batch endpoint accepts JSONL or CSV input and uses GraphBatch under the hood, with all builder options exposed as query parameters. See HTTP Batch Import for full details and examples in curl, Python, and JavaScript.
Relationship with GraphImporter
The GraphImporter is a declarative, JSON-driven graph importer built on top of GraphBatch.
It adds file parsing (XML, CSV, JSONL), schema auto-creation, foreign key resolution, and post-import commands.
Use GraphBatch directly when you need full programmatic control over vertex and edge creation.
Use GraphImporter when you want to import graph data from files using a JSON configuration.