Importer

ArcadeDB provides some basic ETL capabilities for automatically importing datasets in any of the following formats:

OrientDB database export
Neo4j database export
GraphML database export
GraphSON database export
Generic XML files
Generic JSON files
Generic JSONL files
Generic CSV files
Generic RDF files

From file of types:

Plain text
Compressed with ZIP (only the first file is read)
Compressed with GZip

Located on:

local file system (just provide the path or use file:// in the URL)
remote, by specifying http:// or https:// in the URL
classpath, by using classpath:// as a prefix

The easiest way is to use the console and the SQL IMPORT DATABASE command. You can also use directly the Java API located in com.arcadedb.integration.importer.Importer.

To start importing it’s super easy as providing the URL where the source file to import is located. URLs can be local paths (use file://) or from the Internet by using http:// and https://.

Example of loading the Freebase RDF dataset:

> CREATE DATABASE FreeBase
{FreeBase}> IMPORT DATABASE http://commondatastorage.googleapis.com/freebase-public/rdf/freebase-rdf-latest.gz

Analyzing url: http://commondatastorage.googleapis.com/freebase-public/rdf/freebase-rdf-latest.gz... [SourceDiscovery]
Recognized format RDF (limitBytes=9.54MB limitEntries=0) [SourceDiscovery]
Creating type 'Node' of type VERTEX [Importer]
Creating type 'Relationship' of type EDGE [Importer]
Parsed 144951 (28990/sec) - 0 documents (0/sec) - 143055 vertices (28611/sec) - 144951 edges (28990/sec) [Importer]
Parsed 362000 (54256/sec) - 0 documents (0/sec) - 164118 vertices (5260/sec) - 362000 edges (54256/sec) [Importer]
...

Example of loading the Discogs dataset in the database on path "/temp/discogs":

> IMPORT DATABASE https://discogs-data.s3-us-west-2.amazonaws.com/data/2018/discogs_20180901_releases.xml.gz

Note that in this case the URL is https and the file is compressed with GZip.

Example of importing New York Taxi dataset in CSV format. The first line of the CSV file set the property names:

> IMPORT DATABASE file:///personal/Downloads/data-society-uber-pickups-in-nyc/original/uber-raw-data-april-15.csv/uber-raw-data-april-15.csv

Additional Settings

The Importer takes additional settings as pairs of setting name and value. With the SQL IMPORT DATABASE command, this is the syntax:

IMPORT DATABASE <url> [ WITH ( <setting-name> = <setting-value> [,] )* ]

Example:

> IMPORT DATABASE file:///import/file.csv WITH forceDatabaseCreate = true, commitEvery = 100

Below you can find all the supported settings for the Importer.

Setting Default Description

Setting	Default	Description
url		url of the file to import
database	./databases/imported	Path of the final imported database
forceDatabaseCreate	false	If true, the database is created brand new at every import
wal	false	Use the WAL (journal) for the importing. If the WAL is enabled the importing process will be much slower and will require much more RAM
commitEvery	5000	Create transactions that commit every X records
parallel	Half of the available cores - 1. If you have 8 cores, the default is 3	The number of concurrent threads used
typeIdProperty		Property that represents the ID of the vertex
typeIdUnique	false	True creates a unique index on the type id property, otherwise a non unique index
typeIdType	String	Type of the id property
trimText	true	True if the imported text is trimmed from heading and tailing spaces
maxProperties	512	Maximum number of properties per type (CSV)
maxPropertySize	4096	Maximum size of a property in bytes (CSV)
delimiter	`,`	Delimiter used to separate fields (CSV)
analysisLimitBytes	100,000	Maximum number of bytes parsed from the source to determine the source file type
analysisLimitEntries	10,000	Maximum number of entries (if applicable) parsed from the source to determine the source file type
parsingLimitBytes		Maximum number of bytes parsed from the source to be imported
parsingLimitEntries		Maximum number of entries imported
mapping	null
probeOnly	false	Only probe if url is reachable or file path is readable
documents		url of the file to import containing documents only. This is useful when the database is split in separate files
documentsFileType		The format of the file containing documents (csv, graphml, graphson)
documentsDelimiter		Delimiter used to separate documents
documentsHeader		Header containing the properties in the CSV document. One property per column. If not defined it is parsed from the first line
documentsSkipEntries	0	Number of rows to skip from the documents file
documentPropertiesInclude	`*`	List of property to import from documents. `*` means all
documentType	Document	Name of the type defined in the schema when importing documents
vertices		url of the file to import containing vertices only. This is useful when the database is split in separate files
verticesFileType		The format of the file containing vertices (csv, graphml, graphson)
verticesDelimiter		Delimiter used to separate vertices
verticesHeader		Header containing the properties in the CSV vertices. One property per column. If not defined it is parsed from the first line
verticesSkipEntries	0	Number of rows to skip from the vertices file
expectedVertices	0	Number of vertices expected. This is useful to determine the ETA of the importing process of vertices. 0 means unknown
vertexType	Vertex	Name of the type defined in the schema when importing vertices
vertexPropertiesInclude	`*`	List of property to import from vertices. `*` means all
edges		url of the file to import containing edges only. This is useful when the database is split in separate files
edgesFileType		The format of the file containing edges (csv, graphml, graphson)
edgesDelimiter		Delimiter used to separate edges
edgesHeader		Header containing the properties in the CSV edges. One property per column. If not defined it is parsed from the first line
edgesSkipEntries	0	Number of rows to skip from the edges file
expectedEdges	0	Number of edges expected. This is useful to determine the ETA of the importing process of edges. 0 means unknown
maxRAMIncomingEdges	256MB	Maximum RAM used to create edges. The more RAM, the faster.
edgeType	Edge	Name of the type defined in the schema when importing edges
edgePropertiesInclude	`*`	List of property to import from edges. `*` means all
edgeFromField		Name of the property containing the starting vertex
edgeToField		Name of the property containing the ending vertex
edgeBidirectional	true	When creating edges, create bidirectional edges if true, otherwise unidirectional
distanceFunction	`innerproduct`	Type of distance measure, see similarity measures.
efConstruction	256	Size of dynamic neighbor candidate list of (during insert).
ef	256	Number of nearest neighbors to return (in layer search).
m	16	Maximum number of connections per layer in the HNSW index. Higher values improve recall but increase memory usage
vectorType	`float`	The data type of a vector element, for example 'float'.
idProperty	`"name"`	Name of the property that will be used as the unique identifier for vertices during import

url

url of the file to import

database

./databases/imported

Path of the final imported database

forceDatabaseCreate

false

If true, the database is created brand new at every import

wal

false

Use the WAL (journal) for the importing. If the WAL is enabled the importing process will be much slower and will require much more RAM

commitEvery

5000

Create transactions that commit every X records

parallel

Half of the available cores - 1. If you have 8 cores, the default is 3

The number of concurrent threads used

typeIdProperty

Property that represents the ID of the vertex

typeIdUnique

false

True creates a unique index on the type id property, otherwise a non unique index

typeIdType

String

Type of the id property

trimText

true

True if the imported text is trimmed from heading and tailing spaces

maxProperties

512

Maximum number of properties per type (CSV)

maxPropertySize

4096

Maximum size of a property in bytes (CSV)

delimiter

,

Delimiter used to separate fields (CSV)

analysisLimitBytes

100,000

Maximum number of bytes parsed from the source to determine the source file type

analysisLimitEntries

10,000

Maximum number of entries (if applicable) parsed from the source to determine the source file type

parsingLimitBytes

Maximum number of bytes parsed from the source to be imported

parsingLimitEntries

Maximum number of entries imported

mapping

null

probeOnly

false

Only probe if url is reachable or file path is readable

documents

url of the file to import containing documents only. This is useful when the database is split in separate files

documentsFileType

The format of the file containing documents (csv, graphml, graphson)

documentsDelimiter

Delimiter used to separate documents

documentsHeader

Header containing the properties in the CSV document. One property per column. If not defined it is parsed from the first line

documentsSkipEntries

Number of rows to skip from the documents file

documentPropertiesInclude

*

List of property to import from documents. * means all

documentType

Document

Name of the type defined in the schema when importing documents

vertices

url of the file to import containing vertices only. This is useful when the database is split in separate files

verticesFileType

The format of the file containing vertices (csv, graphml, graphson)

verticesDelimiter

Delimiter used to separate vertices

verticesHeader

Header containing the properties in the CSV vertices. One property per column. If not defined it is parsed from the first line

verticesSkipEntries

Number of rows to skip from the vertices file

expectedVertices

Number of vertices expected. This is useful to determine the ETA of the importing process of vertices. 0 means unknown

vertexType

Vertex

Name of the type defined in the schema when importing vertices

vertexPropertiesInclude

*

List of property to import from vertices. * means all

edges

url of the file to import containing edges only. This is useful when the database is split in separate files

edgesFileType

The format of the file containing edges (csv, graphml, graphson)

edgesDelimiter

Delimiter used to separate edges

edgesHeader

Header containing the properties in the CSV edges. One property per column. If not defined it is parsed from the first line

edgesSkipEntries

Number of rows to skip from the edges file

expectedEdges

Number of edges expected. This is useful to determine the ETA of the importing process of edges. 0 means unknown

maxRAMIncomingEdges

256MB

Maximum RAM used to create edges. The more RAM, the faster.

edgeType

Edge

Name of the type defined in the schema when importing edges

edgePropertiesInclude

*

List of property to import from edges. * means all

edgeFromField

Name of the property containing the starting vertex

edgeToField

Name of the property containing the ending vertex

edgeBidirectional

true

When creating edges, create bidirectional edges if true, otherwise unidirectional

distanceFunction

innerproduct

Type of distance measure, see similarity measures.

efConstruction

256

Size of dynamic neighbor candidate list of (during insert).

256

Number of nearest neighbors to return (in layer search).

Maximum number of connections per layer in the HNSW index. Higher values improve recall but increase memory usage

vectorType

float

The data type of a vector element, for example 'float'.

idProperty

"name"

Name of the property that will be used as the unique identifier for vertices during import

The probeOnly setting can also be used to send a GET request to another service or HTTP API, for example to report a previous import is finished.