Importer

ArcadeDB provides some basic ETL capabilities for automatically importing datasets in any of the following formats:

  • OrientDB database export

  • Neo4j database export

  • GraphML database export

  • GraphSON database export

  • Generic XML files

  • Generic JSON files

  • Generic JSONL files

  • Generic CSV files

  • Generic RDF files

From file of types:

  • Plain text

  • Compressed with ZIP (only the first file is read)

  • Compressed with GZip

Located on:

  • local file system (just provide the path or use file:// in the URL)

  • remote, by specifying http:// or https:// in the URL

  • classpath, by using classpath:// as a prefix

The easiest way is to use the console and the SQL IMPORT DATABASE command. You can also use directly the Java API located in com.arcadedb.integration.importer.Importer.

To start importing it’s super easy as providing the URL where the source file to import is located. URLs can be local paths (use file://) or from the Internet by using http:// and https://.

Example of loading the Freebase RDF dataset:

> CREATE DATABASE FreeBase
{FreeBase}> IMPORT DATABASE http://commondatastorage.googleapis.com/freebase-public/rdf/freebase-rdf-latest.gz
Analyzing url: http://commondatastorage.googleapis.com/freebase-public/rdf/freebase-rdf-latest.gz... [SourceDiscovery]
Recognized format RDF (limitBytes=9.54MB limitEntries=0) [SourceDiscovery]
Creating type 'Node' of type VERTEX [Importer]
Creating type 'Relationship' of type EDGE [Importer]
Parsed 144951 (28990/sec) - 0 documents (0/sec) - 143055 vertices (28611/sec) - 144951 edges (28990/sec) [Importer]
Parsed 362000 (54256/sec) - 0 documents (0/sec) - 164118 vertices (5260/sec) - 362000 edges (54256/sec) [Importer]
...

Example of loading the Discogs dataset in the database on path "/temp/discogs":

> IMPORT DATABASE https://discogs-data.s3-us-west-2.amazonaws.com/data/2018/discogs_20180901_releases.xml.gz

Note that in this case the URL is https and the file is compressed with GZip.

Example of importing New York Taxi dataset in CSV format. The first line of the CSV file set the property names:

> IMPORT DATABASE file:///personal/Downloads/data-society-uber-pickups-in-nyc/original/uber-raw-data-april-15.csv/uber-raw-data-april-15.csv

See also:

Additional Settings

The Importer takes additional settings as pairs of setting name and value. With the SQL IMPORT DATABASE command, this is the syntax:

IMPORT DATABASE <url> [ WITH ( <setting-name> = <setting-value> [,] )* ]

Example:

> IMPORT DATABASE file:///import/file.csv WITH forceDatabaseCreate = true, commitEvery = 100

Below you can find all the supported settings for the Importer.

Setting Default Description

url

url of the file to import

database

./databases/imported

Path of the final imported database

forceDatabaseCreate

false

If true, the database is created brand new at every import

wal

false

Use the WAL (journal) for the importing. If the WAL is enabled the importing process will be much slower and will require much more RAM

commitEvery

5000

Create transactions that commit every X records

parallel

Half of the available cores - 1. If you have 8 cores, the default is 3

The number of concurrent threads used

typeIdProperty

Property that represents the ID of the vertex

typeIdUnique

false

True creates a unique index on the type id property, otherwise a non unique index

typeIdType

String

Type of the id property

trimText

true

True if the imported text is trimmed from heading and tailing spaces

maxProperties

512

Maximum number of properties per type (CSV)

maxPropertySize

4096

Maximum size of a property in bytes (CSV)

delimiter

,

Delimiter used to separate fields (CSV)

analysisLimitBytes

100,000

Maximum number of bytes parsed from the source to determine the source file type

analysisLimitEntries

10,000

Maximum number of entries (if applicable) parsed from the source to determine the source file type

parsingLimitBytes

Maximum number of bytes parsed from the source to be imported

parsingLimitEntries

Maximum number of entries imported

mapping

null

probeOnly

false

Only probe if url is reachable or file path is readable

documents

url of the file to import containing documents only. This is useful when the database is split in separate files

documentsFileType

The format of the file containing documents (csv, graphml, graphson)

documentsDelimiter

Delimiter used to separate documents

documentsHeader

Header containing the properties in the CSV document. One property per column. If not defined it is parsed from the first line

documentsSkipEntries

0

Number of rows to skip from the documents file

documentPropertiesInclude

*

List of property to import from documents. * means all

documentType

Document

Name of the type defined in the schema when importing documents

vertices

url of the file to import containing vertices only. This is useful when the database is split in separate files

verticesFileType

The format of the file containing vertices (csv, graphml, graphson)

verticesDelimiter

Delimiter used to separate vertices

verticesHeader

Header containing the properties in the CSV vertices. One property per column. If not defined it is parsed from the first line

verticesSkipEntries

0

Number of rows to skip from the vertices file

expectedVertices

0

Number of vertices expected. This is useful to determine the ETA of the importing process of vertices. 0 means unknown

vertexType

Vertex

Name of the type defined in the schema when importing vertices

vertexPropertiesInclude

*

List of property to import from vertices. * means all

edges

url of the file to import containing edges only. This is useful when the database is split in separate files

edgesFileType

The format of the file containing edges (csv, graphml, graphson)

edgesDelimiter

Delimiter used to separate edges

edgesHeader

Header containing the properties in the CSV edges. One property per column. If not defined it is parsed from the first line

edgesSkipEntries

0

Number of rows to skip from the edges file

expectedEdges

0

Number of edges expected. This is useful to determine the ETA of the importing process of edges. 0 means unknown

maxRAMIncomingEdges

256MB

Maximum RAM used to create edges. The more RAM, the faster.

edgeType

Edge

Name of the type defined in the schema when importing edges

edgePropertiesInclude

*

List of property to import from edges. * means all

edgeFromField

Name of the property containing the starting vertex

edgeToField

Name of the property containing the ending vertex

edgeBidirectional

true

When creating edges, create bidirectional edges if true, otherwise unidirectional

distanceFunction

innerproduct

Type of distance measure, see similarity measures.

efConstruction

256

Size of dynamic neighbor candidate list of (during insert).

ef

256

Number of nearest neighbors to return (in layer search).

m

16

Maximum number of connections per layer in the HNSW index. Higher values improve recall but increase memory usage

vectorType

float

The data type of a vector element, for example 'float'.

idProperty

"name"

Name of the property that will be used as the unique identifier for vertices during import

The probeOnly setting can also be used to send a GET request to another service or HTTP API, for example to report a previous import is finished.