Importer
ArcadeDB provides some basic ETL capabilities for automatically importing datasets in any of the following formats:
From file of types:
-
Plain text
-
Compressed with ZIP (only the first file is read)
-
Compressed with GZip
Located on:
-
local file system (just provide the path or use
file://in the URL) -
remote, by specifying
http://orhttps://in the URL -
classpath, by using
classpath://as a prefix
The easiest way is to use the console and the SQL IMPORT DATABASE command. You can also use directly the Java API located in com.arcadedb.integration.importer.Importer.
To start importing it’s super easy as providing the URL where the source file to import is located.
URLs can be local paths (use file://) or from the Internet by using http:// and https://.
Example of loading the Freebase RDF dataset:
> CREATE DATABASE FreeBase
{FreeBase}> IMPORT DATABASE http://commondatastorage.googleapis.com/freebase-public/rdf/freebase-rdf-latest.gz
Analyzing url: http://commondatastorage.googleapis.com/freebase-public/rdf/freebase-rdf-latest.gz... [SourceDiscovery]
Recognized format RDF (limitBytes=9.54MB limitEntries=0) [SourceDiscovery]
Creating type 'Node' of type VERTEX [Importer]
Creating type 'Relationship' of type EDGE [Importer]
Parsed 144951 (28990/sec) - 0 documents (0/sec) - 143055 vertices (28611/sec) - 144951 edges (28990/sec) [Importer]
Parsed 362000 (54256/sec) - 0 documents (0/sec) - 164118 vertices (5260/sec) - 362000 edges (54256/sec) [Importer]
...
Example of loading the Discogs dataset in the database on path "/temp/discogs":
> IMPORT DATABASE https://discogs-data.s3-us-west-2.amazonaws.com/data/2018/discogs_20180901_releases.xml.gz
Note that in this case the URL is https and the file is compressed with GZip.
Example of importing New York Taxi dataset in CSV format. The first line of the CSV file set the property names:
> IMPORT DATABASE file:///personal/Downloads/data-society-uber-pickups-in-nyc/original/uber-raw-data-april-15.csv/uber-raw-data-april-15.csv
See also:
-
Graph Importer (high-performance, declarative graph import from XML, CSV, JSONL with JSON configuration)
Additional Settings
The Importer takes additional settings as pairs of setting name and value. With the SQL IMPORT DATABASE command, this is the syntax:
IMPORT DATABASE <url> [ WITH ( <setting-name> = <setting-value> [,] )* ]
Example:
> IMPORT DATABASE file:///import/file.csv WITH forceDatabaseCreate = true, commitEvery = 100
Below you can find all the supported settings for the Importer.
| Setting | Default | Description |
|---|---|---|
url |
url of the file to import |
|
database |
./databases/imported |
Path of the final imported database |
forceDatabaseCreate |
false |
If true, the database is created brand new at every import |
wal |
false |
Use the WAL (journal) for the importing. If the WAL is enabled the importing process will be much slower and will require much more RAM |
commitEvery |
5000 |
Create transactions that commit every X records |
parallel |
Half of the available cores - 1. If you have 8 cores, the default is 3 |
The number of concurrent threads used |
typeIdProperty |
Property that represents the ID of the vertex |
|
typeIdUnique |
false |
True creates a unique index on the type id property, otherwise a non unique index |
typeIdType |
String |
Type of the id property |
trimText |
true |
True if the imported text is trimmed from heading and tailing spaces |
maxProperties |
512 |
Maximum number of properties per type (CSV) |
maxPropertySize |
4096 |
Maximum size of a property in bytes (CSV) |
delimiter |
|
Delimiter used to separate fields (CSV) |
analysisLimitBytes |
100,000 |
Maximum number of bytes parsed from the source to determine the source file type |
analysisLimitEntries |
10,000 |
Maximum number of entries (if applicable) parsed from the source to determine the source file type |
parsingLimitBytes |
Maximum number of bytes parsed from the source to be imported |
|
parsingLimitEntries |
Maximum number of entries imported |
|
mapping |
null |
|
probeOnly |
false |
Only probe if url is reachable or file path is readable |
documents |
url of the file to import containing documents only. This is useful when the database is split in separate files |
|
documentsFileType |
The format of the file containing documents (csv, graphml, graphson) |
|
documentsDelimiter |
Delimiter used to separate documents |
|
documentsHeader |
Header containing the properties in the CSV document. One property per column. If not defined it is parsed from the first line |
|
documentsSkipEntries |
0 |
Number of rows to skip from the documents file |
documentPropertiesInclude |
|
List of property to import from documents. |
documentType |
Document |
Name of the type defined in the schema when importing documents |
vertices |
url of the file to import containing vertices only. This is useful when the database is split in separate files |
|
verticesFileType |
The format of the file containing vertices (csv, graphml, graphson) |
|
verticesDelimiter |
Delimiter used to separate vertices |
|
verticesHeader |
Header containing the properties in the CSV vertices. One property per column. If not defined it is parsed from the first line |
|
verticesSkipEntries |
0 |
Number of rows to skip from the vertices file |
expectedVertices |
0 |
Number of vertices expected. This is useful to determine the ETA of the importing process of vertices. 0 means unknown |
vertexType |
Vertex |
Name of the type defined in the schema when importing vertices |
vertexPropertiesInclude |
|
List of property to import from vertices. |
edges |
url of the file to import containing edges only. This is useful when the database is split in separate files |
|
edgesFileType |
The format of the file containing edges (csv, graphml, graphson) |
|
edgesDelimiter |
Delimiter used to separate edges |
|
edgesHeader |
Header containing the properties in the CSV edges. One property per column. If not defined it is parsed from the first line |
|
edgesSkipEntries |
0 |
Number of rows to skip from the edges file |
expectedEdges |
0 |
Number of edges expected. This is useful to determine the ETA of the importing process of edges. 0 means unknown |
maxRAMIncomingEdges |
256MB |
Maximum RAM used to create edges. The more RAM, the faster. |
edgeType |
Edge |
Name of the type defined in the schema when importing edges |
edgePropertiesInclude |
|
List of property to import from edges. |
edgeFromField |
Name of the property containing the starting vertex |
|
edgeToField |
Name of the property containing the ending vertex |
|
edgeBidirectional |
true |
When creating edges, create bidirectional edges if true, otherwise unidirectional |
distanceFunction |
|
Type of distance measure, see similarity measures. |
efConstruction |
256 |
Size of dynamic neighbor candidate list of (during insert). |
ef |
256 |
Number of nearest neighbors to return (in layer search). |
m |
16 |
Maximum number of connections per layer in the HNSW index. Higher values improve recall but increase memory usage |
vectorType |
|
The data type of a vector element, for example 'float'. |
idProperty |
|
Name of the property that will be used as the unique identifier for vertices during import |
The probeOnly setting can also be used to send a GET request to another service or HTTP API, for example to report a previous import is finished.
|