Many NoSQL products do sharding against an hashed key (see MongoDB or any DHT system like DynamoDB). This wouldn’t work well in a Graph Database where traversing relationships means jumping from a node to another one with very high probability.
So the sharding strategy is in charge to the developer/dba. How? By defining multiple clusters per class. Example of splitting the class “Client” in 3 clusters:
Class Client -> Clusters [ client_0, client_1, client_2 ]
This means that OrientDB will consider any record/document/graph element in any of such clusters as “Clients” (Client class relies on such clusters). In Distributed-Architecture each cluster can be assigned to one or multiple server nodes.
Shards, based on clusters, work against indexed and non-indexed class/clusters.
You can assign each cluster to one or more servers. If more servers are enlisted the records will be copied in all the servers. This is similar to what RAID stands for Disks.
This is an example of configuration where the Client class has been split in the 3 clusters client_0, client_1 and client_2, each one with different configuration:
In order to keep things simple, the entire OrientDB Distributed Configuration is stored on a single JSON file. Example of distributed database configuration for (Multiple servers per cluster)[Distributed-Sharding#Multiple-servers-per-cluster] use case:
{
"autoDeploy": true,
"hotAlignment": false,
"readQuorum": 1,
"writeQuorum": 2,
"failureAvailableNodesLessQuorum": false,
"readYourWrites": true,
"clusters": {
"internal": {
},
"index": {
},
"client_0": {
"servers" : [ "europe", "usa" ]
},
"client_1": {
"servers" : [ "usa" ]
},
"client_2": {
"servers" : [ "asia", "europe", "usa" ]
},
"*": {
"servers" : [ "<NEW_NODE>" ]
}
}
}
To enable autosharding, create one cluster per server for all the classes you want to automatic shard. This can be done automatically by OrientDB every time a new class is created by executing this command:
alter database minimumclusters 4
This will create 4 clusters per every new created class, so the following command will create the "Garden" class with 4 clusters (garden_0, garden_1, garden_2, garden_3):
create class Garden
By default each class has the ROUND_ROBIN strategy, where each cluster is used in turn on insertion. This works generally well when you pre-create multiple clusters, but if you add further clusters at existent class, they will be 0 records at the beginning. In this case you can use the BALANCED strategy to privilege the most empty cluster to be filled faster than others.
The application can decide where to insert a new Client. To insert it into the first cluster the cluster name must be specified.
INSERT INTO Client CLUSTER:client_0 SET name = 'Jay'
OrientVertex v = graph.addVertex("class:Client,cluster:client_0");
v.setProperty("name", "Jay");
ODocument doc = new ODocument("Client");
doc.field("name", "Jay");
doc.save( "client_0" );
In both cases, OrientDB will send the record to both “europe” and “usa” nodes, because client_0 has been configured with 2 servers: [ "europe", "usa" ].
All the query works by aggregating the result sets of all the involved nodes. Example of a query against a particular cluster:
select from cluster:client_0
This query will be executed against “europe” or “usa” node. Instead this query (against the Class):
select from Client
Will be executed against all 3 clusters that made the Client class, so against all the nodes.
OrientDB supports Map/Reduce by using the OrientDB SQL. The Map/Reduce operation is totally transparent to the developer. When a query involve multiple shards (clusters), OrientDB executes the query against all the involved server nodes (Map operation) and then merge the results (Reduce operation). Example:
select max(amount), count(*), sum(amount) from Client
In this case the query is executed across all the 3 nodes and then filtered again on starting node.
All the indexes are managed locally to a server. This means that if a class is spanned across 3 clusters on 3 different servers, each server will have own local indexes. By executing a distributed query (Map/Reduce like) each server will use own indexes.
With Community Edition the distributed configuration cannot be changed at run-time but you have to stop and restart all the nodes. Enterprise Edition allows to create and drop new shards without stopping the distributed cluster.
By using Enterprise Edition and the Workbench, you can deploy the database to the new server and defining the cluster to assign to it. In this example a new server "usa2" is created where only the cluster "client_1" will be copied. After the deployment, cluster "client_1" will be replicated against nodes "usa" and "usa2".