The Paginated Local Storage, "plocal" from now, is a disk based storage which works with data using page model.
plocal storage consists of several components each of those components use disk data through disk cache.
Below is list of plocal storage components and short description of each of them:
Since PLOCAL is bases on disk, all the pages are flushed to physical files. You can specify any mounted partitions on your machine, backed by Disks, SSD, Flash Disks or DRAM.
Cluster is logical piece of disk space where storage stores records data. Each cluster is split in pages. Page is a single atomic unit, which is used by cluster.
Each page contains system information and records data. System information includes "magic number" and crc32 check sum of page content. This information is used to check storage integrity after DB crash. To start integrity check run command "check database" from console.
Each cluster has 2 sub components:
To speed up the access to the most requested clusters it's recommended to use the cluster files to a SSD or any faster support than disk. To do that, move the files to the mounted partitions and create symbolic links to them on original path. OrientDB will follow symbolic links and will open cluster files everywhere are reachable.
The mapping between data file and physical position is managed with a list, where each entry of this list is a fixed size element which is the pointer to the physical position of record in data file.
Because data file is paginated, this pointer consist of 2 items: page index (long value) and position of record inside page (int value), so each record pointer consumes 12 bytes.
When a new record is inserted, a new pointer is added to the list so index of this pointer becomes cluster position. The list is append only data structure so if you add a new record its cluster position will be unique and will not be reused.
When you delete a record, the page index and record position are set to -1. So record pointer is transformed in record tombstone. You can think about record id like a uuid. It is unique and never reused.
Usually when you delete records you lose very small amount of disk space. This could be mitigated with a periodic "offline compaction" by performing database export/import. In such case records cluster positions will be changed (tombstones will be ignored during export) and the lost space will be revoked. So during the import process, the cluster positions can change.
OrientDB import tool uses a manual hash index (by default the name is '___exportImportRIDMap') to map the old record ids and new record ids.
Write Ahead Log, WAL from now, is used to restore storage data after a non-soft shutdown:
All the operations on plocal components are logged in WAL before they are performed on these components. WAL is append only data structure, you can think about it like a list of records which contains information about operations performed on storage components.
WAL content is flushed to the disk on these events:
As result if OrientDB crashes, all data changes done during <=1 second interval before crash will be lost. This is the trade off between performance and durability.
It's strongly recommended to store WAL records on separate disk than the disk used to store the DB content. In this way data I/O operations will not be interrupted by WAL I/O operations. This can be done by setting the storage.wal.path property to the folder where storage WAL files will be placed.
Indexes can work with WAL in 2 modes:
In ROLLBACK_ONLY mode only data are needed to rollback transactions are stored. WAL records can not be used to restore index content after crash, in such case automatic indexes are rebuild. In FULL mode indexes can be restored after DB crash without rebuild. You can change index durability mode by setting the property index.txMode.
You can find more details about WAL here.
PLocal storage writes the database on file system using different files. Below all the extensions:
Basically paginated storage is nothing more than 2-level disk cache which works together with write ahead log.
Every file is spitted on pages, and each file operation is atomic at page level. 2-level disk cache allows:
2-level cache itself consist of Read Cache (implementation is based on 2Q cache algorithm) and *Write cache (implementation is based on WOW cache algorithm).
Typical set of operations are needed to work with any file looks like following:
When we load page at first Read Cache looks it in one of LRU lists. There are two of them for data which are accessed several times and then not accessed for very long period of item (it consumes 25% of memory) and data which are accessed frequently for long period of time (it consumes 75% of memory).
If page is absent in LRU queues, then Read Cache asks to the Write Cache to load data from the disk.
If we are lucky and pages which are queued to flush are still in Write Queue of Write Cache it will be retrieved from there or otherwise Write Cache will load data from file on the disk.
When data will be read from file by Write Cache, it will be put in LRU queue which contains “short living” pages. Eventually, if this pages will be accessed frequently during long time interval, loaded page will be moved to the LRU of “long living” pages.
When we release page and this page is marked as dirty this page is put into the Write Cache which adds it to the Write Queue. Write Queue can be considered as ring buffer where all the pages are sorted by its position on the disk. This trick allows to minimize disk head movements during pages flush. What is more interesting that pages are always flushed in background in “background flush” thread. This approach allows to mitigate I/O bottleneck if we have enough RAM to work in memory only and flush data in background.
So it was about how disk cache works. But how we achieve durability of changes on page level and what is more interesting on the level when we work with complex data structures like Trees or Hash Maps (these data structures are used in indexes).
If we look back on set of operations which we perform to manipulate file data you see that step 5 does not contains any references to OrientDB API. That is because there are two ways to work with off heap page durable and not durable.
So simple (not durable way) is to work with methods of direct memory pointer com.orientechnologies.common.directmemory.ODirectMemoryPointer(setLong/getLong, setInt/getInt and so on). If you would like to make all changes in your data structures durable you should not work with direct memory pointer but should create component which will present part of your data structure and extend it from com.orientechnologies.orient.core.storage.impl.local.paginated.ODurablePage class. This class has similar methods for manipulation of data in off heap page but also it tracks all changes are done to the page and we can always return diff between old/new states of page using com.orientechnologies.orient.core.storage.impl.local.paginated.ODurablePage#getPageChanges method. Also this class allows to apply given diff to the old/new snapshot of given pages to repeat/revert (restoreChanges()/revertChanges()) changes are done for this page.