The Paginated Local Storage, "plocal" from now, is a disk based storage which works with data using page model.
plocal storage consists of several components each of those components use disk data through disk cache.
Below is list of plocal storage components and short description of each of them:
- Clusters are managed by 2 kinds of files:
- .pcl files contain the cluster data
- .cpm files contain the mapping between record's cluster position and real physical position
- Write Ahead (operation) Log (WAL) are managed by 2 kinds of files:
- .wal to store the log content
- .wmr contains timing about synchronization operations between storage cache and disk system
- SBTree Index, it uses files with extensions .sbt.
- Hash Index, it uses files with extensions .hit, .him and .hib.
- Index Containers to store values of single entries of not unique index (Index RID Set). It uses files with extension .irs.
- File mapping, maps between file names and file ids (used internally). It's a single file with name: name_id_map.cm.
Since PLOCAL is disk-based, all pages are flushed to physical files. You can specify any mounted partitions on your machine, backed by Disks, SSD, Flash Disks or DRAM.
Cluster is logical piece of disk space where storage stores records data. Each cluster is split in pages. Page is a single atomic unit, which is used by cluster.
Each page contains system information and records data. System information includes "magic number" and a crc32 check sum of the page content. This information is used to check storage integrity after a DB crash. To start an integrity check run command "check database" from console.
Each cluster has 2 sub components:
- data file, with extension .pcl
- mapping between physical position of record in the data file and cluster position, with extension .cpm
To speed up access to the most requested clusters it's recommended to use the cluster files to a SSD or any faster support than disk. To do that, move the files to the mounted partitions and create symbolic links to them on original path. OrientDB will follow symbolic links and will open cluster files everywhere are reachable.
The mapping between data file and physical position is managed with a list. Each entry in this list is a fixed size element, which is the pointer to the physical position of the record in the data file.
Because the data file is paginated, this pointer will consist of 2 items: a page index (long value) and the position of the record inside the page (int value). Each record pointer consumes 12 bytes.
When a new record is inserted, a pointer is added to the list. The index of this pointer is the cluster position. The list is an append only data structure, so if you add a new record its cluster position will be unique and will not be reused.
When you delete a record, the page index and record position are set to -1. So the record pointer is transformed into a "tombstone". You can think of a record id like a uuid.. It is unique and never reused.
Usually when you delete records you lose very small amount of disk space. This can be mitigated with a periodic "offline compaction", by performing a database export/import. During this process, cluster positions will be changed (tombstones will be ignored during export) and the lost space will be recovered. So during the import process, the cluster positions can change.
The OrientDB import tool uses a manual hash index (by default the name is '___exportImportRIDMap') to map the old record ids to new record ids.
The Write Ahead Log (or WAL) is used to restore storage data after a non-soft shutdown:
- Hard kill of the OrientDB process
- Crash/Failure of the Java Virtual Machine that runs OrientDB
- Crash/Failure of the Operating System that is hosting OrientDB
All the operations on plocal components are logged in WAL before they are performed. WAL is an append only data structure. You can think of it as a list of records which contain information about operations performed on storage components.
WAL content is flushed to the disk on these events:
- every 1 second in background thread (flush interval can be changed in storage.wal.commitTimeout configuration property)
- synchronously if the amount of RAM used by WAL exceeds 65Mb (can be changed in storage.wal.cacheSize configuration property).
As result if OrientDB crashes, all data changes done during <=1 second interval before crash will be lost. This is a trade off between performance and durability.
It's strongly recommended that WAL records are stored on a different disk than the disk used to store the DB content. In this way data I/O operations will not be interrupted by WAL I/O operations. This can be done by setting the storage.wal.path property to the folder where storage WAL files will be placed.
Indexes can work with WAL in 2 modes:
- ROLLBACK_ONLY (default mode) and
In ROLLBACK_ONLY mode only the data needed to rollback transactions is stored. This means that WAL records can not be used to restore index content after a crash. In the case of a crash, the indexes will be rebuilt automatically.
In FULL mode, indexes can be restored after DB crash without a rebuild. You can change index durability mode by setting the property index.txMode.
You can find more details about WAL here.
PLocal stores data on the file system using different files, using the following extensions:
- .cpm, contains the mapping between real physical positions and cluster positions. If you delete a record, the tombstone is placed here. Each tombstone consumes about 12 bytes.
- .pcl, data file
- .sbt, is index file
- .wal and .wmr, are Journal Write Ahead (operation) Log files
- .cm, is the mapping between file id and real file name (used internally)
- .irs, is the RID set file for non-unique indexes
Paginated storage is a 2-level disk cache that works together with the write ahead log.
Every file is spit into pages, and each file operation is atomic at a page level. The 2-level disk cache allows:
- Cache frequently accessed pages in memory.
- Automatically separate pages which are rarely accessed from frequently accessed and rid off the first from cache memory.
- Minimize amount of disk head seeks during data writes.
- In case of low or middle write data load allows to mitigate pauses are needed to write data to the disk by flushing all changed or newly added pages to the disk in background thread.
- Works together with WAL to make any set changes on single page look like atomic operation.
2-level cache itself consist of a Read Cache (implementation is based on 2Q cache algorithm) and a *Write cache (implementation is based on WOW cache algorithm).
Typical set of operations are needed to work with any file looks like following:
- Open file using OReadWriteDiskCache#openFile operation and get id of open file. If the file does not exist it will be automatically created. The id of file is stored in a special meta data file and will always belong to the given file till it will be deleted.
- Allocate new page OReadWriteDiskCache#allocateNewPage or load existing one ORreadWriteDiskCache#load into off-heap memory.
- Retrieve pointer to the allocated area of off-heap memory OCacheEntry#getCachePointer().
- If you plan to change page data acquire a write lock, or a read lock if you read data and your single file page is shared across several data structures. Write lock must be acquired whether a single page is used between several data structures or not. The write lock is needed to prevent flushing inconsistent pages to the disk by the “data flush” thread of the write cache. OCachePointer#acquireExclusiveLock.
- Update/read data in off heap memory.
- Release write lock if needed. OCachePointer#releaseExclusiveLock.
- Mark page as dirty if you changed page data. It will allow write cache to flush pages which are really changed OCacheEntry#markDirty.
- Push record back to the disk cache: indicate to the cache that you will not use this page any more so it can be safely evicted from the memory to make room to other pages OReadWriteDiskCache#release.
When we load page the Read Cache looks it in one of its two LRU lists. One list is for data that was accessed several times and then not accessed for very long period of time. It consumes 25% of memory. The second is for data that is accessed frequently for a long period of time. It consumes 75% of memory.
If the page is not in either LRU queue, the Read Cache asks the Write Cache to load page from the disk.
If we are lucky and the page is queued to flush but is still in the Write Queue of Write Cache it will be retrieved from there. Otherwise, the Write Cache will load the page from disk.
When data will be read from file by Write Cache, it will be put in LRU queue which contains “short living” pages. Eventually, if this pages will be accessed frequently during long time interval, loaded page will be moved to the LRU of “long living” pages.
When we release a page and the page is marked as dirty, it is put into the Write Cache which adds it to the Write Queue. The Write Queue can be considered as ring buffer where all the pages are sorted by their position on the disk. This trick allows to minimize disk head movements during pages flush. What is more interesting is that pages are always flushed in the background in the “background flush” thread. This approach allows to mitigate I/O bottleneck if we have enough RAM to work in memory only and flush data in background.
So it was about how disk cache works. But how we achieve durability of changes on page level and what is more interesting on the level when we work with complex data structures like Trees or Hash Maps (these data structures are used in indexes).
If we look back on set of operations which we perform to manipulate file data you see that step 5 does not contains any references to OrientDB API. That is because there are two ways to work with off heap pages: durable and not durable.
The simple (not durable way) is to work with methods of direct memory pointer com.orientechnologies.common.directmemory.ODirectMemoryPointer(setLong/getLong, setInt/getInt and so on). If you would like to make all changes in your data structures durable you should not work with direct memory pointer but should create a component that will present part of your data structure by extending com.orientechnologies.orient.core.storage.impl.local.paginated.ODurablePage class. This class has similar methods for manipulation of data in off heap pages, but it also tracks all changes made to the page. It can return the diff between the old/new states of page using the com.orientechnologies.orient.core.storage.impl.local.paginated.ODurablePage#getPageChanges method. Also this class allows to apply given diff to the old/new snapshot of given pages to repeat/revert (restoreChanges()/revertChanges()) changes are done for this page.