Skip to content

Latest commit

 

History

History
127 lines (98 loc) · 5.66 KB

sharded.rst

File metadata and controls

127 lines (98 loc) · 5.66 KB

Neuroglancer Precomputed Sharded Format

The precomputed sharded format is logically a key-value store that maps 8-byte (uint64) keys to arbitrary byte sequence values.

It packs any number of key/value pairs (called chunks) into a fixed number of larger shard files. Compared to storing each key in a separate file, it can reduce space overhead and storage and improve write efficiency on storage systems with high per-file overhead, as is common in many distributed storage systems including cloud object stores. There are several downsides to the sharded format, however:

  • It requires greater complexity in the generation pipeline.
  • It is not possible to re-write the data for individual chunks; the entire shard must be re-written.
  • There is somewhat higher read latency due to the need to retrieve additional index information before retrieving the actual chunk data, although this latency is partially mitigated by client-side caching of the index data in Neuroglancer.

The sharded format uses a two-level index hierarchy:

Note

The sharded format requires that the underlying key-value store supports :ref:`byte range reads<kvstore-byte-range-reads>`.

The sharded format consists of the :json:schema:`sharding metadata parameters<PrecomputedSharding>`, which are embedded in the parent format :file:`info` metadata file, and a directory containing the :ref:`shard data files<precomputed-sharded-shard-data-files>`.

Sharding metadata

.. json:schema:: PrecomputedSharding

Shard data files

For each shard number in the range [0, 2**shard_bits), there is a :file:`{<shard>}.shard` file, where :file:`{<shard>}` is the lowercase base-16 shard number zero padded to ceil(shard_bits/4) digits.

Note

There was an earlier (obselete) version of the sharded format, which also used the same :json:`"neuroglancer_uint64_sharded_v1"` identifier. The earlier format differed only in that there was a separate :file:`{<shard>}.index` file (containing the shard index) and a :file:`{<shard>}.data` file (containing the remaining data) in place of the single :file:`{<shard>}.shard` file of the current format; the :file:`{<shard>}.shard` file is equivalent to the concatenation of the :file:`{<shard>}.index` and :file:`{<shard>}.data` files of the earlier version.

Shard index format

The first 2**minishard_bits * 16 bytes of each shard file is the shard index consisting of 2**minishard_bits 16-byte entries of the form:

Both the start_offset and end_offset are relative to the end of the shard index, i.e. shard_index_end = 2**minishard_bits * 16 bytes.

That is, the encoded :ref:`minishard index<precomputed-sharded-format-minishard-index>` for a given minishard is stored in the byte range [shard_index_end + start_offset, shard_index_end + end_offset) of the shard file. A zero-length byte range indicates that there are no chunk IDs in the minishard.

Minishard index format

The minishard index stored in the shard file is encoded according to the :json:schema:`~PrecomputedSharding.minishard_index_encoding` metadata value.

The decoded minishard index is a binary string of 24*n bytes, specifying a contiguous C-order array of [3, n] uint64le values.

  • Values array[0, 0], ..., array[0, n-1] specify the chunk IDs in the minishard, and are delta encoded, such that array[0, 0] is equal to the ID of the first chunk, and the ID of chunk i is equal to the sum of array[0, 0], ..., array[0, i].
  • The size of the data for chunk i is stored as array[2, i]. Values array[1, 0], ..., array[1, n-1] specify the starting offsets in the shard file of the data corresponding to each chunk, and are also delta encoded relative to the end of the prior chunk, such that the starting offset of the first chunk is equal to shard_index_end + array[1, 0], and the starting offset of chunk i is the sum of shard_index_end + array[1, 0], ..., array[1, i] and array[2, 0], ..., array[2, i-1].

The start and size values in the minishard index specify the location in the shard file of the chunk data, which is encoded according to the :json:schema:`~PrecomputedSharding.data_encoding` metadata value.