a modern archive file format (in rust)
You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
 
 

12 KiB

Format of a bitbottle

A bitbottle archive is a "bottle", with other bottles inside, nested like an onion.

Each bottle contains:

  • a fixed "bottle cap" describing the bottle type and header size
  • a variable-sized header of key/value pairs, describing metadata specific to this bottle type
  • one or more streams, which may each be raw data or a nested bottle
  • an "end of streams" marker

For example:

+--------------------------------------+
| Bottle cap: "compressed"             |
+--------------------------------------+
| Header:                              |
|     - algorithm: snappy              |
+--------------------------------------+
| Data stream: (compressed)            |
|     +---------------------------+    |
|     | Bottle cap: "file list"   |    |
|     +---------------------------+    |
|     | Header:                   |    |
|     |     - ...                 |    |
|     +---------------------------+    |
|     | ...                       |    |
|     |                           |    |
|     +---------------------------+    |
|     | End                       |    |
|     +---------------------------+    |
+--------------------------------------+
| End                                  |
+--------------------------------------+

Bottle cap

Every bottle starts with a bottle cap, containing a bottle type, version, length of the header, and a CRC of the header and bottle cap data to detect corruption. All multi-byte fields are little-endian (LE).

Specifically, the CRC field is the CRC-32C of (version || type || header length || header).

+-Magic:---+----------+----------+----------+
| 0xF0     | 0x9F     | 0x8D     | 0xBC     |
+----------+----------+----------+----------+
| Version  | Type     | Header length (LE)  |
+----------+----------+---------------------+
| CRC-32C (LE)                              |
+-------------------------------------------+

The magic (F0 9F 8D BC) is present in every bottle. Its purpose is to confirm, as a double-check, that you are actually reading a bitbottle.

Version is always 00. If the version is not 00, you can't read the archive correctly. Future versions may ascribe more specific meaning to the bits of this field.

Only a few bottle types are defined:

  • 1 - File list
  • 2 - File
  • 3 - File block
  • 4 - Compressed
  • 5 - Encrypted
  • 6 - Signed

Because the header length is 16 bits, the header must be no more than 64KB in size. It will usually be much much smaller.

Header

Each header is specific to its bottle type, but they share the same encoding format. The header format is also used for some data streams, where the data is meant to encode information that isn't just raw data or a nested bottle.

A header is a variable number of fields. Each field is a one-byte descriptor, optionally followed by bytes of data. The field type (0 - 15) is the high nybble of the descriptor (descriptor >> 4), and the field ID (0 - 15) is the low nybble (descriptor & 0xf).

  7   6   5   4   3   2   1   0
+---------------+---------------+
| Field type    | Field ID      |
+---------------+---------------+

Field IDs for types 8 - 15 are per field type, so descriptor 0x43 (type 4, ID 3) is a different field from 0x53 (type 5, ID 3). In other words, the descriptor byte must be unique per header, but the ID itself must only be unique for its field type.

Field IDs for types 0 - 7 (integers) are unique among themselves, so descriptor 0x1A (type 1, ID 10) and 0x2A (type 2, ID 10) are different encodings for the same field, and only one of them may be in a header. The encoding must use the smallest type that will fit the full value.

Field lengths are implicit per type. For types 9 and 10, the length is variable (0 - 255) and specified by a byte following the descriptor.

The types are:

  • 0 - u8 (length = 1)
  • 1 - u16 (length = 2)
  • 2 - u32 (length = 4)
  • 3 - u64 (length = 8)
  • 8 - bool (length = 0) -- the presence of this field is "true", the absence is "false"
  • 9 - UTF-8 string (length byte follows)
  • A - binary data (length byte follows)

Types 4 - 7 are reserved for negative integers, if they're needed in the future. No current bottle uses them.

For example, this header's fields are decoded as:

00 64 23 40 42 0f 00 80 90 03 63 61 74
^     ^              ^  ^
|     |              |  `-- UTF-8 string ID 0, length = 3, "cat"
|     |              `-- bool ID 0, true
|     `-- int ID 3, encoded as u32, 1_000_000 (0xf4240)
`-- int ID 0, encoded as u8, 100 (0x64)

Each bottle defines the exhaustive list of fields that can be in its header.

Streams

Each stream is preceded by a byte to indicate which type of stream will follow.

  • 0x40 - raw data stream
  • 0x80 - nested bottle
  • 0xc0 - end of streams, and end of bottle

A nested bottle is included as-is, with no framing, because it has a deterministic ending.

A raw data stream uses a frame encoding called "expony": Each frame is preceded by a length byte.

  7   6   5   4   3   2   1   0
+-------+-----------------------+
| Y     | X                     |
+-------+-----------------------+

The length of the frame is computed as X * 2**(6 * Y) or X << (6 * Y). In other words, Y encodes the scale (1, 64, 4KB, or 256KB), and X encodes the multiplier for that scale.

For example:

  • 0xc1 = 0b11000001 = 1 << 18 = 256KB
  • 0xe0 = 0b11100000 = 32 << 18 = 8MB
  • 0x3f = 0b00111111 = 63 << 0 = 63

Every frame must have Y > 0 until the last frame, which may be any size from 0 to 63. Frames with X = 0 and Y > 1 are illegal.

Normally, a data stream will be buffered so that each frame is the same round size (like 1MB) until the last few, which wrap up the straggling bytes.

The largest frame size is just under 16MB (Y = 3, X = 63).

Bottle types

File list (1)

Header fields:

  • int 0: total file count
  • int 1: total block count
  • int 2: hash function used for files & blocks
    • 0: SHA-256
    • 1: Blake2
    • 2: Blake3

Streams:

  • (int 0) * bottle 2 (File)
  • (int 1) * bottle 3 (Block)

A file list contains a list of files and folders and the contents of those files. Each enclosed "File" bottle contains a file or folder's metadata, the hash of the entire contents (for files) and a list of the blocks that make up the file. Each enclosed "Block" bottle is a block of raw data, identified by its hash.

File (2)

Header fields:

  • string 0: filename (full relative path)
  • bool 0: is this a folder?
  • int 0: file size (F)
  • int 1: posix mode (only the lower 9 bits)
  • int 2: creation time (ctime) in nanoseconds since epoch
  • int 3: modification time (mtime) in nanoseconds since epoch
  • int 4: block count (F)
  • string 1: mime type (currently unused)
  • string 2: posix owner username
  • string 3: posix group name
  • bytes 0: hash of the entire file (using the algorithm from the file list) (F)
  • string 4: target of the symlink (symlinks only)

Fields marked with (F) are only present for files, not folders.

Streams:

  • (int 4) * data, header-encoded:
    • int 0: block size
    • bytes 0: hash of the block (using the algorithm from the file list)

The streams are a list of the blocks that make up the file, in order, with no gaps. The sum of the size of each block in this list must be the file size in int 0. Each block exists at an offset computed by summing the sizes of each previous block. The first block is always at offset 0. For example, if there are 3 blocks, with sizes 10, 15, and 11, then the total file size is 36, and the third block is at offset 25.

If the block count is 0, there are no streams and no file content. If the block count is 1, there are no streams, and the implicit block list is a single block containing the whole file: its size is in int 0 and its hash is in bytes 0.

Data for each block is enclosed in a "Block" bottle inside the same "File list" bottle that holds these "File"s.

Filenames may contain "/" to create relative nested paths, as long as there's a "File" record for each intermediate folder. (For example, folder records for "src" and "src/cli" allow a file record for "src/cli/mod.rs".)

Block (3)

Header fields:

  • int 0: block size
  • bytes 0: hash of the block (using the algorithm from the file list)

Streams:

  • raw data for this block

Compressed (4)

Header fields:

  • int 0: algorithm
    • 0: framed snappy
    • 1: liblzma2

Streams:

  • data: which decompresses into a bottle

A compressed bottle is just a thinly-wrapped data stream containing another bottle.

Encrypted (5)

Header fields:

  • int 0: algorithm
    • 0: AES-128-GCM
    • 1: XCHACHA20-POLY1305
  • int 1: block size (bits) (20 = 1MB)
  • int 2: # of public key recipients (optional)
  • int 3: recipient public key algorithm (optional)
    • 0: ED25519-NACL-SEALED
  • bytes 0: argon2id parameters (optional):
    • u8: version (0)
    • u16: time cost
    • u8: memory cost (bits)
    • u8: parallelism
    • u8[key_length]: salt
  • bytes 1: key encrypted via (int 0 algorithm) with argon key (optional)

Streams:

  • (int 2) * data, header-encoded:
    • string 0: recipient name (free-form, go nuts)
    • bytes 0: public key
    • bytes 1: encrypted form of bottle encryption key
  • raw data which encodes a bottle, in encrypted blocks

An encrypted bottle is, at core, an encrypted data stream containing another bottle. The data stream is encrypted with a key generated by a CRNG. This key is then encrypted independently for each recipient. A recipient may be a password (only one) or an asymmetric key (many) or both.

If a password is used, the key is encrypted using a temporary key generated by Argon2 in "Argon2id" mode. The Argon2 parameters and salt are passed in the bytes 0 header. This header must be present for password decryption. Argon2 is used to derive a temporary key from the password, and this temporary key is used to encrypt the real key. This encrypted key is stored in the bytes 1 header.

  • AES-128-GCM: Argon2 is used to generate a 16-byte (128-bit) temporary key, used to encrypt the real key as a single AES-128 block.
  • XCHACHA20-POLY1305: Argon2 is used to generate a 56-byte temporary key. The first 32 bytes (256 bits) are the key and the remaining 24 bytes are the nonce, used to encrypt the real key using an XChaCha20 stream cipher.

If public keys are used, the key is encrypted independently for each public key, and attached as a data stream encoded in header format. The stream includes a text name (to help users figure out which secret key to use), the full public key that was used to encrypt the key, and the encrypted key.

  • ED25519-NACL-SEALED: An Ed25519 public key is used to generate a NaCl "sealed box" containing the real key.

Block encryption

The key is used to encrypt the data stream in blocks, based on the block size in the int 1 header. Each block is preceded by a nonce and tag, based on the encryption algorithm. The nonce is generated from a CRNG, and the tag is generated by the algorithm. A new nonce is generated for each block. The final block may be smaller than the block size.

  • AES-128-GCM: 12 byte nonce, 16 byte tag
  • XCHACHA20-POLY1305: 24 byte nonce, 16 byte tag

So, for example, an XCHACHA20-POLY1305 block with a 1MB block size (int 1 header = 20) would have 1048616 bytes in each block: 24 bytes of nonce, followed by a 16 byte tag, followed by 1048576 bytes of encrypted data.

+------------+------------+------------....----+
| nonce (24) | tag (16)   | data (1MB)         |
+------------+------------+------------....----+