a modern archive file format (in rust)
You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

272 lines
12 KiB

# Format of a bitbottle
A bitbottle archive is a "bottle", with other bottles inside, nested like an onion.
Each bottle contains:
- a fixed "bottle cap" describing the bottle type and header size
- a variable-sized header of key/value pairs, describing metadata specific to this bottle type
- one or more streams, which may each be raw data or a nested bottle
- an "end of streams" marker
For example:
+--------------------------------------+
| Bottle cap: "compressed" |
+--------------------------------------+
| Header: |
| - algorithm: snappy |
+--------------------------------------+
| Data stream: (compressed) |
| +---------------------------+ |
| | Bottle cap: "file list" | |
| +---------------------------+ |
| | Header: | |
| | - ... | |
| +---------------------------+ |
| | ... | |
| | | |
| +---------------------------+ |
| | End | |
| +---------------------------+ |
+--------------------------------------+
| End |
+--------------------------------------+
## Bottle cap
Every bottle starts with a bottle cap, containing a bottle type, version, length of the header, and a CRC of the header and bottle cap data to detect corruption. All multi-byte fields are little-endian (LE).
7 months ago
Specifically, the CRC field is the CRC-32C of `(version || type || header length || header)`.
+-Magic:---+----------+----------+----------+
| 0xF0 | 0x9F | 0x8D | 0xBC |
+----------+----------+----------+----------+
| Version | Type | Header length (LE) |
+----------+----------+---------------------+
| CRC-32C (LE) |
+-------------------------------------------+
The magic (`F0 9F 8D BC`) is present in every bottle. Its purpose is to confirm, as a double-check, that you are actually reading a bitbottle.
Version is always `00`. If the version is not `00`, you can't read the archive correctly. Future versions may ascribe more specific meaning to the bits of this field.
Only a few bottle types are defined:
- `1` - File list
- `2` - File
- `3` - File block
- `4` - Compressed
- `5` - Encrypted
- `6` - Signed
Because the header length is 16 bits, the header must be no more than 64KB in size. It will usually be much much smaller.
## Header
Each header is specific to its bottle type, but they share the same encoding format. The header format is also used for some data streams, where the data is meant to encode information that isn't just raw data or a nested bottle.
A header is a variable number of fields. Each field is a one-byte descriptor, optionally followed by bytes of data. The field type (0 - 15) is the high nybble of the descriptor (`descriptor >> 4`), and the field ID (0 - 15) is the low nybble (`descriptor & 0xf`).
7 6 5 4 3 2 1 0
+---------------+---------------+
| Field type | Field ID |
+---------------+---------------+
Field IDs for types 8 - 15 are _per field type_, so descriptor `0x43` (type 4, ID 3) is a different field from `0x53` (type 5, ID 3). In other words, the descriptor byte must be unique per header, but the ID itself must only be unique for its field type.
Field IDs for types 0 - 7 (integers) are unique among themselves, so descriptor `0x1A` (type 1, ID 10) and `0x2A` (type 2, ID 10) are different encodings for the same field, and only one of them may be in a header. The encoding must use the smallest type that will fit the full value.
Field lengths are implicit per type. For types 9 and 10, the length is variable (0 - 255) and specified by a byte following the descriptor.
The types are:
- `0` - u8 (length = 1)
- `1` - u16 (length = 2)
- `2` - u32 (length = 4)
- `3` - u64 (length = 8)
- `8` - bool (length = 0) -- the presence of this field is "true", the absence is "false"
- `9` - UTF-8 string (length byte follows)
- `A` - binary data (length byte follows)
Types 4 - 7 are reserved for negative integers, if they're needed in the future. No current bottle uses them.
For example, this header's fields are decoded as:
00 64 23 40 42 0f 00 80 90 03 63 61 74
^ ^ ^ ^
| | | `-- UTF-8 string ID 0, length = 3, "cat"
| | `-- bool ID 0, true
| `-- int ID 3, encoded as u32, 1_000_000 (0xf4240)
`-- int ID 0, encoded as u8, 100 (0x64)
Each bottle defines the exhaustive list of fields that can be in its header.
## Streams
Each stream is preceded by a byte to indicate which type of stream will follow.
- `0x40` - raw data stream
- `0x80` - nested bottle
- `0xc0` - end of streams, and end of bottle
8 months ago
A nested bottle is included as-is, with no framing, because it has a deterministic ending.
A raw data stream uses a frame encoding called "expony": Each frame is preceded by a length byte.
7 6 5 4 3 2 1 0
+-------+-----------------------+
| Y | X |
+-------+-----------------------+
The length of the frame is computed as `X * 2**(6 * Y)` or `X << (6 * Y)`. In other words, Y encodes the scale (1, 64, 4KB, or 256KB), and X encodes the multiplier for that scale.
For example:
- `0xc1` = `0b11000001` = `1 << 18` = 256KB
- `0xe0` = `0b11100000` = `32 << 18` = 8MB
- `0x3f` = `0b00111111` = `63 << 0` = 63
Every frame must have `Y > 0` until the last frame, which may be any size from 0 to 63. Frames with `X = 0` and `Y > 1` are illegal.
Normally, a data stream will be buffered so that each frame is the same round size (like 1MB) until the last few, which wrap up the straggling bytes.
The largest frame size is just under 16MB (Y = 3, X = 63).
# Bottle types
## File list (1)
Header fields:
- `int 0`: total file count
- `int 1`: total block count
- `int 2`: hash function used for files & blocks
- `0`: SHA-256
- `1`: Blake2
- `2`: Blake3
Streams:
- (`int 0`) * bottle 2 (File)
- (`int 1`) * bottle 3 (Block)
A file list contains a list of files and folders and the contents of those files. Each enclosed "File" bottle contains a file or folder's metadata, the hash of the entire contents (for files) and a list of the blocks that make up the file. Each enclosed "Block" bottle is a block of raw data, identified by its hash.
## File (2)
Header fields:
- `string 0`: filename (full relative path)
- `bool 0`: is this a folder?
- `int 0`: file size _(F)_
- `int 1`: posix mode (only the lower 9 bits)
- `int 2`: creation time (ctime) in nanoseconds since epoch
- `int 3`: modification time (mtime) in nanoseconds since epoch
- `int 4`: block count _(F)_
- `string 1`: mime type (currently unused)
- `string 2`: posix owner username
- `string 3`: posix group name
- `bytes 0`: hash of the entire file (using the algorithm from the file list) _(F)_
7 months ago
- `string 4`: target of the symlink _(symlinks only)_
Fields marked with _(F)_ are only present for files, not folders.
Streams:
- (`int 4`) * data, header-encoded:
- `int 0`: block size
- `bytes 0`: hash of the block (using the algorithm from the file list)
The streams are a list of the blocks that make up the file, in order, with no gaps. The sum of the size of each block in this list must be the file size in `int 0`. Each block exists at an offset computed by summing the sizes of each previous block. The first block is always at offset 0. For example, if there are 3 blocks, with sizes 10, 15, and 11, then the total file size is 36, and the third block is at offset 25.
If the block count is 0, there are no streams and no file content. If the block count is 1, there are no streams, and the implicit block list is a single block containing the whole file: its size is in `int 0` and its hash is in `bytes 0`.
Data for each block is enclosed in a "Block" bottle inside the same "File list" bottle that holds these "File"s.
Filenames may contain "/" to create relative nested paths, as long as there's a "File" record for each intermediate folder. (For example, folder records for "src" and "src/cli" allow a file record for "src/cli/mod.rs".)
## Block (3)
Header fields:
- `int 0`: block size
- `bytes 0`: hash of the block (using the algorithm from the file list)
Streams:
- raw data for this block
## Compressed (4)
Header fields:
- `int 0`: algorithm
- `0`: framed snappy
- `1`: liblzma2
Streams:
- data: which decompresses into a bottle
A compressed bottle is just a thinly-wrapped data stream containing another bottle.
## Encrypted (5)
Header fields:
- `int 0`: algorithm
- `0`: `AES-128-GCM`
- `1`: `XCHACHA20-POLY1305`
- `int 1`: block size (bits) (20 = 1MB)
- `int 2`: # of public key recipients (optional)
- `int 3`: recipient public key algorithm (optional)
- `0`: `ED25519-NACL-SEALED`
- `bytes 0`: argon2id parameters (optional):
- `u8`: version (0)
- `u16`: time cost
- `u8`: memory cost (bits)
- `u8`: parallelism
- `u8[key_length]`: salt
- `bytes 1`: key encrypted via (int 0 algorithm) with argon key (optional)
Streams:
8 months ago
- (`int 2`) * data, header-encoded:
- `string 0`: recipient name (free-form, go nuts)
- `bytes 0`: public key
- `bytes 1`: encrypted form of bottle encryption key
8 months ago
- raw data which encodes a bottle, in encrypted blocks
8 months ago
An encrypted bottle is, at core, an encrypted data stream containing another bottle. The data stream is encrypted with a key generated by a [CRNG](https://en.wikipedia.org/wiki/Cryptographically-secure_pseudorandom_number_generator). This key is then encrypted independently for each recipient. A recipient may be a password (only one) or an asymmetric key (many) or both.
If a password is used, the key is encrypted using a temporary key generated by [Argon2](https://en.wikipedia.org/wiki/Argon2) in "Argon2id" mode. The Argon2 parameters and salt are passed in the `bytes 0` header. This header must be present for password decryption. Argon2 is used to derive a temporary key from the password, and this temporary key is used to encrypt the real key. This encrypted key is stored in the `bytes 1` header.
- `AES-128-GCM`: Argon2 is used to generate a 16-byte (128-bit) temporary key, used to encrypt the real key as a single AES-128 block.
- `XCHACHA20-POLY1305`: Argon2 is used to generate a 56-byte temporary key. The first 32 bytes (256 bits) are the key and the remaining 24 bytes are the nonce, used to encrypt the real key using an XChaCha20 stream cipher.
If public keys are used, the key is encrypted independently for each public key, and attached as a data stream encoded in header format. The stream includes a text name (to help users figure out which secret key to use), the full public key that was used to encrypt the key, and the encrypted key.
- `ED25519-NACL-SEALED`: An Ed25519 public key is used to generate a [NaCl](https://en.wikipedia.org/wiki/NaCl_(software)) "sealed box" containing the real key.
### Block encryption
The key is used to encrypt the data stream in blocks, based on the block size in the `int 1` header. Each block is preceded by a nonce and tag, based on the encryption algorithm. The nonce is generated from a CRNG, and the tag is generated by the algorithm. A new nonce is generated for each block. The final block may be smaller than the block size.
- `AES-128-GCM`: 12 byte nonce, 16 byte tag
- `XCHACHA20-POLY1305`: 24 byte nonce, 16 byte tag
8 months ago
So, for example, an `XCHACHA20-POLY1305` block with a 1MB block size (`int 1` header = 20) would have 1048616 bytes in each block: 24 bytes of nonce, followed by a 16 byte tag, followed by 1048576 bytes of encrypted data.
+------------+------------+------------....----+
| nonce (24) | tag (16) | data (1MB) |
+------------+------------+------------....----+