|
|
DESIGN - dedup - deduplicating backup program |
|
|
 |
git clone git://bitreich.org/dedup/ git://enlrupgkhuxnvlhsf6lc3fziv5h2hhfrinws65d7roiv6bfj7d652fid.onion/dedup/ (git://bitreich.org) |
|
|
 |
Log |
|
|
 |
Files |
|
|
 |
Refs |
|
|
 |
Tags |
|
|
 |
README |
|
|
 |
LICENSE |
|
|
|
--- |
|
|
|
DESIGN (2323B) |
|
|
|
--- |
|
|
|
1 Design notes |
|
|
|
2 ============ |
|
|
|
3 |
|
|
|
4 There are three main abstractions in the design of dedup: |
|
|
|
5 |
|
|
|
6 - The chunker interface |
|
|
|
7 - The snapshot layer |
|
|
|
8 - The block layer |
|
|
|
9 |
|
|
|
10 The block layer |
|
|
|
11 --------------- |
|
|
|
12 |
|
|
|
13 From the outside world, the block layer is just an abstraction for |
|
|
|
14 dealing with variable length blocks. All blocks are referenced with |
|
|
|
15 their hash. |
|
|
|
16 |
|
|
|
17 The block layer is arranged into a stack of layers. From top to |
|
|
|
18 bottom these are as follows: |
|
|
|
19 |
|
|
|
20 - Generic layer |
|
|
|
21 - The compression layer |
|
|
|
22 - The encryption layer |
|
|
|
23 - The storage layer |
|
|
|
24 |
|
|
|
25 The generic layer is the one that client code interfaces with. It is |
|
|
|
26 the top level entrypoint to the block layer. The generic layer |
|
|
|
27 calculates the hash of the block and passes it down to the compression |
|
|
|
28 layer. |
|
|
|
29 |
|
|
|
30 The compression layer will prepend a compression descriptor to the |
|
|
|
31 block and then compress the block using snappy or lz4. It is possible |
|
|
|
32 to disable compression in which case a special descriptor is prepended |
|
|
|
33 and the data is passed uncompressed to the encryption layer. |
|
|
|
34 |
|
|
|
35 The encryption layer will prepend an encryption descriptor to the |
|
|
|
36 block and then encrypt/authenticate the block using XChaCha20 and |
|
|
|
37 Poly1305. It is possible to disable encryption in which case it acts |
|
|
|
38 as a bypass with a special type of encryption descriptor. The block |
|
|
|
39 is then passed to the storage layer. |
|
|
|
40 |
|
|
|
41 The storage layer will prepend a storage descriptor and append the |
|
|
|
42 descriptor and the data to a single backing file. |
|
|
|
43 |
|
|
|
44 The snapshot layer |
|
|
|
45 ------------------ |
|
|
|
46 |
|
|
|
47 The snapshot abstraction is currently very simplistic. A snapshot is |
|
|
|
48 a file under $repo/archive/<name>. The contents of the file are the |
|
|
|
49 block hashes of the data stored in the snapshot. |
|
|
|
50 |
|
|
|
51 The chunker interface |
|
|
|
52 --------------------- |
|
|
|
53 |
|
|
|
54 The chunker issues variable length blocks. The minimum block size is |
|
|
|
55 512KB, the maximum block size is 8MB and the average block size is |
|
|
|
56 2MB. These configuration parameters can be modified by editing |
|
|
|
57 config.h but it can be tricky to tune it properly. |
|
|
|
58 |
|
|
|
59 The buzhash[0] rolling hash algorithm is used to fingerprint the input |
|
|
|
60 stream. |
|
|
|
61 |
|
|
|
62 When encryption is enabled, a random seed is generated and stored |
|
|
|
63 encrypted in the repository state file. The seed is XOR-ed with the |
|
|
|
64 buzhash initial state table to mitigate against length fingerprinting |
|
|
|
65 attacks. |
|
|
|
66 |
|
|
|
67 [0] http://www.serve.net/buz/Notes.1st.year/HTML/C6/rand.012.html |
|