:: bzip3 / README.md 6.2 KB raw

1
# BZip3
2
3
[![Build](https://github.com/iczelia/bzip3/actions/workflows/build.yml/badge.svg)](https://github.com/iczelia/bzip3/actions/workflows/build.yml)
4
5
A better, faster and stronger spiritual successor to BZip2. Features higher compression ratios and better performance thanks to a order-0 context mixing entropy coder, a fast Burrows-Wheeler transform code making use of suffix arrays and a RLE with Lempel Ziv+Prediction pass based on LZ77-style string matching and PPM-style context modeling.
6
7
Like its ancestor, **BZip3 excels at compressing text or code**.
8
9
## Installation
10
11
```console
12
# If using a git clone (not needed for source packages), first...
13
$ ./bootstrap.sh
14
15
# All...
16
$ ./configure
17
$ make
18
$ sudo make install
19
```
20
21
Alternatively, you might be able to install bzip3 using your system's package manager:
22
23
[![Packaging status](https://repology.org/badge/vertical-allrepos/bzip3.svg?columns=3)](https://repology.org/project/bzip3/versions)
24
25
On macOS, you can use [Homebrew](https://brew.sh) to easily install:
26
27
```console
28
$ brew install bzip3
29
```
30
31
## Perl source code benchmark
32
33
First, I have downloaded every version of Perl5 ever released and decompressed them.
34
35
```bash
36
% wget -r -l1 -nH --cut-dirs=2 --no-parent -A.tar.gz --no-directories https://www.cpan.org/src/5.0/
37
% for g in *.gz; do gunzip $g; done
38
% ls -la | wc -l
39
262
40
```
41
42
Then, I put all the resulting `.tar` files in a single `.tar` file and tried to compress it using various compressors:
43
44
```
45
xz -T16 -9 -k all.tar  10829.91s user 26.91s system 1488% cpu 14658M memory 12:09.24 total
46
bzip2 -9 -k all.tar  981.78s user 9.77s system 95% cpu 8M memory 17:16.64 total
47
bzip3 -e -b 256 -j 12 all.tar  2713.81s user 16.28s system 634% cpu 18301M memory 7:10.10 total
48
bzip3 -e -b 511 -j 4 all.tar  17.65s user 12.19s system 170% cpu 12178M memory 7:08.65 total
49
zstd -T12 -16 all.tar  4162.94s user 16.40s system 1056% cpu 687M memory 6:35.62 total
50
```
51
52
The results follow:
53
54
| Method           | Compressed size (bytes) |
55
| ---------------- | -----------------------:|
56
| LZMA (xz)        |           2'056'645'240 |
57
| bzip2            |           3'441'163'911 |
58
| bzip3 -b 256     |           1'001'957'587 |
59
| bzip3 -b 511     |             546'456'978 |
60
| Zstandard        |           3'076'143'660 |
61
62
Finally, wall clock time decompression times (WD Blue HDD):
63
64
| Method           | Decompression time |
65
| ---------------- | ------------------:|
66
| LZMA (xz)        |           4min 40s |
67
| bzip2            |           9min 22s |
68
| bzip3 (parallel) |           4min 06s |
69
| Zstandard        |           3min 51s |
70
71
Then, I used `lrzip` to perform long-range deduplication on the original `.tar` file:
72
73
```
74
% time lrzip -n -o all_none.tar.lrz all.tar
75
546.17s user 160.87s system 102% cpu 10970M memory 11:28.00 total
76
77
% time lrzip --lzma -o all_lzma.tar.lrz all.tar
78
702.16s user 161.87s system 122% cpu 10792M memory 11:44.83 total
79
80
% time lrzip -b -o all_bzip2.tar.lrz all.tar
81
563.93s user 147.38s system 112% cpu 10970M memory 10:34.10 total
82
```
83
84
Finally, I compressed the resulting `none.tar.lrz` file using bzip3:
85
86
```
87
% time bzip3 -e -b 256 -j 2 all_none.tar.lrz
88
32.05s user 0.76s system 146% cpu 2751M memory 22.411 total
89
```
90
91
The results follow:
92
93
| Method           | Compressed size (bytes) |
94
| ---------------- | -----------------------:|
95
| lrzip + bzip3    |              60'672'608 |
96
| lrzip + lzma     |              64'774'202 |
97
| lrzip + bzip2    |              75'685'065 |
98
99
For further benchmarks against Turbo-Range-Coder and BSC, check [powturbo's benchmark](https://github.com/powturbo/Turbo-Range-Coder) of bzip3, bzip2, bsc and others.
100
101
## Disclaimers
102
103
**I TAKE NO RESPONSIBILITY FOR ANY LOSS OF DATA ARISING FROM THE USE OF THIS PROGRAM/LIBRARY, HOWSOEVER CAUSED.**
104
105
Every compression of a file implies an assumption that the compressed file can be decompressed to reproduce the original. Great efforts in design, coding and testing have been made to ensure that this program works correctly.
106
107
However, the complexity of the algorithms, and, in particular, the presence of various special cases in the code which occur with very low but non-zero probability make it impossible to rule out the possibility of bugs remaining in the program.
108
109
DO NOT COMPRESS ANY DATA WITH THIS PROGRAM UNLESS YOU ARE PREPARED TO ACCEPT THE POSSIBILITY, HOWEVER SMALL, THAT THE DATA WILL NOT BE RECOVERABLE.
110
111
That is not to say this program is inherently unreliable. Indeed, I very much hope the opposite is true. Bzip3/libbz3 has been carefully constructed and extensively tested.
112
113
**Bzip3's performance is _heavily_ dependent on the compiler. x64 Linux clang13 builds usually can go as high as 17MiB/s compression and 23MiB/s decompression _per thread_. Windows and 32-bit builds might be considerably slower.**
114
115
Bzip3 has been tested on the following architectures:
116
- x86
117
- x86_64
118
- armv6
119
- armv7
120
- aarch64
121
- ppc64le
122
- mips
123
- mips64
124
- sparc
125
- s390x
126
127
## Corpus benchmarks
128
129
![visualisation of the benchmarks](etc/benchmark.png)
130
131
Check etc/BENCHMARKS.md for more results.
132
133
## Licensing
134
135
A breakdown of components and their licenses follows:
136
137
- (runtime) The codebase as a whole: Copyright 2022-2023, Kamila Szewczyk (k@iczelia.net); LGPL (LICENSE)
138
- (runtime) The Burrows-Wheeler transform (libsais) and LZP code: 2021-2022, Ilya Grebnov (ilya.grebnov@gmail.com); Apache 2.0 (3rdparty/libsais-LICENSE)
139
- (compile-time) `build-aux`: Copyright 2011, Daniel Richard G (skunk@iSKUNK.ORG), 2019, Marc Stevens (marc.stevens@cwi.nl), 2008, Steven G. Johnson (stevenj@alum.mit.edu); GPL-3+ with AutoConf exception
140
- (compile-time) `build-aux/ax_check_compile_flag.m4`: Copyright 2008, Guido U. Draheim (guidod@gmx.de), 2011, Maarten Bosmans (mkbosmans@gmail.com); FSFAP
141
- (compile-time) `build-aux/git-version-gen`: Copyright 2007-2012, Free Software Foundation, Inc; GPLv3
142
- (runtime) `bz3grep`: Copyright 2003, Thomas Klausner; BSD-2-clause
143
144
`bzip3` as a whole is licensed under LGPLv3 only. It is not dual-licensed under LGPLv3 and Apache 2.0.
145
146
## Thanks
147
148
- Ilya Grebnov for his `libsais` library used for BWT construction in BZip3 and the LZP encoder which I had used as a reference implementation to improve myself.
149
- Caleb Maclennan for configuring autotools as a packaging-friendly build system for BZip3.
150
- Ilya Muravyov for his public domain BWT post-coder, a derivative of which is used in this project.
151
tab: 248 wrap: offon