20 March 2024

Applying any sort of entropy coding to lzw output seems like a lot of pain. Nominally you have a lot of invariants like e.g. US-ASCII packets that you could efficiently use FSE for to get a double digit CR, but the match packets (>256 + 2 for table clear and promotion) are so just random. the "simple" scheme of inserting trash into the LZW dictionary on almost every byte is moot and you end up with some sort of weird 16-bit alphabet to compress with sparse entries which occur once or twice per block. and then there's the invariant that all the values are <= than currently seen maximum since last table clear, which UNIX "compress" seems to use for some sort of primitive entropy coding where you begin with 9-bit codes, promote all the way up to 16-bit codes and then freeze the dict and scrape it up when CR drops below delta of 1-3%. for purely technical reasons my program seems faster/stronger than unix compress but it's a low bar. there's nothing you can really do with this format, i can see why people ditched it in the 90s and never came back. ... but now i'm reading up on some interesting schemes like LZW with 12-bit dictionaries that are selected by o0...o3 context and it clicked in my head that LZP/ROLZ are basically LZW, also looking into some other variations on LZW. probably won't figure out a nice way to do EC on the LZW output though.

< back to journal