30 March 2025

I can't possibly convey how fun working with bioinformatics tool suites is. To a newcomer, they all look pretty okay, and all the standard stuff seems to be already there.

You try running a simple sequence alignment program, but instead of standard input/output, it insists on using six different configuration files in a format no other tool understands and uses semicolons for FASTA headers.

You grab a genome assembler, but it only works with a very specific version of an old library that isn't maintained anymore. The documentation tells you to "just install it from this FTP link", which is of course broken.

You run a phylogenetic tree builder, but the output file is in a format that no visualization software can read - except for one program from 2003, which only runs on an old version of Java that conflicts with everything else on your system.

You take a look at the variant caller. It requires chromosome names in the format chr1, but another tool in your pipeline needs them as just 1, and neither has an option to switch.

The genome assembler hasn’t been updated since 2012, but everyone still uses it because the "modern alternative" requires 200GB of RAM and crashes silently when it runs out.

And on you go. Every tool mostly does what it's supposed to, but each one has its own weird dependencies, its own obscure input formats, its own inexplicable quirks. There's no clear problem with bioinformatics software as a whole; all the essential tools technically exist.

Now imagine you meet millions of bioinformaticians who tell you, "Well hey, what's the problem? This is what we've always used, and it works fine!". And they show you their pipelines, held together with ten different shell scripts, manual file renaming, and a README that just says, "Run this on Ubuntu 16.04, trust me". And you point out that one of their scripts has a hardcoded file path to someone's home directory, and they just say, "Yeah, that happens sometimes".