I’ve been syncing files/copying drives onto my new NAS for the past few days and I may actually end up running a bit low on space (currently up to 26/40TiB, 13M+ files and counting).
While my main plan is to do mass deduplication as a part of a larger effort (there are multiple backups/copies of many of the files I’m dumping onto the NAS) if I run out of space I may have to do some manual lookups, which will probably involve using something like fdupes
.
One interesting thing looking at the fdupes repo is that it’s about to turn 19 years old soon (actually pretty close in age to this blog), maintained basically by a single author over the years. It’s at version 1.6.1 currently.
Anyway, just thought that was a neat little thing. When you think about it, all computing is about people dedicating time and energy towards building this larger infrastructure that makes the modern world possible (one bug at a time), but there are tons of these small utilities/projects that are stewarded/maintained by individuals over the course of decades.
On a somewhat related tangent, a small anecdote which I don’t think I ever posted about here, but a few (wait, 8!) years ago WordPress asked me if I’d re-license some really old code (2001) I wrote from GPLv2 to GPLv2+. It turns out this code I wrote mainly as a way to learn Cold Fusion (which also still runs in some form in Metafilter I believe) lives on in the WP formatting code and gets run every time content is saved. It’s some of my oldest still-running code (and almost certainly the most executed), and what’s especially interesting is that it pretty much happened without any effort from my part. I put it online on an old Drupal site that had my little code projects back in the day and one day Michel from b2 (proto-WP) dropped a line that he had used the code. The kicker to the tale is that I switched majors (CECS to FA) before taking a compilers class or knowing anything about lexing/parsing (which, tbf, I still don’t), so it’s really just me writing something from first principles w/ barely any idea of what I was doing. And yet, if the stats are correct, probably a third of the world’s published content has been touched by it. Pretty wacky stuff, but probably not such an uncommon tale when you think about it. (I’ve also written plenty of code on purpose for millions+ of people to use; it’s one of the big appeals of programming IMO.)
Tangent two: eventually I will publish my research into file and document crawlers, indexers, management systems. I’m trying out Fess right now, which was pleasantly easy get running, and plan on trying out Ambar and Diskover. I have an idea of the features I actually want and they don’t exist, so it’s likely that I’ll end up trying to adapt/write a crawler (storage-crawler? datacat? fscrawler?) and then adding what I need on top of that.