Since I wasn’t able to find a file cataloguer and dupe-finding app that quite fit my needs (for the Mac, DiskTracker was pretty close, I’d definitely recommend that of all the apps I tried), I started to code some stuff up. One of the things I was interested in starting out was how well using Python’s os.walk()
(and os.lstat()
)would perform against ls. I threw in find while I was there. Here are the results for a few hundred-thousand files, the relative speed which was consistent over a few runs:
python (44M, 266173 lines) --- real 0m54.003s user 0m18.982s sys 0m19.972s ls (35M, 724416 lines) --- real 0m45.994s user 0m9.316s sys 0m20.204s find (36M, 266174 lines) --- real 1m42.944s user 0m1.434s sys 0m9.416s
The Python code uses the most CPU-time but is still I/O bound and is negligibly slower in real-time than ls. The options I used for ls were -aAlR, which apparently produces output with lots of line breaks, but ends up being smaller than find‘s single-line, full-path output. The find was really a file-count sanity check (the 1 difference from the Python script is because find lists itself to start with). Using Python’s os lib has the advantage of returning all the attributes I need w/o the need for additional parsing, and since the performance is fine, I’ll be using that. So, just thought it’d be worth sharing these results for anyone who needs to process a fair number of files (I’ll be processing I’m guessing in the ballpark of 2M files (3-4TB of data?) across about a dozen NAS, DAS, and removable drives. Obviously, if you’re processing a very large number, you may want a different approach.