While experimenting with some of my GridBackup code last night, I noticed something very interesting: There's a lot more wasted space on my hard drive than I thought.
Oh, I knew there were tons of files I haven't touched in years, and may never look at again. What I hadn't realized was how much stuff is duplicated.
GridBackup has progressed to the point where the file system scanning and backup log generation is pretty solid (though I still need to test on Windows). I'm working on actually uploading data to the grid now.
Anyway, I decided to put my queue of jobs to be uploaded into a database, for convenient access and lookups. One of the things stored for each job is a hash of the file. For those who don't know what that is, it's a kind of unique fingerprint. Different files basically never have the same hash, and identical files always have the same hash. So, by counting unique hashes in the database, I can tell how many unique files are on my computer.
It turns out that out of the 733,189 files on the computer, only 520,242 of them are unique. So over 200,000 files are duplicates of at least one other.
Of course there are a boatload of empty files, over 11,000 of them. There are many reasons that programs create empty files, so that didn't surprise me. But there were a lot of others, too -- all told I have over 1 GB of space wasted due to duplicates.
I generated a list of the culprits, sorted by total space wasted (size * (count-1)), and while some of worst offenders were my own fault, lots of others had been quietly generated by various programs, wasting a lot of space.
What to do? Nothing, of course. This is 1GB wasted out of nearly 200 GB of stuff, on a 300 GB drive. But I thought it was interesting, and it reaffirms my suspicion that the fact that GridBackup tracks files by content, rather than name, will mean that lots of stuff will only have to get backed up once, even though it may exist many times in the group of computers.
Subscribe to:
Post Comments (Atom)
My collection is complete(ish)!
I drive an electric car, a Tesla Model S. Generally, I never worry about how or where to charge it. It has a large battery and a long range,...
-
I drive an electric car, a Tesla Model S. Generally, I never worry about how or where to charge it. It has a large battery and a long range,...
-
Since I just finished it yesterday, I thought I'd post some photos of my first woodworking project intended to be good enough to go insi...
-
In 2011 Microsoft posted an updated copy of their "Ten Immutable Laws of Security". It's interesting to look at these laws in...
I bet I have a ton of wasted space on my computer, you should check it out and clean it up.!!!
ReplyDeleteVery interesting, a back-up service that can identify duplicates. And possibly eliminate unnecessary ones?
ReplyDeleteI thought about that, and I don't think so. I mean, the tool certainly could eliminate duplicates, but which ones are the duplicates and which one should be kept? It couldn't really know.
ReplyDeleteOn real operating systems, there's a notion called a "hardlink", which allows you to have a single copy of a file that is "linked" into multiple directories. It looks like any other file, it's not a special shortcut or symbolic link or anything, so I thought at least the tool could remove all of the dups and replace them with hardlinks. That way the tool wouldn't have to decide which can be kept and which can be lost.
Unfortunately, if you modify one of the hardlinked "files", they will *all* change (because it's really just one file). But that may be a bad thing. What you really need is a "copy on write" hardlink, so that if one of the linked "files" is modified, it gets duplicated. So as long as the files are the same, they share space, but if one gets modified it takes on its own life.
Unfortunately, I don't know of any file system other than btrfs that supports copy-on-write links, and btrfs is still under heavy development, not recommended for general use, and anyway btrfs will do the copy-on-write link stuff automatically, so my tool would be unnecessary (when you copy a file on btrfs it doesn't actually make a copy at all, just a link with the "copy-on-write" bit set -- copying files on btrfs is blazing fast).
So, bottom line, I think the best thing my tool could do is generate a report of duplicate files for the user, and then let you decide what, if anything, to do about it. If my experience is typical, it wouldn't be worth the effort to save 0.3% of disk space. You're better off just buying a bigger drive.