This is a REALLY bad time. Please please PLEASE tell me I'm not fucked...
I've accidentally created about a million files in a directory. Now ls takes forever, Python's os.listdir is faster but still dog-slow – but! but! there's hope – a C loop program using opendir/readdir is reasonably fast.
Now I want to modify said program so that it removes those files which have a certain substring in their name. (I want the files in that directory and its subdirectories, just not the junk I created due to a typo in my code.)
HOW THE FUCK DO I DO THAT IN O(N)??
O(N^2) is easy enough. The unlink function takes a filename, which means that under the hood it reads ALL THE BLOODY FILE NAMES IN THAT DIRECTORY until it finds the right one and THEN it deletes that file. Repeated a million times, that's a trillion operations – a bit less because shit gets divided by 2 in there, but you get the idea.
Now, readdir gives me the fucking inode number. How the FUCK do I pass it back to this piece of shit operating system from hell, WITHOUT having it search through the whole damned directory AGAIN to find what it just gave me?
I would have thought that rm -rf for instance would be able to deal with this kind of job efficiently. I'm not sure it can. The excise function in the guts of coreutils for instance seems to be using unlinkat which gets a pathname. All attempts to google for this shit came up with advice to use find -inode -exec rm or some shit like that, which means find converts inode to name, rm gets the name, Unix converts the name back to inode...
So am I correct in that:
- Neither Unix nor the commercial network filers nor nobody BOTHERS to use a hash table somewhere in their guts to get the inode in O(1) from the name (NOT A VERY HARD TASK DAMMIT), and
- Unix does not provide a way to remove files given inode numbers, but
- Unix does unfortunately makes it easy enough (O(1)) to CREATE a new file in a directory which is already full of files, so that a loop creating those bloody files in the first place is NOT quadratic?? So that you can REALLY shoot yourself in the foot, big time?
Please tell me I'm wrong about the first 2 points, especially the second... Please please please... I kinda need those files in there...
(And I mean NOTHING, nothing at all works in such a directory at a reasonable speed because every time you need to touch a file the entire directory is traversed underneath... FUUUUUCK... I guess I could traverse it in linear time and copy aside somehow though... maybe that's what I'd do...)
Anyway, a great entry for the Accidentally Quadratic blog I guess...
Update: gitk fires up really damn quickly in that repository, showing all the changes. Hooray! Not the new files though. git citool is kinda... sluggish. I hope there were no new files there...
Updates:
- `find fucked-dir -maxdepth 1 -name "fuckers-name*" -delete` nuked the bastards; I didn't measure the time it took, but I ran it in the evening, didn't see it finish in the half an hour I had to watch it, and then the files were gone in the morning. Better than I feared it would be.
- As several commenters pointed out, many modern filesystems do provide O(1) or O(log(N)) access to files given a name, so they asked what my file system was. Answer: fucked if I know, it's an NFS server by a big-name vendor.
- Commenters also pointed out how deleting a file given an inode is hard because you'd need a back-link to all the directories with a hard link to the file. I guess it makes sense in some sort of a warped way. (What happens to a file when you delete it by name and there are multiple hard links to it from other directories?.. I never bothered to learn the semantics under the assumption that hard links are, um, not a very sensible feature.)