Accidentally quadratic: rm -rf??
This is a REALLY bad time. Please please PLEASE tell me I'm not fucked...
I've accidentally created about a million files in a directory. Now ls takes forever, Python's os.listdir is faster but still
dog-slow – but! but! there's hope – a C loop program using opendir/readdir is reasonably fast.
Now I want to modify said program so that it removes those files which have a certain substring in their name. (I want the
files in that directory and its subdirectories, just not the junk I created due to a typo in my code.)
HOW THE FUCK DO I DO THAT IN O(N)??
O(N^2) is easy enough. The unlink function takes a filename, which means that under the hood it reads ALL THE BLOODY FILE
NAMES IN THAT DIRECTORY until it finds the right one and THEN it deletes that file. Repeated a million times, that's a trillion
operations – a bit less because shit gets divided by 2 in there, but you get the idea.
Now, readdir gives me the fucking inode number. How the FUCK do I pass it back to this piece of shit operating system from
hell, WITHOUT having it search through the whole damned directory AGAIN to find what it just gave me?
I would have thought that rm -rf for instance would be able to deal with this kind of job efficiently. I'm not sure it can.
The excise function in the guts of coreutils for
instance seems to be using unlinkat which gets a pathname. All attempts to google for this shit came up with advice to use find
-inode -exec rm or some shit like that, which means find converts inode to name, rm gets the name, Unix converts the name back
to inode...
So am I correct in that:
- Neither Unix nor the commercial network filers nor nobody BOTHERS to use a hash table somewhere in their guts to get the
inode in O(1) from the name (NOT A VERY HARD TASK DAMMIT), and
- Unix does not provide a way to remove files given inode numbers, but
- Unix does unfortunately makes it easy enough (O(1)) to CREATE a new file in a directory which is already full of files, so
that a loop creating those bloody files in the first place is NOT quadratic?? So that you can REALLY shoot yourself in the foot,
big time?
Please tell me I'm wrong about the first 2 points, especially the second... Please please please... I kinda need those files
in there...
(And I mean NOTHING, nothing at all works in such a directory at a reasonable speed because every time you need to touch a
file the entire directory is traversed underneath... FUUUUUCK... I guess I could traverse it in linear time and copy aside
somehow though... maybe that's what I'd do...)
Anyway, a great entry for the Accidentally Quadratic blog I
guess...
Update: gitk fires up really damn quickly in that repository, showing all the changes. Hooray! Not the new
files though. git citool is kinda... sluggish. I hope there were no new files there...
Updates:
- `find fucked-dir -maxdepth 1 -name "fuckers-name*" -delete` nuked the bastards; I didn't measure the time it took, but I ran
it in the evening, didn't see it finish in the half an hour I had to watch it, and then the files were gone in the morning.
Better than I feared it would be.
- As several commenters pointed out, many modern filesystems do provide O(1) or O(log(N)) access to files given a name, so
they asked what my file system was. Answer: fucked if I know, it's an NFS server by a big-name vendor.
- Commenters also pointed out how deleting a file given an inode is hard because you'd need a back-link to all the directories
with a hard link to the file. I guess it makes sense in some sort of a warped way. (What happens to a file when you delete it by
name and there are multiple hard links to it from other directories?.. I never bothered to learn the semantics under the
assumption that hard links are, um, not a very sensible feature.)
What does `find -delete` do in this case?
I think that if you divide the files in 10 folders then you will have
1/10-th of the original complexity
@Les: I don't know. Perhaps I should find out.
@Viktor: yeah, that's what everybody does when they do it ON PURPOSE
in produciton, but this here was a fucking one-time ACCIDENT. Now how do
I clean it up??
You didn't say what filesystem this is. Ext4 (and I think even ext3
but not by default) have dir_index support (tune2fs -l /dev/sdaX |grep
dir_index), and XFS can similarly deal with large number of files.
Perhaps this will help:
https://github.com/danfruehauf/Scripts/blob/master/spread_files/spread_files.sh
Just use rsync to delete all files that match a substring. Rsync
caches agressively, so it's much faster than doing find . | grep
$keyword | xargs rm or something similar.
Modern Linuxes do actually use a b-tree for directories, so it's only
O(n log n). But still, the whole directory / inode dichotomy in Unix is
a mistake. (And no, you can't delete a file by inode number, because a
file can have multiple links and the OS would have to keep a list of
*all* of them to do the reverse mapping).
Maybe copying the good files to a new directory will be faster.
Is it possible to generate all possible filenames, and call unlink on
them? That is, call unlink without first knowing whether the file
exists.
I guess it's a question of whether the filenames are predictable
beyond "has this substring in its name" and what fraction of possible
filenames you actually used (e.g., if you created 900,000 files out of a
possible 1,000,000, then that's one thing; if you created 100,000 out of
a possible 1,000,000, that's something else).
xargs?
ls | grep delete_me | xargs rm
Is xargs with rm quadratic?
Might be interesting/safer to use interim data files in /tmp instead
of pipes.
I would be surprised if you could delete by inode alone, since that
would leave around a bunch of file names potentially pointing to those
inodes. Deleting a file is more than deleting its inode, you have to
delete its entry(ies) in the directory, too... no?
Yossi, assuming you're using a filesystem in the family of ext*fs,
have a look at the e2fsprogs source code, and more specifically
debugfs/kill_file_by_inode(). Still, be careful not to corrupt your
fs!
'ls' is dog slow because it reads *all* the directory entries into
memory, and then sorts them, before output. Use 'ls -U' to output in
directory order as it does the readdir.
As noted above, deleting by inode would be dangerous to the
filesystem's integrity, since there could be other links to that inode.
Of course, the kernel could enforce a condition of inode->link_count
== 1 to allow it.
More sensible for this case, I think, would be deletion by dirent,
which gives some cookie within the directory (d_off), and the name and
inode number to either confirm that the right thing is to be deleted or
to look up the right thing if the directory's contents changed.
tune2fs -O dir_index, if it's an old ass ext3
newer filesystems are not that stupid
Filename lookup depends on what filesystem you are using. ZFS uses an
on-disk hashtable (extendible hash to be specific), so filename lookup
is O(1) and "rm -rf" can be O(number files deleted).
Delete by inode number would be much slower, because the filesystem
would need to find the directory entry to remove, which AFAIK is at best
O(number files in parent directory).
Copy out the files you want to a new filesystem, and nuke the old one
from orbit.
That is, assuming you're right about rm being O(N); on Linux all the
major filesystems (ext4,xfs,btrfs) use hash tables or trees and thus
should be faster than that. Or is this a Mac or some BSD variant?
It does not make too much sense for file creation to be O(1) and file
removal to be O(N) – before you create the file you have to check if the
file with the same name is already there. Of course it's possible to
implement file removal to be O(N) *after* a sub-linear name lookup but
it feels unlikely that file systems would do this.
Might be much faster, depending on your filesystem, to just remove
the whole directory, if that's an option! For instance in case of
HAMMER2 filesystem, directory removal is a constant operation.
NTFS actually keeps the list of directories a file is in, as well as
the names the file has in those directories, in the NTFS-equivalent of
the i-node. So deleting by i-node would be entirely doable if UNIX ever
had historically tried to support it.
Move out the files you want and kill the rest. Alternatively look at
the inodes and see the first bad file and directly remove from
there.
The `dirent` struct contains a field called `d_filename`; see `man 2
readdir`. Also `man 2 unlinkat`.
Re "What happens to a file when you delete it by name and there are
multiple hard links to it from other directories?": Files are reference
counted. Unlinking simply deletes a reference. If all the references are
gone, the file is removed.
FreeBSD supports hashed directories on UFS as optimization (old
format is still always present, for binary compatibility).
XFS uses btrees. ZFS, I'm pretty sure, too (but need to check).
NFS is very slow anyway, it is almost irrelevant which underlying FS
is used by server.
If the NFS server's export is on a journaled filesystem, the journal
semantics can have a huge impact on FS operations. Think ext3's
"data={journal|ordered|writeback}" variations.
In addition, unlinking a file on a journaled FS is one of the most
complicated operations there is, managing reference counts, likely (but
not guaranteed) block/extent releases, and directory inode
modifications. Given that all the unlinks are in the same directory,
that's a lot of serialization happening in the journal entries when the
journal fills up.
Hahahaa. I used to have this tarball that had special names that
collided on the kernel filesystem hash table -> O(n^2). So if you did
"rm -rf *" it was meant to take a century, I think. You literally had to
re-compile your kernel with a different hash to do kill the files.
I think it's been fixed, but posting as anon just to be sure not to
step on too many toes :) The fix was meant to be randomization of a seed
in the hash func at kernel startup. The bug was filed but dunno if it
got followed up on. Anyway, accidental O(n^2) is _super_ frustrating and
the filesystem has weird quirks.
Also relevant: "You can list a directory containing 8 million files!
But not with ls.." (intentionally not linked)
"What happens to a file when you delete it by name and there are
multiple hard links to it from other directories?"
That's the easy one! You don't delete it (by name or otherwise), you
unlink it (removing the link with the given name). The file only
actually goes away when there are no more links to it and no more file
descriptors open to it, as a sort of autonomic process. The system
doesn't give you the ability to reach in and delete the "actual file"
out from under things that hold references to it.
Post a comment