Tuesday, January 11, 2011

what is the difference between find -exec cmd {} + and xargs

what is the difference between find -exec cmd {} + and xargs

Which one is more efficient over a very large set of files and should be used?

Method 1:
# find . -exec ls -l {} \;

Note: above command executes the command ls -l on each individual file.

Note: above command will fail with an error message of "Argument list too long" if there are too many files in the directory.

Note: Each filename is passes to exec as a single value, with special characters escaped, so it is as if the filename has been enclosed in single quotes.

Method 2:
# find . | xargs cmd

Note: above command constructs an argument list from the output of the find commend and passes it to ls.

Note: find feeds the input of xargs with a long list of file names. xargs then splits this list into sublists and calls rm once for every sublist.

consider if the ouput of the find command produced:
H1
H2
H3

the Method 1 command would execute
ls -l H1
ls -l H2
ls -l H3

but the Method 2 would execute
ls -l H1 H2 H3

Note: Using -exec will start a grep program for each file it founds, while xargs will be less resources consuming. You can end with hundreds of "grep"s running.

Note: The main reason you would use xargs is efficiency.

When you use "-exec cmd {} \;" with 'find', it starts a new process for each file that is found.

Note: Method 2 command is faster than Method 1 command because xargs will collect file names and execute a command with as long as a length as possible. Often this will be just a single command.

However the xargs solution will fail if the shell has trouble parsing the file names that contain spaces or tabs, etc. Try:

# touch "stupid name"

and then retry the two commands.

There is a third solution that combines the best of both worlds. It is in Posix but not every version of the find command supports it. It's like the first syntax except that instead of \; you just use + to terminate the command.

Method 3:
# find . -exec cmd {} +

Note: "find . -exec cmd {} +" command will NOT start a new process for each file, so it is as efficient as xargs command.

Method 4:
# find . -print0 | xargs -0 cmd -option1 -option2

Note: you should always place -print0 at the END of argument list, or it will output wrong result.

Note: without -print0 it does not work if there is a file with a space or tab etc. This can be a security vulnerability as if there is a filename like "foo -o index.html" then -o will be treated as an option. Try in empty directory: "touch -- foo\ -o\ index.html; find . | xargs cat". You'll get: "cat: invalid option -- 'o'"

Note: above command will work even if filenames contain funky characters (-print0 makes find print NULL-terminated matches, -0 makes xargs expect this format.)

Method 5:
#!/bin/sh
for FILE in `find srcDir -name "*.log" -print 2> /dev/null`
do
python pyScript $FILE `dirname $FILE`/python.log
done

Method 6:
If the command to run is CPU intensive you may want to use GNU Parallel:

# find . | parallel command

Watch the intro video to learn more: http://www.youtube.com/watch?v=OpaiGYxkSuQ

Thanks tange.

Reference:
http://www.unix.com/unix-dummies-questions-answers/19217-difference-between-xargs-exec.html

http://stackoverflow.com/questions/896808/find-exec-cmd-vs-xargs

用 find、sed、xargs 及 mv 換檔名

http://www.softpanorama.org/Tools/Find/using_exec_option_and_xargs_in_find.shtml

http://en.wikipedia.org/wiki/Xargs

1 comment:

Anonymous said...

If the command to run is CPU intensive you may want to use GNU Parallel:

find . | parallel command

Watch the intro video to learn more: http://www.youtube.com/watch?v=OpaiGYxkSuQ