I find myself having to compress a number of very large files (80-ish GB), and I am surprised at the (lack of) speed my system is exhibiting. I get about 500 MB / min conversion speed; using top, I seem to be using a single CPU at approximately 100%.
I am pretty sure it's not (just) disk access speed, since creating a tar file (that's how the 80G file was created) took just a few minutes (maybe 5 or 10), but after more than 2 hours my simple gzip command is still not done.
In summary:
tar -cvf myStuff.tar myDir/*Took <5 minutes to create an 87 G tar file
gzip myStuff.tarTook two hours and 10 minutes, creating a 55G zip file.
My question: Is this normal? Are there certain options in gzip to speed things up? Would it be faster to concatenate the commands and use tar -cvfz? I saw reference to pigz - Parallel Implementation of GZip - but unfortunatly I cannot install software on the machine I am using, so that is not an option for me. See for example this earlier question.
I am intending to try some of these options myself and time them - but it is quite likely that I will not hit "the magic combination" of options. I am hoping that someone on this site knows the right trick to speed things up.
When I have the results of other trials available I will update this question - but if anyone has a particularly good trick available, I would really appreciate it. Maybe the gzip just takes more processing time than I realized...
UPDATE
As promised, I tried the tricks suggsted below: change the amount of compression, and change the destination of the file. I got the following results for a tar that was about 4.1GB:
flag user system size sameDisk
-1 189.77s 13.64s 2.786G +7.2s
-2 197.20s 12.88s 2.776G +3.4s
-3 207.03s 10.49s 2.739G +1.2s
-4 223.28s 13.73s 2.735G +0.9s
-5 237.79s 9.28s 2.704G -0.4s
-6 271.69s 14.56s 2.700G +1.4s
-7 307.70s 10.97s 2.699G +0.9s
-8 528.66s 10.51s 2.698G -6.3s
-9 722.61s 12.24s 2.698G -4.0sSo yes, changing the flag from the default -6 to the fastest -1 gives me a 30% speedup, with (for my data) hardly any change to the size of the zip file. Whether I'm using the same disk or another one makes essentially no difference (I would have to run this multiple times to get any statistical significance).
If anyone is interested, I generated these timing benchmarks using the following two scripts:
#!/bin/bash
# compare compression speeds with different options
sameDisk='./'
otherDisk='/tmp/'
sourceDir='/dirToCompress'
logFile='./timerOutput'
rm $logFile
for i in {1..9} do /usr/bin/time -a --output=timerOutput ./compressWith $sourceDir $i $sameDisk $logFile do /usr/bin/time -a --output=timerOutput ./compressWith $sourceDir $i $otherDisk $logFile
doneAnd the second script (compressWith):
#!/bin/bash
# use: compressWith sourceDir compressionFlag destinationDisk logFile
echo "compressing $1 to $3 with setting $2" >> $4
tar -c $1 | gzip -$2 > $3test-$2.tar.gzThree things to note:
- Using
/usr/bin/timerather thantime, since the built-in command ofbashhas many fewer options than the GNU command - I did not bother using the
--formatoption although that would make the log file easier to read - I used a script-in-a-script since
timeseemed to operate only on the first command in a piped sequence (so I made it look like a single command...).
With all this learnt, my conclusions are
- Speed things up with the
-1flag (accepted answer) - Much more time is spend compressing the data than reading from disk
- Invest in faster compression software (
pigzseems like a good choice). - If you have multiple files to compress you can put each
gzipcommand in its own thread and use more of the available CPU (poor man’spigz)
Thanks everyone who helped me learn all this!
134 Answers
You can change the speed of gzip using --fast --best or -# where # is a number between 1 and 9 (1 is fastest but less compression, 9 is slowest but more compression). By default gzip
runs at level 6.
The reason tar takes so little time compared to gzip is that there's very little computational overhead in copying your files into a single file (which is what it does). gzip on the otherhand, is actually using compression algorithms to shrink the tar file.
The problem is that gzip is constrained (as you discovered) to a single thread.
Enter pigz, which can use multiple threads to perform the compression. An example of how to use this would be:
tar -c --use-compress-program=pigz -f tar.file dir_to_zipThere is a nice succint summary of the --use-compress-program option over on a sister site.
2I seem to be using a single CPU at approximately 100%.
That implies there isn't an I/O performance issue but that the compression is only using one thread (which will be the case with gzip).
If you manage to achieve the access/agreement needed to get other tools installed, then 7zip also supports multiple threads to take advantage of multi core CPUs, though I'm not sure if that extends to the gzip format as well as its own.
If you are stuck to using just gzip for the time being and have multiple files to compress, you could try compressing them individually - that way you'll use more of that multi-core CPU by running more than one process in parallel. Be careful not to overdo it though because as soon as you get anywhere near the capacity of your I/O subsystem performance will drop off precipitously (to lower than if you were using one process/thread) as the latency of head movements becomes a significant bottleneck.
1One can exploit the number of process available as well in pigz which is usually faster performance as shown in the following command
tar cf - directory to archive | pigz -0 -p largenumber > mydir.tar.gz
Example - tar cf - patha | pigz -0 -p 32 > patha.tar.gz
This is probably faster than the methods suggested in the post as -p is the number of processes one can run. In my personal experience setting a very large value doesnt hurt performance if the directory to be archived consists of a large number of small files. Else the default value considered is 8. For large files, my recommendation would be to set this value as the total number of threads supported on the system.
Example setting a value of p = 32 in case of a 32 CPU machine helps.
0 is meant for the fastest pigz compression as it doesnt compress the archive and rather is focussed on speed. Default value is 6 for compression.
1