TAR and GZIP Compression in Java

When you are generating a lot of output files as I do on some of my projects you at some point want to archive them so you end creating an archive directory and placing the files in there (usually with something like org.apache.commons.io.FileUtils) and then have your network guys or admins set up backing up the archive (or have the archive itself on a fully backed-up network appliance).

This is great and works really well except that you are most likely wasting a ton of HD space because:

  • It’s an archive, you won’t be referencing it that much
  • Odds are it’s mostly text based so it can be heavily compressed
  • You have a whole directory, why not make it one file?

This is where tar and gzip (.tar.gz) can help you out!

The Basics

I’m not going to go into how tar and gzip work here other than to say:

  • tar takes N number of files and puts them into one file
  • gzip compresses a file

Knowing that you can see how they can work together. You tar up your archive directory and then gzip the resulting file to have a compressed version.

To pull this off I created my own FileUtils class that extends org.apache.commons.io.FileUtils and then use that class whenever I want what FileUtils gives me.

The Code

	/**
	 * Compress (tar.gz) the input file (or directory) to the output file
	 * <p/>
	 *
	 * In the case of a directory all files within the directory (and all nested
	 * directories) will be added to the archive
	 *
	 * @param file The file(s if a directory) to compress
	 * @param output The resulting output file (should end in .tar.gz)
	 * @throws IOException
	 */
	public static void compressFile(File file, File output)
		throws IOException
	{
		ArrayList<File> list = new ArrayList<File>(1);
		list.add(file);
		compressFiles(list, output);
	}

	/**
	 * Compress (tar.gz) the input files to the output file
	 *
	 * @param files The files to compress
	 * @param output The resulting output file (should end in .tar.gz)
	 * @throws IOException
	 */
	public static void compressFiles(Collection<File> files, File output)
		throws IOException
	{
		LOG.debug("Compressing "+files.size() + " to "+output.getAbsoluteFile());
                // Create the output stream for the output file
		FileOutputStream fos = new FileOutputStream(output);
                // Wrap the output file stream in streams that will tar and gzip everything
		TarArchiveOutputStream taos = new TarArchiveOutputStream(
			new GZIPOutputStream(new BufferedOutputStream(fos)));
                // TAR has an 8 gig file limit by default, this gets around that
		taos.setBigNumberMode(TarArchiveOutputStream.BIGNUMBER_STAR); // to get past the 8 gig limit
                // TAR originally didn't support long file names, so enable the support for it
		taos.setLongFileMode(TarArchiveOutputStream.LONGFILE_GNU);

                // Get to putting all the files in the compressed output file
		for (File f : files) {
			addFilesToCompression(taos, f, ".");
		}

                // Close everything up
		taos.close();
		fos.close();
	}

	/**
	 * Does the work of compression and going recursive for nested directories
	 * <p/>
	 *
	 * Borrowed heavily from http://www.thoughtspark.org/node/53
	 *
	 * @param taos The archive
	 * @param file The file to add to the archive
         * @param dir The directory that should serve as the parent directory in the archivew
	 * @throws IOException
	 */
	private static void addFilesToCompression(TarArchiveOutputStream taos, File file, String dir)
		throws IOException
	{
                // Create an entry for the file
		taos.putArchiveEntry(new TarArchiveEntry(file, dir+FILE_SEPARATOR+file.getName()));
		if (file.isFile()) {
                        // Add the file to the archive
			BufferedInputStream bis = new BufferedInputStream(new FileInputStream(file));
			IOUtils.copy(bis, taos);
			taos.closeArchiveEntry();
			bis.close();
		}
		else if (file.isDirectory()) {
                        // close the archive entry
			taos.closeArchiveEntry();
                        // go through all the files in the directory and using recursion, add them to the archive
			for (File childFile : file.listFiles()) {
				addFilesToCompression(taos, childFile, file.getName());
			}
		}
	}

Conclusion

As you can see, it’s pretty easy to create an compressed version of your files. Trust me when I say this is worth it. One project had a daily file creation of 13 gigs for archiving. Once compressed it was less than 2 gigs. That is a savings of 4,015 gigs a year. That’s huge for such a small thing.

About sseaman

Connect with me on Google+
This entry was posted in Java, Programming and tagged , . Bookmark the permalink.

2 Responses to TAR and GZIP Compression in Java

  1. George H says:

    If you are interested in using even higher grade compression in Java try checking out 7zip Java API (http://www.7-zip.org/sdk.html) I’ve used these classes in my java projects for compressing java objects serialized to bytes. I’ve found that uzing LZMA compresses by far the best from all the other algorithms.

    Also a good point on using the 7zip SDK… it’s public domain so no license worries 🙂

  2. leon says:

    Line 79: should be
    addFilesToCompression(taos, childFile, dir + File.seperator + file.getName());

Leave a Reply

Your email address will not be published. Required fields are marked *