taro

Published February 27, 2019

Normally, extracting files from a tar archive leaves a copy of the original in the archive, thus requiring twice the disk space for the entire extraction. taro is a tool that extracts tar archives in-place. taro can extract an entire archive with only 512 extra bytes of disk space overhead, so it's ideal when disk space is limited.

origin

I was the system administrator for the Northeastern University team at the Student Cluster Competition at SC18, where we had to build a 3kW computing cluster and benchmark its performance on unknown datasets.

At the competition, the datasets were made available in a single 190GB tar archive.

We had a four node cluster, with just a single 256GB drive on each node. The archive would fit, but it'd be close. With intense care, I managed to clear up enough space on one node to store the entire archive, and downloaded it (excruciatingly slowly!) from the competition FTP server through a Raspberry Pi router.

It was here that I discovered GNU tar has no way to perform an in-place extraction. I could selectively extract files, but the original archive would still take up the majority of the drive, and we needed the entire dataset before we could run anything with it!

Due to competition rules, we were not allowed to add external drives, or change our network configuration. We attempted to transfer it to another machine, extract it there, and then transfer extracted files back. Unfortunately, the network speed was a limiting factor, and we eventually ran out of time. After this ordeal, I was inspired to build an in-place tar archive extractor. It may not be necessary in most cases, but hopefully it can save someone in a time of need.

If taro is useful for you, I'd love to hear your story.

taro

origin

more taro links