Quick Tip #4: Sorting Large Files

With traditional Unix sort(1), the size of the files you can sort is limited by the amount of available main memory. As soon the file get larger and your system has to swap, performance degrades significantly. Even GNU sort which uses temporary files to get around this limitation doesn’t sort in parallel. The only viable option for sorting very large files efficiently is to split them, sort the individual parts in parallel and merge them.

First you have to split the input at line boundaries because sort works line oriented. Fortunately, most split(1) utilities today (like GNU split) provide an -l switch. Example:

  $ split -l 100000 input input-

This splits the input file into chunks of 100000 lines. The chunks are named input-aa, input-ab, etc. You’ll have to experiment with file sizes to see what works well for your problem.

Now sort the individual files using whatever flags you need:

  $ sort input-aa > sorted-input-aa

To speed things up, you can parallelize this step. For example, on a quad core box you’d typically run four sort processes in parallel because sorting is CPU-bound if your input is large enough.

As soon as all files are sorted we merge them using sort’s -m flag:

  $ sort -m sorted-input-* > sorted-input

Note that you have to use the same flags you used to sort the chunks to get a correct result!

This whole process is pretty simple and you can script it easily. However, if your files get really big (several hundred GB and more) and you start considering to parallelize across multiple machines, you might want to consider using a MapReduce cluster.

Quick Tip #4: Sorting Large Files

Trending Articles

Police confirm man stabbed to death in Selsdon was Andrew David Else of Croydon

Obituaries for Friday, June 27, 2025

Raj Panchayat 3rd / Third Grade Teacher Revised Result 2012 Level 1-2...

A/L Technology Stream – Subject combinations, Syllabuses and Teacher guides

libdevinfo を使ってネットワークインターフェイスデバイスの一覧を取得する

DD Kashir channel packaging bids invited by 29 june

Srinagar Kitty’s brother dies at 67 due to Covid-19

Procedure for conduct of supplementary DPC

HResult: 0x80240033 Context: uecGeneral Msg: The license terms of one or more...

Re: How to fix error on printer HP Color LaserJet Pro MFP 3303 with event...

Thread: Ticket to Ride Legacy: Legends of the West:: General:: [SPOILERS]...

Born To Be Wild: Chicago Outfit Hit Squad Littered The Streets With Bodies...

NCERT Solutions for Class 9th Sanskrit Chapter 3 पाथेयम्

Himachal Pradesh TET Answer Key Download 2019

Mp3 Download: Mdu - Nammer

Ilahi mera jee aaye/ Shaame Malang si Lyrics Translation

Current scandal has roots in NPF saga

TAPERED BEAM DESIGN OUTPUT

spreading clines

Re: My Sisters Plan For Me To Smell Her Feet (Fiction): Part 1,2,3 and 4!!!