Sorting Lists with more than 1,1M Rows

Recently I encountered a very weird problem. My wife is working with satellite data producing huge text files with numbers seperated by a space. The total amount of rows exceeding the 1,1 million. So, how we can sort these long huge lists based on multiple criteria?

If you try LibreOffice you will notice very fast, that the max number of rows are 1,048,576. Anything existing beyond that number, its lost. You can always of course split the list but then you can’t simple sort the numbers. And beside this, the LibreOffice is having a stupid limitation to only 3 criteria.

The solution is called «Use the damn terminal!» Actually the command is «sort» and with few parameters, you can get the whole file ready, with your values sorted in less than few seconds.

How? Lets say my file is called foo.txt, and is having 5 columns separated by a space. You want to short this file first by the 5th column and then by the 2nd, 3rd and 4th.

$sort -k5n,5 -k2n,2 -k3n,3 -k4n,4 foo.txt > foo_new.txt

Bam… Done!

Mandatory arguments to long options are mandatory for short options too.

-b–ignore-leading-blanks ignore leading blanks
-d–dictionary-order
consider only blanks and alphanumeric characters
-f–ignore-case
fold lower case to upper case characters
-g–general-numeric-sort
compare according to general numerical value
-i–ignore-nonprinting
consider only printable characters
-M–month-sort
compare (unknown) < `JAN” < … < `DEC”
-n–numeric-sort
compare according to string numerical value
-r–reverse
reverse the result of comparisons

Other options:

-c–check
check whether input is sorted; do not sort
-k–key=POS1[,POS2]
start a key at POS1, end it at POS 2 (origin 1)
-m–merge
merge already sorted files; do not sort
-o–output=FILE
write result to FILE instead of standard output
-s–stable
stabilize sort by disabling last-resort comparison
-S–buffer-size=SIZE
use SIZE for main memory buffer
-t–field-separator=SEP use SEP instead of non- to whitespace transition
-T–temporary-directory=DIR
use DIR for temporaries, not $TMPDIR or /tmp multiple options specify multiple directories
-u–unique
with -c: check for strict ordering
otherwise: output only the first of an equal run
-z–zero-terminated
end lines with 0 byte, not newline
–help
display this help and exit
–version
output version information and exit
This entry was posted in code and tagged . Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *