I have a complex sorting situation that I'd like to optimize. I have a working implementation but I think it could be better. I have a large list (millions) of strings. There can and are likely many redundant entries. I'd like to take this list as input and output a list that is sorted, unique, and with a count of the number of times a particular item appeared in the original list. So for example, if I had a list like: 3 5 2 3 9 1 1 3 3 I'd like to output (1, 2) (2, 1) (3, 4) (5, 1) (9, 1) (first number is the item, second number is number of occurances, note it is sorted by the first number). Output format is irrelevant, I just used this format to be concise. The main problem here is that the original list is too large to fit into memory in many cases, so I cannot sort the list in place. Does anyone have any suggestions for an efficient solution? Let me explain my current solution. I'm using a PHP command line script in Linux (not the most ideal programming language, but it's associative arrays and easy file handling are nice). I read data in, using an array with the item being the key, and value being the number of occurrances. When I find I'm nearing the memory limit, I sort the keys of the array using PHP's ksort function, and spit the array out into a temp file, and purge the data from memory, and start fresh continuing through the data. I do this until I'm done with the data, and then I do a merge on all of the temp files. When I use this function on a data set small enough to fit into memory (and therefore skip the caching to disk and merging part), it's faster than doing the equivalent sort and uniq -c commands in linux. However the sort command is more memory efficient. Once I get into the problem of running out of memory, it takes longer for data that would be small enough for the Linux sort command. The AVL tree idea seems interesting. I will explore this option a bit and let you know how that goes. I however do not think it effectively solves the problem of the data being too large to fit entirely in memory.