Natwerk Designs

Best algorithm to sort and unique a large list of items?

I have a complex sorting situation that I'd like to optimize. I have a working implementation but I think it could be better. I have a large list (millions) of strings. There can and are likely many redundant entries. I'd like to take this list as input and output a list that is sorted, unique, and with a count of the number of times a particular item appeared in the original list. So for example, if I had a list like: 3 5 2 3 9 1 1 3 3 I'd like to output (1, 2) (2, 1) (3, 4) (5, 1) (9, 1) (first number is the item, second number is number of occurances, note it is sorted by the first number). Output format is irrelevant, I just used this format to be concise. The main problem here is that the original list is too large to fit into memory in many cases, so I cannot sort the list in place. Does anyone have any suggestions for an efficient solution? Let me explain my current solution. I'm using a PHP command line script in Linux (not the most ideal programming language, but it's associative arrays and easy file handling are nice). I read data in, using an array with the item being the key, and value being the number of occurrances. When I find I'm nearing the memory limit, I sort the keys of the array using PHP's ksort function, and spit the array out into a temp file, and purge the data from memory, and start fresh continuing through the data. I do this until I'm done with the data, and then I do a merge on all of the temp files. When I use this function on a data set small enough to fit into memory (and therefore skip the caching to disk and merging part), it's faster than doing the equivalent sort and uniq -c commands in linux. However the sort command is more memory efficient. Once I get into the problem of running out of memory, it takes longer for data that would be small enough for the Linux sort command. The AVL tree idea seems interesting. I will explore this option a bit and let you know how that goes. I however do not think it effectively solves the problem of the data being too large to fit entirely in memory.

Public Comments

  1. [UPDATED 2] Ok, here is a solution: Read the data into an AVL tree. Your tree node should store a word as well as the number of occurances. Your insert function should insert the word if it doesn't exist in the tree. If the word already exists, then your function should increment the number of occurances of that word without inserting it again. Now, to print the sorted data you should use inorder traversal. Using an AVL tree rather than a regular binary search tree will ensure that your tree in balanced and thus your insertion will take O(logn) (that's fast) and printing the data in a sorted fashion is going to take O(n) (that's fast too). If you need to know what an AVL tree is and how to implement it, I can help you with that. Let me know your decision. You are right that this AVL solution doesn't solve the memory problem. I can't think of a way to do this other than the merge files approach you described. However, using AVL trees rather than arrays is still faster to sort and thus write to files.
Powered by Yahoo! Answers