memory effective list of integers with fast search and slow insert/delete

Question

memory effective list of integers with fast search and slow insert/delete

I'm trying to find the best data structure for a sorted list of positive integers (millions of elements). Requirements are (in order of importance):

Small memory footprint
Fast O(log n) search
Insert/delete faster than memcpy()

I'm thinking about keeping two arrays: one for search and one for insert. Every few operations I will re-organize the main one and clean the second one. Any thoughts? Am I on the right track?

ps. There are no duplicates. It doesn't need to be thread-safe. Reads will happen very often, while writes very rarely. Integers are distribute in the structure unevenly, meaning that some structures will contain just a few elements, while others may have millions of them, taking positions from zero to 0xFFFFFFFF.

java

algorithm

design-patterns

data-structures

asked on Stack Overflow Jul 2, 2012 by

yegor256 • edited Jul 2, 2012 by

yegor256

7 Answers

I think you want to use an Van Emde Boas Tree

It has the following characteristics:

Space   O(M)
Search  O(log log M)
Insert  O(log log M)
Delete  O(log log M)

answered on Stack Overflow Jul 2, 2012 by

Woot4Moo

Could you use char[65536][]? where the top or bottom 16 bits is an index to an array of the other 16 bits. This could use less than 4 * X per entry.

Lookup

 private final char[][] bitsArray = new char[65536][];

 public int countFor(int num) {
     int topBits = num >>> 16;
     int lowerBits = num & 0xFFFF;
     char[] lowerBitsArray = bitsArray[topBits];
     int count = 0;
     for(char l : lowerBitsArray)
        if(l == lowerBits)
           count++;
     return count;
 }

If the count can never be more than 1, a BitSet is likely to be a better choice. (Possibly an array of BitSet depending on the pattern of data) E.g. if you were to record IP addresses seen, you might not need to worry about 0., 10., 127.* or 224-255.*

Whether an int[] or char[] is faster to access including casting to int.

public static void main(String... args) {
    char[] chars = new char[1000000];
    for (int i = 0; i < 5; i++)
        timeSum(chars);
    int[] ints = new int[1000000];
    for (int i = 0; i < 5; i++)
        timeSum(ints);
}

private static int timeSum(char[] chars) {
    long start = System.nanoTime();
    int sum = 0;
    for (char ch : chars) {
        sum += ch;
    }
    long time = System.nanoTime() - start;
    System.out.printf("Took %,d us to sum %,d chars%n", time / 1000, chars.length);
    return sum;
}

private static int timeSum(int[] ints) {
    long start = System.nanoTime();
    int sum = 0;
    for (int i : ints) {
        sum += i;
    }
    long time = System.nanoTime() - start;
    System.out.printf("Took %,d us to sum %,d ints%n", time / 1000, ints.length);
    return sum;
}

prints

Took 5,378 us to sum 1,000,000 chars
Took 11,551 us to sum 1,000,000 chars
Took 437 us to sum 1,000,000 chars
Took 407 us to sum 1,000,000 chars
Took 407 us to sum 1,000,000 chars
Took 5,539 us to sum 1,000,000 ints
Took 532 us to sum 1,000,000 ints
Took 530 us to sum 1,000,000 ints
Took 511 us to sum 1,000,000 ints
Took 507 us to sum 1,000,000 ints

My conclusion is that cache efficiency is more important than cast cost.

answered on Stack Overflow Jul 2, 2012 by

Peter Lawrey • edited Jul 4, 2012 by

Peter Lawrey

This is actually an interesting and non-trivial problem. The optimal answer will depend on your specific requirements the precise mix of operations you perform.

If the data is dense and duplicates are not allowed, then a big bitmap may be optimal. Just set a bit to show the presence / absence of each possible integer value. This approach will be very fast and O(1) for both reads and writes, but memory usage will obviously be driven by how large a range you have / how sparse your data tends to be.

If the data is dense and duplicates are allowed / common, then an array storing the number of occurrences for each possible value may work well. Similar in performance to bitmap approach, however you probably need 32x as much memory assuming ints for occurrence counts.

If you are read-heavy and data is sparse then a sorted array based approach (with binary search for lookup) may be best. If you have knowledge about the rough distribution of the values then you may be able to go even faster by using heuristics to guess the likely position of the target value in the array (e.g. you can significantly beat log2(N) if you exploit the knowledge that the distribution is roughly uniform)

If you have a lot of writes and data is sparse then you probably want a tree-based structure that splits based on subsets of the bits in your integers (e.g. a 32-way trie splitting on the next most significant 5 bits at each node). Clojure's persistent data structures use this technique to great effect.

answered on Stack Overflow Jul 2, 2012 by

mikera • edited Jul 2, 2012 by

mikera

I think @Peter Lawrey has a good start: subdivide. Partly to be different, I'd subdivide into 256 things, each of which tracks 2^23 things. Depending on the distribution of your integers, either use the top or bottom 8 bits to subdivide.

As for the sub things, start with a Set (or similar) when the ints are sparse. But, once that Set reaches a certain size, (it starts getting dense) switch to a BitSet. I don't know if you will need to support removing values, in which case you would need to switch back from BitSet to Set.

p.s. If all else fails, a simple BitSet of all the positive integers is "only" 268MB (If my calculations are right...)

answered on Stack Overflow Jul 2, 2012 by

user949300 • edited Jul 2, 2012 by

user949300

What about a linked list? the momory constraints would be the size of the ints + a little overhead for the previous & next pointers. As far as inserting and deleting, the time requirement would be only to go down the list until you find one smaller then the one you're insterting and put it in just before that record. Deleting would only require the pointers of previous and next to change, and searching would be just as easy as inserting.

answered on Stack Overflow Jul 2, 2012 by

Matt Westlake

If you are not too worried about speed, and are petrified of memory usage, you can load an array of ints in, create another array, sort through the array until you have a number X (1K or so to prevent memory overload) and then save that part of the array as a text file (objectOutputStream will save ints as ints), clear the array, and then do the same for the next X number of ints in the array. Just make sure you mark the output stream to append the file (true) vs overwrite which is what the default is.

answered on Stack Overflow Jul 2, 2012 by

Matt Westlake

You could look at some of the modern generation of tries (that link doesn't mention fusion trees). However, i think they're all pretty complicated to implement. If you're llucky, you might find that some bold individual has written and open-sourced an implementation you can use.

Another thing to look at would be a classic B-tree.

If your data sets are of a relatively consistent size, you could even write a one-level B-tree (so with a single root node and multiple child nodes), which simplifies the implementation quite a bit (in that you can just store an int[][], and replace the internal keys with peeks into the leaves, if that makes any sense).

answered on Stack Overflow Jul 4, 2012 by

Tom Anderson

User contributions licensed under CC BY-SA 3.0