So, I've had a chance to get away from Python and do something more interesting in C++ at work for the last couple days. I will describe the problem, with details abstracted away to protect the aesthetically guilty.
There is a certain class, which I will refer to as ItemList, which stores a list of items and allows it to be queried with the following operations: several getItemByX() methods which return the first item in the list matching the provided value of X, for different properties X of the items, and a getNthItem() method which retrieves items by an integer position. This class allows the list to be modified by a push_back() method which, following the STL naming convention, appends an item to the end of the list, a sort() method which sorts the list according to a particular order, one removeItem() method which searches by one of the values X queryable by the getItemByX() methods, and a removeOffendingItems() method which takes an arbitrary boolean test and removes all matching items.
This class uses an STL vector to store refcounting pointers to items, so we have a declaration something like this:
class ItemList {
public:
Item * getItemByID(UUID id) const;
Item * getItemByName(std::string name) const;
// etc.
Item * getNthItem(int n) const;
void push_back(Pointer<Item>& new_item);
void removeItem(UUID id);
void removeOffendingItems(ItemTest& test);
void sort(void);
protected:
std::vector<Pointer<Item> > mItemList;
};
These are all implemented in the obvious way, by iterating over the vector. The getItemByX() methods take O(n), the getNthItem() and push_back() methods take O(1), the removeItem() and removeOffendingItems() method take O(n), and the sort() method takes O(n*log(n)).
A certain piece of code which calls this class needs to iterate through the list of items, and then replace some of them (which match particular UUIDs) with different items, possibly renaming said replacements (to preserve uniqueness of names). The test for non-uniqueness of name takes O(n) with this implementation, and it needs to be done for some fixed proportion of all items, so the entire process takes O(n^2), which has proven unacceptably slow.
The solution, naturally, is to speed up getItemByX() to a more sensible O(log(n)), so this whole loop only needs O(n*log(n)). Unfortunately, the existence of more than one getItemByX() method, together with getNthItem() and its implied semantics about the constancy of the order of items makes the simplest option, replacing the vector with an STL map with Pointer as the value and UUID or std::string as the key, impossible. The next best option is to add an additional STL map for each value of X, indexing the vector. Thus, getItemByX() only needs O(log(n)), push_back() method now takes O(log(n)) for the index updates, and the other methods retain their original asymptotic performance.
It occurred to me, however, that to have such an index available and still do removeItem() by searching the entire vector was a bit silly. Using an STL vector still makes deletions take O(n), but if I could find the item to be deleted using the index, I could save one pass through the vector, at least. One needs an iterator to delete an item from an STL container, however, so to do it that way I would have to store iterators in the indices rather than additional refcounting pointers.
That won't work, though, because iterators into an STL vector refer to a particular index in the vector, and don't point to the same value after an insertion or deletion before that value. To get stable iterators, I would have to replace the vector with a list, and then I could put iterators in the indices and do removeItem() in O(log(n)), but at the cost of making getNthItem() take logarithmic time, which would violate the constraint of not slowing down any other operations by more than a factor of log(n). Thus, I'm going to leave the vector and just accept an O(n) removeItem() that can't use the indices, as silly as that seems.
If I were feeling really effortful there would be a better solution, though, which is the real point of this post: a data structure with the semantics of a doubly linked list (a sequence of items in a stable, unsorted order, insertions and deletions permitted at any point) and the performance characteristics of a balanced tree (all operations take O(log(n))). Consider a tree where each node has a pointer to its parent and its two children, a value and a count of all the nodes, including itself, in the subtree rooted at that node. Now provide tree rotation operations which adjust the counts accordingly, and use them to keep the tree balanced. You can insert and delete in arbitrary positions in O(log(n)), iterate forward or backward from any node in O(log(n)), so it's no more than log(n) worse than a doubly linked list, but you could also binary-search down the tree using the node counts to find the nth node in O(log(n)), so the integer index lookup is no more than log(n) slower than an array.
Someone must have invented something like this before, but I hadn't ever encountered it, so I thought I'd write it down for future reference. I'm not going to do it for this particular task; it's overkill, and balanced trees are a pain to implement, to say nothing of balanced trees with the interface of an STL container.

This entry was originally posted at http://puellavulnerata.dreamwidth.org/127832.html. Please comment there using OpenID.