Implementing OpenDataStructures in Rust

In safe Rust, I've implemented Open Data Structures. In this article, I'd like to write about my motivation, how to proceed and what was tricky to implement.

Repository

https://github.com/o8vm/ods

Since this project is also a learning project of Rust, it has been implemented using only Safe Rust. In other words, I implemented Linked-List in Rc.
This project also has simple test codes, and you can try it out with cargo test.

The second half of this project code may be more polished than the first half, as I became more proficient in Rust during the implementation. I think I implemented the best is RedBlackTree, which becomes my favorite data structure.

Motivation

I wanted to learn about data structures and Rust simultaneously, but I couldn't find suitable teaching materials. Open Data Structures has implementations in various languages, but not Rust, so I decided to study it while implementing it myself. Another reason was that there were calls for a Rust implementation in the Japanese translation of the ODS book.

https://lambdanote.com/blogs/news/article-6

『みんなのデータ構造』を読んでRustで実装した」といった報告をお待ちしています！

How to Read the ODS book

I thought it would be a good idea for beginners to read Open Data Structures in order from the beginning. If you don't understand the first chapter, you won't understand the subsequent chapters, so I recommend reading at least the first chapter.

I read the chapters and implemented them when I understood them. Also, when I encountered complicated data structures, there were times when I could understand them easily by drawing pictures, so I recommend drawing pictures anyway.

In the following sections, I will focus on implementing the code in Rust.

Implementation Overview for each Chapter

Chapter 1: Interfaces

The first chapter also explains the basic data structure interfaces.
Note that in Rust, common behaviors can be defined using trait, so I used trait to represent each interface first. See interface.rs for details.


pub trait Queue<T> {
    fn add(&mut self, x: T);
    fn remove(&mut self) -> Option<T>;
}

pub trait Stack<T> {
    fn push(&mut self, x: T);
    fn pop(&mut self) -> Option<T>;
}

pub trait List<T: Clone> {
    fn size(&self) -> usize;
    fn get(&self, i: usize) -> Option<T>;
    fn set(&mut self, i: usize, x: T) -> Option<T>;
    fn add(&mut self, i: usize, x: T);
    fn remove(&mut self, i: usize) -> Option<T>;
}

pub trait USet<T: PartialEq + Clone> {
    fn size(&self) -> usize;
    fn add(&mut self, x: T) -> bool;
    fn remove(&mut self, x: &T) -> Option<T>;
    fn find(&self, x: &T) -> Option<T>;
}

pub trait SSet<T: PartialOrd + Clone> {
    fn size(&self) -> usize;
    fn add(&mut self, x: T) -> bool;
    fn remove(&mut self, x: &T) -> Option<T>;
    fn find(&self, x: &T) -> Option<T>;
}

pub trait Graph {
    fn add_edge(&mut self, i: usize, j: usize);
    fn remove_edge(&mut self, i: usize, j: usize);
    fn has_edge(&self, i: usize, j: usize) -> bool;
    fn out_edges(&self, i: usize) -> Vec<usize>;
    fn in_edges(&self, i: usize) -> Vec<usize>;
}

Trait bound can specify the characteristics of the data handled by each interface. For example, the SSet interface requires PartialOrd or Ord, and the USet requires PartialEq or Eq.

Note that due to the use of RefCell in lists, trees, etc., Clone is also specified as a trait bound because it is not possible to return a reference from borrow(), so it is decided to return the clone()ed value by get() or find(). See this article for related info.

It would have been nice if I could have used impl Deref<target = T> in the return value of the associated type or trait method, but that doesn't seem to be available now, so this is the way it is. If there is a better way, please let me know. For details, see this article.

Chapter 2: Array-Based Lists

Array-Based Lists use Boxed Slice because the implementation of resize(), etc., is also done to match the book's contents as much as possible. In addition, I wanted the default state to be None of Option when no element exists, so for example, ArrayStack is defined as follows:


pub struct Array<T> {
    a: Box<[Option<T>]>,
    n: usize,
}

Also, since the other data structures in this chapter are based on ArrayStack, which implements the List interface, it is easy to implement the rest of the data structures in this chapter if implementing ArrayStack is completed.

By the way, for the type T, Clone is specified as a trait bound for later linked lists and tree structures, so get() and find() do not return a reference. So, for example, in the List interface implemented in ArrayStack, get() does not return a reference but returns the clone()ed value wrapped in an Option.


    fn get(&self, i: usize) -> Option<T> {
        self.a.get(i)?.as_ref().map(|x| x.clone())
    }

For this reason, it is important that the RootishArrayStack which uses the above is based on Rc<[T]>:


pub struct Array<T> {
    blocks: ArrayStack<Rc<[RefCell<Option<T>>]>>,
    n: usize,
}

Therefore, the List interface can be implemented according to the inner mutability pattern:


    fn set(&mut self, i: usize, x: T) -> Option<T> {
        let b = Self::i2b(i);
        let j = i - b * (b + 1) / 2;
        self.blocks.get(b)?[j].borrow_mut().replace(x)
    }

Chapter 3 & 4: Linked Lists

Since a linked list is a pattern with multiple owners for each node, it's not that difficult if we remember to use Rc and RefCell. Here I have defined the following types to represent the links for each node:


type Link<T> = Option<Rc<RefCell<Node<T>>>>;

SkipList looks complicated, but it is pretty simple. What was more challenging to think about in Rust was the DLList and SEList, which are doubly linked lists.
If implemented only in Rc, stack overflow will occur in the debug output due to circular references. To overcome this issue, use two dummy nodes and Week instead of Rc for representing the backword link. In this way, circular references can be avoided.


type Link<T> = Option<Rc<RefCell<Node<T>>>>;
type Wink<T> = Option<Weak<RefCell<Node<T>>>>;

#[derive(Clone, Debug, Default)]
pub struct DLList<T> {
    head: Link<T>,
    tail: Wink<T>,
    n: usize,
}

#[derive(Clone, Debug, Default)]
pub struct Node<T> {
    x: T,
    next: Link<T>,
    prev: Wink<T>,
}

Since I used Option<Rc<RefCell<Node<T>>>> to represent the links, the amount of code to write is a little more than before.

In addition, when I was implementing the tree structure later, I thought that the method chain would have been a little shorter if the links were represented by the RefCell<Option<Rc<Node<T>>>> type. The XFastTrie trie tree also uses a doubly-linked list, but this one uses RefCell<Option<Rc<Node<T>>>>, which is a bit simpler than the code above.

Chapter 5: Hash Tables

I've implemented ChainedHashTable and LinearHashTable in this chapter. In Rust, the hash value of a value x of type T can be easily calculated by implementing Hash Trait as follows. So we can quickly implement a hash table by simply specifying Hash as a trait bound.


pub fn hashcode<T: Hash>(x: &T) -> usize {
    let mut s = DefaultHasher::new();
    x.hash(&mut s);
    s.finish() as usize
}

Using the above, Multiplicative Hashing and 64bit Tabulation Hashing are implemented in ChainedHashTable and LinerHashTable, respectively. The following is an example of Tabulation Hashing:


// tabulation hashing
lazy_static! {
    pub static ref TAB: [[u64; 256]; 8] = {
        let mut array = [[0; 256]; 8];
        for i in 0..8 {
            thread_rng().fill(&mut array[i]);
        }
        array
    };
}
pub fn byte_chunks_64(h: u64) -> [u8; 8] {
    [
        (h & 0xff) as u8,
        ((h >> 8) & 0xff) as u8,
        ((h >> 16) & 0xff) as u8,
        ((h >> 24) & 0xff) as u8,
        ((h >> 32) & 0xff) as u8,
        ((h >> 40) & 0xff) as u8,
        ((h >> 48) & 0xff) as u8,
        ((h >> 56) & 0xff) as u8,
    ]
}
...
    fn hash(&self, x: &T) -> usize {
        // u64 tabulation hashing

        let mut v = 0u64;
        let h = x.hashcode();
        let chunks = byte_chunks_64(h as u64);
        for (i, c) in chunks.iter().enumerate() {
            v ^= TAB[i][*c as usize];
        }
        v = v.overflowing_shr(Self::W - self.d).0;
        v as usize
    }

Note that implementations are based on Boxed Slice for the same reason as the array-based list.


// ChainedHashTable
#[derive(Clone, Debug, Default, Eq, Ord, PartialEq, PartialOrd)]
pub struct ChainedHashTable<T> {
    t: Box<[ArrayStack<T>]>,
    n: usize,
    d: usize,
    z: usize,
}

// LinerHashTable
#[derive(Clone, Debug, Eq, Ord, PartialEq, PartialOrd, Copy)]
enum Elem<T> {
    Val(T),
    Null,
    Del,
}
#[derive(Clone, Debug, Default, Eq, Ord, PartialEq, PartialOrd)]
pub struct LinearHashTable<T> {
    t: Box<[Elem<T>]>,
    n: usize,
    q: usize,
    d: u32,
}

Chapter 6 ~ 10: Tree, Tree, more Trees.

Chapters 6 through 10 deal with tree structures with almost the same base structure, except BinaryHeap.
BinaryTree, BinarySearchTree, Treap, ScapegoatTree, RedBlackTree, and MeldableHeap could all be implemented using the following data structure base.
Except for RedBlackTree, the rest are straightforward to implement.


#[derive(Clone, Debug, Default)]
pub struct BTNode {
    left: RefCell<Option<Rc<BTNode>>>,
    right: RefCell<Option<Rc<BTNode>>>,
    parent: RefCell<Option<Weak<BTNode>>>,
}

It is almost the same as Linked List. The only difference is using RefCell<Option<Rc<BTNode>>> instead of Option<Rc<RefCell<Node>> because it seems a bit simpler to implement.

The above data structure can represent various other tree structures by simply adding the necessary elements. For example, the RedBlackTree looks like the following. Just add the values and color information:


pub enum Color {
    Red,    // 0
    Black,  // 1
    WBlack, // 2
}

#[derive(Clone, Debug, Default)]
pub struct RBTNode<T> {
    color: RefCell<Color>,
    x: RefCell<T>,
    left: RefCell<Option<Rc<RBTNode<T>>>>,
    right: RefCell<Option<Rc<RBTNode<T>>>>,
    parent: RefCell<Option<Weak<RBTNode<T>>>>,
}

However, there is only one problem with RedBlackTree. In the C++ implementation of the ODS book, the part of the tree where the node does not exist is actively involved in preserving the properties of the Red-Black Tree as an external node Nil with black color. However, in the above implementation in Rust, the non-existent node is None, so it does not have any information about its parent or color. Therefore, as it is, it is not possible to perform calculations to maintain the properties of the Red-Black Tree.
But we can come up with a simple implementation if we realize that the external node is involved in the process only once and that the only information that matters in the process is the parent and its color. So we can't trace None back to its parent, but we can figure out children's color from its parent.

The key differences between the C++ and Rust implementations are as follows:

C++:


void removeFicup(Node *u)

Rust:


fn remove_fixup(&mut self, mut color: isize, mut u: Tree<T>, mut p: Tree<T>)

The Rust implementation passes the node u and its parent p and its color.
If you are interested in the detailed implementation, please refer to the source list. I have also defined a method to check if the properties of a red-black tree are satisfied.

Unlike other tree structures, BinaryHeap is an array-based data structure that can be implemented straightforwardly.


pub struct BinaryHeap<T> {
    a: Box<[Option<T>]>,
    n: usize,
}

Chapter 11: Sorting Algorithms

I don't have anything special to say about Rust, so I'll skip it.

Chapter 12: Graphs

Both AdjacencyMatrix and AdjacencyLists can be easily implemented with array-based data structures.
Most Graphs interfaces can be implemented using only the slice, vec, and ArrayStack methods.


// AsjacencyMatrix
pub struct AdjacencyMatrix {
    n: usize,
    a: Vec<Vec<bool>>,
}
...
    fn get_mut(&mut self, i: usize, j: usize) -> Option<&mut bool> {
        if let Some(ii) = self.a.get_mut(i) {
            ii.get_mut(j)
        } else {
            None
        }
    }
...
    fn add_edge(&mut self, i: usize, j: usize) {
        if let Some(e) = self.get_mut(i, j) {
            *e = true;
        }
    }


// AdjacencyList
pub struct AdjacencyLists {
    n: usize,
    adj: Vec<ArrayStack<usize>>,
}
...
    fn add_edge(&mut self, i: usize, j: usize) {
        if let Some(e) = self.adj.get_mut(i) {
            e.add(e.size(), j);
        }
    }

Chapter 13: Trie

This chapter describes three types of trie trees: BinaryTrie, XFastTrie, and YFastTrie. Trie is a data structure that deals with integers, but it is also a tree, and it is based on the same data structure as used in BinaryTree. The tricky part is that it also uses a doubly-linked list internally.

For example, BinaryTrie has the following data structure:


pub struct BTNode<T: USizeV + Default> {
    x: RefCell<T>,
    child: [RefCell<Option<Rc<BTNode<T>>>>; 2], // 0 = left, 1 = right
    jump: RefCell<Option<Rc<BTNode<T>>>>,
    parent: RefCell<Option<Weak<BTNode<T>>>>,
    prev: RefCell<Option<Weak<BTNode<T>>>>, // left
    next: RefCell<Option<Rc<BTNode<T>>>>,   // right
}
pub struct BinaryTrie<T: USizeV + Default> {
    n: usize,
    r: Rc<BTNode<T>>,
    head: Option<Rc<BTNode<T>>>,   // dummy1
    tail: Option<Weak<BTNode<T>>>, // dummy2
}

Compare this with the C++ case:


    T x;
    N *jump;
    N *parent;
    union {
        struct {
            N *left;
            N *right;
        };
        struct {
            N *prev;
            N *next;
        };
        N* child[2];
    };

Safe Rust does not allow to use of Union as in C++. Besides, a doubly-linked list needs a Weak pointer, so even if Safe Rust can define a Union, if one of the pointers to the node's children becomes Weak, the link will be broken, and the link will be meaningless.

So I decided to have separate pointers to prev and next for doubly-linked lists.

Trie accepts any data structure that can derive a non-negative integer value. So, I have defined a trait with a method to output a non-negative integer value as follows:


pub trait USizeV {
    fn usize_value(&self) -> usize;
}

Any data structure that implements this trait is acceptable.

XFastTrie uses a hash table keyed by the bit-value prefix to find nodes.
Nodes need to implement Hash to handle hashing values, but as long as Hash is implemented, LinerHashtable can be used without modification.


pub struct BTNode<T: USizeV + Default> {
    pub x: RefCell<T>,
    prefix: RefCell<usize>,
    child: [RefCell<Option<Rc<BTNode<T>>>>; 2], // 0 = left, 1 = right
    jump: RefCell<Option<Rc<BTNode<T>>>>,
    parent: RefCell<Option<Weak<BTNode<T>>>>,
    prev: RefCell<Option<Weak<BTNode<T>>>>,   // left
    pub next: RefCell<Option<Rc<BTNode<T>>>>, // right
}
impl<T: USizeV + Default> Hash for BTNode<T> {
    fn hash<H: Hasher>(&self, state: &mut H) {
        self.prefix.borrow().hash(state);
    }
}
pub struct XFastTrie<T: USizeV + Default> {
    n: usize,
    r: Rc<BTNode<T>>,
    head: Option<Rc<BTNode<T>>>,   // dummy1
    tail: Option<Weak<BTNode<T>>>, // dummy2
    t: Box<[LinearHashTable<Rc<BTNode<T>>>]>,
}

YFastTrie is a combination of XFastTrie and Treap. At first glance, it looks complicated, but since the troublesome parts are already implemented in XFastTrie and Treap, the implementation is unexpectedly simple.


struct YPair<T: USizeV + Default> {
    ix: usize,
    t: Rc<RefCell<Treap<T>>>,
}
pub struct YFastTrie<T>
where
    T: USizeV + Default + PartialOrd + Clone,
{
    xft: XFastTrie<YPair<T>>,
    n: usize,
}

Chapter 14: B-Trees

The BTree is originally a tree structure for on-disk data. Still, because it is a generalization of binary trees and handles nodes in blocks, it can be implemented more like an array-based data structure like BinaryHeap, rather than a linked list like conventional tree structures.
There are a lot of features that need to be implemented, but it was not too difficult to think about and implement in Rust.
However, it is hard to think about implementation that calculates the index of an array, such as shifting array elements.

The external memory is abstracted in a data structure called BlockStore (read_block(), write_block(), free_block(), place_block()), and B-tree does operations on it.


#[derive(Clone, Debug, Default, Eq, Ord, PartialEq, PartialOrd)]
struct Node<T: Clone + PartialOrd> {
    id: usize,
    keys: Box<[Option<T>]>,
    children: Box<[i32]>,
}

#[allow(non_snake_case)]
#[derive(Clone, Debug, Default, Eq, Ord, PartialEq, PartialOrd)]
pub struct BTree<T: Clone + PartialOrd> {
    b: usize,  // the maximum number of children of a node (must be odd)
    B: usize,  // d div 2
    n: usize,  // number of elements stored in the tree
    ri: usize, // index of the root
    bs: BlockStore<Node<T>>,
}

BlockStore is implemented in ArrayStack:


pub struct BlockStore<T: Clone> {
    blocks: ArrayStack<T>,
    free: ArrayStack<usize>,
}

Note that since Btree targets the disk, after changing the inside of the node, it is necessary to write into the disk before reading occurs.


                    x = w.remove(0).unwrap();
                    u.add(x, w.id as i32);
                    self.bs.write_block(w.id, w);
                    self.bs.write_block(u.id, u.clone());

Areas of confusion when implementing & Tips

`borrow()` and `loop()`

The following is an excerpt from SkiplistList, but the code below will result in a compilation error if there is no semicolon at the end of the match.


loop {
    let u = Rc::clone(&n);
    match u.borrow().next[r] {
        Some(ref u) if j + n.borrow().length[r] - 1 < i => {
            j += n.borrow().length[r];
            n = Rc::clone(u)
        }
        _ => break,
    }; // <= if miss this semicolon, compilation error
}


error[E0597]: u does not live long enough
  --> chapter04/src/skiplistlist.rs:58:31
   |
58 |                         match u.borrow().next[r] {
   |                               ^---------
   |                               |
   |                               borrowed value does not live long enough
   |                               a temporary with access to the borrow is created here ...
...
65 |                     }
   |                     -
   |                     |
   |                     u dropped here while still borrowed
   |                     ... and the borrow might be used here, when that temporary is dropped and runs the destructor for type std::cell::Ref<'_, skiplistlist::Node<T>>.
   |
   = note: The temporary is part of an expression at the end of a block. Consider adding semicolon after the expression so its temporaries are dropped sooner, before the local variables declared by the block are dropped.

With a semicolon, expressions that use u are match ... {...} , so the order is match expression evaluation (temporal consumption), then u is drop, but without a semicolon, the expression that uses u is loop {...} , which is a problem because u can't be drop()ed until the evaluation of the loop expression is finished.

Trap of `Default` 1

When initializing a structure, using ..Default::default(), a compile error will occur if Default is not on a trait bound, even if the type T does not require Default.


impl<T: Default> RBTNode<T> {
    pub fn new(x: T) -> Self {
        Self {
            x: RefCell::new(x),
            ..Default::default()
        }
    }
}

To avoid this, write the following, although it is complicated:


impl<T> RBTNode<T> {
    pub fn new(x: T) -> Self {
        Self {
            x: RefCell::new(x),
            left: Default::default(),
            right: Default::default(),
            parent: Default::default(),
            color: Default::default(),
        }
    }
}

Trap of `Default` 2

You should not use ..Default::default() when you want to implement Default because it will cause a stack overflow in recursion. Instead, it is better to use derive(Default).


// compile error
impl Default for X {
    fn default() -> Self {
        Self {
            ..Default::default()
        }
    }
}

Use `rotate_left()`

When shifting an element by a certain amount in a slice, use rotate_left() or rotate_right() instead of swap(). In most cases, it is easier to understand.

Option's `or()` method is useful

It is extremely easy to write code that would be cumbersome if it used if statements to separate cases.


match (prev, parent, left, right) {
    (Some(p), Some(v), left, right) if Rc::ptr_eq(&p, &v) => {
        next = left.or(right).or(Some(p));
    }
  ...
}

`fill()` of `rand::Rng` is useful.

There are times when you want to initialize an array with random numbers, but you can easily do so with fill() without working hard with iterators.

pretty printing

In the Debug output, "{:#?}" can be used for pretty printing. This is useful when seeing a small tree structure.

The `Rc`-based linked list trap

Rc has a recursive destruction problem. For this reason, if Rc-based Linked List is too large, it seems to cause a stack overflow when the resource is destroyed. The solution is to implement Drop to prevent recursive calls.

Final Thought

I think that implementing OpenDataStructures in Rust has made me more proficient in writing Rust, and I recommend OSD in Rust as a good practice for those who are learning Rust.

----

I've written other articles about Rust. Please read them if you like:

Implementing Spinlock for RISC-V OS in Rust

Writing an x86 bootloader in Rust that can launch vmlinux

I'm also implementing an OS, bootloader, and DOS application in Rust. I'm open to advise and pull requests for these repositories.

bootloader: https://github.com/o8vm/krabs

os: https://github.com/o8vm/octox

dos app: https://github.com/o8vm/rust_dos