The following was originally part of a presentation I did at the Elixir Melbourne meetup on the 19th May, 2022
In functional languages we often program declaratively
Declarative programming favors immutability - this means that:
In the following presentation we'll be covering three main data structures in Elixir:
A naive way to implement immutable data is just to copy the entire piece of data every time we need to change it
a = [1,2,3]
List.append(a, 4)
copies the entire of a
and adds a 4
on the endThe problem with this is that we get O(n)
given the length of the list n
This is why we need persistent data structures
Persistent data structures are immutable, but can reuse old parts of itself when modified (through structural sharing)
In Erlang/Elixir, the list is implemented using a linked list, rather than an array as you'd see in imperative languages like Python or C
In a singly linked list:
a = [1, 2, 3]
b = [0 | a]
B
to a node with value 0
, and then that node to the head of A
in memory, a
might look something like this
Let's reason about the ++
operator
When you call C = A ++ B
, under the hood the left side A
is copied, and then the last node is pointed to the list on the right
a = [1, 2, 3]
b = [4, 5, 6]
So we can see here that the main cost of this operation is in the copying of the first list, A.
In general, the time complexity is O(n) where n is the length of the first list.
Let’s keep that in mind as we go through an example using this operator, to see how we can use this operator optimally.
defmodule Reverse do
# A body recursive implementation
def naive_reverse([head | tail]) do
reverse_tail = naive_reverse(tail)
new_head = [head]
IO.write(inspect(reverse_tail))
IO.write(" ++ ")
IO.inspect(new_head)
IO.inspect("-----------------")
result = reverse_tail ++ new_head
result
end
def naive_reverse([]) do
[]
end
end
Reverse.naive_reverse([1, 2, 3, 4])
++
copies the entirety of whatever is on its leftdefmodule Reverse do
def better_reverse([head | tail], acc) do
reverse_tail = acc
new_head = [head]
IO.write(inspect(new_head))
IO.write(" ++ ")
IO.inspect(reverse_tail)
IO.inspect("-----------------")
better_reverse(tail, new_head ++ reverse_tail)
end
def better_reverse([], acc) do
acc
end
end
Reverse.better_reverse([1, 2, 3, 4], [])
defmodule Reverse do
def even_better_reverse(list, acc \\ [])
def even_better_reverse([head | tail], acc) do
even_better_reverse(tail, [head | acc])
end
def even_better_reverse([], acc) do
acc
end
end
Reverse.even_better_reverse([1, 2, 3, 4])
Operation | Time Complexity |
---|---|
Access | O(n) |
Search | O(n) |
Insertion | O(1) for pre-pending, otherwise O(n) |
Deletion | O(1) if first element, otherwise O(n) |
As you can see, Linked Lists are really good for insertion, but not so great for access/search
a = {a, 42, :ok}
looks something like this in memory
This means that we have
Operation | Time Complexity |
---|---|
Access | O(1) |
Search | O(n) |
Insertion | O(n) |
Deletion | O(n) |
So they’re really good for reading, but not so great for creating, inserting and deleting.
This means that they are ideal when you have small list of values that need be accessed quickly This makes sense when you see where they are commonly used, like in function returns as well as pattern matching
{:ok, 1}
Tuples
{:ok, 1}
List
map/dictionary
As an example, let us look at how the map %{a => foo, z => bar}
is represented:
m = %{
"a" => "foo",
"z" => "bar",
3 => nil,
4 => nil,
5 => nil,
6 => nil,
7 => nil,
8 => nil,
9 => nil,
10 => nil,
11 => nil,
12 => nil,
13 => nil,
14 => nil,
15 => nil,
16 => nil,
17 => nil,
18 => nil,
19 => nil,
20 => nil,
21 => nil,
22 => nil,
23 => nil,
24 => nil,
25 => nil,
26 => nil,
27 => nil,
28 => nil,
29 => nil,
30 => nil,
31 => nil,
32 => nil
}
m |> Map.keys()
How come the "a" and "z" keys are at the end though?
number < atom < reference < function < port < pid < tuple < map < list < bitstring
ok, but what happens if we have > 32 keys ...
m = %{
1 => nil,
2 => nil,
3 => nil,
4 => nil,
5 => nil,
6 => nil,
7 => nil,
8 => nil,
9 => nil,
10 => nil,
11 => nil,
12 => nil,
13 => nil,
14 => nil,
15 => nil,
16 => nil,
17 => nil,
18 => nil,
19 => nil,
20 => nil,
21 => nil,
22 => nil,
23 => nil,
24 => nil,
25 => nil,
26 => nil,
27 => nil,
28 => nil,
29 => nil,
30 => nil,
31 => nil,
32 => nil,
33 => nil
}
m |> Map.keys()
What happened? This is because when we exceed 32 keys, Maps become implemented using a different data structure, the Hash Array Mapped Trie (HAMT) so it is no longer sorted.
A number of other functional languages also use this data structure to implement their Map (e.g Scala, Clojure)
a = %{"elmo" => "cute", "elixir" => "awesome"}
Map.get(a, "elixir")
, you can just “walk” down the tree until you hit a leaf, and get the value at the leaf.This is where Hashing comes in
Hashing gives us an alternative "key" to use to insert into the Trie
"elmo"
to "11100000"
and insert this as our key instead of "elmo"
The advantage of this is:
Lets say that we're inserting A = %{"elmo" => "cute", "elixir" => "awesome"}
defmodule HashHelper do
def pad_with_zeroes(binary) do
case {binary, String.length(binary)} do
{_, 32} ->
binary
{binary, _} ->
pad_with_zeroes("0" <> binary)
end
end
def partition_hash(hash) do
for <<hashed_partition::binary-2 <- hash>>, do: hashed_partition
end
def hash(key) do
IO.inspect(:erlang.phash2(key))
# I reverse the string here because I realized I did the diagram the wrong way around after I generated this data ..
:erlang.phash2(key)
|> :erlang.integer_to_binary(2)
|> pad_with_zeroes()
|> String.reverse()
|> String.slice(-8, 8)
end
end
"elmo" |> HashHelper.hash() |> IO.inspect(label: "hash(elmo) ")
"elixir" |> HashHelper.hash() |> IO.inspect(label: "hash(elixir)")
"elsa" |> HashHelper.hash() |> IO.inspect(label: "hash(elsa)")
So we're inserting A = %{"elmo" => "cute", "elixir" => "awesome"}
We hash "elmo" and "elixir" to get
hash(elmo) = "11100000"
, hash(elixir) = "10000000”
So instead of inserting the letters of the key into the trie, we can just insert using this binary representation
Technically, we could just insert them like this,
But this means that the tree becomes really deep, as it will take 8 steps each time to reach a leaf
To create wider trees, we can split them into 2 bit sections
"elmo" |> HashHelper.hash() |> HashHelper.partition_hash() |> IO.inspect(label: "hash(elmo) ")
"elixir" |> HashHelper.hash() |> HashHelper.partition_hash() |> IO.inspect(label: "hash(elixir)")
"elsa" |> HashHelper.hash() |> HashHelper.partition_hash() |> IO.inspect(label: "hash(elsa)")
We can easily clone by pointing the new variable to the head of the new variable to the same object
a = %{"elmo" => "cute", "elixir" => "awesome"}
b = a
Say we wanted to add the key value pair "elsa" => "princess"
to A
a = %{"elmo" => "cute", "elixir" => "awesome"}
b = Map.put(a, "elsa", "princess")
a |> IO.inspect()
c = %{a: 1}
elsa = ["00", "00", "00", "00"]
elsa
would follow the same path as “elixir” ["10", "00", "00", "00"] for 3 nodes, and then diverge.
"elsa" => "princess"
00
already exists in A’s tree00
already exists in A’s tree as well, so we create our own node in B and copy all the pointers that it has to A’s nodes00
finally doesn’t have a connection in A’s tree, so we can create the node in our own tree without linking it to A's treeSo in this way, the Hash array mapped trie can effectively use old data structures when creating new ones, without mutating the old data structures
If you follow the paths down in the trie for A, you can see that it is unchanged - there is no way for it to reach the new paths created by b.
B on the other hand has access to the nodes in A, as well as the new value inserted for "elsa".
This is how it doesn’t duplicate the HashMap every time it needs to create a new copy or add items.
Below is the implementation of the get for the Map implementation in Erlang's OTP with some comments for my own understanding and reference. Source link here
// hx is the hashed key
const Eterm *
erts_hashmap_get(Uint32 hx, Eterm key, Eterm node)
{
/* thing_word tagscheme
* Need two bits for map subtags
*
* Original HEADER representation:
*
* aaaaaaaaaaaaaaaa aaaaaaaaaatttt00 arity:26, tag:4
*
* For maps we have:
*
* vvvvvvvvvvvvvvvv aaaaaaaamm111100 val:16, arity:8, mtype:2
*
* unsure about trailing zeros
*
* map-tag:
* 00 - flat map tag (non-hamt) -> val:16 = #items
* 01 - map-node bitmap tag -> val:16 = bitmap
* 10 - map-head (array-node) -> val:16 = 0xffff
* 11 - map-head (bitmap-node) -> val:16 = bitmap
*/
// hval is any one of the map-tag values above
// if 0xffff, it is an array node
/* erl_map.h stuff */
Eterm *ptr, hdr, *res;
Uint ix, lvl = 0;
Uint32 hval, bp;
ASSERT(is_boxed(node));
// node is a pointer
ptr = boxed_val(node);
// dereferences the pointer to get the header value
hdr = *ptr;
ASSERT(is_header(hdr));
ASSERT(is_hashmap_header_head(hdr));
ptr++;
// NOTE: Infinite loop
for (;;)
{
hval = MAP_HEADER_VAL(hdr);
// hx is the hashed key
// hashmap_index scopes hx to the current last 4 bits
ix = hashmap_index(hx);
// ix stores the index (ix corresponds to the ith element)
// If we are not in an array node, i.e we are in a bitmap
if (hval != 0xffff)
{
// bp is 100000... where its length is ix
// you can use this to get the ith value of hval by doing bp & hval
// hval here is the bitmap
// so we are checking whether the ith value of the bitmap is set
// if it isn't set then we return null since the next node does not exist
bp = 1 << ix;
if (!(bp & hval))
{
/* not occupied */
res = NULL;
break;
}
// If we're here, that means that the next value does indeed exist, from our hashed key
// hval & (bp - 1) first filters the bitmask to only contain bits to the right of our bit at the ix'th index
// we then count the number of bits to get the index of the array node that contains the pointer to the next node
ix = hashmap_bitcount(hval & (bp - 1));
}
// Go to the next node
node = ptr[ix + 1];
// If our new node is a list, then the pointer is actually pointing to a key value pair
// If our original key is equal to the key stored at the pointer, then return the value
// Otherwise, we have a mismatch, return NULL
if (is_list(node))
{ /* LEAF NODE [K|V] */
ptr = list_val(node);
res = EQ(CAR(ptr), key) ? &(CDR(ptr)) : NULL;
break;
}
// Our levels go from 0 to 7 (inclusive) since we look at 4 bits each time.
// Since our level is never changed in this function, we always shift by 4
// #define hashmap_shift_hash(Hx, Lvl, Key) \
// (((++(Lvl)) & 7) ? \
// (Hx) >> 4 : \
// make_map_hash(Key, ((Lvl) >> 3)))
// we shift by 3 at the last level since base 16 has
hx = hashmap_shift_hash(hx, lvl, key);
ASSERT(is_boxed(node));
ptr = boxed_val(node);
hdr = *ptr;
ASSERT(is_header(hdr));
ASSERT(!is_hashmap_header_head(hdr));
}
return res;
}
Operation | Time Complexity |
---|---|
Access | O(log n) |
Search | O(log n) |
Insertion | O(n) for < 32 elements, O(log n) for >= 32 elements [2] |
Deletion | O(n) for < 32 elements, O(log n) for >= 32 elements |
Since we have a Trie data structure, most things like insertion and deletion take O(log_32(n))
This isn’t too bad though as our Trie’s depth is limited to the hash length / section length, which is usually around 7-8
In practice, it is almost as performant as the regular old HashTable that allows O(n) access
Diving into the implementations of data structures in Elixir was really interesting! I'd always have a nagging thought that I didn't really understand what was happening beneath the surface when coding, so it was a bit of a relief to finally gain some understanding of how these common data structures work in Elixir!
Please let me know if you spot any issues/incorrect statements in what i've written above!^^
https://github.com/erlang/otp
https://iq.opengenus.org/time-complexity-of-array/
https://iq.opengenus.org/time-complexity-of-linked-list/
https://softwareengineering.stackexchange.com/questions/294983/why-do-haskell-and-scheme-use-singly-linked-lists
https://dev.to/edisonywh/-elixir--why-linked-lists--1e9d
https://stackoverflow.com/questions/25779783/practical-difference-between-erlang-mapsremove-2-and-mapswithout-2
https://softwareengineering.stackexchange.com/questions/132309/why-are-cons-lists-associated-with-functional-programming
https://gist.github.com/PJUllrich/4cd3d8e2e8c9170b560e5a501c13c9f3
https://stackoverflow.com/questions/28676383/erlang-array-vs-list
https://hypirion.com/musings/understanding-persistent-vector-pt-1
https://www.quora.com/How-does-a-persistent-data-structure-in-functional-programming-work
https://stackoverflow.com/questions/4399837/what-is-the-benefit-of-purely-functional-data-structure
https://doc.lagout.org/programmation/Functional%20Programming/Chris_Okasaki-Purely_Functional_Data_Structures-Cambridge_University_Press%281998%29.pdf
https://stackoverflow.com/questions/599153/are-some-data-structures-more-suitable-for-functional-programming-than-others https://inquisitivedeveloper.com/lwm-elixir-47/#:~:text=Maps%20in%20Elixir%20aren't,performance%20that%20hash%20tables%20have.
https://inquisitivedeveloper.com/lwm-elixir-14/
https://www.openmymind.net/Elixir-A-Little-Beyond-The-Basics-Part-1-lists/
https://www.openmymind.net/Elixir-A-Little-Beyond-The-Basics-Part-3-maps/
https://ferd.ca/erlang-s-tail-recursion-is-not-a-silver-bullet.html
https://www.erlang.org/doc/efficiency_guide/myths.html#myth--operator--++--is-always-bad
https://blog.acolyer.org/2015/11/27/hamt/
https://worace.works/2016/05/24/hash-array-mapped-tries/
https://www.erlang.org/doc/efficiency_guide/maps.html
https://www.slideshare.net/ErlangSolutionsLtd/erlang-meetup-19-september-2017
https://github.com/python/cpython/blob/main/Python/hamt.c
https://moaazsidat.com/writings/2015-11-24-supercharging-react-with-immutablejs.md
https://lampwww.epfl.ch/papers/idealhashtrees.pdf
https://www.honeybadger.io/blog/elixir-memory-structure/