...
The following diagram shows how the enqueueing and dequeueing are performed without using locks.
The dequeue operation takes all the elements in the buffer, while enqueue adds one element.The first node in the buffer is always a dummy node with a null object. The tail is an atomic reference to the last element in the buffer. In this example, there is no need for an atomic reference to the first element in the buffer, as there is only one sending thread that can dequeue elements and thus change the dummy node's next element.
However, it is impossible to implement the eviction functionality within this approach without adding a lock. This lock can be held by dequeue and evict, so that these two operations cannot happen at the same time. Nevertheless, the enqueue is lock-free, thus should be very fast.
A completely lock-free approach is described below.
The third approach is lock-based and uses two locks:
- tail lock - used by enqueue, dequeue
- head lock - used by dequeue, evict
Next, the three operations performed on the buffer in this lock-based implementation are described step by step.
Enqueue:
- allocate node, compute and set size
- lock tail lock
- link node to tail
- swing tail to new node
- unlock tail lock
- change buffer size
- signal eviction if buffer is full
Dequeue:
- if buffer is empty, return
- lock tail lock
- lock head lock
- get the reference of the first element
- swing head and tail to dummy node
- change size to 0
- unlock head lock
- unlock tail lock
- go through buffer, add objects to list
- return list of dequeued nodes
Evict:
- wait until there is need for eviction
- lock head lock
- go through list until the required eviction size is reached
- change head to new first element
- unlock head lock
- subtract evicted size from total buffer size
- we assume here that eviction will never empty the buffer, so it does not need to acquire the tail lock
Testing
Issues that need to be checked:
...
- 1 sending thread
- loop:
- sleeps 5s
- dequeues all the elements in the buffer
- prints them
- loop:
- n threads that enter data:
- n = 2, 4, 108, 100 ..., 256
- loop that adds elements
- no sleeping between enqueues to make it fast
- limiting the size by number of elements:
- use random Floats as data
- object size not important, only the number of elements
- limiting the size by memory
- use Strings with variable length
The threads run for some time (e.g. 1 minute) and then are interrupted and the statistics about total number of elements added, retrieved and evicted are printed out.
Results
Comparing the number of elements added in 1min for the three implementations (see Figure below), it was observed that the lock-based implementation is the most inefficient, while the other two implementation give similar results, with some variations for smaller number of threads, where the one that uses a lock for dequeue and eviction proves to be better. The tests use the "Limit by number" strategy for the buffer size.
References
[1] Maurice Herlihy, Nir Shavit, "The Art of Multiprocessor Programming, Revised Reprint"
[2] Maged M. Michael, Michael L. Scott, "Simple, Fast, and Practical Non-Blocking and Blocking Concurrent Queue Algorithms"
[3] ConcurrentLinkedQueue implementation in Java
[4] CMR buffer specifications
[5] Calculation of object sizes
...