Ringbuffer in the inspectIT agent
Meetings
Date | Focus | Link |
---|---|---|
13.09.2012 | Initial discussion of the requirements and possible approaches | 2012-09-13 Ringbuffer meeting/wiki/spaces/ARCHIVE/pages/5019325 |
21.09.2012 | Discussion of approaches2012-09-21 Ring-buffer | /wiki/spaces/ARCHIVE/pages/5019319 |
29.09.2012 | Current status, next steps: Create sample realization2012-09-29 Ringbuffer status update | /wiki/spaces/ARCHIVE/pages/5019311 |
25.10.2012 | Current status, next steps: caliper, another approach | /wiki/spaces/ARCHIVE/pages/5019155 |
Requirements
The ringbuffer will replace the buffering strategy in the agent completely, while the sending strategy will be kept, as it defines when data should be sent.
The ringbuffer uses a FIFO policy:
...
In case the buffer is full and an insertion operation is desired, the oldest element(s) must be dropped to make room for the new element(s).
Moreover, when the limitation is done with respect to memory occupancy, a utility to calculate object sizes is also required.
Approach
There are two issues that have to be dealt with:
- implementing the concurrent access, enqueue and dequeue operations.
- implementing the eviction of old elements when the buffer has reached its size limit
Approach for implementing a concurrent queue [1][2]
- non-blocking - for better performance in highly multi-threaded environments
- linked list:
- each element holds a reference to the next element
- sentinel nodes: head and tail
- head - points to the first element in the list
- tail - points to the last element in the list
- the last element has a NULL reference for next element
- non-blocking enqueue/dequeue: use compare-and-set to set the values of head and tail
- lazy approach - enqueue/dequeue are both done in two steps [1]
- two-step enqueue:
- compare-and-set to append new element
- compare-and-set to change the value of tail
- "every other method call must be prepared to encounter a half-finished enqueue call, and to finish the job"
- an enqueue call never waits for a dequeuer
- two-step dequeue:
- read the value from the first element
- compare-and-set to change the value of head
- make sure that tail does not lag behind
Refer to Section 10.5 "An Unbounded Lock-Free Queue" in [1] and Section 2 in [2] for a more detailed explanation.
Another possibility to implement a concurrent queue:
- using two locks, one for queing and one for dequeueing
- livelock free
- but high contention
Adding the elements can be performed with or without locking.
Regarding the dequeueing, the constraint that only one thread dequeues all the elements leads to a specialised implementation. The dequeueing can also be lock-based or lock-free, as the eviction thread competes with this thread for changing the head element.
Both locking and lock-free approaches will be implemented in order to find the best approach. The approaches are described in detail in the next sections.
Approach for eviction of old elements
- similar to CMR atomic buffer implementation [4]
- use AbstractObjectSizes class for calculating object sizes [5]
- separate thread for analyzing and calculating the buffer sizeseparate thread for eviction operation by FIFO policy; this operation has to be suspended while the size of all objects in the buffer is smaller than the buffer maximum size
- in the analysis adding thread, signal the eviction thread if needed
- use conditions for suspending and activating the threads that work on eviction and/or analysis
The object size can also be computed when enqueueing an element instead of or on a different specialised thread for analysis. Both approaches introduce some overhead: size calculation on a separate thread means locking and context switching, while synchronous size calculation means slowing down the enqueue operation, which has to be very fast. Nevertheless, it is hard to say which approach is faster, so both of them will be implemented.
Therefore, we have the following approaches that have to be implemented and compared:
Size Calculation / Locking | Lock-based | Lock-free |
---|---|---|
Asynchronous (separate thread) | 1 | 2 |
Synchronous (at enqueue) | 3 | 4 |
Realization
Class diagram for a lock-free RingBuffer with separate eviction and analysis threads (Approach 2):
We chose the synchronous size calculation at enqueueing.
Completely lock-free approach
Non-blocking is believed to give better performance in highly multi-threaded environments.
The queue can be implemented as follows:
- non-blocking enqueue/dequeue: use compare-and-set to set the values of head and tail
- lazy approach - enqueue/dequeue are both done in two steps [1]
- two-step enqueue:
- compare-and-set to append new element
- compare-and-set to change the value of tail
- "every other method call must be prepared to encounter a half-finished enqueue call, and to finish the job"
- an enqueue call never waits for a dequeuer
- two-step dequeue:
- read the value from the first element
- compare-and-set to change the value of head
- make sure that tail does not lag behind
Refer to Section 10.5 "An Unbounded Lock-Free Queue" in [1] and Section 2 in [2] for a more detailed explanation.
A completely lock-free approach is described below.
Lock-free enqueue and lock-based dequeue
The following diagram shows how the enqueueing and dequeueing are is performed without using locks.
...
However, it is impossible to implement the eviction functionality within this approach without adding a lock. This lock can be held by dequeue and evict, so that these two operations cannot happen at the same time. Nevertheless, the enqueue is lock-free, thus should be very fast.
A completely lock-free approach is described below.
...
Lock-based approach
Another possibility to implement a concurrent queue:
- using two locks, one for queing and one for dequeueing
- livelock free
- but high contention
The two locks are
- tail lock - used by enqueue, dequeue
- head lock - used by dequeue, evict
Next, the three operations performed on the buffer in this lock-based implementation are described step by step.
Enqueue:
- allocate node, compute and set size
- lock tail lock
- link node to tail
- swing tail to new node
- unlock tail lock
- change buffer size
- signal eviction if buffer is full
Dequeue:
- if buffer is empty, return
- lock tail lock
- lock head lock
- get the reference of the first element
- swing head and tail to dummy node
- change size to 0
- unlock head lock
- unlock tail lock
- go through buffer, add objects to list
- return list of dequeued nodes
Evict:
- wait until there is need for eviction
- lock head lock
- go through list until the required eviction size is reached
- change head to new first element
- unlock head lock
- subtract evicted size from total buffer size
- we assume here that eviction will never empty the buffer, so it does not need to acquire the tail lock
Circular Buffer
Another idea involves a preallocated circular linked list that also uses sentinel nodes for the first and last element, namely head and tail. The eviction is done implicitly by overriding old elements when the buffer becomes full. The drawback is that this approach can only be used with the limiting strategy depending on the number of elements. The figure below explains the approach.
Realization
Class diagram for a lock-free RingBuffer with separate eviction and analysis threads (Approach 2):
Correctness
Checking for possible inconsistent states for the "completely lock-free" approach, by analyzing what happens when two operations are executed in parallel:
- Dequeue and Evict
- both CAS head first:
- if dequeue succeeds, no eviction is done (first element is different => CAS in eviction fails)
- if evict succeeds, dequeue fails this time and succeeds next time it tries => the evicted elements are not dequeued
- both CAS head first:
- several enqueues
- CAS tail is always the first operation, therefore only one thread can succeed in doing the operation
- the next thread will change tail and link the element after the last element added
- if buffer is empty i.e. (last == dummy) => head is changed only if CAS tail succeeds
- enqueue and dequeue
- if buffer is empy, no dequeue is done (head == dummy)
- dequeue: CAS head to dummy; if this succeeds, set tail to dummy
- if an enqueue is done before CAS head => OK, because enqueue only changes tail (if buffer is empty, enqueue can change head, but no dequeue in that case)
- if an enqueue is done after CAS head, but before set tail, then the element added will be dequeued as well, because it will be added at the end of the buffer.
- when dequeue finally sets tail to dummy, we have an empty buffer
- if an enqueue is done after head and tail are set in dequeue, we have an empty buffer and everything is OK
- after enqueue CAS tail to new node, it links the old tail to the new node: this should be OK and it should happen before the dequeuer reaches to that node (because it has to first set the tail, and then go through the list)
- for a while after dequeue, the size might be greater than the real size (because the size is changed after setting the head, tail, going through the list and calculating the size of the dequeued elements) -> but evict check is head is dummy
- what if eviction happens after dequeue has canged head, tail, but not size, and some enqueues were already done?
- shouldEvict checks if size>capacity, which will be true, but the real size is much smaller! then the eviction will be performed, even though it was not needed, thus we lose some valid new elements!
Testing
Issues that need to be checked:
- Performance
- measure time for different number of threads
- compare the different approaches
- Correctness (harder to check)
- linearizability
- thread safety
- no deadlocks
- no livelocks
- the number/size of elements enqueued has to be equal to the sum of number/size of elements dequeued, evicted and remaining in the buffer at the end of the test:
- enqueued = dequeued + evicted + remaining
The general structure of a test application for RingBuffer has:
...
The threads run for some time (e.g. 1 minute) and then are interrupted and the statistics about total number of elements added, retrieved and evicted are printed out.
Results
Comparing the number or size of elements added in 1min for the three implementations (see Figure Figures below), it was observed that the lock-based implementation is the most inefficient, while the other two implementation give similar results, with some variations for smaller number of threads, where the one that uses a lock for dequeue and eviction proves to be better. The tests use the "Limit by number" strategy for the buffer size.slightly better results for the completely lock-free implementation, as expected. The figures are based on the average values of three runs for each case.
Microbenchmarking with Caliper
As discussed inthe meeting /wiki/spaces/ARCHIVE/pages/5019155, the results for the lock-free enqueue, lock based dequeue and evict are not always explainable, because the lock-based dequeue and evict is faster than the non locking approach in some cases. We decided to go for microbenchmarking [6], which measures the performance of a very small piece of code. The results, however, were not very conclusive. The lock based approach is definitely the most inefficient, but the other two implementations, the lock free and lock based dequeue and evict, usually give the same results. This can be explained by the rare occurence of the dequeueing operation. Even if it uses a lock or not, it does not happen often enough to have an impact on the overall performace.
Below some results of benchmarking with Caliper are presented. We tried different capacities, different sending intervals, different number/size of elements added, but the results do not give more insight into which method is better.
Code Block | ||
---|---|---|
| ||
0% Scenario{vm=java, trial=0, benchmark=CompletelyLockFree, bufferCapacity=10000, numElementsToAdd=400000, numThreads=16} 2502033754.00 ns; s=815687.04 ns @ 3 trials
17% Scenario{vm=java, trial=0, benchmark=LockFreeEnqueue, bufferCapacity=10000, numElementsToAdd=400000, numThreads=16} 2501636021.00 ns; s=645694.28 ns @ 3 trials
33% Scenario{vm=java, trial=0, benchmark=CompletelyLockFree, bufferCapacity=10000, numElementsToAdd=400000, numThreads=32} 2503686894.00 ns; s=620891.89 ns @ 3 trials
50% Scenario{vm=java, trial=0, benchmark=LockFreeEnqueue, bufferCapacity=10000, numElementsToAdd=400000, numThreads=32} 2502442691.00 ns; s=606238.69 ns @ 3 trials
67% Scenario{vm=java, trial=0, benchmark=CompletelyLockFree, bufferCapacity=10000, numElementsToAdd=400000, numThreads=64} 2512078430.00 ns; s=24169268.96 ns @ 7 trials
83% Scenario{vm=java, trial=0, benchmark=LockFreeEnqueue, bufferCapacity=10000, numElementsToAdd=400000, numThreads=64} 2702248773.00 ns; s=114828734.41 ns @ 10 trials
numThreads benchmark s linear runtime
16 CompletelyLockFree 2.50 ===========================
16 LockFreeEnqueue 2.50 ===========================
32 CompletelyLockFree 2.50 ===========================
32 LockFreeEnqueue 2.50 ===========================
64 CompletelyLockFree 2.51 ===========================
64 LockFreeEnqueue 2.70 ==============================
vm: java
trial: 0
bufferCapacity: 10000
numElementsToAdd: 400000
dequeue every 500ms
0% Scenario{vm=java, trial=0, benchmark=CompletelyLockFree, bufferCapacity=20000, numElementsToAdd=800000, numThreads=16} 2503758676.00 ns; s=581886.19 ns @ 3 trials
17% Scenario{vm=java, trial=0, benchmark=LockFreeEnqueue, bufferCapacity=20000, numElementsToAdd=800000, numThreads=16} 2504004679.00 ns; s=24777692.23 ns @ 6 trials
33% Scenario{vm=java, trial=0, benchmark=CompletelyLockFree, bufferCapacity=20000, numElementsToAdd=800000, numThreads=32} 2507706517.00 ns; s=5005925.11 ns @ 3 trials
50% Scenario{vm=java, trial=0, benchmark=LockFreeEnqueue, bufferCapacity=20000, numElementsToAdd=800000, numThreads=32} 2577092776.50 ns; s=58717140.01 ns @ 10 trials
67% Scenario{vm=java, trial=0, benchmark=CompletelyLockFree, bufferCapacity=20000, numElementsToAdd=800000, numThreads=64} 4825908109.00 ns; s=40336488.36 ns @ 3 trials
83% Scenario{vm=java, trial=0, benchmark=LockFreeEnqueue, bufferCapacity=20000, numElementsToAdd=800000, numThreads=64} 4668499674.50 ns; s=457395064.90 ns @ 10 trials
numThreads benchmark s linear runtime
16 CompletelyLockFree 2.50 ===============
16 LockFreeEnqueue 2.50 ===============
32 CompletelyLockFree 2.51 ===============
32 LockFreeEnqueue 2.58 ================
64 CompletelyLockFree 4.83 ==============================
64 LockFreeEnqueue 4.67 =============================
vm: java
trial: 0
bufferCapacity: 20000
numElementsToAdd: 800000 |
Therefore, we opted for the completely lock free implementation to be used in the inspectIT agent. The functionality tests and integration test presented in the next section concern this approach.
Functionality and Integration Testing
The tests were implemented using TestNG [7] and are used to thoroughly test the completely lock free implementation, to ensure that it is ready to be used for the agent implementation.
The functionality tests aim to ensure that all the units in the RingBuffer work correctly. The tests include:
- initialisation
- adding one element
- object size calculation
- trying to add a negative sized element
- maintaining the adding order of elements at dequeue time
- dequeueing
- two consecutive dequeues
- checking if the eviction starts once the buffer is full
Each tests checks if the size and shouldEvict have the expected values.
For the limiting strategy, a mocking framework was used, Mockito [8].
The integration tests check if everything works as expected when multiple threads add elements to a buffer, while one thread takes all the elements and another one deals with the eviction. The tests are run for a longer period of time (e.g. 1min) with variable number of threads, and at the end we check if there are any elements lost in the process, i.e. if the number/size of elements added is the same as the sum of the number/size of elements dequeued, evicted and remaining in the buffer.
Building and Running
The projects can be built using the eclipse compiler. The Ivy dependency management system takes care of all the dependencies, with the exception of TestNG (should be on the build path via the Eclipse plugin).
The functional and integration test configurations can be found in the corresponding xml files, which can simply be run from Eclipse.
Running the caliper benchmarks can de done by launching the class BenchmarkRunner with the input parameters given as example in the file.
References
[1] Maurice Herlihy, Nir Shavit, "The Art of Multiprocessor Programming, Revised Reprint"
[2] Maged M. Michael, Michael L. Scott, "Simple, Fast, and Practical Non-Blocking and Blocking Concurrent Queue Algorithms"
[3] ConcurrentLinkedQueue implementation in Java
[4] CMR buffer specifications
[5] Calculation of object sizes
[6] Microbenchmarking with Caliper
[7] TestNG
[8] Mockito