Viterbi - performance
DeViterbi method consumes the most of the CPU time in the sdrdab. Although it works in real time on PCs, there is possibility that there won’t be enough computation power in smaller units.
Current Viterbi decoder implementation has been optimized using SSE (Streaming SIMD Extensions).
SSE introduces 8 (on x86 arch) or 16 (on x86-64 arch only in 64-bit mode) additional
XMM registers dedicated for operations on floating point data. Each register is 128-bit, meaning it can hold 4 floating point numbers at once. Algorithms employing SSE can take advantage of performing multiple operations on packed data in parallel, e.g. if
XMM0 holds numbers
a1, b1, c1, d1 and
a2, b2, c2, d2, then it is possible to perform, for example,
a1 * a2, b1 * b2, c1 * c2, d1 * d2 at once.
The most consuming operation in
DeViterbi is calculating dot product of two 4-element vectors twice in order to decide which order path is better - by comparing
As we can see,
refOut vectors always consist of
1 values, so in fact they only affect signs of elements of input vector.
As seen above, algorithm does not perform any multiplications. Signs of input elements are flipped according to corresponding elements of
refOut vector. Next, elements are horizontally added by performing sequence of shuffling and adding. Let’s take a closer look at this.
Assume we have a register holding following vector
We want to add its all elements. First, we do
__m128 V2 = _mm_movehdup_ps(V1) on our vector, which will duplicate its odd indexed values (indices start from 0) into
Next, we add vectors
__m128 V3 = _mm_add_ps(V1, V2).
Now we can reuse
V2 and copy upper elements of
V3 to the lower elements of V2 by doing
V2 = _mm_movehl_ps(V2, V3)/
Finally, we add only first elements of
V3 = _mm_add_ss(V3, V2) to get our sum:
A + B + C + D.
We also need to add corresponding
accMetricBuff value (which is loaded to
XMM register too) to get final result. Again, it can be added with
_mm_add_ss function. Values held in 0-th element of a vector can be obtained by using
float _mm_cvtss_f32(__m128 a) function.
At last, decision is made by comparing
Functions used above (like
_mm_add_ps) are called intrinsics, which are functions, that are treated specially by compiler. SSE intrinsics can be almost directly translated to assembly. Usually, code generated from intrinsics is more efficient than assembly written by hand, because compiler can optimize it.
SSE enables decoder to operate 2 times faster, or even above 3 times faster than a scalar (not optimized) version. Speed-up is dependent on hardware, system and compiler configuration, but it in all observed cases it was always bigger than speed-up from automatic vectorization done by compiler on a scalar code.
- Implementation of other algorithm, e.g. Lazy-Viterbi or BCJR.
- Speeding-up the Viterbi algorithm, resulting in lowering the accuracy.
- Implementation of depuncturer in DeViterbi function.