## Viterbi - performance

`DeViterbi`

method consumes the most of the CPU time in the *sdrdab*. Although it works in real time on PCs, there is possibility that there won’t be enough computation power in smaller units.

### Optimization

Current Viterbi decoder implementation has been optimized using SSE (Streaming SIMD Extensions).

SSE introduces 8 (on x86 arch) or 16 (on x86-64 arch only in 64-bit mode) additional `XMM`

registers dedicated for operations on floating point data. Each register is 128-bit, meaning it can hold 4 floating point numbers at once. Algorithms employing SSE can take advantage of performing multiple operations on packed data in parallel, e.g. if `XMM0`

holds numbers `a1, b1, c1, d1`

and `XMM1`

holds `a2, b2, c2, d2`

, then it is possible to perform, for example, `a1 * a2, b1 * b2, c1 * c2, d1 * d2`

at once.

The most consuming operation in `DeViterbi`

is calculating dot product of two 4-element vectors twice in order to decide which order path is better - by comparing `Value`

and _{1}`Value`

.
_{2}

As we can see, `refOut`

vectors always consist of `-1`

and `1`

values, so in fact they only affect signs of elements of input vector.

As seen above, algorithm does not perform any multiplications. Signs of input elements are flipped according to corresponding elements of `refOut`

vector. Next, elements are horizontally added by performing sequence of shuffling and adding. Let’s take a closer look at this.

Assume we have a register holding following vector `V1`

.

We want to add its all elements. First, we do `__m128 V2 = _mm_movehdup_ps(V1)`

on our vector, which will duplicate its odd indexed values (indices start from 0) into `V2`

.

Next, we add vectors `V1`

and `V2`

, doing `__m128 V3 = _mm_add_ps(V1, V2)`

.

Now we can reuse `V2`

and copy upper elements of `V3`

to the lower elements of V2 by doing `V2 = _mm_movehl_ps(V2, V3)`

/

Finally, we add only first elements of `V3`

and `V2`

with `V3 = _mm_add_ss(V3, V2)`

to get our sum: `A + B + C + D`

.

We also need to add corresponding `accMetricBuff`

value (which is loaded to `XMM`

register too) to get final result. Again, it can be added with `_mm_add_ss function`

. Values held in 0-th element of a vector can be obtained by using `float _mm_cvtss_f32(__m128 a)`

function.

At last, decision is made by comparing `Value1`

and `Value2`

.

Functions used above (like `_mm_add_ps`

) are called intrinsics, which are functions, that are treated specially by compiler. SSE intrinsics can be almost directly translated to assembly. Usually, code generated from intrinsics is more efficient than assembly written by hand, because compiler can optimize it.

SSE enables decoder to operate 2 times faster, or even above 3 times faster than a scalar (not optimized) version. Speed-up is dependent on hardware, system and compiler configuration, but it in all observed cases it was always bigger than speed-up from automatic vectorization done by compiler on a scalar code.

### Possibilities of further improvements

- Implementation of other algorithm, e.g. Lazy-Viterbi or BCJR.
- Speeding-up the Viterbi algorithm, resulting in lowering the accuracy.
- Implementation of depuncturer in DeViterbi function.