Optimized lutInterpolate, using SMUADX

074e58f1 · Micah Elizabeth Scott · 5ca648fd · 074e58f1 · 074e58f1
Commit 074e58f1 authored 11 years ago by Micah Elizabeth Scott
--- a/firmware/fc_pixel.cpp
+++ b/firmware/fc_pixel.cpp
@@ -34,14 +34,47 @@ ALWAYS_INLINE static inline uint32_t lutInterpolate(const uint16_t *lut, uint32_
     * input of 0x10000, which can't quite be reached.
     *
     * 'arg' is in the range [0, 0xFFFF]
+     *
+     * This operation is equivalent to the following:
+     *
+     *      unsigned index = arg >> 8;          // Range [0, 0xFF]
+     *      unsigned alpha = arg & 0xFF;        // Range [0, 0xFF]
+     *      unsigned invAlpha = 0x100 - alpha;  // Range [1, 0x100]
+     *
+     *      // Result in range [0, 0xFFFF]
+     *      return (lut[index] * invAlpha + lut[index + 1] * alpha) >> 8;
+     *
+     * This is easy to understand, but it turns out to be a serious bottleneck
+     * in terms of speed and memory bandwidth, as well as register pressure that
+     * affects the compilation of updatePixel().
+     *
+     * To speed this up, we try and do the lut[index] and lut[index+1] portions
+     * in parallel using the SMUAD instruction. This is a pair of 16x16 multiplies,
+     * and the results are added together. We can combine this with an unaligned load
+     * to grab two adjacent entries from the LUT. The remaining complications are:
+     *
+     *   1. We wanted unsigned, not signed
+     *   2. We still need to generate the input values efficiently.
+     *
+     * (1) is easy to solve if we're okay with 15-bit precision for the LUT instead
+     * of 16-bit, which is fine. During LUT preparation, we right-shift each entry
+     * by 1, keeping them within the positive range of a signed 16-bit int. 
+     *
+     * For (2), we need to quickly put 'alpha' in the high halfword and invAlpha in
+     * the low halfword, or vice versa. One fast way to do this is (0x01000000 + x - (x << 16).
     */
-    unsigned index = arg >> 8;          // Range [0, 0xFF]
+    uint32_t index = arg >> 8;          // Range [0, 0xFF]
+    // Load lut[index] into low halfword, lut[index+1] into high halfword.
+    uint32_t pair = *(const uint32_t*)(lut + index);
    unsigned alpha = arg & 0xFF;        // Range [0, 0xFF]
-    unsigned invAlpha = 0x100 - alpha;  // Range [1, 0x100]
-    // Result in range [0, 0xFFFF]
+    // Reversed halfword order
-    return (lut[index] * invAlpha + lut[index + 1] * alpha) >> 8;
+    uint32_t pairAlpha = (0x01000000 + alpha - (alpha << 16));
+    return __SMUADX(pairAlpha, pair) >> 7;
 }
 static uint32_t updatePixel(uint32_t icPrev, uint32_t icNext,

--- a/firmware/fc_usb.cpp
+++ b/firmware/fc_usb.cpp
@@ -109,9 +109,11 @@ void fcBuffers::finalizeLUT()
     * To keep LUT lookups super-fast, we copy the LUT into a linear array at this point.
     * LUT changes are intended to be infrequent (initialization or configuration-time only),
     * so this isn't a performance bottleneck.
+     *
+     * Note the right shift by 1. See lutInterpolate() for an explanation.
     */
    for (unsigned i = 0; i < LUT_TOTAL_SIZE; ++i) {
-        lutCurrent.entries[i] = lutNew.entry(i);
+        lutCurrent.entries[i] = lutNew.entry(i) >> 1;
    }
 }