Skip to content
Snippets Groups Projects
Commit 074e58f1 authored by Micah Elizabeth Scott's avatar Micah Elizabeth Scott
Browse files

Optimized lutInterpolate, using SMUADX

parent 5ca648fd
No related branches found
No related tags found
No related merge requests found
...@@ -34,14 +34,47 @@ ALWAYS_INLINE static inline uint32_t lutInterpolate(const uint16_t *lut, uint32_ ...@@ -34,14 +34,47 @@ ALWAYS_INLINE static inline uint32_t lutInterpolate(const uint16_t *lut, uint32_
* input of 0x10000, which can't quite be reached. * input of 0x10000, which can't quite be reached.
* *
* 'arg' is in the range [0, 0xFFFF] * 'arg' is in the range [0, 0xFFFF]
*
* This operation is equivalent to the following:
*
* unsigned index = arg >> 8; // Range [0, 0xFF]
* unsigned alpha = arg & 0xFF; // Range [0, 0xFF]
* unsigned invAlpha = 0x100 - alpha; // Range [1, 0x100]
*
* // Result in range [0, 0xFFFF]
* return (lut[index] * invAlpha + lut[index + 1] * alpha) >> 8;
*
* This is easy to understand, but it turns out to be a serious bottleneck
* in terms of speed and memory bandwidth, as well as register pressure that
* affects the compilation of updatePixel().
*
* To speed this up, we try and do the lut[index] and lut[index+1] portions
* in parallel using the SMUAD instruction. This is a pair of 16x16 multiplies,
* and the results are added together. We can combine this with an unaligned load
* to grab two adjacent entries from the LUT. The remaining complications are:
*
* 1. We wanted unsigned, not signed
* 2. We still need to generate the input values efficiently.
*
* (1) is easy to solve if we're okay with 15-bit precision for the LUT instead
* of 16-bit, which is fine. During LUT preparation, we right-shift each entry
* by 1, keeping them within the positive range of a signed 16-bit int.
*
* For (2), we need to quickly put 'alpha' in the high halfword and invAlpha in
* the low halfword, or vice versa. One fast way to do this is (0x01000000 + x - (x << 16).
*/ */
unsigned index = arg >> 8; // Range [0, 0xFF] uint32_t index = arg >> 8; // Range [0, 0xFF]
// Load lut[index] into low halfword, lut[index+1] into high halfword.
uint32_t pair = *(const uint32_t*)(lut + index);
unsigned alpha = arg & 0xFF; // Range [0, 0xFF] unsigned alpha = arg & 0xFF; // Range [0, 0xFF]
unsigned invAlpha = 0x100 - alpha; // Range [1, 0x100]
// Result in range [0, 0xFFFF] // Reversed halfword order
return (lut[index] * invAlpha + lut[index + 1] * alpha) >> 8; uint32_t pairAlpha = (0x01000000 + alpha - (alpha << 16));
return __SMUADX(pairAlpha, pair) >> 7;
} }
static uint32_t updatePixel(uint32_t icPrev, uint32_t icNext, static uint32_t updatePixel(uint32_t icPrev, uint32_t icNext,
......
...@@ -109,9 +109,11 @@ void fcBuffers::finalizeLUT() ...@@ -109,9 +109,11 @@ void fcBuffers::finalizeLUT()
* To keep LUT lookups super-fast, we copy the LUT into a linear array at this point. * To keep LUT lookups super-fast, we copy the LUT into a linear array at this point.
* LUT changes are intended to be infrequent (initialization or configuration-time only), * LUT changes are intended to be infrequent (initialization or configuration-time only),
* so this isn't a performance bottleneck. * so this isn't a performance bottleneck.
*
* Note the right shift by 1. See lutInterpolate() for an explanation.
*/ */
for (unsigned i = 0; i < LUT_TOTAL_SIZE; ++i) { for (unsigned i = 0; i < LUT_TOTAL_SIZE; ++i) {
lutCurrent.entries[i] = lutNew.entry(i); lutCurrent.entries[i] = lutNew.entry(i) >> 1;
} }
} }
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment