Skip to content
GitLab
Explore
Sign in
Primary navigation
Search or go to…
Project
F
Fadecandy
Manage
Activity
Members
Code
Repository
Branches
Commits
Tags
Repository graph
Compare revisions
Help
Help
Support
GitLab documentation
Compare GitLab plans
Community forum
Contribute to GitLab
Provide feedback
Keyboard shortcuts
?
Snippets
Groups
Projects
Show more breadcrumbs
scanlime
Fadecandy
Commits
074e58f1
Commit
074e58f1
authored
11 years ago
by
Micah Elizabeth Scott
Browse files
Options
Downloads
Patches
Plain Diff
Optimized lutInterpolate, using SMUADX
parent
5ca648fd
No related branches found
Branches containing commit
No related tags found
Tags containing commit
No related merge requests found
Changes
2
Hide whitespace changes
Inline
Side-by-side
Showing
2 changed files
firmware/fc_pixel.cpp
+37
-4
37 additions, 4 deletions
firmware/fc_pixel.cpp
firmware/fc_usb.cpp
+3
-1
3 additions, 1 deletion
firmware/fc_usb.cpp
with
40 additions
and
5 deletions
firmware/fc_pixel.cpp
+
37
−
4
View file @
074e58f1
...
...
@@ -34,14 +34,47 @@ ALWAYS_INLINE static inline uint32_t lutInterpolate(const uint16_t *lut, uint32_
* input of 0x10000, which can't quite be reached.
*
* 'arg' is in the range [0, 0xFFFF]
*
* This operation is equivalent to the following:
*
* unsigned index = arg >> 8; // Range [0, 0xFF]
* unsigned alpha = arg & 0xFF; // Range [0, 0xFF]
* unsigned invAlpha = 0x100 - alpha; // Range [1, 0x100]
*
* // Result in range [0, 0xFFFF]
* return (lut[index] * invAlpha + lut[index + 1] * alpha) >> 8;
*
* This is easy to understand, but it turns out to be a serious bottleneck
* in terms of speed and memory bandwidth, as well as register pressure that
* affects the compilation of updatePixel().
*
* To speed this up, we try and do the lut[index] and lut[index+1] portions
* in parallel using the SMUAD instruction. This is a pair of 16x16 multiplies,
* and the results are added together. We can combine this with an unaligned load
* to grab two adjacent entries from the LUT. The remaining complications are:
*
* 1. We wanted unsigned, not signed
* 2. We still need to generate the input values efficiently.
*
* (1) is easy to solve if we're okay with 15-bit precision for the LUT instead
* of 16-bit, which is fine. During LUT preparation, we right-shift each entry
* by 1, keeping them within the positive range of a signed 16-bit int.
*
* For (2), we need to quickly put 'alpha' in the high halfword and invAlpha in
* the low halfword, or vice versa. One fast way to do this is (0x01000000 + x - (x << 16).
*/
unsigned
index
=
arg
>>
8
;
// Range [0, 0xFF]
uint32_t
index
=
arg
>>
8
;
// Range [0, 0xFF]
// Load lut[index] into low halfword, lut[index+1] into high halfword.
uint32_t
pair
=
*
(
const
uint32_t
*
)(
lut
+
index
);
unsigned
alpha
=
arg
&
0xFF
;
// Range [0, 0xFF]
unsigned
invAlpha
=
0x100
-
alpha
;
// Range [1, 0x100]
// Result in range [0, 0xFFFF]
return
(
lut
[
index
]
*
invAlpha
+
lut
[
index
+
1
]
*
alpha
)
>>
8
;
// Reversed halfword order
uint32_t
pairAlpha
=
(
0x01000000
+
alpha
-
(
alpha
<<
16
));
return
__SMUADX
(
pairAlpha
,
pair
)
>>
7
;
}
static
uint32_t
updatePixel
(
uint32_t
icPrev
,
uint32_t
icNext
,
...
...
This diff is collapsed.
Click to expand it.
firmware/fc_usb.cpp
+
3
−
1
View file @
074e58f1
...
...
@@ -109,9 +109,11 @@ void fcBuffers::finalizeLUT()
* To keep LUT lookups super-fast, we copy the LUT into a linear array at this point.
* LUT changes are intended to be infrequent (initialization or configuration-time only),
* so this isn't a performance bottleneck.
*
* Note the right shift by 1. See lutInterpolate() for an explanation.
*/
for
(
unsigned
i
=
0
;
i
<
LUT_TOTAL_SIZE
;
++
i
)
{
lutCurrent
.
entries
[
i
]
=
lutNew
.
entry
(
i
);
lutCurrent
.
entries
[
i
]
=
lutNew
.
entry
(
i
)
>>
1
;
}
}
This diff is collapsed.
Click to expand it.
Preview
0%
Loading
Try again
or
attach a new file
.
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Save comment
Cancel
Please
register
or
sign in
to comment