In a different post, I looked at optimizing a particular algorithm using Arm Neon SIMD. If you are interested in how to use SIMD to convert RGB888 to RGB565, definitely take a look at that post.
While trying different things and figuring out how to reduce the execution time as much as possible, I did a deep dive into the generated assembly code for different implementations of that algorithm as well as the code that the compiler generates when optimizing (and vectorizing) the scalar version.
This post focuses on what I learned about compilers and how to make them do what you want.
1. Compilers are really smart
Modern compilers are pretty good at optimizing algorithms, at least trivial ones. They make use of architecture specific instructions to get the most performance out of the target machine.
When optimizing this simple algorithm for example:
1void rgb888_to_rgb565_scalar(Image *src, Image *dst) {
2
3 for (int i = 0; i < src->img_width * src->img_height; i++) {
4 uint8_t r = src->buffer[i * 3 + 0];
5 uint8_t g = src->buffer[i * 3 + 1];
6 uint8_t b = src->buffer[i * 3 + 2];
7
8 // Add half the lost precision before truncating
9 uint16_t r5 = (r + 4) >> 3; // +4 is half of 8 (2^3)
10 uint16_t g6 = (g + 2) >> 2; // +2 is half of 4 (2^2)
11 uint16_t b5 = (b + 4) >> 3; // +4 is half of 8 (2^3)
12
13 // Clamp to prevent overflow
14 if (r5 > 31)
15 r5 = 31;
16 if (g6 > 63)
17 g6 = 63;
18 if (b5 > 31)
19 b5 = 31;
20
21 uint16_t value = (uint16_t)((r5 << 11) | (g6 << 5) | b5);
22
23 ((uint16_t *)dst->buffer)[i] = value;
24 }
25}
The compiler essentially generates 3 functions. Or execution paths. One truly scalar version, one processing 8 pixels per iteration and one processing 16 pixels per iteration. Since Arm Neon registers are 128 bits wide, 16 8 bit values are the most that can fit into one register. The whole assembly is in the Appendix, I just want to draw your attention to a few small excerpts.
The following snippet shows the start of the function, and it contains the logic to decide which one of the aforementioned paths it should take.
10000000100001554 <_rgb888_to_rgb565_scalar>:
2100001554: a9412408 ldp x8, x9, [x0, #0x10]
3100001558: 9b087d28 mul x8, x9, x8
410000155c: b4000988 cbz x8, 0x10000168c <_rgb888_to_rgb565_scalar+0x138>
5100001560: f9400009 ldr x9, [x0]
6100001564: f940002a ldr x10, [x1]
7100001568: f100211f cmp x8, #0x8
810000156c: 54000543 b.lo 0x100001614 <_rgb888_to_rgb565_scalar+0xc0> ; if num pixel < 8, jump to scalar
9100001570: d37ff90b lsl x11, x8, #1
10100001574: 8b0b014c add x12, x10, x11
11100001578: 8b08012d add x13, x9, x8
1210000157c: 8b0b01ab add x11, x13, x11
13100001580: eb0b015f cmp x10, x11
14100001584: fa4c3122 ccmp x9, x12, #0x2, lo
15100001588: 54000463 b.lo 0x100001614 <_rgb888_to_rgb565_scalar+0xc0>
1610000158c: f100411f cmp x8, #0x10
17100001590: 54000802 b.hs 0x100001690 <_rgb888_to_rgb565_scalar+0x13c> ; if num pixel >= jump to 16 pixel at a time
18// ... continue processing 8 pixels at a time
Making the compiler do what you want it to do
If you read my post on using SIMD to solve this problem efficiently, then you know that I have implemented different versions. Among them one that processes 8 pixels per iteration and one that processes 16 pixels per iteration by loading 3x16 values and then processing the higher 8 values and lower 8 values separately for each color channel, since it was not possible to keep the algorithm the same and truly process 16 values with each arithmetic instruction, since the values would overflow. There is a way to achieve this if you change the algorithm implementation a little, but more on that in SIMD Algorithm Design: Beyond Basic Vectorization.
And the compiler essentially did the same. In this post, we will only look at the 16 pixel version and ignore the 8 pixel one, since the former is more interesting in my opinion.
Before the rest of this post, a short section about terminology:
For the rest of this post I will refer to my ‘16 pixels per iteration’ implementation as 16P (short for ‘16 pixels version’), and
I will refer to the code that the compiler generated when optimizing and vectorizing the scalar version as CGV, ’the compiler generated version’.
So when I talk about ’the assembly generated for 16P’ or ’the 16P assembly’, then I mean the assembly that the compiler generated when compiling my handwritten ‘16 pixels per iteration’ version. And when I mention ’the CGV’, I refer to the ‘16 pixels at a time’ part of the compiler generated version, and ’the CGV assembly’ refers to the specific instructions the compiler has generated for CGV.
Comparing CGV and 16P
So when comparing the 16P assembly and the CGV assembly, it is clear that the SIMD instructions are exactly the same. However, benchmarks reveal that CGV is a little bit faster overall.
So it is worthwhile to look at the loop code generated by the compiler (loop of 16P, loop of CGV).
What immediately becomes apparent is that number of instructions for 16P is higher than the number for CGV. Particularly at the beginning and at the end of the loop.
Beginning of the loop
The first ones are right at the beginning of the loop:
1100001f28: f940000b ldr x11, [x0] ; load base address (src->buffer) from memory
2100001f2c: 8b08016b add x11, x11, x8 ; add offset: i * 3
3100001f30: 4c404164 ld3.16b { v4, v5, v6 }, [x11] ; load 3 times 16 1 byte values from the source address into the registers v4-v6.
So for some reason the compiler loads something from memory (source address can be found in x0) into the register x11
and adds the content from x8 to it.
The next instruction makes it painfully obvious what this is.
It represents the interleaved load of 16 pixels (3 color channels times 16 1 byte values).
These 3 lines represent this line from the C code:
1v_rgb888 = vld3q_u8(src->buffer + (i * 3));
x8 contains i * 3 (is set to 0 before the loop code and is later increased by 0x30, which is 3 * 16 in hexadecimal,
and i increases by 16 at each iteration), and the ldr instruction loads the buffer start address from the src struct,
which is located in memory.
This has to be done at each iteration because the compiler doesn’t know that the value of buffer does not change.
An easy way to avoid this 2 instruction overhead is to save the value of buffer in a local variable that is placed outside of the loop:
1const uint8_t *src_ptr = src->buffer;
2
3for (int i = 0; i < src->img_width * src->img_height; i += 16) {
4 // ...
5}
The next couple of instructions are the same. The order is different, and it is possible that this improves pipelining. However, this does not cause any measurable performance difference, so it might just be a coincidence. Furthermore, it is difficult to reason about this without knowing the details of the underlying hardware, which Apple has not made public. The next difference is at the end of the loop after all the Neon instructions.
End of the loop
The CGV ends with these 3 instructions:
110000171c: ac8115a4 stp q4, q5, [x13], #0x20 ; save result to dst->buffer
2100001720: f100418c subs x12, x12, #0x10 ; decrement x12 by 0x10 (16 in decimal)
3100001724: 54fffc61 b.ne 0x1000016b0 <_rgb888_to_rgb565_scalar+0x15c> ; branch
The first instruction saves the computation result back to memory using one instruction to save both buffers to their particular offsets.
The values to be saved are in q4 and q5, they are saved to the location in x13 and said location is incremented by 32.
The next 2 instructions decrement the value in x12 by 16 and branch to the start of the loop if that value is not equal to 0.
So what the compiler actually did is convert the for loop to a while loop:
1for (int i = 0; i < src->img_width * src->img_height; i++) {
2 // ...
3}
4
5// to
6
7int remaining_pixels = src->img_width * src->img_height;
8while (remaining_pixels > 0) {
9 // ...
10 remaining_pixels -= 16;
11}
Meanwhile, the compiler generated the following instructions for 16P:
1100001f98: f940002b ldr x11, [x1] ; load a value from memory (location in x1 aka dst->buffer) to x11
2; ... omitted one neon function that has not yet been exectued, probably due to pipelining
3100001fa0: 3ca96964 str q4, [x11, x9] ; store q4 (half of the 16 values) to base address + (2 * i)
4100001fa4: f940002b ldr x11, [x1] ; load a value from memory (location in x1 aka dst->buffer) to x11
5100001fa8: 8b09016b add x11, x11, x9 ; x11 = dst->buffer + (2 * i)
6100001fac: 3d800565 str q5, [x11, #0x10] ; store q5 into dst->buffer + (2 * i) + 16
7100001fb0: 9100414a add x10, x10, #0x10 ; add 16 to x10 (i+=16)
8100001fb4: a941300b ldp x11, x12, [x0, #0x10] ; load x0 into x11 and x0 + 16 into x12 (img width and height)
9100001fb8: 9b0b7d8b mul x11, x12, x11 ; multiply x11 with x12 (num pixels = img->width * img->height)
10100001fbc: 91008129 add x9, x9, #0x20 ; x9 += 32 (16 values * 2 bytes)
11100001fc0: 9100c108 add x8, x8, #0x30 ; x8 += 48
12100001fc4: eb0a017f cmp x11, x10 ; compare x11 and x10
13100001fc8: 54fffb08 b.hi 0x100001f28 <_rgb888_to_rgb565_neon_16_vals+0x54> ; branch if num_pixels > i
What becomes immediately apparent is that there is not ‘store pair’ (stp) instruction, but there are 2 ‘store’ (str)
instructions instead.
This stp instruction accepts two SIMD registers as inputs, which makes vectorizing the memory access possible as well.
The addressing to figure out where to store the results is also more complicated, and so is the branching logic.
Starting with the store operations, here is the corresponding C code:
1vst1q_u16((uint16_t *)dst->buffer + i + 8, v_rgb565_high);
2vst1q_u16((uint16_t *)dst->buffer + i, v_rgb565_low);
The ldr pattern should seem familiar, since we encountered the exact same issue at the beginning of the loop,
and the reason as well as the solution is the same here. x1 contains the location of the destination buffer, x9 is
2 * i (due to 16-bit integer addressing).
The compiler reloads the destination address before every store, which also prevents it to replace those two individual
str instructions with a more efficient stp instruction.
One way to get this ‘store pair’ instruction is by making the compiler understand that the address does not change:
1uint16_t *dst_ptr = (uint16_t *)dst->buffer + i;
2vst1q_u16(dst_ptr, v_rgb565_low);
3vst1q_u16(dst_ptr + 8, v_rgb565_high);
This yields the following assembly, and as you can see we now have a single store (stp) instead of two stores (2x str):
11000020ac: f940002b ldr x11, [x1]
21000020b0: 8b09016b add x11, x11, x9
31000020b4: ad001564 stp q4, q5, [x11]
41000020b8: 9100414a add x10, x10, #0x10
51000020bc: a941300b ldp x11, x12, [x0, #0x10]
61000020c0: 9b0b7d8b mul x11, x12, x11
71000020c4: 91008129 add x9, x9, #0x20
81000020c8: 9100c108 add x8, x8, #0x30
91000020cc: eb0a017f cmp x11, x10
101000020d0: 54fffb48 b.hi 0x100002038 <_rgb888_to_rgb565_neon_16_vals_sp+0x54>
However, the compiler still does not know that the value of dst->buffer does not change in between iterations,
so it has to load it in every loop and add the offset.
This is now exactly the same as in the beginning of the loop, and the solution is also the same:
1uint16_t *dst_ptr = (uint16_t *)dst->buffer;
2const uint8_t *src_ptr = src->buffer;
3
4for (int i = 0; i < src->img_width * src->img_height; i += 16) {
5 // ...
6}
This reduces the instructions at the end of the loop from 12 to 7:
11000021b0: ad3f9544 stp q4, q5, [x10, #-0x10] ; store to dst->buffer
21000021b4: 91004108 add x8, x8, #0x10 ; i+=16
31000021b8: a941300b ldp x11, x12, [x0, #0x10] ; load image dimensions
41000021bc: 9b0b7d8b mul x11, x12, x11 ; calculate number of pixels in image
51000021c0: 9100814a add x10, x10, #0x20 ; calculate offset for dst->buffer for the next iteration
61000021c4: eb08017f cmp x11, x8 ; if number pixels > i:
71000021c8: 54fffbe8 b.hi 0x100002144 <_rgb888_to_rgb565_neon_16_vals_sp_restrict+0x58> ; branch
This is still 4 more than the CGV assembly, and to explain that discrepancy we have to examine the looping behavior. I already mentioned that the compiler converted the for loop into essentially a while loop, which costs 2 instructions, while the for loop costs 5. Part of that is that the inline calculation of the number of pixels could not be optimized by the compiler, so it loads the image dimensions in every loop.
1const int num_pixels = src->img_width * src->img_height;
2for (int i = 0; i < num_pixels; i += 16) {
3 // ...
4}
Moving that calculation out of the loop gets rid of the extra load instruction and the multiplication, as these are moved
out of the loop, but it keeps the cmp and adds a lsr instruction, which shifts the value in x10 (i) by 4.
The comparison changed as well:
1100001c84: ad3fc526 stp q6, q17, [x9, #-0x10] ; store to dst->buffer
2100001c88: 9100414a add x10, x10, #0x10 ; i+=16
3100001c8c: 91008129 add x9, x9, #0x20 ; calculate offset for dst->buffer for the next iteration
4100001c90: d344fd4b lsr x11, x10, #4 ; i >> 4
5100001c94: f1270d7f cmp x11, #0x9c3 ; (i >> 4) < 2499?
6100001c98: 54fffc43 b.lo 0x100001c20 <_rgb888_to_rgb565_neon_sp+0x60> ; branch if (i >> 4) < 2499
So the compiler still cannot convert it to a while loop like it can in the compiler optimized version. The reason for that is that the compiler doesn’t know that/if the image dimensions are divisible by 16, but we know that they are in this case (200x200 pixels = 40 000 pixels, divided by 16 = 2 500 iterations).
However, we can tell the compiler that by changing the comparison from i < num_pixels to i != num_pixels.
This lets the compiler know that i does not surpass the number of pixels.
1const int num_pixels = src->img_width * src->img_height;
2for (int i = 0; i != num_pixels; i += 16) {
3 // ...
4}
After that the generated instructions are almost equal to CGV:
1100001c84: ad3fc526 stp q6, q17, [x9, #-0x10]
2100001c88: 91008129 add x9, x9, #0x20
3100001c8c: f100414a subs x10, x10, #0x10
4100001c90: 54fffc81 b.ne 0x100001c20 <_rgb888_to_rgb565_neon_sp+0x60>
The only difference is that there is an extra add instruction to compute the next offset that the program should write to.
In CGV that is done as part of the stp instruction.
Here is a side by side comparison1:
| opcode | source 1 | source 2 | destination | offset |
|---|---|---|---|---|
stp | q4 | q5 | [x13] | +32 (post-index) |
stp | q6 | q17 | x9 | -16 (pre-index) |
So the difference is that in the CGV version both register contents are stored to the address at [x13]
and then the offset is incremented by 32. So an extra add instruction is not necessary.
But for some reason the compiler didn’t do this for the 16P version.
In the assembly for 16P the destination address is stored in x9, the register contents are saved at x9-16 and then the
destination address is incremented by 32 (add x9, x9, #0x20).
It is not entirely clear to me why the compiler does it that way. But one way to get it to generate the same instructions is to change the addressing like this:
1// from
2vst1q_u16(dst_ptr + i, v_rgb565_low);
3vst1q_u16(dst_ptr + i + 8, v_rgb565_high);
4
5// to
6vst1q_u16(dst_ptr, v_rgb565_low);
7dst_ptr += 8;
8vst1q_u16(dst_ptr, v_rgb565_high);
9dst_ptr += 8;
That way the compiler generates the same instructions for both 16P and CGV.
1100001c80: ac814506 stp q6, q17, [x8], #0x20
2100001c84: f100414a subs x10, x10, #0x10
3100001c88: 54fffca1 b.ne 0x100001c1c <_rgb888_to_rgb565_neon_sp+0x5c>
Conclusion
To conclude, we have a looked at a rather simple algorithm that has both been vectorized by hand by the compiler. During the comparisons there have been several observations made that lead to the following conclusions:
- Compilers are really smart
They might optimize your problem just as well, if not better than you do, at least for trivial problems. Also doing manual optimizations might end up confusing the compiler, so it can’t do its best work.
- Be aware what the compiler knows and what it doesn’t
Many of these changes in assembly were because the compiler could not know about the influence of other code on data and therefore had to play it safe, which resulted in extra instructions and an observable performance penalty.
- Pre-calculate what you can
The compiler does not necessarily move calculations out of a loop, even though the result does not change across iterations. Moving calculations out of a hot-loop and caching the results can be worth it. Especially when the calculations are expensive.
- Be as expressive as you can be
Some things that might be obvious to you are not clear to the compiler, and generally the more expressive your code is, the better the compiler performs when generating the code that eventually runs on the CPU. However, one or two add instructions might be a worthwhile cost for less verbose/more concise and thereby more readable code. But if you want to make sure the compiler does the best it can, being more expressive is usually the way to achieve that.
Appendix
Scalar Algorithm (auto vectorized)
10000000100001554 <_rgb888_to_rgb565_scalar>:
2100001554: a9412408 ldp x8, x9, [x0, #0x10]
3100001558: 9b087d28 mul x8, x9, x8
410000155c: b4000988 cbz x8, 0x10000168c <_rgb888_to_rgb565_scalar+0x138>
5100001560: f9400009 ldr x9, [x0]
6100001564: f940002a ldr x10, [x1]
7100001568: f100211f cmp x8, #0x8
810000156c: 54000543 b.lo 0x100001614 <_rgb888_to_rgb565_scalar+0xc0>
9100001570: d37ff90b lsl x11, x8, #1
10100001574: 8b0b014c add x12, x10, x11
11100001578: 8b08012d add x13, x9, x8
1210000157c: 8b0b01ab add x11, x13, x11
13100001580: eb0b015f cmp x10, x11
14100001584: fa4c3122 ccmp x9, x12, #0x2, lo
15100001588: 54000463 b.lo 0x100001614 <_rgb888_to_rgb565_scalar+0xc0>
1610000158c: f100411f cmp x8, #0x10
17100001590: 54000802 b.hs 0x100001690 <_rgb888_to_rgb565_scalar+0x13c>
18100001594: d280000b mov x11, #0x0 ; =0
19100001598: aa0b03ee mov x14, x11
2010000159c: 927df10b and x11, x8, #0xfffffffffffffff8
211000015a0: d37ff9cd lsl x13, x14, #1
221000015a4: 8b0e01ac add x12, x13, x14
231000015a8: 8b0c012c add x12, x9, x12
241000015ac: 8b0d014d add x13, x10, x13
251000015b0: cb0b01ce sub x14, x14, x11
261000015b4: 4f008480 movi.8h v0, #0x4
271000015b8: 4f008441 movi.8h v1, #0x2
281000015bc: 4f0087e2 movi.8h v2, #0x1f
291000015c0: 4f0187e3 movi.8h v3, #0x3f
301000015c4: 0cdf4184 ld3.8b { v4, v5, v6 }, [x12], #24
311000015c8: 2e241007 uaddw.8h v7, v0, v4
321000015cc: 6f1d04e7 ushr.8h v7, v7, #0x3
331000015d0: 2e251030 uaddw.8h v16, v1, v5
341000015d4: 6f1e0610 ushr.8h v16, v16, #0x2
351000015d8: 2e261004 uaddw.8h v4, v0, v6
361000015dc: 6f1d0484 ushr.8h v4, v4, #0x3
371000015e0: 6e626ce5 umin.8h v5, v7, v2
381000015e4: 6e636e06 umin.8h v6, v16, v3
391000015e8: 6e626c84 umin.8h v4, v4, v2
401000015ec: 4f1b54a5 shl.8h v5, v5, #0xb
411000015f0: 4f1554c6 shl.8h v6, v6, #0x5
421000015f4: 4ea51cc5 orr.16b v5, v6, v5
431000015f8: 4ea41ca4 orr.16b v4, v5, v4
441000015fc: 3c8105a4 str q4, [x13], #0x10
45100001600: b10021ce adds x14, x14, #0x8
46100001604: 54fffe01 b.ne 0x1000015c4 <_rgb888_to_rgb565_scalar+0x70>
47100001608: eb0b011f cmp x8, x11
4810000160c: 54000061 b.ne 0x100001618 <_rgb888_to_rgb565_scalar+0xc4>
49100001610: 1400001f b 0x10000168c <_rgb888_to_rgb565_scalar+0x138>
50100001614: d280000b mov x11, #0x0 ; =0
51100001618: d37ff96c lsl x12, x11, #1
5210000161c: 8b0c014a add x10, x10, x12
53100001620: 8b0b018c add x12, x12, x11
54100001624: 8b090189 add x9, x12, x9
55100001628: 91000929 add x9, x9, #0x2
5610000162c: cb0b0108 sub x8, x8, x11
57100001630: 528003eb mov w11, #0x1f ; =31
58100001634: 528007ec mov w12, #0x3f ; =63
59100001638: 385fe12d ldurb w13, [x9, #-0x2]
6010000163c: 385ff12e ldurb w14, [x9, #-0x1]
61100001640: 3840352f ldrb w15, [x9], #0x3
62100001644: 110011ad add w13, w13, #0x4
63100001648: 53037dad lsr w13, w13, #3
6410000164c: 110009ce add w14, w14, #0x2
65100001650: 53027dce lsr w14, w14, #2
66100001654: 110011ef add w15, w15, #0x4
67100001658: 53037def lsr w15, w15, #3
6810000165c: 71007dbf cmp w13, #0x1f
69100001660: 1a8b31ad csel w13, w13, w11, lo
70100001664: 7100fddf cmp w14, #0x3f
71100001668: 1a8c31ce csel w14, w14, w12, lo
7210000166c: 71007dff cmp w15, #0x1f
73100001670: 1a8b31ef csel w15, w15, w11, lo
74100001674: 531b69ce lsl w14, w14, #5
75100001678: 2a0d2dcd orr w13, w14, w13, lsl #11
7610000167c: 2a0f01ad orr w13, w13, w15
77100001680: 7800254d strh w13, [x10], #0x2
78100001684: f1000508 subs x8, x8, #0x1
79100001688: 54fffd81 b.ne 0x100001638 <_rgb888_to_rgb565_scalar+0xe4>
8010000168c: d65f03c0 ret
81100001690: 927ced0b and x11, x8, #0xfffffffffffffff0
82100001694: 4f008480 movi.8h v0, #0x4
83100001698: 4f008441 movi.8h v1, #0x2
8410000169c: 4f0087e2 movi.8h v2, #0x1f
851000016a0: 4f0187e3 movi.8h v3, #0x3f
861000016a4: aa0b03ec mov x12, x11
871000016a8: aa0a03ed mov x13, x10
881000016ac: aa0903ee mov x14, x9
891000016b0: 4cdf41c4 ld3.16b { v4, v5, v6 }, [x14], #48
901000016b4: 6e241007 uaddw2.8h v7, v0, v4
911000016b8: 2e241010 uaddw.8h v16, v0, v4
921000016bc: 6f1d0610 ushr.8h v16, v16, #0x3
931000016c0: 6f1d04e7 ushr.8h v7, v7, #0x3
941000016c4: 6e251031 uaddw2.8h v17, v1, v5
951000016c8: 2e251032 uaddw.8h v18, v1, v5
961000016cc: 6f1e0652 ushr.8h v18, v18, #0x2
971000016d0: 6f1e0631 ushr.8h v17, v17, #0x2
981000016d4: 6e261013 uaddw2.8h v19, v0, v6
991000016d8: 2e261004 uaddw.8h v4, v0, v6
1001000016dc: 6f1d0484 ushr.8h v4, v4, #0x3
1011000016e0: 6f1d0665 ushr.8h v5, v19, #0x3
1021000016e4: 6e626ce6 umin.8h v6, v7, v2
1031000016e8: 6e626e07 umin.8h v7, v16, v2
1041000016ec: 6e636e30 umin.8h v16, v17, v3
1051000016f0: 6e636e51 umin.8h v17, v18, v3
1061000016f4: 6e626ca5 umin.8h v5, v5, v2
1071000016f8: 6e626c84 umin.8h v4, v4, v2
1081000016fc: 4f1b54e7 shl.8h v7, v7, #0xb
109100001700: 4f1b54c6 shl.8h v6, v6, #0xb
110100001704: 4f155631 shl.8h v17, v17, #0x5
111100001708: 4f155610 shl.8h v16, v16, #0x5
11210000170c: 4ea61e06 orr.16b v6, v16, v6
113100001710: 4ea71e27 orr.16b v7, v17, v7
114100001714: 4ea41ce4 orr.16b v4, v7, v4
115100001718: 4ea51cc5 orr.16b v5, v6, v5
11610000171c: ac8115a4 stp q4, q5, [x13], #0x20
117100001720: f100418c subs x12, x12, #0x10
118100001724: 54fffc61 b.ne 0x1000016b0 <_rgb888_to_rgb565_scalar+0x15c>
119100001728: eb0b011f cmp x8, x11
12010000172c: 54fffb00 b.eq 0x10000168c <_rgb888_to_rgb565_scalar+0x138>
121100001730: 371ff348 tbnz w8, #0x3, 0x100001598 <_rgb888_to_rgb565_scalar+0x44>
122100001734: 17ffffb9 b 0x100001618 <_rgb888_to_rgb565_scalar+0xc4>
Loop (auto vectorized)
11000016b0: 4cdf41c4 ld3.16b { v4, v5, v6 }, [x14], #48
21000016b4: 6e241007 uaddw2.8h v7, v0, v4
31000016b8: 2e241010 uaddw.8h v16, v0, v4
41000016bc: 6f1d0610 ushr.8h v16, v16, #0x3
51000016c0: 6f1d04e7 ushr.8h v7, v7, #0x3
61000016c4: 6e251031 uaddw2.8h v17, v1, v5
71000016c8: 2e251032 uaddw.8h v18, v1, v5
81000016cc: 6f1e0652 ushr.8h v18, v18, #0x2
91000016d0: 6f1e0631 ushr.8h v17, v17, #0x2
101000016d4: 6e261013 uaddw2.8h v19, v0, v6
111000016d8: 2e261004 uaddw.8h v4, v0, v6
121000016dc: 6f1d0484 ushr.8h v4, v4, #0x3
131000016e0: 6f1d0665 ushr.8h v5, v19, #0x3
141000016e4: 6e626ce6 umin.8h v6, v7, v2
151000016e8: 6e626e07 umin.8h v7, v16, v2
161000016ec: 6e636e30 umin.8h v16, v17, v3
171000016f0: 6e636e51 umin.8h v17, v18, v3
181000016f4: 6e626ca5 umin.8h v5, v5, v2
191000016f8: 6e626c84 umin.8h v4, v4, v2
201000016fc: 4f1b54e7 shl.8h v7, v7, #0xb
21100001700: 4f1b54c6 shl.8h v6, v6, #0xb
22100001704: 4f155631 shl.8h v17, v17, #0x5
23100001708: 4f155610 shl.8h v16, v16, #0x5
2410000170c: 4ea61e06 orr.16b v6, v16, v6
25100001710: 4ea71e27 orr.16b v7, v17, v7
26100001714: 4ea41ce4 orr.16b v4, v7, v4
27100001718: 4ea51cc5 orr.16b v5, v6, v5
2810000171c: ac8115a4 stp q4, q5, [x13], #0x20
29100001720: f100418c subs x12, x12, #0x10
30100001724: 54fffc61 b.ne 0x1000016b0 <_rgb888_to_rgb565_scalar+0x15c>
Loop (hand optimized)
1100001f28: f940000b ldr x11, [x0]
2100001f2c: 8b08016b add x11, x11, x8
3100001f30: 4c404164 ld3.16b { v4, v5, v6 }, [x11]
4100001f34: 6e241007 uaddw2.8h v7, v0, v4
5100001f38: 6e251030 uaddw2.8h v16, v1, v5
6100001f3c: 6e261011 uaddw2.8h v17, v0, v6
7100001f40: 2e241012 uaddw.8h v18, v0, v4
8100001f44: 2e251033 uaddw.8h v19, v1, v5
9100001f48: 2e261004 uaddw.8h v4, v0, v6
10100001f4c: 6f1d04e5 ushr.8h v5, v7, #0x3
11100001f50: 6f1e0606 ushr.8h v6, v16, #0x2
12100001f54: 6f1d0627 ushr.8h v7, v17, #0x3
13100001f58: 6f1d0650 ushr.8h v16, v18, #0x3
14100001f5c: 6f1e0671 ushr.8h v17, v19, #0x2
15100001f60: 6f1d0484 ushr.8h v4, v4, #0x3
16100001f64: 6e626ca5 umin.8h v5, v5, v2
17100001f68: 6e636cc6 umin.8h v6, v6, v3
18100001f6c: 6e626ce7 umin.8h v7, v7, v2
19100001f70: 6e626e10 umin.8h v16, v16, v2
20100001f74: 6e636e31 umin.8h v17, v17, v3
21100001f78: 6e626c84 umin.8h v4, v4, v2
22100001f7c: 4f1b54a5 shl.8h v5, v5, #0xb
23100001f80: 4f1554c6 shl.8h v6, v6, #0x5
24100001f84: 4f1b5610 shl.8h v16, v16, #0xb
25100001f88: 4f155631 shl.8h v17, v17, #0x5
26100001f8c: 4ea51cc5 orr.16b v5, v6, v5
27100001f90: 4eb01e26 orr.16b v6, v17, v16
28100001f94: 4ea41cc4 orr.16b v4, v6, v4
29100001f98: f940002b ldr x11, [x1]
30100001f9c: 4ea71ca5 orr.16b v5, v5, v7
31100001fa0: 3ca96964 str q4, [x11, x9]
32100001fa4: f940002b ldr x11, [x1]
33100001fa8: 8b09016b add x11, x11, x9
34100001fac: 3d800565 str q5, [x11, #0x10]
35100001fb0: 9100414a add x10, x10, #0x10
36100001fb4: a941300b ldp x11, x12, [x0, #0x10]
37100001fb8: 9b0b7d8b mul x11, x12, x11
38100001fbc: 91008129 add x9, x9, #0x20
39100001fc0: 9100c108 add x8, x8, #0x30
40100001fc4: eb0a017f cmp x11, x10
41100001fc8: 54fffb08 b.hi 0x100001f28 <_rgb888_to_rgb565_neon_16_vals+0x54>