What Compilers Really Do: A SIMD Case Study

In a different post, I looked at optimizing a particular algorithm using Arm Neon SIMD. If you are interested in how to use SIMD to convert RGB888 to RGB565, definitely take a look at that post.

While trying different things and figuring out how to reduce the execution time as much as possible, I did a deep dive into the generated assembly code for different implementations of that algorithm as well as the code that the compiler generates when optimizing (and vectorizing) the scalar version.

This post focuses on what I learned about compilers and how to make them do what you want.

1. Compilers are really smart

Modern compilers are pretty good at optimizing algorithms, at least trivial ones. They make use of architecture specific instructions to get the most performance out of the target machine.

When optimizing this simple algorithm for example:

 1void rgb888_to_rgb565_scalar(Image *src, Image *dst) {
 2
 3  for (int i = 0; i < src->img_width * src->img_height; i++) {
 4    uint8_t r = src->buffer[i * 3 + 0];
 5    uint8_t g = src->buffer[i * 3 + 1];
 6    uint8_t b = src->buffer[i * 3 + 2];
 7
 8    // Add half the lost precision before truncating
 9    uint16_t r5 = (r + 4) >> 3; // +4 is half of 8 (2^3)
10    uint16_t g6 = (g + 2) >> 2; // +2 is half of 4 (2^2)
11    uint16_t b5 = (b + 4) >> 3; // +4 is half of 8 (2^3)
12
13    // Clamp to prevent overflow
14    if (r5 > 31)
15      r5 = 31;
16    if (g6 > 63)
17      g6 = 63;
18    if (b5 > 31)
19      b5 = 31;
20
21    uint16_t value = (uint16_t)((r5 << 11) | (g6 << 5) | b5);
22
23    ((uint16_t *)dst->buffer)[i] = value;
24  }
25}

The compiler essentially generates 3 functions. Or execution paths. One truly scalar version, one processing 8 pixels per iteration and one processing 16 pixels per iteration. Since Arm Neon registers are 128 bits wide, 16 8 bit values are the most that can fit into one register. The whole assembly is in the Appendix, I just want to draw your attention to a few small excerpts.

The following snippet shows the start of the function, and it contains the logic to decide which one of the aforementioned paths it should take.

 10000000100001554 <_rgb888_to_rgb565_scalar>:
 2100001554: a9412408    	ldp	x8, x9, [x0, #0x10]
 3100001558: 9b087d28    	mul	x8, x9, x8
 410000155c: b4000988    	cbz	x8, 0x10000168c <_rgb888_to_rgb565_scalar+0x138>
 5100001560: f9400009    	ldr	x9, [x0]
 6100001564: f940002a    	ldr	x10, [x1]
 7100001568: f100211f    	cmp	x8, #0x8
 810000156c: 54000543    	b.lo	0x100001614 <_rgb888_to_rgb565_scalar+0xc0> ; if num pixel < 8, jump to scalar
 9100001570: d37ff90b    	lsl	x11, x8, #1
10100001574: 8b0b014c    	add	x12, x10, x11
11100001578: 8b08012d    	add	x13, x9, x8
1210000157c: 8b0b01ab    	add	x11, x13, x11
13100001580: eb0b015f    	cmp	x10, x11
14100001584: fa4c3122    	ccmp	x9, x12, #0x2, lo
15100001588: 54000463    	b.lo	0x100001614 <_rgb888_to_rgb565_scalar+0xc0>
1610000158c: f100411f    	cmp	x8, #0x10
17100001590: 54000802    	b.hs	0x100001690 <_rgb888_to_rgb565_scalar+0x13c> ; if num pixel >= jump to 16 pixel at a time
18// ... continue processing 8 pixels at a time

Making the compiler do what you want it to do

If you read my post on using SIMD to solve this problem efficiently, then you know that I have implemented different versions. Among them one that processes 8 pixels per iteration and one that processes 16 pixels per iteration by loading 3x16 values and then processing the higher 8 values and lower 8 values separately for each color channel, since it was not possible to keep the algorithm the same and truly process 16 values with each arithmetic instruction, since the values would overflow. There is a way to achieve this if you change the algorithm implementation a little, but more on that in SIMD Algorithm Design: Beyond Basic Vectorization.

And the compiler essentially did the same. In this post, we will only look at the 16 pixel version and ignore the 8 pixel one, since the former is more interesting in my opinion.

Before the rest of this post, a short section about terminology:

For the rest of this post I will refer to my ‘16 pixels per iteration’ implementation as 16P (short for ‘16 pixels version’), and I will refer to the code that the compiler generated when optimizing and vectorizing the scalar version as CGV, ’the compiler generated version’.

So when I talk about ’the assembly generated for 16P’ or ’the 16P assembly’, then I mean the assembly that the compiler generated when compiling my handwritten ‘16 pixels per iteration’ version. And when I mention ’the CGV’, I refer to the ‘16 pixels at a time’ part of the compiler generated version, and ’the CGV assembly’ refers to the specific instructions the compiler has generated for CGV.

Comparing CGV and 16P

So when comparing the 16P assembly and the CGV assembly, it is clear that the SIMD instructions are exactly the same. However, benchmarks reveal that CGV is a little bit faster overall.

So it is worthwhile to look at the loop code generated by the compiler (loop of 16P, loop of CGV).

What immediately becomes apparent is that number of instructions for 16P is higher than the number for CGV. Particularly at the beginning and at the end of the loop.

Beginning of the loop

The first ones are right at the beginning of the loop:

1100001f28: f940000b    	ldr	x11, [x0]                   ; load base address (src->buffer) from memory
2100001f2c: 8b08016b    	add	x11, x11, x8                ; add offset: i * 3
3100001f30: 4c404164    	ld3.16b	{ v4, v5, v6 }, [x11]   ; load 3 times 16 1 byte values from the source address into the registers v4-v6.

So for some reason the compiler loads something from memory (source address can be found in x0) into the register x11 and adds the content from x8 to it. The next instruction makes it painfully obvious what this is. It represents the interleaved load of 16 pixels (3 color channels times 16 1 byte values).

These 3 lines represent this line from the C code:

1v_rgb888 = vld3q_u8(src->buffer + (i * 3));

x8 contains i * 3 (is set to 0 before the loop code and is later increased by 0x30, which is 3 * 16 in hexadecimal, and i increases by 16 at each iteration), and the ldr instruction loads the buffer start address from the src struct, which is located in memory.

This has to be done at each iteration because the compiler doesn’t know that the value of buffer does not change.

An easy way to avoid this 2 instruction overhead is to save the value of buffer in a local variable that is placed outside of the loop:

1const uint8_t *src_ptr = src->buffer;
2
3for (int i = 0; i < src->img_width * src->img_height; i += 16) {
4    // ...
5}

The next couple of instructions are the same. The order is different, and it is possible that this improves pipelining. However, this does not cause any measurable performance difference, so it might just be a coincidence. Furthermore, it is difficult to reason about this without knowing the details of the underlying hardware, which Apple has not made public. The next difference is at the end of the loop after all the Neon instructions.

End of the loop

The CGV ends with these 3 instructions:

110000171c: ac8115a4    	stp	q4, q5, [x13], #0x20    ; save result to dst->buffer
2100001720: f100418c    	subs	x12, x12, #0x10     ; decrement x12 by 0x10 (16 in decimal)
3100001724: 54fffc61    	b.ne	0x1000016b0 <_rgb888_to_rgb565_scalar+0x15c>  ; branch

The first instruction saves the computation result back to memory using one instruction to save both buffers to their particular offsets. The values to be saved are in q4 and q5, they are saved to the location in x13 and said location is incremented by 32.

The next 2 instructions decrement the value in x12 by 16 and branch to the start of the loop if that value is not equal to 0. So what the compiler actually did is convert the for loop to a while loop:

 1for (int i = 0; i < src->img_width * src->img_height; i++) {
 2    // ...
 3}
 4
 5// to
 6
 7int remaining_pixels = src->img_width * src->img_height;
 8while (remaining_pixels > 0) {
 9    // ...
10    remaining_pixels -= 16;
11}

Meanwhile, the compiler generated the following instructions for 16P:

 1100001f98: f940002b    	ldr	x11, [x1]               ; load a value from memory (location in x1 aka dst->buffer) to x11
 2; ... omitted one neon function that has not yet been exectued, probably due to pipelining
 3100001fa0: 3ca96964    	str	q4, [x11, x9]           ; store q4 (half of the 16 values) to base address + (2 * i)
 4100001fa4: f940002b    	ldr	x11, [x1]               ; load a value from memory (location in x1 aka dst->buffer) to x11
 5100001fa8: 8b09016b    	add	x11, x11, x9            ; x11 = dst->buffer + (2 * i)
 6100001fac: 3d800565    	str	q5, [x11, #0x10]        ; store q5 into dst->buffer + (2 * i) + 16
 7100001fb0: 9100414a    	add	x10, x10, #0x10         ; add 16 to x10 (i+=16)
 8100001fb4: a941300b    	ldp	x11, x12, [x0, #0x10]   ; load x0 into x11 and x0 + 16 into x12 (img width and height)
 9100001fb8: 9b0b7d8b    	mul	x11, x12, x11           ; multiply x11 with x12 (num pixels = img->width * img->height)
10100001fbc: 91008129    	add	x9, x9, #0x20           ; x9 += 32 (16 values * 2 bytes)
11100001fc0: 9100c108    	add	x8, x8, #0x30           ; x8 += 48
12100001fc4: eb0a017f    	cmp	x11, x10                ; compare x11 and x10
13100001fc8: 54fffb08    	b.hi	0x100001f28 <_rgb888_to_rgb565_neon_16_vals+0x54> ; branch if num_pixels > i

What becomes immediately apparent is that there is not ‘store pair’ (stp) instruction, but there are 2 ‘store’ (str) instructions instead. This stp instruction accepts two SIMD registers as inputs, which makes vectorizing the memory access possible as well. The addressing to figure out where to store the results is also more complicated, and so is the branching logic.

Starting with the store operations, here is the corresponding C code:

1vst1q_u16((uint16_t *)dst->buffer + i + 8, v_rgb565_high);
2vst1q_u16((uint16_t *)dst->buffer + i, v_rgb565_low);

The ldr pattern should seem familiar, since we encountered the exact same issue at the beginning of the loop, and the reason as well as the solution is the same here. x1 contains the location of the destination buffer, x9 is 2 * i (due to 16-bit integer addressing). The compiler reloads the destination address before every store, which also prevents it to replace those two individual str instructions with a more efficient stp instruction.

One way to get this ‘store pair’ instruction is by making the compiler understand that the address does not change:

1uint16_t *dst_ptr = (uint16_t *)dst->buffer + i;
2vst1q_u16(dst_ptr, v_rgb565_low);
3vst1q_u16(dst_ptr + 8, v_rgb565_high);

This yields the following assembly, and as you can see we now have a single store (stp) instead of two stores (2x str):

 11000020ac: f940002b    	ldr	x11, [x1]
 21000020b0: 8b09016b    	add	x11, x11, x9
 31000020b4: ad001564    	stp	q4, q5, [x11]
 41000020b8: 9100414a    	add	x10, x10, #0x10
 51000020bc: a941300b    	ldp	x11, x12, [x0, #0x10]
 61000020c0: 9b0b7d8b    	mul	x11, x12, x11
 71000020c4: 91008129    	add	x9, x9, #0x20
 81000020c8: 9100c108    	add	x8, x8, #0x30
 91000020cc: eb0a017f    	cmp	x11, x10
101000020d0: 54fffb48    	b.hi	0x100002038 <_rgb888_to_rgb565_neon_16_vals_sp+0x54>

However, the compiler still does not know that the value of dst->buffer does not change in between iterations, so it has to load it in every loop and add the offset. This is now exactly the same as in the beginning of the loop, and the solution is also the same:

1uint16_t *dst_ptr = (uint16_t *)dst->buffer;
2const uint8_t *src_ptr = src->buffer;
3
4for (int i = 0; i < src->img_width * src->img_height; i += 16) {
5    // ...
6}

This reduces the instructions at the end of the loop from 12 to 7:

11000021b0: ad3f9544    	stp	q4, q5, [x10, #-0x10]   ; store to dst->buffer
21000021b4: 91004108    	add	x8, x8, #0x10           ; i+=16
31000021b8: a941300b    	ldp	x11, x12, [x0, #0x10]   ; load image dimensions
41000021bc: 9b0b7d8b    	mul	x11, x12, x11           ; calculate number of pixels in image
51000021c0: 9100814a    	add	x10, x10, #0x20         ; calculate offset for dst->buffer for the next iteration
61000021c4: eb08017f    	cmp	x11, x8                 ; if number pixels > i:
71000021c8: 54fffbe8    	b.hi	0x100002144 <_rgb888_to_rgb565_neon_16_vals_sp_restrict+0x58> ; branch

This is still 4 more than the CGV assembly, and to explain that discrepancy we have to examine the looping behavior. I already mentioned that the compiler converted the for loop into essentially a while loop, which costs 2 instructions, while the for loop costs 5. Part of that is that the inline calculation of the number of pixels could not be optimized by the compiler, so it loads the image dimensions in every loop.

1const int num_pixels = src->img_width * src->img_height;
2for (int i = 0; i < num_pixels; i += 16) {
3  // ...
4}

Moving that calculation out of the loop gets rid of the extra load instruction and the multiplication, as these are moved out of the loop, but it keeps the cmp and adds a lsr instruction, which shifts the value in x10 (i) by 4. The comparison changed as well:

1100001c84: ad3fc526    	stp	q6, q17, [x9, #-0x10]   ; store to dst->buffer
2100001c88: 9100414a    	add	x10, x10, #0x10         ; i+=16
3100001c8c: 91008129    	add	x9, x9, #0x20           ; calculate offset for dst->buffer for the next iteration
4100001c90: d344fd4b    	lsr	x11, x10, #4            ; i >> 4
5100001c94: f1270d7f    	cmp	x11, #0x9c3             ; (i >> 4) < 2499?
6100001c98: 54fffc43    	b.lo	0x100001c20 <_rgb888_to_rgb565_neon_sp+0x60> ; branch if (i >> 4) < 2499

So the compiler still cannot convert it to a while loop like it can in the compiler optimized version. The reason for that is that the compiler doesn’t know that/if the image dimensions are divisible by 16, but we know that they are in this case (200x200 pixels = 40 000 pixels, divided by 16 = 2 500 iterations).

However, we can tell the compiler that by changing the comparison from i < num_pixels to i != num_pixels. This lets the compiler know that i does not surpass the number of pixels.

1const int num_pixels = src->img_width * src->img_height;
2for (int i = 0; i != num_pixels; i += 16) {
3  // ...
4}

After that the generated instructions are almost equal to CGV:

1100001c84: ad3fc526    	stp	q6, q17, [x9, #-0x10]
2100001c88: 91008129    	add	x9, x9, #0x20
3100001c8c: f100414a    	subs	x10, x10, #0x10
4100001c90: 54fffc81    	b.ne	0x100001c20 <_rgb888_to_rgb565_neon_sp+0x60>

The only difference is that there is an extra add instruction to compute the next offset that the program should write to. In CGV that is done as part of the stp instruction.

Here is a side by side comparison1:

opcodesource 1source 2destinationoffset
stpq4q5[x13]+32 (post-index)
stpq6q17x9-16 (pre-index)

So the difference is that in the CGV version both register contents are stored to the address at [x13] and then the offset is incremented by 32. So an extra add instruction is not necessary.

But for some reason the compiler didn’t do this for the 16P version. In the assembly for 16P the destination address is stored in x9, the register contents are saved at x9-16 and then the destination address is incremented by 32 (add x9, x9, #0x20).

It is not entirely clear to me why the compiler does it that way. But one way to get it to generate the same instructions is to change the addressing like this:

1// from
2vst1q_u16(dst_ptr + i, v_rgb565_low);
3vst1q_u16(dst_ptr + i + 8, v_rgb565_high);
4
5// to
6vst1q_u16(dst_ptr, v_rgb565_low);
7dst_ptr += 8;
8vst1q_u16(dst_ptr, v_rgb565_high);
9dst_ptr += 8;

That way the compiler generates the same instructions for both 16P and CGV.

1100001c80: ac814506    	stp	q6, q17, [x8], #0x20
2100001c84: f100414a    	subs	x10, x10, #0x10
3100001c88: 54fffca1    	b.ne	0x100001c1c <_rgb888_to_rgb565_neon_sp+0x5c>

Conclusion

To conclude, we have a looked at a rather simple algorithm that has both been vectorized by hand by the compiler. During the comparisons there have been several observations made that lead to the following conclusions:

  1. Compilers are really smart

They might optimize your problem just as well, if not better than you do, at least for trivial problems. Also doing manual optimizations might end up confusing the compiler, so it can’t do its best work.

  1. Be aware what the compiler knows and what it doesn’t

Many of these changes in assembly were because the compiler could not know about the influence of other code on data and therefore had to play it safe, which resulted in extra instructions and an observable performance penalty.

  1. Pre-calculate what you can

The compiler does not necessarily move calculations out of a loop, even though the result does not change across iterations. Moving calculations out of a hot-loop and caching the results can be worth it. Especially when the calculations are expensive.

  1. Be as expressive as you can be

Some things that might be obvious to you are not clear to the compiler, and generally the more expressive your code is, the better the compiler performs when generating the code that eventually runs on the CPU. However, one or two add instructions might be a worthwhile cost for less verbose/more concise and thereby more readable code. But if you want to make sure the compiler does the best it can, being more expressive is usually the way to achieve that.


Appendix

Scalar Algorithm (auto vectorized)

  10000000100001554 <_rgb888_to_rgb565_scalar>:
  2100001554: a9412408    	ldp	x8, x9, [x0, #0x10]
  3100001558: 9b087d28    	mul	x8, x9, x8
  410000155c: b4000988    	cbz	x8, 0x10000168c <_rgb888_to_rgb565_scalar+0x138>
  5100001560: f9400009    	ldr	x9, [x0]
  6100001564: f940002a    	ldr	x10, [x1]
  7100001568: f100211f    	cmp	x8, #0x8
  810000156c: 54000543    	b.lo	0x100001614 <_rgb888_to_rgb565_scalar+0xc0>
  9100001570: d37ff90b    	lsl	x11, x8, #1
 10100001574: 8b0b014c    	add	x12, x10, x11
 11100001578: 8b08012d    	add	x13, x9, x8
 1210000157c: 8b0b01ab    	add	x11, x13, x11
 13100001580: eb0b015f    	cmp	x10, x11
 14100001584: fa4c3122    	ccmp	x9, x12, #0x2, lo
 15100001588: 54000463    	b.lo	0x100001614 <_rgb888_to_rgb565_scalar+0xc0>
 1610000158c: f100411f    	cmp	x8, #0x10
 17100001590: 54000802    	b.hs	0x100001690 <_rgb888_to_rgb565_scalar+0x13c>
 18100001594: d280000b    	mov	x11, #0x0               ; =0
 19100001598: aa0b03ee    	mov	x14, x11
 2010000159c: 927df10b    	and	x11, x8, #0xfffffffffffffff8
 211000015a0: d37ff9cd    	lsl	x13, x14, #1
 221000015a4: 8b0e01ac    	add	x12, x13, x14
 231000015a8: 8b0c012c    	add	x12, x9, x12
 241000015ac: 8b0d014d    	add	x13, x10, x13
 251000015b0: cb0b01ce    	sub	x14, x14, x11
 261000015b4: 4f008480    	movi.8h	v0, #0x4
 271000015b8: 4f008441    	movi.8h	v1, #0x2
 281000015bc: 4f0087e2    	movi.8h	v2, #0x1f
 291000015c0: 4f0187e3    	movi.8h	v3, #0x3f
 301000015c4: 0cdf4184    	ld3.8b	{ v4, v5, v6 }, [x12], #24
 311000015c8: 2e241007    	uaddw.8h	v7, v0, v4
 321000015cc: 6f1d04e7    	ushr.8h	v7, v7, #0x3
 331000015d0: 2e251030    	uaddw.8h	v16, v1, v5
 341000015d4: 6f1e0610    	ushr.8h	v16, v16, #0x2
 351000015d8: 2e261004    	uaddw.8h	v4, v0, v6
 361000015dc: 6f1d0484    	ushr.8h	v4, v4, #0x3
 371000015e0: 6e626ce5    	umin.8h	v5, v7, v2
 381000015e4: 6e636e06    	umin.8h	v6, v16, v3
 391000015e8: 6e626c84    	umin.8h	v4, v4, v2
 401000015ec: 4f1b54a5    	shl.8h	v5, v5, #0xb
 411000015f0: 4f1554c6    	shl.8h	v6, v6, #0x5
 421000015f4: 4ea51cc5    	orr.16b	v5, v6, v5
 431000015f8: 4ea41ca4    	orr.16b	v4, v5, v4
 441000015fc: 3c8105a4    	str	q4, [x13], #0x10
 45100001600: b10021ce    	adds	x14, x14, #0x8
 46100001604: 54fffe01    	b.ne	0x1000015c4 <_rgb888_to_rgb565_scalar+0x70>
 47100001608: eb0b011f    	cmp	x8, x11
 4810000160c: 54000061    	b.ne	0x100001618 <_rgb888_to_rgb565_scalar+0xc4>
 49100001610: 1400001f    	b	0x10000168c <_rgb888_to_rgb565_scalar+0x138>
 50100001614: d280000b    	mov	x11, #0x0               ; =0
 51100001618: d37ff96c    	lsl	x12, x11, #1
 5210000161c: 8b0c014a    	add	x10, x10, x12
 53100001620: 8b0b018c    	add	x12, x12, x11
 54100001624: 8b090189    	add	x9, x12, x9
 55100001628: 91000929    	add	x9, x9, #0x2
 5610000162c: cb0b0108    	sub	x8, x8, x11
 57100001630: 528003eb    	mov	w11, #0x1f              ; =31
 58100001634: 528007ec    	mov	w12, #0x3f              ; =63
 59100001638: 385fe12d    	ldurb	w13, [x9, #-0x2]
 6010000163c: 385ff12e    	ldurb	w14, [x9, #-0x1]
 61100001640: 3840352f    	ldrb	w15, [x9], #0x3
 62100001644: 110011ad    	add	w13, w13, #0x4
 63100001648: 53037dad    	lsr	w13, w13, #3
 6410000164c: 110009ce    	add	w14, w14, #0x2
 65100001650: 53027dce    	lsr	w14, w14, #2
 66100001654: 110011ef    	add	w15, w15, #0x4
 67100001658: 53037def    	lsr	w15, w15, #3
 6810000165c: 71007dbf    	cmp	w13, #0x1f
 69100001660: 1a8b31ad    	csel	w13, w13, w11, lo
 70100001664: 7100fddf    	cmp	w14, #0x3f
 71100001668: 1a8c31ce    	csel	w14, w14, w12, lo
 7210000166c: 71007dff    	cmp	w15, #0x1f
 73100001670: 1a8b31ef    	csel	w15, w15, w11, lo
 74100001674: 531b69ce    	lsl	w14, w14, #5
 75100001678: 2a0d2dcd    	orr	w13, w14, w13, lsl #11
 7610000167c: 2a0f01ad    	orr	w13, w13, w15
 77100001680: 7800254d    	strh	w13, [x10], #0x2
 78100001684: f1000508    	subs	x8, x8, #0x1
 79100001688: 54fffd81    	b.ne	0x100001638 <_rgb888_to_rgb565_scalar+0xe4>
 8010000168c: d65f03c0    	ret
 81100001690: 927ced0b    	and	x11, x8, #0xfffffffffffffff0
 82100001694: 4f008480    	movi.8h	v0, #0x4
 83100001698: 4f008441    	movi.8h	v1, #0x2
 8410000169c: 4f0087e2    	movi.8h	v2, #0x1f
 851000016a0: 4f0187e3    	movi.8h	v3, #0x3f
 861000016a4: aa0b03ec    	mov	x12, x11
 871000016a8: aa0a03ed    	mov	x13, x10
 881000016ac: aa0903ee    	mov	x14, x9
 891000016b0: 4cdf41c4    	ld3.16b	{ v4, v5, v6 }, [x14], #48
 901000016b4: 6e241007    	uaddw2.8h	v7, v0, v4
 911000016b8: 2e241010    	uaddw.8h	v16, v0, v4
 921000016bc: 6f1d0610    	ushr.8h	v16, v16, #0x3
 931000016c0: 6f1d04e7    	ushr.8h	v7, v7, #0x3
 941000016c4: 6e251031    	uaddw2.8h	v17, v1, v5
 951000016c8: 2e251032    	uaddw.8h	v18, v1, v5
 961000016cc: 6f1e0652    	ushr.8h	v18, v18, #0x2
 971000016d0: 6f1e0631    	ushr.8h	v17, v17, #0x2
 981000016d4: 6e261013    	uaddw2.8h	v19, v0, v6
 991000016d8: 2e261004    	uaddw.8h	v4, v0, v6
1001000016dc: 6f1d0484    	ushr.8h	v4, v4, #0x3
1011000016e0: 6f1d0665    	ushr.8h	v5, v19, #0x3
1021000016e4: 6e626ce6    	umin.8h	v6, v7, v2
1031000016e8: 6e626e07    	umin.8h	v7, v16, v2
1041000016ec: 6e636e30    	umin.8h	v16, v17, v3
1051000016f0: 6e636e51    	umin.8h	v17, v18, v3
1061000016f4: 6e626ca5    	umin.8h	v5, v5, v2
1071000016f8: 6e626c84    	umin.8h	v4, v4, v2
1081000016fc: 4f1b54e7    	shl.8h	v7, v7, #0xb
109100001700: 4f1b54c6    	shl.8h	v6, v6, #0xb
110100001704: 4f155631    	shl.8h	v17, v17, #0x5
111100001708: 4f155610    	shl.8h	v16, v16, #0x5
11210000170c: 4ea61e06    	orr.16b	v6, v16, v6
113100001710: 4ea71e27    	orr.16b	v7, v17, v7
114100001714: 4ea41ce4    	orr.16b	v4, v7, v4
115100001718: 4ea51cc5    	orr.16b	v5, v6, v5
11610000171c: ac8115a4    	stp	q4, q5, [x13], #0x20
117100001720: f100418c    	subs	x12, x12, #0x10
118100001724: 54fffc61    	b.ne	0x1000016b0 <_rgb888_to_rgb565_scalar+0x15c>
119100001728: eb0b011f    	cmp	x8, x11
12010000172c: 54fffb00    	b.eq	0x10000168c <_rgb888_to_rgb565_scalar+0x138>
121100001730: 371ff348    	tbnz	w8, #0x3, 0x100001598 <_rgb888_to_rgb565_scalar+0x44>
122100001734: 17ffffb9    	b	0x100001618 <_rgb888_to_rgb565_scalar+0xc4>

Loop (auto vectorized)

 11000016b0: 4cdf41c4    	ld3.16b	{ v4, v5, v6 }, [x14], #48
 21000016b4: 6e241007    	uaddw2.8h	v7, v0, v4
 31000016b8: 2e241010    	uaddw.8h	v16, v0, v4
 41000016bc: 6f1d0610    	ushr.8h	v16, v16, #0x3
 51000016c0: 6f1d04e7    	ushr.8h	v7, v7, #0x3
 61000016c4: 6e251031    	uaddw2.8h	v17, v1, v5
 71000016c8: 2e251032    	uaddw.8h	v18, v1, v5
 81000016cc: 6f1e0652    	ushr.8h	v18, v18, #0x2
 91000016d0: 6f1e0631    	ushr.8h	v17, v17, #0x2
101000016d4: 6e261013    	uaddw2.8h	v19, v0, v6
111000016d8: 2e261004    	uaddw.8h	v4, v0, v6
121000016dc: 6f1d0484    	ushr.8h	v4, v4, #0x3
131000016e0: 6f1d0665    	ushr.8h	v5, v19, #0x3
141000016e4: 6e626ce6    	umin.8h	v6, v7, v2
151000016e8: 6e626e07    	umin.8h	v7, v16, v2
161000016ec: 6e636e30    	umin.8h	v16, v17, v3
171000016f0: 6e636e51    	umin.8h	v17, v18, v3
181000016f4: 6e626ca5    	umin.8h	v5, v5, v2
191000016f8: 6e626c84    	umin.8h	v4, v4, v2
201000016fc: 4f1b54e7    	shl.8h	v7, v7, #0xb
21100001700: 4f1b54c6    	shl.8h	v6, v6, #0xb
22100001704: 4f155631    	shl.8h	v17, v17, #0x5
23100001708: 4f155610    	shl.8h	v16, v16, #0x5
2410000170c: 4ea61e06    	orr.16b	v6, v16, v6
25100001710: 4ea71e27    	orr.16b	v7, v17, v7
26100001714: 4ea41ce4    	orr.16b	v4, v7, v4
27100001718: 4ea51cc5    	orr.16b	v5, v6, v5
2810000171c: ac8115a4    	stp	q4, q5, [x13], #0x20
29100001720: f100418c    	subs	x12, x12, #0x10
30100001724: 54fffc61    	b.ne	0x1000016b0 <_rgb888_to_rgb565_scalar+0x15c>

Loop (hand optimized)

 1100001f28: f940000b    	ldr	x11, [x0]
 2100001f2c: 8b08016b    	add	x11, x11, x8
 3100001f30: 4c404164    	ld3.16b	{ v4, v5, v6 }, [x11]
 4100001f34: 6e241007    	uaddw2.8h	v7, v0, v4
 5100001f38: 6e251030    	uaddw2.8h	v16, v1, v5
 6100001f3c: 6e261011    	uaddw2.8h	v17, v0, v6
 7100001f40: 2e241012    	uaddw.8h	v18, v0, v4
 8100001f44: 2e251033    	uaddw.8h	v19, v1, v5
 9100001f48: 2e261004    	uaddw.8h	v4, v0, v6
10100001f4c: 6f1d04e5    	ushr.8h	v5, v7, #0x3
11100001f50: 6f1e0606    	ushr.8h	v6, v16, #0x2
12100001f54: 6f1d0627    	ushr.8h	v7, v17, #0x3
13100001f58: 6f1d0650    	ushr.8h	v16, v18, #0x3
14100001f5c: 6f1e0671    	ushr.8h	v17, v19, #0x2
15100001f60: 6f1d0484    	ushr.8h	v4, v4, #0x3
16100001f64: 6e626ca5    	umin.8h	v5, v5, v2
17100001f68: 6e636cc6    	umin.8h	v6, v6, v3
18100001f6c: 6e626ce7    	umin.8h	v7, v7, v2
19100001f70: 6e626e10    	umin.8h	v16, v16, v2
20100001f74: 6e636e31    	umin.8h	v17, v17, v3
21100001f78: 6e626c84    	umin.8h	v4, v4, v2
22100001f7c: 4f1b54a5    	shl.8h	v5, v5, #0xb
23100001f80: 4f1554c6    	shl.8h	v6, v6, #0x5
24100001f84: 4f1b5610    	shl.8h	v16, v16, #0xb
25100001f88: 4f155631    	shl.8h	v17, v17, #0x5
26100001f8c: 4ea51cc5    	orr.16b	v5, v6, v5
27100001f90: 4eb01e26    	orr.16b	v6, v17, v16
28100001f94: 4ea41cc4    	orr.16b	v4, v6, v4
29100001f98: f940002b    	ldr	x11, [x1]
30100001f9c: 4ea71ca5    	orr.16b	v5, v5, v7
31100001fa0: 3ca96964    	str	q4, [x11, x9]
32100001fa4: f940002b    	ldr	x11, [x1]
33100001fa8: 8b09016b    	add	x11, x11, x9
34100001fac: 3d800565    	str	q5, [x11, #0x10]
35100001fb0: 9100414a    	add	x10, x10, #0x10
36100001fb4: a941300b    	ldp	x11, x12, [x0, #0x10]
37100001fb8: 9b0b7d8b    	mul	x11, x12, x11
38100001fbc: 91008129    	add	x9, x9, #0x20
39100001fc0: 9100c108    	add	x8, x8, #0x30
40100001fc4: eb0a017f    	cmp	x11, x10
41100001fc8: 54fffb08    	b.hi	0x100001f28 <_rgb888_to_rgb565_neon_16_vals+0x54>