Basically I'm having a hard time getting the execution time any lower than it is, as well as reducing the amount of clock cycles and memory size. Does anyone have any idea on how I can do this? The code works fine I just want to change it a bit.
Wrote a working code, but don't want to mess up the code, but also don't know what changes to make.
; Calculation of a factorial value using a simple loop
; set up the exception addresses
THUMB
AREA RESET, CODE, READONLY
EXPORT __Vectors
EXPORT Reset_Handler
__Vectors
DCD 0x00180000 ; top of the stack
DCD Reset_Handler ; reset vector - where the program starts
AREA 2a_Code, CODE, READONLY
Reset_Handler
ENTRY
start
MOV r1,#0 ; count the number of multiplications performed
MOV r2,#3 ; the final value in the factorial calculation
MOV r3,#1 ; the factorial result will be stored here
; loop r2 times forming the product
fact
ADD r1,r1,#1 ; find the next multiplicand
MUL r3,r1,r3 ; form the next product - note that MUL r3,r3,r1 gives unpredictable output
CMP r1,r2 ; check if the final value has been reached
BMI fact ; continue if all products have not been formed
exit ; stay in an endless loop
B exit
END
The current results are: Memory Size: 0x00000024 Clock Cycles: 22 Total Execution Time:1.1 Micro seconds
We are working with the Cortex M3
I just need any of these to be reduced, the changes to the code can be minor as long as it produces different results.
Often code-size and performance are a tradeoff. Unrolling a loop often helps performance (for large inputs at least), but requires extra logic outside the loop to handle the cleanup and so on.
(The original question didn't specify a core, and I was expecting that even low-end CPUs would have multi-cycle mul
latency. I only found Cortex-M3 numbers after writing it.)
Your code will probably bottleneck on the latency of integer multiply. Unlike add
, where the result will be ready the next cycle, mul
is complex and takes multiple cycles to produce a result.
(Except on some very slowly-clocked chips, like apparently Cortex-M3 has a 1-cycle mul
instruction. But Cortex-M0/M0+/M23 are available with a choice of 1 cycle or 32 cycle performance for that instruction! Slow iterative = smaller silicon.)
The multiply execution unit itself is often pipelined so multiple independent multiplies can be in flight at once, but your factorial loop needs each multiply result as an input to the next iteration. (Only for higher-performance cores, not Cortex-M series. The 32-cycle multiply on slow cortex-M chips is iterative and presumably not pipelined, so another multiply couldn't start while it's running, and there'd be no benefit to exposing any instruction-level parallelism beyond reducing loop overhead.)
Notice that multiplication is associative: 1 * 2 * 3
= 3 * 2 * 1
, so we can count down from n
, as @ensc's answer points out. Or (1*2) * (3*4)
= 1*2*3*4
.
We could instead do 1 * 2 * ... * (n/2)
in parallel with n/2+1 * n/2+2 * n/2+3 * ... * n
, interleaving work on those two dependency chains. Or we could interleave 1 * 3 * 5 * ... * n
with 2 * 4 * 6 * ... n-1
, in a loop that did n -= 2
and calculates n+1
from that. (Then at the end, you multiply those 2 products).
This is obviously going to require more code-size, but could help performance a lot.
Of course, a lookup table is another workaround. If you only care about inputs that don't overflow a 32-bit result, that's a pretty small table. But that has a significant size cost.
Even on an in-order CPU (where instruction execution has to start in program order), long-running instructions like cache-miss loads, or multiplies, may be allowed to complete out of order, so e.g. some add
instructions could run after starting a mul
but before the mul
result was written back. Or even starting another independent mul
instruction in the shadow of an earlier mul
's latency.
I googled some ARM performance numbers to maybe get a feel for what's typical.
For example, Cortex-A9 is an older fairly common high-end ARMv7 CPU that is superscalar (multiple instructions per cycle) with out-of-order execution.
mul
"takes" 2 cycles, and has 4 cycle result latency. They don't explain what they mean by the non-latency cost. Perhaps that's the reciprocal throughput of the execution unit, like how often you can start a new independent operation. It's an out-of-order CPU so it doesn't make sense for it to stall other instructions for 2 cycles. In the NEON SIMD instruction section, they explain what looks like the same "cycles" number:
This is the number of issue cycles the particular instruction consumes, and is the absolute minimum number of cycles per instruction if no operand interlocks are present.
(operand interlocks = waiting for an input operand to be ready, if an earlier instruction hasn't produced a result yet).
(Cortex-A9 does support packed integer multiplication, so for large factorials you could look at doing 4 multiplies in parallel starting one vector per 4 cycles, using vmul.32 q1, q1, q2
. Or 2 per 2 cycles with 64-bit d
registers, but then you'd need more vadd
instructions and unlike multiply, vadd.32
is just as fast with 128-bit q
regs as with 64-bit vectors. So SIMD can give you twice the multiply throughput of scalar on Cortex-A9, if you use enough registers to hide the large latency. But SIMD would probably only be useful with n
so large that n!
overflows a 32-bit integer, so you get a result modulo 2^32.)
mul
is a 32x32 => 32-bit multiply. On Cortex-A9, it has 2c throughput and 4c latency.
(muls
is a 16-bit instruction in thumb mode, and should be preferred unless you need to not clobber the flags. mul
in Thumb mode is only available in ARMv6T2 and later.)
smulbb
is a 16x16 => 32-bit signed multiply that only reads the low half of its inputs, but has 1c throughput and 3c latency on A9. (BB = bottom, bottom. The other combinations are also available, along with multiply-accumulate and various funky things.)
There is not 2-byte Thumb version of smulxy
, so this is worse for code-size than muls
.
Unfortunately smulxy
isn't available in an unsigned version, so that limits the range of inputs we can use it with to positive int16_t
, not uint16_t
.
But if we only care about the case where the final 32-bit result doesn't overflow, we can arrange our order of operations so the last multiply has 2 inputs of similar magnitude (both large-ish 16-bit numbers). i.e. as close to sqrt(n!)
as possible. So e.g. the product of odds and evens would be reasonable, but (n-1)! * n
would be the worst case because that would require (n-1)!
to fit in 16 bits. Actually the worst case would be counting down from n
so the last one is a multiply by 3 then 2. We could special case the multiply by 2 to a left shift...
Putting these pieces together, notice that multiplying by 1
is a no-op (except with smulbb
where it truncates the input to 16 bit). So we can unroll in a way that stops after a multiply by 1 or 2 depending on the input being odd or even.
So instead of knowing which is odd and which is even, we just have lo (starting with n-1
) and hi (starting with n
).
;; UNTESTED, but it does assemble with the GNU assembler, after sed -i 's/;/@/' arm-fact.S
;; and replacing THUMB with
; .thumb
; .syntax unified
THUMB
;; Input: n in r0. (n is signed positive, otherwise we return n.)
;; Output: n! in r0.
;; clobbers: r1, r2, r3
;; pre-conditions: n! < 2^31. Or maybe slightly lower.
fact:
subs r3, r0, #3 ; r3 = lo = n-3 (first multiplier for loprod)
bls .Ltiny_input
subs r2, r0, #2 ; r2 = hi = n-2 (first multiplier for hiprod)
subs r1, r0, #1 ; r1 = loprod = n-1
; r0 = hiprod = n
.Lloop: ; do {
smulbb r0,r0, r2 ; hiprod *= hi
subs r2, #2 ; hi -= 2 for next iter
smulbb r1,r1, r3
subs r3, #2 ; lo -= 2 for next iter
bgt .Lloop ; while((lo-=2) > 0); signed condition
; r3 = 0 or -1, r2 = 1 or 0. The last multiplies were:
; hiprod *= 2 and loprod *= 1 for even n
; or hiprod *= 3 and loprod *= 2 for odd n
; muls r0, r1
smulbb r0,r0, r1 ; return hiprod *= loprod
bx lr ; or inline this
.Ltiny_input: ; alternate return path for tiny inputs
; r0 = n. flags still set from n - 3
IT eq ; GAS insists on explicit IT for thumb mode
moveq r0, #6 ; 3! = 6, else n! = n for smaller n=1 or 2.
; 0! = 1 case is not handled, nor are negative inputs
bx lr
(.L in a label name makes it a local label that doesn't show up in the object file, at least in GAS syntax. Maybe not in ARMASM, if you're using that assembler.)
ARM assembly lets you leave out the destination when it's the same as the first source, for some instructions like subs
but not smulbb
. You could write it out like subs r2, r2, #2
every time if you want.
You might use muls r0, r1
for the final product, because the final hiprod
is a bit higher than loprod
. The product might not overflow even if hiprod
> max int16_t. That would save 2 bytes of code-size, too, but add 1 cycle of latency on Cortex-A9. (BTW, ARMv6 fixed the "unpredictable result" with mul d,d, src
weirdness, and your code used 32-bit Thumb2 instructions, thus it only works on ARMv6T2 and above anyway.)
With 2 accumulators for the products, this can possibly run at 2 multiplies per 3 cycles on Cortex-A9, depending greatly on the CPU micro-architecture and whether its front-end can keep up. On an in-order ARM, I'd be worried about it being able to start other instructions before a multiply finished.
It might be better to spend 2 extra bytes on sub
instead of subs
so we can compute the flags a couple instructions ahead of the branch, maybe reducing branch mispredict penalty and avoiding stalls on in-order CPUs. smulbb
doesn't touch flags, so we can do loprod
first and have the hi
stuff not touch flags.
.loop: ; do {
smulbb r1, r3 ; loprod *= lo
subs r3, #2 ; lo -= 2 for next iter, and set flags
smulbb r0, r2 ; hiprod *= hi
sub r2, #2 ; hi -= 2 for next iter (no flags)
bgt .loop ; while((lo-=2) >= 0);
Note that we're modifying r3
and r2
right after smulbb
reads them, avoiding creating a stall for the data dependency on in-order chips.
You're using Thumb mode and optimizing for code-size, so it's important to know which forms of which instructions can use a 2-byte / 16-bit encoding and which are only available as 32-bit Thumb2 encodings.
subs Rd, Rn, #imm
can be encoded as a 16-bit Thumb instruction for imm=0..7 (3-bit immediate). Or with the same register as src and destination, for imm=0..255. So my copy-and-sub instructions are compact.
Non-flag-setting sub
can't be a 16-bit instruction except inside a IT block, or with SP
as the operand.
Predicated instructions in Thumb mode, like moveq r0, #6
, require the assembler to use an IT
instruction to introduce predication for the next up-to-4 instructions. In ARM mode, the top 4 bits of every instruction signals predication. (If you don't use a suffix, the assembler encodes it as ALways, i.e. not predicated.)
We could handle the n==0
case with another 4 or 6 bytes, with cmp r0,#0
/ moveq r0, #1
. Maybe getting it down to 4 bytes if we put the tst / mov inside the same IT block. IT doesn't snapshot the actual flag condition, it snapshots which predicate, so flag-setting instructions inside an IT block can have an effect on later instructions in the same block. (I think this is right, but I'm not 100% sure).
tiny_input: ; r0 = n, flags set according to n-3
ITET EQ
moveq r0, #6
cmpne r0, #0
moveq r0, #1
Or there's 16-bit cbnz
to conditionally jump over a mov r0, #1
. But the branch target must be from 4 to 130 bytes after the cbnz
, so we can't jump over just a single 16-bit instruction, apparently!
$ arm-none-eabi-gcc -g -c -mcpu=cortex-a9 arm-fact.S
$ arm-none-eabi-objdump -drwC arm-fact.o
arm-fact.o: file format elf32-littlearm
Disassembly of section .text:
00000000 <fact>:
0: 1ec3 subs r3, r0, #3
2: d90b bls.n 1c <.tiny_input>
4: 1e82 subs r2, r0, #2
6: 1e41 subs r1, r0, #1
00000008 <.loop>:
8: fb10 f002 smulbb r0, r0, r2
c: 3a02 subs r2, #2
e: fb11 f103 smulbb r1, r1, r3
12: 3b02 subs r3, #2
14: dcf8 bgt.n 8 <.loop>
16: fb10 f001 smulbb r0, r0, r1
1a: 4770 bx lr
0000001c <.tiny_input>:
1c: bf08 it eq
1e: 2006 moveq r0, #6
20: 4770 bx lr
So it's 0x22 bytes for this function. (Or 0x26 if we want to handle 0! = 1
.)
It's larger than your version (your byte count includes some constants in memory, and the mov
instructions to produce input), but in theory maybe better than twice as fast for large input, on CPUs with pipelined multipliers). And maybe much faster for inputs from 1 to 3, where it just branches once and produces the result.
You probably don't have anything like a Cortex-A9, because your 1.1 microseconds = 22 clock cycles means a 20MHz clock speed, while Cortex-A9 was available in 0.8 to 2GHz.
So maybe you have a much simpler in-order core like Cortex M3? M3 does support the mul
instruction, and Thumb2 mode. And wikipedia says its multiply is 1 cycle! So that's weird, I'm surprised it has that efficient a multiplier. Or just that it clocks so slowly that there's time for a lot of gate delays in 1 stage, and it's only a 3-stage pipeline.
subs and muls are single-cycle on Cortex-M3. I haven't found perf numbers on branches, but they're common so I'm assuming it's probably 1 cycle and doesn't cause a big fetch bubble (if correctly predicted...). The Cortex-M3 HTML manual has a section on Branch target forwarding which appears to be about reducing the fetch bubble.
Its instruction timing table shows b<cond>
costs 1 cycle for not-taken, or 2 cycles for taken. (1 for the branch, 1 for the pipeline reload after an immediate displacement.). So taken branches are slow compared to sub/mul and unrolling would be valuable, so my code above should still work well. (But multiple product accumulators are not necessary, so it can be simplified).
;; UNTESTED
THUMB
;; Input: n in r0. (n is signed positive, otherwise we return n.)
;; Output: n! in r0.
;; clobbers: r1
fact:
subs r1, r0, #1 ; i = n-1
bls .Ltiny_input ; jump if n<=1
.Lloop: ; do {
muls r0, r1 ; prod *= i
subs r1, #1 ; --i
bgt .Lloop ; while(--i > 0); signed condition
; r1 = 0, r0 = n!
; last multiply was a redundant prod *= 1 but avoiding that would take a cmp
.Ltiny_input: ; alternate return path for tiny inputs
; 0! = 1 case is not handled, nor are negative inputs
bx lr ; or inline this
I think that's the smallest we can manage. The loop has 3 instructions, and probably costs 4 cycles per iteration (1 + 1 + 2, the taken branch costing 2 cycles).
00000000 <fact>:
0: 1e41 subs r1, r0, #1
2: d902 bls.n a <fact+0xa>
4: 4348 muls r0, r1
6: 3901 subs r1, #1
8: dcfc bgt.n 4 <fact+0x4>
a: 4770 bx lr # don't count this if inlining
So this is 0xa = 10 bytes, not counting the bx lr
return instruction.
We could handle the 0! = 1
case with an IT
block after the first subs
, before the branch, so we can still jump to right after the loop (instead of to a separate block like my Cortex-A9 version). You could use this trick for it, too, though.
subs r1, r0, #1 ; i = n-1
it lt
movlt r0, #1 ; n = 1 for n<1
bls .Ltiny_input ; return n if n was <=1
If we needed more range for the branch, we could use itt ls
/ movls r0, #1
, so the branch was inside the IT block (where branch instructions can use an encoding that spends more bits on displacement and none on the predicate). But it's a short range in this case, so I chose to leave r0
unmodified in the r0 == 1
case. I don't know if there are any CPUs where it's more efficient or lower latency for a predicated instruction to be a NOP instead of running, but there might be.
Without unrolling, putting a cmp
in the loop to avoid the last *=1
iteration would cost us an extra cycle per iteration (4 cycles instead of 3), so only pay for itself with n=2
or maybe n=3
.
Unrolling could help speed significantly for larger inputs, going from 1 mul per 3 cycles to asymptotically approaching 1 mul per 2 cycles (sub + mul + amortized loop overhead). I can't see any way to avoid an instruction like sub
or mov
to generate a separate input for each mul
, except by hard-coding special case sequences for each n
(like *2 * 4
= *8
= left shift by 3) when you could instead just hard-code the answer.
Combining r1
and r2
is the obvious solution which you get too when cheating with a c compiler...
unsigned int foo(unsigned int a)
{
unsigned int res = 1;
while (a > 0) {
res *= a;
--a;
}
return res;
}
translates to
subs r3, r0, #0
mov r0, #1
bxeq lr
1: mul r0, r3, r0
subs r3, r3, #1
bne 1b
bx lr
If TL;DR then skip to the end for punch line.
Ran this on an STM32 blue pill, a STM32F103C8T6
Definitely expect results to change with different chips even if they have the same rev of cortex-m3 as the processor is one thing but what feeds it and how is another and that is vendor specific. Also at times the chip vendor can compile the core differently, sometimes they can have multicycle multiplies to save on chip real estate, some cores they can pick between fetching 16 bits at a time or 32. Benchmarks are often easy to muck with so take them with a grain of salt.
I have seen execution in sram be faster than from flash generally. ST though, sometimes not, I dont think on these ancient cortex-m3s that they have their (instruction) cache with some fancy name. Newer ones do and you cant turn it off.
Other chip vendors dont have this and will for cores that support it implement arms caches rather than their own (or have neither). Perhaps why the first two experiments below run at a different time (two digit number up front is hex, the systick timer counts, systick cvr address is passed in in r0. You can see I used a nop to change the alignment of the loop. The arm documentation didnt state in the usual place that the cortex-m3 fetches halfwords or words, but the ST documentation when talking about something else states word fetches. Your four instruction loop is 2 words but aligned not on a word boundary means it needs to fetch three words per loop. Where if those four words are aligned then it needs to fetch two words per loop, will let Peter or someone else count instructions for this/your code. I am sure that is a factor but there are perhaps others, probably not.
For this chip running from flash is much faster. You can see the affects of turning off STs prefetch, and adding wait states.
000 Zero wait state, if 0 < SYSCLK≤ 24 MHz
001 One wait state, if 24 MHz < SYSCLK ≤ 48 MHz
010 Two wait states, if 48 MHz < SYSCLK ≤ 72 MHz
So while I am running off the internal 8mhz clock, there are two measurements here one is the number of clocks it takes to do something, if we triple the sysclk to 24mhz, the number of clocks should not change. The wall clock duration of each sysclk cycle is a third of the time so wall clock time is faster. Real time performance is better. Following those rules go, go one step above 24Mhz and now you add a wait state, and your code now slows down again. As the number of system clocks to run the code has now slowed down. Now if you double that to 48Mhz, has that overcome the wait state? Probably but for each program/loop there is a point between 24Mhz + a smidge and 48Mhz catches up to right at 24Mhz performance. And 48Mhz plus a smidge now you slow down again and somewhere between 48Mhz plus a smidge an 72Mhz we hopefully catch up to and pass the 48Mhz performance.
Just like the flash cannot keep up, other peripherals have rules, esp with these older chips like many of the cortex-m3 based ones, there are other performance cliffs you fall off of, some peripherals cannot run as fast as whatever sysclk is so you might have some other speed X where you are at the max speed for one/some of your peripherals or peripheral busses, and X + smidge you have to halve the clock as that is your smallest divisor now your peripherals and/or their busses are now half speed so performance of your code falls off a cliff possibly worse than half. This code of yours doesnt-ish touch a peripheral. It does use multiply which is risky for performance, but for the cortex-m3 I didnt see that there was a compile time option for single cycle vs other, it just said single cycle.
Peter covered the obvious optimization, whenever you are counting up to some number, if the instruction set allows, and your code, which it does in this case because a * b * c = c * b * a, so you want to count down and use the flags to compare with zero or plus minus if that floats your boat, rather than increment and then have to do a compare before the conditional. When you skip to the end you will see that it was faster (fewer clocks).
The M3's dont have caches, the m4s and m7s do. So running this code with its small loop, would want to be wrapped by a run many times loop and time that to see the affects of caching and cache line alignment and such. But for the m3, one time through is fine (if the chip doesnt have a hidden cache you cant control).
I am only really interested in the loop here as that has the most potential for cycle stealers. Validating/limiting the input, checking for shortcuts, looking for overflow when multiplying, etc, not something this answer is worrying about.
I recommend you google look for Michael Abrash's books. Zen of Assembly for example which you can build a copy on github. I read it when it came out and I have pretty much used what I learned there since, debugging chips, tools, breaking stuff, improving performance, etc. The 8088/86 was obsolete when it came out and if you think its an x86 book you are completely missing the point. For example my assumption of sram is going to be faster, didnt happen here. I also tried things like adding nops (extra instructions) inside the loop, believe it or not there are times when that can make the performance of a loop faster. These short pipeline, small prefetch processors though that generally isnt the case.
Sometimes you can get free instructions in a loop, the number of clocks is the same even with more instructions. For example if this had a multi-clock multiply, depending on how many clocks and depending on what registers/resources you touch you might get some free instructions in that loop. This appears to be a single cycle multiply so cant hope for that here.
Then there is the pipeline stuff you read in the Patterson and Hennessy text books. Which registers you choose can affect the performance. Order of instructions if you can functionally re-arrange the instructions, etc.
Notes taken doing simple experiments
15
20000018 <fact>:
20000018: b430 push {r4, r5}
2000001a: 2100 movs r1, #0
2000001c: 2203 movs r2, #3
2000001e: 2301 movs r3, #1
20000020: 6804 ldr r4, [r0, #0]
20000022 <fact_loop>:
20000022: 3101 adds r1, #1
20000024: 434b muls r3, r1
20000026: 4291 cmp r1, r2
20000028: d4fb bmi.n 20000022 <fact_loop>
2000002a: 6805 ldr r5, [r0, #0]
2000002c: 1b60 subs r0, r4, r5
2000002e: bc30 pop {r4, r5}
20000030: 4770 bx lr
12
20000018 <fact>:
20000018: b430 push {r4, r5}
2000001a: 2100 movs r1, #0
2000001c: 2203 movs r2, #3
2000001e: 2301 movs r3, #1
20000020: 46c0 nop ; (mov r8, r8)
20000022: 6804 ldr r4, [r0, #0]
20000024 <fact_loop>:
20000024: 3101 adds r1, #1
20000026: 434b muls r3, r1
20000028: 4291 cmp r1, r2
2000002a: d4fb bmi.n 20000024 <fact_loop>
2000002c: 6805 ldr r5, [r0, #0]
2000002e: 1b60 subs r0, r4, r5
20000030: bc30 pop {r4, r5}
20000032: 4770 bx lr
15
20000018 <fact>:
20000018: b430 push {r4, r5}
2000001a: 2100 movs r1, #0
2000001c: 2203 movs r2, #3
2000001e: 2301 movs r3, #1
20000020: 46c0 nop ; (mov r8, r8)
20000022: 46c0 nop ; (mov r8, r8)
20000024: 6804 ldr r4, [r0, #0]
20000026 <fact_loop>:
20000026: 3101 adds r1, #1
20000028: 434b muls r3, r1
2000002a: 4291 cmp r1, r2
2000002c: d4fb bmi.n 20000026 <fact_loop>
2000002e: 6805 ldr r5, [r0, #0]
20000030: 1b60 subs r0, r4, r5
20000032: bc30 pop {r4, r5}
20000034: 4770 bx lr
20000036: 46c0 nop ; (mov r8, r8)
12
20000018 <fact>:
20000018: b430 push {r4, r5}
2000001a: 2100 movs r1, #0
2000001c: 2203 movs r2, #3
2000001e: 2301 movs r3, #1
20000020: 46c0 nop ; (mov r8, r8)
20000022: 46c0 nop ; (mov r8, r8)
20000024: 46c0 nop ; (mov r8, r8)
20000026: 6804 ldr r4, [r0, #0]
20000028 <fact_loop>:
20000028: 3101 adds r1, #1
2000002a: 434b muls r3, r1
2000002c: 4291 cmp r1, r2
2000002e: d4fb bmi.n 20000028 <fact_loop>
20000030: 6805 ldr r5, [r0, #0]
20000032: 1b60 subs r0, r4, r5
20000034: bc30 pop {r4, r5}
20000036: 4770 bx lr
55
20000018 <fact>:
20000018: b430 push {r4, r5}
2000001a: 2100 movs r1, #0
2000001c: 220b movs r2, #11
2000001e: 2301 movs r3, #1
20000020: 6804 ldr r4, [r0, #0]
20000022 <fact_loop>:
20000022: 3101 adds r1, #1
20000024: 434b muls r3, r1
20000026: 4291 cmp r1, r2
20000028: d4fb bmi.n 20000022 <fact_loop>
2000002a: 6805 ldr r5, [r0, #0]
2000002c: 1b60 subs r0, r4, r5
2000002e: bc30 pop {r4, r5}
20000030: 4770 bx lr
20000032: bf00 nop
42
20000018 <fact>:
20000018: b430 push {r4, r5}
2000001a: 2100 movs r1, #0
2000001c: 220b movs r2, #11
2000001e: 2301 movs r3, #1
20000020: 46c0 nop ; (mov r8, r8)
20000022: 6804 ldr r4, [r0, #0]
20000024 <fact_loop>:
20000024: 3101 adds r1, #1
20000026: 434b muls r3, r1
20000028: 4291 cmp r1, r2
2000002a: d4fb bmi.n 20000024 <fact_loop>
2000002c: 6805 ldr r5, [r0, #0]
2000002e: 1b60 subs r0, r4, r5
20000030: bc30 pop {r4, r5}
20000032: 4770 bx lr
41
20000018 <fact>:
20000018: b430 push {r4, r5}
2000001a: 210b movs r1, #11
2000001c: 2301 movs r3, #1
2000001e: 6804 ldr r4, [r0, #0]
20000020 <fact_loop>:
20000020: 434b muls r3, r1
20000022: 3901 subs r1, #1
20000024: d1fc bne.n 20000020 <fact_loop>
20000026: 6805 ldr r5, [r0, #0]
20000028: 1b60 subs r0, r4, r5
2000002a: bc30 pop {r4, r5}
2000002c: 4770 bx lr
2000002e: bf00 nop
42
20000018 <fact>:
20000018: b430 push {r4, r5}
2000001a: 210b movs r1, #11
2000001c: 2301 movs r3, #1
2000001e: 46c0 nop ; (mov r8, r8)
20000020: 6804 ldr r4, [r0, #0]
20000022 <fact_loop>:
20000022: 434b muls r3, r1
20000024: 3901 subs r1, #1
20000026: d1fc bne.n 20000022 <fact_loop>
20000028: 6805 ldr r5, [r0, #0]
2000002a: 1b60 subs r0, r4, r5
2000002c: bc30 pop {r4, r5}
2000002e: 4770 bx lr
41
20000018 <fact>:
20000018: b430 push {r4, r5}
2000001a: 210b movs r1, #11
2000001c: 2301 movs r3, #1
2000001e: 46c0 nop ; (mov r8, r8)
20000020: 46c0 nop ; (mov r8, r8)
20000022: 6804 ldr r4, [r0, #0]
20000024 <fact_loop>:
20000024: 434b muls r3, r1
20000026: 3901 subs r1, #1
20000028: d1fc bne.n 20000024 <fact_loop>
2000002a: 6805 ldr r5, [r0, #0]
2000002c: 1b60 subs r0, r4, r5
2000002e: bc30 pop {r4, r5}
20000030: 4770 bx lr
20000032: bf00 nop
FLASH ACR 0x30
2d
08000020 <fact>:
8000020: b430 push {r4, r5}
8000022: 210b movs r1, #11
8000024: 2301 movs r3, #1
8000026: 6804 ldr r4, [r0, #0]
08000028 <fact_loop>:
8000028: 434b muls r3, r1
800002a: 3901 subs r1, #1
800002c: d1fc bne.n 8000028 <fact_loop>
800002e: 6805 ldr r5, [r0, #0]
8000030: 1b60 subs r0, r4, r5
8000032: bc30 pop {r4, r5}
8000034: 4770 bx lr
2d
08000020 <fact>:
8000020: b430 push {r4, r5}
8000022: 210b movs r1, #11
8000024: 2301 movs r3, #1
8000026: 46c0 nop ; (mov r8, r8)
8000028: 6804 ldr r4, [r0, #0]
0800002a <fact_loop>:
800002a: 434b muls r3, r1
800002c: 3901 subs r1, #1
800002e: d1fc bne.n 800002a <fact_loop>
8000030: 6805 ldr r5, [r0, #0]
8000032: 1b60 subs r0, r4, r5
8000034: bc30 pop {r4, r5}
8000036: 4770 bx lr
FLASH_ACR 0x00
2d
08000020 <fact>:
8000020: b430 push {r4, r5}
8000022: 210b movs r1, #11
8000024: 2301 movs r3, #1
8000026: 46c0 nop ; (mov r8, r8)
8000028: 6804 ldr r4, [r0, #0]
0800002a <fact_loop>:
800002a: 434b muls r3, r1
800002c: 3901 subs r1, #1
800002e: d1fc bne.n 800002a <fact_loop>
8000030: 6805 ldr r5, [r0, #0]
8000032: 1b60 subs r0, r4, r5
8000034: bc30 pop {r4, r5}
8000036: 4770 bx lr
FLASH_ACR 0x02
5e
08000020 <fact>:
8000020: b430 push {r4, r5}
8000022: 210b movs r1, #11
8000024: 2301 movs r3, #1
8000026: 6804 ldr r4, [r0, #0]
08000028 <fact_loop>:
8000028: 434b muls r3, r1
800002a: 3901 subs r1, #1
800002c: d1fc bne.n 8000028 <fact_loop>
800002e: 6805 ldr r5, [r0, #0]
8000030: 1b60 subs r0, r4, r5
8000032: bc30 pop {r4, r5}
8000034: 4770 bx lr
5f
08000020 <fact>:
8000020: b430 push {r4, r5}
8000022: 210b movs r1, #11
8000024: 2301 movs r3, #1
8000026: 46c0 nop ; (mov r8, r8)
8000028: 6804 ldr r4, [r0, #0]
0800002a <fact_loop>:
800002a: 434b muls r3, r1
800002c: 3901 subs r1, #1
800002e: d1fc bne.n 800002a <fact_loop>
8000030: 6805 ldr r5, [r0, #0]
8000032: 1b60 subs r0, r4, r5
8000034: bc30 pop {r4, r5}
8000036: 4770 bx lr
FLASH_ACR 0x32
41
08000020 <fact>:
8000020: b430 push {r4, r5}
8000022: 210b movs r1, #11
8000024: 2301 movs r3, #1
8000026: 6804 ldr r4, [r0, #0]
08000028 <fact_loop>:
8000028: 434b muls r3, r1
800002a: 3901 subs r1, #1
800002c: d1fc bne.n 8000028 <fact_loop>
800002e: 6805 ldr r5, [r0, #0]
8000030: 1b60 subs r0, r4, r5
8000032: bc30 pop {r4, r5}
8000034: 4770 bx lr
41
08000020 <fact>:
8000020: b430 push {r4, r5}
8000022: 210b movs r1, #11
8000024: 2301 movs r3, #1
8000026: 46c0 nop ; (mov r8, r8)
8000028: 6804 ldr r4, [r0, #0]
0800002a <fact_loop>:
800002a: 434b muls r3, r1
800002c: 3901 subs r1, #1
800002e: d1fc bne.n 800002a <fact_loop>
8000030: 6805 ldr r5, [r0, #0]
8000032: 1b60 subs r0, r4, r5
8000034: bc30 pop {r4, r5}
8000036: 4770 bx lr
PUT32(FLASH_ACR,0x3A);
41
08000020 <fact>:
8000020: b430 push {r4, r5}
8000022: 210b movs r1, #11
8000024: 2301 movs r3, #1
8000026: 6804 ldr r4, [r0, #0]
08000028 <fact_loop>:
8000028: 434b muls r3, r1
800002a: 3901 subs r1, #1
800002c: d1fc bne.n 8000028 <fact_loop>
800002e: 6805 ldr r5, [r0, #0]
8000030: 1b60 subs r0, r4, r5
8000032: bc30 pop {r4, r5}
8000034: 4770 bx lr
...
41
08000020 <fact>:
8000020: b430 push {r4, r5}
8000022: 210b movs r1, #11
8000024: 2301 movs r3, #1
8000026: 46c0 nop ; (mov r8, r8)
8000028: 6804 ldr r4, [r0, #0]
0800002a <fact_loop>:
800002a: 434b muls r3, r1
800002c: 3901 subs r1, #1
800002e: d1fc bne.n 800002a <fact_loop>
8000030: 6805 ldr r5, [r0, #0]
8000032: 1b60 subs r0, r4, r5
8000034: bc30 pop {r4, r5}
8000036: 4770 bx lr
flash acr 0x32
4c
08000020 <fact>:
8000020: b430 push {r4, r5}
8000022: 210b movs r1, #11
8000024: 2301 movs r3, #1
8000026: 6804 ldr r4, [r0, #0]
08000028 <fact_loop>:
8000028: 46c0 nop ; (mov r8, r8)
800002a: 434b muls r3, r1
800002c: 3901 subs r1, #1
800002e: d1fb bne.n 8000028 <fact_loop>
8000030: 6805 ldr r5, [r0, #0]
8000032: 1b60 subs r0, r4, r5
8000034: bc30 pop {r4, r5}
8000036: 4770 bx lr
4c
08000020 <fact>:
8000020: b430 push {r4, r5}
8000022: 210b movs r1, #11
8000024: 2301 movs r3, #1
8000026: 46c0 nop ; (mov r8, r8)
8000028: 6804 ldr r4, [r0, #0]
0800002a <fact_loop>:
800002a: 46c0 nop ; (mov r8, r8)
800002c: 434b muls r3, r1
800002e: 3901 subs r1, #1
8000030: d1fb bne.n 800002a <fact_loop>
8000032: 6805 ldr r5, [r0, #0]
8000034: 1b60 subs r0, r4, r5
8000036: bc30 pop {r4, r5}
8000038: 4770 bx lr
flash acr 0x30
38
08000020 <fact>:
8000020: b430 push {r4, r5}
8000022: 210b movs r1, #11
8000024: 2301 movs r3, #1
8000026: 6804 ldr r4, [r0, #0]
08000028 <fact_loop>:
8000028: 46c0 nop ; (mov r8, r8)
800002a: 434b muls r3, r1
800002c: 3901 subs r1, #1
800002e: d1fb bne.n 8000028 <fact_loop>
8000030: 6805 ldr r5, [r0, #0]
8000032: 1b60 subs r0, r4, r5
8000034: bc30 pop {r4, r5}
8000036: 4770 bx lr
3b
0800002c <fact_loop>:
800002c: d002 beq.n 8000034 <fact_done>
800002e: 434b muls r3, r1
8000030: 3901 subs r1, #1
8000032: e7fb b.n 800002c <fact_loop>
08000034 <fact_done>:
8000034: 6805 ldr r5, [r0, #0]
8000036: 1b60 subs r0, r4, r5
8000038: bc30 pop {r4, r5}
800003a: 4770 bx lr
38
08000020 <fact>:
8000020: b430 push {r4, r5}
8000022: 2100 movs r1, #0
8000024: 220b movs r2, #11
8000026: 2301 movs r3, #1
8000028: 6804 ldr r4, [r0, #0]
0800002a <fact_loop>:
800002a: 3101 adds r1, #1
800002c: 434b muls r3, r1
800002e: 4291 cmp r1, r2
8000030: d4fb bmi.n 800002a <fact_loop>
8000032: 6805 ldr r5, [r0, #0]
8000034: 1b60 subs r0, r4, r5
8000036: bc30 pop {r4, r5}
8000038: 4770 bx lr
38
08000020 <fact>:
8000020: b430 push {r4, r5}
8000022: 2100 movs r1, #0
8000024: 220b movs r2, #11
8000026: 2301 movs r3, #1
8000028: 46c0 nop ; (mov r8, r8)
800002a: 6804 ldr r4, [r0, #0]
0800002c <fact_loop>:
800002c: 3101 adds r1, #1
800002e: 434b muls r3, r1
8000030: 4291 cmp r1, r2
8000032: d4fb bmi.n 800002c <fact_loop>
8000034: 6805 ldr r5, [r0, #0]
8000036: 1b60 subs r0, r4, r5
8000038: bc30 pop {r4, r5}
800003a: 4770 bx lr
2d
08000020 <fact>:
8000020: b430 push {r4, r5}
8000022: 210b movs r1, #11
8000024: 2301 movs r3, #1
8000026: 6804 ldr r4, [r0, #0]
08000028 <fact_loop>:
8000028: 434b muls r3, r1
800002a: 3901 subs r1, #1
800002c: d1fc bne.n 8000028 <fact_loop>
800002e: 6805 ldr r5, [r0, #0]
8000030: 1b60 subs r0, r4, r5
8000032: bc30 pop {r4, r5}
8000034: 4770 bx lr
Skip to here:
Note that I changed the number of loops, the input value from 3 to 11.
With zero wait states on the flash and prefetch enabled, your loop:
38
08000020 <fact>:
8000020: b430 push {r4, r5}
8000022: 2100 movs r1, #0
8000024: 220b movs r2, #11
8000026: 2301 movs r3, #1
8000028: 6804 ldr r4, [r0, #0]
0800002a <fact_loop>:
800002a: 3101 adds r1, #1
800002c: 434b muls r3, r1
800002e: 4291 cmp r1, r2
8000030: d4fb bmi.n 800002a <fact_loop>
8000032: 6805 ldr r5, [r0, #0]
8000034: 1b60 subs r0, r4, r5
8000036: bc30 pop {r4, r5}
8000038: 4770 bx lr
That means 0x38 systick clocks between the two ldr instructions. Alignment didnt affect this in flash.
If you use Peter's or a variation on it (bne makes more sense to me than plus minus, YMMV)
2d
08000020 <fact>:
8000020: b430 push {r4, r5}
8000022: 210b movs r1, #11
8000024: 2301 movs r3, #1
8000026: 6804 ldr r4, [r0, #0]
08000028 <fact_loop>:
8000028: 434b muls r3, r1
800002a: 3901 subs r1, #1
800002c: d1fc bne.n 8000028 <fact_loop>
800002e: 6805 ldr r5, [r0, #0]
8000030: 1b60 subs r0, r4, r5
8000032: bc30 pop {r4, r5}
8000034: 4770 bx lr
Alignment didnt affect this loop either. It is fewer instructions, as well as faster.
So from an other answer and the documentation mul and sub one clock each the branch when taken is 2 clocks according to that answer, so 4 clocks per loop times 11 is 44 clocks or 0x2C. No doubt the two ldrs have a cost perhaps that is where the additional two clocks come from. Or it could be how the prefetch unit works or other.
Your loop is 5 clocks or 55 or 0x37, same answer for the extra two clocks being measured.
So I overcomplicated some of these experiments, the prefetch unit from ST and running at zero wait states allowed us to see the performance shown in ARMs documentation. Counting down instead of up saved an instruction in the loop which is both smaller in size and faster, which is what you were asking for.
Your 5 clocks per loop times 3 factorial means 14 clocks (5+5+4), your 22 clocks (check how you measured it, very often the ruler is the problem with benchmarking not the code) have 8 clocks somewhere else minus the 3 for the setup instructions if you were counting those. Whatever ruler you are using if you use the count down solution, see how that compares on your system. Saves a couple of instructions, one in and one outside the loop.
------- EDIT
I am kinda surprised that gcc didnt optimize this into a count down loop. I only tried one version maybe an older 3.x or 4.x might have. Also if you build for cortex-m3 it uses a thumb2 instruction rather than the thumb instruction.
unsigned int fact ( unsigned int x )
{
unsigned int a;
unsigned int rb;
a=1;
for(rb=1;rb<=x;rb++)
{
a*=rb;
}
return(a);
}
unsigned int fact2 ( unsigned int x )
{
unsigned int a;
a=1;
while(x)
{
a*=x--;
}
return(a);
}
Yes I could optimize the C code further....
Disassembly of section .text:
00000000 <fact>:
0: b140 cbz r0, 14 <fact+0x14>
2: 2301 movs r3, #1
4: 461a mov r2, r3
6: fb03 f202 mul.w r2, r3, r2
a: 3301 adds r3, #1
c: 4298 cmp r0, r3
e: d2fa bcs.n 6 <fact+0x6>
10: 4610 mov r0, r2
12: 4770 bx lr
14: 2201 movs r2, #1
16: 4610 mov r0, r2
18: 4770 bx lr
1a: bf00 nop
0000001c <fact2>:
1c: 4603 mov r3, r0
1e: 2001 movs r0, #1
20: b123 cbz r3, 2c <fact2+0x10>
22: fb03 f000 mul.w r0, r3, r0
26: 3b01 subs r3, #1
28: d1fb bne.n 22 <fact2+0x6>
2a: 4770 bx lr
2c: 4770 bx lr
2e: bf00 nop
I forgot about cbz, I dont use thumb2 unless I have to, not as universally portable as classic thumb instructions...
more portable version:
Disassembly of section .text:
00000000 <fact>:
0: 2800 cmp r0, #0
2: d007 beq.n 14 <fact+0x14>
4: 2301 movs r3, #1
6: 2201 movs r2, #1
8: 435a muls r2, r3
a: 3301 adds r3, #1
c: 4298 cmp r0, r3
e: d2fb bcs.n 8 <fact+0x8>
10: 0010 movs r0, r2
12: 4770 bx lr
14: 2201 movs r2, #1
16: e7fb b.n 10 <fact+0x10>
00000018 <fact2>:
18: 0003 movs r3, r0
1a: 2001 movs r0, #1
1c: 2b00 cmp r3, #0
1e: d003 beq.n 28 <fact2+0x10>
20: 4358 muls r0, r3
22: 3b01 subs r3, #1
24: 2b00 cmp r3, #0
26: d1fb bne.n 20 <fact2+0x8>
28: 4770 bx lr
2a: 46c0 nop ; (mov r8, r8)
Hmmmm:
20: 4358 muls r0, r3
22: 3b01 subs r3, #1
24: 2b00 cmp r3, #0
26: d1fb bne.n 20 <fact2+0x8>
wow.
arm-none-eabi-gcc --version
arm-none-eabi-gcc (GCC) 8.3.0
Copyright (C) 2018 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
Something like this coould be used: (assuming 32 bit registers, where 12! is largest possible value), but Peter Cordes is more familiar with the ARM (it's been 10 years since I worked with ARM), and his code based answer is good. The table lookup I show below should be fastest, and it requires more space, but not a lot since the range is 0! to 12! for 32 bit unsigned integers.
mov r2,#3 ;r2 = n
; ...
mov r3,#1
sub r2,#2
blo factx
mov r1,#(fact11-fact12)
mul r1,r2,r1 ; or better, use a left-shift by 2 or 3 and an assemble time static assert that fact11-fact12 == 4 or 8
adr r2,fact2
sub r2,r2,r1
mov r1,#2
b r2
fact12 mul r3,r1,r3
add r1,r1,#1
fact11 mul r3,r1,r3
add r1,r1,#1
mul r3,r1,r3
add r1,r1,#1
mul r3,r1,r3
add r1,r1,#1
mul r3,r1,r3
add r1,r1,#1
mul r3,r1,r3
add r1,r1,#1
mul r3,r1,r3
add r1,r1,#1
mul r3,r1,r3
add r1,r1,#1
mul r3,r1,r3
add r1,r1,#1
mul r3,r1,r3
add r1,r1,#1
fact2 mul r3,r1,r3
factx ... ;r3 = n!
or simpler still, a table lookup:
tblfac dcd 1,1,2,6,24,120,720,5040
dcd 40320,362880,3628800,39916800
dcd 479001600
; ...
mov r2,#3 ;r2 = n
adr r3,tblfac
ldr r3,[r3, r2, lsl #2] ;r3 = n!
User contributions licensed under CC BY-SA 3.0