LPC2148 vs AT91SAM7 vs STM32, another battle at 48MHz

Some months ago I posted a speed comparison between the LPC2000 and AT91SAM CPUs: viewtopic.php?p=43398

I think it is time to update the benchmarks because I can provide the scores for the new STM32, and also because the new GCC compiler, now at version 4.3.0.

Test setup

  • - Compiler: GCC4.3.0, experimental WinARM build 20080331
  • - ChibiOS/RT version 0.6.4 stable, all subsystems included.
  • - All CPUs clocked at 48MHz with the manufacturer recommended optimal settings.
  • - All tests were performed without interworking switch specified to the compiler, the interworking code would be slower and larger.
  • - STM32 and LPC2138 were tested using the -falign-functions=16 compile switch because this is required by their flash prefetch mechanisms, without the switch the scores would not be reliable, the side effect is a slightly larger generated code.
  • The CPU were tested in ARM mode using speed settings and in THUMB mode using code size reduction settings. The STM32 is tested in THUMB2 mode of corse (it has no ARM mode).

    The benchmarks

    AT91SAM7X256, ARM mode (-O2 -fomit-frame-pointer -mabi=apcs-gnu)

    Kernel size: 6.028 bytes

    *** Kernel Benchmark, context switch test #1 (optimal):
    Messages throughput = 113365 msgs/S, 226730 ctxswc/S
    *** Kernel Benchmark, context switch test #2 (no threads in ready list):
    Messages throughput = 89245 msgs/S, 178490 ctxswc/S
    *** Kernel Benchmark, context switch test #3 (04 threads in ready list):
    Messages throughput = 89245 msgs/S, 178490 ctxswc/S
    *** Kernel Benchmark, threads creation/termination:
    Threads throughput = 72051 threads/S
    *** Kernel Benchmark, I/O Queues throughput:
    Queues throughput = 240596 bytes/S
    

    AT91SAM7X256, THUMB mode (-Os -fomit-frame-pointer -mabi=apcs-gnu )

    Kernel size: 3.808 bytes

    *** Kernel Benchmark, context switch test #1 (optimal):
    Messages throughput = 96647 msgs/S, 193294 ctxswc/S
    *** Kernel Benchmark, context switch test #2 (no threads in ready list):
    Messages throughput = 83775 msgs/S, 167550 ctxswc/S
    *** Kernel Benchmark, context switch test #3 (04 threads in ready list):
    Messages throughput = 83775 msgs/S, 167550 ctxswc/S
    *** Kernel Benchmark, threads creation/termination:
    Threads throughput = 72268 threads/S
    *** Kernel Benchmark, I/O Queues throughput:
    Queues throughput = 242252 bytes/S
    

    LPC2148, ARM mode (-O2 -fomit-frame-pointer -mabi=apcs-gnu -falign-functions=16)

    Kernel size: 6.512 bytes

    *** Kernel Benchmark, context switch test #1 (optimal):
    Messages throughput = 142327 msgs/S, 284654 ctxswc/S
    *** Kernel Benchmark, context switch test #2 (no threads in ready list):
    Messages throughput = 110956 msgs/S, 221912 ctxswc/S
    *** Kernel Benchmark, context switch test #3 (04 threads in ready list):
    Messages throughput = 110955 msgs/S, 221910 ctxswc/S
    *** Kernel Benchmark, threads creation/termination:
    Threads throughput = 93770 threads/S
    *** Kernel Benchmark, I/O Queues throughput:
    Queues throughput = 343752 bytes/S
    

    LPC2148, THUMB mode (-Os -fomit-frame-pointer -mabi=apcs-gnu -falign-functions=16)

    Kernel size: 4.208 bytes

    *** Kernel Benchmark, context switch test #1 (optimal):
    Messages throughput = 98118 msgs/S, 196236 ctxswc/S
    *** Kernel Benchmark, context switch test #2 (no threads in ready list):
    Messages throughput = 82958 msgs/S, 165916 ctxswc/S
    *** Kernel Benchmark, context switch test #3 (04 threads in ready list):
    Messages throughput = 82956 msgs/S, 165912 ctxswc/S
    *** Kernel Benchmark, threads creation/termination:
    Threads throughput = 73291 threads/S
    *** Kernel Benchmark, I/O Queues throughput:
    Queues throughput = 241820 bytes/S
    

    STM32, THUMB2 mode (-O2 -fomit-frame-pointer -mabi=apcs-gnu -falign-functions=16)

    Kernel size: 4.576 bytes

    *** Kernel Benchmark, context switch test #1 (optimal):
    Messages throughput = 157965 msgs/S, 315930 ctxswc/S
    *** Kernel Benchmark, context switch test #2 (no threads in ready list):
    Messages throughput = 132211 msgs/S, 264422 ctxswc/S
    *** Kernel Benchmark, context switch test #3 (04 threads in ready list):
    Messages throughput = 132211 msgs/S, 264422 ctxswc/S
    *** Kernel Benchmark, threads creation/termination:
    Threads throughput = 113976 threads/S
    *** Kernel Benchmark, I/O Queues throughput:
    Queues throughput = 377696 bytes/S
    

    STM32, THUMB2 mode (-Os -fomit-frame-pointer -mabi=apcs-gnu -falign-functions=16)

    Kernel size: 4.400 bytes

    *** Kernel Benchmark, context switch test #1 (optimal):
    Messages throughput = 148646 msgs/S, 297292 ctxswc/S
    *** Kernel Benchmark, context switch test #2 (no threads in ready list):
    Messages throughput = 130769 msgs/S, 261538 ctxswc/S
    *** Kernel Benchmark, context switch test #3 (04 threads in ready list):
    Messages throughput = 130769 msgs/S, 261538 ctxswc/S
    *** Kernel Benchmark, threads creation/termination:
    Threads throughput = 108285 threads/S
    *** Kernel Benchmark, I/O Queues throughput:
    Queues throughput = 368876 bytes/S
    

    The STM32 clearly outperforms the other 2 micro controllers, the new Cortex-M3 core is clearly a winner compared to the ARM7.

    The LPC2148 clearly outperforms the AT91SAM7X256 in ARM mode but in THUMB mode there is not much difference.

    The STM32 code size efficiency is very high, it is close to the classic THUMB mode while allowing a much much better performance.

    I tried to make the comparison as fair as possible, all the CPUs are clocked at the same speed (48MHz) even if the upper limits are very different: 60MHz for the LPC2148, 55MHz for the AT91SAM7X256 and 72MHz for the STM32. The speed difference is even greater than the reported scores when using the chips at their top speed.

    Giovanni

    In gcc 4.x if you specify -Os the -falign-functions option will be ignored.

    Cheers

    Spen

    Hello, I tested it again, by using the -Os and -falign-functions=16 options the compiler generates the following directives as function preamble:

      12              		.align	2
      13              		.p2align 4,,15
      14              		.global	main
      15              		.thumb
      16              		.thumb_func
      17              		.type	main, %function
      18              	main:
    

    It appears that the -falign-functions=16 option is accepted and the “.p2align 4,15” is inserted.

    I am using GCC 4.3.0. What version are you using ? may be it was fixed at some point in the 4.x branch.

    regards,

    Giovanni

    Just read it in the gcc docs

    http://gcc.gnu.org/onlinedocs/gcc-4.3.0 … ze-Options

    -Os

    Optimize for size. -Os enables all -O2 optimizations that do not typically increase code size. It also performs further optimizations designed to reduce code size.

    -Os disables the following optimization flags:

    -falign-functions -falign-jumps -falign-loops

    -falign-labels -freorder-blocks -freorder-blocks-and-partition

    -fprefetch-loop-arrays -ftree-vect-loop-version

    Cheers

    Spen

    Strange, it should be a bug then. Probably it would be a good idea to report it, I will make some more tests first.

    regards,

    Giovanni

    gdisirio, they are cool figures which are hard to obtain. Deeper to learn Cortex-M3, more I realize it’d be rather called ARM-II not mere extention of standard ARM7/ARM9. Is it worthwhile you tune exception model and/or atomic-boo-foo using the new attractive insn set? You can try it.

    Toru Nishimura / ALKYL Technology

    I agree, it is a very efficient architecture and the new exception model fixes the greatest problem I had with the normal ARM architecture: being unable to implement -efficiently- interrupts nesting.

    The main problem I see with the Cortex-M3 and its exception model is that it is quite complicated to learn, the NVIC has a lot of functionalities that you can’t simply ignore. Luckily, for most users, the problem is handled at the OS level.

    I believe that the scores from the Cortex-M3 can further be improved with a better usage of the exception model, it is something I am looking into.

    regards,

    Giovanni

    I believe that the scores from the Cortex-M3 can further be improved with a better usage of the exception model, it is something I am looking into.

    The following PDFs might be helpful for your venture.

    http://www.luminarymicro.com/products/white_papers.html

    Choose “Transitioning to Cortex-M3 based MCUs” or an MPR ariticle reprint “32 Bits for a Buck.” They should be good introductionaries to understand UM and/or RM.

    Toru Nishimura / ALKYL Technology