Home arrow Support arrow Forums

Luminary Micro Forums

<< Start < Prev 1 2 3 Next > End >>

paulkimelman

Junior Boarder
Click here to see the profile of this user

2008/05/19 22:43

Re:hand optimized FFT/IFFT for Cortex-M3 attached

ProARM wrote:
Also, I dont want to go off-topic but I would ask you about executing code in RAM. In another post on this forum, a user talks about programs running 40% slower when in RAM.
In the ST32' forum they said: "if you run code from RAM , we will loose the harvard architecture and you will proceed with a Von-Neuman one an the System Bus will be stalled more times"
I'm afraid of this because my future project will load code from SD/mmc card to run it inside the RAM.
Of course, the program will have to be compiled in a special way (blocks of code) but the avantage would be the illimitate size of it.
Do you think the old classic ARM7DMI would be a better choice for this kind of purpouses?


Cortex-M3 is still faster than an ARM7 in general, even when running from RAM. But, it is not optimal due to a combination of Von-Neumann issues (load/store competing with fetches) and fetches paying a registration penalty. Since a fair bit of code is 16-bit, it should be impacted less than one may expect. Numbers from 10% to 40% worse are typical. So, if you want to load code into RAM from SDCard, you need to consider putting commonly used and highly iterative (looping) functions into Flash. This will cut down on the impact quite a bit. Also, it may be worth optimizing for size, since this will tend to mean more 16-bit instructions.
Regards, Paul

login or register to reply

ProARM

Senior Boarder
Click here to see the profile of this user

2008/05/20 02:36

Re:hand optimized FFT/IFFT for Cortex-M3 attached

Ok,
I understand that no processor can do miracles running everything always at 1 cycle... but Cortex has a sophisticated smart architecture and will attempt to perform as best as possible...

Many tanks again and greetings from Italy! :)

Post edited by: ProARM, at: 2008/05/20 02:38

login or register to reply

imellen

Fresh Boarder
Click here to see the profile of this user

2008/05/21 11:30

Re:hand optimized FFT/IFFT for Cortex-M3 attached

Hi Everybody, quite a discussion on this post since I’ve checked it last time ;-)
To answer your questions:
- typical metrics for benchmarks is number of CPU cycles required to execute routine under test. It can be translated into execution time if you know cycle duration (e.g. time= 20ns*CPU cycles ; for 50MHz CPU clock) Other important metrics is power consumption per routine execution e.g. how many 1024 point FFTs can I perform per second if I can consume 10mA for processor. This is what usually matters for battery operated devices.
- dsPIC vs ARM7 – dsPIC is very strong in DSP processing, ARM7 is not. But keep in mind dsPIC goes up to 30MHz, ARM7 chips much higher. Furthermore, I did not know level of ARM7 FFT optimization.
- "Cortex-M3: coefficients in Flash or RAM , same speed" – for latency 0 and Flash bus width 64 bit, flash can provide one 32bit(or two 16 bit) instruction and 32 bit coefficient in one cycle. So there is no penalty if coefficients are read from flash.
- 64 bit pre-fetch: this is the same mechanism as in ARM7 from NXP has. ARM7 chips are usually equipped with 128 bit prefetch, since ARM instructions are 32 bit wide. To save silicon, Cortex-m3 uses only 64 bit prefetch assuming that most of the time there are 16 bit instructions in the execution stream. (not always the case, for cash latency 2 there is penalty – 64 bit prefetch can provide 64 bits every 3 cycles – only enough for 16/16/32 and 16/16/16 bit instruction combinations.) . paulkimelman already explained this in more detail.
Anyway, this does not apply to Stelaris, since Flash latency is 0.
- "Cortex-M3 FFT benchmarks in CPU cycles based on real hardware measurements (STM32 used, Stelaris should be the same)" – since Stelaris has flash latency 0, benchmarks should be the same as STM32 Latency 0 benchmarks.
- To reproduce benchmark results use GCC 4.2.1 (CodeSourcery 2007q3-53 release). Older versions performed different optimization, so they are slower.
- Hand optimization for latency 0 is easier, since you have to worry only about branch speculation and loop jumps to 64 bit boundaries. For higher latencies trade-offs for 16/32 bit instructions has to be considered (32bit instructions are more likely to get extra wait cycle if pre-fetch buffer does not contain whole instruction, also flash data access collides more often)
- RAM execution – faster when flash latency >0 and there is no RAM data access. Otherwise usually slower since advantages of Harvard architecture (simultaneous data, instruction memory access) are lost. Data access causes stall for next instruction fetch.

hope this helps. If somebody run this FFT routine and has benchmarks results for Stelaris hardware, please let me know.
Ivan

login or register to reply

ProARM

Senior Boarder
Click here to see the profile of this user

2008/05/22 14:47

Re:hand optimized FFT/IFFT for Cortex-M3 attached

Hello to both of you,
it's a pleasure to talk with people who really understand the advanced details of ARM processors.
ST, Luminary and others started with the bad habitude to say "with ARM you dont need to program in M.L. any more"...
Where can I find these informations?
When I try to understand the speed of instructions on ARM documentation they say "ask to manufacturer, it depends on external bus"... and when I search in the manifacturer book they say "ask to ARM, they made the CPU!" ;)

- - - - - - - - - - - - - - - - - - - -
Anyway some of your explanations are not still clear to me:

->To save silicon, Cortex-m3 uses only 64 bit prefetch.

uh?
Paulkimelman said Cortex try to read 3 word ahead.
So I was thinking the "prefetch" was 3 word in size but Cortex can read only 1 word at time (the bus is 32 bit) so it takes 3 cycles to fill the prefetch.

To avoid confusion, lets call "prefetch" the prefetch architecture inside the ARM core. So every ARM Cortex has exactly the same "prefetch". And so, every ARM7DMI has...

Lets call "memory access size" the size of the bus to/from the memory. The capability of reading a data bigger than CPU bus size forms a "buffer" in order to reduce the "stall" situations...
I believe this architecture depends on the manufacturer because must be customized for the memory thay have fitted in the chip...

For example, (correct me if wrong)
LPC has 128bit memory access (one for FLASH and one for RAM)and performs well even if the flash has 2 wait states.
STM32 has a 64bit memory access (only for flash because RAM is 0 wait states)
STR9 (ARM9 from ST) has bad performances because of the 32bit bus size (they made no larger bus because the memory lies on a separate wafer)
ATMEL ARM7DMI has only 1 wait state FLASH but just a "32bit access" (no buffer). This carry to lower performances than LPC, despite of the faster flash, and they often need to load programs in the RAM to gain full speed...
Also Luminary has a simple "32bit access" but with 0 wait states. The clock is limited to 50 Mhz but may be this architecture will show its advantages when there is a lot of "jumping" code where the buffered-solutions tend to fail badly...

So, when you say "Cortex-m3 uses only 64 bit prefetch", do you mean "STM32 uses only 64 bit prefetch"?

- - - - - - - - - - - - - - - - - - - -
- To reproduce benchmark results use GCC 4.2.1 (CodeSourcery 2007q3-53 release). Older versions performed different optimization, so they are slower.

Another point I dont understand: your code if fully M.L. so how can it depend on the compiler's optimizations???

- - - - - - - - - - - - - - - - - - - -
-If somebody run this FFT routine and has benchmarks results for Stellaris hardware.
Sure I would do it but, at the moment, I'm not able to program anything in M.L. with my LM3S1968 proto board due to the reasons I told before.
-> No examples or documentation for M.L. in the Luminary pack. Also there are some rumors about the possibility to ruin the processor forever if I do something wrong via M.L.
Since I'm not able to replace these ultra small chip on the board, of course, I'm terrorized... ;)

login or register to reply

paulkimelman

Junior Boarder
Click here to see the profile of this user

2008/05/22 19:58

Re:hand optimized FFT/IFFT for Cortex-M3 attached

ProARM wrote:
Hello to both of you,
Where can I find these informations?


Look in the ARM Cortex-M3 TRM, such as r1p0. You can find this at the bottom of http://www.luminarymicro.com/home/data_sheets.html

Look at section 18. This table is very accurate about instruction timings and what impacts them. When using slow flash (such as STM32), you have to add extra cycles for branches, literal loads, and other cases. When using Stellaris parts, you only add for the branches to unaligned 32-bit instructions, which are rare.


->To save silicon, Cortex-m3 uses only 64 bit prefetch.

No, this is incorrect. He means STM32 uses a pre-fetcher of 64-bits to try to compensate for its slow flash. The terms are:
1. Cortex-M3 uses a fetch buffer. This is 3 words and is loaded 32-bits at a time. Instructions are read from it 16 or 32 bits at a time, as needed. It continues to keep itself full when possible. It is flushed on branch.
2. Cortex-M3 supports a pre-fetcher for slower flash. It supports this by emitting branch information a cycle in advance (including conditional and hard branches).
3. A pre-fetcher is supplied by a Si vendor if needed. Stellaris does not use since not needed. More info on pre-fetchers below.

A prefetch buffer model works by having wider flash access (e.g. 64-bit or 128-bit) vs. the normal 32. It parallel loads the 64/128 bits into the buffer and then the core's fetch buffer reads 32 bits from it at a time. The reason it works is that a branch will have to wait the full amount (1 or 2 wait states) to get its 32, but the next 32 will be 0 wait state (except when the branch is not aligned to the pre-fetch buffer). Further, the pre-fetcher will read ahead, so the next buffer will be ready when the core needs it; the core emits signals to warn the buffer not to read ahead when a branch is coming. As long as you have inline code, it works well. The STM32 one does not work well, since the buffer size has to increase the buffering per wait state. So, a 1 wait state flash does well with a 64-bit buffer. A 2 wait state flash needs 128-bit buffers. Note that a 2 wait state flash will be quite a bit slower due to impact on literal loads (load via PC of ROM constants such as addresses, big constants, and const arrays and strings).

As to the other vendors/processors:
1. ARM7 uses corebus and has no emitted side bands. So, NXP cannot do very well with the MAM. They use a wider buffer to try to compensate as much as possible. But, the ARM7 does not have a fetch buffer. The NXP model was an attempt to optimize ARM code vs. Thumb (since it tends to be inline due to condition tests in each instruction), and they use a non-deterministic approach to try to improve literal loads and short loops (so timings can vary quite a bit, including as a side effect of a small code change).
2. STR9 has burst flash in a separate die off the TCM (tightly coupled memory) port. This has poor performance since the TCM is not pipelined (is for fast RAM), has switching costs when loading literals, has interaction costs when switching back and forth between system bus and TCM, and has other issues with regards to it being TCM.
3. Atmel wants you to use Thumb code (to hide flash costs) and put tight loops into RAM (since loops amplify the flash costs). This is not a hardware solution, but forcing a software one. The big problem is that Thumb-1 code is less efficient than ARM or Thumb-2 code (often takes quite a few more instructions to do the same thing). Thumb-2 solved that by providing a much larger range of instructions, including complex ones (what would have taken 2 or 3 thumb-1 instructions) and getting the conditionals concept of ARM instructions via the IT instruction.

Also Luminary has a simple "32bit access" but with 0 wait states. The clock is limited to 50 Mhz but may be this architecture will show its advantages when there is a lot of "jumping" code where the buffered-solutions tend to fail badly...

Yes. Many applications are as fast or faster with a lower clock speed as a result. Further, the kind of code that needs the performance (e.g. tight loops) is most dramatic in impact.

-> No examples or documentation for M.L. in the Luminary pack. Also there are some rumors about the possibility to ruin the processor forever if I do something wrong via M.L.

You can use an __asm function if using ARM/Keil compiler and a C function with __asm("instrninstrn...") if GCC. Both can access variables and the args and can return a result (in R0).

You will not ruin the processor with this kind of code. I suspect someone meant doing something like one-time-programming the Flash or locking out pins (such as JTAG). No risk with this style of code. Further, there are techniques to regain control in general (see the Luminary Flash loader).

regards, Paul

login or register to reply

imellen

Fresh Boarder
Click here to see the profile of this user

2008/05/23 12:44

Re:hand optimized FFT/IFFT for Cortex-M3 attached

Quick ansvers:

->To save silicon, Cortex-m3 uses only 64 bit prefetch.

paulkimelman is right, I meant STM32 not Cortex-m3 core


->Another point I dont understand: your code if fully M.L. so how can it depend on the compiler's optimizations???.
ARM assembler uses pseudoinstructions such as LDR immediate and ADR that can be translated into different instructions. Furthermore, some insturctions have 32 bit and 16 bit form, e.g. STM sp!, {R0, R2} . More optimizing assembler picks narrow version of instruction when available.

Regarding processor destruction by custom machine language code - I've heard this urban legend first time on 8 bit Sinclair ZX80 computer 23 years ago :-)

Testing FFT routine on Stellaris should be easy if your tools are based on GCC (Rowley Crossworks, Keil with gcc, ...). Just add assembly file to project, declare and call fft functions from C code.

Ivan

login or register to reply
<< Start < Prev 1 2 3 Next > End >>