AT91SAM7 Parallel data

I need to read 8 bit parallel data using an AT91SAM7S64. The issue I am having is of the 32 PIO bits, I can’t have the 8 bits together because of other perpherials in use.

// Here's the input pins for the 8 bit data
#define    DATA0		(1<<0)			
#define	DATA1		(1<<1)
#define	DATA2		(1<<26)
#define	DATA3		(1<<27)
#define	DATA4		(1<<28)
#define	DATA5		(1<<29)
#define	DATA6		(1<<30)
#define	DATA7		(1<<31)
#define	DATA_MASK	(DATA0|DATA1|DATA2|DATA3|DATA4|DATA5|DATA6|DATA7)

I set up the PIO like this:

AT91C_BASE_PMC->PMC_PCER = (1<<AT91C_ID_PIOA);  // Turns on the perpherial clock
AT91C_BASE_PIOA->PIO_PER = DATA_MASK;		// Enables DATA pins
AT91C_BASE_PIOA->PIO_ODR = DATA_MASK;		// Makes DATA pins inputs
AT91C_BASE_PIOA->PIO_PPUDR = DATA_MASK;		// Disables pullups on DATA pins

Now I read the PIO_PDSR register and mask it like this:

	int a;
	a = AT91C_BASE_PIOA->PIO_PDSR & DATA_MASK;

Now I have register “a” with 32 bits of data like this 0bxxxxxx111111111111111111111111xx where “x” is my 8 bits of incoming data and the only data I want. My question is what is the fastest way (least amount of instruction cycles) to turn this data into a CHAR? I have tried several ways using several variables and shifting it around but it’s becoming more complicated than I think it should be and it’s not working. Thanks in advance for any replies!

Something like:

char c = (a >> 24) | (a & 3);

regards,

Giovanni

edit: bug fix :slight_smile:

If you can permute the data lines like this:

#define   DATA0      (1<<26)
#define   DATA1      (1<<27)
#define   DATA2      (1<<28)
#define   DATA3      (1<<29)
#define   DATA4      (1<<30)
#define   DATA5      (1<<31)
#define   DATA6      (1<<0)         
#define   DATA7      (1<<1)

, you need only one instruction:

const unsigned register mask = 0xff;
...
for(...) {
  unsigned char a;
  asm("and %0, %1, %2, ror #26":"=r"(a):"r"(mask),"r"(AT91C_BASE_PIOA->PIO_PDSR));
  ...
}

GCC will not reload the mask when there are enough free registers and if all you do is write the value to memory, you don’t even need to mask the value and can replace the AND with a MOV.

I’ve tested both pieces of code which were suggested and here is the disassembly of each.

	int b;
	b = AT91C_BASE_PIOA->PIO_PDSR;
0x000002c8 <main+100>: mov  r3, #-1610612736	; 0xa0000000
0x000002cc <main+104>: asr  r3, r3, #19
0x000002d0 <main+108>: ldr  r3, [r3, #60]
0x000002d4 <main+112>: str  r3, [r11, #-28]
	char c = (b >> 24) | (b & 3); 
0x000002d8 <main+116>: ldr  r3, [r11, #-28]
0x000002dc <main+120>: asr  r3, r3, #24
0x000002e0 <main+124>: and  r2, r3, #255	; 0xff
0x000002e4 <main+128>: ldr  r3, [r11, #-28]
0x000002e8 <main+132>: and  r3, r3, #255	; 0xff
0x000002ec <main+136>: and  r3, r3, #255	; 0xff
0x000002f0 <main+140>: and  r3, r3, #3	; 0x3
0x000002f4 <main+144>: orr  r3, r2, r3
0x000002f8 <main+148>: and  r3, r3, #255	; 0xff
0x000002fc <main+152>: strb r3, [r11, #-22]
asm("and %0, %1, %2, ror#26":"=r"(a):"r"(mask),"r"(AT91C_BASE_PIOA->PIO_PDSR)); 
0x00000304 <main+160>: mov  r3, #-1610612736	; 0xa0000000
0x00000308 <main+164>: asr  r3, r3, #19
0x0000030c <main+168>: ldr  r3, [r3, #60]
0x00000310 <main+172>: and  r3, r2, r3, ror #26
0x00000314 <main+176>: strb r3, [r11, #-21]

It appears that the second will require much less time to execute, although I only understand how the first example works. I have read about the first half of the ARM System Developers Guide by A. Sloss and even using it as reference I don’t understand this line of code. Denial, could you please explain how this works? I understand how the and and ror works, but what are the % signs and the “r” and “=r”? Thanks very much for your help!

EDIT - I answered my own questions by reading the ARM GCC Inline Assembler Cookbook.

0x000002c8 <main+100>: mov  r3, #-1610612736   ; 0xa0000000 

0x000002cc <main+104>: asr r3, r3, #19
0x000002d0 <main+108>: ldr r3, [r3, #60]
0x000002d4 <main+112>: str r3, [r11, #-28]

You appear to be very low on registers.

It needs to reload the PIO address and can’t even keep b in a register.

Try

volatile unsigned * const port = &AT91C_BASE_PIOA->PIO_PDSR;
...
for(...) {
  unsigned char v;
  asm("mov %0, %1, ror #26":"=r"(v):"r"(*port));
  ...
}

In my tests GCC automatically masks the lower byte when it needs to cast v to int.

Just a question, what optimization level are you using ? if you are using GCC try -O2 and -fomit-frame-pointer, the generated code is really bad, probably you are not enabling the optimizations.

This is the code i get from GCC 4.2.2 (YAGARTO)

char func(unsigned a) {

  a &= DATA_MASK;
  return (a >> 24) | (a & 3);
}
 103 0040 033000E2 		and	r3, r0, #3
 104 0044 FF0300E2 		and	r0, r0, #-67108861
 107 0048 200C83E1 		orr	r0, r3, r0, lsr #24
 112 004c 1EFF2FE1 		bx	lr

And note that the “bx lr” is there only because it is a function, the code itself is 3 instructions.

regards,

Giovanni

Here’s what the flags in the make-file are: (Written by James Lynch)

CFLAGS = -I./ -c -fno-common -O0 -g

I changed it to this:

CFLAGS = -I./ -c -fno-common -O2 -g -fomit-frame-pointer

The only way I know of using the YAGARTO toolchain to view the disassembly is while debugging. This won’t work with optimization -O1 or -O2 as I tried both and the disassembly came out wrong. Is there another way to view it? Here’s what I got:

	int b;
	b = AT91C_BASE_PIOA->PIO_PDSR;
0x00000234 <main+84>:  mvn   r3, #2816	; 0xb00
0x00000238 <main+88>:  ldr   r2, [r3, #-195]
	char c = (b >> 24) | (b & 3); 

	const unsigned register mask = 0xff; 
	unsigned char a; 
	asm("and %0, %1, %2, ror #26":"=r"(a):"r"(mask),"r"(AT91C_BASE_PIOA->PIO_PDSR)); 
0x0000023c <main+92>:  ldr   r3, [r3, #-195]

It seems to just skip these lines of code. The GCC version is 4.2.2

You have to use the variable “c” after you assigned it or it is optimized out.

GCC can create a listing file using “-Wa,-alms=filename.lst”, very useful.

You should download the GCC documentation PDF file, there are a lot of useful options documented there.

regards,

Giovanni

I think I’ve been having some OCD problems. I have the new version and the debugger seems to work correctly now. I created a simple project called “Simple” which is very stripped down to test functions and debugging. Here’s my test program:

#include "AT91SAM7S64.h"
#include "board.h"

main(void)
{
	low_level_init();
	
	int i = 5;
	int j = 0;
	
	while (j < i)
		j++;
}

Here’s the disassembly with optimization turned off (-O0)

{
0x00000108 <main>:    mov  r12, sp
0x0000010c <main+4>:  push {r11, r12, lr, pc}
0x00000110 <main+8>:  sub  r11, r12, #4	; 0x4
0x00000114 <main+12>: sub  sp, sp, #8	; 0x8
	low_level_init();
0x00000118 <main+16>: bl   0x154 <low_level_init>
	
	int i = 5;
0x0000011c <main+20>: mov  r3, #5	; 0x5
0x00000120 <main+24>: str  r3, [r11, #-20]
	int j = 0;
0x00000124 <main+28>: mov  r3, #0	; 0x0
0x00000128 <main+32>: str  r3, [r11, #-16]
	
	while (j < i)
0x0000012c <main+36>: b    0x13c <main+52>
0x0000013c <main+52>: ldr  r2, [r11, #-16]
0x00000140 <main+56>: ldr  r3, [r11, #-20]
0x00000144 <main+60>: cmp  r2, r3
0x00000148 <main+64>: blt  0x130 <main+40>
		j++;
0x00000130 <main+40>: ldr  r3, [r11, #-16]
0x00000134 <main+44>: add  r3, r3, #1	; 0x1
0x00000138 <main+48>: str  r3, [r11, #-16]
}
0x0000014c <main+68>: sub  sp, r11, #12	; 0xc
0x00000150 <main+72>: ldm  sp, {r11, sp, pc}

Here’s with optimization set to (-O1)

{
0x00000108 <main>:   push {lr}		; (str lr, [sp, #-4]!)
	low_level_init();
0x0000010c <main+4>: bl   0x114 <low_level_init>
	
	int i = 5;
	int j = 0;
	
	while (j < i)
		j++;
}
0x00000110 <main+8>: pop  {pc}		; (ldr pc, [sp], #4)

Here’s with optimization set to (-O2

	low_level_init();
0x00000108 <main>:   b 0x10c <low_level_init>

I’m not going test a more complex function until I get something simple like this to work first. With optimization turned on the function low_level_init executes and I am able to step through the code, although it’s not in the order that it is listed in the disassembly. The simple While statement in the function main does not operate with optimization turned on. This seems pretty straight forward, am I doing something wrong? Thanks again.

Optimizations do remove useless code, you don’t use i and j anymore so any reference to those variable is removed. Try declaring them “volatile” or to use them (as example invoke a function and pass i and j as parameters).

regards,

Giovanni

I’ve made some good progress with optimizations turned on. Here’s the asm code that reads the PIO parallel data into a small array.

	register const unsigned mask asm ("r4") = 0xff;
	unsigned char a;
	char data[100];
	char *pdata;
	pdata = &data[0];

    	while (*pdata < *pdata+100)
 		{	
			asm("and %0, %1, %2, ror #26":"=r"(a):"r"(mask),"r"(AT91C_BASE_PIOA->PIO_PDSR)); 
			*pdata++ = a;
		}

Here’s the disassembly of the code:

	register const unsigned mask asm ("r4") = 0xff;
0x00000168 <main+80>:  mov  r4, #255	; 0xff
	unsigned char a;
	char data[100];
	char *pdata;
	pdata = &data[0];

    	while (*pdata < *pdata+100)
0x0000016c <main+84>:  mov  r0, sp
 		{	
			asm("and %0, %1, %2, ror #26":"=r"(a):"r"(mask),"r"(AT91C_BASE_PIOA->PIO_PDSR)); 
0x00000170 <main+88>:  mvn  r3, #2816	; 0xb00
0x00000174 <main+92>:  ldr  r2, [r3, #-195]
0x00000178 <main+96>:  and  r1, r4, r2, ror #26
			*pdata++ = a;
0x0000017c <main+100>: strb r1, [r0], #1
0x00000180 <main+104>: b    0x170 <main+88>

Does this look like it should or are there still too many instructions? Here’s the C code that was suggested:

char func_a(unsigned a)
	{ 
	  a &= DATA_MASK; 
	  return (a >> 24) | (a & 3); 
	}

Here’s the disassembly:

	  a &= DATA_MASK; 
0x00000108 <func>:    and r3, r0, #3	; 0x3
0x0000010c <func+4>:  and r0, r0, #-67108861	; 0xfc000003
	  return (a >> 24) | (a & 3); 
	} 
0x00000110 <func+8>:  orr r0, r3, r0, lsr #24
0x00000114 <func+12>: bx  lr

This seems to look correct as previously posted. I guess I should explain a little more about what I’m trying to do. I’m replacing a PIC18 (which was completley coded in ASM at about 5000 lines :? ) with an ARM7 processor due to the lack of throughput. This particular function needs to read the PCLK output of a 2MP CMOS sensor (while VSYNC is high) and save the parallel data read on the PIO to internal ram. Is there a better (more efficent way) to save this data other than to an array as I have done in my first section of code? The migration from the PIC18 ASM to ARM7 C has a very steep learning curve :smiley:

Do you have to read the data just as fast as possible ? no synchronization required ?

The image starts when VSYNC goes high, it envelopes the entire image. HSYNC doesn’t need to be read which envelopes each line of data because I disable the embedded codes. The line size defaults to 512 bytes which is what I use. The PCLK goes high indicating the parallel 8-bit data is valid and needs to be read. There are adjustments such as PLL and FIFO data rate to match the data output speed with the capability of the processor. I’ve run the PCLK as low as 100khz (which produces grainy images) and as high as 400khz on the inefficent PIC18 in SVGA mode. With the PIC18 I saved the image data directly to a 2GB SD card using it as a byte array, and a table of contents which is decoded later when it’s downloaded. For right now, I want to save the image data to ram on the AT91SAM7s64 which will hold a small image (QVGA or QCIF) and send it out RS232. So I guess the round about answer is I need to read the data as fast as possible when the PCLK pin on the PIO goes high and save it to ram. Thanks again for your help

So you have to read the port on the rising edge of the PCLK pin and then buffer the data. It should not be complicated.

Don’t worry about the optimizations, GCC does a pretty good job, you don’t really need to verify the generated code unless you run into problems.

You may run into troubles if you have long interrupt service routines (if you use interrupts…) because it may delay the polling and you may miss a cycle.

That’s correct, the rising edge of PCLK (Pixel Clock) then buffer the data. I don’t think I will use interrupts as I would like to read the data as fast as possible and think polled would be better. Besides I don’t really need to do much while this is happening. The max PCLK of the sensor is 80mhz which is required for streaming JPEG images at 30fps at 1600 x 1200 (80MB/sec). I only need a few frames per second so the data rate will not be nearly as high. I would consider something faster than ARM7 but I need the low power consumption. Thanks again.

Has:

    	while (*pdata < *pdata+100)

You realize this is an endless loop?

What you need is```
while (*pdata < *data+100)


I'd do it like this:

void f(char *p, unsigned n) {
volatile unsigned * const PDSR=&AT91C_BASE_PIOA->PIO_PDSR;
char *e = p+n;
do {
char a;
asm(“mov %0, %1, ror #26”:“=r”(a):“r”(*PDSR));
*p++ = a;
} while(p<e);
}


, assuming n>0