I’ve been down a rabbit hole for nearly a month now, but I am convinced that I have found a bug in the Apollo3 A1/B0 silicon. It may take a couple posts to get all the info down here, but I will start with the facts:
When the STIMER is clocked from a source other than the HFRC, the act of the processor reading the STIMER count register or any of the CTIMER count registers TMR0 through TMR7 may cause the STIMER count to become permanently corrupted due to a double-increment.
Informally stated: a 32KHz crystal-clocked STIMER keeps perfect time unless you want to know what time it is. It’s kind of like a backwards Schrodinger’s cat. Instead of the cat being in an indeterminate state until its box is opened, Schrodinger’s timer (the “S” in STIMER?) is in a completely determinate state always knowing the actual time right up until you observe it. The act of observing the time will cause the timer to report that the time is “R”. Unfortunately,the potential double-increment-on-read means that the actual time is either “R”, or “R-1” and there is no practical way I am aware of to know which answer is right. From a statistical standpoint, the double-increment is the unlikely case. Even so, as your system continues to read any of the STIMER/CTIMER count registers, the odds increase that the reported STIMER count gets further and further ahead of the actual time. It’s not a hypothetical situation: that’s exactly how I found the bug in the first place.
Cause
It would appear that the corruption event can be triggered when the processor reads any of the STIMER or CTIMER count registers while the STIMER is actively being incremented from a different clock domain. I have good evidence that this is the corruption mechanism, but only Ambiq’s engineering team can confirm it.
Proof
It should be noted that in any practical system, an STIMER double-increment will be undetectable from just looking at the count once in a while. Imagine that each time you stared at your wall clock, the second hand jumped ahead an extra second just before your eyes focused on it. Without staring at the clock constantly, or comparing the clock time to some external source, you would never be aware of the additional seconds. It’s the same for the STIMER. The only way for a system to know that a double-increment occurred would be to read the timer constantly and actually see it double-count. That is the premise behind the first test program, below. It’s in the form of an Arduino sketch so that it is easy for you to try out. Just cut and paste it into an Arduino IDE and it should build and run just fine.
// This is the simplest possible demonstration of the Apollo3 double-increment issue.
// All we need to do is to use the CPU to read the STIMER count faster than
// it is incrementing. By logging the data, we can see clearly it double-increment
// once in a while.
const uint32_t bufLen = 250000;
uint8_t buffer[bufLen];
void setup()
{
Serial.begin(115200);
Serial.printf("STIMER bug demonstration #1\n");
Serial.flush();
// Uncomment one of the following lines to test an STIMER clock source.
// The issue appears when STIMER is clocked with the 32KHz XTAL or the LFRC.
// If the timer is running faster, the bug occurs more often.
am_hal_stimer_config(AM_HAL_STIMER_XTAL_32KHZ);
//am_hal_stimer_config(AM_HAL_STIMER_LFRC_1KHZ);
uint32_t errors = 0;
// Run the test repeatedly until we get at least 10 errors.
while (errors<10) {
Serial.printf("Filling the buffer...\n");
delay(200);
// Disable all interrupts during this read loop to be certain that nothing
// disturbs our sequence of reads
am_hal_interrupt_master_disable();
for (uint32_t i = 0; i<bufLen; i++) {
buffer[i] = CTIMER->STTMR & 0xFF;
// This delay is unecessary - It just helps see the counting context that gets printed out when the errors occur.
// If you remove it, you will only see the counter values N and N+2 because they get read so many times in a row.
// If the STIMER is clocked via LFRC, it is so slow that this delay won't help with the context.
am_hal_flash_delay(FLASH_CYCLES_US(6));
}
// Turn interrupts back on so Serial IO will work
am_hal_interrupt_master_enable();
Serial.printf("Scanning for oddities\n");
for (uint32_t i=1; i<bufLen; i++) {
// Calculate the difference between this reading and the one that came before it.
uint8_t diff = (buffer[i] - buffer[i-1]);
// The only possible values that the variable 'diff' can have are:
// 0 if the timer didn't increment between the two successive readings, or
// 1 if the timer did increment between the two successive readings
if ((diff == 0) || (diff == 1)) {
// These are the expected cases: everything is fine
}
else {
errors++;
Serial.printf("0x%08X: %02X->%02X\n ", &buffer[i-1], buffer[i-1], buffer[i]);
// Show some context to the unusual change in counter value, marking the moment of the change with a "->"
for (int j=-16; j<16; j++) {
if (((i+j)>=0) && ((i+j)<bufLen)) {
Serial.printf("%02X%s", buffer[i+j], (j == -1) ? "->" : " ");
}
}
Serial.printf("\n");
}
}
Serial.printf("%d errors detected so far\n", errors);
}
Serial.printf("Done!\n");
}
void loop() {
// put your main code here, to run repeatedly:
}
The test should be simple to understand. All it does is configure the STIMER to run from the 32KHz XTAL clock source, then repeatedly read the STIMER into a really big log buffer. Because the problem shows up as a double increment, it will only affect the low-order byte of STIMER so we can capture more readings by only saving the bottom byte of each STIMER read. Once the buffer is full, the test program scans the collected data looking for anything weird. Since the polling loop runs way faster than the STIMER is incrementing, any adjacent pair of readings in the buffer can only have two possible values: either the counter incremented between the two readings, or it didn’t. Nothing else is possible. Interestingly, what the test program shows is cases where adjacent readings incremented by 2. When the program finds anything out of the ordinary, it prints a message describing the unexpected change, along with some context from the buffer so that you can see the timer incrementing properly before and after the double-increment. Here is an example:
STIMER bug demonstration #1
Filling the buffer...
Scanning for oddities
0x1001EF88: E6->E8
E3 E3 E4 E4 E4 E4 E4 E5 E5 E5 E5 E5 E6 E6 E6 E6->E8 E8 E8 E8 E8 E9 E9 E9 E9 E9 EA EA EA EA EA EB
0x1002B73A: 33->35
30 30 31 31 31 31 31 32 32 32 32 32 33 33 33 33->35 35 35 35 35 36 36 36 36 36 37 37 37 37 37 38
0x10046C9B: D8->DA
D5 D5 D6 D6 D6 D6 D6 D7 D7 D7 D7 D7 D8 D8 D8 D8->DA DA DA DA DA DB DB DB DB DB DC DC DC DC DC DD
0x1004C17E: D4->D6
D1 D1 D2 D2 D2 D2 D2 D3 D3 D3 D3 D3 D4 D4 D4 D4->D6 D6 D6 D6 D6 D7 D7 D7 D7 D7 D8 D8 D8 D8 D8 D9
4 errors detected so far
There you have it. I really don’t think there is a flaw in my thinking or testing methodology, but I have made dumber mistakes before.
Status & Workarounds
This issue is not documented in the Ambiq errata, but strangely, a double-increment issue was mentioned in passing in a posting on their support website: https://support.ambiq.com/hc/en-us/arti … are-cases-
The posting never mentioned the cause of the double-increment, or addressed any other issues that a double-increment might cause. The specific situation they mention is perhaps the only one with a workaround. Their workaround does not address the fact that the system will get ahead in time due to the skipped counts, but at least their workaround keeps the system from skipping over a count that should have caused a compare interrupt.
In the general case, I don’t see any simple workarounds for a timer that doesn’t keep time. Unless you double-check every single increment, you have no idea if a read of STIMER/CTIMER0…7 has triggered a false increment. And if you do double-check every single increment, then it would be simpler to just implement a software counter in the first place.
Implications
There are a number of implications. If clocking the STIMER from the HFRC is OK because you can tolerate its 2% accuracy, then this whole thing is a non-issue. But if you are clocking the STIMER from the crystal because you need timing accuracy, then your system gets very complicated very fast. The most important implication is that if you really need the STIMER to be accurate at all times, then your system is not allowed to read the current contents of any of the timer registers STIMER or CTIMER TMR0 through TMR7, ever. But what good is a timekeeping source if you can’t read it? That’s a tough one. Perhaps there are certain workarounds though. For example, the STIMER capture mechanism seems to be able to read the STIMER without corrupting it. If that does what you need, then that’s OK. In fact, it should be possible to read the STIMER by creating a fake capture event that returns the current time. One annoyance with this method is that the capture events are synced to the STIMER clock (and not the processor clock), so issuing a capture request may take as many as two STIMER clocks to be recognized. So the workaround would work, but a read would be pretty slow.
The general problem of a system reading the STIMER to find out the current time was how I found the problem initially. It turns out that the tickless FreeRTOS implementation provided by Ambiq uses a CTIMER count register as part of its sleep/wakeup processing. Those reads of the CTIMER count were corrupting the STIMER count I was capturing to measure some very precisely timed events. I was able to create a workaround by running FreeRTOS in its standard ‘always ticking’ mode. That fixes the STIMER corruption, but it means that the power-saving advantages of the tickless mode are lost in my system.
I am sure that there are more implications, but the ones listed above strike me as pretty extensive.
Next Steps
So what to do with all this? Well, all silicon has bugs and that’s a fact. I notified Ambiq of this issue a couple weeks ago. I am still waiting to hear a confirmation of this issue from them. If this turns out to be a real issue in their estimation, I would like to see this documented in their errata. My sincere hope is that Ambiq engineering team can figure out some potential workarounds that I am not seeing.
In the meantime, the trick for us users is to understand the nature of the bugs, their implications on a specific system, and to design our systems in a way that avoids triggering these issues. To that end, my FreeRTOS system is back to wasting power in a tickfull mode of operation. My system has excised all stray reads of any CTIMER/STIMER counter registers. Now, my system’s measurements are finally falling within their expected error bounds.
In a bit, I’ll post the source for another test program that shows how to tickle the bug in a different way.