Apollo 3 STIMER Silicon Bug

robin_hodgson · March 6, 2021, 6:12pm

I’ve been down a rabbit hole for nearly a month now, but I am convinced that I have found a bug in the Apollo3 A1/B0 silicon. It may take a couple posts to get all the info down here, but I will start with the facts:

When the STIMER is clocked from a source other than the HFRC, the act of the processor reading the STIMER count register or any of the CTIMER count registers TMR0 through TMR7 may cause the STIMER count to become permanently corrupted due to a double-increment.

Informally stated: a 32KHz crystal-clocked STIMER keeps perfect time unless you want to know what time it is. It’s kind of like a backwards Schrodinger’s cat. Instead of the cat being in an indeterminate state until its box is opened, Schrodinger’s timer (the “S” in STIMER?) is in a completely determinate state always knowing the actual time right up until you observe it. The act of observing the time will cause the timer to report that the time is “R”. Unfortunately,the potential double-increment-on-read means that the actual time is either “R”, or “R-1” and there is no practical way I am aware of to know which answer is right. From a statistical standpoint, the double-increment is the unlikely case. Even so, as your system continues to read any of the STIMER/CTIMER count registers, the odds increase that the reported STIMER count gets further and further ahead of the actual time. It’s not a hypothetical situation: that’s exactly how I found the bug in the first place.

Cause

It would appear that the corruption event can be triggered when the processor reads any of the STIMER or CTIMER count registers while the STIMER is actively being incremented from a different clock domain. I have good evidence that this is the corruption mechanism, but only Ambiq’s engineering team can confirm it.

Proof

It should be noted that in any practical system, an STIMER double-increment will be undetectable from just looking at the count once in a while. Imagine that each time you stared at your wall clock, the second hand jumped ahead an extra second just before your eyes focused on it. Without staring at the clock constantly, or comparing the clock time to some external source, you would never be aware of the additional seconds. It’s the same for the STIMER. The only way for a system to know that a double-increment occurred would be to read the timer constantly and actually see it double-count. That is the premise behind the first test program, below. It’s in the form of an Arduino sketch so that it is easy for you to try out. Just cut and paste it into an Arduino IDE and it should build and run just fine.

// This is the simplest possible demonstration of the Apollo3 double-increment issue.
// All we need to do is to use the CPU to read the STIMER count faster than 
// it is incrementing. By logging the data, we can see clearly it double-increment 
// once in a while.

const uint32_t bufLen = 250000;
uint8_t buffer[bufLen];

void setup()
{
  Serial.begin(115200);
  Serial.printf("STIMER bug demonstration #1\n");
  Serial.flush();
  
  // Uncomment one of the following lines to test an STIMER clock source.
  // The issue appears when STIMER is clocked with the 32KHz XTAL or the LFRC.
  // If the timer is running faster, the bug occurs more often.
  am_hal_stimer_config(AM_HAL_STIMER_XTAL_32KHZ);
  //am_hal_stimer_config(AM_HAL_STIMER_LFRC_1KHZ);
  
  uint32_t errors = 0;

  // Run the test repeatedly until we get at least 10 errors.
  while (errors<10) {
    Serial.printf("Filling the buffer...\n");
    delay(200);

    // Disable all interrupts during this read loop to be certain that nothing
    // disturbs our sequence of reads
    am_hal_interrupt_master_disable();
    
    for (uint32_t i = 0; i<bufLen; i++) {
      buffer[i] = CTIMER->STTMR & 0xFF;
      // This delay is unecessary - It just helps see the counting context that gets printed out when the errors occur.
      // If you remove it, you will only see the counter values N and N+2 because they get read so many times in a row.
      // If the STIMER is clocked via LFRC, it is so slow that this delay won't help with the context.
      am_hal_flash_delay(FLASH_CYCLES_US(6));
    }

    // Turn interrupts back on so Serial IO will work
    am_hal_interrupt_master_enable();
    
    Serial.printf("Scanning for oddities\n");
    for (uint32_t i=1; i<bufLen; i++) {
      // Calculate the difference between this reading and the one that came before it.
      uint8_t diff = (buffer[i] - buffer[i-1]);

      // The only possible values that the variable 'diff' can have are:
      //  0 if the timer didn't increment between the two successive readings, or
      //  1 if the timer did increment between the two successive readings
      if ((diff == 0) || (diff == 1)) {
        // These are the expected cases: everything is fine
      }
      else {
        errors++;
        Serial.printf("0x%08X: %02X->%02X\n    ", &buffer[i-1], buffer[i-1], buffer[i]);
        // Show some context to the unusual change in counter value, marking the moment of the change with a "->"
        for (int j=-16; j<16; j++) {
          if (((i+j)>=0) && ((i+j)<bufLen)) {
            Serial.printf("%02X%s", buffer[i+j], (j == -1) ? "->" : " ");
          }
        }
        Serial.printf("\n");
      }
    }
    Serial.printf("%d errors detected so far\n", errors);
  }
  Serial.printf("Done!\n");
}

void loop() {
  // put your main code here, to run repeatedly:

}

The test should be simple to understand. All it does is configure the STIMER to run from the 32KHz XTAL clock source, then repeatedly read the STIMER into a really big log buffer. Because the problem shows up as a double increment, it will only affect the low-order byte of STIMER so we can capture more readings by only saving the bottom byte of each STIMER read. Once the buffer is full, the test program scans the collected data looking for anything weird. Since the polling loop runs way faster than the STIMER is incrementing, any adjacent pair of readings in the buffer can only have two possible values: either the counter incremented between the two readings, or it didn’t. Nothing else is possible. Interestingly, what the test program shows is cases where adjacent readings incremented by 2. When the program finds anything out of the ordinary, it prints a message describing the unexpected change, along with some context from the buffer so that you can see the timer incrementing properly before and after the double-increment. Here is an example:

STIMER bug demonstration #1
Filling the buffer...
Scanning for oddities
0x1001EF88: E6->E8
    E3 E3 E4 E4 E4 E4 E4 E5 E5 E5 E5 E5 E6 E6 E6 E6->E8 E8 E8 E8 E8 E9 E9 E9 E9 E9 EA EA EA EA EA EB 
0x1002B73A: 33->35
    30 30 31 31 31 31 31 32 32 32 32 32 33 33 33 33->35 35 35 35 35 36 36 36 36 36 37 37 37 37 37 38 
0x10046C9B: D8->DA
    D5 D5 D6 D6 D6 D6 D6 D7 D7 D7 D7 D7 D8 D8 D8 D8->DA DA DA DA DA DB DB DB DB DB DC DC DC DC DC DD 
0x1004C17E: D4->D6
    D1 D1 D2 D2 D2 D2 D2 D3 D3 D3 D3 D3 D4 D4 D4 D4->D6 D6 D6 D6 D6 D7 D7 D7 D7 D7 D8 D8 D8 D8 D8 D9 
4 errors detected so far

There you have it. I really don’t think there is a flaw in my thinking or testing methodology, but I have made dumber mistakes before.

Status & Workarounds

This issue is not documented in the Ambiq errata, but strangely, a double-increment issue was mentioned in passing in a posting on their support website: https://support.ambiq.com/hc/en-us/arti … are-cases-

The posting never mentioned the cause of the double-increment, or addressed any other issues that a double-increment might cause. The specific situation they mention is perhaps the only one with a workaround. Their workaround does not address the fact that the system will get ahead in time due to the skipped counts, but at least their workaround keeps the system from skipping over a count that should have caused a compare interrupt.

In the general case, I don’t see any simple workarounds for a timer that doesn’t keep time. Unless you double-check every single increment, you have no idea if a read of STIMER/CTIMER0…7 has triggered a false increment. And if you do double-check every single increment, then it would be simpler to just implement a software counter in the first place.

Implications

There are a number of implications. If clocking the STIMER from the HFRC is OK because you can tolerate its 2% accuracy, then this whole thing is a non-issue. But if you are clocking the STIMER from the crystal because you need timing accuracy, then your system gets very complicated very fast. The most important implication is that if you really need the STIMER to be accurate at all times, then your system is not allowed to read the current contents of any of the timer registers STIMER or CTIMER TMR0 through TMR7, ever. But what good is a timekeeping source if you can’t read it? That’s a tough one. Perhaps there are certain workarounds though. For example, the STIMER capture mechanism seems to be able to read the STIMER without corrupting it. If that does what you need, then that’s OK. In fact, it should be possible to read the STIMER by creating a fake capture event that returns the current time. One annoyance with this method is that the capture events are synced to the STIMER clock (and not the processor clock), so issuing a capture request may take as many as two STIMER clocks to be recognized. So the workaround would work, but a read would be pretty slow.

The general problem of a system reading the STIMER to find out the current time was how I found the problem initially. It turns out that the tickless FreeRTOS implementation provided by Ambiq uses a CTIMER count register as part of its sleep/wakeup processing. Those reads of the CTIMER count were corrupting the STIMER count I was capturing to measure some very precisely timed events. I was able to create a workaround by running FreeRTOS in its standard ‘always ticking’ mode. That fixes the STIMER corruption, but it means that the power-saving advantages of the tickless mode are lost in my system.

I am sure that there are more implications, but the ones listed above strike me as pretty extensive.

Next Steps

So what to do with all this? Well, all silicon has bugs and that’s a fact. I notified Ambiq of this issue a couple weeks ago. I am still waiting to hear a confirmation of this issue from them. If this turns out to be a real issue in their estimation, I would like to see this documented in their errata. My sincere hope is that Ambiq engineering team can figure out some potential workarounds that I am not seeing.

In the meantime, the trick for us users is to understand the nature of the bugs, their implications on a specific system, and to design our systems in a way that avoids triggering these issues. To that end, my FreeRTOS system is back to wasting power in a tickfull mode of operation. My system has excised all stray reads of any CTIMER/STIMER counter registers. Now, my system’s measurements are finally falling within their expected error bounds.

In a bit, I’ll post the source for another test program that shows how to tickle the bug in a different way.

robin_hodgson · March 6, 2021, 7:13pm

Here is the next test program that demonstrates the STIMER double-increment bug. I went down this path because FreeRTOS was reading a CTIMER count that seemed to be the source of my original problem. At the time, it seemed especially weird that reading a CTIMER count could affect the STIMER, but the evidence was there. That begged the question: how many other CTIMER registers could trigger the issue? This next test aimed to find out.

The test is designed around a software STIMER that counts in parallel with the hardware STIMER. They both do their counting from the same clock source, the 32 KHz XTAL that gets driven to a CLKOUT pin. The only difference is that while the hardware timer counts the XTAL clocks in silicon, the software counter simply polls the XTAL CLKOUT signal to decide when to increment its software STIMER count. In theory, the results from either method of counting should be identical: both counters should always hold the same value. But as we know from the first test, merely reading the STIMER count register will cause the hardware timer to double-count, causing the STIMER count to get further and further ahead of where it should be.

To see if reading other registers in the CTIMER block cause trouble, this test program walks through the entire CTIMER address space, reading a particular CTIMER register over and over while it polls for the XTAL clock changes on CLKOUT. After polling a certain number of XTAL ticks, the program compares the XTAL count to the STIMER count. If the STIMER count has diverged from reality, then reading that CTIMER register has tickled the double-increment issue. Astute readers will point out that the one single read of the STIMER used to compare against the software count could be the cause of a miscount. This is true, but if the STIMER is more than 1 count off, then the STIMER was being messed up during the loop. If the hardware STIMER is exactly 1 count off, then it’s likely that reading that register is probably fine and the test just got unlucky with that one final read for the comparison operation.

Here is the Arduino source for the test:

#define CLKOUT_PAD 7

void setup()
{
  uint32_t polledCount=0;
  const uint32_t testLen = 50000;
  volatile uint32_t foo;
  uint32_t initialStimerCount;
  uint32_t clkout, clkout_prev;
  
  const am_hal_gpio_pincfg_t clkout_config =
  {
      .uFuncSel            = AM_HAL_PIN_7_CLKOUT,
      .ePullup             = AM_HAL_GPIO_PIN_PULLUP_NONE,
      .eDriveStrength      = AM_HAL_GPIO_PIN_DRIVESTRENGTH_2MA,
      .eGPOutcfg           = AM_HAL_GPIO_PIN_OUTCFG_PUSHPULL
  };

  Serial.begin(115200);
  Serial.printf("STIMER Bug Demonstration Method 2\n");
  Serial.flush();

  // All interrupts off so that we don't miss any XTAL clock edges in our poll loop
  am_hal_interrupt_master_disable();
  
  // Drive the 32KHz XTAL clock to CLKOUT
  am_hal_stimer_config(AM_HAL_STIMER_XTAL_32KHZ);
  am_hal_gpio_pinconfig(CLKOUT_PAD, clkout_config);
  am_hal_clkgen_clkout_enable(true, AM_HAL_CLKGEN_CLKOUT_XTAL_32768);
  
  // Set the Input Enable on the CLKOUT pad so we can poll its state
  GPIO->PADKEY = GPIO_PADKEY_PADKEY_Key;
  GPIO->PADREGB_b.PAD7INPEN  = GPIO_PADREGB_PAD7INPEN_EN;
  GPIO->PADKEY = 0;
  
  // Enable burst mode so we can poll as fast as possible
  am_hal_burst_avail_e peBurstAvail;
  am_hal_burst_mode_e peBurstStatus;
  am_hal_burst_mode_initialize(&peBurstAvail);
  am_hal_burst_mode_enable(&peBurstStatus);
  
  // The test scans through every single register in the CTIMER block one at a time
  for (uint32_t addr=0x40008000; addr<=0x4000830c; addr+=4) {
    polledCount = 0;
      
    // Wait for a rising edge on the XTAL clock
    clkout_prev = GPIO->RDA & (1<<CLKOUT_PAD);
    while (1) {
      clkout = GPIO->RDA & (1<<CLKOUT_PAD);
      if (clkout != clkout_prev) {
        clkout_prev = clkout;
        if (clkout) {
          // That's our first rising edge!  Make a note of the original STIMER count.
          // It's OK if this initial count resulted in a double-count because it represents time zero.
          initialStimerCount = CTIMER->STTMR;
          break;
        }
      }
    }
  
    while (polledCount < testLen) {
      clkout = GPIO->RDA & (1<<CLKOUT_PAD);
      if (clkout != clkout_prev) {
        if (clkout) {
          // we just saw a 0->1 transition on clkout
          polledCount++;
        }
        clkout_prev = clkout;
      }
      else {
        // Read the current register address to see if the read results in extra counts in STIMER
        foo = *(volatile uint32_t*)addr;
      }
    }
    uint32_t stimer_count = CTIMER->STTMR - initialStimerCount;
    uint32_t extraTicks = stimer_count - polledCount;
    am_hal_interrupt_master_enable();
    if (extraTicks) {
      Serial.printf("%08X: polled count: %lu, stimer: %lu, %lu extra STIMER counts\n", addr, polledCount, stimer_count, extraTicks);
    }
    else {
      // Serial.printf("%08X: No extra counts\n", addr);
    }
    Serial.flush();
    am_hal_interrupt_master_disable();
  }
  am_hal_interrupt_master_enable();
  Serial.printf("Done!\n");
}

void loop() {
  // put your main code here, to run repeatedly:

}

The program only prints out the addresses of registers that triggered double-counts. It takes roughly four minutes to run, so here is some sample output if you don’t want to wait:

STIMER Bug Demonstration Method 2
40008000: polled count: 50000, stimer: 50032, 32 extra STIMER counts
40008020: polled count: 50000, stimer: 50026, 26 extra STIMER counts
40008040: polled count: 50000, stimer: 50026, 26 extra STIMER counts
40008060: polled count: 50000, stimer: 50030, 30 extra STIMER counts
40008080: polled count: 50000, stimer: 50030, 30 extra STIMER counts
400080A0: polled count: 50000, stimer: 50032, 32 extra STIMER counts
400080C0: polled count: 50000, stimer: 50025, 25 extra STIMER counts
400080E0: polled count: 50000, stimer: 50019, 19 extra STIMER counts
40008144: polled count: 50000, stimer: 50018, 18 extra STIMER counts
Done!

As you can see, there are a total of 9 registers in the CTIMER address space, which if read during the polling loop, cause the hardware STIMER to see counts that the software polling loop says are not there. If you decode those register addresses, those 9 registers represent the STIMER counter (which we already knew about from the test in the first post), as well as all 8 of the CTIMER TMR0 through TMR7 counters.

There is a statistical element to this test. If you run it again, you will get different numbers of extra counts, but it will always be roughly the same odds. In the example run above, it’s roughly 1 in 2500 times that the counter mis-counts. Other runs will give different answers. Sometimes, it is closer to 1 read in 5000. Your mileage will vary.

robin_hodgson · March 6, 2021, 7:32pm

Forgot: if it wasn’t clear, those Arduino test programs will run on any Sparkfun Artemis board. There is no additional wiring that needs to be done. Just connect your board via USB cable so you can see the Arduino Serial output console and you are good to go.

Dr.T · April 24, 2021, 1:57pm

Hello,

I just tried out the last test program several times, always with no error reports. I use a Artemis Nano board (ordered 2 weeks ago). Has there been a silicon update?

BR, Andreas

paulvha · April 24, 2021, 3:40pm

I was working on Software Serial for V2.0.6 and run into an issue that might be related to this.

On SoftwareSerial the timing is critical and becomes more critical higher the baudrate to capture the bit-values at the right moment. It turned out that the execution speed differs a factor 3 to 5 difference between V1.2.1 and V2.0.6 (which has Mbed/Rtos).

Using the SAME program the time between a level change on the RX line and the interrupt routine in the hal-driver being called is 1.5uS on V1, where it is 5uS on V2.0.6. Constant… 3 times as long on the same ATP board, while the processor only needs to handle the interrupt.

A loop (the same again) in the hal interrupt driver takes 11us on V1.2.1 while it takes 33us V2.0.6. THREE times longer…continuous reproducible. I have documented my experience and SoftwareSerial code on https://github.com/paulvha/apollo3/tree … wareSerial.

Now working on CMOS camera on Edge. It took a number of changes to power the camera on and getting the different structures in sync and some driver code is integrated with MBED-OS target instead of the library.

I am now running into the same issue timing issue where get the CMOS camera data with a pattern triggered by a frequency of 12Mhz, works great (NO ISSUES) on V1.2.1. Fails on V2.0.6. After changing the pattern to be sent, which influences the speed at which the camera is sending the pixel information, I am making some progress. Not there yet.

To me it looks as if the processor gets interrupted all the time to handle much more actions on V2.0.6 ( Mbed/ROS ??) than on V1.2.1. I have not figured it out why, but as a result you COULD see double counts happening in STIMER as more time has passed on V2.0.6 than the delay in the sketch.

robin_hodgson · April 24, 2021, 4:31pm

Dr. T.:
Hello,

I just tried out the last test program several times, always with no error reports. I use a Artemis Nano board (ordered 2 weeks ago). Has there been a silicon update?

BR, Andreas

I am not aware of any silicon revision past ‘B0’. It is weird (to me) that you are not seeing anything. I have tried the tests on a number of different processors with both A1 and B0 silicon and I see the issues consistently. The issue does seem to be related to the relationship between the HFRC and XTAL clock domains, so it is conceivable that a specific processor might have XTAL and HFRC clocks that are not triggering the issue. You might want to warm your XTAL with your fingertip to change its frequency slightly.

Can you try the first test program too? To verify what silicon revision you are running, modify the test code at the start of setup() from this:

  Serial.printf("STIMER bug demonstration #1\n");
  Serial.flush();

to this:

  Serial.printf("STIMER bug demonstration #1\n");
  am_hal_mcuctrl_device_t pInfo;
  am_hal_mcuctrl_info_get(AM_HAL_MCUCTRL_INFO_DEVICEID, (void *)&pInfo);
  uint8_t majorRev = (pInfo.ui32ChipRev >> 4) & 0xF;
  uint8_t minorRev = pInfo.ui32ChipRev & 0xF;
  printf("Processor revision: %c%c\n", 'A' + majorRev - 1, '0' + minorRev - 1);
  Serial.flush();

It will print out the silicon revision that you are running. I am expecting it to be ‘B0’, but this will prove what you have.

matt-bathyscope · June 6, 2021, 6:45pm

This seems related https://support.ambiq.com/hc/en-us/arti … are-cases-

It is possible that there is a clock glitch which may cause STIMER double count so that its interrupt is lost. STIMER supports multiple comparators. Currently CMPA is used. One solution is to use another comparator CMPB as a backup. If the STIMER double counts, then CMPA will not see it, but CMPB will see it immediately by setting CMPB one count higher than CMPA.

Edit to add more context:

The support article linked from robin_hodgson’s original post seems to acknowledge an unaddressed silicon issue. I’m seeing this behavior as well and I agree the suggested workaround is not really a solution.

robin_hodgson · June 7, 2021, 7:26pm

It still irks me that Ambiq is not documenting this timer bug in as an errata issue. IMHO, a counter that potentially miscounts simply because it was read seems worthy of an errata, especially when there is no general workaround.

I submitted a carefully researched and thorough bug report to them 4 months ago. That resulted in some conversation with them regarding the issue, but when the topic came to workarounds, I got ghosted. It’s been three months since then. Maybe that’s just how they work though. I filed a bug against their SDK 2.5.1, and that has been open for 8 months now. I pinged them 4 months ago and they said that that they would fix it, although there has been no official fix or patch release since then. I’m not holding my breath regarding either issue at this point.

The end result is that there will continue to be users who will waste a bunch of time rediscovering this issue simply because Ambiq won’t document it. That’s not very customer-friendly.

KyleW · June 10, 2021, 6:49pm

This is interesting.

I have spent some time trying to figure out if the stimer is still keeping accurate time despite it clearly skipping a beat every once in a while.

My theories were that perhaps:

It is reporting a reading for longer than it should in an attempt to prevent a misread. For example if it was reporting 00 for as long as it should report 00, and 01 and then reporting 02 as it should. I tested this by traversing the buffer of collected values around a double increment, and as far as i can tell this is not happening.

or

It is taking occasionally taking unusually long to grab the value of the register. For example if it took occasionally took so long to read that value 00 that it would only be accurate to report 02 next. I tested this by pulsing a gpio on every read. It is difficult to say for certain but i couldn’t find an unusually long pulse on the misreads.

So I am convinced that is double counting AND that this double counts are not “making up” for anything.

What I am still trying to figure out is the impact of these misreads. I could see a real problem if you were looking for a specific value and it missed it, otherwise I can’t see a real problem with the time given this example. I ran the example to a full 10 error twice. The first time it took 495 loops, the next it took 524. during these loops, each timer value was sampled 66 to 67 times before the next value came up (no delay between reads). So by my estimate the worst case would be 10 failures, per 495 loops of (250000/67) ticks. Or about an error of 1 in 184701, or .00054%.

This doesn’t seem sufficient to explain the problems you were reporting(~2% right?). I think there is still a piece we (or I) am missing.

KyleW · June 10, 2021, 6:52pm

I forgot to attach, this is the sketch I am currently using.

// This is the simplest possible demonstration of the Apollo3 double-increment issue.
// All we need to do is to use the CPU to read the STIMER count faster than 
// it is incrementing. By logging the data, we can see clearly it double-increment 
// once in a while.

const uint32_t bufLen = 250000;
uint8_t buffer[bufLen];

#define GPIO_DEBUG

void setup()
{
  Serial.begin(115200);
  Serial.printf("STIMER bug demonstration #1\n");
  Serial.flush();

  pinMode(18, OUTPUT);
    
  // Uncomment one of the following lines to test an STIMER clock source.
  // The issue appears when STIMER is clocked with the 32KHz XTAL or the LFRC.
  // If the timer is running faster, the bug occurs more often.
  am_hal_stimer_config(AM_HAL_STIMER_XTAL_32KHZ);
  //am_hal_stimer_config(AM_HAL_STIMER_LFRC_1KHZ);
  
  uint32_t errors = 0;

  // Run the test repeatedly until we get at least 10 errors.
  uint16_t numLoops = 0;
  while (errors<10) {
    Serial.printf("Filling the buffer...\n");
    delay(200);

    // Disable all interrupts during this read loop to be certain that nothing
    // disturbs our sequence of reads
    am_hal_interrupt_master_disable();
    
    for (uint32_t i = 0; i<bufLen; i++) {
#ifdef GPIO_DEBUG
      am_hal_gpio_output_set(18);
#endif
      buffer[i] = CTIMER->STTMR & 0xFF;
#ifdef GPIO_DEBUG
      am_hal_gpio_output_clear(18);
#endif
      // This delay is unecessary - It just helps see the counting context that gets printed out when the errors occur.
      // If you remove it, you will only see the counter values N and N+2 because they get read so many times in a row.
      // If the STIMER is clocked via LFRC, it is so slow that this delay won't help with the context.
      //am_hal_flash_delay(FLASH_CYCLES_US(1));
    }

    // Turn interrupts back on so Serial IO will work
    am_hal_interrupt_master_enable();
    
    Serial.printf("Scanning for oddities\n");
    for (uint32_t i=1; i<bufLen; i++) {
      // Calculate the difference between this reading and the one that came before it.
      uint8_t diff = (buffer[i] - buffer[i-1]);

      // The only possible values that the variable 'diff' can have are:
      //  0 if the timer didn't increment between the two successive readings, or
      //  1 if the timer did increment between the two successive readings
      if ((diff == 0) || (diff == 1)) {
        // These are the expected cases: everything is fine
      }
      else {
        errors++;
        Serial.printf("%d, 0x%08X: %02X->%02X\n    ", i, &buffer[i-1], buffer[i-1], buffer[i]);
        // Show some context to the unusual change in counter value, marking the moment of the change with a "->"
        for (int j=-16; j<16; j++) {
          if (((i+j)>=0) && ((i+j)<bufLen)) {
            Serial.printf("%02X%s", buffer[i+j], (j == -1) ? "->" : " ");
          }
        }

        Serial.printf("\n");
        int k = i - 1;
        for(int x = 0; x<10; x++)
        {
          uint32_t numCycles = 0;
          uint8_t prevVal = buffer[k];
          while(buffer[k] == prevVal)
          {
            numCycles++;
            k--;
          }
          Serial.printf("previous val %02X repeated %d times\r\n", prevVal, numCycles);
        }
        k = i;
        for(int x = 0; x<10; x++)
        {
          uint32_t numCycles = 0;
          uint8_t nextVal = buffer[k];
          while(buffer[k] == nextVal)
          {
            numCycles++;
            k++;
          }
          Serial.printf("next val %02X repeated %d times\r\n", nextVal, numCycles);
        }
      }
    }
    Serial.printf("%d errors detected so far\n", errors);
    numLoops++;
  }
  Serial.printf("Done! 10 errors found in %d loops\n", numLoops);
}

void loop() {
  // put your main code here, to run repeatedly:

}

robin_hodgson · June 10, 2021, 10:56pm

My second test was designed to prove that the timer double counts. It drives the XTAL clock to a CLKOUT pin and at the same time uses a tight software loop to manually count those transitions on CLKOUT. It compares the manual count to the count in the timer. Since the timer and CLKOUT pin are driven by the same internal XTAL clock source, the counts should never, ever diverge regardless of the method used to count the XTAL clocks. But the counts from the two methods do diverge. I proved to myself that the XTAL oscillator is OK and the data on the CLKOUT pin is correct. I connected both a scope and frequency counter to the CLKOUT pin and set them to trigger on out-of-spec CLKOUT timing. Both pieces of equipment swear that the CLKOUT timing is perfect when the timer counter register skips a count. The only possibility is that the timer count is wrong: it has to be the timer double-incrementing. As mentioned earlier, Ambiq admits that their timers double increment, but they never said what triggered this issue, or how often it might happen. All that they said is that it is “rare”. Maybe Ambiq and I just disagree on what “rare” means.

I have shown that all you need to do to trigger the bug is to read the current timer value for any of the timer count registers, STIMER or CTIMER0 through CTIMER7. Each time you read any one of those counters, there is a chance that the STIMER will silently double-increment. This is especially annoying because there is some other bug in the chip whose official workaround in the Ambiq HAL is to read the timer count value three times to determine that actual timer count. That workaround makes it three times more likely that your STIMER count will get corrupted each time you read any timer counter using the HAL!

As for how this affects people, well, that’s an open question. At a minimum, it is certainly something to be aware of. Imagine if the Apollo3 had a bug where when you added two numbers together, there was a chance that the result might be reported as 1 larger than it should be, without any mechanism to detect the corruption. You could still write a large class of programs that would function just fine under those circumstances. Likewise, a timer that skips a value once in a while means that from time to time, 1/32768 of a second will appear to pass when it shouldn’t have. A large class of programs will never notice. However, some will! I found the bug exactly because I needed to use the timer to make precision measurements of the passage of time, and I noticed that the timer counting time was moving faster than the real time. That really tossed a wrench into my plans, so the bug is not without its potentially severe side effects for some applications.

robin_hodgson · June 10, 2021, 10:59pm

Forgot: the whole thing about the 2% error comment is that the timer bug only shows up when the timer is clocked from the XTAL clock domain. If both the processor and timer are clocked from HFRC, the double-increments do not occur. The only issue with clocking them both from HFRC is that the HFRC is only accurate to 2%. If you need the timing precision of the XTAL clock, beware that it’s not going to be as precise as you think due to this bug.

prittenhouse · May 14, 2022, 8:17pm

Hi Robin,

Sorry to jump on an old thread. First I wanted to thank you for all the effort you put into providing such a detailed investigation. I just wanted to clarify one thing. In your description of the bug you say

all you need to do to trigger the bug is to read the current timer value for any of the timer count registers, STIMER or CTIMER0 through CTIMER7. Each time you read any one of those counters, there is a chance that the STIMER will silently double-increment.

In my application I am planning to use a CTIMER to keep track of elapsed time. Do you know if this bug also causes CTIMERs to double increment or is it only an issue on STIMER?

Best regards,

Phil

robin_hodgson · May 14, 2022, 8:30pm

It would appear that only the STIMER has the double-increment problem. You should be good to go using the CTIMER.

prittenhouse · May 15, 2022, 8:57pm

Great! Thanks for the quick reply

Topic		Replies	Views
measure pulses with Ctimer registers Artemis	8	2103	June 8, 2021
Redboard Artemis Nano : 32KHz on board precision Artemis	11	4660	November 16, 2019
Stange behavior of micros() between atmega328p and atmega168 Arduino	13	1142	March 9, 2012
Calibrating An Artemis XTAL On The Cheap Artemis	8	2674	October 4, 2021
New Data Sheet, New Errata Document, New SDK Artemis	5	1946	February 25, 2022

Apollo 3 STIMER Silicon Bug

Related topics