Getting 1kHz from 32.768kHz crystal

I soldered a 32768hz crystal to my Seeeduino film, and set timer2 to async mode as follows:

    cli();

    // stop timer2 for now
    TCCR2B  = 0;

    // Enable asynchronous mode
    TIMSK2 = 0;
    ASSR |= _BV(AS2);

    TCNT2 = 0;
    OCR2A = 32;

    // set CTC mode, clear PWM modes
    TCCR2A = _BV(WGM21);

    // no prescaler, start clock
    TCCR2B = _BV(CS20);

    // enable interrupt on compare match
    TIFR2 |= _BV(OCF2A);
    TIMSK2 |= _BV(OCIE2A);

    // set up variables
    remainder = 768;
    milliCount = 0;

    sei();

I implemented the timer2 compare match routine as follows, with the idea of averaging 1ms per tick:

ISR(TIMER2_COMPA_vect) {
    ++milliCount;
    remainder += 768;
    if (remainder >= 1000) {
        remainder -= 1000;
        OCR2A = 33;
    } else {
        OCR2A = 32;
    }
}

However, my clock seems to be running at about double speed! I made sure to set the WGM mode to CTC, so it should be counting 32.768/32768 = 1ms per interrupt, then resetting and starting over.

Any ideas?

Thanks in advance

In case it’s useful to anyone, I had a tight sleep loop

while (!condition) sleep_cpu();

I think I was putting the processor back to sleep before it could clear the overflow interrupt flag - I added a delayMicroseconds(32) to the loop and that seemed to make it work.