I had trouble with Atmel’s application note so I redid it. I also switched to assembly and made the SCL timing as uniform as possible. This is 400kHz (a bit faster at 10Mhz, a bit slower at 8 but works without a crystal).
I should clarify, the 2 Megabaud goes to my FTDI serial breakout when it is set to 2000000 baud. Yes, async - 1 start bit, 8 data bits, minimal 1 stop bit (usually somewhat longer since you have to reenter the routine and it takes a few clocks). With a 15Mhz Xtal it would be 3Mb.