Take a look at the ARM technical reference manual for the specific CPU you are using. While they are all called ARM there are actually many different versions, and some of the earliest versions had the type of limitations you describe.
The architecture is pipe-lined meaning more than 1 instruction is executed concurrently. The pipelines are now complex enough to either take copies of the register, or to freeze conflicting pipes until the conflict is resolved. Compilers no longer insert NOPs, not sure it ever did after the earliest parts.
In some of the bigger parts access to peripherals and or memory can be queued up and actually executed out of order without setting various conditions. So it gets real interesting. Some peripherals are on faster busses than others so without forcing order things may happen out of order.
All in the name of getting faster performance. I’m sure glad I’m not designing CPUs anymore.