Got myself a few months ago into the optimization rabbit hole as I had a slow quant finance library to take care of, and for now my most successful optimizations are using local memory allocators (see my C++ post, I also played with mimalloc which helped but custom local memory allocators are even better) and rethinking class layouts in a more “data-oriented” way (mostly going from array-of-structs to struct-of-arrays layouts whenever it’s more advantageous to do so, see for example this talk).

What are some of your preferred optimizations that yielded sizeable gains in speed and/or memory usage?