Discussion about this post

User's avatar
Pawel Jozefiak's avatar

The shift from modeling challenge to systems engineering challenge is where things get interesting. Most people building on top of these models never touch the optimization stack, but it shows up indirectly in cost and latency.

Flash Attention and ZeRO are useful framing for understanding why inference costs stay sticky even as raw compute prices drop. Still feels like this complexity mostly lives at labs and large enterprises. Curious where you see the 'you need to care about this' threshold sitting for production builders who aren't running their own infra.

No posts

Ready for more?