Discussion about this post

User's avatar
Mauro Sciancalepore's avatar

Hey nice read :) my two cents, maybe three

In science, there are cycles, right now the world is focused on modeling everything as generation, mostly due to chatgpt (e.g. scale). Let's be real!

The mixture of scale and data indeed proved CLM to be an effective pretraining technique.

But do we all need or have trillion-scale datasets and 100k-worth of gpu clusters?

Do we need Billion-param networks to do classification? I don't think so and your sweet employer paying for aws don't know it. The problem is that we are doing it..

Also, let's be more real and exclude MAANG context, because we all know that those are industry outliers and what works there, doesn't work in the "real world", pass me the term, companies.

Back to us..

The "generation" modeling paradigm is creating both extreme inefficiencies, but bringing on the table super cool optimizations, for necessity I would say. See GQA, fa3, quantization, etc

Has this already happened in the past? Absolutely. We start with clever, super engineered artefacts, mostly due to hardware constraints.. and as we feel less burden of such limits, we can dare to push the boundaries of creativity. At the cost of? Efficiency of course.

If you think about it, with LLMs we have built (I think) the most energivorous artefact ever in computer science history.. just take a look to Meta's Llama training reports.. crazy isn't it?

I think BERT belonged to a period in time where DL was still contaminated with ML approaches, inductive biases, etc. But man if it did its job, didn't it?

If you're into the AI scene, you are mostly probably aware of the SLMs and agentic paradigm.. why that and not pushing for a huge chad LLM? Practicality. Real life and bills.

Sure, tech will prolly get cheaper, but in AI-land this is the moat of who can afford it *at scale* either way.

Just like Neural Nets got back from the dead thanks to the crazy unforseen combo of gpus and data.. the same will keep happening, also for architectures, in a cycle.

See RNNs and Mamba: parallel scan algorithm => unlocked efficiency.

See FlashAttention: the softmax trick was already well-know in the past and used by nvidia engineers.. but on some other problem, even when attention was there, waiting with its O(n²) complexity. Then we got Tri Dao and he connected the dots.. allowing us to run much more models, locally!

Sometimes we all just need a different perspective :)

Expand full comment

No posts