Jan 25

TLDR: We often obsess over attention mechanisms and context windows, but two papers suggest that normalization is the actual bottleneck for training speed and stability.

2 Comments

Pratilipi

Jan 26

i have one question, is this valid for transformer like architecture or we can use with other architecture as well, for instance i was training 2 tower model and saw user tower and item tower has large embedding gradients its expected behaviour and therefore we use adagrad because of sparse gradient update, i was thinking whether adding batchnorm after embedding and optimizing the gradient using adam will it work? what do you think?

Reply (1)

Share

Machine Learning At Scale

The Unreasonable Effectiveness of…