TLDR: We often obsess over attention mechanisms and context windows, but two papers suggest that normalization is the actual bottleneck for training speed and stability.
i have one question, is this valid for transformer like architecture or we can use with other architecture as well, for instance i was training 2 tower model and saw user tower and item tower has large embedding gradients its expected behaviour and therefore we use adagrad because of sparse gradient update, i was thinking whether adding batchnorm after embedding and optimizing the gradient using adam will it work? what do you think?
i have one question, is this valid for transformer like architecture or we can use with other architecture as well, for instance i was training 2 tower model and saw user tower and item tower has large embedding gradients its expected behaviour and therefore we use adagrad because of sparse gradient update, i was thinking whether adding batchnorm after embedding and optimizing the gradient using adam will it work? what do you think?
In these cases there’a only one answer: setup an exp and see :)