Discussion about this post

User's avatar
Pratilipi's avatar

i have one question, is this valid for transformer like architecture or we can use with other architecture as well, for instance i was training 2 tower model and saw user tower and item tower has large embedding gradients its expected behaviour and therefore we use adagrad because of sparse gradient update, i was thinking whether adding batchnorm after embedding and optimizing the gradient using adam will it work? what do you think?

1 more comment...

No posts

Ready for more?