Hello, thank you for this amazing article! Can you please explain better why we need to make sure that a single gradient step on a batch of data yields a decrease in loss? The loss can usually go up and down at each training step (hopefully more times down than up), and it does not look like a strictly decreasing function. Let me know if I am missing something.
You are absolutely correct that the gradient could also go up. I meant to say: "with enough steps on the same batch of data, loss should go more down than up". That is, while individual steps may increase loss, the overall trend over many steps should be downward.
Hello, thank you for this amazing article! Can you please explain better why we need to make sure that a single gradient step on a batch of data yields a decrease in loss? The loss can usually go up and down at each training step (hopefully more times down than up), and it does not look like a strictly decreasing function. Let me know if I am missing something.
Hey Riccardo, glad you enjoyed the article! :)
You are absolutely correct that the gradient could also go up. I meant to say: "with enough steps on the same batch of data, loss should go more down than up". That is, while individual steps may increase loss, the overall trend over many steps should be downward.
Thanks for the question!