Discussion about this post

User's avatar
Riccardo Bertoglio's avatar

Hello, thank you for this amazing article! Can you please explain better why we need to make sure that a single gradient step on a batch of data yields a decrease in loss? The loss can usually go up and down at each training step (hopefully more times down than up), and it does not look like a strictly decreasing function. Let me know if I am missing something.

Expand full comment
1 more comment...

No posts