Update README.md
Browse files
README.md
CHANGED
@@ -44,7 +44,7 @@ To put it simply, we compare the predictions of the models given the same inputs
|
|
44 |
|
45 |
In Fisher-weighted averaging, typically the squared grads are used. However, with the budget I had, I couldn't afford to store gradients in something else than FP16.
|
46 |
The loss cannot be scaled much more than it is already, as it leads to NaNs during backpropagation inside of the model, and some of the gradients are too small to be squared without having a tendency to underflow.
|
47 |
-
I also tried taking the square root of the expected square gradients to see if it helped regularize the extremely flat distribution of gradients
|
48 |
|
49 |
Overall, the approach that gave the best results was taking the expected value of the absolute grads, given the FP16 constraint.
|
50 |
|
|
|
44 |
|
45 |
In Fisher-weighted averaging, typically the squared grads are used. However, with the budget I had, I couldn't afford to store gradients in something else than FP16.
|
46 |
The loss cannot be scaled much more than it is already, as it leads to NaNs during backpropagation inside of the model, and some of the gradients are too small to be squared without having a tendency to underflow.
|
47 |
+
I also tried taking the square root of the expected square gradients to see if it helped regularize the extremely flat distribution of expected square gradients that span entire range of FP16.
|
48 |
|
49 |
Overall, the approach that gave the best results was taking the expected value of the absolute grads, given the FP16 constraint.
|
50 |
|