Update README.md
Browse files
README.md
CHANGED
@@ -43,7 +43,7 @@ $$
|
|
43 |
To put it simply, we compare the predictions of the models given the same inputs (or very similar) where we expect them to be different, and determine which parameters would contribute the most to reducing the gap between the models.
|
44 |
|
45 |
In Fisher-weighted averaging, typically the squared grads are used. However, with the budget I had, I couldn't afford to store gradients in something else than FP16.
|
46 |
-
The loss cannot be scaled much more, as it leads to NaNs during backpropagation inside of the model, and some of the gradients are too small to be squared without having a tendency to underflow.
|
47 |
I also tried taking the square root of the expected square gradients.
|
48 |
|
49 |
Overall, the approach that gave the best results was taking the expected value of the absolute grads, given the FP16 constraint.
|
|
|
43 |
To put it simply, we compare the predictions of the models given the same inputs (or very similar) where we expect them to be different, and determine which parameters would contribute the most to reducing the gap between the models.
|
44 |
|
45 |
In Fisher-weighted averaging, typically the squared grads are used. However, with the budget I had, I couldn't afford to store gradients in something else than FP16.
|
46 |
+
The loss cannot be scaled much more than it is already, as it leads to NaNs during backpropagation inside of the model, and some of the gradients are too small to be squared without having a tendency to underflow.
|
47 |
I also tried taking the square root of the expected square gradients.
|
48 |
|
49 |
Overall, the approach that gave the best results was taking the expected value of the absolute grads, given the FP16 constraint.
|