ljleb
/

noobai11-animagine4

Model card Files Files and versions Community

ljleb commited on Feb 16

Commit

16cbd18

·

verified ·

1 Parent(s): d873db3

Update README.md

Files changed (1) hide show

README.md +1 -1

README.md CHANGED Viewed

@@ -44,7 +44,7 @@ To put it simply, we compare the predictions of the models given the same inputs
 In Fisher-weighted averaging, typically the squared grads are used. However, with the budget I had, I couldn't afford to store gradients in something else than FP16.
 The loss cannot be scaled much more than it is already, as it leads to NaNs during backpropagation inside of the model, and some of the gradients are too small to be squared without having a tendency to underflow.
-I also tried taking the square root of the expected square gradients to see if it helped regularize the extremely flat distribution of gradients from over the entire range of FP16.
 Overall, the approach that gave the best results was taking the expected value of the absolute grads, given the FP16 constraint.

 In Fisher-weighted averaging, typically the squared grads are used. However, with the budget I had, I couldn't afford to store gradients in something else than FP16.
 The loss cannot be scaled much more than it is already, as it leads to NaNs during backpropagation inside of the model, and some of the gradients are too small to be squared without having a tendency to underflow.
+I also tried taking the square root of the expected square gradients to see if it helped regularize the extremely flat distribution of expected square gradients that span entire range of FP16.
 Overall, the approach that gave the best results was taking the expected value of the absolute grads, given the FP16 constraint.