ljleb commited on
Commit
ff958ee
·
verified ·
1 Parent(s): e3b3525

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +5 -1
README.md CHANGED
@@ -42,7 +42,11 @@ $$
42
 
43
  To put it simply, we compare the predictions of the models given the same inputs (or very similar) where we expect them to be different, and determine which parameters would contribute the most to reducing the gap between the models.
44
 
45
- In Fisher-weighted averaging, typically the squared grads are used. However, experimentally this leads to overblown importance estimation, especially for outliers. I also tried taking the square root of the expected square gradients, but overall the approach that gave the best results was taking the expected value of the absolute grads.
 
 
 
 
46
 
47
  In a sense, the parameters with higher expected absolute gradients have a higher slope in the loss landscape. This means that merging these parameters using a naive weighted average approach will cause the loss L to change much more than other parameters with smaller expected absolute gradients.
48
 
 
42
 
43
  To put it simply, we compare the predictions of the models given the same inputs (or very similar) where we expect them to be different, and determine which parameters would contribute the most to reducing the gap between the models.
44
 
45
+ In Fisher-weighted averaging, typically the squared grads are used. However, with the budget I had, I couldn't afford to store gradients in something else than FP16.
46
+ The loss cannot be scaled much more, as it leads to NaNs during backpropagation inside of the model, and some of the gradients are too small to be squared without having a tendency to underflow.
47
+ I also tried taking the square root of the expected square gradients.
48
+
49
+ Overall, the approach that gave the best results was taking the expected value of the absolute grads, given the FP16 constraint.
50
 
51
  In a sense, the parameters with higher expected absolute gradients have a higher slope in the loss landscape. This means that merging these parameters using a naive weighted average approach will cause the loss L to change much more than other parameters with smaller expected absolute gradients.
52