ljleb commited on
Commit
ab6e160
·
verified ·
1 Parent(s): 682a6dd

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +2 -0
README.md CHANGED
@@ -41,6 +41,8 @@ $$
41
 
42
  To put it simply, we compare the predictions of the models given the same inputs (or very similar) where we expect them to be different, and determine which parameters would contribute the most to reducing the gap between the models.
43
 
 
 
44
  In a sense, the parameters with higher expected absolute gradients have a higher slope in the loss landscape. This means that merging these parameters using a naive weighted average approach will cause the loss L to change much more than other parameters with smaller expected absolute gradients.
45
 
46
  In our case with NoobAI and Animagine, as the loss landscape is highly non-linear, naively merging high slope parameters completely decimates the loss instead of improving it: the merge cannot even denoise anything anymore. We then want to move high slope parameters as little as possible, keep them in place as much as possible wherever we can.
 
41
 
42
  To put it simply, we compare the predictions of the models given the same inputs (or very similar) where we expect them to be different, and determine which parameters would contribute the most to reducing the gap between the models.
43
 
44
+ In Fisher-weighted averaging, typically the squared grads are used. However, experimentally this leads to overblown importance estimation, especially for outliers. I also tried taking the square root of the expected square gradients, but overall the approach that gave the best results was taking the expected value of the absolute grads.
45
+
46
  In a sense, the parameters with higher expected absolute gradients have a higher slope in the loss landscape. This means that merging these parameters using a naive weighted average approach will cause the loss L to change much more than other parameters with smaller expected absolute gradients.
47
 
48
  In our case with NoobAI and Animagine, as the loss landscape is highly non-linear, naively merging high slope parameters completely decimates the loss instead of improving it: the merge cannot even denoise anything anymore. We then want to move high slope parameters as little as possible, keep them in place as much as possible wherever we can.