Welcome to RotoGrad’s documentation!
RotoGrad is a solution to alleviate the problem of gradient conflict (a.k.a. gradient interference) in multitask models, produced by the gradients of different tasks w.r.t. the shared parameters pointing towards different directions.
RotoGrad homogeneizes the gradients of these gradients during training by accordingly scaling and rotating the input space of the task-specific modules (a.k.a. heads) such that their gradients for the shared module (a.k.a. backbone) do not overweight/cancel each other out.
This is a Pytorch implementation. For more information you can check out the original paper.
Installation
Install RotoGrad by running:
pip install rotograd
How to use
Suppose you have a backbone
model shared across tasks, and two different tasks to solve. These tasks take the
output of the backbone z = backbone(x)
and fed it to a task-specific model (head1
and head2
) to obtain the predictions of their tasks, that is,
y1 = head1(z)
and y2 = head2(z)
.
Then you can simply use RotateOnly, RotoGrad. or RotoGradNorm (RotateOnly + GradNorm) by putting all parts together in a single model.
from rotograd import RotoGrad
model = RotoGrad(backbone, [head1, head2], size_z, normalize_losses=True)
where you can recover the backbone and i-th head simply calling model.backbone
and
model.heads[i]
. Even more, you can obtain the end-to-end model for a single task (that is,
backbone + head), by typing model[i]
.
As discussed in the paper, it is advisable to have a smaller learning rate for the parameters of RotoGrad and GradNorm. This is as simple as doing:
optim_model = nn.Adam({'params': m.parameters() for m in [backbone, head1, head2]}, lr=learning_rate_model)
optim_rotograd = nn.Adam({'params': model.parameters()}, lr=learning_rate_rotograd)
Finally, we can train the model on all tasks using a simple step function:
import rotograd
def step(x, y1, y2):
model.train()
optim_model.zero_grad()
optim_rotograd.zero_grad()
with rotograd.cached(): # Speeds-up computations by caching Rotograd's parameters
pred1, pred2 = model(x)
loss1 = loss_task1(pred1, y1)
loss2 = loss_task2(pred2, y2)
model.backward([loss1, loss2])
optim_model.step()
optim_rotograd.step()
return loss1, loss2
Contribute
Issue Tracker: https://github.com/adrianjav/rotograd/issues
Source Code: https://github.com/adrianjav/rotograd
Support
If you are having issues, please let us know. We have a mailing list located at: adrian.javaloy@gmail.com
Citing
@inproceedings{javaloy2022rotograd,
title={RotoGrad: Gradient Homogenization in Multitask Learning},
author={Adri{\'a}n Javaloy and Isabel Valera},
booktitle={International Conference on Learning Representations},
year={2022},
url={https://openreview.net/forum?id=T8wHz4rnuGL}
}
License
The project is licensed under the MIT license.