Abstract

Today Deep Learning holds state-of-the-art performances in many tasks, but mainly supervised training with huge and fixed datasets. Continual learning (CL) is a particular machine learning paradigm where the data distribution and learning objective change through time, or where all the training data and objective criteria are never available at once. Models are trained on small batches of data and can run even on CPU and thus making it possible to achieve remarkable levels of autonomy. The main problem associated with these techniques is Catastrophic Forgetting, a sudden erase of all the previously acquired knowledge. In this report, we will briefly discuss different techniques of continual learning and how they deal with Catastrophic Forgetting.

Introduction

Continual learning is highly inspired by the learning process of humans; we acquire new knowledge and build it on the previously learned experience. In the case of machine learning, till recent times the approach was stateless and the models were trained every time on new data without considering the previous experience and state of the model. Recently we have a change in the trend of training models from stateless to stateful training. This stateful training is strongly motivated by the principle of reusing the learned knowledge to learn new tasks with lesser training and resources. Also commonly known as Continual learning, multitask learning or lifelong learning, builds cumulatively as opposed to learning from scratch.

Catastrophic forgetting means the model will forget the old task while learning new ones, something that is not expected since we always want it to be able to perform well on old tasks as well as on new tasks. Memory constraint assumes that we have limited space available to store the model parameters and the data from previous tasks.

The main techniques to deal with forgetting are the Latent Replay approach and the Continual Learning with Hypernetworks. There are some other methods that mainly grow from the architecture of the network and will be discussed in brief at the end of the report.

Handling memories

Catastrophic Forgetting refers to the phenomenon of a neural network experiencing performance degradation at previously learned concepts when trained sequentially on learning new concepts. An efficient memory management strategy should only save important information, as well as be able to transfer knowledge and skills to future tasks. In practice, it is almost impossible to know what will be important and what could be transferable in the future; a trade off should then be found between the precision of the information saved and the acceptable forgetting. This trade-off problem is known as the stability/plasticity dilemma

Supervised Continual Learning

Supervised continual learning is a particular case where the data is not available all at once. The function should then be learned from a sequence of data points in order to be able to map data to labels at the end of the sequence for the whole dataset. Supervised Continual Learning approaches have been mostly focused on classification. While the study of continual learning in this context may help disentangling the complexity introduced by algorithms that learn continually, in the context of robotics, the lack of supervision does not allow, most of the time, to apply directly supervised methods.

Unsupervised Continual Learning

In the context of robotics, unsupervised continual learning may play an important role in building increasingly robust multi-modal representations over time to be later fine-tuned with an external and very sparse feedback signal from the environment. In order to learn robust and adaptive representations with unsupervised learning, the main objective is to find suitable surrogate and meaningful learning signals, as robotics priors, self-supervised models or curiosity driven techniques. A particular unsupervised task learned in a continual learning setting is the generation of images. Image generation is achieved by training generative models to reproduce images from a dataset. In a CL setting, the distribution changes over time and the generative model should be able to produce at the end images from the whole distribution.

Hypernetworks

The problem of Forgetting is taken to the meta-level, where the outputs of a metamodel called task-conditioned hyper network, that maps the task embeddings to weights, are fixed. This approach make possible to save every task as a single point. In other words, in this method, we have constraints on the updating of the neural weights. From the computational perspective, this is generally modeled via additional constraints of mapping that restore any changes in the mapping function of the neural networks. The central approach of this model is to learn the parameters of the metamodel rather than directly learning the parameters of the target model. Thus, a hyper network can be treated as a weight generator and was originally introduced to dynamically compress the model parameters. One approach to avoid the catastrophic forgetting in this method is to store data from previous tasks and corresponding model outputs and then fix such outputs. In this way, we can somehow avoid the issue of forgetting by mixing the data from the past tasks with the new ones, this approach can be seen as a multitask learning because we are learning all the tasks simultaneously. Although this task seems to be the easy option, aligning to this approach is potentially memory expensive and not strictly online learning. To tackle this issue, the hyper network fixes the weight space of the target network to a single point per task. This constraint can be employed by using two-step optimizations: - Compute the current loss: Firstly, the candidate is changed to minimize the current task loss concerning the current network, then candidate parameter of the meta-model is obtained by minimizing the loss with the target parameter. - Learned task embeddings: The task embeddings are differentiable deterministic that can be then learned from the candidate meta-model parameters. This model is very well suited for the neural network models, but we need to keep past tasks in memory which are then used by hypernetwork models to penalize the changes in the previous output.