AutoML | 元学习——MAML的各种玩法 - 文章 - 开发者社区

有一段时间没关注meta learning的最新进展了，所以这篇博客是我对近年meta learning中MAML方法变体的梳理。共涉及8篇论文：

Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks, ICML, 2017
How to train your MAML, ICLR, 2019
Meta-Learning With Task-Adaptive Loss Function for Few-Shot Learning, ICCV, 2021
Task similarity aware meta learning: theory-inspired improvement on MAML, UAI, 2021
Rapid Learning or Feature Reuse? Towards Understanding the Effectiveness of MAML, ICLR, 2020
The Close Relationship Between Contrastive Learning and Meta-learning, ICLR, 2022
How to Train Your Maml to Excel in Few-shot Classification, ICLR, 2022
Meta-learning with Fewer Tasks through Task Interpolation, ICLR, 2022

以我浅显的认知水准，将上述论文按照4个思路进行了整理。

如果有帮到您，求个关注点赞收藏~

欢迎在公众号联系我加入小样本学习交流群~

1
IC ML2017 | Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks (MAML)

picture.image

‍

‍ 首先回顾下MAML

Inner loop

Inner loop中是基于support set对base learner做更新。i标识任务id，对应的是当前任务的base learner的参数，是meta learner的meta params。

其中每一步j更新基于梯度下降：

因为meta learning会追求fast adaptation，所以一般inner loop的update step会比较少，比如5。但在后文会介绍的How to Train Your Maml to Excel in Few-shot Classification中说，more steps会带来更好的结果。

Outer loop

Outer loop中，meta learner汇聚来自所有任务的knowledge，每一个任务对当前任务的query set计算损失，并最终对meta learner的参数做梯度下降更新。

这里值得留意的是，inner loop和outer loop中的损失函数是一致的，这一点也会在后文中提到。

这里还是再回顾下关于一阶近似和二阶近似的知识。

理论和动机可以看知乎这个问题：

(zhihu.com)https://www.zhihu.com/question/292959709/answer/1520505806

更直观的代码理解(关于torch.autograd.grad)可以看这个博客：

https://blog.csdn.net/Cecilia6277/article/details/109091482 ‍

‍

思路1：优化实现细节

2 ICLR2019 | How to train your MAML (MAML++)

picture.image

这篇文章的内容就如题目所说：how to train。

论文对多个方面的实现做了优化：

Gradient Instability → Multi-Step Loss Optimization (MSL)

👉 MAML works by minimizing the target set loss computed by the base-network after it has completed all of its inner-loop updates towards a support set task. Instead we propose minimizing the target set loss computed by the base-network after every step towards a support set task .

outer loop的变化是

inner loop共个update steps。代表当前step损失的重要性权重。这个起到的作用是使outer loop的学习率变动态。

相比MAML中inner loop和outer loop串行的方式，maml++是一定程度上并行的，它在每一个inner step都参与了outer loop对meta learner的更新。

开源代码对应的部分：


        
if use_multi_step_loss_optimization and training_phase and epoch < self.args.multi_step_loss_num_epochs:  
                    # MAML++：每个inner update step都参与outer loop update  
          target_loss, target_preds = self.net_forward(x=x_target_set_task,  
                                                                 y=y_target_set_task, weights=names_weights_copy,  
                                                                 backup_running_statistics=False, training=True,  
                                                                 num_step=num_step)  
          # v\_j: per\_step\_loss\_importance\_vectors[num\_step]  
                    task_losses.append(per_step_loss_importance_vectors[num_step] * target_loss)  
elif num_step == (self.args.number_of_training_steps_per_iter - 1):  
                    target_loss, target_preds = self.net_forward(x=x_target_set_task,  
                                                                 y=y_target_set_task, weights=names_weights_copy,  
                                                                 backup_running_statistics=False, training=True,  
                                                                 num_step=num_step)  
                    task_losses.append(target_loss)  
  
# 继续inner update step

实现：

💡 we employ an annealed weighting for the per step losses. Initially all losses have equal contributions towards the loss, but as iterations increase, we decrease the contributions from earlier steps and slowly increase the contribution of later steps.

This is done to ensure that as training progresses the final step loss receives more attention from the optimizer thus ensuring it reaches the lowest possible loss. If the annealing is not used, we found that the final loss might be higher than with the original formulation.


        
# 每个training phase更新一次  
def get\_per\_step\_loss\_importance\_vector(self):  
        """  
        Generates a tensor of dimensionality (num\_inner\_loop\_steps) indicating the importance of each step's target  
        loss towards the optimization loss.  
        :return: A tensor to be used to compute the weighted average of the loss, useful for  
        the MSL (Multi Step Loss) mechanism.  
        """  
        loss_weights = np.ones(shape=(self.args.number_of_training_steps_per_iter)) * (  
                1.0 / self.args.number_of_training_steps_per_iter)  
        decay_rate = 1.0 / self.args.number_of_training_steps_per_iter / self.args.multi_step_loss_num_epochs  
        min_value_for_non_final_losses = 0.03 / self.args.number_of_training_steps_per_iter  
        for i in range(len(loss_weights) - 1):  
            curr_value = np.maximum(loss_weights[i] - (self.current_epoch * decay_rate), min_value_for_non_final_losses)  
            loss_weights[i] = curr_value  
  
        curr_value = np.minimum(  
            loss_weights[-1] + (self.current_epoch * (self.args.number_of_training_steps_per_iter - 1) * decay_rate),  
            1.0 - ((self.args.number_of_training_steps_per_iter - 1) * min_value_for_non_final_losses))  
        loss_weights[-1] = curr_value  
        loss_weights = torch.Tensor(loss_weights).to(device=self.device)  
        return loss_weights

Second Order Derivative Cost → Derivative-Order Annealing (DA)

这个就是说MAML整个训练过程用了一阶近似。MAML++先用一阶近似训练50epochs，剩下的用二阶梯度。

Absence of Batch Normalization Statistic Accumulation → Per-Step Batch Normalization Running Statistics (BNRS)

💡 A better alternative would be to collect statistics in a per-step regime. To collect running statistics per-step, one needs to instantiate N (where N is the total number of inner-loop update steps) sets of running mean and running standard deviation for each batch normalization layer in the network and update the running statistics respectively with the steps being taken during the optimization. The per-step batch normalization methodology should speed up optimization of MAML whilst potentially improving generalization performance.

MAML中的BN只在当前batch中做bn的statistics。MAML++使用running batch statistics。

Shared (across step) Batch Normalization Bias → Per-Step Batch Normalization Weights and Biases (BNWB)

💡 In the MAML paper the authors trained their model to learn a single set of biases for each layer. Doing so assumes that the distributions of features passing through the network are similar. However, this is a false assumption since the base-model is updated for a number of times, thus making the feature distributions increasingly dissimilar from each other. To fix this problem we propose learning a set of biases per-step within the inner-loop update process. Doing so, means that batch normalization will learn biases specific to the feature distributions seen at each set, which should increase convergence speed, stability and generalization performance.

这里解释一下：MAML网络中的bias没做特殊处理，MAML++认为每个task的分布不同，其样本特征分布也应该很不同。所以MAML++在每个inner loop中学习一组biases。

Fixed Outer Loop Learning Rate → Cosine Annealing of Meta-Optimizer Learning Rate (CA)

整个就是在outer loop中通过Cosine Annealing调整meta model的学习率。

思路2：更灵活&自适应

3 ICCV2021 | Meta-Learning With Task-Adaptive Loss Function for Few-Shot Learning (MeTAL)

picture.image

对比公式1.2，MeTAL给损失函数加了参数

。该参数的学习可以让loss function更加的task-adaptive。

另外，损失函数中包含了当前任务在当前step的状态。状态是指当前任务的损失啊权重啊这些信息。这里不展开细节了。

4 UAI2021 | Task Similarity Aware Meta Learning: Theory-inspired Improvement on MAML

picture.image

这篇论文基于Lipschitz continuity and smoothness做了很多理论分析并推导出了误差上界，结论是support set越多以及meta model和base model越相似时（其实是说找到离当前任务最优参数空间最近的初始点），对当前任务的泛化性提升越大。这两个结论其实显而易见。

针对上面的结论。论文方法首先通过MAML训练得到meta-model ，并采样个任务再次用MAML得到对应的个base models并做K-means聚类（取model均值作为簇中心）。

这K个簇用来给下游任务提供初始model的选择，model之间的相似度基于欧氏距离计算。

picture.image

思路3：实验观察与理论分析

5 ICLR2020 | Rapid Learning or Feature Reuse? Towards Understanding the Effectiveness of MAML (ANIL/NIL)

picture.image

这篇论文在讨论MAML中是rapid learning更重要还是feature reuse更重要，如下图所描述。

picture.image

我们把网络分为body（特征提取）和head（分类头）。

rapid learning是说处理新task时，body中的特征表示会发生较大变化（上图左侧图的红色虚线）。

feature reuse是说特征不必发生较大改变，只reuse就可以有很好的效果（如上图右侧图蓝色实线）。

为了分析这个现象，作者使用4层CNNs(算作body部分)，分别冻结各层，只微调head，会发现性能差异并不大，所以有理由相信feature reuse才是MAML的关键。实验结果如下表：

picture.image

另外也可以通过CCA/CKA分析使用inner loop中body更新前后的特征表示相似度变化。如下图，可以看到其实变化最大的只有head层。

picture.image

基于如上观察，作者提出了ANIL（Almost No Inner Loop）和NIL （No Inner Loop）。

ANIL无论在training还是testing，inner loop都不更新body，只更新head。而body部分仅在outer loop更新。

NIL在ANIL基础上，在测试时直接抛弃掉head，转而基于特征用余弦相似度做分类。这相当于直接把testing阶段的inner loop省略了。

💡 In ANIL, during training and testing, we remove the inner loop updates for the network body, and apply inner loop adaptation only to the head.

The NIL (No Inner Loop) algorithm, that removes the head entirely at test time, and uses learned features and cosine similarity to perform effective classification, thus avoiding inner loop updates altogether.

补充一句，如果对实现感兴趣，推荐看下Libfewshot的实现，可以很清晰的看出ANIL和MAML实现的区别。https://github.com/RL-VIG/LibFewShot

6 ICLR2022 | The Close Relationship Between Contrastive Learning and Meta-learning

picture.image

元学习训练过程包括1）采样task（a random batch of classes）和2）更新特征提取器。

这个过程其实和对比学习很相似。

对比学习训练过程包括1）采样数据（a batch of images）并增强它们（对应于元学习，一张图和它的增强样本视角对应一个class）和2）更新特征提取器（与元学习一致）。

该论文发现，当数据采样策略一致时，元学习算法和对比学习算法性能相近。此外，一些元学习策略可以改善对比学习性能。

7
ICLR2022 | How to Train Your Maml to Excel in Few-shot Classification

picture.image

这篇文章我个人很喜欢，它基于大量的实验观察提出了两个问题：

MAML needs a large number of inner loop gradient steps
MAML is sensitive to the label permutations in meta-testing

MAML needs a large number of inner loop gradient steps

对于第一个问题，meta learning本身追求模型快速适应的能力，所以一般会在每个task里对support set只做几步inner loop更新。但如该论文所体现的结果(下图)，更多的inner loop更新步往往能带来更好的结果。

picture.image

MAML is sensitive to the label permutations in meta-testing

作者做了个有趣的实验。对每个任务，如果我们把它们的标签顺序做变换。(For instance, the class “dog” may be assigned to c = 1 and paired with w1 at the current task, but to c = 2 and paired with w2 when it is sampled again.)。比如对于dog标签，变换前后可能分别是当前任务的第一类和第三类。实验策略直接看下面流程表比较清晰。

picture.image

该变换会对模型的判别结果造成很大的影响。如下图，对不同数据及不同任务中的2k个任务的120次扰动结果进行了统计，可以看出其影响非常明显。

picture.image

所以作者提出了使MAML具备permutation-invariant的性质。

我们把网络化分成body和head两部分，head就是分类头。对于N-way任务，head权重是dxN维。该论文所提出的方法是初始化一个dx1维的向量m，并用它来初始化每一个dx1维的head。在inner loop中，head正常更新。在outer loop中，head权重的梯度用来更新m。

其实直接看开源代码更清晰：


        
def forward(self, data\_shot, data\_query):  
  
        # set the initial classifier  
    # self.fcone就是用来初始化head的m  
        self.encoder.fc.weight.data = self.fcone.weight.data.repeat(self.args.way, 1)  
        self.encoder.fc.bias.data = self.fcone.bias.data.repeat(self.args.way)  
  
        # update with gradient descent  
    # Inner loop  
    # 这里更新self.encoder.fc，及head的参数。  
        updated_params, acc_gradients = inner_train_step(self.encoder, data_shot, self.args)  
       
        # reupate with the initial classifier and the accumulated gradients  
        updated_params['fc.weight'] = self.fcone.weight.repeat(self.args.way, 1) - self.args.gd_lr * acc_gradients[0]  
        updated_params['fc.bias'] = self.fcone.bias.repeat(self.args.way) - self.args.gd_lr * acc_gradients[1]  
  
        # Outer loop  
    # 这里会更新self.fcone的参数  
        logitis = self.encoder(data_query, updated_params) / self.args.temperature  
  
        return logitis