【检测|RCNN系列3】原理+源码彻底搞懂Faster RCNN（文末下载论文） - 文章 - 开发者社区

点击上方【AI人工智能初学者】，选择【星标】公众号，期待您我的相遇与进步

系列推荐阅读：

【检测|RCNN系列-1】目标检测算法开山之作RCNN（附论文获取方式）

【检测|RCNN系列-2】目标检测算法Fast RCNN（附论文获取方式）

picture.image

经过R-CNN和Fast RCNN的积淀，Girshick在2016年提出了新的Faster RCNN，在结构上，Faster RCNN已经将特征抽取提取、边界框回归以及目标分类都整合在了一个网络中，使得综合性能有较大提高，在检测速度方面的提升尤为明显。

Faster RCNN算法

picture.image

Faster RCNN把目标检测的4个基本步骤（提取候选框、特征提取、特征分类以及边框回归）统一到一个深度学习模型之中，同时其中的候选区域的生成使用候选区域网络（Region Proposal Network,RPN）取代了Fast RCNN中的SS算法，而特征提取、分类、Bounding-Box回归3个操作依旧沿用Fast RCNN的方法，使得候选区域框的提取和Fast RCNN后端融合在一起，形成了一个完整的卷积神经网络，也是首次真正意义上的端到端的模型。

Faster RCNN算法主要由以下2大模块组成：

1、RPN层进行候选框提取；

2、最后的分类与Bounding Box回归依然沿用Fast RCNN的检测模块，即RoI Pooling和多任务损失函数。

1 算法具体步骤

picture.image

图1 Faster RCNN模型结构图

picture.image

图2 Faster RCNN训练流程图

1、首先，原始图像输入卷积神经网络中，得到最后一层卷积层的特征作为后续网络层的输入，该特征分为2路，被后续的RPN层和RoI Pooling层所共享（其中RoI Pooling层是前一篇文章中所说的RoI Pooling层，详情可以参见前一篇文章）。

2、RPN层用于生成候选区域框，每张特征图生成多个候选区域。如果最后一层卷积层生成256个特征图，每张特征图生成300个候选区域，那么RPN层一共产生76800（256*300）个候选区域。其目的是代替在输入图像上进行选择性搜索（SS算法）寻找合适的候选区域框这一个耗时的操作。

3、把RPN层得到的候选区域框作为RoI Pooling层的输入，使得每个候选区域产生固定尺寸的RoI Pooling特征图。

4、最后一步与Fast RCNN一样，利用SoftMax Loss获得分类的概率和Smooth L1 Loss进行边框回归。假设步骤3中产生的RoI特征大小为(32,32,256)，经过分类层输出每一个位置上9个候选区域框（anchor）属于前景和背景概率，因此分类层的输出特征为(32,32,(92))；窗口回归层则输出每一个位置上9个候选区域对应窗口应该平移缩放的参数，因此窗口回归层的输出特征为(32,32,(94))。

Faster RCNN网络源码：


        
          
class FasterRCNN(GeneralizedRCNN):  
  
    def \_\_init\_\_(self, backbone, num\_classes=None,  
                 # transform parameters  
                 min\_size=800, max\_size=1333,  
                 image\_mean=None, image\_std=None,  
                 # RPN parameters  
                 rpn\_anchor\_generator=None, rpn\_head=None,  
                 rpn\_pre\_nms\_top\_n\_train=2000, rpn\_pre\_nms\_top\_n\_test=1000,  
                 rpn\_post\_nms\_top\_n\_train=2000, rpn\_post\_nms\_top\_n\_test=1000,  
                 rpn\_nms\_thresh=0.7,  
                 rpn\_fg\_iou\_thresh=0.7, rpn\_bg\_iou\_thresh=0.3,  
                 rpn\_batch\_size\_per\_image=256, rpn\_positive\_fraction=0.5,  
                 # Box parameters  
                 box\_roi\_pool=None, box\_head=None, box\_predictor=None,  
                 box\_score\_thresh=0.05, box\_nms\_thresh=0.5, box\_detections\_per\_img=100,  
                 box\_fg\_iou\_thresh=0.5, box\_bg\_iou\_thresh=0.5,  
                 box\_batch\_size\_per\_image=512, box\_positive\_fraction=0.25,  
                 bbox\_reg\_weights=None):  
  
        out_channels = backbone.out_channels  
  
        if rpn_anchor_generator is None:  
            anchor_sizes = ((32,), (64,), (128,), (256,), (512,))  
            aspect_ratios = ((0.5, 1.0, 2.0),) * len(anchor_sizes)  
            rpn_anchor_generator = AnchorGenerator(  
                anchor_sizes, aspect_ratios  
            )  
        if rpn_head is None:  
            rpn_head = RPNHead(  
                out_channels, rpn_anchor_generator.num_anchors_per_location()[0]  
            )  
  
        rpn_pre_nms_top_n = dict(training=rpn_pre_nms_top_n_train, testing=rpn_pre_nms_top_n_test)  
        rpn_post_nms_top_n = dict(training=rpn_post_nms_top_n_train, testing=rpn_post_nms_top_n_test)  
  
        rpn = RegionProposalNetwork(  
            rpn_anchor_generator, rpn_head,  
            rpn_fg_iou_thresh, rpn_bg_iou_thresh,  
            rpn_batch_size_per_image, rpn_positive_fraction,  
            rpn_pre_nms_top_n, rpn_post_nms_top_n, rpn_nms_thresh)  
  
        if box_roi_pool is None:  
            box_roi_pool = MultiScaleRoIAlign(  
                featmap_names=['0', '1', '2', '3'],  
                output_size=7,  
                sampling_ratio=2)  
  
        if box_head is None:  
            resolution = box_roi_pool.output_size[0]  
            representation_size = 1024  
            box_head = TwoMLPHead(  
                out_channels * resolution ** 2,  
                representation_size)  
  
        if box_predictor is None:  
            representation_size = 1024  
            box_predictor = FastRCNNPredictor(  
                representation_size,  
                num_classes)  
  
        roi_heads = RoIHeads(  
            # Box  
            box_roi_pool, box_head, box_predictor,  
            box_fg_iou_thresh, box_bg_iou_thresh,  
            box_batch_size_per_image, box_positive_fraction,  
            bbox_reg_weights,  
            box_score_thresh, box_nms_thresh, box_detections_per_img)  
  
        if image_mean is None:  
            image_mean = [0.485, 0.456, 0.406]  
        if image_std is None:  
            image_std = [0.229, 0.224, 0.225]  
        transform = GeneralizedRCNNTransform(min_size, max_size, image_mean, image_std)  
  
        super(FasterRCNN, self).__init__(backbone, rpn, roi_heads, transform)

2 RPN网络

经典的检测方法生成检测框都非常耗时，而Faster RCNN则抛弃了传统的滑动窗口和SS方法，直接使用RPN生成检测框，这也是Faster RCNN的巨大优势，能极大提升检测框的生成速度：

picture.image

图3 RPN网络结构图

上图展示了RPN网络的具体结构。可以看到RPN网络实际分为2条线，上面一条通过softmax分类anchors获得positive和negative分类，下面一条用于计算对于anchors的bounding box regression偏移量，以获得精确的proposal。而最后的Proposal层则负责综合positive anchors和对应bounding box regression偏移量获取proposals，同时剔除太小和超出边界的proposals。其实整个网络到了Proposal Layer这里，就完成了相当于目标定位的功能。

RPN网络主体代码：


        
          
class RegionProposalNetwork(torch.nn.Module):  
    .......  
  
    def forward(self, images, features, targets=None):  
        features = list(features.values())  
        objectness, pred_bbox_deltas = self.head(features)  
        anchors = self.anchor_generator(images, features)  
  
        num_images = len(anchors)  
        num_anchors_per_level_shape_tensors = [o[0].shape for o in objectness]  
        num_anchors_per_level = [s[0] * s[1] * s[2] for s in num_anchors_per_level_shape_tensors]  
        objectness, pred_bbox_deltas = \  
            concat_box_prediction_layers(objectness, pred_bbox_deltas)  
        # apply pred\_bbox\_deltas to anchors to obtain the decoded proposals  
        # note that we detach the deltas because Faster R-CNN do not backprop through  
        # the proposals  
        proposals = self.box_coder.decode(pred_bbox_deltas.detach(), anchors)  
        proposals = proposals.view(num_images, -1, 4)  
        boxes, scores = self.filter_proposals(proposals, objectness, images.image_sizes, num_anchors_per_level)  
  
        losses = {}  
        if self.training:  
            assert targets is not None  
            labels, matched_gt_boxes = self.assign_targets_to_anchors(anchors, targets)  
            regression_targets = self.box_coder.encode(matched_gt_boxes, anchors)  
            loss_objectness, loss_rpn_box_reg = self.compute_loss(  
                objectness, pred_bbox_deltas, labels, regression_targets)  
            losses = {  
                "loss\_objectness": loss_objectness,  
                "loss\_rpn\_box\_reg": loss_rpn_box_reg,  
            }  
        return boxes, losses

RPNHead网络head代码：


        
          
class RPNHead(nn.Module):  
    def \_\_init\_\_(self, in\_channels, num\_anchors):  
        super(RPNHead, self).__init__()  
        self.conv = nn.Conv2d(  
            in_channels, in_channels, kernel_size=3, stride=1, padding=1  
        )  
        self.cls_logits = nn.Conv2d(in_channels, num_anchors, kernel_size=1, stride=1)  
        self.bbox_pred = nn.Conv2d(  
            in_channels, num_anchors * 4, kernel_size=1, stride=1  
        )  
  
    def forward(self, x):  
        logits = []  
        bbox_reg = []  
        for feature in x:  
            t = F.relu(self.conv(feature))  
            logits.append(self.cls_logits(t))  
            bbox_reg.append(self.bbox_pred(t))  
        return logits, bbox_reg

3 Anchors

所谓anchors，实际上就是一组由RPN生成的矩形。Faster RCNN的RPN网络中每个特征点上会有n个（默认9个）矩形共有3种形状，长宽比为大约为3种，如下图。实际上通过anchors就引入了检测中常用到的多尺度方法。

picture.image

图4 anchors示意图

如下图，遍历Conv layers计算获得的feature maps，为每一个点都配备这9种anchors作为初始的检测框。其实这样做获得检测框很不准确，但是后面还有2次bounding box regression可以修正检测框位置。

picture.image

图5 anchors生成图

其实RPN最终就是在原图尺度上，设置了密密麻麻的候选Anchor。然后用cnn去判断哪些Anchor是里面有目标的positive anchor，哪些是没目标的negative anchor。所以，仅仅是个二分类而已！

这里举个例子，假设一张原图尺寸为800*600的图像，通过VGG16下采样16倍，然后再最后输出的Feature Map上面的每一个点设置9个Anchor，所以该图上共有：

picture.image

图6 Gernerate Anchors图

可以看出总共得到17100个Anchors。


        
          
class AnchorGenerator(nn.Module):  
        ......  
  
    def generate\_anchors(self, scales, aspect\_ratios, dtype=torch.float32, device="cpu"):  
        # type: (List[int], List[float], int, Device)  # noqa: F821  
        scales = torch.as_tensor(scales, dtype=dtype, device=device)  
        aspect_ratios = torch.as_tensor(aspect_ratios, dtype=dtype, device=device)  
        h_ratios = torch.sqrt(aspect_ratios)  
        w_ratios = 1 / h_ratios  
  
        ws = (w_ratios[:, None] * scales[None, :]).view(-1)  
        hs = (h_ratios[:, None] * scales[None, :]).view(-1)  
  
        base_anchors = torch.stack([-ws, -hs, ws, hs], dim=1) / 2  
        return base_anchors.round()  
  
    def set\_cell\_anchors(self, dtype, device):  
        # type: (int, Device) -> None    # noqa: F821  
        ......  
  
        cell_anchors = [  
            self.generate_anchors(  
                sizes,  
                aspect_ratios,  
                dtype,  
                device  
            )  
            for sizes, aspect_ratios in zip(self.sizes, self.aspect_ratios)  
        ]  
        self.cell_anchors = cell_anchors

4 Classification

Classification部分利用已经获得的proposal feature maps，通过full connect层与softmax计算每个proposal具体属于那个类别（如人，车，电视等），输出cls_prob概率向量；同时再次利用bounding box regression获得每个proposal的位置偏移量bbox_pred，用于回归更加精确的目标检测框。

picture.image