Deeper Learning

Deformable Convolutional Networks 본문

AI/Deep Learning

Deformable Convolutional Networks

Dlaiml 2021. 9. 25. 15:18

(구조 이해를 위한 짧은 요약)

Abstract

  • CNN은 fixed geometric structure의 한계로 geometric transformation에 한계가 있다
  • CNN의 transformation capacity를 향상하기 위해 2개의 모듈을 제시
    • deformable convolution
    • deformable RoI pooling
  • spatial sampling location을 offset을 사용하여 변경, offset은 추가적인 supervised-learning 없이 target task를 수행하며 학습된다.

1. Introduction

  • visual recognition task는 geometric variation, model geometric transformation을 핸들링할 수 있어야 한다.
  • Augmentation을 통해 이를 해결하거나 SIFT(scale invariant feature tranform), sliding window 기반 object detection을 사용
  • Augmentation method의 문제점은 geometric transformation을 알고 있으며 고정적이라는 가정에 기반한다는 것이다. → "크기가 다를 것이다", "회전을 할 것이다"라는 prior knowledge에 기반하여 Augmentation을 설계할 수밖에 없다.
  • hand-craft 모델 method의 문제점은 복잡한 transformation을 해결할 수 없는것이다. (even for known tranformation)
  • CNN은 visual recognition task에서 매우 효과적이나 geometric transformation에 대한 capacity는 대부분 augmentation, large model, 단순한 hand-crafted module (maxpool for small transformation-invariance)에서 얻고 있다.
  • 동일한 CNN layer에서 모든 activation units은 같은 receptive-field를 가지고 있는데 different location, scale, deformation high-level semantic을 처리하기 위해 adaptive 한 method가 필요하다.
  • object detection method는 빠르게 발전하고 있지만 전통적인 b-box 기반 feature extraction을 사용하고 있다.
  • 위 방식은 특히 사각형이 아닌 object에 대해 최적의 방법이 아니다.
  • CNN의 geometric transformation 모델링 능력을 향상하기 위한 2개의 모듈을 제시

2. Deformable Convolutional Networks

  • dilation=1인 normal convolution의 수식은 위와 같다.
    • R은 input feature map에서 grid, p_0는 output feature map y 중 하나의 위치를 말한다.
    • 3x3 kernel에서 R의 길이는 9

 

Code

  • Conv2D의 sampling location 자체의 수정이 필요하여 직접 작성할 경우 속도가 매우 느려 기존 구현체를 line by line으로 이해하였음
"""
https://github.com/DHZS/tf-deformable-conv-layer
"""

import tensorflow as tf
from tensorflow.keras.layers import Conv2D

import os

os.environ["CUDA_VISIBLE_DEVICES"] = "-1"


class DeformableConvLayer(Conv2D):
    """Only support "channel last" data format"""

    def __init__(
        self,
        filters,  # output channels
        kernel_size,  # kernel_size
        strides=(1, 1),  # strides
        padding="valid",  # No padding
        data_format=None,  # `channels_last` (default) or `channels_first`
        dilation_rate=(
            1,
            1,
        ),  # an integer or tuple/list of 2 integers, specifying the dilation rate to use for dilated convolution.
        num_deformable_group=None,  # number of filters share same offset
        activation=None,  # act func
        use_bias=True,  # use_bias
        kernel_initializer="glorot_uniform",  # initializer for kernel
        bias_initializer="zeros",  # initializer for bias
        kernel_regularizer=None,  # regularizer for kernel, keras.regularizers object
        bias_regularizer=None,  # regularizer for bias, keras.regularizers obejct
        activity_regularizer=None,  # Regularizer function applied to the output of the layer
        kernel_constraint=None,  # Constraint function applied to the kernel weight matrix, ex. keras.constraints.MaxNorm
        bias_constraint=None,  # Constraint function applied to the bias weight matrix, ex. keras.constraints.MaxNorm
        **kwargs
    ):
        """
        `kernel_size`, `strides` and `dilation_rate` must have the same value in both axis.
        :param num_deformable_group: split output channels into groups, offset shared in each group. If
        this parameter is None, then set  num_deformable_group=filters.

        if num_deformable_group=None=filters -> one feature map share same offset

        """
        super().__init__(
            filters=filters,
            kernel_size=kernel_size,
            strides=strides,
            padding=padding,
            data_format=data_format,
            dilation_rate=dilation_rate,
            activation=activation,
            use_bias=use_bias,
            kernel_initializer=kernel_initializer,
            bias_initializer=bias_initializer,
            kernel_regularizer=kernel_regularizer,
            bias_regularizer=bias_regularizer,
            activity_regularizer=activity_regularizer,
            kernel_constraint=kernel_constraint,
            bias_constraint=bias_constraint,
            **kwargs
        )
        self.kernel = None
        self.bias = None
        self.offset_layer_kernel = None
        self.offset_layer_bias = None
        if num_deformable_group is None:
            num_deformable_group = filters
        if filters % num_deformable_group != 0:
            raise ValueError('"filters" mod "num_deformable_group" must be zero')
        self.num_deformable_group = num_deformable_group

    def build(self, input_shape):
        input_dim = int(input_shape[-1])  #! input_dim = 3
        # kernel_shape = self.kernel_size + (input_dim, self.filters)
        # we want to use depth-wise conv
        kernel_shape = self.kernel_size + (
            self.filters * input_dim,
            1,
        )  # (k, k, filters * input_dim, 1) for depth-wise conv
        #! kernel_shape = (3, 3, 16 * 3)
        self.kernel = self.add_weight(  # Adds a new variable to the layer., tf.keras.layers
            name="kernel",
            shape=kernel_shape,
            initializer=self.kernel_initializer,
            regularizer=self.kernel_regularizer,
            constraint=self.kernel_constraint,
            trainable=True,
            dtype=self.dtype,
        )
        if self.use_bias:
            self.bias = self.add_weight(
                name="bias",
                shape=(self.filters,),
                initializer=self.bias_initializer,
                regularizer=self.bias_regularizer,
                constraint=self.bias_constraint,
                trainable=True,
                dtype=self.dtype,
            )

        # create offset conv layer
        offset_num = (
            self.kernel_size[0]
            * self.kernel_size[1]
            * self.num_deformable_group  # num of total offset = kernel_w * kernel_h * deformable_group
        )
        #! offset_num = 3 * 3 * 16 = 144
        self.offset_layer_kernel = self.add_weight(
            name="offset_layer_kernel",
            shape=self.kernel_size
            + (
                input_dim,
                offset_num * 2,
            ),  # 2 means x and y axis ( kernel_w, kernel_h, input_dim, offset_num * 2)
            initializer=tf.zeros_initializer(),
            regularizer=self.kernel_regularizer,
            trainable=True,
            dtype=self.dtype,
        )  #! offset_layer_kernel = (3, 3, 3, 288) , (k,k,input_dim, output_dim)
        self.offset_layer_bias = self.add_weight(
            name="offset_layer_bias",
            shape=(offset_num * 2,),  #! 288
            initializer=tf.zeros_initializer(),
            # initializer=tf.random_uniform_initializer(-5, 5),
            regularizer=self.bias_regularizer,
            trainable=True,
            dtype=self.dtype,
        )
        self.built = True

    def call(self, inputs, training=None, **kwargs):
        # get offset, shape [batch_size, out_h, out_w, filter_h, * filter_w * channel_out * 2]
        offset = tf.nn.conv2d(
            inputs,
            filters=self.offset_layer_kernel,
            strides=[1, *self.strides, 1],
            padding=self.padding.upper(),
            dilations=[1, *self.dilation_rate, 1],
        )  #! offset = (32, 126, 126, 288)
        offset += self.offset_layer_bias

        # add padding if needed
        inputs = self._pad_input(inputs)
        #! inputs: (32, 128, 128, 3)
        # some length
        batch_size = int(inputs.get_shape()[0])  #! 32
        channel_in = int(inputs.get_shape()[-1])  #! 3
        in_h, in_w = [
            int(i) for i in inputs.get_shape()[1:3]
        ]  # input feature map size #! 128
        out_h, out_w = [
            int(i) for i in offset.get_shape()[1:3]  #! 126, 126
        ]  # output feature map size
        filter_h, filter_w = self.kernel_size  #! 3, 3

        # get x, y axis offset
        offset = tf.reshape(
            offset, [batch_size, out_h, out_w, -1, 2]
        )  #! (32, 126, 126, 144, 2)
        y_off, x_off = offset[:, :, :, :, 0], offset[:, :, :, :, 1]
        #! y_off = (32, 126, 126, 144), x_off = (32, 126, 126, 144)

        # input feature map gird coordinates
        y, x = self._get_conv_indices([in_h, in_w]) #! y=(1, 126, 126, 9), x=(1, 126, 126, 9)
        y, x = [tf.expand_dims(i, axis=-1) for i in [y, x]] #! x,y = (1, 126, 126, 9, 1)
        """
        tf.tile(input, multiple)
        a = tf.constant([[1,2,3],[4,5,6]], tf.int32)
        b = tf.constant([1,2], tf.int32)
        tf.tile(a, b)
        = [[1,2,3,1,2,3],
           [4,5,6,4,5,6]]]
        input a를, multiple인 b([1,2])에 따라 행을 1번 반복, 열을 2번 반복
        tf.tile(a, [2,1]) 
        = [[1,2,3],
           [4,5,6],
           [1,2,3],
           [4,5,6]]
        """
        y, x = [
            tf.tile(i, [batch_size, 1, 1, 1, self.num_deformable_group]) for i in [y, x]
        ] #! x,y = (32, 126, 126, 9, 16) 
        y, x = [tf.reshape(i, [*i.shape[0:3], -1]) for i in [y, x]] #! x,y = (32, 126, 126, 144)
        y, x = [tf.cast(i, dtype=tf.float32) for i in [y, x]]

        # add offset
        y, x = y + y_off, x + x_off #! add offset to sampling location index
        y = tf.clip_by_value(y, 0, in_h - 1) #! clip for edge location
        x = tf.clip_by_value(x, 0, in_w - 1)

        # get four coordinates of points around (x, y)
        y0, x0 = [tf.cast(tf.floor(i), dtype=tf.int32) for i in [y, x]] #! offset is fracion, apply floor func to get coordinate points
        y1, x1 = y0 + 1, x0 + 1
        # clip
        y0, y1 = [tf.clip_by_value(i, 0, in_h - 1) for i in [y0, y1]]
        x0, x1 = [tf.clip_by_value(i, 0, in_w - 1) for i in [x0, x1]]
        #! x0,x1,y0,y1 = (32, 126, 126, 144) 각 sampling location 꼭짓점의 좌표
        # get pixel values
        indices = [[y0, x0], [y0, x1], [y1, x0], [y1, x1]] # 4 coordinates
        p0, p1, p2, p3 = [
            DeformableConvLayer._get_pixel_values_at_point(inputs, i) for i in indices
        ] #! sampling by offset added location values
        #! p0~p3 = (32, 126, 126, 144, 3)

        # cast to float
        x0, x1, y0, y1 = [tf.cast(i, dtype=tf.float32) for i in [x0, x1, y0, y1]]
        # weights
        w0 = (y1 - y) * (x1 - x)
        w1 = (y1 - y) * (x - x0)
        w2 = (y - y0) * (x1 - x)
        w3 = (y - y0) * (x - x0)
        # expand dim for broadcast
        w0, w1, w2, w3 = [tf.expand_dims(i, axis=-1) for i in [w0, w1, w2, w3]]
        # bilinear interpolation
        pixels = tf.add_n([w0 * p0, w1 * p1, w2 * p2, w3 * p3])
        #! pixels = (32, 126, 126, 144, 3)
        # reshape the "big" feature map
        pixels = tf.reshape(
            pixels,
            [
                batch_size,
                out_h,
                out_w,
                filter_h,
                filter_w,
                self.num_deformable_group,
                channel_in,
            ],
        ) #! pixels = (32, 126, 126, 3, 3, 16, 3)
        pixels = tf.transpose(pixels, [0, 1, 3, 2, 4, 5, 6]) #! pixels = (32, 126, 3, 126, 3, 16, 3)
        pixels = tf.reshape(
            pixels,
            [
                batch_size,
                out_h * filter_h,
                out_w * filter_w,
                self.num_deformable_group,
                channel_in,
            ],
        )
        #! pixels = (32, 378, 378, 16, 3)
        # copy channels to same group
        feat_in_group = self.filters // self.num_deformable_group
        pixels = tf.tile(pixels, [1, 1, 1, 1, feat_in_group])
        pixels = tf.reshape(
            pixels, [batch_size, out_h * filter_h, out_w * filter_w, -1] #! (32, 378, 378, 48)
        )

        # depth-wise conv
        out = tf.nn.depthwise_conv2d(
            pixels, self.kernel, [1, filter_h, filter_w, 1], "VALID"
        )
        # add the output feature maps in the same group
        #! out = (32, 126, 126, 48)
        out = tf.reshape(out, [batch_size, out_h, out_w, self.filters, channel_in])
        #! out = (32, 126, 126, 16, 3)
        out = tf.reduce_sum(out, axis=-1)
        if self.use_bias:
            out += self.bias
        return self.activation(out) #! (32, 126, 126, 16)

    def _pad_input(self, inputs):
        """Check if input feature map needs padding, because we don't use the standard Conv() function.
        :param inputs:
        :return: padded input feature map
        """
        # When padding is 'same', we should pad the feature map.
        # if padding == 'same', output size should be `ceil(input / stride)`
        if self.padding == "same":
            in_shape = inputs.get_shape().as_list()[1:3]  #! (128, 128)
            padding_list = []
            for i in range(2):
                filter_size = self.kernel_size[i]  #! 3
                dilation = self.dilation_rate[i]  #! 1
                dilated_filter_size = filter_size + (filter_size - 1) * (
                    dilation - 1
                )  #! 3
                same_output = (in_shape[i] + self.strides[i] - 1) // self.strides[i]  #!
                valid_output = (
                    in_shape[i] - dilated_filter_size + self.strides[i]
                ) // self.strides[i]
                if same_output == valid_output:
                    padding_list += [0, 0]
                else:
                    p = dilated_filter_size - 1
                    p_0 = p // 2
                    padding_list += [p_0, p - p_0]
            if sum(padding_list) != 0:
                padding = [
                    [0, 0],
                    [padding_list[0], padding_list[1]],  # top, bottom padding
                    [padding_list[2], padding_list[3]],  # left, right padding
                    [0, 0],
                ]
                inputs = tf.pad(inputs, padding)
        return inputs

    def _get_conv_indices(self, feature_map_size): #! (128,128)
        """the x, y coordinates in the window when a filter sliding on the feature map
        :param feature_map_size:
        :return: y, x with shape [1, out_h, out_w, filter_h * filter_w]
        """
        feat_h, feat_w = [int(i) for i in feature_map_size[0:2]] #! 128, 128
        """
        tf.meshgrid
        x = [1, 2, 3]
        y = [4, 5, 6]
        X, Y = tf.meshgrid(x, y)
        X = [[1, 2, 3],
            [1, 2, 3],
            [1, 2, 3]]
        Y = [[4, 4, 4],
            [5, 5, 5],
            [6, 6, 6]]
        """
        #! x = (128,128), y = (128,128)
        x, y = tf.meshgrid(tf.range(feat_w), tf.range(feat_h)) #! 1~128, representing index
        x, y = [
            tf.reshape(i, [1, *i.get_shape(), 1]) for i in [x, y]
        ]  # shape [1, h, w, 1]
        #! x = (1, 128, 128, 1), y = (1, 128, 128, 1)
        # https://www.tensorflow.org/api_docs/python/tf/image/extract_patches
        x, y = [
            tf.image.extract_patches(
                i,
                [1, *self.kernel_size, 1],
                [1, *self.strides, 1],
                [1, *self.dilation_rate, 1],
                "VALID",
            )
            for i in [x, y]
        ]  # shape [1, out_h, out_w, filter_h * filter_w]
        #! kernel이 input에 곱해질 때 patches를 모두 기록 (index를 기록)
        #! shape [1, 126, 126, 3 * 3], stride 1 기준으로 총 kernel의 9개의 값이 input을 sliding 하며 126 * 126 번 계산됨
        return y, x

    @staticmethod
    def _get_pixel_values_at_point(inputs, indices):
        """get pixel values
        :param inputs:
        :param indices: shape [batch_size, H, W, I], I = filter_h * filter_w * channel_out
        :return:
        """
        #! inputs = (32, 128, 128, 3), indices = [x0, y0] x0,y0 = (32, 126, 126, 144)
        y, x = indices
        batch, h, w, n = y.get_shape().as_list()[0:4]
        
        batch_idx = tf.reshape(tf.range(0, batch), (batch, 1, 1, 1)) #! (32, 1, 1, 1) (0 ~ 31)
        b = tf.tile(batch_idx, (1, h, w, n)) #! b=(32, 126, 126, 144), 0~31까지 index가 tile에 의해 복사
        pixel_idx = tf.stack([b, y, x], axis=-1) #! pixel_idx = (32, 126, 126, 144, 3)
        """
        tf.gather_nd(indices, params)
        indices에 따라 sampling 하여 return
        output = [params[0][0][1], params[1][0][1]]

        indices = [[0, 0, 1], [1, 0, 1]]
        params = [[['a0', 'b0'], ['c0', 'd0']],
                  [['a1', 'b1'], ['c1', 'd1']]]
        output = ['b0', 'b1']
        """
        return tf.gather_nd(inputs, pixel_idx) #! (32, 126, 126, 144, 3)


if __name__ == "__main__":
    deformable_conv = DeformableConvLayer(16, 3)
    #! Example) input's shape == (32, 128, 128, 3), output 16 channels, 3x3 kernel
    x = tf.ones((32, 128, 128, 3))
    out = deformable_conv(x)
    # out.shape == (32, 126, 126, 16)
    print("DONE", out.shape)

 

 

Reference

[1] Deformable convolution networks, Jifen Dai et al., https://arxiv.org/abs/1703.06211

[2] https://github.com/DHZS/tf-deformable-conv-layer

 

Comments