일 | 월 | 화 | 수 | 목 | 금 | 토 |
---|---|---|---|---|---|---|
1 | 2 | |||||
3 | 4 | 5 | 6 | 7 | 8 | 9 |
10 | 11 | 12 | 13 | 14 | 15 | 16 |
17 | 18 | 19 | 20 | 21 | 22 | 23 |
24 | 25 | 26 | 27 | 28 | 29 | 30 |
Tags
- userwidget
- Generative Model
- Few-shot generation
- 딥러닝
- 디퓨전모델
- BERT
- dl
- animation retargeting
- CNN
- Stat110
- cv
- 생성모델
- Unreal Engine
- Diffusion
- multimodal
- motion matching
- 모션매칭
- GAN
- WBP
- Font Generation
- NLP
- ue5.4
- RNN
- WinAPI
- 폰트생성
- 언리얼엔진
- inductive bias
- UE5
- deep learning
- ddpm
Archives
- Today
- Total
Deeper Learning
Deformable Convolutional Networks 본문
(구조 이해를 위한 짧은 요약)
Abstract
- CNN은 fixed geometric structure의 한계로 geometric transformation에 한계가 있다
- CNN의 transformation capacity를 향상하기 위해 2개의 모듈을 제시
- deformable convolution
- deformable RoI pooling
- spatial sampling location을 offset을 사용하여 변경, offset은 추가적인 supervised-learning 없이 target task를 수행하며 학습된다.
1. Introduction
- visual recognition task는 geometric variation, model geometric transformation을 핸들링할 수 있어야 한다.
- Augmentation을 통해 이를 해결하거나 SIFT(scale invariant feature tranform), sliding window 기반 object detection을 사용
- Augmentation method의 문제점은 geometric transformation을 알고 있으며 고정적이라는 가정에 기반한다는 것이다. → "크기가 다를 것이다", "회전을 할 것이다"라는 prior knowledge에 기반하여 Augmentation을 설계할 수밖에 없다.
- hand-craft 모델 method의 문제점은 복잡한 transformation을 해결할 수 없는것이다. (even for known tranformation)
- CNN은 visual recognition task에서 매우 효과적이나 geometric transformation에 대한 capacity는 대부분 augmentation, large model, 단순한 hand-crafted module (maxpool for small transformation-invariance)에서 얻고 있다.
- 동일한 CNN layer에서 모든 activation units은 같은 receptive-field를 가지고 있는데 different location, scale, deformation high-level semantic을 처리하기 위해 adaptive 한 method가 필요하다.
- object detection method는 빠르게 발전하고 있지만 전통적인 b-box 기반 feature extraction을 사용하고 있다.
- 위 방식은 특히 사각형이 아닌 object에 대해 최적의 방법이 아니다.
- CNN의 geometric transformation 모델링 능력을 향상하기 위한 2개의 모듈을 제시
2. Deformable Convolutional Networks
- dilation=1인 normal convolution의 수식은 위와 같다.
- R은 input feature map에서 grid, p_0는 output feature map y 중 하나의 위치를 말한다.
- 3x3 kernel에서 R의 길이는 9
Code
- Conv2D의 sampling location 자체의 수정이 필요하여 직접 작성할 경우 속도가 매우 느려 기존 구현체를 line by line으로 이해하였음
"""
https://github.com/DHZS/tf-deformable-conv-layer
"""
import tensorflow as tf
from tensorflow.keras.layers import Conv2D
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "-1"
class DeformableConvLayer(Conv2D):
"""Only support "channel last" data format"""
def __init__(
self,
filters, # output channels
kernel_size, # kernel_size
strides=(1, 1), # strides
padding="valid", # No padding
data_format=None, # `channels_last` (default) or `channels_first`
dilation_rate=(
1,
1,
), # an integer or tuple/list of 2 integers, specifying the dilation rate to use for dilated convolution.
num_deformable_group=None, # number of filters share same offset
activation=None, # act func
use_bias=True, # use_bias
kernel_initializer="glorot_uniform", # initializer for kernel
bias_initializer="zeros", # initializer for bias
kernel_regularizer=None, # regularizer for kernel, keras.regularizers object
bias_regularizer=None, # regularizer for bias, keras.regularizers obejct
activity_regularizer=None, # Regularizer function applied to the output of the layer
kernel_constraint=None, # Constraint function applied to the kernel weight matrix, ex. keras.constraints.MaxNorm
bias_constraint=None, # Constraint function applied to the bias weight matrix, ex. keras.constraints.MaxNorm
**kwargs
):
"""
`kernel_size`, `strides` and `dilation_rate` must have the same value in both axis.
:param num_deformable_group: split output channels into groups, offset shared in each group. If
this parameter is None, then set num_deformable_group=filters.
if num_deformable_group=None=filters -> one feature map share same offset
"""
super().__init__(
filters=filters,
kernel_size=kernel_size,
strides=strides,
padding=padding,
data_format=data_format,
dilation_rate=dilation_rate,
activation=activation,
use_bias=use_bias,
kernel_initializer=kernel_initializer,
bias_initializer=bias_initializer,
kernel_regularizer=kernel_regularizer,
bias_regularizer=bias_regularizer,
activity_regularizer=activity_regularizer,
kernel_constraint=kernel_constraint,
bias_constraint=bias_constraint,
**kwargs
)
self.kernel = None
self.bias = None
self.offset_layer_kernel = None
self.offset_layer_bias = None
if num_deformable_group is None:
num_deformable_group = filters
if filters % num_deformable_group != 0:
raise ValueError('"filters" mod "num_deformable_group" must be zero')
self.num_deformable_group = num_deformable_group
def build(self, input_shape):
input_dim = int(input_shape[-1]) #! input_dim = 3
# kernel_shape = self.kernel_size + (input_dim, self.filters)
# we want to use depth-wise conv
kernel_shape = self.kernel_size + (
self.filters * input_dim,
1,
) # (k, k, filters * input_dim, 1) for depth-wise conv
#! kernel_shape = (3, 3, 16 * 3)
self.kernel = self.add_weight( # Adds a new variable to the layer., tf.keras.layers
name="kernel",
shape=kernel_shape,
initializer=self.kernel_initializer,
regularizer=self.kernel_regularizer,
constraint=self.kernel_constraint,
trainable=True,
dtype=self.dtype,
)
if self.use_bias:
self.bias = self.add_weight(
name="bias",
shape=(self.filters,),
initializer=self.bias_initializer,
regularizer=self.bias_regularizer,
constraint=self.bias_constraint,
trainable=True,
dtype=self.dtype,
)
# create offset conv layer
offset_num = (
self.kernel_size[0]
* self.kernel_size[1]
* self.num_deformable_group # num of total offset = kernel_w * kernel_h * deformable_group
)
#! offset_num = 3 * 3 * 16 = 144
self.offset_layer_kernel = self.add_weight(
name="offset_layer_kernel",
shape=self.kernel_size
+ (
input_dim,
offset_num * 2,
), # 2 means x and y axis ( kernel_w, kernel_h, input_dim, offset_num * 2)
initializer=tf.zeros_initializer(),
regularizer=self.kernel_regularizer,
trainable=True,
dtype=self.dtype,
) #! offset_layer_kernel = (3, 3, 3, 288) , (k,k,input_dim, output_dim)
self.offset_layer_bias = self.add_weight(
name="offset_layer_bias",
shape=(offset_num * 2,), #! 288
initializer=tf.zeros_initializer(),
# initializer=tf.random_uniform_initializer(-5, 5),
regularizer=self.bias_regularizer,
trainable=True,
dtype=self.dtype,
)
self.built = True
def call(self, inputs, training=None, **kwargs):
# get offset, shape [batch_size, out_h, out_w, filter_h, * filter_w * channel_out * 2]
offset = tf.nn.conv2d(
inputs,
filters=self.offset_layer_kernel,
strides=[1, *self.strides, 1],
padding=self.padding.upper(),
dilations=[1, *self.dilation_rate, 1],
) #! offset = (32, 126, 126, 288)
offset += self.offset_layer_bias
# add padding if needed
inputs = self._pad_input(inputs)
#! inputs: (32, 128, 128, 3)
# some length
batch_size = int(inputs.get_shape()[0]) #! 32
channel_in = int(inputs.get_shape()[-1]) #! 3
in_h, in_w = [
int(i) for i in inputs.get_shape()[1:3]
] # input feature map size #! 128
out_h, out_w = [
int(i) for i in offset.get_shape()[1:3] #! 126, 126
] # output feature map size
filter_h, filter_w = self.kernel_size #! 3, 3
# get x, y axis offset
offset = tf.reshape(
offset, [batch_size, out_h, out_w, -1, 2]
) #! (32, 126, 126, 144, 2)
y_off, x_off = offset[:, :, :, :, 0], offset[:, :, :, :, 1]
#! y_off = (32, 126, 126, 144), x_off = (32, 126, 126, 144)
# input feature map gird coordinates
y, x = self._get_conv_indices([in_h, in_w]) #! y=(1, 126, 126, 9), x=(1, 126, 126, 9)
y, x = [tf.expand_dims(i, axis=-1) for i in [y, x]] #! x,y = (1, 126, 126, 9, 1)
"""
tf.tile(input, multiple)
a = tf.constant([[1,2,3],[4,5,6]], tf.int32)
b = tf.constant([1,2], tf.int32)
tf.tile(a, b)
= [[1,2,3,1,2,3],
[4,5,6,4,5,6]]]
input a를, multiple인 b([1,2])에 따라 행을 1번 반복, 열을 2번 반복
tf.tile(a, [2,1])
= [[1,2,3],
[4,5,6],
[1,2,3],
[4,5,6]]
"""
y, x = [
tf.tile(i, [batch_size, 1, 1, 1, self.num_deformable_group]) for i in [y, x]
] #! x,y = (32, 126, 126, 9, 16)
y, x = [tf.reshape(i, [*i.shape[0:3], -1]) for i in [y, x]] #! x,y = (32, 126, 126, 144)
y, x = [tf.cast(i, dtype=tf.float32) for i in [y, x]]
# add offset
y, x = y + y_off, x + x_off #! add offset to sampling location index
y = tf.clip_by_value(y, 0, in_h - 1) #! clip for edge location
x = tf.clip_by_value(x, 0, in_w - 1)
# get four coordinates of points around (x, y)
y0, x0 = [tf.cast(tf.floor(i), dtype=tf.int32) for i in [y, x]] #! offset is fracion, apply floor func to get coordinate points
y1, x1 = y0 + 1, x0 + 1
# clip
y0, y1 = [tf.clip_by_value(i, 0, in_h - 1) for i in [y0, y1]]
x0, x1 = [tf.clip_by_value(i, 0, in_w - 1) for i in [x0, x1]]
#! x0,x1,y0,y1 = (32, 126, 126, 144) 각 sampling location 꼭짓점의 좌표
# get pixel values
indices = [[y0, x0], [y0, x1], [y1, x0], [y1, x1]] # 4 coordinates
p0, p1, p2, p3 = [
DeformableConvLayer._get_pixel_values_at_point(inputs, i) for i in indices
] #! sampling by offset added location values
#! p0~p3 = (32, 126, 126, 144, 3)
# cast to float
x0, x1, y0, y1 = [tf.cast(i, dtype=tf.float32) for i in [x0, x1, y0, y1]]
# weights
w0 = (y1 - y) * (x1 - x)
w1 = (y1 - y) * (x - x0)
w2 = (y - y0) * (x1 - x)
w3 = (y - y0) * (x - x0)
# expand dim for broadcast
w0, w1, w2, w3 = [tf.expand_dims(i, axis=-1) for i in [w0, w1, w2, w3]]
# bilinear interpolation
pixels = tf.add_n([w0 * p0, w1 * p1, w2 * p2, w3 * p3])
#! pixels = (32, 126, 126, 144, 3)
# reshape the "big" feature map
pixels = tf.reshape(
pixels,
[
batch_size,
out_h,
out_w,
filter_h,
filter_w,
self.num_deformable_group,
channel_in,
],
) #! pixels = (32, 126, 126, 3, 3, 16, 3)
pixels = tf.transpose(pixels, [0, 1, 3, 2, 4, 5, 6]) #! pixels = (32, 126, 3, 126, 3, 16, 3)
pixels = tf.reshape(
pixels,
[
batch_size,
out_h * filter_h,
out_w * filter_w,
self.num_deformable_group,
channel_in,
],
)
#! pixels = (32, 378, 378, 16, 3)
# copy channels to same group
feat_in_group = self.filters // self.num_deformable_group
pixels = tf.tile(pixels, [1, 1, 1, 1, feat_in_group])
pixels = tf.reshape(
pixels, [batch_size, out_h * filter_h, out_w * filter_w, -1] #! (32, 378, 378, 48)
)
# depth-wise conv
out = tf.nn.depthwise_conv2d(
pixels, self.kernel, [1, filter_h, filter_w, 1], "VALID"
)
# add the output feature maps in the same group
#! out = (32, 126, 126, 48)
out = tf.reshape(out, [batch_size, out_h, out_w, self.filters, channel_in])
#! out = (32, 126, 126, 16, 3)
out = tf.reduce_sum(out, axis=-1)
if self.use_bias:
out += self.bias
return self.activation(out) #! (32, 126, 126, 16)
def _pad_input(self, inputs):
"""Check if input feature map needs padding, because we don't use the standard Conv() function.
:param inputs:
:return: padded input feature map
"""
# When padding is 'same', we should pad the feature map.
# if padding == 'same', output size should be `ceil(input / stride)`
if self.padding == "same":
in_shape = inputs.get_shape().as_list()[1:3] #! (128, 128)
padding_list = []
for i in range(2):
filter_size = self.kernel_size[i] #! 3
dilation = self.dilation_rate[i] #! 1
dilated_filter_size = filter_size + (filter_size - 1) * (
dilation - 1
) #! 3
same_output = (in_shape[i] + self.strides[i] - 1) // self.strides[i] #!
valid_output = (
in_shape[i] - dilated_filter_size + self.strides[i]
) // self.strides[i]
if same_output == valid_output:
padding_list += [0, 0]
else:
p = dilated_filter_size - 1
p_0 = p // 2
padding_list += [p_0, p - p_0]
if sum(padding_list) != 0:
padding = [
[0, 0],
[padding_list[0], padding_list[1]], # top, bottom padding
[padding_list[2], padding_list[3]], # left, right padding
[0, 0],
]
inputs = tf.pad(inputs, padding)
return inputs
def _get_conv_indices(self, feature_map_size): #! (128,128)
"""the x, y coordinates in the window when a filter sliding on the feature map
:param feature_map_size:
:return: y, x with shape [1, out_h, out_w, filter_h * filter_w]
"""
feat_h, feat_w = [int(i) for i in feature_map_size[0:2]] #! 128, 128
"""
tf.meshgrid
x = [1, 2, 3]
y = [4, 5, 6]
X, Y = tf.meshgrid(x, y)
X = [[1, 2, 3],
[1, 2, 3],
[1, 2, 3]]
Y = [[4, 4, 4],
[5, 5, 5],
[6, 6, 6]]
"""
#! x = (128,128), y = (128,128)
x, y = tf.meshgrid(tf.range(feat_w), tf.range(feat_h)) #! 1~128, representing index
x, y = [
tf.reshape(i, [1, *i.get_shape(), 1]) for i in [x, y]
] # shape [1, h, w, 1]
#! x = (1, 128, 128, 1), y = (1, 128, 128, 1)
# https://www.tensorflow.org/api_docs/python/tf/image/extract_patches
x, y = [
tf.image.extract_patches(
i,
[1, *self.kernel_size, 1],
[1, *self.strides, 1],
[1, *self.dilation_rate, 1],
"VALID",
)
for i in [x, y]
] # shape [1, out_h, out_w, filter_h * filter_w]
#! kernel이 input에 곱해질 때 patches를 모두 기록 (index를 기록)
#! shape [1, 126, 126, 3 * 3], stride 1 기준으로 총 kernel의 9개의 값이 input을 sliding 하며 126 * 126 번 계산됨
return y, x
@staticmethod
def _get_pixel_values_at_point(inputs, indices):
"""get pixel values
:param inputs:
:param indices: shape [batch_size, H, W, I], I = filter_h * filter_w * channel_out
:return:
"""
#! inputs = (32, 128, 128, 3), indices = [x0, y0] x0,y0 = (32, 126, 126, 144)
y, x = indices
batch, h, w, n = y.get_shape().as_list()[0:4]
batch_idx = tf.reshape(tf.range(0, batch), (batch, 1, 1, 1)) #! (32, 1, 1, 1) (0 ~ 31)
b = tf.tile(batch_idx, (1, h, w, n)) #! b=(32, 126, 126, 144), 0~31까지 index가 tile에 의해 복사
pixel_idx = tf.stack([b, y, x], axis=-1) #! pixel_idx = (32, 126, 126, 144, 3)
"""
tf.gather_nd(indices, params)
indices에 따라 sampling 하여 return
output = [params[0][0][1], params[1][0][1]]
indices = [[0, 0, 1], [1, 0, 1]]
params = [[['a0', 'b0'], ['c0', 'd0']],
[['a1', 'b1'], ['c1', 'd1']]]
output = ['b0', 'b1']
"""
return tf.gather_nd(inputs, pixel_idx) #! (32, 126, 126, 144, 3)
if __name__ == "__main__":
deformable_conv = DeformableConvLayer(16, 3)
#! Example) input's shape == (32, 128, 128, 3), output 16 channels, 3x3 kernel
x = tf.ones((32, 128, 128, 3))
out = deformable_conv(x)
# out.shape == (32, 126, 126, 16)
print("DONE", out.shape)
Reference
[1] Deformable convolution networks, Jifen Dai et al., https://arxiv.org/abs/1703.06211
[2] https://github.com/DHZS/tf-deformable-conv-layer
'AI > Deep Learning' 카테고리의 다른 글
Comments