Panoptic SegFormer

Machine Learning/Image

Panoptic SegFormer

ai-notes 2024. 10. 7. 16:44

DEER 시리즈

오랜만에 글을 게시해봅니다.

DEER를 구현하는 과정에 있어서 Decoder(Character Recognition)는 시행착오가 있었으나 학습이 잘 되는데, Encoder(Segmentation)은 도저히 학습이 진행되지 않아 DEER에서 참고한 논문들(Efficient DETR)을 정리해봤습니다.

Li, Zhiqi, et al. "Panoptic segformer: Delving deeper into panoptic segmentation with transformers."
Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
. 2022.
https://arxiv.org/pdf/2109.03814

Abstract

파노라마 세그먼트는 semantic segmentation과 instance segmentation을 결합한 것으로, 이미지를*things와 **stuff 두 가지 유형으로 나눕니다.

*things : 개별적으로 셀 수 있는 것 ( 사람, 자동차 등 )

**stuff : 개별적으로 세기 어려운 것 ( 하늘, 잔디 등 )

Panoptic SegFormer는 transformers로 panoptic segmentation을 수행하는데, 세 가지 기여 요소를 갖고 있다고 합니다.

효율적인 Mask Decoder 심층(?) 지도 학습
Deformable DETR을 사용했고, 구체적으로는 Mask Decoder의 Attention Module을 레이어 별로 감독(Supervised)하는 방법을 사용했다고 합니다. 이 전략은 Attention Module이 의미있는 semantic regions에 빠르게 집중할 수 있도록 만듭니다.
쿼리 분리 전략
쿼리 분리 전략은 query set의 역할을 분리하고, things와 stuff의 상호 간섭을 방지합니다. 예를 들어, num queries가 300이라면 앞의 150은 things를, 뒤의 150은 stuff를 segment 하도록 한다는 뜻 같습니다. ( Bipartite matching 기법을 통해 고정적으로 Label을 할당합니다. )
개선된 후처리(post-processing) 방법
개선된 후처리 방법은 충돌되는 mask overlaps를 해결하기 위해 classification과 segmentation의 품질을 함께 고려함으로써 비용증가 없이 성능을 개선합니다.

(a) 기존 Panoptic Segementation에서는 Query에 대해 things / stuff를 구분하지 않고 Bipartite Matching으로 학습을 진행하였으나, (b) 저자들은 Stuff(Green Color)는 고정적으로 Assign되도록 하고, Things에 대해선 Bipartite Matching이 이루어지도록 했습니다.

Architecture

Overview of Panoptic SegFormer. Backbone, Encoder와 Decoder 구조로 이루어져 있으며, Backbone과 Encoder의 output은 refine multi-scale features입니다. Location Decoder의 Input은 Thing Queries 와 multi-scale features 입니다. Mask Decoder에는 Location Decoder로 부터 나온 Thing Queries와 Stuff Queries가 Input으로 사용됩니다. Location Decoder는 queries의 reference points를 찾는 것에 집중합니다. 결과를 취합하기 위해 일반적으로 사용하는 pixel-wise argmax method를 사용한 것이 아닌 mask-wise merging 기법을 사용했습니다.

저자들의 framework design은 다음 관찰들에서 영감을 얻었는데,

Deep supervision matters는 Mask Decoder에서 고품질의 discriminative attention representations를 학습하는데 중요합니다.
Things와 Stuff는 속성이 다르기 때문에 같은 방법으로 처리하는 것은 최적의 방법이 아닙니다.
Pixel-wise argmax method는 극단적인 이상치로 인해 false-positive 결과를 생성하는 경향이 있습니다.

Method

Decoder

1. Query Decoupling Strategy

위에서 언급(Architecture 2)한 것처럼 Things와 Stuff은 속성들이 많이 다르기 때문에, 서로 간섭된다면 모델의 성능을 손상시킬 가능성이 높습니다.이를 방지하기 위해 thing queries와 stuff queries를 구분하였으며, thing queries의 경우 Location Decoder를 통함으로써 각각의 instances를 잘 구분하도록 학습됩니다. Mask Decoder는 두 queries의 값을 모두 받아 최종적인 mask와 categories를 생성합니다. 앞선 언급(Abstract 2)에서와 같이 thing queries는 ground truths와 bipartite matching 전략을 사용하고, stuff queries의 경우 fixed assign 전략이 사용됩니다. 즉, 각각의 stuff query는 하나의 stuff category에 대응됩니다. Thing와 Stuff 모두 같은 format의 Output을 가지기 때문에 동일한 방법으로 post-processing이 가능합니다.

2. Location Decoder

지역 정보는 panoptic segementation task에서 개별적인 instance를 구분하는데 중요한 역할을 합니다. 이에 따라 저자들은 Location Decoder를 이용하여 things의 위치 정보를 학습 가능한 queries를 통해 획득합니다. 학습 단계에서는 추가적인 MLP 헤드(Layer)를 통해 목표 물체의 bounding boxes와 categories를 예측합니다. 이 예측값은 supervised 기법을 통해 학습되고, Infernece 시에 이 MLP 헤드는 사용하지 않습니다. Location decoder는 Deformable DETR에서 사용한 방법을 따릅니다. Location Decoder는 Bounding Box 대신 Mask의 중심을 예측하여 위치 정보를 학습할 수 있습니다.

3. Mask Decoder

Mask Decoder는 주어진 Queries에 따라 categories와 masks를 예측합니다. Queries에는 location decoder의 Output인 things queries와 class가 고정된 stuff queries가 있습니다. Attention 계산에서 K와 V는 transformer encoder에서 구한 refined feature tokens 입니다. Classification은 각 Decoder Layer의 산출물인 refined query에 FC layer를 붙여 수행합니다. Thing query는 모든 카테고리에 대한 확률 값을 예측하며, Stuff query는 특정 query가 담당하는 stuff category에 대한 확률만 예측합니다. 동시에 masks를 예측하기 위해, attention maps을 3개 $ (A_3,A_4,A_5) $ 로 분할하는데, 그 shape은 Backbone Feature Map의 크기 $ C_3, C_4,C_5 $ 입니다. 분할한 Attention Maps은 Upsample을 통해 H/8xW/8의 크기를 갖게 되며, 채널 방향으로 concat이 이루어집니다.

$$ A_{fused} =Concat(A_3,Up_{\times2}(A_4),Up_{\times4}(A_5)) $$

최종적으로, 1x1 conv를 통해 binary mask를 얻습니다.

이전 연구들에 따르면 DETR의 수렴속도가 늦은 이유는 Attention Module이 Feature map의 모든 pixel에 동등한 weight를 주며, sparse meaningful한 위치에 초점을 맞추도록 학습하려면 많은 노력을 필요로 하기 때문이라고 합니다.
저자들은 이에 대응하려고, (1) Attention Maps에서 masks를 생성하기 위한 아주 가벼운(Ultra-light) FC head 를 사용했는데, 이는 Attention Module이 ground truth mask의 어느 부분에 집중해야 하는지를 Guide해주게 됩니다. (2) 또한 Mask Decoder 내부에 Deep Supervision을 적용했는데, 이는 각 레이어의 Attnetion maps은 mask에 의해 학습?(supervied)되며, Attention Module은 얕은 레이어에서 부터 유의미한 정보에 집중할 수 있게 됩니다.

마치며

DEER Encoder를 구현하며, 필요하다고 생각하는 부분만 정리했습니다.
DEER의 경우 Location Head(Segmentation의 Decoder에 해당)하는 부분을 Differentiable Binarization 네트워크 구조를 차용했기 때문에 위 논문의 Decoder가 적용되지는 않습니다만, 저자들이 DEER Network 구성을 하는데 영감을 받았다고 합니다.

아마 location decoder를 통해 reference point를 짚어내는 것에 중점을 둔 것 같습니다.

이후 Efficient DETR과 함께 정리하면서 보면 좋을것 같습니다.

'Machine Learning > Image' 카테고리의 다른 글

DBNet - Real-time Scene Text Detection with Differentiable Binarization (0)	2024.10.16
Efficient DETR (1)	2024.10.14
Conditional-DETR : for Fast Training Convergence (1)	2024.09.25
DEER: Detection-agnostic End-to-End Recognizer for Scene Text Spotting (4)	2024.07.23
Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection (0)	2024.07.12

현재글Panoptic SegFormer

ai-notes

Kim, Lim,

ai-notes