Detect Anything via Next Point Prediction

๐Ÿท๏ธ ๋…ผ๋ฌธ ๊ฐ์ฒดํƒ์ง€ LLM

๊ฐ์ฒด ๊ฒ€์ถœ์€ ์˜ค๋žซ๋™์•ˆ YOLO, DETR, Grounding DINO์™€ ๊ฐ™์€ ํšŒ๊ท€ ๊ธฐ๋ฐ˜ ๋ชจ๋ธ์ด ์ฃผ๋„ํ•ด์™”์Šต๋‹ˆ๋‹ค. ์ตœ๊ทผ MLLM(Multimodal Large Language Model)์„ ํ™œ์šฉํ•œ ์‹œ๋„๋“ค์ด ์žˆ์—ˆ์ง€๋งŒ, ๋‚ฎ์€ ์žฌํ˜„์œจ, ์ค‘๋ณต ์˜ˆ์ธก, ์ขŒํ‘œ ๋ถˆ์ผ์น˜ ๋“ฑ์˜ ๋ฌธ์ œ์— ์ง๋ฉดํ–ˆ์Šต๋‹ˆ๋‹ค. ์ด ๋…ผ๋ฌธ์€ ์ด๋Ÿฌํ•œ ๊ฒฉ์ฐจ๋ฅผ ํ•ด์†Œํ•˜๊ธฐ ์œ„ํ•ด Rex-Omni๋ผ๋Š” 3B ํŒŒ๋ผ๋ฏธํ„ฐ MLLM์„ ์ œ์•ˆํ•ฉ๋‹ˆ๋‹ค. Rex-Omni๋Š” COCO์™€ LVIS ๋ฒค์น˜๋งˆํฌ์—์„œ ์ œ๋กœ์ƒท ์„ค์ •์œผ๋กœ DINO, Grounding DINO์™€ ๊ฐ™์€ ํšŒ๊ท€ ๊ธฐ๋ฐ˜ ๋ชจ๋ธ๊ณผ ๋น„์Šทํ•˜๊ฑฐ๋‚˜ ๋” ๋‚˜์€ ์„ฑ๋Šฅ์„ ๋‹ฌ์„ฑํ•ฉ๋‹ˆ๋‹ค.

Q. Jiang, J. Huo, X. Chen, Y. Xiong, Z. Zeng, Y. Chen, T. Ren, J. Yu, and L. Zhang, "Detect Anything via Next Point Prediction", arXiv preprint arXiv:2510.12798, 2025.

1-das.png

์š”์•ฝ

์•„ํ‚คํ…์ฒ˜: Rex-Omni๋Š” Qwen2.5-VL-3B๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ๊ตฌ์ถ•๋˜์—ˆ์œผ๋ฉฐ, 0๋ถ€ํ„ฐ 999๊นŒ์ง€์˜ ์–‘์žํ™”๋œ ์ขŒํ‘œ๋ฅผ ๋‚˜ํƒ€๋‚ด๋Š” ํŠน์ˆ˜ ํ† ํฐ์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. ๋งˆ์ง€๋ง‰ 1,000๊ฐœ์˜ ์–ดํœ˜ ํ† ํฐ์„ ์žฌ์‚ฌ์šฉํ•˜์—ฌ ์ขŒํ‘œ๋ฅผ ํ‘œํ˜„ํ•ฉ๋‹ˆ๋‹ค.

ํƒœ์Šคํฌ ์ •์˜: ๋ชจ๋“  ์‹œ๊ฐ ์ธ์‹ ํƒœ์Šคํฌ๋ฅผ ์ขŒํ‘œ ์˜ˆ์ธก์œผ๋กœ ํ†ตํ•ฉํ•ฉ๋‹ˆ๋‹ค. ํฌ์ธํŒ…์€ ํ•œ ์ , ๊ฒ€์ถœ์€ ๋‘ ์ ์œผ๋กœ ๋ฐ”์šด๋”ฉ ๋ฐ•์Šค, ํด๋ฆฌ๊ณค์€ ๋„ค ๊ฐœ ์ด์ƒ์˜ ์ , ํ‚คํฌ์ธํŠธ๋Š” ์—ฌ๋Ÿฌ ์˜๋ฏธ์  ์ ์„ ์ถœ๋ ฅํ•ฉ๋‹ˆ๋‹ค.

๋ฐ์ดํ„ฐ ์—”์ง„: ์„ธ ๊ฐ€์ง€ ์ „๋ฌธ ๋ฐ์ดํ„ฐ ์—”์ง„์„ ๊ตฌ์ถ•ํ–ˆ์Šต๋‹ˆ๋‹ค.

๊ณต๊ฐœ ๋ฐ์ดํ„ฐ์…‹ 890๋งŒ ๊ฐœ์™€ ํ•ฉ์ณ ์ด 2,200๋งŒ ๊ฐœ์˜ ๊ณ ํ’ˆ์งˆ ํ•™์Šต ๋ฐ์ดํ„ฐ๋ฅผ ํ™•๋ณดํ–ˆ์Šต๋‹ˆ๋‹ค.

ํ•™์Šต ๋ฐฉ๋ฒ•: 2๋‹จ๊ณ„ ํ•™์Šต ํŒŒ์ดํ”„๋ผ์ธ์„ ์ฑ„ํƒํ•ฉ๋‹ˆ๋‹ค.

ํ‰๊ฐ€ ๋ฉ”ํŠธ๋ฆญ: ์ „ํ†ต์ ์ธ mAP ๋Œ€์‹  Recall, Precision, F1 ์Šค์ฝ”์–ด๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. IoU ์ž„๊ณ„๊ฐ’ 0.5, 0.95, ํ‰๊ท (0.5~0.95)์—์„œ ํ‰๊ฐ€ํ•ฉ๋‹ˆ๋‹ค.

์ฃผ์š” ๊ฒฐ๊ณผ:

๋…ผ๋ฌธ ์ƒ์„ธ

Introduction

๊ฐ์ฒด ๊ฒ€์ถœ์€ ์ดˆ๊ธฐ CNN ๊ธฐ๋ฐ˜ ๊ตฌ์กฐ(YOLO, Faster R-CNN)์—์„œ ํŠธ๋žœ์Šคํฌ๋จธ ๊ธฐ๋ฐ˜ ๋ชจ๋ธ(DETR, DINO)๋กœ ๋ฐœ์ „ํ–ˆ์œผ๋ฉฐ, ํ์‡„ํ˜• ๊ฒ€์ถœ์—์„œ ๊ฐœ๋ฐฉํ˜• ๊ฒ€์ถœ๋กœ ์ง„ํ™”ํ–ˆ์Šต๋‹ˆ๋‹ค.

๋ชฉํ‘œ: ์ž„์˜์˜ ๊ฐ์ฒด์™€ ๊ฐœ๋…์„ ์‹๋ณ„ํ•  ์ˆ˜ ์žˆ๋Š” ๋ชจ๋ธ ๊ฐœ๋ฐœ

๊ธฐ์กด ์ ‘๊ทผ๋ฒ•์˜ ํ•œ๊ณ„:

  1. ๊ฐœ๋ฐฉ ์–ดํœ˜ ๊ฒ€์ถœ ๋ชจ๋ธ(Grounding DINO ๋“ฑ)

    • BERT๋‚˜ CLIP ๊ฐ™์€ ํ…์ŠคํŠธ ์ธ์ฝ”๋” ์‚ฌ์šฉ
    • ์–•์€ ์–ธ์–ด ์ดํ•ด๋กœ ๋ณต์žกํ•œ ์˜๋ฏธ ์„ค๋ช… ์ฒ˜๋ฆฌ ์–ด๋ ค์›€
    • ์˜ˆ: "๋นจ๊ฐ„ ์‚ฌ๊ณผ" ์ž…๋ ฅ์—๋„ ๋ชจ๋“  ์‚ฌ๊ณผ ๊ฒ€์ถœ
  2. ๊ธฐ์กด MLLM ๊ธฐ๋ฐ˜ ์ ‘๊ทผ๋ฒ•

    • ์ขŒํ‘œ๋ฅผ ์ด์‚ฐ ํ† ํฐ์œผ๋กœ ํ‘œํ˜„ํ•˜๊ณ  ๋‹ค์Œ ํ† ํฐ ์˜ˆ์ธก์œผ๋กœ ๋ฐ”์šด๋”ฉ ๋ฐ•์Šค ์ƒ์„ฑ
    • ์ •ํ™•ํ•œ ๊ฐ์ฒด ์œ„์น˜ ํŒŒ์•… ์–ด๋ ค์›€
    • ๋‚ฎ์€ ์žฌํ˜„์œจ, ์ขŒํ‘œ ๋“œ๋ฆฌํ”„ํŠธ, ์ค‘๋ณต ์˜ˆ์ธก ๋ฌธ์ œ

์„ฑ๋Šฅ ๊ฒฉ์ฐจ์˜ ๋‘ ๊ฐ€์ง€ ๊ทผ๋ณธ ์›์ธ:

1. ์ด์‚ฐ-์—ฐ์† ๋งคํ•‘์˜ ์–ด๋ ค์›€

MLLMs๋Š” ์ขŒํ‘œ ์˜ˆ์ธก์„ ์ด์‚ฐ ๋ถ„๋ฅ˜ ์ž‘์—…์œผ๋กœ ๋‹ค๋ฃน๋‹ˆ๋‹ค. ์ ˆ๋Œ€ ์ขŒํ‘œ ๊ฐ’์„ ์ง์ ‘ ์ƒ์„ฑํ•˜๊ณ  ํฌ๋กœ์Šค์—”ํŠธ๋กœํ”ผ ์†์‹ค์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.

๋ฌธ์ œ์ :

2. Teacher Forcing์˜ ํ•œ๊ณ„

SFT(Supervised Fine-tuning)๋Š” teacher forcing ๋ฐฉ์‹์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.

๋ฌธ์ œ์ :

Rex-Omni์˜ ํ•ต์‹ฌ ์„ค๊ณ„

1. ํƒœ์Šคํฌ ์ •์˜

์ขŒํ‘œ ํ‘œํ˜„ ๋ฐฉ์‹ ์„ ํƒ:

์„ธ ๊ฐ€์ง€ ํŒจ๋Ÿฌ๋‹ค์ž„ ๋น„๊ต:

  1. ์ง์ ‘ ์ขŒํ‘œ ์˜ˆ์ธก (์ฑ„ํƒ): ์ขŒํ‘œ๋ฅผ LLM ์–ดํœ˜์˜ ์ด์‚ฐ ํ† ํฐ์œผ๋กœ ์ฒ˜๋ฆฌ
  2. ๊ฒ€์ƒ‰ ๊ธฐ๋ฐ˜: ์ถ”๊ฐ€ ์ œ์•ˆ ๋ชจ๋“ˆ ์‚ฌ์šฉ, LLM์ด ํ›„๋ณด ์˜์—ญ์˜ ์ธ๋ฑ์Šค ์˜ˆ์ธก
  3. ์™ธ๋ถ€ ๋””์ฝ”๋”: LLM์ด ํŠน์ˆ˜ ํ† ํฐ ์˜ˆ์ธก, ์ž„๋ฒ ๋”ฉ์„ ์™ธ๋ถ€ ๋””์ฝ”๋”์— ์ „๋‹ฌ

์ขŒํ‘œ ํ˜•์‹ ์„ ํƒ:

์„ธ ๊ฐ€์ง€ ๋ณ€ํ˜• ๋น„๊ต:

  1. ํŠน์ˆ˜ ํ† ํฐ์„ ์‚ฌ์šฉํ•œ ์ƒ๋Œ€ ์ขŒํ‘œ (์ฑ„ํƒ): 0~999๋กœ ์–‘์žํ™”, ๊ฐ ์ขŒํ‘œ๋ฅผ ํŠน์ˆ˜ ํ† ํฐ์œผ๋กœ ํ‘œํ˜„
  2. ํŠน์ˆ˜ ํ† ํฐ ์—†๋Š” ์ƒ๋Œ€ ์ขŒํ‘œ: 1,000๊ฐœ ๊ตฌ๊ฐ„์œผ๋กœ ์–‘์žํ™”ํ•˜์ง€๋งŒ ์—ฌ๋Ÿฌ ์›์ž ํ† ํฐ ์‚ฌ์šฉ
  3. ์ ˆ๋Œ€ ์ขŒํ‘œ: 1921์„ (1, 9, 2, 1)๋กœ ํ† ํฐํ™”

์ฑ„ํƒ ์ด์œ :

์ž…๋ ฅ ํ˜•์‹:

ํ…์ŠคํŠธ ํ”„๋กฌํ”„ํŠธ ์˜ˆ์‹œ:

Please detect pigeon, person, truck, snow in this image. 
Return the output in box format.

๋น„์ฃผ์–ผ ํ”„๋กฌํ”„ํŠธ ์˜ˆ์‹œ:

Here are some example boxes specifying the location of several objects 
in the image: "object1": ["<12><412><339><568>", "<92><55><179><378>"]. 
Please detect all objects with the same category and return their 
bounding boxes in [x0, y0, x1, y1] format.

์ถœ๋ ฅ ํ˜•์‹:

๊ธฐ๋ณธ ๊ตฌ์กฐ:

<|object_ref_start|>PHRASE<|object_ref_end|><|box_start|>COORDS<|box_end|>

๋ฐ”์šด๋”ฉ ๋ฐ•์Šค:

<|object_ref_start|>person<|object_ref_end|><|box_start|>
<12><42><512><612>, <24><66><172><623>, ...<|box_end|>

ํฌ์ธํŠธ:

<|object_ref_start|>button<|object_ref_end|><|box_start|>
<100><150>,<200><250>, ...<|box_end|>

ํ‚คํฌ์ธํŠธ:

{"person1": {"box": <0><123><42><256>, 
"keypoints": {"left eye": <32><43>, "right eye": <66><55>, ...}}}

๋ชจ๋ธ ์•„ํ‚คํ…์ฒ˜:

Qwen2.5-VL-3B-Instruct ๊ธฐ๋ฐ˜์œผ๋กœ ์ตœ์†Œํ•œ์˜ ์ˆ˜์ •:

2. ํ•™์Šต ๋ฐ์ดํ„ฐ

๊ณต๊ฐœ ๋ฐ์ดํ„ฐ์…‹: ์•ฝ 890๋งŒ ์ƒ˜ํ”Œ

Grounding Data Engine: ์•ฝ 300๋งŒ ์ด๋ฏธ์ง€

  1. ์ด๋ฏธ์ง€ ์บก์…”๋‹: Qwen2.5-VL-7B๋กœ ์„ค๋ช… ์ƒ์„ฑ
  2. ๊ตฌ๋ฌธ ์ถ”์ถœ: SpaCy๋กœ ๋ช…์‚ฌ๊ตฌ ์ถ”์ถœ
  3. ๊ตฌ๋ฌธ ํ•„ํ„ฐ๋ง: ํ˜•์šฉ์‚ฌ ๋“ฑ ์†์„ฑ ํฌํ•จ ๊ตฌ๋ฌธ ์ œ๊ฑฐ (์˜ˆ: "green lemon" ์ œ๊ฑฐ, "lemon" ์œ ์ง€)
    • ์ด์œ : ํ˜„์žฌ grounding ๋ชจ๋ธ๋“ค์ด ์†์„ฑ ์ดํ•ด ๋ถ€์กฑ
  4. ๊ตฌ๋ฌธ grounding: DINO-X๋กœ ๋ฐ”์šด๋”ฉ ๋ฐ•์Šค ์ƒ์„ฑ

Referring Data Engine: ์•ฝ 300๋งŒ ์ด๋ฏธ์ง€

  1. ํ‘œํ˜„ ์ƒ์„ฑ: Qwen2.5-VL-7B๋กœ referring expression ์ƒ์„ฑ
  2. ํฌ์ธํŒ…: Molmo๋กœ ๊ฐ ํ‘œํ˜„์˜ ๊ณต๊ฐ„ ํฌ์ธํŠธ ์ƒ์„ฑ
  3. ๋งˆ์Šคํฌ ์ƒ์„ฑ: SAM์œผ๋กœ ๊ฐ GT ๋ฐ•์Šค์˜ ๋งˆ์Šคํฌ ์ƒ์„ฑ
  4. ํฌ์ธํŠธ-๋ฐ•์Šค ์—ฐ๊ฒฐ: Molmo์˜ ํฌ์ธํŠธ๊ฐ€ ๋งˆ์Šคํฌ ๋‚ด์— ์žˆ์œผ๋ฉด ๋ฐ•์Šค์™€ referring expression ์—ฐ๊ฒฐ

๊ธฐํƒ€ ๋ฐ์ดํ„ฐ ์—”์ง„:

์ด ๋ฐ์ดํ„ฐ: 2,200๋งŒ ๊ณ ํ’ˆ์งˆ ์ฃผ์„ ์ด๋ฏธ์ง€

3. ํ•™์Šต ํŒŒ์ดํ”„๋ผ์ธ

1๋‹จ๊ณ„: Supervised Fine-tuning (SFT)

์˜จ๋ผ์ธ ๋Œ€ํ™” ๋ฐ์ดํ„ฐ ๊ตฌ์„ฑ ์ „๋žต:

ํ•™์Šต ์„ค์ •:

2๋‹จ๊ณ„: GRPO ๊ธฐ๋ฐ˜ ๊ฐ•ํ™”ํ•™์Šต ํ›„์ฒ˜๋ฆฌ

SFT์˜ ํ•œ๊ณ„:

  1. ๊ธฐํ•˜ํ•™์  ์ด์‚ฐํ™” ๋ฌธ์ œ

    • ์ขŒํ‘œ๋ฅผ ๋ฒ”์ฃผ ํ† ํฐ(<0>~<999>)์œผ๋กœ ํ‘œํ˜„
    • GT๊ฐ€ <33>์ด๊ณ  ์˜ˆ์ธก์ด <32>๋ฉด ํ”ฝ์…€ ์ฐจ์ด๋Š” ๋ฌด์‹œํ•  ์ˆ˜ ์žˆ์ง€๋งŒ CE ์†์‹ค์€ ์™„์ „ํžˆ ํ‹€๋ฆฐ ๊ฒƒ์œผ๋กœ ์ฒ˜๋ฆฌ
    • GT๊ฐ€ <0><0><100><100>์ด๊ณ  ์˜ˆ์ธก์ด <0><0><100><1000>์ด๋ฉด ํ•˜๋‚˜์˜ ํ† ํฐ๋งŒ ํ‹€๋ ธ์ง€๋งŒ ๋ฐ•์Šค๋Š” ์‹ฌ๊ฐํ•˜๊ฒŒ ์ž˜๋ชป๋จ
  2. ํ–‰๋™ ์กฐ์ ˆ ๊ฒฐํ•

    • Teacher forcing์œผ๋กœ ๋ฐ•์Šค ์ˆ˜๊ฐ€ GT์™€ ๋™์ผํ•˜๊ฒŒ ๊ณ ์ •
    • ๋ชจ๋ธ์ด ์ž์œจ์ ์œผ๋กœ ๊ฐ์ฒด ์ˆ˜ ํ•™์Šต ๋ชปํ•จ
    • ์ถ”๋ก  ์‹œ: (1) ์˜ˆ์ธก ๋ฐ•์Šค ๋ถ€์กฑ ๋˜๋Š” (2) ๊ณผ๋„ํ•œ ์˜ˆ์ธก (๋™์ผ/์•ฝ๊ฐ„ ์ด๋™ํ•œ ์ขŒํ‘œ ๋ฐ˜๋ณต)

GRPO ์ž‘๋™ ๋ฐฉ์‹:

์ด๋ฏธ์ง€์™€ ์งˆ๋ฌธ \((I, x)\)๊ฐ€ ์ฃผ์–ด์ง€๋ฉด:

  1. ํ˜„์žฌ ์ •์ฑ… \(\pi_\theta\)์—์„œ \(G\)๊ฐœ์˜ ์™„์ „ํ•œ ์‘๋‹ต ์ƒ˜ํ”Œ๋ง
  2. ๊ฐ ์ถœ๋ ฅ \(o_i\)์— ๋Œ€ํ•ด ์Šค์นผ๋ผ ๋ฆฌ์›Œ๋“œ \(r_i\) ๊ณ„์‚ฐ
  3. ๊ทธ๋ฃน ์ „์ฒด์—์„œ ์ •๊ทœํ™”ํ•˜์—ฌ ์ƒ๋Œ€์  ์ด์  ๊ณ„์‚ฐ:

\[A_i = \frac{r_i - \text{mean}(r_1, \ldots, r_G)}{\text{std}(r_1, \ldots, r_G)}\]

  1. GRPO ๋ชฉ์  ํ•จ์ˆ˜:

\[\mathcal{J}_{\text{GRPO}}(\theta) = \frac{1}{G} \sum_{i=1}^{G} \frac{1}{|o_i|} \sum_{t=1}^{|o_i|} [\min(\rho_{i,t} \hat{A}_{i,t}, \text{clip}(\rho_{i,t}, 1-\epsilon, 1+\epsilon) \hat{A}_{i,t}) - \beta D_{KL}[\pi_\theta | \pi_{\text{ref}}]\]

๊ธฐํ•˜ํ•™์  ์ธ์‹ ๋ฆฌ์›Œ๋“œ:

  1. Box IoU Reward (๊ฒ€์ถœ, grounding, referring, OCR)
    • GT ๋ฐ•์Šค์™€ ์˜ˆ์ธก ๋ฐ•์Šค ๋งค์นญ
    • ์นดํ…Œ๊ณ ๋ฆฌ ์ผ์น˜ํ•˜๋ฉด IoU๋ฅผ ๋ฆฌ์›Œ๋“œ๋กœ, ์•„๋‹ˆ๋ฉด 0
    • Recall, Precision, F1 ๊ณ„์‚ฐ:

\[\text{Recall} = \frac{\sum_{j=1}^{n} r_j}{n}, \quad \text{Precision} = \frac{\sum_{j=1}^{n} r_j}{m}, \quad r_{\text{IoU}} = \frac{2 \cdot \text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall} + \epsilon}\]

  1. Point-in-Mask Reward (ํฌ์ธํŒ… ํƒœ์Šคํฌ)

    • SAM์œผ๋กœ GT ๋ฐ•์Šค์˜ ๋งˆ์Šคํฌ ์ถ”์ถœ
    • ์˜ˆ์ธก ํฌ์ธํŠธ๊ฐ€ ๋งˆ์Šคํฌ ๋‚ด๋ถ€์ด๊ณ  ์นดํ…Œ๊ณ ๋ฆฌ ์ผ์น˜ํ•˜๋ฉด 1, ์•„๋‹ˆ๋ฉด 0
  2. Point-in-Box Reward (GUI Grounding)

    • ์˜ˆ์ธก ํฌ์ธํŠธ๊ฐ€ GT ๋ฐ•์Šค ๋‚ด๋ถ€๋ฉด 1, ์•„๋‹ˆ๋ฉด 0

ํ•™์Šต ์„ค์ •:

๋ฒค์น˜๋งˆํฌ ๊ฒฐ๊ณผ

Common Object Detection (COCO)

ํ‰๊ฐ€ ์„ค์ •:

์ฃผ์š” ๊ฒฐ๊ณผ:

์˜๋ฏธ:

Long-tailed Object Detection (LVIS)

๋ฒค์น˜๋งˆํฌ: 1,203๊ฐœ ์นดํ…Œ๊ณ ๋ฆฌ, 19,626๊ฐœ ํ…Œ์ŠคํŠธ ์ด๋ฏธ์ง€

์ฃผ์š” ๊ฒฐ๊ณผ:

์˜๋ฏธ:

Dense and Tiny Object Detection

๋ฒค์น˜๋งˆํฌ:

์ฃผ์š” ๊ฒฐ๊ณผ:

์‹คํŒจ ๋ชจ๋“œ ๋ถ„์„:

  1. Large-box prediction: ์—ฌ๋Ÿฌ ์ธ์ ‘ ๊ฐ์ฒด๋ฅผ ํ•˜๋‚˜์˜ ํฐ ๋ฐ•์Šค๋กœ ์ปค๋ฒ„
  2. Structured duplicate predictions: ์ตœ์†Œ ์˜คํ”„์…‹์œผ๋กœ ์ขŒํ‘œ ๋ฐ˜๋ณต

GRPO์˜ ํšจ๊ณผ:

Referring Object Detection

๋ฒค์น˜๋งˆํฌ:

์ฃผ์š” ๊ฒฐ๊ณผ:

5-das.png

Visual Prompting

ํ‰๊ฐ€:

์ฃผ์š” ๊ฒฐ๊ณผ:

Object Pointing

ํ‰๊ฐ€: COCO, LVIS, Dense200, VisDrone, RefCOCOg, HumanRef์—์„œ ํฌ์ธํŠธ ์˜ˆ์ธก

์ฃผ์š” ๊ฒฐ๊ณผ:

๋ชจ๋“  ๋ฒค์น˜๋งˆํฌ์—์„œ ์ตœ๊ณ  F1 ์Šค์ฝ”์–ด ๋‹ฌ์„ฑ

GUI Grounding

๋ฒค์น˜๋งˆํฌ:

์ฃผ์š” ๊ฒฐ๊ณผ:

๊ธฐํƒ€ ํƒœ์Šคํฌ

Layout Grounding (DocLayNet, M6Doc):

OCR (HierText, ICDAR2015, TotalText, SROIE):

Spatial Pointing (RefSpatial-Bench):

Keypoint (COCO, AP10K):

์‹ฌ์ธต ๋ถ„์„

GRPO๊ฐ€ ์ž‘๋™ํ•˜๋Š” ์ด์œ 

1. ํ•™์Šต ์—ญํ•™

SFT ๋‹จ๊ณ„: ๊พธ์ค€ํ•˜๊ณ  ์ ์ง„์ ์ธ ๊ฐœ์„  GRPO ๋‹จ๊ณ„: ์ ์€ ๋‹จ๊ณ„๋กœ ๊ธ‰๊ฒฉํ•œ ์„ฑ๋Šฅ ํ–ฅ์ƒ

ํ•ด์„: SFT ๋ชจ๋ธ์€ ์ด๋ฏธ ๊ฐ•๋ ฅํ•œ ์ž ์žฌ ๋Šฅ๋ ฅ์„ ๋ณด์œ ํ•˜์ง€๋งŒ ์ถฉ๋ถ„ํžˆ ํ™œ์šฉ๋˜์ง€ ์•Š์Œ. GRPO๊ฐ€ ํ–‰๋™ ์ธ์‹ ๋ฆฌ์›Œ๋“œ์™€ ์‹œํ€€์Šค ๋ ˆ๋ฒจ ํ”ผ๋“œ๋ฐฑ์œผ๋กœ ์ด๋ฅผ ํ•ด์ œ.

2. ํ–‰๋™ ๊ต์ •

์ค‘๋ณต ์˜ˆ์ธก ์ œ๊ฑฐ ์‹คํ—˜:

โ†’ GRPO๊ฐ€ ์ค‘๋ณต ์˜ˆ์ธก์„ ํšจ๊ณผ์ ์œผ๋กœ ์–ต์ œ

Large-box ์˜ˆ์ธก ์ œ๊ฑฐ ์‹คํ—˜ (Dense200):

โ†’ GRPO๊ฐ€ ๊ณผ๋„ํ•˜๊ฒŒ ํฐ ๋ฐ•์Šค ์˜ˆ์ธก ์–ต์ œ

3. ์ขŒํ‘œ ์ •๋ฐ€๋„ ๊ฐœ์„ ?

์ œ์–ด ์‹คํ—˜: ๋‘ ๋ชจ๋ธ์ด ๋ชจ๋‘ GT ๋งค์นญ์— ์„ฑ๊ณตํ•œ ๊ฒฝ์šฐ๋งŒ ๋น„๊ต

โ†’ GRPO์˜ ์ฃผ์š” ์ด์ ์€ ์ขŒํ‘œ ์ •๋ฐ€๋„ ํ–ฅ์ƒ์ด ์•„๋‹ˆ๋ผ ํ–‰๋™ ๊ฒฐํ•จ ๊ต์ •

4. ์˜ฌ๋ฐ”๋ฅธ ์˜ˆ์ธก์˜ ๊ฐ€๋Šฅ์„ฑ ํ–ฅ์ƒ

๊ณ ์˜จ ์ƒ˜ํ”Œ๋ง ์‹คํ—˜:

๊ฒฐ๊ณผ:

โ†’ GRPO๋Š” ๊ฐ„๋‹จํ•œ ๋ฐ์ดํ„ฐ์…‹์—์„œ๋Š” ์ƒ˜ํ”Œ๋ง ์ผ๊ด€์„ฑ ํ–ฅ์ƒ, ๋ณต์žกํ•œ ํƒœ์Šคํฌ์—์„œ๋Š” ๋ณธ์งˆ์ ์œผ๋กœ ๋” ์ •ํ™•ํ•œ ์˜ˆ์ธก ๊ฐ€๋Šฅ

์ถ”๋ก  ํšจ์œจ์„ฑ๊ณผ ์†๋„

ํ† ํฐํ™” ํšจ์œจ์„ฑ:

์ถ”๋ก  ์†๋„ (A100 GPU, vLLM, BF16):

์†๋„๋Š” ์˜ˆ์ธก ๊ฐ์ฒด ์ˆ˜์— ์„ ํ˜• ๋น„๋ก€. ํ˜„์žฌ MLLM ๊ธฐ๋ฐ˜ ๊ฒ€์ถœ๊ธฐ๋Š” ์ „ํ†ต ์ตœ์ ํ™”๋œ ๊ฒ€์ถœ๊ธฐ๋ณด๋‹ค ๋А๋ฆฌ์ง€๋งŒ, ์–‘์žํ™”๋‚˜ ์ฆ๋ฅ˜๋กœ ๊ฐœ์„  ๊ฐ€๋Šฅ.

๊ด€๋ จ ์—ฐ๊ตฌ

Regression-based Object Detection

CNN ๊ธฐ๋ฐ˜ ์ดˆ๊ธฐ ๋ชจ๋ธ(YOLO, SSD, Faster R-CNN)์—์„œ ์•ต์ปค ํ”„๋ฆฌ ์ ‘๊ทผ๋ฒ•(CornerNet, CenterNet, FCOS)์„ ๊ฑฐ์ณ ํŠธ๋žœ์Šคํฌ๋จธ ๊ธฐ๋ฐ˜ ๊ฒ€์ถœ๊ธฐ(DETR, Deformable DETR, DINO)๋กœ ์ง„ํ™”ํ–ˆ์Šต๋‹ˆ๋‹ค.

์ง€์†์  ๊ฐœ์„ ์„ ์œ„ํ•œ ํ˜์‹ ๋“ค:

Open-set Object Detection

ํ…์ŠคํŠธ ํ”„๋กฌํ”„ํŠธ๋ฅผ ์‚ฌ์šฉํ•œ ๊ฐœ๋ฐฉ ์–ดํœ˜ ๊ฒ€์ถœ:

MLLM-based Object Detection

์ง์ ‘ ์ขŒํ‘œ ์˜ˆ์ธก:

๊ฒ€์ƒ‰ ๊ธฐ๋ฐ˜:

์™ธ๋ถ€ ๋””์ฝ”๋”:

๊ฒฐ๋ก 

Rex-Omni๋Š” MLLM ๊ธฐ๋ฐ˜ ๊ฐ์ฒด ๊ฒ€์ถœ์˜ ๋ฌธ์ œ๋ฅผ ์ฒด๊ณ„์ ์œผ๋กœ ํ•ด๊ฒฐํ•ฉ๋‹ˆ๋‹ค.

ํ•ต์‹ฌ ๊ธฐ์—ฌ:

  1. ํšจ์œจ์ ์ธ ์ขŒํ‘œ ํ† ํฐํ™”: ํŠน์ˆ˜ ํ† ํฐ์œผ๋กœ ํ•™์Šต ๋ณต์žก๋„ ๊ฐ์†Œ ๋ฐ ํšจ์œจ์„ฑ ํ–ฅ์ƒ
  2. ๋Œ€๊ทœ๋ชจ ๋ฐ์ดํ„ฐ ์ƒ์„ฑ: ๋งž์ถคํ˜• ์—”์ง„์œผ๋กœ 2,200๋งŒ ๊ฐœ ๊ณ ํ’ˆ์งˆ ๋ฐ์ดํ„ฐ ํ™•๋ณด
  3. 2๋‹จ๊ณ„ ํ•™์Šต ํŒŒ์ดํ”„๋ผ์ธ: SFT + GRPO๋กœ ์ •ํ™•ํ•œ ์œ„์น˜ ํŒŒ์•…๊ณผ ๊นŠ์€ ์–ธ์–ด ์ดํ•ด ๋‹ฌ์„ฑ
  4. ํ–‰๋™ ๊ต์ •: GRPO๊ฐ€ SFT ์œ ๋„ ๊ฒฐํ•จ(์ค‘๋ณต ์˜ˆ์ธก, large-box ์˜ˆ์ธก) ํšจ๊ณผ์ ์œผ๋กœ ๊ต์ •

์‹คํ—˜ ๊ฒ€์ฆ:

ํ•œ๊ณ„์™€ ํ–ฅํ›„ ๊ณผ์ œ:

Rex-Omni๋Š” ๋‹ค์žฌ๋‹ค๋Šฅํ•˜๊ณ  ์–ธ์–ด ์ธ์‹ ๋Šฅ๋ ฅ์„ ๊ฐ–์ถ˜ ์ฐจ์„ธ๋Œ€ ์ธ์‹ ์‹œ์Šคํ…œ์œผ๋กœ ๊ฐ€๋Š” ์ค‘์š”ํ•œ ๋‹จ๊ณ„์ž…๋‹ˆ๋‹ค.