SkillOrchestra - Learning to Route Agents via Skill Transfer

๐Ÿท๏ธ ๋…ผ๋ฌธ Headliner
image-1.png

๋„์ž…

๋ณตํ•ฉ AI ์‹œ์Šคํ…œ(Compound AI System)์ด ์ ์  ๋ณต์žกํ•ด์ง€๋ฉด์„œ, ์ž์—ฐ์Šค๋Ÿฝ๊ฒŒ ์ด๋Ÿฐ ์งˆ๋ฌธ์ด ๋– ์˜ค๋ฆ…๋‹ˆ๋‹ค. "์—ฌ๋Ÿฌ ๋ชจ๋ธ๊ณผ ๋„๊ตฌ๋ฅผ ์–ด๋–ป๊ฒŒ ํšจ์œจ์ ์œผ๋กœ ์กฐํ•ฉํ•  ๊ฒƒ์ธ๊ฐ€?" ๊ธฐ์กด์˜ ๋ชจ๋ธ ๋ผ์šฐํŒ… ๋ฐฉ์‹์€ ์ฟผ๋ฆฌ ๋‹จ์œ„๋กœ ํ•œ ๋ฒˆ ๊ฒฐ์ •ํ•˜๊ฑฐ๋‚˜, RL๋กœ ์—”๋“œํˆฌ์—”๋“œ ํ•™์Šตํ•˜๋Š” ๋ฐฉ์‹์ด ์ฃผ๋ฅผ ์ด๋ค˜์Šต๋‹ˆ๋‹ค. ํ•˜์ง€๋งŒ ์ „์ž๋Š” ๋ฉ€ํ‹ฐํ„ด ์ƒํ˜ธ์ž‘์šฉ์—์„œ ๋ฌด๋ ฅํ•˜๊ณ , ํ›„์ž๋Š” ๋น„์šฉ์ด ๋น„์‹ธ๋ฉด์„œ๋„ **๋ผ์šฐํŒ… ๋ถ•๊ดด(routing collapse)**๋ผ๋Š” ๊ณ ์งˆ์  ๋ฌธ์ œ๋ฅผ ์•ˆ๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.

๊ฒฐ๋ก ๋ถ€ํ„ฐ ๋งํ•˜๋ฉด, SkillOrchestra๋Š” "์Šคํ‚ฌ"์ด๋ผ๋Š” ์ค‘๊ฐ„ ์ถ”์ƒ ๊ณ„์ธต์„ ๋„์ž…ํ•˜์—ฌ, RL ๊ธฐ๋ฐ˜ ์˜ค์ผ€์ŠคํŠธ๋ ˆ์ดํ„ฐ ๋Œ€๋น„ ์ตœ๋Œ€ 22.5%p ์ •ํ™•๋„ ํ–ฅ์ƒ๊ณผ 700๋ฐฐ ํ•™์Šต ๋น„์šฉ ์ ˆ๊ฐ์„ ๋‹ฌ์„ฑํ–ˆ์Šต๋‹ˆ๋‹ค [1]. ์œ„์Šค์ฝ˜์‹ -๋งค๋””์Šจ ๋Œ€ํ•™๊ต์™€ Salesforce AI Research์˜ ๊ณต๋™ ์—ฐ๊ตฌ์ž…๋‹ˆ๋‹ค.

ํ•ต์‹ฌ ์š”์•ฝ

๊ธฐ์กด ๋ฐฉ๋ฒ•์˜ ํ•œ๊ณ„

๋ชจ๋ธ ๋ผ์šฐํŒ…์˜ ๋ฌธ์ œ

๋ชจ๋ธ ๋ผ์šฐํŒ…(Model Routing)์€ ์ฟผ๋ฆฌ๊ฐ€ ๋“ค์–ด์˜ค๋ฉด ๋ชจ๋ธ ํ’€์—์„œ ์ ํ•ฉํ•œ ๋ชจ๋ธ์„ ์„ ํƒํ•˜๋Š” ๋ฐฉ์‹์ž…๋‹ˆ๋‹ค [2][3]. KNN Router, BERT Router, GraphRouter ๊ฐ™์€ ํŒ๋ณ„์ (discriminative) ๋ฐฉ๋ฒ•๋“ค์ด ๋Œ€ํ‘œ์ ์ด์ฃ .

๋ฌธ์ œ๋Š” ์ด๋Ÿฐ ๋ฐฉ๋ฒ•๋“ค์ด ๋‹จ๋ฐœ์„ฑ ๊ฒฐ์ •์ด๋ผ๋Š” ์ ์ž…๋‹ˆ๋‹ค. ์ฟผ๋ฆฌ๋ฅผ ํ•œ ๋ฒˆ ๋ณด๊ณ  ๋ชจ๋ธ์„ ๊ณ ๋ฅด๋ฉด ๋์ž…๋‹ˆ๋‹ค. ๋ฉ€ํ‹ฐํ„ด ์—์ด์ „ํŠธ ์›Œํฌํ”Œ๋กœ์—์„œ๋Š” ๊ฐ ๋‹จ๊ณ„๋งˆ๋‹ค ๋‹ค๋ฅธ ๋Šฅ๋ ฅ์ด ํ•„์š”ํ•œ๋ฐ, ์ด๊ฑธ ๋ฐ˜์˜ํ•˜์ง€ ๋ชปํ•ฉ๋‹ˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด ์ฒซ ํ„ด์—์„œ๋Š” ์›น ๊ฒ€์ƒ‰์ด, ๋‘ ๋ฒˆ์งธ ํ„ด์—์„œ๋Š” ์ฝ”๋“œ ์‹คํ–‰์ด, ์„ธ ๋ฒˆ์งธ ํ„ด์—์„œ๋Š” ์ˆ˜ํ•™ ์ถ”๋ก ์ด ํ•„์š”ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

RL ๊ธฐ๋ฐ˜ ์˜ค์ผ€์ŠคํŠธ๋ ˆ์ด์…˜์˜ ๋ฌธ์ œ

Router-R1 [4]์ด๋‚˜ ToolOrchestra [5] ๊ฐ™์€ RL ๊ธฐ๋ฐ˜ ๋ฐฉ๋ฒ•์€ ์ด ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๋ ค ํ–ˆ์Šต๋‹ˆ๋‹ค. LLM์„ PPO๋‚˜ GRPO๋กœ ํ•™์Šต์‹œ์ผœ ์ˆœ์ฐจ์  ๋ผ์šฐํŒ… ์ •์ฑ…์„ ์ตœ์ ํ™”ํ•˜๋Š” ๋ฐฉ์‹์ด์ฃ .

ํ•˜์ง€๋งŒ ๋‘ ๊ฐ€์ง€ ๋ฌธ์ œ๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค.

  1. ํ•™์Šต ๋น„์šฉ์ด ๋น„์Œ‰๋‹ˆ๋‹ค. Router-R1์€ 14k ์ƒ˜ํ”Œ๋กœ PPO ํ•™์Šต์ด ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค.
  2. ๋ผ์šฐํŒ… ๋ถ•๊ดด๊ฐ€ ๋ฐœ์ƒํ•ฉ๋‹ˆ๋‹ค. ์‹คํ—˜ ๊ฒฐ๊ณผ Router-R1์€ ์ „์ฒด ํ˜ธ์ถœ์˜ 98%๋ฅผ LLaMA-3.1-70B ํ•œ ๋ชจ๋ธ์— ์ง‘์ค‘์‹œ์ผฐ์Šต๋‹ˆ๋‹ค. ๋‹ค๋ฅธ ๋ชจ๋ธ๋“ค์€ ๊ฐ๊ฐ 1% ๋ฏธ๋งŒ์œผ๋กœ, ์‚ฌ์‹ค์ƒ ๋ฉ€ํ‹ฐ๋ชจ๋ธ ๋ผ์šฐํ„ฐ๊ฐ€ ์•„๋‹ˆ๋ผ ๋‹จ์ผ ๋ชจ๋ธ ํ˜ธ์ถœ๊ธฐ์— ๋ถˆ๊ณผํ•œ ์…ˆ์ž…๋‹ˆ๋‹ค.

์ด๊ฒƒ์ด SkillOrchestra๊ฐ€ ํ•ด๊ฒฐํ•˜๋ ค๋Š” ํ•ต์‹ฌ ๋ฌธ์ œ์ž…๋‹ˆ๋‹ค.

SkillOrchestra: ์Šคํ‚ฌ ํ•ธ๋“œ๋ถ ๊ธฐ๋ฐ˜ ์˜ค์ผ€์ŠคํŠธ๋ ˆ์ด์…˜

ํ•ต์‹ฌ ์•„์ด๋””์–ด

SkillOrchestra์˜ ๋ฐœ์ƒ์€ ๋‹จ์ˆœํ•ฉ๋‹ˆ๋‹ค. ๋ผ์šฐํŒ… ์ •์ฑ…์„ ์—”๋“œํˆฌ์—”๋“œ๋กœ ํ•™์Šตํ•˜๋Š” ๋Œ€์‹ , **์žฌ์‚ฌ์šฉ ๊ฐ€๋Šฅํ•œ ์Šคํ‚ฌ ํ•ธ๋“œ๋ถ(Skill Handbook)**์„ ๊ตฌ์ถ•ํ•˜์ž๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.

์Šคํ‚ฌ(Skill)์€ "ํŠน์ • ์šด์˜ ๋ชจ๋“œ์—์„œ ํ•„์š”ํ•œ ์žฌ์‚ฌ์šฉ ๊ฐ€๋Šฅํ•œ ๋Šฅ๋ ฅ ์ถ”์ƒํ™”"๋กœ ์ •์˜๋ฉ๋‹ˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด ์ฝ”๋”ฉ ๋ชจ๋“œ ์•„๋ž˜์— data_processing.symbolic_logic(๊ทœ์น™ ๊ธฐ๋ฐ˜ ์ถ”๋ก )์ด๋‚˜ data_processing.numerical_approximation(์ˆ˜์น˜ ๊ทผ์‚ฌ) ๊ฐ™์€ ์„ธ๋ถ„ํ™”๋œ ์Šคํ‚ฌ์ด ์กด์žฌํ•ฉ๋‹ˆ๋‹ค.

ํ˜•์‹์ ์œผ๋กœ, ์Šคํ‚ฌ \(\sigma\)๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์ด ํ‘œํ˜„๋ฉ๋‹ˆ๋‹ค:

\[\sigma \triangleq \langle D, I \rangle\]

์—ฌ๊ธฐ์„œ \(D\)๋Š” ํ•ด๋‹น ๋Šฅ๋ ฅ์˜ ์ž์—ฐ์–ด ์„ค๋ช…, \(I\)๋Š” ์Šคํ‚ฌ์ด ์ ์šฉ๋˜๋Š” ์ƒํ™ฉ์„ ์•Œ๋ ค์ฃผ๋Š” ๋งฅ๋ฝ ์ง€ํ‘œ(ํ‚ค์›Œ๋“œ, ๊ตฌ์กฐ์  ํŒจํ„ด ๋“ฑ)์ž…๋‹ˆ๋‹ค.

์Šคํ‚ฌ ํ•ธ๋“œ๋ถ์˜ ๊ตฌ์กฐ

์Šคํ‚ฌ ํ•ธ๋“œ๋ถ \(H\)๋Š” ์„ธ ๊ฐ€์ง€ ๊ณ„์ธต์œผ๋กœ ๊ตฌ์„ฑ๋ฉ๋‹ˆ๋‹ค:

  1. ๋ชจ๋“œ ์ˆ˜์ค€ ๋ฉ”ํƒ€๋ฐ์ดํ„ฐ (\(V_\Psi\)): ๊ฐ ์šด์˜ ๋ชจ๋“œ(๊ฒ€์ƒ‰, ์ฝ”๋”ฉ, ๋‹ต๋ณ€ ๋“ฑ)์— ๋Œ€ํ•œ ์ „ํ™˜ ์ธ์‚ฌ์ดํŠธ๋ฅผ ์ €์žฅํ•ฉ๋‹ˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด "์‚ฐ์ˆ  ์—ฐ์‚ฐ์ด 2๊ฐœ ์ด์ƒ์ด๊ฑฐ๋‚˜ ์ง‘๊ณ„๊ฐ€ ํ•„์š”ํ•˜๋ฉด ๊ฒ€์ƒ‰ ๋Œ€์‹  ์ฝ”๋”ฉ ๋ชจ๋“œ๋กœ ์ „ํ™˜ํ•˜๋ผ"๋Š” ๊ทœ์น™์ด ์—ฌ๊ธฐ์— ํ•ด๋‹นํ•ฉ๋‹ˆ๋‹ค.
  2. ์Šคํ‚ฌ ๋ ˆ์ง€์ŠคํŠธ๋ฆฌ (\(V_\Sigma\)): ์„ธ๋ถ„ํ™”๋œ ์Šคํ‚ฌ ์ •์˜์™€ ์ ์šฉ ์กฐ๊ฑด์„ ๊ด€๋ฆฌํ•ฉ๋‹ˆ๋‹ค.
  3. ์—์ด์ „ํŠธ ํ”„๋กœํ•„ (\(V_P\)): ๊ฐ ์—์ด์ „ํŠธ์˜ ์Šคํ‚ฌ๋ณ„ ์„ฑ๊ณต ํ™•๋ฅ , ๋น„์šฉ ํŠน์„ฑ, ๊ฐ•์ ๊ณผ ์•ฝ์ ์„ ์š”์•ฝํ•ฉ๋‹ˆ๋‹ค.

์—์ด์ „ํŠธ ํ”„๋กœํ•„์€ ๊ตฌ์ฒด์ ์œผ๋กœ ๋‹ค์Œ๊ณผ ๊ฐ™์ด ์ •์˜๋ฉ๋‹ˆ๋‹ค:

\[P_{A,\psi} = ({\phi_{A,\sigma}}_{\sigma \in \Sigma_\psi}, \hat{C}_A(\psi), R_{A,\psi}, \Gamma_A)\]

์—ฌ๊ธฐ์„œ \(\phi_{A,\sigma}\)๋Š” ์—์ด์ „ํŠธ \(A\)์˜ ์Šคํ‚ฌ \(\sigma\)์— ๋Œ€ํ•œ ์ถ”์ • ์„ฑ๊ณต ํ™•๋ฅ , \(\hat{C}_A(\psi)\)๋Š” ๋ชจ๋“œ๋ณ„ ์˜ˆ์ƒ ๋น„์šฉ, \(R_{A,\psi}\)๋Š” ๋ผ์šฐํŒ… ์‹œ๊ทธ๋„, \(\Gamma_A\)๋Š” ๊ฐ•์•ฝ์  ์š”์•ฝ์ž…๋‹ˆ๋‹ค.

๋ฐฐํฌ ์‹œ ๋™์ž‘ ๋ฐฉ์‹

๋ฐฐํฌ ์‹œ ์˜ค์ผ€์ŠคํŠธ๋ ˆ์ดํ„ฐ๋Š” ๊ฐ ํƒ€์ž„์Šคํ…์—์„œ ๋‘ ๊ฐ€์ง€๋ฅผ ๊ฒฐ์ •ํ•ฉ๋‹ˆ๋‹ค:

1๋‹จ๊ณ„ - ๋ชจ๋“œ ์„ ํƒ: ํ˜„์žฌ ์ƒํƒœ \(s_t\)์—์„œ ์–ด๋–ค ์šด์˜ ๋ชจ๋“œ(๊ฒ€์ƒ‰, ์ฝ”๋”ฉ, ๋‹ต๋ณ€)๋ฅผ ์ˆ˜ํ–‰ํ• ์ง€ ๊ฒฐ์ •ํ•ฉ๋‹ˆ๋‹ค.

2๋‹จ๊ณ„ - ์Šคํ‚ฌ ๊ธฐ๋ฐ˜ ์—์ด์ „ํŠธ ์„ ํƒ: ์„ ํƒ๋œ ๋ชจ๋“œ์—์„œ ํ™œ์„ฑํ™”๋œ ์Šคํ‚ฌ ์ง‘ํ•ฉ \(\Sigma_t\)๋ฅผ ์‹๋ณ„ํ•˜๊ณ , ๊ฐ ์—์ด์ „ํŠธ์˜ ์—ญ๋Ÿ‰๊ณผ ๋น„์šฉ์„ ์ข…ํ•ฉํ•˜์—ฌ ์ตœ์  ์—์ด์ „ํŠธ๋ฅผ ์„ ํƒํ•ฉ๋‹ˆ๋‹ค:

\[A^*_t = \arg\max_{A \in \mathcal{A}_{\psi_t}} \left[ \sum_{\sigma \in \Sigma_t} w_{t,\sigma} \frac{\alpha_{A,\sigma}}{\alpha_{A,\sigma} + \beta_{A,\sigma}} - \lambda_c \cdot \hat{C}_A(\psi_t) \right]\]

์—ฌ๊ธฐ์„œ \(\frac{\alpha_{A,\sigma}}{\alpha_{A,\sigma} + \beta_{A,\sigma}}\)๋Š” Beta ๋ถ„ํฌ์˜ ์‚ฌํ›„ ํ‰๊ท ์œผ๋กœ ์ถ”์ •๋œ ์Šคํ‚ฌ๋ณ„ ์„ฑ๊ณต ํ™•๋ฅ ์ž…๋‹ˆ๋‹ค. ์ด ๋ฐฉ์‹์ด ๊น”๋”ํ•œ ์ด์œ ๋Š”, ์—ญ๋Ÿ‰ ์ถ”์ •๊ณผ ๋น„์šฉ์„ ๋ช…์‹œ์ ์œผ๋กœ ๋ถ„๋ฆฌํ•˜์—ฌ ํŠธ๋ ˆ์ด๋“œ์˜คํ”„๋ฅผ ์ œ์–ดํ•  ์ˆ˜ ์žˆ๋‹ค๋Š” ์ ์ž…๋‹ˆ๋‹ค.

์Šคํ‚ฌ ํ•ธ๋“œ๋ถ ํ•™์Šต

Phase 1: ์Šคํ‚ฌ ๋ฐœ๊ฒฌ๊ณผ ํ”„๋กœํ•„ ๊ตฌ์ถ•

ํƒ์ƒ‰ ๋ฐ์ดํ„ฐ์…‹ \(D_{\text{train}} = {(q_i, B_i)}_{i=1}^N\)์—์„œ, ๊ฐ™์€ ์ฟผ๋ฆฌ์— ๋Œ€ํ•ด ์—์ด์ „ํŠธ๋ฅผ ๋ฐ”๊ฟ”๊ฐ€๋ฉฐ ์–ป์€ ์„ฑ๊ณต/์‹คํŒจ ๊ถค์ ์„ ๋Œ€์กฐํ•ฉ๋‹ˆ๋‹ค.

์„ฑ๊ณต ๊ถค์  \(\tau^+\)์™€ ์‹คํŒจ ๊ถค์  \(\tau^-\)์˜ ์ฐจ์ด \(D_{\text{diff}}(\tau^+ | \tau^-)\)๋ฅผ ๋ถ„์„ํ•˜๋ฉด, ์‹คํŒจํ•œ ์—์ด์ „ํŠธ์—๊ฒŒ ๋ถ€์กฑํ–ˆ๋˜ ๋Šฅ๋ ฅ์ด ๋“œ๋Ÿฌ๋‚ฉ๋‹ˆ๋‹ค. LLM ๊ธฐ๋ฐ˜ ๋ฐœ๊ฒฌ๊ธฐ๊ฐ€ ์ด ๋Šฅ๋ ฅ ๊ฒฉ์ฐจ๋ฅผ ์žฌ์‚ฌ์šฉ ๊ฐ€๋Šฅํ•œ ์Šคํ‚ฌ ์ •์˜๋กœ ์ถ”์ƒํ™”ํ•ฉ๋‹ˆ๋‹ค.

์—์ด์ „ํŠธ ํ”„๋กœํ•„์€ ์ง‘๊ณ„๋œ ๊ฒฐ๊ณผ๋กœ๋ถ€ํ„ฐ ์ถ”์ •๋ฉ๋‹ˆ๋‹ค. ๊ฐ ์—์ด์ „ํŠธ \(A\), ๋ชจ๋“œ \(\psi\), ์Šคํ‚ฌ \(\sigma\)์— ๋Œ€ํ•ด ์„ฑ๊ณต ํ™•๋ฅ ์„ Beta ๋ถ„ํฌ๋กœ ๋ชจ๋ธ๋งํ•ฉ๋‹ˆ๋‹ค:

\[\alpha_{A,\sigma}^{(t+1)} \leftarrow \alpha_{A,\sigma}^{(t)} + \sum_{\tau} \mathbb{I}[A \text{ succeeds on } \sigma]$$ $$\beta_{A,\sigma}^{(t+1)} \leftarrow \beta_{A,\sigma}^{(t)} + \sum_{\tau} \mathbb{I}[A \text{ fails on } \sigma]\]

Beta ๋ถ„ํฌ๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๊ฑด ๊ฝค ํ•ฉ๋ฆฌ์ ์ธ ์„ ํƒ์ž…๋‹ˆ๋‹ค. ๋ฐ์ดํ„ฐ๊ฐ€ ์ ์„ ๋•Œ๋Š” ์‚ฌ์ „ ๋ถ„ํฌ์— ์˜์กดํ•˜๊ณ , ๋ฐ์ดํ„ฐ๊ฐ€ ์Œ“์ด๋ฉด ์ž์—ฐ์Šค๋Ÿฝ๊ฒŒ ์ •ํ™•ํ•ด์ง€๋‹ˆ๊นŒ์š”.

Phase 2: ํ•ธ๋“œ๋ถ ์ •์ œ

์Šคํ‚ฌ์ด ๋„ˆ๋ฌด ์„ธ๋ถ„ํ™”๋˜๊ฑฐ๋‚˜ ์ค‘๋ณต๋˜๋Š” ๊ฒƒ์„ ๋ฐฉ์ง€ํ•˜๊ธฐ ์œ„ํ•œ ์ •์ œ ๊ณผ์ •์ž…๋‹ˆ๋‹ค:

ํŒŒ๋ ˆํ†  ์ตœ์  ํ•ธ๋“œ๋ถ ์„ ํƒ

ํฅ๋ฏธ๋กœ์šด ์ ์€, ์Šคํ‚ฌ์„ ์„ธ๋ถ„ํ™”ํ• ์ˆ˜๋ก ํ•ญ์ƒ ์ข‹์•„์ง€๋Š” ๊ฒŒ ์•„๋‹ˆ๋ผ๋Š” ๋ฐœ๊ฒฌ์ž…๋‹ˆ๋‹ค. ์˜ค์ผ€์ŠคํŠธ๋ ˆ์ดํ„ฐ์˜ ์ถ”๋ก  ๋Šฅ๋ ฅ์— ๋”ฐ๋ผ ์ ์ ˆํ•œ ์„ธ๋ถ„ํ™” ์ˆ˜์ค€์ด ๋‹ค๋ฆ…๋‹ˆ๋‹ค.

์•ฝํ•œ ์˜ค์ผ€์ŠคํŠธ๋ ˆ์ดํ„ฐ์—๊ฒŒ symbolic_logic๊ณผ numerical_approximation์„ ๊ตฌ๋ถ„ํ•˜๊ฒŒ ์‹œํ‚ค๋ฉด ์˜คํžˆ๋ ค ์ž˜๋ชป๋œ ์Šคํ‚ฌ์„ ํ™œ์„ฑํ™”ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ด๋Ÿฐ ๊ฒฝ์šฐ ์ƒ์œ„ ์Šคํ‚ฌ์ธ data_processing ์ˆ˜์ค€์—์„œ ๊ฒฐ์ •ํ•˜๋Š” ๊ฒŒ ๋” ์•ˆ์ •์ ์ด์ฃ .

์ด๋ฅผ ์œ„ํ•ด ๊ฒ€์ฆ ๋ฐ์ดํ„ฐ์—์„œ ํŒŒ๋ ˆํ†  ์ตœ์  ํ•ธ๋“œ๋ถ์„ ์„ ํƒํ•ฉ๋‹ˆ๋‹ค:

\[H_{\text{base}}^{(O)} = \arg\max_{H \subseteq H^*} \mathbb{E}_{q \sim D_{\text{val}}} \left[ R(\tau_H(q)) - \lambda \sum_{t=0}^{|\tau_H(q)|} C(\psi_t, A_t) \right]\]

์‹คํ—˜ ๊ฒฐ๊ณผ

๋ชจ๋ธ ๋ผ์šฐํŒ… (QA ๋ฒค์น˜๋งˆํฌ)

Qwen2.5-3B์„ ์˜ค์ผ€์ŠคํŠธ๋ ˆ์ดํ„ฐ๋กœ, 6๊ฐœ ๋ชจ๋ธ ํ’€(Qwen2.5-7B, LLaMA-3.1-8B/70B, Mistral-7B, Mixtral-8x22B, Gemma-2-27B)์„ ์‚ฌ์šฉํ•œ ์‹คํ—˜์ž…๋‹ˆ๋‹ค. 7๊ฐœ QA ๋ฒค์น˜๋งˆํฌ์—์„œ์˜ ๊ฒฐ๊ณผ๋ฅผ ๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.

๋ฐฉ๋ฒ•

ํ‰๊ท  EM

RAG

26.7

Search-R1

29.1

RouterDC

31.4

FrugalGPT

31.8

Router-R1 (RL)

41.6

SkillOrchestra

47.4

SkillOrchestra+

51.6

SkillOrchestra๊ฐ€ Router-R1 ๋Œ€๋น„ +5.8, SkillOrchestra+๋Š” +10.0์˜ ๊ฐœ์„ ์„ ๋ณด์˜€์Šต๋‹ˆ๋‹ค. ํŠนํžˆ ๋ฉ€ํ‹ฐํ™‰ QA์—์„œ ๋‘๋“œ๋Ÿฌ์ง‘๋‹ˆ๋‹ค. Musique์—์„œ 13.8 โ†’ 18.2 โ†’ 20.6, Bamboogle์—์„œ 51.2 โ†’ 58.4 โ†’ 63.2๋กœ ์ƒ์Šนํ–ˆ์Šต๋‹ˆ๋‹ค.

์ˆ˜ํ•™ ์ถ”๋ก (MATH, AMC23)์—์„œ๋Š” ๋” ๊ทน์ ์ž…๋‹ˆ๋‹ค. MATH์—์„œ 55.8% โ†’ 73.6%๋กœ 17.8%p ํ–ฅ์ƒ, AMC23์—์„œ 25.0% โ†’ 52.5%๋กœ 27.5%p ํ–ฅ์ƒ์„ ๋‹ฌ์„ฑํ•˜๋ฉด์„œ, ๋น„์šฉ์€ ์˜คํžˆ๋ ค ์•ฝ 2๋ฐฐ ์ ˆ๊ฐ๋˜์—ˆ์Šต๋‹ˆ๋‹ค.

๋ผ์šฐํŒ… ๋ถ•๊ดด ํ•ด์†Œ

Router-R1์˜ ๋ชจ๋ธ ์„ ํƒ ๋ถ„ํฌ๋ฅผ ๋ณด๋ฉด:

์‚ฌ์‹ค์ƒ ๋‹จ์ผ ๋ชจ๋ธ ํ˜ธ์ถœ๊ธฐ์ž…๋‹ˆ๋‹ค. ๋ฐ˜๋ฉด SkillOrchestra๋Š”:

๊ฐ ๋ชจ๋ธ์ด ์ž์‹ ์˜ ๊ฐ•์ ์— ๋งž๋Š” ์ž‘์—…์„ ์ฒ˜๋ฆฌํ•˜๋Š”, ์‹ค์งˆ์ ์ธ ๋ฉ€ํ‹ฐ๋ชจ๋ธ ์˜ค์ผ€์ŠคํŠธ๋ ˆ์ด์…˜์ด ์ผ์–ด๋‚˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. ์˜ค์ผ€์ŠคํŠธ๋ ˆ์ดํ„ฐ ์ž์ฒด๊ฐ€ ์ง์ ‘ ๋‹ต๋ณ€ํ•˜๋Š” ๊ฒฝ์šฐ(11.50%)๋„ ์žˆ์–ด์„œ, ๋ถˆํ•„์š”ํ•œ ์™ธ๋ถ€ ํ˜ธ์ถœ์„ ์ค„์ด๋Š” ํšจ๊ณผ๋„ ์žˆ๋„ค์š”.

์—์ด์ „ํŠธ ์˜ค์ผ€์ŠคํŠธ๋ ˆ์ด์…˜ (FRAMES)

๋„๊ตฌ ์‚ฌ์šฉ๊นŒ์ง€ ํฌํ•จํ•œ ์ „์ฒด ์—์ด์ „ํŠธ ์˜ค์ผ€์ŠคํŠธ๋ ˆ์ด์…˜์—์„œ๋„ ๊ฒฐ๊ณผ๊ฐ€ ์ข‹์Šต๋‹ˆ๋‹ค. Qwen3-8B์„ ์˜ค์ผ€์ŠคํŠธ๋ ˆ์ดํ„ฐ๋กœ, ๊ฒ€์ƒ‰ยท์ฝ”๋”ฉยท๋‹ต๋ณ€ 3๊ฐœ ๋ชจ๋“œ์— ๊ฐ๊ฐ ๋‹ค๋ฅธ ๋ชจ๋ธ ํ’€์„ ์‚ฌ์šฉํ•œ FRAMES ๋ฒค์น˜๋งˆํฌ ๊ฒฐ๊ณผ์ž…๋‹ˆ๋‹ค:

๋ฐฉ๋ฒ•

์ •ํ™•๋„ (%)

๋น„์šฉ ($)

ToolOrchestra (RL)

76.3

92.7

GPT-5 (์˜ค์ผ€์ŠคํŠธ๋ ˆ์ดํ„ฐ)

74.6

120.4

Claude Opus 4.5 (์˜ค์ผ€์ŠคํŠธ๋ ˆ์ดํ„ฐ)

77.9

758.1

Gemini 3 Pro (์˜ค์ผ€์ŠคํŠธ๋ ˆ์ดํ„ฐ)

78.9

1,729.3

SkillOrchestra

84.3

72.7

SkillOrchestra๊ฐ€ RL ํ•™์Šต๋œ ToolOrchestra ๋Œ€๋น„ +8.0%p ์ •ํ™•๋„๋ฅผ ๋‹ฌ์„ฑํ•˜๋ฉด์„œ ๋น„์šฉ์€ 21.6% ์ ˆ๊ฐํ–ˆ์Šต๋‹ˆ๋‹ค. GPT-5๋‚˜ Claude Opus 4.5 ๊ฐ™์€ ๊ฐ•๋ ฅํ•œ ํ”„๋กœํ”„๋ผ์ด์–ดํ„ฐ๋ฆฌ ๋ชจ๋ธ์„ ์˜ค์ผ€์ŠคํŠธ๋ ˆ์ดํ„ฐ๋กœ ์“ฐ๋Š” ๊ฒƒ๋ณด๋‹ค๋„ ์ •ํ™•ํ•˜๋ฉด์„œ ํ›จ์”ฌ ์ €๋ ดํ•ฉ๋‹ˆ๋‹ค.

Claude Opus 4.5์˜ ๋น„์šฉ(\(758.1)์ด๋‚˜ Gemini 3 Pro(\)1,729.3)๋ฅผ ๋ณด๋ฉด,, ๊ฐ•ํ•œ ๋ชจ๋ธ ํ•˜๋‚˜์— ์˜์กดํ•˜๋Š” ์ „๋žต์ด ๋น„์šฉ ์ธก๋ฉด์—์„œ ์–ผ๋งˆ๋‚˜ ๋น„ํšจ์œจ์ ์ธ์ง€ ์ž˜ ๋“œ๋Ÿฌ๋‚ฉ๋‹ˆ๋‹ค.

ํ•ธ๋“œ๋ถ ์ „์ด์„ฑ

Qwen2.5-3B์—์„œ ํ•™์Šตํ•œ ์Šคํ‚ฌ ํ•ธ๋“œ๋ถ์„ ๋‹ค๋ฅธ ๋ชจ๋ธ์— ๊ทธ๋Œ€๋กœ ์ ์šฉํ•œ ๊ฒฐ๊ณผ:

๋ชจ๋ธ

ํ•ธ๋“œ๋ถ ์—†์ด

ํ•ธ๋“œ๋ถ ์ ์šฉ

ํ–ฅ์ƒ

Qwen2.5-3B

40.7%

56.1%

+15.4

Qwen2.5-7B

35.7%

60.0%

+24.3

LLaMA-3.1-8B

35.5%

58.0%

+22.5

Mistral-7B

36.5%

59.8%

+23.3

Mixtral-8x22B

46.5%

61.3%

+14.8

์žฌํ›ˆ๋ จ ์—†์ด ์ผ๊ด€๋œ ์„ฑ๋Šฅ ํ–ฅ์ƒ์„ ๋ณด์ž…๋‹ˆ๋‹ค. ๊ฐ•ํ•œ ๋ชจ๋ธ์ผ์ˆ˜๋ก ํ•ธ๋“œ๋ถ๊ณผ์˜ ์‹œ๋„ˆ์ง€๊ฐ€ ํฌ๋‹ค๋Š” ์ ๋„ ํฅ๋ฏธ๋กญ์Šต๋‹ˆ๋‹ค. ์ด๊ฑด ์Šคํ‚ฌ ํ•ธ๋“œ๋ถ์ด ๋ชจ๋ธ ํŒŒ๋ผ๋ฏธํ„ฐ์— ์ข…์†๋˜์ง€ ์•Š๋Š” ์ „์ด ๊ฐ€๋Šฅํ•œ ์˜ค์ผ€์ŠคํŠธ๋ ˆ์ด์…˜ ์ง€์‹์„ ๋‹ด๊ณ  ์žˆ๋‹ค๋Š” ์˜๋ฏธ์ž…๋‹ˆ๋‹ค.

์ปดํฌ๋„ŒํŠธ ๊ธฐ์—ฌ ๋ถ„์„ (Ablation)

FRAMES์—์„œ 100๊ฐœ ์ƒ˜ํ”Œ๋กœ ์ˆ˜ํ–‰ํ•œ ablation ๊ฒฐ๊ณผ์ž…๋‹ˆ๋‹ค:

์„ค์ •

์ •ํ™•๋„ (%)

๋น„์šฉ ($)

ํ•ธ๋“œ๋ถ ์—†์Œ

71.0

122.9

๋ฐœ๊ฒฌ๋งŒ (์ •์ œยท์„ ํƒ ์—†์Œ)

79.0

5.5

์ •์ œ๊นŒ์ง€ (์„ ํƒ ์—†์Œ)

79.3

3.4

์„ธ๋ถ„ํ™” ์Šคํ‚ฌ ์—†์Œ

80.4

15.1

์ „์ฒด ์‹œ์Šคํ…œ

85.0

9.3

ํ•ธ๋“œ๋ถ ์—†์ด๋Š” 71.0%์— ๋น„์šฉ \(122.9์ž…๋‹ˆ๋‹ค. ์Šคํ‚ฌ ๋ฐœ๊ฒฌ๋งŒ์œผ๋กœ๋„ ๋น„์šฉ์ด\)5.5๋กœ ๊ธ‰๊ฐํ•˜๊ณ , ์ „์ฒด ์‹œ์Šคํ…œ์—์„œ 85.0%/$9.3์œผ๋กœ ์ตœ์  ํŠธ๋ ˆ์ด๋“œ์˜คํ”„๋ฅผ ๋‹ฌ์„ฑํ•ฉ๋‹ˆ๋‹ค. ์ •์ œ๊ฐ€ ๋น„์šฉ์„ ์ค„์ด๊ณ , ์„ธ๋ถ„ํ™” ์Šคํ‚ฌ์ด ์ •ํ™•๋„๋ฅผ ๋†’์ด๋Š” ๋ฐ ๊ฐ๊ฐ ๊ธฐ์—ฌํ•˜๋Š” ๊ตฌ์กฐ์ž…๋‹ˆ๋‹ค.

๋น„ํŒ์  ๋ถ„์„

์ž˜ํ•œ ์ 

  1. ์Šคํ‚ฌ์ด๋ผ๋Š” ์ค‘๊ฐ„ ์ถ”์ƒํ™”: ์ฟผ๋ฆฌ ์ˆ˜์ค€๊ณผ ์—์ด์ „ํŠธ ์ˆ˜์ค€ ์‚ฌ์ด์— "์Šคํ‚ฌ"์„ ๋†“์€ ๊ฒƒ์€ ์ง๊ด€์ ์ด๋ฉด์„œ๋„ ํšจ๊ณผ์ ์ž…๋‹ˆ๋‹ค. ์ธ๊ฐ„์ด ํŒ€์„ ๊ตฌ์„ฑํ•  ๋•Œ๋„ "์ด ์ผ์—๋Š” ์–ด๋–ค ์—ญ๋Ÿ‰์ด ํ•„์š”ํ•˜๊ณ , ๋ˆ„๊ฐ€ ๊ทธ ์—ญ๋Ÿ‰์ด ์žˆ๋‚˜?"๋ฅผ ๋”ฐ์ง€๋Š” ๊ฒƒ๊ณผ ๊ฐ™์€ ์›๋ฆฌ์ฃ .

  2. ๋ผ์šฐํŒ… ๋ถ•๊ดด ํ•ด๊ฒฐ: RL ๊ธฐ๋ฐ˜ ๋ฐฉ๋ฒ•์˜ 98% ํŽธ์ค‘ ๋ฌธ์ œ๋ฅผ ๋ช…์‹œ์  ์—ญ๋Ÿ‰ ๋ชจ๋ธ๋ง์œผ๋กœ ์ž์—ฐ์Šค๋Ÿฝ๊ฒŒ ํ•ด๊ฒฐํ•œ ๊ฒƒ์€ ๊ฐ€์น˜๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค.

  3. ๋น„์šฉ ํšจ์œจ์„ฑ: 700๋ฐฐ(Router-R1 ๋Œ€๋น„), 300๋ฐฐ(ToolOrchestra ๋Œ€๋น„) ํ•™์Šต ๋น„์šฉ ์ ˆ๊ฐ์€ ์‹ค์šฉ์ ์œผ๋กœ ํฐ ์˜๋ฏธ๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค. 50๊ฐœ ๋ฏธ๋งŒ์˜ ์ƒ˜ํ”Œ๋กœ ํ•ธ๋“œ๋ถ์„ ํ•™์Šตํ•  ์ˆ˜ ์žˆ๋‹ค๋Š” ์ ์€ ๋น ๋ฅธ ์ ์šฉ์ด ๊ฐ€๋Šฅํ•˜๋‹ค๋Š” ๋œป์ž…๋‹ˆ๋‹ค.

  4. ํ•ด์„ ๊ฐ€๋Šฅ์„ฑ: ์–ด๋–ค ์Šคํ‚ฌ์ด ํ™œ์„ฑํ™”๋˜์—ˆ๊ณ , ์™œ ํŠน์ • ์—์ด์ „ํŠธ๊ฐ€ ์„ ํƒ๋˜์—ˆ๋Š”์ง€ ์ถ”์ ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. RL ๊ธฐ๋ฐ˜ ๋ฐฉ๋ฒ•์˜ ๋ธ”๋ž™๋ฐ•์Šค ํŠน์„ฑ๊ณผ ๋Œ€๋น„๋ฉ๋‹ˆ๋‹ค.

์•„์‰ฌ์šด ์ 

  1. ์Šคํ‚ฌ ๋ฐœ๊ฒฌ์˜ LLM ์˜์กด์„ฑ: ์Šคํ‚ฌ์„ ๋ฐœ๊ฒฌํ•˜๊ณ  ์ •์ œํ•˜๋Š” ๊ณผ์ •์—์„œ LLM(GPT-5๋กœ ์ถ”์ •)์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. ์ด LLM์˜ ํ’ˆ์งˆ์— ํ•ธ๋“œ๋ถ์˜ ํ’ˆ์งˆ์ด ์ขŒ์šฐ๋  ์ˆ˜ ์žˆ๋Š”๋ฐ, ์ด์— ๋Œ€ํ•œ ๋ฏผ๊ฐ๋„ ๋ถ„์„์ด ์—†์Šต๋‹ˆ๋‹ค.

  2. ๋™์  ํ™˜๊ฒฝ ์ ์‘: ๋ชจ๋ธ ํ’€์ด ๋ณ€๊ฒฝ๋˜๊ฑฐ๋‚˜ ์ƒˆ ๋„๊ตฌ๊ฐ€ ์ถ”๊ฐ€๋˜๋ฉด ํ•ธ๋“œ๋ถ์„ ์–ด๋–ป๊ฒŒ ์—…๋ฐ์ดํŠธํ•˜๋Š”์ง€์— ๋Œ€ํ•œ ๋…ผ์˜๊ฐ€ ๋ถ€์กฑํ•ฉ๋‹ˆ๋‹ค. "์ „์ด ๊ฐ€๋Šฅ"ํ•˜๋‹ค๊ณ  ํ–ˆ์ง€๋งŒ, ์ด๊ฑด ๊ธฐ์กด ๋ชจ๋ธ ํ’€ ๋‚ด์—์„œ ์˜ค์ผ€์ŠคํŠธ๋ ˆ์ดํ„ฐ๋งŒ ๋ฐ”๊พผ ๊ฒฝ์šฐ์ž…๋‹ˆ๋‹ค.

  3. ์‹คํ—˜ ํ™˜๊ฒฝ์˜ ์ œํ•œ: FRAMES ๋ฒค์น˜๋งˆํฌ์˜ ์ตœ๋Œ€ 50ํ„ด ์„ค์ •์ด ์‹ค์ œ ๋ณตํ•ฉ ์—์ด์ „ํŠธ ์‹œ๋‚˜๋ฆฌ์˜ค๋ฅผ ์ถฉ๋ถ„ํžˆ ๋ฐ˜์˜ํ•˜๋Š”์ง€๋Š” ์˜๋ฌธ์ž…๋‹ˆ๋‹ค. ๋” ๊ธด ํ˜ธ๋ผ์ด์ฆŒ์—์„œ์˜ ์„ฑ๋Šฅ ๋ณ€ํ™”๊ฐ€ ๊ถ๊ธˆํ•ฉ๋‹ˆ๋‹ค.

  4. ์Šคํ‚ฌ ์„ธ๋ถ„ํ™”์˜ ์ž๋™ ๊ฒฐ์ •: ํŒŒ๋ ˆํ†  ์ตœ์  ํ•ธ๋“œ๋ถ ์„ ํƒ์ด ๊ฒ€์ฆ ๋ฐ์ดํ„ฐ์— ์˜์กดํ•˜๋Š”๋ฐ, ๊ฒ€์ฆ ๋ฐ์ดํ„ฐ๊ฐ€ ์‹ค์ œ ๋ฐฐํฌ ํ™˜๊ฒฝ์˜ ๋ถ„ํฌ์™€ ๋‹ค๋ฅผ ๊ฒฝ์šฐ ์ตœ์  ์„ธ๋ถ„ํ™” ์ˆ˜์ค€์ด ๋‹ฌ๋ผ์งˆ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

๊ฒฐ๋ก 

SkillOrchestra๋Š” ๋ณตํ•ฉ AI ์‹œ์Šคํ…œ์˜ ์˜ค์ผ€์ŠคํŠธ๋ ˆ์ด์…˜ ๋ฌธ์ œ์— "์Šคํ‚ฌ"์ด๋ผ๋Š” ์ค‘๊ฐ„ ์ถ”์ƒํ™”๋ฅผ ๋„์ž…ํ•˜์—ฌ, RL ์—†์ด๋„ ํšจ๊ณผ์ ์ธ ๋ฉ€ํ‹ฐ์—์ด์ „ํŠธ ๋ผ์šฐํŒ…์„ ๋‹ฌ์„ฑํ–ˆ์Šต๋‹ˆ๋‹ค. ๋ผ์šฐํŒ… ๋ถ•๊ดด๋ฅผ ํ•ด๊ฒฐํ•˜๊ณ , ํ•™์Šต ๋น„์šฉ์„ ์ˆ˜๋ฐฑ ๋ฐฐ ์ ˆ๊ฐํ•˜๋ฉด์„œ, ํ•ธ๋“œ๋ถ์˜ ์ „์ด ๊ฐ€๋Šฅ์„ฑ๊นŒ์ง€ ํ™•๋ณดํ•œ ๊ฒƒ์€ ์‹ค์šฉ์ ์œผ๋กœ ์˜๋ฏธ๊ฐ€ ํฝ๋‹ˆ๋‹ค.

์ œ ์ƒ๊ฐ์—๋Š”, ์ด ์ ‘๊ทผ์ด ํŠนํžˆ ๊ฐ€์น˜ ์žˆ๋Š” ์ด์œ ๊ฐ€ ๋ชจ๋ธ ํ’€์ด ๋น ๋ฅด๊ฒŒ ๋ณ€ํ•˜๋Š” ํ˜„์‹ค์— ์žˆ์Šต๋‹ˆ๋‹ค. ๋งค๋‹ฌ ์ƒˆ ๋ชจ๋ธ์ด ๋‚˜์˜ค๋Š” ์ƒํ™ฉ์—์„œ, RL๋กœ ์ •์ฑ…์„ ์ฒ˜์Œ๋ถ€ํ„ฐ ๋‹ค์‹œ ํ•™์Šตํ•˜๋Š” ๊ฑด ํ˜„์‹ค์ ์ด์ง€ ์•Š์Šต๋‹ˆ๋‹ค. ์Šคํ‚ฌ ํ•ธ๋“œ๋ถ์ด๋ผ๋Š” ๋…๋ฆฝ์  ์ง€์‹ ๊ตฌ์กฐ๋ฅผ ์œ ์ง€ํ•˜๊ณ , ์ƒˆ ๋ชจ๋ธ์˜ ํ”„๋กœํ•„๋งŒ ์ถ”๊ฐ€ํ•˜๋Š” ๋ฐฉ์‹์ด ํ›จ์”ฌ ํ™•์žฅ ๊ฐ€๋Šฅํ•˜์ฃ .

์—์ด์ „ํŠธ ์˜ค์ผ€์ŠคํŠธ๋ ˆ์ด์…˜์ด ์ ์  ์ค‘์š”ํ•ด์ง€๋Š” ์‹œ์ ์—์„œ, ์Šคํ‚ฌ ๊ธฐ๋ฐ˜ ์ ‘๊ทผ์ด ์–ด๋””๊นŒ์ง€ ํ™•์žฅ๋  ์ˆ˜ ์žˆ์„์ง€ ์ง€์ผœ๋ณผ ํ•„์š”๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค.

References

[1] J. Wang, Y. Ming, Z. Ke, S. Joty, A. Albarghouthi, and F. Sala, "SkillOrchestra: Learning to Route Agents via Skill Transfer," arXiv preprint arXiv:2602.19672, 2026.

[2] L. Chen, M. Zaharia, and J. Zou, "FrugalGPT: How to use large language models while reducing cost and improving performance," Transactions on Machine Learning Research, 2024.

[3] Q. J. Hu et al., "Routerbench: A benchmark for multi-LLM routing system," arXiv preprint arXiv:2403.12031, 2024.

[4] H. Zhang, T. Feng, and J. You, "Router-R1: Teaching LLMs multi-round routing and aggregation via reinforcement learning," in NeurIPS, 2025.

[5] H. Su et al., "ToolOrchestra: Elevating intelligence via efficient model and tool orchestration," arXiv preprint arXiv:2511.21689, 2025.