The Free Transformer

๐Ÿท๏ธ ๋…ผ๋ฌธ ๋”ฅ๋Ÿฌ๋‹

F. Fleuret, "The Free Transformer", arXiv preprint arXiv:2510.17558, 2025.

์š”์•ฝ

์•„ํ‚คํ…์ฒ˜: ํ‘œ์ค€ ๋””์ฝ”๋” ํŠธ๋žœ์Šคํฌ๋จธ๋ฅผ ์กฐ๊ฑด๋ถ€ ๋ณ€๋ถ„ ์˜คํ† ์ธ์ฝ”๋”(VAE)๋กœ ํ™•์žฅํ–ˆ์Šต๋‹ˆ๋‹ค. ์ค‘๊ฐ„ ๋ ˆ์ด์–ด์— ๋žœ๋ค ์ž ์žฌ ๋ณ€์ˆ˜ \(Z\)๋ฅผ ์ฃผ์ž…ํ•˜๊ณ , ์ธ์ฝ”๋”๋Š” ์ฒซ ๋ฒˆ์งธ ์ ˆ๋ฐ˜์˜ ๋ ˆ์ด์–ด์™€ ๋น„์ธ๊ณผ์  ํŠธ๋žœ์Šคํฌ๋จธ ๋ธ”๋ก ํ•˜๋‚˜๋กœ ๊ตฌ์„ฑ๋ฉ๋‹ˆ๋‹ค.

1-freetransformer.png

๋ชจ๋ธ ํฌ๊ธฐ: 1.5B ๋ชจ๋ธ(28๋ ˆ์ด์–ด)๊ณผ 8B ๋ชจ๋ธ(32๋ ˆ์ด์–ด, Llama-3 ๊ตฌ์กฐ)์„ ๊ฐ๊ฐ 47B, 200B, 1T ํ† ํฐ์œผ๋กœ ํ›ˆ๋ จํ–ˆ์Šต๋‹ˆ๋‹ค. ์ธ์ฝ”๋”๋กœ ์ธํ•œ ์˜ค๋ฒ„ํ—ค๋“œ๋Š” 1.5B์—์„œ 3.6%, 8B์—์„œ 3.1%์— ๋ถˆ๊ณผํ•ฉ๋‹ˆ๋‹ค.

ํ•ต์‹ฌ ๋ฉ”์ปค๋‹ˆ์ฆ˜:

ํ›ˆ๋ จ ์†์‹ค: ํ‘œ์ค€ ๊ต์ฐจ ์—”ํŠธ๋กœํ”ผ + ์ œ์–ด๋œ KL ๋ฐœ์‚ฐ \[\mathcal{L} = \text{CE} + \frac{1}{T}\sum_{t=1}^T \max\left(0, D_{KL}(Q(Z_t|S) | P(Z_t)) - \kappa\right)\]

์„ฑ๋Šฅ ํ–ฅ์ƒ (8B, 1T ํ† ํฐ):

ํ‰๊ฐ€ ๋ฒค์น˜๋งˆํฌ: ์ฝ”๋“œ/์ˆ˜ํ•™ ์ƒ์„ฑ(HumanEval+, MBPP, GSM8K), ๋‹ค์ง€์„ ๋‹ค ์ƒ์‹ ์ถ”๋ก (MMLU, CSQA, HellaSwag), ๋…ํ•ด(RACE, BoolQ), ์ง€์‹ ๊ฒ€์ƒ‰(NQ, TriviaQA) ๋“ฑ 15๊ฐœ ํƒœ์Šคํฌ์—์„œ ํ‰๊ฐ€ํ–ˆ์Šต๋‹ˆ๋‹ค.

๋…ผ๋ฌธ ์ƒ์„ธ

1. Introduction

ํŠธ๋žœ์Šคํฌ๋จธ์˜ ๋ฐœ๋ช… ์ดํ›„ ๊ฑฐ์˜ 10๋…„์ด ์ง€๋‚ฌ์ง€๋งŒ, ์ž๊ธฐํšŒ๊ท€ ๋ชจ๋ธ๋ง์€ ๋ณธ์งˆ์ ์œผ๋กœ ๋„์ „๋ฐ›์ง€ ์•Š์•˜์Šต๋‹ˆ๋‹ค. ์ด ๋…ผ๋ฌธ์€ ์ด ํ•ต์‹ฌ ์„ค๊ณ„ ์ธก๋ฉด์„ ์žฌ๊ฒ€ํ† ํ•˜์—ฌ ๋” ํ’๋ถ€ํ•˜๊ณ  ์ž์—ฐ์Šค๋Ÿฌ์šด ๋ฐ€๋„ ๋ชจ๋ธ์ด ๋‚˜ํƒ€๋‚  ์ˆ˜ ์žˆ๋„๋ก ํ•ฉ๋‹ˆ๋‹ค.

๋””์ฝ”๋” ํŠธ๋žœ์Šคํฌ๋จธ๋Š” ์ž๊ธฐํšŒ๊ท€ ์ด์‚ฐ ๋ฐ€๋„ ๊ทผ์‚ฌ๊ธฐ์ž…๋‹ˆ๋‹ค. ํ† ํฐ ์‹œํ€€์Šค \(S_1, \ldots, S_T\)๋ฅผ ๋ชจ๋ธ๋งํ•˜์—ฌ ๊ฐ ํ† ํฐ์ด ์ด์ „ ํ† ํฐ๋“ค์ด ์ฃผ์–ด์กŒ์„ ๋•Œ์˜ ์กฐ๊ฑด๋ถ€ ๋ถ„ํฌ๋ฅผ ์ถ”์ •ํ•ฉ๋‹ˆ๋‹ค. ์ด๋Ÿฌํ•œ ๋ชจ๋ธ์ด ๊ตฌํ˜„ํ•˜๋Š” ์œ ์ผํ•œ ๋ฐ€๋„ ๋ชจ๋ธ๋ง๊ณผ ์ƒ˜ํ”Œ๋ง์€ ์ƒ์„ฑ๋œ ํ† ํฐ์˜ ๊ฒƒ์ž…๋‹ˆ๋‹ค. ํŠนํžˆ ๋””์ฝ”๋” ํŠธ๋žœ์Šคํฌ๋จธ๋Š” ์ƒ์„ฑํ•  ํ† ํฐ ์ŠคํŠธ๋ฆผ์— ๋Œ€ํ•œ ์ถ”๊ฐ€์ ์ธ ์ž ์žฌ ๊ฒฐ์ •์„ ๋‚ด๋ฆฌ์ง€ ์•Š์Šต๋‹ˆ๋‹ค.

๊ฐ„๋‹จํ•œ ์˜ˆ๋ฅผ ์‚ดํŽด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค. \(Z \sim B(0.5)\)๋ฅผ ์ž ์žฌ "๋™์ „ ๋˜์ง€๊ธฐ"๋ผ๊ณ  ํ•˜๊ณ , \(X_1, \ldots, X_T\)๋Š” ํ™•๋ฅ  \(\epsilon\)์˜ ๋…๋ฆฝ์ ์ธ ํ”Œ๋ฆฝ์œผ๋กœ \(Z\)์™€ ๊ฐ™๋‹ค๊ณ  ํ•˜๊ฒ ์Šต๋‹ˆ๋‹ค. \(X_t\)๋“ค์€ \(Z\)๊ฐ€ ์ฃผ์–ด์กŒ์„ ๋•Œ ์กฐ๊ฑด๋ถ€ ๋…๋ฆฝ์ด๋ฉฐ:

\[P(X_{t+1} = 1 | Z = z) = \epsilon z + (1-\epsilon)(1-z)\]

ํ•˜์ง€๋งŒ \(Z\) ์—†์ด ์ž๊ธฐํšŒ๊ท€ ๋ชจ๋ธ๋กœ ํ‘œํ˜„ํ•˜๋ฉด:

\[P(X_{t+1} = 1 | X_1 = x_1, \ldots, X_t = x_t) = \frac{\left(\frac{\epsilon}{1-\epsilon}\right)^{\sum_{s=1}^t x_s}(1-\epsilon)^{t+1} + \left(\frac{1-\epsilon}{\epsilon}\right)^{\sum_{s=1}^t x_s}\epsilon^{t+1}}{\left(\frac{\epsilon}{1-\epsilon}\right)^{\sum_{s=1}^t x_s}(1-\epsilon)^t + \left(\frac{1-\epsilon}{\epsilon}\right)^{\sum_{s=1}^t x_s}\epsilon^t}\]

์ˆœ์ˆ˜ํ•œ ์ž๊ธฐํšŒ๊ท€ ๋ฐ€๋„ ๋ชจ๋ธ์€ ์ž ์žฌ์ ์œผ๋กœ ์—ฌ๋Ÿฌ ๋‹จ์ ์„ ๊ฒช์Šต๋‹ˆ๋‹ค:

2. Motivation

์ฒด์ธ ๋ฃฐ๋กœ ์ธํ•ด ๋ชจ๋“  ๋ฐ€๋„๋Š” ์ž๊ธฐํšŒ๊ท€๋กœ ๋ชจ๋ธ๋ง๋  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ํ•˜์ง€๋งŒ ํŠนํžˆ "์ž์—ฐ์Šค๋Ÿฌ์šด" ๊ตฌ์กฐ๊ฐ€ ์ž ์žฌ ๋ณ€์ˆ˜์— ๋Œ€ํ•œ ์กฐ๊ฑด๋ถ€๋ฅผ ํฌํ•จํ•  ๋•Œ, ์‹ ํ˜ธ์˜ ์ž๊ธฐํšŒ๊ท€ ๋ชจ๋ธ์€ ์ž ์žฌ ๋ณ€์ˆ˜๋ฅผ ํฌํ•จํ•œ ์ „์ฒด ๊ฒฐํ•ฉ ๋ชจ๋ธ๋ณด๋‹ค ํ›จ์”ฌ ๋” ๋ณต์žกํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

์ด ์—ฐ๊ตฌ์˜ ์ฃผ์š” ๋ชฉํ‘œ๋Š” ํ›ˆ๋ จ ์˜ˆ์ œ์— ์˜ํ•ด ๋ถ€๊ณผ๋˜์ง€ ์•Š๋Š” ์ž ์žฌ ๋žœ๋ค ์–‘์— ์ž๊ธฐํšŒ๊ท€ ํ”„๋กœ์„ธ์Šค๋ฅผ ์กฐ๊ฑดํ™”ํ•  ์ž์œ ๋ฅผ ๋ชจ๋ธ์— ์ œ๊ณตํ•˜์—ฌ ์ด๋Ÿฌํ•œ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.

3. Method

์กฐ๊ฑด๋ถ€ ๋ณ€๋ถ„ ์˜คํ† ์ธ์ฝ”๋”: ๋žœ๋ค ๋ณ€์ˆ˜ \(Z\)์— ์˜์กดํ•˜๋Š” ๋ชจ๋ธ๋กœ ์ฒ˜์Œ๋ถ€ํ„ฐ ์ „์ฒด ์‹œํ€€์Šค๋ฅผ ์ƒ์„ฑํ•˜๋Š” ๊ฒƒ์€ ๊ฐ„๋‹จํ•ฉ๋‹ˆ๋‹ค. \(Z \sim P(Z)\)๋ฅผ ์ƒ˜ํ”Œ๋งํ•œ ๋‹ค์Œ ํ‘œ์ค€ ์ž๊ธฐํšŒ๊ท€ ํ”„๋กœ์„ธ์Šค๋ฅผ ์‹คํ–‰ํ•˜๋ฉด ๋ฉ๋‹ˆ๋‹ค.

๊ทธ๋Ÿฌ๋‚˜ ๋ชจ๋ธ์„ ํ›ˆ๋ จํ•˜๋Š” ๊ฒƒ์€ ํ›จ์”ฌ ๋” ๋ณต์žกํ•ฉ๋‹ˆ๋‹ค. ํ›ˆ๋ จ ์ƒ˜ํ”Œ \(S\)๊ฐ€ ์ฃผ์–ด์ง€๋ฉด ๋ชฉํ‘œ๋Š” ๋‹ค์Œ์„ ์ตœ๋Œ€ํ™”ํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค:

\[P(S) = \int_z P(S | Z=z)P(Z=z)dz\]

VAE์˜ ์ธ์ฝ”๋” ์—ญํ• ์€ "์ข‹์€" ๋ถ„ํฌ \(Q(Z|S)\)์—์„œ ์ƒ˜ํ”Œ๋งํ•˜์—ฌ ์ƒ˜ํ”Œ๋ง๋œ \(Z\)๊ฐ€ ๋””์ฝ”๋”๋ฅผ ๋ณ€์กฐํ•˜์—ฌ \(S\)๋ฅผ ์ƒ์„ฑํ•˜๋„๋ก ํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.

๋ชจ๋ธ ๊ตฌ์กฐ: Free Transformer๋Š” ์ค‘๊ฐ„ ๋ ˆ์ด์–ด์— ๋…ธ์ด์ฆˆ \(Z\)๊ฐ€ ์ฃผ์ž…๋œ ํ‘œ์ค€ ๋””์ฝ”๋”์ž…๋‹ˆ๋‹ค. ์ด๋ฅผ ํ†ตํ•ด ํŠธ๋žœ์Šคํฌ๋จธ ๋ธ”๋ก์˜ ์ ˆ๋ฐ˜์„ ์ธ์ฝ”๋”์™€ ๊ณต์œ ํ•˜์—ฌ ์ธ์ฝ”๋”์— ํŠน์ •ํ•˜๊ฒŒ ๊ณ„์‚ฐํ•ด์•ผ ํ•˜๋Š” ๋‹จ์ผ ํŠธ๋žœ์Šคํฌ๋จธ ๋ธ”๋ก๋งŒ ์žˆ์œผ๋ฉด ๋˜๋ฏ€๋กœ ๊ณ„์‚ฐ ์˜ค๋ฒ„ํ—ค๋“œ๋ฅผ ๋Œ€ํญ ์ค„์ž…๋‹ˆ๋‹ค.

\(1024 \times 1024\) ์ด๋ฏธ์ง€๋ฅผ ์ž…๋ ฅํ•œ๋‹ค๊ณ  ๊ฐ€์ •ํ•˜๋ฉด DeepEncoder๋Š” ์ด๋ฅผ \(1024/16 \times 1024/16 = 4096\) ํŒจ์น˜ ํ† ํฐ์œผ๋กœ ๋ถ„ํ• ํ•ฉ๋‹ˆ๋‹ค. ์ฒซ ๋ฒˆ์งธ ์ ˆ๋ฐ˜์˜ ์ธ์ฝ”๋”๊ฐ€ ์œˆ๋„์šฐ ์–ดํ…์…˜์ด ์ง€๋ฐฐ์ ์ด๊ณ  80M๋งŒ ์‚ฌ์šฉํ•˜๋ฏ€๋กœ ํ™œ์„ฑํ™”๊ฐ€ ํ—ˆ์šฉ ๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋‹ค. ๊ธ€๋กœ๋ฒŒ ์–ดํ…์…˜์— ๋“ค์–ด๊ฐ€๊ธฐ ์ „์— 4096๊ฐœ์˜ ํ† ํฐ์ด ์••์ถ• ๋ชจ๋“ˆ์„ ๊ฑฐ์ณ \(4096/16 = 256\)๊ฐœ๊ฐ€ ๋˜๋ฏ€๋กœ ์ „์ฒด ํ™œ์„ฑํ™” ๋ฉ”๋ชจ๋ฆฌ๊ฐ€ ์ œ์–ด ๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋‹ค.

ํ‘œ์ค€ ๋””์ฝ”๋” ํŠธ๋žœ์Šคํฌ๋จธ๋กœ์„œ Free Transformer๋Š” ์ž„๋ฒ ๋”ฉ ํ…Œ์ด๋ธ”๋กœ ํ† ํฐ ์‹œํ€€์Šค๋ฅผ ์ธ์ฝ”๋”ฉํ•˜์—ฌ \(T \times D\) ํ˜•ํƒœ์˜ ํ…์„œ \(X_0\)๋ฅผ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค. ๊ทธ๋Ÿฐ ๋‹ค์Œ ์ฒซ ๋ฒˆ์งธ \(L/2\) ํŠธ๋žœ์Šคํฌ๋จธ ๋ธ”๋ก์„ ์ˆœ์ฐจ์ ์œผ๋กœ ํ‰๊ฐ€ํ•˜์—ฌ ๋™์ผํ•œ ํ˜•ํƒœ์˜ \(X_{L/2}\)๋ฅผ ์–ป์Šต๋‹ˆ๋‹ค.

์ด ์‹œ์ ์—์„œ ์›-ํ•ซ ๋ฒกํ„ฐ์˜ ์‹œํ€€์Šค \(Z = (Z_1, \ldots, Z_t) \in {0,1}^{T \times C}\)๋ฅผ ์ƒ˜ํ”Œ๋งํ•ฉ๋‹ˆ๋‹ค. ์ƒ์„ฑ ์ค‘์—๋Š” ๊ฐ \(Z_t\)์— ๋Œ€ํ•ด ์ธ๋ฑ์Šค \(c\)๋ฅผ \({0, \ldots, C-1}\)์—์„œ ๊ท ์ผํ•˜๊ฒŒ ์ƒ˜ํ”Œ๋งํ•œ ๋‹ค์Œ ์ฐจ์› \(C\)์˜ ์›-ํ•ซ ๋ฒกํ„ฐ๋กœ ์ธ์ฝ”๋”ฉํ•ฉ๋‹ˆ๋‹ค.

์ธ์ฝ”๋”์™€ ์†์‹ค: ํ›ˆ๋ จ ๋˜๋Š” KV ์บ์‹œ ์‚ฌ์ „ ์ฑ„์šฐ๊ธฐ ์ค‘์— ํ…์„œ \(Z\)๋Š” ์ธ์ฝ”๋”๋กœ ์ƒ˜ํ”Œ๋ง๋ฉ๋‹ˆ๋‹ค. Free Transformer๋Š” ๋น„์ธ๊ณผ์ ์ธ ์ธ์ฝ”๋” ์ „์šฉ ํŠธ๋žœ์Šคํฌ๋จธ ๋ธ”๋ก ํ•˜๋‚˜๋ฅผ ๊ฐ€์ง€๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. ์ด๋Š” ๋””์ฝ”๋”์˜ ์กฐ๊ฑดํ™”๊ฐ€ ์žฅ๊ฑฐ๋ฆฌ ํšจ๊ณผ๋ฅผ ๊ฐ€์งˆ ์ˆ˜ ์žˆ์–ด ์ ์ ˆํ•œ ์ž ์žฌ ์กฐ๊ฑด๋ถ€ ๋ถ„ํฌ๋ฅผ ์–ป๊ธฐ ์œ„ํ•ด ์ „์ฒด ์‹œํ€€์Šค๋ฅผ ๊ณ ๋ คํ•ด์•ผ ํ•˜๊ธฐ ๋•Œ๋ฌธ์— ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค.

์„ ํ˜• ํŒ๋…์€ ์ธ์ฝ”๋” ๋ธ”๋ก์˜ ์ถœ๋ ฅ์—์„œ ๋ชจ๋“  ํ† ํฐ์— ๋Œ€ํ•ด \(H=16\) ์ฐจ์›์˜ ๋ฒกํ„ฐ๋ฅผ ๊ณ„์‚ฐํ•ฉ๋‹ˆ๋‹ค. ์ด๋Ÿฌํ•œ ๊ตฌ์„ฑ ์š”์†Œ๋Š” ๊ฐœ๋ณ„ ๋น„ํŠธ์˜ ๋กœ์ง“์œผ๋กœ ํ•ด์„๋˜์–ด \({0, \ldots, 2^H - 1}\)์—์„œ ๊ฐ’์„ ์ƒ˜ํ”Œ๋งํ•˜๋Š” ๋ฐ ์‚ฌ์šฉ๋ฉ๋‹ˆ๋‹ค.

KL ๋ฐœ์‚ฐ์€ ๊ฐœ๋ณ„ \(Z_t\)์˜ KL ๋ฐœ์‚ฐ์„ ์ž„๊ณ„๊ฐ’ \(\kappa\) ์ด์ƒ์ธ ๊ฒƒ๋งŒ ํ•ฉ์‚ฐํ•˜๊ณ  ๋‚˜๋จธ์ง€๋Š” ๋ฌด์‹œํ•˜๋Š” ํ† ํฐ๋ณ„ free bits ๋ฐฉ๋ฒ•์œผ๋กœ ์ œ์–ด๋ฉ๋‹ˆ๋‹ค:

\[\frac{1}{T}\sum_{t=1}^T \max\left(0, D_{KL}(Q(Z_t|S_1, \ldots, S_T) | P(Z_t)) - \kappa\right)\]

Binary Mapper: ์ธ์ฝ”๋”์˜ ๋งˆ์ง€๋ง‰ ์„ ํ˜• ๋ ˆ์ด์–ด๋Š” ์ฒ˜๋ฆฌ ์ค‘์ธ ์‹œํ€€์Šค์˜ ๋ชจ๋“  ์ธ๋ฑ์Šค \(t\)์— ๋Œ€ํ•ด ๋ฒกํ„ฐ \(L_t = (L_{t,1}, \ldots, L_{t,H}) \in \mathbb{R}^H\)๋ฅผ ๊ณ„์‚ฐํ•ฉ๋‹ˆ๋‹ค. ์ด ๊ตฌ์„ฑ ์š”์†Œ๋Š” ์ด์ง„ ์ธ์ฝ”๋”ฉ์˜ ๊ฐœ๋ณ„ ๋น„ํŠธ์˜ ๋กœ์ง“์œผ๋กœ ํ•ด์„๋ฉ๋‹ˆ๋‹ค.

Binary Mapper๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์ด ๋…๋ฆฝ์ ์œผ๋กœ ๋น„ํŠธ \(B_{t,1}, \ldots, B_{t,H}\)๋ฅผ ์ƒ˜ํ”Œ๋งํ•ฉ๋‹ˆ๋‹ค:

\[P(B_{t,h} = 1) = \frac{1}{1 + e^{-L_{t,h}}}\]

๊ทธ๋ฆฌ๊ณ  ๊ฒฐ๊ณผ ๊ฐ’์— ํ•ด๋‹นํ•˜๋Š” \(2^H\) ์ฐจ์›์˜ ์›-ํ•ซ ๋ฒกํ„ฐ \(Y_t\)๋ฅผ ์ถœ๋ ฅํ•ฉ๋‹ˆ๋‹ค.

4. Experiments

ํ•ฉ์„ฑ ๋ฐ์ดํ„ฐ์…‹: Free Transformer๊ฐ€ ์‹ค์ œ๋กœ \(Z\)๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์ƒ์„ฑ ํ”„๋กœ์„ธ์Šค๋ฅผ ์กฐ๊ฑดํ™”ํ•˜๋Š”์ง€ ํ™•์ธํ•˜๊ธฐ ์œ„ํ•ด ํ•ฉ์„ฑ ๋ฐ์ดํ„ฐ์…‹์„ ์„ค๊ณ„ํ–ˆ์Šต๋‹ˆ๋‹ค. ๊ฐ ์‹œํ€€์Šค๋Š” 64๊ฐœ์˜ ๋ฐ‘์ค„๋กœ ์‹œ์ž‘ํ•˜๊ณ , ๋Œ€๋ฌธ์ž์™€ ์‹œํ€€์Šค์˜ ์œ„์น˜๋ฅผ ๋ฌด์ž‘์œ„๋กœ ์„ ํƒํ•˜์—ฌ ์„ ํƒํ•œ ๋ฌธ์ž๊ฐ€ 8๋ฒˆ ๋ฐ˜๋ณต๋˜๋Š” "ํƒ€๊ฒŸ"์œผ๋กœ ๋ฐ‘์ค„์„ ๊ต์ฒดํ•ฉ๋‹ˆ๋‹ค.

๋งค์šฐ ๋‚ฎ์€ KL ๋ฐœ์‚ฐ ๊ฐ’์˜ ๊ฒฝ์šฐ ๋ชจ๋ธ์€ ๋ฐ”๋‹๋ผ ๋ชจ๋ธ์ฒ˜๋Ÿผ ๋™์ž‘ํ•˜๋ฉฐ, ๊ฐ’์ด ์ฆ๊ฐ€ํ•˜๋ฉด ๋ชจ๋ธ์€ ์ฒ˜์Œ์— ์ž ์žฌ ์ƒํƒœ์— ํƒ€๊ฒŸ์˜ ์œ„์น˜๋งŒ ์ธ์ฝ”๋”ฉํ•˜๊ณ , ๊ทธ ๋‹ค์Œ ํƒ€๊ฒŸ ์œ„์น˜์™€ ๋…ธ์ด์ฆˆ๋ฅผ ๋ชจ๋‘ ์ธ์ฝ”๋”ฉํ•˜๊ณ , ๋งˆ์ง€๋ง‰์œผ๋กœ ์ „์ฒด ์‹œํ€€์Šค๋ฅผ ์ธ์ฝ”๋”ฉํ•˜์—ฌ ๋ถ€์ •ํ™•ํ•œ ์ƒ์„ฑ์„ ์ดˆ๋ž˜ํ•ฉ๋‹ˆ๋‹ค.

ํƒ์ƒ‰์  ๊ฒฐ๊ณผ: 1.5B ๋ชจ๋ธ(47B ํ† ํฐ)๊ณผ 8B ๋ชจ๋ธ(200B ํ† ํฐ)์„ ๋‹ค์–‘ํ•œ KL ๋ฐœ์‚ฐ ์ž„๊ณ„๊ฐ’์œผ๋กœ ํ›ˆ๋ จํ•˜์—ฌ ์—ฌ๋Ÿฌ ๋ฒค์น˜๋งˆํฌ์—์„œ ์„ฑ๋Šฅ์„ ๋น„๊ตํ–ˆ์Šต๋‹ˆ๋‹ค.

์ถ”๋ก ์„ ํ•„์š”๋กœ ํ•˜๋Š” ๋ฒค์น˜๋งˆํฌ์ธ HumanEval+, MBPP, GSM8K์—์„œ ์ƒ๋‹นํ•œ ์„ฑ๋Šฅ ํ–ฅ์ƒ์„ ๊ด€์ฐฐํ–ˆ์Šต๋‹ˆ๋‹ค. 8B ๋ชจ๋ธ์˜ ๊ฒฝ์šฐ 1/2 ๋น„ํŠธ KL ๋ฐœ์‚ฐ์œผ๋กœ ๋‹ค์ง€์„ ๋‹ค ์งˆ๋ฌธ์ธ MMLU์™€ CSQA์—์„œ๋„ ๋ช…ํ™•ํ•œ ๊ฐœ์„ ์ด ์žˆ์—ˆ์Šต๋‹ˆ๋‹ค.

1T ํ† ํฐ ํ›ˆ๋ จ ๊ฒฐ๊ณผ: ๋” ํ˜„์‹ค์ ์ธ ์„ค์ •์—์„œ ๊ฐœ์„ ์„ ์ธก์ •ํ•˜๊ธฐ ์œ„ํ•ด 8B ๋ชจ๋ธ์„ 1T ํ† ํฐ์œผ๋กœ ํ›ˆ๋ จํ–ˆ์Šต๋‹ˆ๋‹ค. 200B ํ† ํฐ ๊ฒฐ๊ณผ๋ฅผ ๊ณ ๋ คํ•˜์—ฌ ํ† ํฐ๋‹น ์ตœ๋Œ€ ์ ˆ๋ฐ˜ ๋น„ํŠธ์˜ ์ •๋ณด์— ํ•ด๋‹นํ•˜๋Š” ๊ฐ’ \(\kappa = \log(2)/2\)๋ฅผ ์„ ํƒํ–ˆ์Šต๋‹ˆ๋‹ค.

ํ•ต์‹ฌ ๊ฒฐ๊ณผ๋Š” HumanEval+, MBPP, GSM8K, MMLU, CSQA์—์„œ์˜ ์„ฑ๋Šฅ ํ–ฅ์ƒ์ด๋ฉฐ, ์ด๋Š” ๋” ์ž‘์€ ์„ค์ •์—์„œ ๊ด€์ฐฐํ•œ ๊ฒƒ์„ ํ™•์ธํ•˜๊ณ  ๋‹ค๋ฅธ ์ž‘์—…์—์„œ ๋” ํฐ ์•ˆ์ •์„ฑ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.

5. Previous work

VAE์™€ ๋””์ฝ”๋” ํŠธ๋žœ์Šคํฌ๋จธ๋ฅผ ๊ฒฐํ•ฉํ•˜๋ ค๋Š” ์—ฌ๋Ÿฌ ์‹œ๋„๊ฐ€ ์žˆ์—ˆ์Šต๋‹ˆ๋‹ค. OPTIMUS ๋ชจ๋ธ์€ ์‚ฌ์ „ ํ›ˆ๋ จ๋œ BERT๋ฅผ ํ…์ŠคํŠธ ์ž„๋ฒ ๋”ฉ/์ธ์ฝ”๋”๋กœ, GPT-2๋ฅผ ๋””์ฝ”๋”๋กœ ๊ฒฐํ•ฉํ•˜์—ฌ VAE์™€ ์œ ์‚ฌํ•œ ์†์‹ค๋กœ ๋ฏธ์„ธ ์กฐ์ •ํ•ฉ๋‹ˆ๋‹ค.

Fang ๋“ฑ์˜ CVAE๋Š” ๋‘ ๊ฐœ์˜ ์‚ฌ์ „ ํ›ˆ๋ จ๋œ GPT-2๋ฅผ ๊ฒฐํ•ฉํ•˜๋ฉฐ, ํ•˜๋‚˜๋Š” ์ธ๊ณผ์  ๋งˆ์Šคํ‚น ์—†์ด ์ธ์ฝ”๋”๋กœ ์‚ฌ์šฉ๋ฉ๋‹ˆ๋‹ค. AdaVAE๋Š” ์œ ์‚ฌํ•˜๊ฒŒ ๋‘ ๊ฐœ์˜ ์‚ฌ์ „ ํ›ˆ๋ จ๋œ GPT-2์˜ ์กฐํ•ฉ์ด๋ฉฐ, ์ฒซ ๋ฒˆ์งธ๋Š” ์ธ๊ณผ์  ๋งˆ์Šคํ‚น ์—†์ด ์ธ์ฝ”๋” ์—ญํ• ์„ ํ•ฉ๋‹ˆ๋‹ค.

6. Conclusion

Free Transformer๋Š” ํ‘œ์ค€ ๋””์ฝ”๋” ํŠธ๋žœ์Šคํฌ๋จธ์˜ ์ง์ ‘์ ์ธ ํ™•์žฅ์ด๋ฉฐ ์กฐ๊ฑด๋ถ€ VAE์˜ ์ถ”์ƒ์  ๊ตฌ์กฐ๋ฅผ ๊ฐ€์ง€๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. ๋‹จ์ผ ์ถ”๊ฐ€ ๋น„์ธ๊ณผ์  ํŠธ๋žœ์Šคํฌ๋จธ ๋ธ”๋ก์œผ๋กœ ๊ตฌํ˜„๋˜๋ฉฐ ๋ช‡ ํผ์„ผํŠธ์˜ ๊ณ„์‚ฐ ๋ฐ ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ ์˜ค๋ฒ„ํ—ค๋“œ๊ฐ€ ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค.

์ด ๊ตฌ์กฐ๋Š” ๋น„์ง€๋„ ํ•™์Šต ์ž ์žฌ ๋žœ๋ค ๋ณ€์ˆ˜๋ฅผ ํ•™์Šตํ•˜๊ณ  ์ƒ์„ฑ ํ”„๋กœ์„ธ์Šค๋ฅผ ์กฐ๊ฑดํ™”ํ•  ์ˆ˜ ์žˆ๊ฒŒ ํ•ฉ๋‹ˆ๋‹ค. ์–ด๋–ค ๋ฉด์—์„œ ์ด ์ ‘๊ทผ๋ฒ•์€ ์ถ”๋ก  ๋ชจ๋ธ์ด ํ† ํฐ ๊ณต๊ฐ„์—์„œ ์ƒ๊ฐ ์ฒด์ธ๊ณผ RL ์ ˆ์ฐจ๋กœ ์ˆ˜ํ–‰ํ•˜๋Š” ๊ฒƒ์„ ์ž ์žฌ ๊ณต๊ฐ„์—์„œ ์˜คํ† ์ธ์ฝ”๋”๋กœ ๋‹ฌ์„ฑํ•˜๋Š” ๊ฒƒ์„ ๋ชฉํ‘œ๋กœ ํ•ฉ๋‹ˆ๋‹ค.

์ตœ์ ํ™” ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ์กฐ์ •ํ•˜์ง€ ์•Š๊ณ ๋„ ์—ฌ๋Ÿฌ ๋ฒค์น˜๋งˆํฌ์™€ ๋‘ ๊ฐ€์ง€ ํฌ๊ธฐ์˜ ๋ชจ๋ธ์—์„œ ์„ฑ๋Šฅ ํ–ฅ์ƒ์„ ๋ณด์ธ ๊ฒƒ์€ ์ „์ฒด ์ ‘๊ทผ๋ฒ•์ด ์‹ค์ œ๋กœ ๋ฐ”๋‹๋ผ ํŠธ๋žœ์Šคํฌ๋จธ์˜ ๊ท€๋‚ฉ์  ํŽธํ–ฅ์„ ๊ฐœ์„ ํ•œ๋‹ค๋Š” ๊ฐ•๋ ฅํ•œ ์‹ ํ˜ธ์ž…๋‹ˆ๋‹ค.