Kimi Linear An Expressive, Efficient Attention Architecture

๐Ÿท๏ธ ๋…ผ๋ฌธ Headliner LLM

Kimi Team et al., 2025 (arXiv:2510.26692)

ํŠธ๋žœ์Šคํฌ๋จธ์™€ ์–ดํ…์…˜ ๋ฉ”์ปค๋‹ˆ์ฆ˜์€ ๋›ฐ์–ด๋‚œ ์„ฑ๋Šฅ์œผ๋กœ ๋ ˆ์ „๋“œ๊ฐ€ ๋˜์—ˆ์Šต๋‹ˆ๋‹ค. ๊ทธ๋Ÿฐ ์–ดํ…์…˜ ๋ฉ”์ปค๋‹ˆ์ฆ˜์—๋„ ์•ฝ์ ์ด ๋งŽ์ด ์žˆ์Šต๋‹ˆ๋‹ค. ํŠนํžˆ ๊ธธ์ด๊ฐ€ ๊ธด ๋ฌธ๋งฅ์„ ๋‹ค๋ฃฐ ๋•Œ ์ œ๊ณฑ ์‹œ๊ฐ„๋ณต์žก๋„๋ผ๋Š” ๊ฒƒ์€ ์น˜๋ช…์ ์ธ ์•ฝ์ ์ž…๋‹ˆ๋‹ค. ์‹œ๊ฐ„์ด ํ๋ฅด๊ณ  LLM์ด ๋ฐฑ๋งŒ ํ† ํฐ ๊ทœ๋ชจ์˜ ์ž…๋ ฅ์„ ์ฒ˜๋ฆฌํ•˜๊ฒŒ ๋˜๋ฉด์„œ ์ด ๋‹จ์ ์ด ์„œ์„œํžˆ ๋“œ๋Ÿฌ๋‚ฌ์Šต๋‹ˆ๋‹ค. ๊ฐ•ํ™”ํ•™์Šต(RL) ๊ธฐ๋ฐ˜์˜ ์ถ”๋ก  ์Šค์ผ€์ผ๋ง์ด ์ค‘์š”ํ•ด์ง€๋ฉด์„œ ์ด ๋ฌธ์ œ๋Š” ๋”์šฑ ์‹ฌ๊ฐํ•ด์กŒ์ฃ .

1-KimiLinear.png

์„ ํ˜• ์–ดํ…์…˜์€ ์šฐ์•„ํ•œ ํ•ด๊ฒฐ์ฑ…์œผ๋กœ ๋ณด์˜€์ง€๋งŒ ํ‘œํ˜„๋ ฅ์ด ๋ถ€์กฑํ•ฉ๋‹ˆ๋‹ค. ์ผ๋ฐ˜ ์–ดํ…์…˜์„ ๋”ฐ๋ผ์žก๊ธฐ ์–ด๋ ต์Šต๋‹ˆ๋‹ค. ์ด ๋…ผ๋ฌธ์€ '๊ณผ์—ฐ ์„ ํ˜• ์–ดํ…์…˜์œผ๋กœ ์ถฉ๋ถ„ํ•œ๊ฐ€?'๋ผ๋Š” ์˜ค๋ž˜๋œ ์งˆ๋ฌธ์— ์ƒˆ๋กœ์šด ๋‹ต์„ ์ œ์‹œํ•ฉ๋‹ˆ๋‹ค. Kimi Linear๋Š” ์„ธ๋ฐ€ํ•œ ๊ฒŒ์ดํŒ… ๋ฉ”์ปค๋‹ˆ์ฆ˜๊ณผ ํ•˜์ด๋ธŒ๋ฆฌ๋“œ ์•„ํ‚คํ…์ฒ˜๋ฅผ ํ†ตํ•ด, ์ฒ˜์Œ์œผ๋กœ ์ผ๋ฐ˜ ์–ดํ…์…˜์„ ๋ชจ๋“  ์‹œ๋‚˜๋ฆฌ์˜ค์—์„œ ๋Šฅ๊ฐ€ํ•˜๋Š” ์„ฑ๊ณผ๋ฅผ ๋‹ฌ์„ฑํ–ˆ์Šต๋‹ˆ๋‹ค.

์š”์•ฝ

ํ•ต์‹ฌ ์•„์ด๋””์–ด: Kimi Delta Attention(KDA)์ด๋ผ๋Š” ํ–ฅ์ƒ๋œ ์„ ํ˜• ์–ดํ…์…˜ ๋ชจ๋“ˆ์„ ์ œ์•ˆํ•˜๊ณ , ์ด๋ฅผ ๊ธฐ์กด MLA(Multi-Head Latent Attention)์™€ 3:1 ๋น„์œจ๋กœ ํ•˜์ด๋ธŒ๋ฆฌ๋“œํ•œ ๊ตฌ์กฐ์ž…๋‹ˆ๋‹ค.

๊ธฐ์ˆ  ์ŠคํŽ™:

๋…ผ๋ฌธ ์ƒ์„ธ

1. ๋ฌธ์ œ์˜์‹: ์„ ํ˜• ์–ดํ…์…˜์˜ ๋”œ๋ ˆ๋งˆ

ํ‘œ์ค€ ์–ดํ…์…˜์€ ์ฟผ๋ฆฌ์™€ ๋ชจ๋“  ํ‚ค-๊ฐ’ ์Œ์„ ๋น„๊ตํ•˜๋ฏ€๋กœ O(Tยฒ) ๋ณต์žก๋„๋ฅผ ๊ฐ€์ง‘๋‹ˆ๋‹ค. ์ผ๋ฐ˜ ์–ดํ…์…˜์œผ๋กœ 100๋งŒ ํ† ํฐ์„ ์ฒ˜๋ฆฌํ•˜๋ ค๋ฉด ์—„์ฒญ๋‚œ ์—ฐ์‚ฐ ๋น„์šฉ์ด ํ•„์š”ํ•˜์ฃ . ๋ฐ˜๋ฉด ์„ ํ˜• ์–ดํ…์…˜์€ O(T) ๋ณต์žก๋„๋ฅผ ์ œ๊ณตํ•˜์ง€๋งŒ, ํ‘œํ˜„๋ ฅ์ด ๋ถ€์กฑํ•ด์„œ ๊ธฐ์กด ๋ชจ๋ธ๋“ค์„ ๋”ฐ๋ผ์žก์ง€ ๋ชปํ–ˆ์Šต๋‹ˆ๋‹ค.

ํ•˜์ด๋ธŒ๋ฆฌ๋“œ ์ ‘๊ทผ๋ฒ•(์ผ๋ถ€ ๋ ˆ์ด์–ด๋Š” ์ผ๋ฐ˜ ์–ดํ…์…˜, ์ผ๋ถ€๋Š” ์„ ํ˜•)์ด ์ œ์•ˆ๋˜์—ˆ์ง€๋งŒ, ๋Œ€๋ถ€๋ถ„ ์ œํ•œ๋œ ๊ทœ๋ชจ์—์„œ๋งŒ ํ‰๊ฐ€๋˜์—ˆ๊ฑฐ๋‚˜ ๊ณต์ •ํ•œ ๋น„๊ต๊ฐ€ ๋ถ€์กฑํ–ˆ์Šต๋‹ˆ๋‹ค. ์ €์ž๋“ค์€ ์ด ๋ฌธ์ œ๋ฅผ ์ฒด๊ณ„์ ์œผ๋กœ ํ•ด๊ฒฐํ•˜๊ธฐ๋กœ ๊ฒฐ์‹ฌํ–ˆ์Šต๋‹ˆ๋‹ค.

2. Kimi Delta Attention (KDA): ์„ธ๋ฐ€ํ•œ ๋ฉ”๋ชจ๋ฆฌ ์ œ์–ด

2.1 ๋ธํƒ€ ๊ทœ์น™์˜ ์ง„ํ™”

์„ ํ˜• ์–ดํ…์…˜์˜ ๊ธฐ๋ณธ์€ ํ–‰๋ ฌ ์ƒํƒœ \(S_t \in \mathbb{R}^{d_k \times d_v}\)๋ฅผ ๋ˆ„์ ํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค:

\[S_t = S_{t-1} + k_t v_t^\top\]

์ด๋Š” ์˜จ๋ผ์ธ ํ•™์Šต์œผ๋กœ ๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ํ•˜์ง€๋งŒ ์ด ๋ฐฉ์‹์€ ๋ฉ”๋ชจ๋ฆฌ๊ฐ€ ๋ฌดํ•œ์ • ์ปค์ ธ์„œ ์˜ค๋ž˜๋œ ์ •๋ณด๊ฐ€ ๊ฐ„์„ญ์„ ์ผ์œผํ‚ต๋‹ˆ๋‹ค.

DeltaNet์€ ์ด๋ฅผ ๊ฐœ์„ ํ–ˆ์Šต๋‹ˆ๋‹ค. ์žฌ๊ตฌ์„ฑ ์†์‹ค์— ๋Œ€ํ•ด ๊ฒฝ์‚ฌ๋„ ํ•˜๊ฐ•์„ ์ˆ˜ํ–‰ํ•จ์œผ๋กœ์จ:

\[S_t = (I - \beta_t k_t k_t^\top) S_{t-1} + \beta_t k_t v_t^\top\]

์ด๊ฒƒ์€ ๊ณ ์ „์ ์ธ ๋ธํƒ€ ๊ทœ์น™์ด๋ฉฐ, ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ์„ ํƒ์ ์œผ๋กœ ์ˆ˜์ •ํ•ฉ๋‹ˆ๋‹ค.

**Gated DeltaNet (GDN)**์€ ์Šค์นผ๋ผ ๋ง๊ฐ ๊ฒŒ์ดํŠธ \(\alpha_t \in [0, 1]\)์„ ์ถ”๊ฐ€ํ–ˆ์Šต๋‹ˆ๋‹ค:

\[S_t = \alpha_t (I - \beta_t k_t k_t^\top) S_{t-1} + \beta_t k_t v_t^\top\]

2.2 KDA์˜ ํ˜์‹ : ์ฑ„๋„๋ณ„ ์„ธ๋ฐ€ํ•œ ๊ฒŒ์ดํŒ…

KDA๋Š” ์Šค์นผ๋ผ ๊ฒŒ์ดํŠธ๋ฅผ ์ฑ„๋„๋ณ„ ๋ฒกํ„ฐ ๊ฒŒ์ดํŠธ๋กœ ํ™•์žฅํ•ฉ๋‹ˆ๋‹ค:

\[S_t = \left(I - \beta_t k_t k_t^\top\right) \text{Diag}(\alpha_t) S_{t-1} + \beta_t k_t v_t^\top\]

์—ฌ๊ธฐ์„œ \(\text{Diag}(\alpha_t) \in \mathbb{R}^{d_k \times d_k}\)๋Š” ๊ฐ ํŠน์„ฑ ์ฐจ์›์ด ๋…๋ฆฝ์ ์ธ ๋ง๊ฐ๋ฅ ์„ ๊ฐ€์ง‘๋‹ˆ๋‹ค. ์ด๋Š” RoPE์˜ ์ฐจ์›๋ณ„ ๋‹ค๋ฅธ ํšŒ์ „ ์ฃผํŒŒ์ˆ˜์ฒ˜๋Ÿผ, ๊ฐ ์ฐจ์›์— ๋‹ค๋ฅธ ์œ„์น˜ ์ธ์ฝ”๋”ฉ์„ ํšจ๊ณผ์ ์œผ๋กœ ๋ถ€์—ฌํ•ฉ๋‹ˆ๋‹ค.

์‹ค์ œ ๊ณ„์‚ฐ์—์„œ ์ด๋Š” ๋‹ค์Œ์„ ์˜๋ฏธํ•ฉ๋‹ˆ๋‹ค:

\[q_t^\top \left(\prod_{j=i+1}^{t} (I - \beta_j k_j k_j^\top) \text{Diag}(\alpha_j)\right) k_i\]

๊ฐ ์ฐจ์› \(d\)์—์„œ ๋ˆ„์  ๊ฐ์‡ :

\[\gamma_i^\text{(d)} = \prod_{j=1}^{i} \alpha_j^{(d)}\]

์ด๋ ‡๊ฒŒ ์„ธ๋ฐ€ํ•œ ์ œ์–ด๋ฅผ ํ†ตํ•ด, KDA๋Š” ์ค‘์š”ํ•œ ์ •๋ณด๋Š” ์˜ค๋ž˜ ๋ณด์กดํ•˜๊ณ  ๋ถˆํ•„์š”ํ•œ ์ •๋ณด๋Š” ๋นจ๋ฆฌ ์žŠ์„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

2.3 ํ•˜๋“œ์›จ์–ด ํšจ์œจ์„ฑ: DPLR์˜ ์ตœ์ ํ™”

์„ ํ˜• ์–ดํ…์…˜์˜ ์ผ๋ฐ˜ํ™”๋Š” Diagonal-Plus-Low-Rank(DPLR) ๊ตฌ์กฐ์ž…๋‹ˆ๋‹ค:

\[S_t = (D - a_t b_t^\top) S_{t-1} + k_t v_t^\top\]

ํ•˜์ง€๋งŒ ์ผ๋ฐ˜ DPLR์€ ๊ณ„์‚ฐ ๋น„์šฉ์ด ๋†’๊ณ  ์ˆ˜์น˜์ ์œผ๋กœ ๋ถˆ์•ˆ์ •ํ•ฉ๋‹ˆ๋‹ค (์—ญ์ˆ˜ ๊ณ„์‚ฐ ๋•Œ๋ฌธ).

KDA์˜ ํ•ต์‹ฌ ํ†ต์ฐฐ: \(a_t = \beta_t k_t\), \(b_t = k_t \odot \alpha_t\)๋กœ ์ œํ•œํ•˜๋ฉด, ์ด๋“ค์„ ์ธ์ˆ˜๋ถ„ํ•ดํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค:

\[S_t = \left(\text{Diag}(\alpha_t) - \beta_t k_t k_t^\top\right) S_{t-1} + k_t v_t^\top\]

์ด ์ œ์•ฝ์œผ๋กœ ์ธํ•ด ์ด์ฐจ ์ฒญํ‚น์ด 4๊ฐœ์—์„œ 2๊ฐœ๋กœ ์ค„๊ณ , 3๊ฐœ์˜ ํ–‰๋ ฌ ๊ณฑ์…ˆ์„ ์ œ๊ฑฐํ•  ์ˆ˜ ์žˆ์–ด, DPLR ๋Œ€๋น„ ์•ฝ 2๋ฐฐ์˜ ์†๋„ ํ–ฅ์ƒ์„ ๋‹ฌ์„ฑํ•ฉ๋‹ˆ๋‹ค.

3. ํ•˜์ด๋ธŒ๋ฆฌ๋“œ ์•„ํ‚คํ…์ฒ˜ ์„ค๊ณ„

KDA ํ˜ผ์ž์„œ๋„ ์„ ํ˜• ์–ดํ…์…˜์˜ ๊ทผ๋ณธ์  ํ•œ๊ณ„๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค: ๊ธด ๋ฌธ๋งฅ์—์„œ ์ •ํ™•ํ•œ ์ •๋ณด ๊ฒ€์ƒ‰์ด ์–ด๋ ต์Šต๋‹ˆ๋‹ค. ๋”ฐ๋ผ์„œ ์ €์ž๋“ค์€ KDA์™€ ๊ธฐ์กด MLA(Multi-Head Latent Attention)๋ฅผ 3:1 ๋น„์œจ๋กœ ๊ต๋Œ€๋กœ ๋ฐฐ์น˜ํ–ˆ์Šต๋‹ˆ๋‹ค.

[KDA] โ†’ [KDA] โ†’ [KDA] โ†’ [MLA] โ†’ [KDA] โ†’ ...

์™œ 3:1์ธ๊ฐ€? ๋…ผ๋ฌธ์˜ ์ ˆ์ œ ์—ฐ๊ตฌ(Ablation Study)์— ๋”ฐ๋ฅด๋ฉด:

MLA ๋ ˆ์ด์–ด์˜ No Position Encoding (NoPE): ํฅ๋ฏธ๋กœ์šด ์„ค๊ณ„ ์„ ํƒ์€ MLA์— ์œ„์น˜ ์ธ์ฝ”๋”ฉ์„ ์ ์šฉํ•˜์ง€ ์•Š๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค. ๋ชจ๋“  ์œ„์น˜ ์ •๋ณด ๋ถ€ํ˜ธํ™”๋ฅผ KDA์— ์œ„์ž„ํ•จ์œผ๋กœ์จ:

4. ์‹คํ—˜ ๊ฒฐ๊ณผ: ๋ชจ๋“  ์ฒ™๋„์—์„œ์˜ ์šฐ์ˆ˜์„ฑ

4.1 ํ•ฉ์„ฑ ์ž‘์—…: ๊ธฐ๋ณธ ๋Šฅ๋ ฅ ๊ฒ€์ฆ

๋ณต์žกํ•œ ๋ฒค์น˜๋งˆํฌ ์ „์—, ์„ธ ๊ฐ€์ง€ ํ•ฉ์„ฑ ์ž‘์—…์œผ๋กœ ๊ธฐ์ดˆ๋ฅผ ํ™•์ธํ–ˆ์Šต๋‹ˆ๋‹ค:

ํšŒ๋ฌธ(Palindrome): ํ† ํฐ ์ˆ˜์—ด์„ ์—ญ์ˆœ์œผ๋กœ ์žฌํ˜„. ์„ ํ˜• ์–ดํ…์…˜์˜ ์•ฝ์ ์ธ ์ •ํ™•ํ•œ ๋ณต์‚ฌ ๋Šฅ๋ ฅ์„ ํ…Œ์ŠคํŠธํ•ฉ๋‹ˆ๋‹ค.

๋‹ค์ค‘ ์ฟผ๋ฆฌ ์—ฐ๊ด€ ๊ฒ€์ƒ‰(MQAR): ์—ฌ๋Ÿฌ ์ฟผ๋ฆฌ์— ๋Œ€ํ•ด ๋ฌธ๋งฅ ๋‚ด ๋‹ค์–‘ํ•œ ์œ„์น˜์—์„œ ๊ด€๋ จ ๊ฐ’ ๊ฒ€์ƒ‰. ์–ธ์–ด ๋ชจ๋ธ ์„ฑ๋Šฅ๊ณผ ์ƒ๊ด€๊ด€๊ณ„๊ฐ€ ๋†’์Šต๋‹ˆ๋‹ค.

์Šคํƒ ์ƒํƒœ ์ถ”์ : 64๊ฐœ์˜ ๋…๋ฆฝ LIFO ์Šคํƒ์„ ๊ด€๋ฆฌํ•˜๋ฉฐ PUSH/POP ์—ฐ์‚ฐ ์ถ”์ .

๊ฒฐ๊ณผ: KDA๋Š” ๋ชจ๋“  ์ž‘์—…์—์„œ Gated DeltaNet(GDN)์„ ์ƒํšŒํ–ˆ๊ณ , ์ˆ˜์—ด ๊ธธ์ด ์ฆ๊ฐ€(256โ†’2,048)์— ๋”ฐ๋ฅธ ์„ฑ๋Šฅ ์ €ํ•˜๊ฐ€ ๊ฐ€์žฅ ์™„๋งŒํ–ˆ์Šต๋‹ˆ๋‹ค.

4.2 ์‚ฌ์ „ํ›ˆ๋ จ ์„ฑ๋Šฅ: ๋‹จ๋ฌธ๋งฅ๊ณผ ๋‹ค์–‘์„ฑ

1.4T ํ† ํฐ์œผ๋กœ ํ›ˆ๋ จํ•œ ๊ฒฐ๊ณผ:

๋ฒค์น˜๋งˆํฌ

MLA

GDN-H

Kimi Linear

MMLU

71.6

72.2

73.8

MMLU-Pro

47.2

47.9

51.0

BBH

71.6

70.6

72.9

GSM8K

83.7

81.7

83.9

CEval (์ค‘๊ตญ์–ด)

79.3

79.1

79.5

4.3 ์žฅ๋ฌธ๋งฅ ์„ฑ๋Šฅ: ๊ฒฐ์ •์  ์šฐ์œ„

์ด๊ฒƒ์ด ๋ฐ”๋กœ ํ•˜์ด๋ธŒ๋ฆฌ๋“œ ๊ตฌ์กฐ์˜ ๊ฐ€์น˜๋ฅผ ๋ณด์—ฌ์ฃผ๋Š” ์ˆœ๊ฐ„์ž…๋‹ˆ๋‹ค. 128k ํ† ํฐ ์ปจํ…์ŠคํŠธ:

๋ฒค์น˜๋งˆํฌ

MLA

GDN-H

Kimi Linear (RoPE)

Kimi Linear

RULER

81.3

80.5

78.8

84.3

MRCR

22.6

23.9

22.0

29.6

RepoQA

63.0

63.0

66.5

68.5

ํ‰๊ท 

52.2

51.2

51.8

54.5

NoPE์˜ ํšจ๊ณผ: Kimi Linear (RoPE)๋Š” Kimi Linear๋ณด๋‹ค ์žฅ๋ฌธ๋งฅ์—์„œ ์„ฑ๋Šฅ์ด ๋–จ์–ด์ง‘๋‹ˆ๋‹ค. ์ด๋Š” ์œ„์น˜ ํŽธํ–ฅ์ด KDA๋ฅผ ํ†ตํ•ด ๋ถ„์‚ฐ๋˜๋ฉด ๋” ์œ ์—ฐํ•˜๊ณ  ํ™•์žฅ์„ฑ ์žˆ๋‹ค๋Š” ๊ฒƒ์„ ์‹œ์‚ฌํ•ฉ๋‹ˆ๋‹ค.

4.4 ๊ฐ•ํ™”ํ•™์Šต: ์ถ”๋ก  ํ™•์žฅ์„ฑ

ํฅ๋ฏธ๋กœ์šด ๋ฐœ๊ฒฌ์€ RL ํŠธ๋ ˆ์ด๋‹ ์ค‘์ž…๋‹ˆ๋‹ค. AIME 2025์™€ MATH500 ํ…Œ์ŠคํŠธ์—์„œ:

Kimi Linear๋Š” MLA๋ณด๋‹ค ๋” ๋น ๋ฅธ ์ˆ˜๋ ด๊ณผ ๋” ๋†’์€ ์ตœ์ข… ์„ฑ๋Šฅ์„ ๋‹ฌ์„ฑํ–ˆ์Šต๋‹ˆ๋‹ค. ํŠนํžˆ ์žฅํ˜• ์ƒ์„ฑ์ด ํ•„์š”ํ•œ ์ถ”๋ก  ์ž‘์—…์—์„œ ์„ ํ˜• ์–ดํ…์…˜์˜ ํšจ์œจ์„ฑ์ด ๋„์›€์ด ๋œ ๊ฒƒ์œผ๋กœ ๋ณด์ž…๋‹ˆ๋‹ค.

4.5 ํšจ์œจ์„ฑ: ์‹ค์ œ ๋ฐฐํฌ์˜ ๊ฒŒ์ž„ ์ฒด์ธ์ €

๋””์ฝ”๋”ฉ ์†๋„ (๋ฐฐ์น˜ ํฌ๊ธฐ 1):

๋ฉ”๋ชจ๋ฆฌ: KV ์บ์‹œ 75% ๊ฐ์†Œ๋กœ, ๋” ํฐ ๋ฐฐ์น˜ ํฌ๊ธฐ ์ง€์› ๊ฐ€๋Šฅ. ์‹ค์ œ๋กœ 1M ์ปจํ…์ŠคํŠธ์—์„œ ๋ฐฐ์น˜ ์ฒ˜๋ฆฌ๋Ÿ‰์ด 6๋ฐฐ ํ–ฅ์ƒ๋ฉ๋‹ˆ๋‹ค.

5. ๊ธฐ์ˆ ์  ์‹ฌํ™”: ์œ„์น˜ ์ธ์ฝ”๋”ฉ์œผ๋กœ์„œ์˜ ์„ ํ˜• ์–ดํ…์…˜

๋…ผ๋ฌธ์˜ ํฅ๋ฏธ๋กœ์šด ์ด๋ก ์  ๊ธฐ์—ฌ ์ค‘ ํ•˜๋‚˜์ž…๋‹ˆ๋‹ค. RoPE๋Š” ํšŒ์ „ ํ–‰๋ ฌ์˜ ๋ˆ„์  ๊ณฑ์„ ํ†ตํ•ด ์ƒ๋Œ€์  ์œ„์น˜๋ฅผ ์ธ์ฝ”๋”ฉํ•ฉ๋‹ˆ๋‹ค:

\[\text{RoPE: } q_t^\top \left(\prod_{j=i+1}^{t} R_j\right) k_i\]

์—ฌ๊ธฐ์„œ \(R_j\)๋Š” ๋ธ”๋ก ๋Œ€๊ฐ ํšŒ์ „ ํ–‰๋ ฌ์ž…๋‹ˆ๋‹ค.

GDN/KDA๋„ ์œ ์‚ฌํ•œ ๊ตฌ์กฐ๋ฅผ ๊ฐ€์ง€์ง€๋งŒ, ํšŒ์ „ ํ–‰๋ ฌ ๋Œ€์‹  ๋ฐ์ดํ„ฐ ์˜์กด์ ์ด๊ณ  ํ•™์Šต ๊ฐ€๋Šฅํ•œ ์ „ํ™˜ ํ–‰๋ ฌ์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค:

\[\text{GDN: } q_t^\top \left(\prod_{j=i+1}^{t} (I - \beta_j k_j k_j^\top) \alpha_j\right) k_i\]

์ด๋Š” RoPE์˜ ์ง๊ต์„ฑ ์ œ์•ฝ์„ ์™„ํ™”ํ•˜๋ฉด์„œ, ์ปจํ…์ŠคํŠธ ๊ธธ์ด ์™ธ์‚ฝ ๋ฌธ์ œ๋ฅผ ์ž ์žฌ์ ์œผ๋กœ ํ•ด๊ฒฐํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. RoPE๋Š” ๊ณ ์ •๋œ ์ฃผํŒŒ์ˆ˜๋ฅผ ๊ฐ€์ ธ์„œ ํ›ˆ๋ จ ๊ธธ์ด์— ๊ณผ์ ํ•ฉ๋˜๊ธฐ ์‰ฝ์ง€๋งŒ, KDA๋Š” ๋™์ ์œผ๋กœ ์กฐ์ •ํ•  ์ˆ˜ ์žˆ์ฃ .

6. ํ•œ๊ณ„์™€ ํ–ฅํ›„ ๋ฐฉํ–ฅ

ํ˜„์žฌ ํ•œ๊ณ„:

ํ–ฅํ›„ ์—ฐ๊ตฌ ๋ฐฉํ–ฅ:

๊ฒฐ๋ก 

Kimi Linear๋Š” ์„ ํ˜• ์–ดํ…์…˜์˜ ์˜ค๋ž˜๋œ ๋ฌธ์ œ๋ฅผ ์ƒˆ๋กœ์šด ๊ด€์ ์—์„œ ํ•ด๊ฒฐํ•ฉ๋‹ˆ๋‹ค. ์„ธ๋ฐ€ํ•œ ์ฑ„๋„๋ณ„ ๊ฒŒ์ดํŒ…๊ณผ ์ตœ์ ํ™”๋œ ํ•˜๋“œ์›จ์–ด ๊ตฌํ˜„, ๊ทธ๋ฆฌ๊ณ  ์ ˆ์ œ๋œ ํ•˜์ด๋ธŒ๋ฆฌ๋“œ ์„ค๊ณ„๋ฅผ ํ†ตํ•ด ๋ชจ๋“  ํ‰๊ฐ€ ์‹œ๋‚˜๋ฆฌ์˜ค์—์„œ ๊ธฐ์กด ์–ดํ…์…˜์„ ๋Šฅ๊ฐ€ํ–ˆ์Šต๋‹ˆ๋‹ค.

  1. ์ผ๊ด€๋œ ์šฐ์ˆ˜์„ฑ: ๋‹จ๋ฌธ๋งฅ, ์žฅ๋ฌธ๋งฅ, RL ๋ชจ๋“  ์˜์—ญ์—์„œ ์ตœ๊ณ  ์„ฑ๋Šฅ
  2. ์‹ค์šฉ์  ํšจ์œจ์„ฑ: 1M ํ† ํฐ์—์„œ 6๋ฐฐ ๋””์ฝ”๋”ฉ ๊ฐ€์†, ๋ฉ”๋ชจ๋ฆฌ 75% ๊ฐ์†Œ
  3. ๊ณต์ •ํ•œ ํ‰๊ฐ€: ๋™์ผํ•œ ํ›ˆ๋ จ ์กฐ๊ฑด์—์„œ ์ฒด๊ณ„์ ์ธ ๋น„๊ต
  4. ์˜คํ”ˆ์†Œ์Šค: KDA ์ปค๋„๊ณผ vLLM ํ†ตํ•ฉ, ์‚ฌ์ „ํ›ˆ๋ จ ์ฒดํฌํฌ์ธํŠธ ๊ณต๊ฐœ

์ฐธ๊ณ  ์ž๋ฃŒ: