Deep Delta Learning

๐Ÿท๏ธ LLM ๋ฒ ์ŠคํŠธ๋…ผ๋ฌธ Headliner

์‹ฌ์ธต ์‹ ๊ฒฝ๋ง์˜ ํ•™์Šต ์•ˆ์ •์„ฑ์„ ์ฑ…์ž„์ง€๋Š” ResNet์˜ identity shortcut connection์€ ์‚ฌ์‹ค ๋„ˆ๋ฌด ๋‹จ์ˆœํ•˜๋‹ค๋Š” ๋ฌธ์ œ๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค. ์ž…๋ ฅ์— residual์„ ๋”ํ•˜๋Š” ๋ฐฉ์‹์€ ๊ธฐ์šธ๊ธฐ ์†Œ์‹ค ๋ฌธ์ œ๋Š” ํ•ด๊ฒฐํ–ˆ์ง€๋งŒ, ๋„คํŠธ์›Œํฌ๊ฐ€ ๋ณต์žกํ•œ ์ƒํƒœ ์ „์ด๋ฅผ ํ‘œํ˜„ํ•˜๋Š” ๋ฐ๋Š” ํ•œ๊ณ„๊ฐ€ ์žˆ์—ˆ์ฃ . ์ƒˆ๋กœ์šด ๋…ผ๋ฌธ Deep Delta Learning(DDL)์€ ์ด shortcut ์—ฐ๊ฒฐ์— ํ•™์Šต ๊ฐ€๋Šฅํ•œ ๊ธฐํ•˜ํ•™์  ๋ณ€ํ™˜์„ ์ถ”๊ฐ€ํ•ด์„œ, ๋„คํŠธ์›Œํฌ๊ฐ€ identity mapping, ์ง๊ต ํˆฌ์˜(orthogonal projection), ๊ทธ๋ฆฌ๊ณ  ๊ธฐํ•˜ํ•™์  ๋ฐ˜์‚ฌ(geometric reflection)๋ฅผ ๋™์ ์œผ๋กœ ์„ ํƒํ•  ์ˆ˜ ์žˆ๋„๋ก ๋งŒ๋“ญ๋‹ˆ๋‹ค. ํ•ต์‹ฌ์€ ๋‹จ ํ•˜๋‚˜์˜ ์Šค์นผ๋ผ ๊ฒŒ์ดํŠธ \(\beta(X)\)๋กœ ์ด ๋ชจ๋“  ๋ณ€ํ™˜์„ ์ œ์–ดํ•œ๋‹ค๋Š” ์ ์ž…๋‹ˆ๋‹ค.


๋…ผ๋ฌธ ์ •๋ณด

์ œ๋ชฉ: Deep Delta Learning
์ €์ž: Y. Zhang, Y. Liu, M. Wang, and Q. Gu
์†Œ์†: Princeton University, University of California, Los Angeles
๋ฐœํ–‰: arXiv preprint, 2026-01-01
DOI: 10.48550/arXiv.2601.00417
์ธ์šฉ: Y. Zhang, Y. Liu, M. Wang, and Q. Gu, "Deep Delta Learning," arXiv preprint arXiv:2601.00417, 2026.


ResNet์ด ๋“ฑ์žฅํ•œ ์ง€ ๊ฑฐ์˜ 10๋…„์ด ์ง€๋‚ฌ์Šต๋‹ˆ๋‹ค. ๊ทธ๋™์•ˆ identity shortcut connection์€ ์‹ฌ์ธต ์‹ ๊ฒฝ๋ง์˜ ํ•™์Šต ์•ˆ์ •์„ฑ์„ ์ฑ…์ž„์ง€๋Š” ์‚ฌ์‹ค์ƒ์˜ ํ‘œ์ค€์ด ๋˜์—ˆ์ฃ . ํ•˜์ง€๋งŒ ์ด ๊ตฌ์กฐ๋Š” ๊ทผ๋ณธ์ ์œผ๋กœ "๋ง์…ˆ"๋งŒ ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. \(X_{l+1} = X_l + F(X_l)\) ํ˜•ํƒœ์˜ ์—…๋ฐ์ดํŠธ๋Š” ๊ธฐ์šธ๊ธฐ ์†Œ์‹ค ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธด ํ–ˆ์ง€๋งŒ, ๋„คํŠธ์›Œํฌ๊ฐ€ ๋ฐฐ์šธ ์ˆ˜ ์žˆ๋Š” ๋™์—ญํ•™(dynamics)์— ๊ฐ•ํ•œ ์ œ์•ฝ์„ ๊ฒ๋‹ˆ๋‹ค. ํŠนํžˆ ์ง„๋™(oscillation)์ด๋‚˜ ๋Œ€๋ฆฝ์  ํ–‰๋™(oppositional behavior) ๊ฐ™์€ ๋ณต์žกํ•œ ํŒจํ„ด์„ ๋ชจ๋ธ๋งํ•˜๋ ค๋ฉด ์Œ์˜ ๊ณ ์œ ๊ฐ’(negative eigenvalue)์„ ๊ฐ€์ง„ ๋ณ€ํ™˜์ด ํ•„์š”ํ•œ๋ฐ, ์ˆœ์ˆ˜ํ•œ ๋ง์…ˆ ๊ตฌ์กฐ๋กœ๋Š” ์ด๊ฒŒ ๋ถˆ๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋‹ค.

ํ”„๋ฆฐ์Šคํ„ด ๋Œ€ํ•™๊ต์™€ UCLA์˜ ์—ฐ๊ตฌํŒ€์ด ์ œ์•ˆํ•œ Deep Delta Learning์€ ์ด ๋ฌธ์ œ๋ฅผ ์ •๋ฉด์œผ๋กœ ๋‹ค๋ฃน๋‹ˆ๋‹ค. ์ €์ž๋“ค์€ Householder ๋ฐ˜์‚ฌ๋ผ๋Š” ์ˆ˜์น˜ ์„ ํ˜•๋Œ€์ˆ˜์˜ ๊ณ ์ „์  ๋„๊ตฌ๋ฅผ ์‹ ๊ฒฝ๋ง ๊ตฌ์กฐ์— ์ ‘๋ชฉ์‹œ์ผœ, identity shortcut์„ ํ•™์Šต ๊ฐ€๋Šฅํ•œ ๊ธฐํ•˜ํ•™์  ๋ณ€ํ™˜์œผ๋กœ ์ผ๋ฐ˜ํ™”ํ–ˆ์Šต๋‹ˆ๋‹ค. ํ•ต์‹ฌ ์•„์ด๋””์–ด๋Š” ๋‹จ์ˆœํ•ฉ๋‹ˆ๋‹ค: shortcut ์—ฐ๊ฒฐ์— rank-1 ๋ณ€ํ™˜์„ ์ ์šฉํ•˜๋˜, ๊ทธ ๊ฐ•๋„๋ฅผ ๋ฐ์ดํ„ฐ์— ๋”ฐ๋ผ ๋™์ ์œผ๋กœ ์กฐ์ ˆํ•˜๋Š” ๊ฒƒ์ด์ฃ . ์ด๋ ‡๊ฒŒ ํ•˜๋ฉด ๋„คํŠธ์›Œํฌ๋Š” ์ธต๋งˆ๋‹ค "๊ทธ๋ƒฅ ๋„˜์–ด๊ฐˆ์ง€", "ํŠน์ • ๋ฐฉํ–ฅ์˜ ์ •๋ณด๋ฅผ ์ง€์šธ์ง€", "์™„์ „ํžˆ ๋ฐ˜์‚ฌ์‹œํ‚ฌ์ง€"๋ฅผ ์Šค์Šค๋กœ ๊ฒฐ์ •ํ•  ์ˆ˜ ์žˆ๊ฒŒ ๋ฉ๋‹ˆ๋‹ค.


์š”์•ฝ

1-Ddl.png Deep Delta Learning์€ ๊ธฐ์กด residual block์˜ identity shortcut์— **Delta Operator**๋ผ๋Š” rank-1 ๊ธฐํ•˜ํ•™์  ๋ณ€ํ™˜์„ ์ถ”๊ฐ€ํ•ฉ๋‹ˆ๋‹ค:

\[ X_{l+1} = A(X_l)X_l + \beta(X_l)k(X_l)v(X_l)^\top \]

์—ฌ๊ธฐ์„œ:

๊ธฐํ•˜ํ•™์  ํ•ด์„

\(\beta\)์˜ ๊ฐ’์— ๋”ฐ๋ผ ์—ฐ์‚ฐ์ž๊ฐ€ ๋‹ค๋ฅด๊ฒŒ ๋™์ž‘ํ•ฉ๋‹ˆ๋‹ค:

\(\beta\) ๊ฐ’

๊ณ ์œ ๊ฐ’

๊ธฐํ•˜ํ•™์  ์˜๋ฏธ

ํ–‰๋ ฌ์‹

\(\beta \to 0\)

\((1, 1, ..., 1)\)

Identity mapping

\(\det(A) = 1\)

\(\beta \to 1\)

\((0, 1, ..., 1)\)

Orthogonal projection

\(\det(A) = 0\)

\(\beta \to 2\)

\((-1, 1, ..., 1)\)

Householder reflection

\(\det(A) = -1\)

์ฃผ์š” ํŠน์„ฑ

  1. ์ŠคํŽ™ํŠธ๋Ÿผ ์ œ์–ด: ๋‹จ ํ•˜๋‚˜์˜ ์Šค์นผ๋ผ \(\beta\)๋กœ ๋ณ€ํ™˜์˜ ๊ณ ์œ ๊ฐ’ ๊ตฌ์กฐ๋ฅผ ์™„์ „ํžˆ ์ œ์–ด
  2. Delta Rule ํ†ตํ•ฉ: depth ์ฐจ์›์—์„œ Delta Rule์„ ๊ตฌํ˜„ (\(v^\top - k^\top X\) ํ˜•ํƒœ์˜ ์˜ค์ฐจ ์‹ ํ˜ธ)
  3. ์—ฐ์†์  ๋ณด๊ฐ„: identity, projection, reflection ์‚ฌ์ด๋ฅผ ๋ฏธ๋ถ„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ์ „ํ™˜
  4. ๋™๊ธฐํ™”๋œ ์‚ญ์ œ/์“ฐ๊ธฐ: ๊ฐ™์€ \(\beta\)๋กœ ์ •๋ณด ์‚ญ์ œ(erasure)์™€ ์ฃผ์ž…(injection)์„ ๋™์‹œ์— ์ œ์–ด

์‹คํ—˜ ์„ค์ •


๋…ผ๋ฌธ ์ƒ์„ธ

1. ์„œ๋ก : Residual Connection์˜ ํ•œ๊ณ„

์‹ฌ์ธต residual ๋„คํŠธ์›Œํฌ์˜ ํšจ์œจ์„ฑ์€ ๊ทผ๋ณธ์ ์œผ๋กœ identity shortcut connection์— ๋‹ฌ๋ ค ์žˆ์Šต๋‹ˆ๋‹ค. ์ด ๋ฉ”์ปค๋‹ˆ์ฆ˜์€ ๊ธฐ์šธ๊ธฐ ์†Œ์‹ค ๋ฌธ์ œ๋ฅผ ํšจ๊ณผ์ ์œผ๋กœ ์™„ํ™”ํ•˜์ง€๋งŒ, ๋™์‹œ์— feature ๋ณ€ํ™˜์— "์—„๊ฒฉํ•œ ๋ง์…ˆ ๊ท€๋‚ฉ ํŽธํ–ฅ(strictly additive inductive bias)"์„ ๋ถ€๊ณผํ•ฉ๋‹ˆ๋‹ค.

ํ‘œ์ค€ ResNet์˜ ์—…๋ฐ์ดํŠธ ๊ทœ์น™์„ ๋ณด๋ฉด:

\[ X_{l+1} = X_l + F(X_l) \]

์ด๋Š” ODE \(\dot{X} = F(X)\)์— ๋Œ€ํ•œ forward Euler step(step size 1)์œผ๋กœ ๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ด ๊ด€์ ์€ ์‹ฌ์ธต ๋„คํŠธ์›Œํฌ๋ฅผ ๋™์—ญํ•™๊ณ„(dynamical system)์™€ ์—ฐ๊ฒฐ์‹œ์ผœ์ฃผ์ฃ . ํ•˜์ง€๋งŒ ๋ฌธ์ œ๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค. ์—„๊ฒฉํ•œ ๋ง์…ˆ ์—…๋ฐ์ดํŠธ๋Š” ํ•™์Šต๋œ ๋™์—ญํ•™์— ๊ฐ•ํ•œ translation bias๋ฅผ ๊ฑธ์–ด๋†“์Šต๋‹ˆ๋‹ค. shortcut path๋Š” ํ•ญ์ƒ identity operator์™€ ๊ฐ™์€ ๊ณ ์ •๋œ Jacobian์„ ์œ ์ง€ํ•˜๋‹ˆ๊นŒ์š”.

์ด ๊ฐ•์ง์„ฑ(rigidity)์€ ๋„คํŠธ์›Œํฌ๊ฐ€ ํ‘œํ˜„ํ•  ์ˆ˜ ์žˆ๋Š” ์ƒํƒœ ์ „์ด๋ฅผ ์ œํ•œํ•ฉ๋‹ˆ๋‹ค. ์ตœ๊ทผ ์—ฐ๊ตฌ(Grazzi et al., 2024)๋Š” ์ง„๋™์ด๋‚˜ ๋Œ€๋ฆฝ์  ํ–‰๋™ ๊ฐ™์€ ํŒจํ„ด์„ ๋ชจ๋ธ๋งํ•˜๋ ค๋ฉด ์Œ์˜ ๊ณ ์œ ๊ฐ’์„ ๊ฐ–๋Š” ๋ณ€ํ™˜์ด ํ•„์š”ํ•˜๋‹ค๋Š” ์ ์„ ์ง€์ ํ–ˆ์Šต๋‹ˆ๋‹ค. ํ•˜์ง€๋งŒ ์ˆœ์ˆ˜ํ•œ ๋ง์…ˆ ๊ตฌ์กฐ๋กœ๋Š” ์ด๊ฒŒ ๋ถˆ๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋‹ค.

2. Delta Residual Block: ์ˆ˜ํ•™์  ๊ธฐ์ดˆ

์ €์ž๋“ค์€ ์ด ํ•œ๊ณ„๋ฅผ ๊ทน๋ณตํ•˜๊ธฐ ์œ„ํ•ด ๊ธฐํ•˜ํ•™์  ์„ ํ˜•๋Œ€์ˆ˜์— ๋ฟŒ๋ฆฌ๋ฅผ ๋‘” ์›๋ฆฌ์  ์ผ๋ฐ˜ํ™”๋ฅผ ์ œ์•ˆํ•ฉ๋‹ˆ๋‹ค. ๊ทธ ์ถœ๋ฐœ์ ์€ Householder ๋ณ€ํ™˜์ž…๋‹ˆ๋‹ค.

2.1 Householder ํ–‰๋ ฌ

์˜๋ฒกํ„ฐ๊ฐ€ ์•„๋‹Œ ๋ฒกํ„ฐ \(k \in \mathbb{R}^d\)์— ๋Œ€ํ•ด, Householder ํ–‰๋ ฌ \(H_k\)๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์ด ์ •์˜๋ฉ๋‹ˆ๋‹ค:

\[ H_k = I - 2\frac{kk^\top}{|k|_2^2} \]

๊ธฐํ•˜ํ•™์ ์œผ๋กœ \(H_k\)๋Š” ๋ฒ•์„  ๋ฒกํ„ฐ๊ฐ€ \(k\)์ธ ์ดˆํ‰๋ฉด์— ๋Œ€ํ•ด ๋ฒกํ„ฐ๋ฅผ ๋ฐ˜์‚ฌ์‹œํ‚ต๋‹ˆ๋‹ค. ์ด ํ–‰๋ ฌ์€ ์ˆ˜์น˜ ์„ ํ˜•๋Œ€์ˆ˜์˜ ํ•ต์‹ฌ ๋„๊ตฌ๋กœ, ์—ฌ๋Ÿฌ ์ค‘์š”ํ•œ ์„ฑ์งˆ์„ ๊ฐ€์ง‘๋‹ˆ๋‹ค:

์ŠคํŽ™ํŠธ๋Ÿผ ๊ด€์ ์—์„œ ๋ณด๋ฉด, \(H_k\)๋Š” ๊ณ ์œ ๊ฐ’ \(-1\)์„ ํ•˜๋‚˜ ๊ฐ€์ง€๊ณ  (๊ณ ์œ ๋ฒกํ„ฐ \(k\)), ๋‚˜๋จธ์ง€ \(d-1\)๊ฐœ๋Š” ๊ณ ์œ ๊ฐ’ \(1\)์ž…๋‹ˆ๋‹ค (๊ณ ์œ ๊ณต๊ฐ„ \(k^\perp\)).

2.2 Delta Operator์˜ ์ •์˜

DDL์˜ ํ•ต์‹ฌ ์•„์ด๋””์–ด๋Š” Householder ํ–‰๋ ฌ์˜ ์ƒ์ˆ˜ ์ธ์ž 2๋ฅผ ํ•™์Šต ๊ฐ€๋Šฅํ•œ ๋ฐ์ดํ„ฐ ์˜์กด์  ์Šค์นผ๋ผ ๊ฒŒ์ดํŠธ \(\beta(X)\)๋กœ ๊ต์ฒดํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.

hidden state๋ฅผ ํ–‰๋ ฌ \(X \in \mathbb{R}^{d \times d_v}\)๋กœ ํ‘œํ˜„ํ•ฉ๋‹ˆ๋‹ค. ์—ฌ๊ธฐ์„œ \(d\)๋Š” feature ์ฐจ์›, \(d_v\)๋Š” value ์ฑ„๋„์˜ ๊ฐœ์ˆ˜์ž…๋‹ˆ๋‹ค. DDL block์˜ ์ถœ๋ ฅ์€:

\[ X_{l+1} = A(X_l)X_l + \beta(X_l)k(X_l)v(X_l)^\top \]

์—ฌ๊ธฐ์„œ \(v \in \mathbb{R}^{d_v}\)๋Š” branch \(F: \mathbb{R}^{d \times d_v} \to \mathbb{R}^{d_v}\)๊ฐ€ ์ƒ์„ฑํ•œ residual value ๋ฒกํ„ฐ์ž…๋‹ˆ๋‹ค. outer product \(kv^\top\)์ด ๋ง์…ˆ ์—…๋ฐ์ดํŠธ๋ฅผ ๊ตฌ์„ฑํ•ฉ๋‹ˆ๋‹ค. ์ค‘์š”ํ•œ ์ ์€ ๊ฒŒ์ดํŠธ \(\beta(X)\)๋ฅผ ์ด ์ƒ์„ฑ(constructive) ํ•ญ์—๋„ ์ ์šฉํ•œ๋‹ค๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค. ์ด๋ ‡๊ฒŒ ํ•˜๋ฉด ์‚ญ์ œ(erasure)์™€ ์“ฐ๊ธฐ(write) ์—ฐ์‚ฐ์ด ์—ฐ๊ฒฐ๋ฉ๋‹ˆ๋‹ค.

\(A(X)\)๋Š” feature ์ฐจ์› \(d\)์— ๊ณต๊ฐ„์ ์œผ๋กœ ์ž‘์šฉํ•˜๋Š” Delta Operator์ž…๋‹ˆ๋‹ค:

\[ A(X) = I - \beta(X)\frac{k(X)k(X)^\top}{k(X)^\top k(X) + \epsilon} \]

์ด ๊ตฌ์กฐ๋Š” ๋ฐ˜์‚ฌ ๋ฐฉํ–ฅ \(k(X) \in \mathbb{R}^d\), value ๋ฒกํ„ฐ \(v(X) \in \mathbb{R}^{d_v}\), ๋ฐ˜์‚ฌ ๊ฐ•๋„ \(\beta(X) \in \mathbb{R}\)์„ ๋ณ„๋„์˜ ๊ฒฝ๋Ÿ‰ ์‹ ๊ฒฝ๋ง branch๋กœ ํ•™์Šตํ•ฉ๋‹ˆ๋‹ค. ์ƒ์ˆ˜ \(\epsilon > 0\)์€ ์ˆ˜์น˜ ์•ˆ์ •์„ฑ์„ ๋ณด์žฅํ•ฉ๋‹ˆ๋‹ค.

์ด๋ก  ๋ถ„์„์—์„œ๋Š” \(k\)๊ฐ€ ์—„๊ฒฉํ•˜๊ฒŒ ์ •๊ทœํ™”๋˜์–ด \(k^\top k = 1\)์ด๋ผ๊ณ  ๊ฐ€์ •ํ•ฉ๋‹ˆ๋‹ค. ์ด ์กฐ๊ฑด ํ•˜์—์„œ (\(\epsilon \to 0\)) ์—ฐ์‚ฐ์ž๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์ด ๋‹จ์ˆœํ™”๋ฉ๋‹ˆ๋‹ค:

\[ A(X) = I - \beta(X)k(X)k(X)^\top \]

\(X\)๊ฐ€ ํ–‰๋ ฌ์ด๋ฏ€๋กœ ์—ฐ์‚ฐ์ž \(A(X)\)๋Š” value ์ฐจ์› \(d_v\)์— ๋Œ€ํ•ด broadcast๋˜๋ฉฐ, hidden state์˜ ๋ชจ๋“  ์—ด์— ๊ธฐํ•˜ํ•™์  ๋ณ€ํ™˜์„ ๋™์‹œ์— ์ ์šฉํ•ฉ๋‹ˆ๋‹ค.

๊ฐ™์€ ๋‹จ์œ„ ๋…ธ๋ฆ„ ๊ฐ€์ • ํ•˜์—์„œ, \(A(X) = I - \beta(X)k(X)k(X)^\top\)๋ฅผ ์‹์— ๋Œ€์ž…ํ•˜๋ฉด ๋™๋“ฑํ•œ ๋ง์…ˆํ˜• rank-1 Delta ํ˜•ํƒœ๋ฅผ ์–ป์Šต๋‹ˆ๋‹ค:

\[ X_{l+1} = X_l + \beta(X_l)k(X_l)(v(X_l)^\top - k(X_l)^\top X_l) \]

์ด ํ˜•ํƒœ๋Š” ๊ฐ™์€ ์Šค์นผ๋ผ \(\beta\)๊ฐ€ ์‚ญ์ œ ํ•ญ \(k^\top X\)์™€ ์“ฐ๊ธฐ ํ•ญ \(v^\top\)๋ฅผ ๋ชจ๋‘ ์กฐ์ ˆํ•œ๋‹ค๋Š” ์ ์„ ๋ช…์‹œ์ ์œผ๋กœ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.

๊ฒŒ์ดํŒ… ํ•จ์ˆ˜ \(\beta(X)\)๋Š” \([0, 2]\) ๋ฒ”์œ„์— ์žˆ๋„๋ก ํŒŒ๋ผ๋ฏธํ„ฐํ™”๋ฉ๋‹ˆ๋‹ค:

\[ \beta(X) = 2 \cdot \sigma(\text{Linear}(G(X))) \]

์—ฌ๊ธฐ์„œ \(G(\cdot)\)๋Š” pooling, convolution, ๋˜๋Š” flattening ์—ฐ์‚ฐ์ž…๋‹ˆ๋‹ค. ์ด ํŠน์ • ๋ฒ”์œ„๋Š” ๋‹ค์Œ ์„น์…˜์—์„œ ๋ถ„์„ํ•  ํ’๋ถ€ํ•œ ๊ธฐํ•˜ํ•™์  ํ•ด์„์„ ์ œ๊ณตํ•˜๊ธฐ ์œ„ํ•ด ์„ ํƒ๋˜์—ˆ์Šต๋‹ˆ๋‹ค.

3. ๋ถ„์„: Delta Operator์˜ ์ŠคํŽ™ํŠธ๋Ÿผ

Delta-Res block์˜ ํ‘œํ˜„๋ ฅ์€ ์—ฐ์‚ฐ์ž \(A(X)\)์˜ ์ŠคํŽ™ํŠธ๋Ÿผ ์„ฑ์งˆ์—์„œ ๋‚˜์˜ต๋‹ˆ๋‹ค. ์ด ์„ฑ์งˆ๋“ค์€ ํ•™์Šต๋œ ๊ฒŒ์ดํŠธ \(\beta(X)\)๋กœ ์ œ์–ด๋ฉ๋‹ˆ๋‹ค.

3.1 ์ŠคํŽ™ํŠธ๋Ÿผ ๋ถ„ํ•ด

์ •๋ฆฌ 3.1 (Delta Operator์˜ ์ŠคํŽ™ํŠธ๋Ÿผ): \(A = I - \beta kk^\top\)๋ผ ํ•˜์ž. ์—ฌ๊ธฐ์„œ \(k \in \mathbb{R}^d\)๋Š” ๋‹จ์œ„ ๋ฒกํ„ฐ (\(k^\top k = 1\))์ด๊ณ  \(\beta \in \mathbb{R}\)๋Š” ์Šค์นผ๋ผ๋‹ค. \(A\)์˜ ์ŠคํŽ™ํŠธ๋Ÿผ \(\sigma(A)\)๋Š”:

\[ \sigma(A) = {\underbrace{1, 1, ..., 1}_{d-1 \text{ times}}, 1-\beta} \]

๊ณ ์œ ๊ฐ’ \(\lambda = 1 - \beta\)์— ๋Œ€์‘ํ•˜๋Š” ๊ณ ์œ ๋ฒกํ„ฐ๋Š” \(k\)์ž…๋‹ˆ๋‹ค. ๊ณ ์œ ๊ฐ’ \(\lambda = 1\)์— ๋Œ€ํ•œ ๊ณ ์œ ๊ณต๊ฐ„์€ \(k\)์˜ ์ง๊ต ์—ฌ๊ณต๊ฐ„ \(k^\perp = {u \in \mathbb{R}^d | k^\top u = 0}\)์ž…๋‹ˆ๋‹ค.

์ฆ๋ช… ์Šค์ผ€์น˜:

  1. \(k\)์— ์ง๊ตํ•˜๋Š” ๋ฒกํ„ฐ \(u\) (\(k^\top u = 0\))์— ๋Œ€ํ•ด: \(Au = (I - \beta kk^\top)u = u - \beta k(0) = u\). ๋”ฐ๋ผ์„œ \((d-1)\)์ฐจ์› ๋ถ€๊ณต๊ฐ„ \(k^\perp\)์˜ ๋ชจ๋“  ๋ฒกํ„ฐ๊ฐ€ ๊ณ ์œ ๊ฐ’ 1์„ ๊ฐ€์ง‘๋‹ˆ๋‹ค.

  2. ๋ฒกํ„ฐ \(k\) ์ž์ฒด์— ๋Œ€ํ•ด: \(Ak = (I - \beta kk^\top)k = k - \beta k(1) = (1-\beta)k\). ๋”ฐ๋ผ์„œ \(k\)๋Š” ๊ณ ์œ ๊ฐ’ \(1-\beta\)๋ฅผ ๊ฐ€์ง‘๋‹ˆ๋‹ค.

์ด ์ •๋ฆฌ๋Š” ๊ฒŒ์ดํŠธ \(\beta(X)\)์— ๋Œ€ํ•œ ๋ช…ํ™•ํ•˜๊ณ  ๊ฐ•๋ ฅํ•œ ํ•ด์„์„ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค. ๋‹จ ํ•˜๋‚˜์˜ ์Šค์นผ๋ผ๋ฅผ ํ•™์Šตํ•จ์œผ๋กœ์จ, ๋„คํŠธ์›Œํฌ๋Š” ์ƒํƒœ ํ–‰๋ ฌ์˜ ๋ชจ๋“  \(d_v\)๊ฐœ ์—ด์— ๋™์‹œ์— residual ๋ณ€ํ™˜์˜ ๊ธฐํ•˜ํ•™์„ ๋™์ ์œผ๋กœ ์ œ์–ดํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

ํ–‰๋ ฌ ์ƒํƒœ๋กœ์˜ ํ™•์žฅ: ์œ„ ์ŠคํŽ™ํŠธ๋Ÿผ ๋ช…์ œ๋“ค์€ ๊ณต๊ฐ„์ (spatial)์ž…๋‹ˆ๋‹ค. ์ฆ‰ \(\mathbb{R}^d\)์—์„œ ์„ ํ˜• ์‚ฌ์ƒ \(u \mapsto Au\)๋ฅผ ์„ค๋ช…ํ•ฉ๋‹ˆ๋‹ค. hidden state๊ฐ€ ํ–‰๋ ฌ \(X \in \mathbb{R}^{d \times d_v}\)์ด๊ณ  shortcut์ด ์ขŒ์ธก ๊ณฑ์…ˆ์œผ๋กœ ์ž‘์šฉํ•˜๋ฏ€๋กœ, \(d_v\)๊ฐœ ์—ด ๊ฐ๊ฐ์ด ๊ฐ™์€ \(A\)์— ์˜ํ•ด ๋…๋ฆฝ์ ์œผ๋กœ ๋ณ€ํ™˜๋ฉ๋‹ˆ๋‹ค. ๋ฒกํ„ฐํ™” ๊ด€์ ์—์„œ ์œ ๋„๋œ ์„ ํ˜• ์—ฐ์‚ฐ์ž๋Š” \(I_{d_v} \otimes A\)์ž…๋‹ˆ๋‹ค. ๋”ฐ๋ผ์„œ ํ™•์žฅ๋œ ์‚ฌ์ƒ์˜ ์ŠคํŽ™ํŠธ๋Ÿผ์€ \(A\)์˜ ๊ณ ์œ ๊ฐ’์ด \(d_v\)๋ฒˆ ๋ฐ˜๋ณต๋œ ๊ฒƒ์ด๊ณ , ํ–‰๋ ฌ์‹์€ \(\det(A)^{d_v}\)์ž…๋‹ˆ๋‹ค.

์ง๊ต์„ฑ ์กฐ๊ฑด: \(A\)๊ฐ€ ๋Œ€์นญ์ด๋ฏ€๋กœ ํŠน์ด๊ฐ’์€ ๊ณ ์œ ๊ฐ’์˜ ์ ˆ๋Œ“๊ฐ’๊ณผ ์ผ์น˜ํ•ฉ๋‹ˆ๋‹ค. ํŠนํžˆ \(A\)๋Š” \(|1-\beta| = 1\)์ผ ๋•Œ, ์ฆ‰ \(\beta \in {0, 2}\)์ผ ๋•Œ๋งŒ ์ง๊ต์ž…๋‹ˆ๋‹ค. \(\beta \in (0, 2)\)์— ๋Œ€ํ•ด \(A\)๋Š” \(k\)๋ฅผ ๋”ฐ๋ผ ์ด๋ฐฉ์„ฑ ์ˆ˜์ถ•(anisotropic contraction)์„ ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค (\(\beta > 1\)์ผ ๋•Œ ๋ถ€ํ˜ธ ๋ฐ˜์ „).

๋”ฐ๋ฆ„์ •๋ฆฌ 3.2 (๊ณต๊ฐ„ ํ–‰๋ ฌ์‹): ๊ณต๊ฐ„ ํŠน์„ฑ \(\mathbb{R}^d\)์— ์ž‘์šฉํ•˜๋Š” Delta Operator \(A(X)\)์˜ ํ–‰๋ ฌ์‹์€:

\[ \det(A(X)) = \prod_{i=1}^d \lambda_i = 1^{d-1} \cdot (1-\beta(X)) = 1 - \beta(X) \]

shortcut์ด \(d_v\) value ์—ด์— broadcast๋˜๋ฏ€๋กœ ์ „์ฒด ํ–‰๋ ฌ ์ƒํƒœ ๊ณต๊ฐ„ \(\mathbb{R}^{d \times d_v}\)์—์„œ ์œ ๋„๋œ ํ–‰๋ ฌ์‹์€ \(\det(A(X))^{d_v} = (1-\beta(X))^{d_v}\)์ž…๋‹ˆ๋‹ค. ๋”ฐ๋ผ์„œ \(\beta(X)\)๋Š” ๊ณต๊ฐ„ ๋ฐฉํ–ฅ \(k(X)\)๋ฅผ ๋”ฐ๋ผ ๋ถ€ํ˜ธ ์žˆ๋Š” ๋ถ€ํ”ผ ๋ณ€ํ™”๋ฅผ ์ œ์–ดํ•ฉ๋‹ˆ๋‹ค. ํŠนํžˆ \(\beta(X) > 1\)์€ \(k\)๋ฅผ ๋”ฐ๋ผ ์Œ์˜ ๊ณต๊ฐ„ ๊ณ ์œ ๊ฐ’(๋ฐ˜์‚ฌ)์„ ๋„์ž…ํ•˜๋ฉฐ, \(d_v\)๊ฐ€ ํ™€์ˆ˜์ผ ๋•Œ๋งŒ ํ™•์žฅ๋œ ์ƒํƒœ ๊ณต๊ฐ„์˜ ์ „์ฒด ๋ฐฉํ–ฅ์ด ๋’ค์ง‘ํž™๋‹ˆ๋‹ค.

3.2 ๊ธฐํ•˜ํ•™์  ์—ฐ์‚ฐ์˜ ํ†ตํ•ฉ

์ •๋ฆฌ 3.1์€ \(\beta(X)\)์˜ ๋ฒ”์œ„ \([0, 2]\)๊ฐ€ ์—ฐ์‚ฐ์ž๊ฐ€ ์„ธ ๊ฐ€์ง€ ๊ธฐ๋ณธ ์„ ํ˜• ๋ณ€ํ™˜ ์‚ฌ์ด๋ฅผ ๋ณด๊ฐ„ํ•  ์ˆ˜ ์žˆ๊ฒŒ ํ•จ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.

Identity Mapping (\(\beta(X) \to 0\)): \(\beta \to 0\)์ผ ๋•Œ ๊ณ ์œ ๊ฐ’ \(1-\beta \to 1\)์ž…๋‹ˆ๋‹ค. \(A(X)\)์˜ ๋ชจ๋“  ๊ณ ์œ ๊ฐ’์ด 1์ด ๋˜๋ฏ€๋กœ \(A(X) \to I\)์ž…๋‹ˆ๋‹ค. \(\beta\)๊ฐ€ ์ฃผ์ž… ํ•ญ \(\beta kv^\top\)๋„ ์กฐ์ ˆํ•˜๋ฏ€๋กœ ์ „์ฒด ์—…๋ฐ์ดํŠธ๊ฐ€ ์‚ฌ๋ผ์ง€๋ฉฐ, \(X_{l+1} \approx X_l\)์ด ๋ฉ๋‹ˆ๋‹ค. ์ด identity ๋™์ž‘์€ ๋งค์šฐ ๊นŠ์€ ๋„คํŠธ์›Œํฌ์—์„œ ์‹ ํ˜ธ ์ „ํŒŒ๋ฅผ ๋ณด์กดํ•˜๋Š” ๋ฐ ์ค‘์š”ํ•ฉ๋‹ˆ๋‹ค.

Orthogonal Projection (\(\beta(X) \to 1\)): \(\beta \to 1\)์ผ ๋•Œ ๊ณ ์œ ๊ฐ’ \(1-\beta \to 0\)์ž…๋‹ˆ๋‹ค. ์—ฐ์‚ฐ์ž \(A(X)\)๋Š” \(I - kk^\top\)๊ฐ€ ๋˜๋Š”๋ฐ, ์ด๋Š” ์ดˆํ‰๋ฉด \(k^\perp\)๋กœ์˜ ์ง๊ต ์‚ฌ์˜(rank \(d-1\))์ž…๋‹ˆ๋‹ค. ์ž…๋ ฅ ์ƒํƒœ \(X\)์˜ ๊ฐ ์—ด์—์„œ \(k\)์— ํ‰ํ–‰ํ•œ ์„ฑ๋ถ„์ด residual์„ ๋”ํ•˜๊ธฐ ์ „์— ๋ช…์‹œ์ ์œผ๋กœ ์ œ๊ฑฐ๋ฉ๋‹ˆ๋‹ค ("๋ง๊ฐ"). ์—ฐ์‚ฐ์ž๊ฐ€ ํŠน์ด(singular)๊ฐ€ ๋˜๋ฉฐ \(\det(A) \to 0\)์ž…๋‹ˆ๋‹ค. ์ „์ฒด block (์‹ 2.5) ๊ด€์ ์—์„œ ์ด ์ฒด์ œ๋Š” replace-along-k๋กœ ํ•ด์„ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค: shortcut์ด \(k\)-์„ฑ๋ถ„์„ ์ œ๊ฑฐํ•˜๊ณ , rank-1 ์“ฐ๊ธฐ๊ฐ€ \(v^\top\)๋กœ ์ง€์ •๋œ ์ƒˆ๋กœ์šด \(k\) ์„ฑ๋ถ„์„ ์ฃผ์ž…ํ•ฉ๋‹ˆ๋‹ค.

Full Reflection (\(\beta(X) \to 2\)): \(\beta \to 2\)์ผ ๋•Œ ๊ณ ์œ ๊ฐ’ \(1-\beta \to -1\)์ž…๋‹ˆ๋‹ค. ์—ฐ์‚ฐ์ž \(A(X)\)๋Š” \(I - 2kk^\top\)๊ฐ€ ๋˜๋Š”๋ฐ, ์ด๋Š” ํ‘œ์ค€ Householder ํ–‰๋ ฌ์ž…๋‹ˆ๋‹ค. ์ด๋Š” \(X\)์˜ ๊ฐ ์—ด์„ \(k^\perp\)์— ๋Œ€ํ•ด ์™„๋ฒฝํ•˜๊ฒŒ ๋ฐ˜์‚ฌ์‹œํ‚ต๋‹ˆ๋‹ค. ์ด๊ฒƒ์ด ์ด ๋ฒ”์œ„์—์„œ ๋ณ€ํ™˜์ด ์ง๊ต์ด๊ณ  ๊ณต๊ฐ„์ ์œผ๋กœ ๋ถ€ํ”ผ๋ฅผ ๋ณด์กดํ•˜๋Š” ์œ ์ผํ•œ ๊ฒฝ์šฐ์ด๋ฉฐ, \(\det(A) \to -1\)์ž…๋‹ˆ๋‹ค. ์Œ์˜ ๊ณต๊ฐ„ ํ–‰๋ ฌ์‹์€ ๊ธฐ์ €์˜ ๋ฐฉํ–ฅ ๋ณ€ํ™”(๋ฐ˜์‚ฌ)๋ฅผ ์˜๋ฏธํ•ฉ๋‹ˆ๋‹ค. identity ๊ฒฝ์šฐ(\(\beta = 0\))์™€ ํ•จ๊ป˜ ์ด๋Š” \([0, 2]\)์—์„œ shortcut ์—ฐ์‚ฐ์ž \(A\)๊ฐ€ ์ง๊ต์ธ ์œ ์ผํ•œ ์„ค์ •์ž…๋‹ˆ๋‹ค. ์ „์ฒด block์€ ์ถ”๊ฐ€๋กœ ๋™๊ธฐํ™”๋œ rank-1 ์“ฐ๊ธฐ ํ•ญ์„ ์ ์šฉํ•˜์—ฌ, ๋“ค์–ด์˜ค๋Š” ์ƒํƒœ์˜ ๋ฐ˜์‚ฌ์™€ \(k\)์— ์ •๋ ฌ๋œ ์“ฐ๊ธฐ๋ฅผ ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค.

3.3 ํŠน์ˆ˜ ๊ฒฝ์šฐ: Gated Residual Learning

DDL์˜ ์ค‘์š”ํ•œ ์„ฑ์งˆ์€ ๊ฒŒ์ดํŒ… ์Šค์นผ๋ผ์˜ ๊ทนํ•œ์—์„œ์˜ ๋™์ž‘์ž…๋‹ˆ๋‹ค. ๊ฒŒ์ดํŠธ๊ฐ€ ์‚ฌ๋ผ์งˆ ๋•Œ (\(\beta(X) \to 0\)), Delta Operator๋Š” identity ํ–‰๋ ฌ๋กœ ์ˆ˜๋ ดํ•˜๊ณ  (\(A(X) \to I\)), ์ƒ์„ฑ ํ•ญ์ด ์‚ฌ๋ผ์ง‘๋‹ˆ๋‹ค. ๊ฒฐ๊ณผ์ ์œผ๋กœ ์—…๋ฐ์ดํŠธ ๊ทœ์น™์€ ๋‹ค์Œ๊ณผ ๊ฐ™์ด ๋‹จ์ˆœํ™”๋ฉ๋‹ˆ๋‹ค:

\[ X_{l+1} = X_l \]

์ด๋Š” identity mapping์„ ๋ณต์›ํ•˜๋ฉฐ, ์ธต์„ ์™„์ „ํžˆ ๊ฑด๋„ˆ๋›ธ ์ˆ˜ ์žˆ๊ฒŒ ํ•ฉ๋‹ˆ๋‹ค. ์ด ๋™์ž‘์€ ๋งค์šฐ ๊นŠ์€ ๋„คํŠธ์›Œํฌ ํ›ˆ๋ จ์— ์ข…์ข… ํ•„์š”ํ•œ zero-initialization ์ „๋žต๊ณผ ์ผ์น˜ํ•ฉ๋‹ˆ๋‹ค.

๋ฐ˜๋Œ€๋กœ \(\beta \approx 1\)์ผ ๋•Œ ์ธต์€ Gated Rank-1 Matrix ResNet์œผ๋กœ ๊ธฐ๋Šฅํ•˜๋ฉฐ, \(\beta\)๋Š” ์—…๋ฐ์ดํŠธ ํฌ๊ธฐ๋ฅผ ์ œ์–ดํ•˜๋Š” ํ•™์Šต๋œ step size๋กœ ์ž‘์šฉํ•ฉ๋‹ˆ๋‹ค. ์ด๋Š” DDL์ด ๊ฐ’ ์ฃผ์ž…๊ณผ ๋™๊ธฐ์ ์œผ๋กœ ๊ฒฐํ•ฉ๋œ ๊ณฑ์…ˆ์  ๊ธฐํ•˜ํ•™์  ์กฐ์ ˆ์„ ๋„์ž…ํ•˜์—ฌ residual learning์„ ์ผ๋ฐ˜ํ™”ํ•จ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.

3.4 ๋Œ€๊ฐ Feature ํ–‰๋ ฌ ์ผ€์ด์Šค

Delta Operator์˜ ํ˜ผํ•ฉ(mixing) ์„ฑ์งˆ์„ ๋” ์ž˜ ์ดํ•ดํ•˜๊ธฐ ์œ„ํ•ด, ์ž…๋ ฅ ์ƒํƒœ \(X \in \mathbb{R}^{d \times d}\)๊ฐ€ ์ •๋ฐฉ ๋Œ€๊ฐ ํ–‰๋ ฌ \(X = \text{diag}(\lambda_1, ..., \lambda_d)\)์ธ ํŠน์ˆ˜ํ•œ ๊ฒฝ์šฐ๋ฅผ ๊ณ ๋ คํ•ด๋ด…์‹œ๋‹ค. ์ด๋Š” feature๊ฐ€ value ์ฐจ์›์—์„œ ์™„๋ฒฝํ•˜๊ฒŒ ๋ถ„๋ฆฌ๋œ ์ƒํƒœ๋ฅผ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค. \(A\)๋ฅผ ์ ์šฉํ•˜๋ฉด:

\[ (AX)_{ij} = (X - \beta kk^\top X)_{ij} = \lambda_i\delta_{ij} - \beta\lambda_j k_i k_j \]

๊ตฌ์ฒด์ ์œผ๋กœ, ๋น„๋Œ€๊ฐ ์›์†Œ (\(i \neq j\))๋Š” \(-\beta\lambda_j k_i k_j\)๊ฐ€ ๋˜๊ณ , ๋Œ€๊ฐ ์›์†Œ (\(i = j\))๋Š” \(\lambda_i(1 - \beta k_i^2)\)๋กœ ์Šค์ผ€์ผ๋ฉ๋‹ˆ๋‹ค. ์ด๋Š” ์ถœ๋ ฅ feature \(i\)๊ฐ€ ์ด์ œ ์ž…๋ ฅ feature \(j\)์˜ ํฌ๊ธฐ์— ์˜์กดํ•˜๋ฉฐ, ๊ธฐํ•˜ํ•™์  ์ผ๊ด€์„ฑ \(k_i k_j\)๋กœ ์Šค์ผ€์ผ๋จ์„ ์˜๋ฏธํ•ฉ๋‹ˆ๋‹ค.

์ด ๊ฒฐ๊ณผ๋Š” Delta block์˜ ์ค‘์š”ํ•œ ๊ธฐ๋Šฅ์„ ๋ช…ํ™•ํžˆ ํ•ฉ๋‹ˆ๋‹ค: ์ œ์–ด๋œ feature ๊ฒฐํ•ฉ(coupling)์„ ์œ ๋„ํ•œ๋‹ค๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค. ๋“ค์–ด์˜ค๋Š” feature๊ฐ€ ๋…๋ฆฝ์ ์ด๋”๋ผ๋„, ์˜์ด ์•„๋‹Œ \(\beta\)๋Š” ๋ฐ˜์‚ฌ ๋ฒกํ„ฐ \(k\)์˜ ํˆฌ์˜์— ๋น„๋ก€ํ•˜์—ฌ \(i\)๋ฒˆ์งธ์™€ \(j\)๋ฒˆ์งธ ๋ชจ๋“œ ์‚ฌ์ด์˜ ์ƒํ˜ธ์ž‘์šฉ์„ ๊ฐ•์ œํ•ฉ๋‹ˆ๋‹ค.

\(\beta \to 1\) (ํˆฌ์˜)์ด๋ฉด shortcut์ด ๊ฐ ์—ด์—์„œ \(k\)๋ฅผ ๋”ฐ๋ผ ์„ฑ๋ถ„์„ ์ œ๊ฑฐํ•˜์—ฌ, ์“ฐ๊ธฐ ํ•ญ์ด \(v^\top\)๋กœ ์ง€์ •๋œ ์ƒˆ \(k\)-์„ฑ๋ถ„์„ ์žฌ์„ค์ •ํ•˜๊ธฐ ์ „์— ์ƒํƒœ๋ฅผ \(k^\perp\)๋กœ ๋งคํ•‘ํ•ฉ๋‹ˆ๋‹ค. \(\beta \to 0\)์ด๋ฉด ๋Œ€๊ฐ ๊ตฌ์กฐ๊ฐ€ ๋ณด์กด๋ฉ๋‹ˆ๋‹ค.

3.5 ๋ฒกํ„ฐ Hidden State ๋™์—ญํ•™

DDL์ด ํ–‰๋ ฌ ๊ฐ’ ์ƒํƒœ \(X \in \mathbb{R}^{d \times d_v}\)์—์„œ ์ž‘๋™ํ•˜์ง€๋งŒ, ์ž์—ฐ์Šค๋Ÿฝ๊ฒŒ ํ‘œ์ค€ ๋ฒกํ„ฐ ๊ธฐ๋ฐ˜ ์‹ฌ์ธต ํ•™์Šต์„ ํŠน์ • ๊ทนํ•œ์œผ๋กœ ํฌํ•จํ•ฉ๋‹ˆ๋‹ค. ๋‘ ๊ฐ€์ง€ ๊ตฌ๋ณ„๋˜๋Š” ์ฒด์ œ๋ฅผ ์‹๋ณ„ํ•ฉ๋‹ˆ๋‹ค:

์Šค์นผ๋ผ Value ๊ทนํ•œ (\(d_v = 1\)): value ์ฐจ์›์ด 1๋กœ ์ถ•์†Œ๋˜๋ฉด, hidden state๋Š” ํ‘œ์ค€ feature ๋ฒกํ„ฐ \(x \in \mathbb{R}^d\)๋กœ ํ‡ดํ™”ํ•ฉ๋‹ˆ๋‹ค. ์ด ๊ทนํ•œ์—์„œ value ์—…๋ฐ์ดํŠธ \(v\)๋Š” ์Šค์นผ๋ผ \(v \in \mathbb{R}\)๊ฐ€ ๋ฉ๋‹ˆ๋‹ค. Delta ์—…๋ฐ์ดํŠธ ๊ทœ์น™์€ ๋‹ค์Œ๊ณผ ๊ฐ™์ด ๋‹จ์ˆœํ™”๋ฉ๋‹ˆ๋‹ค:

\[ x_{l+1} = x_l + \beta_l \underbrace{(v_l - k_l^\top x_l)}_{\gamma_l} k_l \]

์—ฌ๊ธฐ์„œ ๊ธฐํ•˜ํ•™์  ๋ณ€ํ™˜์ด ๋™์  ์Šค์นผ๋ผ ๊ฒŒ์ดํŒ… ๋ฉ”์ปค๋‹ˆ์ฆ˜์œผ๋กœ ์ถ•์•ฝ๋ฉ๋‹ˆ๋‹ค. ํ•ญ \(\gamma_l\)์€ ์—…๋ฐ์ดํŠธ ํฌ๊ธฐ๋ฅผ ์ œ์•ˆ๋œ ์“ฐ๊ธฐ ๊ฐ’ \(v_l\)๊ณผ ํ˜„์žฌ ํˆฌ์˜ \(k_l^\top x_l\) ์‚ฌ์ด์˜ ๋ถˆ์ผ์น˜์— ๊ฒฐํ•ฉํ•˜๋Š” ๋ฐ์ดํ„ฐ ์˜์กด์  ๊ณ„์ˆ˜๋กœ ์ž‘์šฉํ•ฉ๋‹ˆ๋‹ค.

๋…๋ฆฝ Feature ๊ทนํ•œ: ๋˜๋Š” ์„น์…˜ 3.4์˜ ๋Œ€๊ฐ ๊ฒฝ์šฐ๋ฅผ ํ–‰๋ ฌ ๋Œ€๊ฐ์„ ์— ๋‚ด์žฅ๋œ ๋ฒกํ„ฐ ์ƒํƒœ์˜ ํ‘œํ˜„์œผ๋กœ ๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๋Œ€๊ฐ ๋ถ„์„์—์„œ ๋ณด๋“ฏ์ด, Delta Operator๋Š” \(\beta k_i k_j\) ํ•ญ์„ ํ†ตํ•ด feature ๊ฒฐํ•ฉ์„ ๋„์ž…ํ•ฉ๋‹ˆ๋‹ค. ํ‘œ์ค€ ์›์†Œ๋ณ„ ๋ฒกํ„ฐ ์—…๋ฐ์ดํŠธ์˜ ๋™์ž‘์„ ๋ณต์›ํ•˜๋ ค๋ฉด (feature๊ฐ€ ๊ณต๊ฐ„์ ์œผ๋กœ ํ˜ผํ•ฉ๋˜์ง€ ์•Š๋Š” ๊ฒฝ์šฐ), ๋ฐ˜์‚ฌ ๋ฒกํ„ฐ \(k\)๊ฐ€ ์ •๊ทœ ๊ธฐ์ €์™€ ์ •๋ ฌ๋˜์–ด์•ผ ํ•ฉ๋‹ˆ๋‹ค (์ฆ‰, one-hot). ์ด ์ฒด์ œ์—์„œ Delta Operator๋Š” ์›์†Œ๋ณ„ ๊ฒŒ์ดํŒ… ํ•จ์ˆ˜๋กœ ์ž‘์šฉํ•˜๋ฉฐ, feature ์ฐจ์›์˜ ๋…๋ฆฝ์„ฑ์„ ์—„๊ฒฉํ•˜๊ฒŒ ๋ณด์กดํ•ฉ๋‹ˆ๋‹ค.

4. ์ตœ์ ํ™” ๋ฐ Delta ๊ตฌ์กฐ์™€์˜ ์—ฐ๊ฒฐ

"Deep Delta Learning"์ด๋ผ๋Š” ์šฉ์–ด๋Š” ์ตœ๊ทผ ํšจ์œจ์ ์ธ ์‹œํ€€์Šค ๋ชจ๋ธ๋ง์—์„œ ์ธ๊ธฐ๋ฅผ ์–ป์€ ๊ธฐ๋ณธ ์—…๋ฐ์ดํŠธ ๋ฉ”์ปค๋‹ˆ์ฆ˜์ธ Delta Rule๊ณผ์˜ ๊ตฌ์กฐ์  ์ƒ๋™์„ฑ์„ ๋ฐ˜์˜ํ•ฉ๋‹ˆ๋‹ค (์˜ˆ: DeltaNet, Schlag et al., 2021; Yang et al., 2024).

4.1 Residual Learning์„ ์œ„ํ•œ Delta Rule

ํ‘œ์ค€ residual connection \(X_{l+1} = X_l + F(X_l)\)์€ ์—„๊ฒฉํ•œ ๋ง์…ˆ ๊ท€๋‚ฉ ํŽธํ–ฅ์„ ๋ถ€๊ณผํ•ฉ๋‹ˆ๋‹ค. \(F\)๊ฐ€ ์ƒ์„ฑํ•œ ์ •๋ณด๋Š” ๋‹จ์ˆœํžˆ ์ถ•์ ๋ฉ๋‹ˆ๋‹ค. ์ด๋Š” "residual accumulation"์œผ๋กœ ์ด์–ด์งˆ ์ˆ˜ ์žˆ๋Š”๋ฐ, ๋„คํŠธ์›Œํฌ๊ฐ€ hidden state๋ฅผ ์„ ํƒ์ ์œผ๋กœ ํ•„ํ„ฐ๋งํ•  ๋ช…์‹œ์  ๋ฉ”์ปค๋‹ˆ์ฆ˜์ด ์—†๊ธฐ ๋•Œ๋ฌธ์— ๋…ธ์ด์ฆˆ๋‚˜ ๊ฐ„์„ญ feature๊ฐ€ ์ธต์„ ๊ฑฐ์ณ ์ง€์†๋ฉ๋‹ˆ๋‹ค.

DDL์€ Delta Rule ๊ตฌ์กฐ๋ฅผ depth ์ฐจ์›์— ํ†ตํ•ฉํ•˜์—ฌ ์ด๋ฅผ ํ•ด๊ฒฐํ•ฉ๋‹ˆ๋‹ค. rank-1 residual ์ •์˜๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ Delta Residual ์—…๋ฐ์ดํŠธ๋ฅผ ํ™•์žฅํ•˜๋ฉด:

\[ X_{l+1} = X_l + \beta_l k_l \left(\underbrace{v_l^\top}_{\text{Write}} - \underbrace{k_l^\top X_l}_{\text{Erase}}\right) \]

์ด ๊ณต์‹ํ™”๋Š” ๋น ๋ฅธ ์—ฐ๊ด€ ๋ฉ”๋ชจ๋ฆฌ์™€ ์„ ํ˜• attention์—์„œ ์‚ฌ์šฉ๋˜๋Š” Delta Rule ์—…๋ฐ์ดํŠธ๋ฅผ ์ •ํ™•ํžˆ ๋ณต์›ํ•ฉ๋‹ˆ๋‹ค. ํ•ญ \(k_l^\top X_l\)์€ ๋ฐ˜์‚ฌ ๋ฒกํ„ฐ๋กœ์˜ ์ƒํƒœ์˜ ํ˜„์žฌ ํˆฌ์˜("์˜ค์ฐจ" ๋˜๋Š” "์˜ค๋ž˜๋œ ๋ฉ”๋ชจ๋ฆฌ")์„ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค. ํ•ญ \((v_l^\top - k_l^\top X_l)\)์€ ๋ณด์ • ์‹ ํ˜ธ๋กœ ์ž‘์šฉํ•ฉ๋‹ˆ๋‹ค.

\(X_l \in \mathbb{R}^{d \times d_v}\)๊ฐ€ ํ–‰๋ ฌ์ด๋ฏ€๋กœ ํ•ญ \(k_l^\top X_l\)์€ \(\mathbb{R}^{1 \times d_v}\)์˜ ํ–‰ ๋ฒกํ„ฐ๋ฅผ ์ƒ์„ฑํ•˜๋ฉฐ, ๋ชจ๋“  value ์—ด์˜ \(k_l\)๋กœ์˜ ํˆฌ์˜์„ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค. ์—…๋ฐ์ดํŠธ๋Š” ์‚ญ์ œ(ํŒŒ๊ดด์ )์™€ ์ฃผ์ž…(์ƒ์„ฑ์ ) ์—ฐ์‚ฐ์„ ๋ชจ๋‘ ํ”„๋กœ์ ํ„ฐ \(k_l\)์ด ์ •์˜ํ•œ ๊ธฐํ•˜ํ•™์  ๋ฐฉํ–ฅ์„ ๋”ฐ๋ผ ์—„๊ฒฉํ•˜๊ฒŒ ์ •๋ ฌํ•˜๋ฉฐ, step size \(\beta_l\)๋กœ ์กฐ์ ˆ๋ฉ๋‹ˆ๋‹ค.

\(\beta(X_l) \approx 1\)์ผ ๋•Œ, ์ด ๋บ„์…ˆ ํ•ญ์€ ์ง๊ต ํˆฌ์˜์œผ๋กœ ์ž‘์šฉํ•˜์—ฌ ๋“ค์–ด์˜ค๋Š” ์ƒํƒœ \(X_l\)์—์„œ \(k(X_l)\)์— ํ‰ํ–‰ํ•œ ์„ฑ๋ถ„์„ ํšจ๊ณผ์ ์œผ๋กœ ์ง€์›๋‹ˆ๋‹ค (๋ง๊ฐ). \(\beta(X_l) \approx 2\)์ผ ๋•Œ, ์ด ํ•ญ์€ ํˆฌ์˜์˜ ๋‘ ๋ฐฐ๋ฅผ ๋นผ์„œ ๋ถ€ํ˜ธ ๋ฐ˜์ „(๋ฐ˜์‚ฌ)์„ ์ดˆ๋ž˜ํ•ฉ๋‹ˆ๋‹ค. ์ด๋Š” ๋„คํŠธ์›Œํฌ์— ์ธต๋ณ„๋กœ ํŠน์ • feature ๋ถ€๊ณต๊ฐ„์„ ์„ ํƒ์ ์œผ๋กœ ์ •๋ฆฌํ•˜๊ฑฐ๋‚˜ ์žฌ๋ฐฐํ–ฅํ•˜๋Š” ์œ ์—ฐํ•œ ๋ฉ”์ปค๋‹ˆ์ฆ˜์„ ์ œ๊ณตํ•˜์—ฌ ๊ฐ„์„ญ์˜ ์ถ•์ ์„ ๋ฐฉ์ง€ํ•ฉ๋‹ˆ๋‹ค.

4.2 DeltaNet ๋ฐ Householder ๊ณฑ๊ณผ์˜ ๊ด€๊ณ„

์šฐ๋ฆฌ์˜ ์ž‘์—…์€ DeltaNet ๊ตฌ์กฐ(Schlag et al., 2021)์™€ ์ด๋ก ์  ์—ฐ๊ฒฐ๊ณ ๋ฆฌ๋ฅผ ๊ณต์œ ํ•ฉ๋‹ˆ๋‹ค. DeltaNet์€ Linear Transformer์˜ ๋ง์…ˆ ์ถ•์ ์„ ๋ฉ”๋ชจ๋ฆฌ ์—…๋ฐ์ดํŠธ๋ฅผ ์œ„ํ•œ Delta Rule๋กœ ๋Œ€์ฒดํ•ฉ๋‹ˆ๋‹ค. ์šฐ๋ฆฌ๋Š” **DDL์ด DeltaNet ์žฌ๊ท€์˜ depth-wise ๋™ํ˜•(isomorphism)**์ž„์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.

DeltaNet์—์„œ hidden state (๋ฉ”๋ชจ๋ฆฌ) \(S_t\)๋Š” ์‹œ๊ฐ„ \(t\)์— ๊ฑธ์ณ ์ง„ํ™”ํ•ฉ๋‹ˆ๋‹ค. ์šฐ๋ฆฌ์˜ depth-wise ๊ณต์‹๊ณผ ํ‘œ๊ธฐ๋ฒ•์„ ํ†ต์ผํ•˜๊ธฐ ์œ„ํ•ด, ๋ฉ”๋ชจ๋ฆฌ ์ƒํƒœ๊ฐ€ \(S_t \in \mathbb{R}^{d_k \times d_v}\)์ธ ์ขŒ์ธก ๊ณฑ์…ˆ ์˜๋ฏธ๋ก ์„ ์‚ฌ์šฉํ•˜์—ฌ DeltaNet ์—…๋ฐ์ดํŠธ๋ฅผ ์ œ์‹œํ•ฉ๋‹ˆ๋‹ค:

\[ S_t = (I - \beta_t k_t k_t^\top)S_{t-1} + \beta_t k_t v_t^\top \]

์—ฌ๊ธฐ์„œ ์—ฐ์‚ฐ์ž๋Š” ํ‚ค ์ฐจ์› \(d_k\)์— ์ž‘์šฉํ•˜๋ฉฐ, ์ด๋Š” DDL์˜ feature ์ฐจ์› \(d\)์™€ ์œ ์‚ฌํ•ฉ๋‹ˆ๋‹ค. ์ด๋ฅผ depth \(l\)์— ๊ฑธ์ณ ์ž‘์šฉํ•˜๋Š” ์šฐ๋ฆฌ์˜ Deep Delta Layer ์—…๋ฐ์ดํŠธ์™€ ๋น„๊ตํ•˜๋ฉด:

\[ X_{l+1} = (I - \beta_l k_l k_l^\top)X_l + \beta_l k_l v_l^\top \]

์—ฌ๊ธฐ์„œ \(v_l\)์€ value branch์˜ ๋ฒกํ„ฐ ์ถœ๋ ฅ์ž…๋‹ˆ๋‹ค.

์ด๋Š” ์ง์ ‘์ ์ธ ๊ตฌ์กฐ์  ๋Œ€์‘์„ ๋“œ๋Ÿฌ๋ƒ…๋‹ˆ๋‹ค:

๋”ฐ๋ผ์„œ DDL์€ Delta Rule์„ ์ธต๋ณ„ feature ์ง„ํ™”์— ์ ์šฉํ•˜๋Š” ๊ฒƒ์œผ๋กœ ํ•ด์„๋  ์ˆ˜ ์žˆ์œผ๋ฉฐ, ๋„คํŠธ์›Œํฌ๊ฐ€ ๊นŠ์€ ์ธต์œผ๋กœ ์ „ํŒŒ๋  ๋•Œ ์–•์€ ์ธต์˜ feature๋ฅผ ๋ง๊ฐํ•˜๊ฑฐ๋‚˜ ์žฌ์ž‘์„ฑํ•  ์ˆ˜ ์žˆ๊ฒŒ ํ•ฉ๋‹ˆ๋‹ค.

5. ๊ด€๋ จ ์—ฐ๊ตฌ

์ด ์ž‘์—…์€ ์‹ฌ์ธต ํ•™์Šต์˜ ์—ฌ๋Ÿฌ ํ•ต์‹ฌ ์—ฐ๊ตฌ ์ฃผ์ œ๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•ฉ๋‹ˆ๋‹ค.

Gated ๋ฐ Invertible ๊ตฌ์กฐ: Highway Networks (Srivastava et al., 2015)๋Š” residual ๋„คํŠธ์›Œํฌ์— ๋ฐ์ดํ„ฐ ์˜์กด์  ๊ฒŒ์ดํŒ…์„ ๋„์ž…ํ–ˆ์ง€๋งŒ, ๊ทธ๋“ค์˜ ๊ฒŒ์ดํŠธ๋Š” ๋ณ€ํ™˜ ์ž์ฒด๋ฅผ ์ˆ˜์ •ํ•˜๋Š” ๊ฒƒ์ด ์•„๋‹ˆ๋ผ identity ๊ฒฝ๋กœ์™€ ํ•จ์ˆ˜ ๊ฒฝ๋กœ ์‚ฌ์ด๋ฅผ ๋ณด๊ฐ„ํ•ฉ๋‹ˆ๋‹ค. Invertible Residual Networks (i-ResNets) (Behrmann et al., 2019)๋Š” \(F\)์˜ Lipschitz ์ƒ์ˆ˜๋ฅผ ์ œํ•œํ•˜์—ฌ ๊ฐ€์—ญ์„ฑ์„ ๋ณด์žฅํ•˜๋Š”๋ฐ, ์ด๋Š” normalizing flow ๊ฐ™์€ ์‘์šฉ์— ์œ ์šฉํ•ฉ๋‹ˆ๋‹ค. DDL์˜ Delta shortcut ์—ฐ์‚ฐ์ž๋Š” \(1 - \beta \neq 0\)์ผ ๋•Œ ๊ฐ€์—ญ์ ์ด๋ฉฐ (\(\epsilon \to 0\) ๋ถ„์„์—์„œ), \(\beta = 2\)์—์„œ ์ง๊ต ๋Œ€ํ•ฉ(orthogonal involution)์ด ๋ฉ๋‹ˆ๋‹ค (Householder ๋ฐ˜์‚ฌ). DDL์€ ์ „์—ญ์ ์œผ๋กœ ๊ฐ€์—ญ์„ฑ์„ ๊ฐ•์ œํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค. ๋Œ€์‹  ๋„คํŠธ์›Œํฌ๊ฐ€ ์ค€-๊ฐ€์—ญ ์ „์ด๊ฐ€ ์œ ์ตํ•œ ๋•Œ์™€ ์˜๋„์ ์œผ๋กœ ํŠน์ดํ•œ (ํˆฌ์˜์ ) ์ „์ด๊ฐ€ ์ œ์–ด๋œ ๋ง๊ฐ์— ์œ ์šฉํ•œ ๋•Œ๋ฅผ ํ•™์Šตํ•  ์ˆ˜ ์žˆ๊ฒŒ ํ•ฉ๋‹ˆ๋‹ค.

์ง๊ต ๋ฐ ์œ ๋‹ˆํ„ฐ๋ฆฌ ๋„คํŠธ์›Œํฌ: ์ƒ๋‹นํ•œ ์—ฐ๊ตฌ๊ฐ€ ๊ธฐ์šธ๊ธฐ ์•ˆ์ •์„ฑ์„ ๊ฐœ์„ ํ•˜๊ณ  ๊ธฐํ•˜ํ•™์  ๊ตฌ์กฐ๋ฅผ ๋ณด์กดํ•˜๊ธฐ ์œ„ํ•ด ๋„คํŠธ์›Œํฌ ๊ฐ€์ค‘์น˜๋ฅผ ์ง๊ต ๋˜๋Š” ์œ ๋‹ˆํ„ฐ๋ฆฌ๋กœ ์ œํ•œํ•˜๋Š” ๋ฐ ์ง‘์ค‘ํ•ด์™”์Šต๋‹ˆ๋‹ค (Arjovsky et al., 2016; Jing et al., 2017). Householder ๋ฐ˜์‚ฌ๋Š” ์ง๊ต ํ–‰๋ ฌ์„ ํŒŒ๋ผ๋ฏธํ„ฐํ™”ํ•˜๋Š” ๊ณ ์ „์  ๋ฐฉ๋ฒ•์ž…๋‹ˆ๋‹ค. ์ด๋Ÿฌํ•œ ๋ฐฉ๋ฒ•์€ ์ง๊ต์„ฑ์„ ์—„๊ฒฉํ•œ ์ œ์•ฝ์œผ๋กœ ๊ฐ•์ œํ•ฉ๋‹ˆ๋‹ค. ๋Œ€์กฐ์ ์œผ๋กœ, ์šฐ๋ฆฌ์˜ Delta Residual Network๋Š” ๊ฒŒ์ดํŠธ \(\beta(x)\)๋ฅผ ํ†ตํ•ด identity์™€ ์ง๊ต์„ฑ์—์„œ ๋ฒ—์–ด๋‚˜๋Š” ๊ฒƒ์„ ํ•™์Šตํ•˜๋ฉฐ, ์ˆœ์ˆ˜ ํˆฌ์˜์ด๋‚˜ ๋ฐ˜์‚ฌ๋กœ ์™„ํ™”ํ•  ์ˆ˜ ์žˆ๋Š” ์†Œํ”„ํŠธํ•œ ์ ์‘์  ์ œ์•ฝ์„ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.

Neural ODE: Neural ODE (Chen et al., 2018)๋Š” feature์˜ ์—ฐ์†์  ์ง„ํ™”๋ฅผ ๋ชจ๋ธ๋งํ•ฉ๋‹ˆ๋‹ค. ํ‘œ์ค€ ResNet์€ ๋‹จ์ˆœ ODE \(\dot{X} = F(X)\)์˜ ์ด์‚ฐํ™”์ž…๋‹ˆ๋‹ค. ์šฐ๋ฆฌ๊ฐ€ ์ œ์•ˆํ•œ ๊ตฌ์กฐ๋Š” ๊ธฐ์ € ๋™์—ญํ•™์„ \(\dot{X} = \beta(X)k(X)(v(X)^\top - k(X)^\top X)\)๋กœ ๋ณ€๊ฒฝํ•˜์—ฌ, ํ–‰๋ ฌ ์ƒํƒœ์— ์ ์šฉ๋˜๋Š” ์ƒํƒœ ์˜์กด์  ํˆฌ์˜ ํ•ญ์„ ๋„์ž…ํ•ฉ๋‹ˆ๋‹ค. ์ด๋Š” ์—ฌ๋Ÿฌ value ์ฐจ์›์— ๊ฑธ์ณ ์ˆ˜์ถ•์ ์ด๊ฑฐ๋‚˜ ์ง„๋™์ ์ธ ๋™์ž‘์„ ๋‚˜ํƒ€๋‚ผ ์ˆ˜ ์žˆ๋Š” ํ›จ์”ฌ ๋” ํ’๋ถ€ํ•œ ํ•™์Šต ๊ฐ€๋Šฅํ•œ ๋™์—ญํ•™๊ณ„ ํŒจ๋ฐ€๋ฆฌ๋ฅผ ํ—ˆ์šฉํ•ฉ๋‹ˆ๋‹ค.

6. ๊ฒฐ๋ก : ํ‘œํ˜„๋ ฅ์˜ ํ™•์žฅ๊ณผ ๋‚จ์€ ๊ณผ์ œ

DDL์€ ์ ์‘์  ๊ธฐํ•˜ํ•™์  residual connection ์œ„์— ๊ตฌ์ถ•๋œ ์ƒˆ๋กœ์šด ๊ตฌ์กฐ๋ฅผ ์ œ์‹œํ•ฉ๋‹ˆ๋‹ค. ๋ถ„์„์„ ํ†ตํ•ด ๊ทธ ํ•ต์‹ฌ ๊ตฌ์„ฑ ์š”์†Œ์ธ Delta Operator๊ฐ€ identity mapping, projection, ๊ทธ๋ฆฌ๊ณ  reflection์„ ํ•˜๋‚˜์˜ ์—ฐ์†์ ์œผ๋กœ ๋ฏธ๋ถ„ ๊ฐ€๋Šฅํ•œ ๋ชจ๋“ˆ๋กœ ํ†ตํ•ฉํ•จ์„ ์ž…์ฆํ–ˆ์Šต๋‹ˆ๋‹ค. ์ด ํ†ตํ•ฉ์€ ๋‹จ์ˆœํ•œ ํ•™์Šต๋œ ์Šค์นผ๋ผ ๊ฒŒ์ดํŠธ๋กœ ์ œ์–ด๋˜๋ฉฐ, ์ธต๊ฐ„ ์ „์ด ์—ฐ์‚ฐ์ž์˜ ์ŠคํŽ™ํŠธ๋Ÿผ์„ ๋™์ ์œผ๋กœ ํ˜•์„ฑํ•ฉ๋‹ˆ๋‹ค.

๋„คํŠธ์›Œํฌ๊ฐ€ ๋ฐ์ดํ„ฐ ์˜์กด์  ๋ฐฉ์‹์œผ๋กœ ์Œ์˜ ๊ณ ์œ ๊ฐ’์„ ๊ฐ€์ง„ ๋ณ€ํ™˜์„ ํ•™์Šตํ•  ์ˆ˜ ์žˆ๊ฒŒ ํ•จ์œผ๋กœ์จ, DDL์€ residual learning ํŒจ๋Ÿฌ๋‹ค์ž„์˜ ๊ธฐ๋ณธ ์ด์ ์„ ์œ ์ง€ํ•˜๋ฉด์„œ ํ‘œํ˜„๋ ฅ์—์„œ ์›๋ฆฌ์ ์ด๊ณ  ์ƒ๋‹นํ•œ ์ฆ๊ฐ€๋ฅผ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.

ํ•˜์ง€๋งŒ ๋…ผ๋ฌธ์ด ์ œ์‹œํ•˜์ง€ ์•Š์€ ๋ถ€๋ถ„๋„ ๋ช…ํ™•ํ•ฉ๋‹ˆ๋‹ค. ์‹ค์ œ ๋Œ€๊ทœ๋ชจ ๋ฒค์น˜๋งˆํฌ์—์„œ์˜ ๊ฒฝํ—˜์  ๊ฒ€์ฆ์ด ๋ถ€์žฌํ•ฉ๋‹ˆ๋‹ค. ImageNet, COCO ๊ฐ™์€ ํ‘œ์ค€ vision ๊ณผ์ œ๋‚˜ GLUE, SQuAD ๊ฐ™์€ NLP ๋ฒค์น˜๋งˆํฌ์—์„œ DDL์ด ์‹ค์ œ๋กœ ํ‘œ์ค€ ResNet์ด๋‚˜ Transformer๋ณด๋‹ค ์šฐ์ˆ˜ํ•œ ์„ฑ๋Šฅ์„ ๋ณด์ด๋Š”์ง€๋Š” ์•„์ง ์ฆ๋ช…๋˜์ง€ ์•Š์•˜์Šต๋‹ˆ๋‹ค. ์ด๋ก ์  ์šฐ์•„ํ•จ์ด ์‹ค์ „ ์„ฑ๋Šฅ ๊ฐœ์„ ์œผ๋กœ ์ด์–ด์ง„๋‹ค๋Š” ๋ณด์žฅ์€ ์—†์ฃ .

๋˜ํ•œ ์ถ”๊ฐ€์ ์ธ ํŒŒ๋ผ๋ฏธํ„ฐ ์˜ค๋ฒ„ํ—ค๋“œ์™€ ๊ณ„์‚ฐ ๋ณต์žก๋„์— ๋Œ€ํ•œ ๋ถ„์„์ด ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค. \(k(X)\), \(\beta(X)\), \(v(X)\)๋ฅผ ๊ฐ๊ฐ ์ถ”์ •ํ•˜๋Š” branch๋“ค์ด ์ „์ฒด ๋ชจ๋ธ ํฌ๊ธฐ์™€ ์ถ”๋ก  ์†๋„์— ๋ฏธ์น˜๋Š” ์˜ํ–ฅ์€? rank-1 ์—…๋ฐ์ดํŠธ๊ฐ€ ๊ฐ„๋‹จํ•ด ๋ณด์ด์ง€๋งŒ, ๊ฐ ์ธต์—์„œ ์ถ”๊ฐ€์ ์ธ ์ˆœ์ „ํŒŒ ์—ฐ์‚ฐ์ด ํ•„์š”ํ•˜๋ฏ€๋กœ ์‹ค์ œ wall-clock time์ด ์ฆ๊ฐ€ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

๊ฐ€์žฅ ํฅ๋ฏธ๋กœ์šด ์งˆ๋ฌธ์€ "์–ธ์ œ DDL์ด ํ•„์š”ํ•œ๊ฐ€?"์ž…๋‹ˆ๋‹ค. ์ €์ž๋“ค์€ ์ง„๋™์ด๋‚˜ ๋Œ€๋ฆฝ์  ํ–‰๋™ ๊ฐ™์€ ๋ณต์žกํ•œ ๋™์—ญํ•™์„ ๋ชจ๋ธ๋งํ•  ๋•Œ ์Œ์˜ ๊ณ ์œ ๊ฐ’์ด ํ•„์š”ํ•˜๋‹ค๊ณ  ์ฃผ์žฅํ•ฉ๋‹ˆ๋‹ค. ํ•˜์ง€๋งŒ ์‹ค์ œ vision์ด๋‚˜ ์–ธ์–ด ๊ณผ์ œ์—์„œ ์ด๋Ÿฐ ๋™์—ญํ•™์ด ์–ผ๋งˆ๋‚˜ ์ž์ฃผ ๋‚˜ํƒ€๋‚ ๊นŒ์š”? ๋Œ€๋ถ€๋ถ„์˜ ์‹ค์šฉ์  ๋ฌธ์ œ์—์„œ๋Š” ํ‘œ์ค€ ResNet์˜ ๋‹จ์กฐ์  feature ๋ณ€ํ™˜์œผ๋กœ๋„ ์ถฉ๋ถ„ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. DDL์˜ ์ง„๊ฐ€๋Š” ํŠน์ • ๋„๋ฉ”์ธ(์˜ˆ: ๋ฌผ๋ฆฌ ์‹œ๋ฎฌ๋ ˆ์ด์…˜, ์‹œ๊ณ„์—ด ์˜ˆ์ธก, ๊ฐ•ํ™”ํ•™์Šต)์—์„œ ๋” ๋ช…ํ™•ํžˆ ๋“œ๋Ÿฌ๋‚  ๊ฐ€๋Šฅ์„ฑ์ด ๋†’์Šต๋‹ˆ๋‹ค.

๊ฒฐ๊ตญ Deep Delta Learning์€ residual connection์˜ ์ด๋ก ์  ํ•œ๊ณ„๋ฅผ ์ •ํ™•ํžˆ ์‹๋ณ„ํ•˜๊ณ , ์ˆ˜ํ•™์ ์œผ๋กœ ์šฐ์•„ํ•œ ํ•ด๊ฒฐ์ฑ…์„ ์ œ์‹œํ–ˆ๋‹ค๋Š” ์ ์—์„œ ์˜๋ฏธ ์žˆ๋Š” ๊ธฐ์—ฌ์ž…๋‹ˆ๋‹ค. Householder ๋ณ€ํ™˜์ด๋ผ๋Š” ๊ณ ์ „์  ๋„๊ตฌ๋ฅผ ์‹ฌ์ธต ํ•™์Šต์— ์ ‘๋ชฉ์‹œํ‚จ ๊ฒƒ์€ ์ฐฝ์˜์ ์ด๋ฉฐ, Delta Rule๊ณผ์˜ ์—ฐ๊ฒฐ์€ DeltaNet ๊ฐ™์€ ์ตœ๊ทผ ์‹œํ€€์Šค ๋ชจ๋ธ๋ง ์—ฐ๊ตฌ์™€์˜ ์ด๋ก ์  ํ†ต์ผ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค. ํ•˜์ง€๋งŒ ์‹ค์ „ ๋ฐฐํฌ๋ฅผ ์œ„ํ•ด์„œ๋Š” ๋Œ€๊ทœ๋ชจ ์‹คํ—˜์  ๊ฒ€์ฆ, ํšจ์œจ์„ฑ ๋ถ„์„, ๊ทธ๋ฆฌ๊ณ  ์–ธ์ œ ์ด ์ถ”๊ฐ€์ ์ธ ๋ณต์žก์„ฑ์ด ์ •๋‹นํ™”๋˜๋Š”์ง€์— ๋Œ€ํ•œ ๋ช…ํ™•ํ•œ ๊ฐ€์ด๋“œ๋ผ์ธ์ด ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค.