ํ’€์Šคํƒ ์›น๐ŸŒ ๊ฐœ๋ฐœ์ž ์ง€๋ง์ƒ ๐Ÿง‘๐Ÿฝโ€๐Ÿ’ป
โž• ์ธ๊ณต์ง€๋Šฅ ๊ด€์‹ฌ ๐Ÿค–


Categories


Recent views

  • 1
  • 2
  • 3
  • 4
  • 5

AI ์ˆ˜ํ•™ ๊ธฐ๋ณธ

  1. Vector
  2. Matrix
  3. Gradient Algorithm(๊ฒฝ์‚ฌํ•˜๊ฐ•๋ฒ•)
  4. ์ธ๊ณต์ง€๋Šฅ ํ•™์Šต์˜ ์ˆ˜ํ•™์  ์ดํ•ด
  5. ํ™•๋ฅ ๋ก 
  6. ํ†ต๊ณ„ํ•™
  7. ๋ฒ ์ด์ฆˆ ํ†ต๊ณ„ํ•™
  8. CNN
  9. RNN

AIMath

Vector

๋ฒกํ„ฐ๋ž€?

  • ์ˆซ์ž๋ฅผ ์›์†Œ๋กœ ๊ฐ€์ง€๋Š” ๋ฆฌ์ŠคํŠธ ๋˜๋Š” ๋ฐฐ์—ด์„ ์˜๋ฏธ

    ๋ฒกํ„ฐ์˜ ์ฝ”๋”ฉ์ƒ ์˜๋ฏธ

  • $X$ = ์—ด ๋ฒกํ„ฐ, $X^T$ = ํ–‰ ๋ฒกํ„ฐ

  • ์ˆ˜ํ•™์ ์œผ๋กœ๋Š” ๊ณต๊ฐ„์—์„œ์˜ ํ•œ ์ , ์›์ ์œผ๋กœ ๋ถ€ํ„ฐ์˜ ์ƒ๋Œ€์  ์œ„์น˜๋ฅผ ์˜๋ฏธ

    ๋ฒกํ„ฐ์˜ ์ˆ˜ํ•™์  ์˜๋ฏธ

  • ๊ฐ ๋ฒกํ„ฐ๊ฐ€ ๊ฐ€์ง€๋Š” ํ–‰๊ณผ ์—ด์˜ ์ˆ˜๋ฅผ ๋ฒกํ„ฐ์˜ ์ฐจ์›์ด๋ผ๊ณ  ํ•œ๋‹ค.

๋ฒกํ„ฐ์˜ ์„ฑ์งˆ

1. ๋ฒกํ„ฐ์— ์–‘์ˆ˜๋ฅผ ๊ณฑํ•ด์ฃผ๋ฉด ๋ฐฉํ–ฅ์€ ๊ทธ๋Œ€๋กœ, ๊ธธ์ด๋งŒ ๋ณ€ํ•œ๋‹ค.

  • ์ด ๋•Œ ๊ณฑํ•ด์ฃผ๋Š” ์ˆซ์ž๋ฅผ ์Šค์นผ๋ผ๊ณฑ($\alpha$)์ด๋ผ๊ณ  ํ‘œํ˜„ํ•œ๋‹ค.

  • ์Šค์นผ๋ผ๊ณฑ์ด ์Œ์ˆ˜์ด๋ฉด ๋ฐฉํ–ฅ์ด ์ •๋ฐ˜๋Œ€ ๋ฐฉํ–ฅ์ด ๋œ๋‹ค.

  • ๋ฒกํ„ฐ์˜ ๊ธธ์ด๋Š” 1๋ณด๋‹ค ํฌ๋ฉด ๊ธธ์ด๊ฐ€ ์ฆ๊ฐ€, 1๋ณด๋‹ค ์ž‘์œผ๋ฉด ๊ธธ์ด๊ฐ€ ๊ฐ์†Œํ•œ๋‹ค.
    2. ๊ฐ™์€ ๋ชจ์–‘(๊ฐ™์€ ํ–‰๊ณผ ์—ด)์„ ๊ฐ€์ง€๋ฉด ๋ง์…ˆ, ๋บ„์…ˆ, ๊ณฑ์…ˆ, ๋‚˜๋ˆ—์…ˆ์ด ๊ฐ€๋Šฅํ•˜๋‹ค.
    - ์ด๋•Œ์˜ ๊ณฑ์…ˆ์„ ์„ฑ๋ถ„๊ณฑ(Hadamard product)๋ผ๊ณ  ํ•œ๋‹ค.
    - numpy array์—๋„ ์ ์šฉ๋œ๋‹ค.
    3. ๋ฒกํ„ฐ์™€ ๋ฒกํ„ฐ์˜ ๋ง์…ˆ๊ณผ ๋บ„์…ˆ์€ ๋‹ค๋ฅธ ๋ฒกํ„ฐ๋กœ๋ถ€ํ„ฐ ์ƒ๋Œ€์  ์œ„์น˜ ์ด๋™์„ ํ‘œํ˜„ํ•จ.

๋ฒกํ„ฐ ๋…ธ๋ฆ„(norm)๊ณผ ๊ธฐํ•˜ํ•™์  ์„ฑ์งˆ

๋ฒกํ„ฐ์˜ ๋…ธ๋ฆ„

  • ๋ฒกํ„ฐ์˜ ๋…ธ๋ฆ„(norm)์€ ์›์ ์—์„œ๋ถ€ํ„ฐ์˜ ๊ฑฐ๋ฆฌ๋ฅผ ์˜๋ฏธ

    • ์ฐจ์›์˜ ์ˆ˜์™€ ๊ด€๊ณ„์—†์ด ๋ชจ๋“  ๋ฒกํ„ฐ๋Š” ๋…ธ๋ฆ„์„ ๊ตฌํ•  ์ˆ˜ ์žˆ๋‹ค.
    • L~1~-๋…ธ๋ฆ„์€ ๊ฐ ์„ฑ๋ถ„์˜ ๋ณ€ํ™”๋Ÿ‰์˜ ์ ˆ๋Œ€๊ฐ’์˜ ํ•ฉ์„ ์˜๋ฏธ
      • $(x,y)$๋Š” $|x|+|y|$ ๋งŒํผ ๊ฑฐ๋ฆฌ
    • L~2~-๋…ธ๋ฆ„์€ ํ”ผํƒ€๊ณ ๋ผ์Šค ์ •๋ฆฌ๋ฅผ ์ด์šฉํ•ด ์œ ํด๋ฆฌ๋“œ ๊ฑฐ๋ฆฌ๋ฅผ ๊ณ„์‚ฐ
      • $(x,y)$๋Š” $\sqrt{|x|^2 +|y|^2}$๋ฅผ ์˜๋ฏธ
      • np.linalg.norm์œผ๋กœ ๊ตฌํ˜„ ๊ฐ€๋Šฅ
    • $\parallel \parallel $ ๊ธฐํ˜ธ๋Š” ๋…ธ๋ฆ„์ด๋ผ๊ณ  ๋ถ€๋ฆ„
    \[\parallel x\parallel_1 = \sum_{i=1}^d|x_i|\\ \parallel x\parallel_2 = \sqrt{\sum_{i=1}^d|x_i|^2}\\\]

๋ฐฑํ„ฐ ๋…ธ๋ฆ„์˜ ์ฝ”๋“œ ๊ตฌํ˜„

def l1_norm(x):
    x_norm = np.abs(x)
    x_norm = np.sum(x_norm)
    return x_norm

def l2_norm(x):
    x_norm = x*x
    x_norm = np.sum(x_norm)
    x_norm = np.sqrt(x_norm)
    return x_norm

๋…ธ๋ฆ„์˜ ํ™œ์šฉ๊ณผ ์„ฑ์งˆ

  • ๋…ธ๋ฆ„์˜ ์ข…๋ฅ˜๋ฅผ ๋ฌด์—‡์œผ๋กœ ์ ์šฉํ•˜๋Š๋ƒ์— ๋”ฐ๋ผ ๊ธฐํ•˜ํ•™์  ์„ฑ์งˆ์ด ๋‹ฌ๋ผ์ง„๋‹ค.

    • L1 ๋…ธ๋ฆ„์€ ๋งˆ๋ฆ„๋ชจ ๋ชจ์–‘์˜ ์›์„ ๊ทธ๋ฆฌ๋ฉฐ, L2 ๋…ธ๋ฆ„์€ ๊ธฐ์กด์˜ ์› ๋ชจ์–‘์˜ ์›์„ ๊ฐ€์ง„๋‹ค.
    • ๋…ธ๋ฆ„์˜ ์ข…๋ฅ˜์— ๋”ฐ๋ผ ๋จธ์‹ ๋Ÿฌ๋‹์—์„œ์˜ ํ™œ์šฉ๋„ ๋‹ฌ๋ผ์ง„๋‹ค.
  • ๋…ธ๋ฆ„์„ ์ด์šฉํ•ด ๋‘ ๋ฒกํ„ฐ ์‚ฌ์ด์˜ ๊ฑฐ๋ฆฌ๋ฅผ ๋บ„์…ˆ์„ ํ†ตํ•˜์—ฌ ๊ณ„์‚ฐํ•  ์ˆ˜ ์žˆ๋‹ค.

    • $\parallel y - x\parallel = \parallel x - y\parallel$ ์ธ์ ์„ ์ด์šฉํ•˜๊ฒŒ ๋œ๋‹ค.
    • ์ฆ‰ x-y ๋˜๋Š” y-x๋ฅผ ํ•œ ์›์ ์—์„œ์˜ ์ขŒํ‘œ๋ฅผ ์ด์šฉํ•ด L2 norm์„ ๊ตฌํ•˜๋ฉด ๋‘๋ฒกํ„ฐ ์‚ฌ์ด์˜ ๊ฑฐ๋ฆฌ๊ฐ€ ๋‚˜์˜จ๋‹ค.
  • L2 ๋…ธ๋ฆ„ ํ•œ์ •์œผ๋กœ ๋‚ด์ ์„ ์ด์šฉํ•ด ์ด๋ ‡๊ฒŒ ๊ตฌํ•œ ๋ฒกํ„ฐ์‚ฌ์ด์˜ ๊ฑฐ๋ฆฌ๋ฅผ ์ด์šฉํ•ด ๊ฐ๋„ ๋˜ํ•œ ๊ณ„์‚ฐ ๊ฐ€๋Šฅํ•˜๋‹ค.

    • <x, y> ๋Š” ๋‚ด์ (inner product)์„ ์˜๋ฏธํ•˜๋ฉฐ ์„ฑ๋ถ„ ๊ณฑ๋“ค์˜ ํ•ฉ์„ ์˜๋ฏธํ•œ๋‹ค.
      • ์˜ˆ๋ฅผ ๋“ค์–ด x = (0, 1), y = (0, 2)์˜ ๋‚ด์ ์€ 0 * 0 + 1 * 2 = 2 ์ด๋‹ค.

๋‚ด์ ์„ ์ด์šฉํ•œ ๊ฐ๋„ ๊ณ„์‚ฐ

def angle(x,y):
    # np.inner(x, y)๊ฐ€ ๋‚ด์ ์„ ๊ตฌํ•˜๋Š” numpy ํ•จ์ˆ˜
    v = np.inner(x, y) / (l2_norm(x) * l2_norm(y))
    theta = np.arccos(v)
    return theta

๋‚ด์ ์˜ ํ•ด์„

  • ๋‚ด์ ์€ ์ •์‚ฌ์˜(orthogonal projection)๋œ ๋ฒกํ„ฐ์˜ ๊ธธ์ด์™€ ๊ด€๋ จ ์žˆ๋‹ค.

    • Proj(x)๋Š” ๋ฒกํ„ฐ y๋กœ ์ •์‚ฌ์˜๋œ ๋ฒกํ„ฐ x์˜ ๊ทธ๋ฆผ์ž๋ฅผ ์˜๋ฏธํ•œ๋‹ค.
    • Proj(x)์˜ ๊ธธ์ด๋Š” ์ฝ”์‚ฌ์ธ๋ฒ•์น™์— ์˜ํ•ด
\[\parallel x \parallel cos\theta\]
  • ๊ฐ€ ๋œ๋‹ค.

  • ์ด๋•Œ ๋‚ด์ ์€ ์ •์‚ฌ์˜์˜ ๊ธธ์ด๋ฅผ ๋ฒกํ„ฐ y์˜ ๊ธธ์ด $\parallel y \parallel$๋งŒํผ ์กฐ์ •ํ•œ(๊ณฑํ•œ) ๊ฐ’์ด๋‹ค.

  • ๋‚ด์ ์„ ์ด์šฉํ•ด ์œ ์‚ฌ๋„(similarity)๋ฅผ ๊ตฌํ•  ์ˆ˜ ์žˆ๋‹ค.

Matrix

ํ–‰๋ ฌ์ด๋ž€?

  • ๋ฒกํ„ฐ๋ฅผ ์›์†Œ๋กœ ๊ฐ€์ง€๋Š” 2์ฐจ์› ๋ฐฐ์—ด

ํ–‰๋ ฌ์˜ ์ˆ˜์‹ ํ‘œํ˜„

\[X = \begin{bmatrix} 1 & -2 & 3 \\ 7 & 5 & 0 \\ -2 & -1 & 2\end{bmatrix}\]

ํ–‰๋ ฌ์˜ ์ฝ”๋“œ ํ‘œํ˜„

x = np.array([[1, -2, 3], [7, 5, 0], [-2, -1, 2]]) # numpy์—์„  ํ–‰(row)์ด ๊ธฐ๋ณธ๋‹จ์œ„

\(\boldsymbol{X} = \begin{bmatrix}symbol{x_{1}} \\symbol{x_{2}} \\symbol{\vdots}\\symbol{x_{n}} \end{bmatrix} = \begin{bmatrix} x_{11} & x_{12} & \dots & x_{1m}\\ x_{21} & x_{22} & \dots & y_{2m}\\ \vdots & \vdots & & \vdots\\ x_{n1} & x_{n2} & \dots & x_{nm}\end{bmatrix} \begin{aligned}\boldsymbol{x_{1}}\\symbol{x_{2}}\\symbol{x_{7}}\end{aligned}\)

  • n x m ํ–‰๋ ฌ์˜ ํ‘œํ˜„

  • ํ–‰๋ ฌ์€ ํ–‰(row)๊ณผ ์—ด(column)์ด๋ผ๋Š” ์ธ๋ฑ์Šค(index)๋ฅผ ๊ฐ€์ง‘๋‹ˆ๋‹ค.

  • ํ–‰๋ ฌ์˜ ํŠน์ • ํ–‰์ด๋‚˜ ์—ด์„ ๊ณ ์ •ํ•˜๋ฉด ํ–‰ ๋ฒกํ„ฐ ๋˜๋Š” ์—ด ๋ฒกํ„ฐ๋ผ ๋ถ€๋ฅธ๋‹ค.

  • ์ „์น˜ ํ–‰๋ ฌ(transpose matrix) $X^T$๋Š” ํ–‰๊ณผ ์—ด์˜ ์ธ๋ฑ์Šค๊ฐ€ ๋ฐ”๋€ ํ–‰๋ ฌ์„ ์˜๋ฏธํ•จ.

  • ๋ฒกํ„ฐ ๋˜ํ•œ ๋™์ผํ•˜๊ฒŒ ํ–‰๊ณผ ์—ด์ด ๋ฐ”๋€ ์ „์น˜ ๋ฒกํ„ฐ๊ฐ€ ์กด์žฌํ•œ๋‹ค.

ํ–‰๋ ฌ์˜ ์ดํ•ด

1. ์ฒซ๋ฒˆ์งธ ์˜๋ฏธ

  • ๋ฒกํ„ฐ๊ฐ€ ๊ณต๊ฐ„์˜ ํ•œ์ ์„ ์˜๋ฏธํ•œ๋‹ค๋ฉด ํ–‰๋ ฌ์€ ๊ณต๊ฐ„์—์„œ ์—ฌ๋Ÿฌ ์ ๋“ค์˜ ์ง‘ํ•ฉ์„ ์˜๋ฏธํ•จ.
  • ํ–‰๋ ฌ์˜ ํ–‰๋ฒกํ„ฐ $x_i$๋Š” i๋ฒˆ์งธ ๋ฐ์ดํ„ฐ๋ฅผ ์˜๋ฏธํ•จ.
  • ํ–‰๋ ฌ $x_{ij}$๋Š” i๋ฒˆ์งธ ๋ฐ์ดํ„ฐ์˜ j ๋ฒˆ์งธ ๋ณ€์ˆ˜๊ฐ’์„ ์˜๋ฏธํ•จ.

  • ๋ฒกํ„ฐ์™€ ๋งˆ์ฐฌ๊ฐ€์ง€๋กœ ๊ฐ™์€ ๋ชจ์–‘์„ ๊ฐ€์ง€๋ฉด ๊ฐ™์€ ์ธ๋ฑ์Šค ์œ„์น˜๋ผ๋ฆฌ ๋ง์…ˆ, ๋บ„์…ˆ. ์„ฑ๋ถ„๊ณฑ์„ ๊ณ„์‚ฐํ•  ์ˆ˜ ์žˆ๋‹ค.
  • ๋ฒกํ„ฐ์™€ ๋งˆ์ฐฌ๊ฐ€์ง€๋กœ ์Šค์นผ๋ผ๊ณฑ($\alpha$) ๋˜ํ•œ ๊ฐ€๋Šฅํ•˜๋‹ค.

2. ๋‘๋ฒˆ์งธ ์˜๋ฏธ

  • ํ–‰๋ ฌ์€ ๋ฒกํ„ฐ ๊ณต๊ฐ„์—์„œ ์‚ฌ์šฉ๋˜๋Š” ์—ฐ์‚ฐ์ž(operator)๋ฅผ ์˜๋ฏธ.
  • ํ–‰๋ ฌ ๊ณฑ์„ ํ†ตํ•ด ๋ฒกํ„ฐ๋ฅผ ๋‹ค๋ฅธ ์ฐจ์›์˜ ๊ณต๊ฐ„์„ ๋ณด๋‚ผ ์ˆ˜ ์žˆ์Œ
    • ํ–‰๋ ฌ์„ X๋ฒกํ„ฐ์™€ ๊ณฑํ•˜๋ฉด m์ฐจ์›์—์„œ n์ฐจ์› ๋ฒกํ„ฐ๋กœ ๋ณ€ํ™˜๋˜์–ด n์ฐจ์›์˜ z๋ฒกํ„ฐ๊ฐ€ ๋จ.
    • ์ด๋ฅผ ํ†ตํ•ด ๋งตํ•‘, ๋””์ฝ”๋”ฉ ๋“ฑ์ด ๊ฐ€๋Šฅํ•จ.
    • ์ด๋ฅผ ์„ ํ˜• ๋ณ€ํ™˜(linear transform)์ด๋ผ๊ณ ๋„ ํ•จ.
    • ๋”ฅ๋Ÿฌ๋‹์€ ์„ ํ˜• ๋ณ€ํ™˜๊ณผ ๋น„์„ ํ˜• ๋ณ€ํ™˜์˜ ํ•ฉ์„ฑ์œผ๋กœ ์ด๋ฃจ์–ด์ง
  • ํŒจํ„ด ์ถ”์ถœ, ๋ฐ์ดํ„ฐ ์••์ถ• ๋“ฑ์—๋„ ์‚ฌ์šฉํ•จ.

ํ–‰๋ ฌ์˜ ๊ณฑ์…ˆ(matrix multiplication)๊ณผ ๋‚ด์ 

1. ํ–‰๋ ฌ์˜ ๊ณฑ์…ˆ(matrix multiplication)

  • ํ–‰๋ ฌ ๊ณฑ์…ˆ์€ i๋ฒˆ์งธ ํ–‰๋ฒกํ„ฐ์™€ j ๋ฒˆ์งธ ์—ด๋ฒกํ„ฐ ์‚ฌ์ด์˜ ๋‚ด์ ์„ ์„ฑ๋ถ„์œผ๋กœ ๊ฐ€์ง€๋Š” ํ–‰๋ ฌ์„ ๋งŒ๋“ญ๋‹ˆ๋‹ค.

  • ๊ณ ๋กœ ํ–‰๊ณผ ์—ด์˜ ๊ฐฏ์ˆ˜๊ฐ€ ๊ฐ™์•„์•ผ ๊ฐ€๋Šฅํ•˜๋‹ค.

ํ–‰๋ ฌ์˜ ๊ณฑ์…ˆ code ๊ตฌํ˜„ ์‹œ

X = np.array([[1, -2, 3],
            	[7, 5, 0],
            	[-2, -1, 3]])
Y = np.array([[0, 1],
            	[1, -1],
            	[-2, 1]])
print(x @ Y) # numpy์—์„  @ ์—ฐ์‚ฐ์œผ๋กœ ํ–‰๋ ฌ ๊ณฑ์…ˆ ๊ณ„์‚ฐ
# array([[-8, 6],
#        	[5, 2],
#         	[-5, 1]])

2. ํ–‰๋ ฌ์˜ ๋‚ด์ 

  • np.inner๋Š” i๋ฒˆ์งธ ํ–‰๋ฒกํ„ฐ์™€ j๋ฒˆ์งธ ํ–‰๋ฒกํ„ฐ ์‚ฌ์ด์˜ ๋‚ด์ ์„ ์„ฑ๋ถ„์œผ๋กœ ๊ฐ€์ง€๋Š” ํ–‰๋ ฌ์„ ๊ณ„์‚ฐํ•ฉ๋‹ˆ๋‹ค.
  • ์ˆ˜ํ•™์˜ ํ–‰๋ ฌ ๋‚ด์  $tr(XY^T)$๊ณผ ๋‹ค๋ฆ„

ํ–‰๋ ฌ์˜ ๋‚ด์  code ๊ตฌํ˜„ ์‹œ

X = np.array([[1, -2, 3],
            	[7, 5, 0],
            	[-2, -1, 3]])
Y = np.array([[0, 1, -1],
            	[1, -1, 0]])
print(np.inner(x, Y)) # numpy์—์„  np.inner() ํ•จ์ˆ˜๋กœ ํ–‰๋ ฌ ๋‚ด์  ๊ณ„์‚ฐ
# array([[-5, 3],
#        	[5, 2],
#         	[-3, -1]])

์—ญํ–‰๋ ฌ์˜ ์ดํ•ด

  • ํ–‰๋ ฌ A์˜ ์—ฐ์‚ฐ์„ ๊ฐ™์€ ์—ฐ์‚ฐ์œผ๋กœ ๊ฑฐ๊พธ๋กœ ๋Œ๋ฆฌ๋Š” ํ–‰๋ ฌ์„ ์—ญํ–‰๋ ฌ(Inverse matrix)๋ผ๊ณ  ๋ถ€๋ฅด๋ฉฐ, $A^{-1}$๋ผ ํ‘œ๊ธฐํ•œ๋‹ค.
  • ํ–‰๊ณผ ์—ด ์ˆซ์ž๊ฐ€ ๊ฐ™๊ณ  ํ–‰๋ ฌ์‹(determinant)๊ฐ€ 0์ด ์•„๋‹Œ ๊ฒฝ์šฐ์—๋งŒ ๊ณ„์‚ฐ ๊ฐ€๋Šฅ.

์—ญํ–‰๋ ฌ๊ณผ์˜ ํ–‰๋ ฌ๊ณฑ์˜ ๊ฒฐ๊ณผ

\[AA^{-1} = A^{-1}A = I(ํ•ญ๋“ฑํ–‰๋ ฌ)\]
  • ํ•ญ๋“ฑํ–‰๋ ฌ(Identity Matrix)์€ ๊ณฑํ•˜๊ฒŒ ๋  ์‹œ ์ž๊ธฐ ์ž์‹ ์ด ๋‚˜์˜ค๋Š” ํ–‰๋ ฌ์ด๋‹ค.

์—ญํ–‰๋ ฌ์˜ ์ฝ”๋“œ ๊ตฌํ˜„

Y = np.array([[1, -2, 3], [7, 5, 0], [-2,-1,2]])
print(Y @ np.linalg.inv(Y)) # np.linalg.inv(Y) Y ํ–‰๋ ฌ์˜ ์—ญํ–‰๋ ฌ์ด ๋ฆฌํ„ด
# array([[1, 0, 0], [0, 1, 0], [0, 0, 1]]) # ์ •ํ™•ํžˆ๋Š” float์œผ๋กœ ๋น„์Šทํ•œ ๊ฐ’์ด ๋‚˜์˜จ๋‹ค.

  • ์—ญํ–‰๋ ฌ์„ ๊ณ„์‚ฐํ•  ์ˆ˜ ์—†๋Š” ์กฐ๊ฑด์ด๋ผ๋ฉด ์œ ์‚ฌ์—ญํ–‰๋ ฌ(pseudo-inverse) ๋˜๋Š” ๋ฌด์–ด-ํŽœ๋กœ์ฆˆ(Moore-Penrose) ์—ญํ–‰๋ ฌ $A^ +$์„ ์ด์šฉํ•œ๋‹ค.

์œ ์‚ฌ์—ญํ–‰๋ ฌ์˜ ์„ฑ์งˆ

\[n \geq m ์ธ\ ๊ฒฝ์šฐ, \ A^+ = (A^TA)^{-1}A^T,\ A^+A = I\\ n \leq m ์ธ\ ๊ฒฝ์šฐ, \ A^+ = A^T(A^TA)^{-1},\ AA^+ = I\\\]
  • ์ˆœ์„œ๋ฅผ ๋ฐ”๊พธ๋ฉด ๊ฒฐ๊ณผ๊ฐ€ ๋‹ฌ๋ผ์ง€๋ฏ€๋กœ ์œ ์‚ฌ์—ญํ–‰๋ ฌ์˜ ์ˆœ์„œ์— ์ฃผ์˜!

์œ ์‚ฌ ์—ญํ–‰๋ ฌ์˜ ์ฝ”๋“œ ๊ตฌํ˜„

Y = np.aray([[0, 1], [1,-1], [-2,1]])
print(Y @ np.linalg.pinv(Y)) # np.linalg.pinv(Y) Y ํ–‰๋ ฌ์˜ ์œ ์‚ฌ์—ญํ–‰๋ ฌ์ด ๋ฆฌํ„ด
# array([[1, 0], [0, 1]]) # ์ •ํ™•ํžˆ๋Š” float์œผ๋กœ ๋น„์Šทํ•œ ๊ฐ’์ด ๋‚˜์˜จ๋‹ค.

ํ–‰๋ ฌ์˜ ์‘์šฉ

1. ์—ฐ๋ฆฝ๋ฐฉ์ •์‹ ํ’€๊ธฐ

\[a_{11}x_1 + a_{12}x_2 + \dots + a_{1m}x_{m} = b_{1}\\ a_{12}x_1 + a_{22}x_2 + \dots + a_{2m}x_{m} = b_{2}\\ \vdots\\ a_{n1}x_1 + a_{n2}x_2 + \dots + a_{nm}x_{m} = b_{n}\\ n \leq m \ ์ธ\ ๊ฒฝ์šฐ:\ ์‹์ด\ ๋ณ€์ˆ˜\ ๊ฐœ์ˆ˜๋ณด๋‹ค\ ์ž‘๊ฑฐ๋‚˜\ ๊ฐ™์•„์•ผ\ ํ•จ\]

sol) n์ด m๋ณด๋‹ค ์ž‘๊ฑฐ๋‚˜ ๊ฐ™์œผ๋ฉด ๋ฌด์–ด-ํŽœ๋กœ์ฆˆ ์—ญํ–‰๋ ฌ์„ ์ด์šฉํ•ด ํ•ด๋ฅผ ํ•˜๋‚˜ ๊ตฌํ•  ์ˆ˜ ์žˆ๋‹ค.
\(Ax = B \\ \Rightarrow x = A^+b\\ =A^T(AA^T)^{-1}b\)

2. ์„ ํ˜•ํšŒ๊ท€๋ถ„์„

  • np.linalg.pinv๋ฅผ ์ด์šฉํ•˜๋ฉด ๋ฐ์ดํ„ฐ๋ฅผ ์„ ํ˜•๋ชจ๋ธ(linear model)๋กœ ํ•ด์„ํ•˜๋Š” ์„ ํ˜•ํšŒ๊ท€์‹์„ ์ฐพ์„ ์ˆ˜ ์žˆ๋‹ค.

  • sklearn์˜ LinearRegression๊ณผ ๊ฐ™์€ ๊ฒฐ๊ณผ๋ฅผ ๊ฐ€์ ธ์˜ด

# Scikit Learn์„ ํ™œ์šฉํ•œ ํšŒ๊ท€๋ถ„์„
from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(x,y)
y_test = model.predict(x_test)

# Moore-Penrose ์—ญํ–‰๋ ฌ, y์ ˆํŽธ(intercept)ํ•ญ์„ ์ง์ ‘ ์ถ”๊ฐ€ํ•ด์•ผํ•œ๋‹ค.
X_ = np.array([np.append(x,[1]) for x in X]) # intercept ํ•ญ ์ถ”๊ฐ€
beta = np.linalog.pinv(X_) @ y
y_test = np.append(x, [1]) @ beta

Gradient Algorithm(๊ฒฝ์‚ฌํ•˜๊ฐ•๋ฒ•)

๋ฏธ๋ถ„ (differentiation)์ด๋ž€?

  • ๋ณ€์ˆ˜์˜ ์›€์ง์ž„์— ๋”ฐ๋ฅธ ํ•จ์ˆ˜๊ฐ’์˜ ๋ณ€ํ™”๋ฅผ ์ธก์ •ํ•˜๊ธฐ ์œ„ํ•œ ๋„๊ตฌ.
  • ๋ณ€ํ™”์œจ์˜ ๊ทนํ•œ, ์ตœ์ ํ™”์—์„œ ์ œ์ผ ๋งŽ์ด ์‚ฌ์šฉํ•˜๋Š” ๊ธฐ๋ฒ•.
\[f'(x) = \lim_{h\rightarrow 0}\frac{f(x+h) - f(x)}h\]

๋ฏธ๋ถ„ ์ฝ”๋“œ ๊ตฌํ˜„

import sympy as sym # ๊ธฐํ˜ธ๋ฅผ ํ†ตํ•ด ํ•จ์ˆ˜๋ฅผ ์ดํ•ดํ•˜๊ฒŒ ํ•ด์คŒ
from sympy.abc import x

sym.diff(sym.poly(x**2 + 2*x + 3), x)
#Poly(2*x + 2, x, domain=&#39;zz&#39;)

๋ฏธ๋ถ„ ํ™œ์šฉ

  • ํ•จ์ˆ˜ ๊ทธ๋ž˜ํ”„์˜ ์ ‘์„ ์˜ ๊ธฐ์šธ๊ธฐ์™€๋„ ๊ฐ™์œผ๋ฉฐ, ์ด๋ฅผ ํ†ตํ•ด ํ•จ์ˆ˜๊ฐ’์˜ ์ฆ๊ฐ์„ ์•Œ ์ˆ˜ ์žˆ๋‹ค.

  • ์ด๋ ‡๊ฒŒ ๊ตฌํ•œ ๊ธฐ์šธ๊ธฐ๋ฅผ ํ†ตํ•˜์—ฌ ๋ฏธ๋ถ„๊ฐ’์„ ๋”ํ•˜๊ฑฐ๋‚˜ ๋นผ์„œ, ๊ณ ์ฐจ์› ๊ณต๊ฐ„์—์„œ๋„ ์ตœ์ ํ™” ๊ฐ€๋Šฅ

    • ๋ฏธ๋ถ„๊ฐ’์„ ๋”ํ•˜๋ฉด ๊ฒฝ์‚ฌ์ƒ์Šน๋ฒ•์ด๋ผ ํ•˜๋ฉฐ, ํ•จ์ˆ˜์˜ ๊ทน๋Œ€๊ฐ’์„ ์ฐพ๋Š”๋ฐ ์‚ฌ์šฉ
    • ๋ฏธ๋ถ„๊ฐ’์„ ๋นผ๋ฉด ๊ฒฝ์‚ฌํ•˜๊ฐ•๋ฒ•์ด๋ผ ํ•˜๋ฉฐ, ํ•จ์ˆ˜์˜ ๊ทน์†Œ๊ฐ’์„ ์ฐพ๋Š”๋ฐ ์‚ฌ์šฉ
  • ๊ทน์†Œ๊ฐ’์ด๋‚˜ ๊ทน๋Œ€๊ฐ’์— ๋„๋‹ฌํ•˜๋ฉด ๋ฏธ๋ถ„๊ฐ’์ด 0์ด๋ฏ€๋กœ ์ตœ์ ํ™”๊ฐ€ ์ข…๋ฃŒ๋จ.

๊ฒฝ์‚ฌํ•˜๊ฐ•๋ฒ• (Gradient Algorithm) ๊ตฌํ˜„

์ฝ”๋“œ ๊ตฌํ˜„ ์Šˆ๋„ ์ฝ”๋“œ

Input : gradeint, init, lr, eps, Output: var
# gradient: ๋ฏธ๋ถ„์„ ๊ณ„์‚ฐํ•˜๋Š” ํ•จ์ˆ˜
# init: ์‹œ์ž‘์ , lr: ํ•™์Šต๋ฅ , eps: ์•Œ๊ณ ๋ฆฌ์ฆ˜ ์ข…๋ฃŒ ์กฐ๊ฑด

var = init
grad = gradient(var)
while(abs(grad) &#38;#62; eps): # ๋ฏธ๋ถ„๊ฐ’์ด ์ •ํ™•์ด 0์ด ๋˜๋Š” ๊ฒฝ์šฐ๋Š” ๊ฑฐ์˜ ์—†์Œ, ์ฆ‰ ์•„์ฃผ ์ž‘์€๊ฐ’(eps)์ดํ•˜๊ฐ€ ๋˜๋ฉด ์ข…๋ฃŒ
    var = var - lr * grad # lr: ํ•™์Šต๋ฅ ์ด ๋†’์„์ˆ˜๋ก ๋„“๊ฒŒ ์—…๋ฐ์ดํŠธํ•จ, ๋„ˆ๋ฌด ํฌ๊ฑฐ๋‚˜ ์ž‘์œผ๋ฉด ์•ˆ๋จ
    grad = gradient(var) # ๋ฏธ๋ถ„๊ฐ’ ์—…๋ฐ์ดํŠธ

๋‹ค๋ณ€์ˆ˜ ํ•จ์ˆ˜์ผ ๊ฒฝ์šฐ?

  • ๋ฒกํ„ฐ๊ฐ€ ์ž…๋ ฅ์ธ ๋‹ค๋ณ€์ˆ˜ ํ•จ์ˆ˜์˜ ๊ฒฝ์šฐ ํŽธ๋ฏธ๋ถ„(partial differentiation)์„ ์‚ฌ์šฉ.
\[\partial _{x_{i}}f\left( x\right) =\lim _{h\rightarrow 0}\dfrac{f\left( x+he_{i}\right) -f\left( x\right) }{h}\\ e_{i}\ :\ i๋ฒˆ์งธ\ ๊ฐ’๋งŒ \ 1์ด๊ณ \ ๋‚˜๋จธ์ง€๋Š”\ 0์ธ\ ๋‹จ์œ„๋ฒกํ„ฐ\]

์ฝ”๋”ฉ์„ ์ด์šฉํ•œ ํŽธ๋ฏธ๋ถ„

import sympy as sym
from sympy.abc import x, y

sym.diff(sym.poly(x**2 + 2*x*y + 3) + sym.cos(x + 2*y), x)
# 2*x + 2*y - sin(x+2*y)

  • ๊ฐ ๋ณ€์ˆ˜ ๋ณ„๋กœ ํŽธ๋ฏธ๋ถ„์„ ๊ณ„์‚ฐํ•œ ๊ทธ๋ ˆ์ด์–ธํŠธ ๋ฒกํ„ฐ๋ฅผ ์ด์šฉํ•˜์—ฌ ๊ฒฝ์‚ฌํ•˜๊ฐ•/๊ฒฝ์‚ฌ์ƒ์Šน๋ฒ•์— ์‚ฌ์šฉ ๊ฐ€๋Šฅ.
\[\partial _{x_{i}}f\left( x\right) =\lim _{h\rightarrow 0}\dfrac{f\left( x+he_{i}\right) -f\left( x\right) }{h}\\ e_{i}\ :\ i๋ฒˆ์งธ\ ๊ฐ’๋งŒ \ 1์ด๊ณ \ ๋‚˜๋จธ์ง€๋Š”\ 0์ธ\ ๋‹จ์œ„๋ฒกํ„ฐ \\ \nabla f = (\partial_{x1}f,\partial_{x2}f,\dots,\partial_{xd}f)\]

๊ทธ๋ ˆ๋””์–ธํŠธ ๋ฒกํ„ฐ (Gradient vector) ์ด์šฉํ•œ ๊ฒฝ์‚ฌํ•˜๊ฐ•๋ฒ•

  • โˆ‡f (x,y)๋Š” ์ž„์˜์˜ ์ (x,y)์—์„œ ๊ฐ€์žฅ ๋นจ๋ฆฌ ํ•จ์ˆ˜๊ฐ’์ด ์ฆ๊ฐ€ํ•˜๋Š” ๋ฐฉํ–ฅ์ด๋‹ค
  • ๊ทธ๋Ÿฌ๋ฏ€๋กœ -โˆ‡f (x,y)๋ฐฉํ–ฅ์œผ๋กœ ๊ฐ€๋ฉด ๊ฐ€์žฅ ๋นจ๋ฆฌ ํ•จ์ˆ˜๊ฐ’์ด ๊ฐ์†Œํ•˜๋Š” ๋ฐฉํ–ฅ์ด ๋˜๋ฉฐ, ์ด๋ฅผ ์ ์šฉํ•ด ๊ฒฝ์‚ฌํ•˜๊ฐ•์„ ํ•œ๋‹ค.

๊ทธ๋ ˆ๋””์–ธํŠธ ๋ฒกํ„ฐ๊ฐ€ ์ ์šฉ๋œ ๊ฒฝ์‚ฌํ•˜๊ฐ•๋ฒ• ์Šˆ๋„์ฝ”๋“œ

Input : gradeint, init, lr, eps, Output: var
# gradient: ๊ทธ๋ ˆ๋””์–ธํŠธ ๋ฒกํ„ฐ๋ฅผ ๊ณ„์‚ฐํ•˜๋Š” ํ•จ์ˆ˜
# init: ์‹œ์ž‘์ , lr: ํ•™์Šต๋ฅ , eps: ์•Œ๊ณ ๋ฆฌ์ฆ˜ ์ข…๋ฃŒ ์กฐ๊ฑด

var = init
grad = gradient(var)
while(norm(grad) &#38;#62; eps): # ๋ฒกํ„ฐ์˜ ์ ˆ๋Œ€๊ฐ’ ๋Œ€์‹  ๋…ธ๋ฆ„(norm)์„ ๊ณ„์‚ฐํ•ด ์ข…๋ฃŒ์กฐ๊ฑด ์„ค์ •
    var = var - lr * grad # lr: ํ•™์Šต๋ฅ ์ด ๋†’์„์ˆ˜๋ก ๋„“๊ฒŒ ์—…๋ฐ์ดํŠธํ•จ, ๋„ˆ๋ฌด ํฌ๊ฑฐ๋‚˜ ์ž‘์œผ๋ฉด ์•ˆ๋จ
    grad = gradient(var) # ๋ฏธ๋ถ„๊ฐ’ ์—…๋ฐ์ดํŠธ

๊ฒฝ์‚ฌํ•˜๊ฐ•๋ฒ•์˜ ์„ ํ˜• ํšŒ๊ท€ ์ ์šฉ (apply to linear regression)

  • ๋ฌด์–ด-ํŽœ๋กœ์ฆˆ ํ–‰๋ ฌ์„ ์ด์šฉํ•ด์„œ ์„ ํ˜•ํšŒ๊ท€๊ฐ€ ๊ฐ€๋Šฅํ–ˆ์ง€๋งŒ, ๊ฒฝ์‚ฌํ•˜๊ฐ•๋ฒ•์„ ์ด์šฉํ•˜๋Š”๊ฒŒ ์ผ๋ฐ˜์ ์ด๋‹ค.

  • **์„ ํ˜•ํšŒ๊ท€์˜ ๋ชฉ์ ์‹์€ โˆฅy-Xฮฒโˆฅ~2~ ๋˜๋Š” โˆฅy-Xฮฒโˆฅ~2~^2^ ** ์ด๋ฉฐ, ์ด๋ฅผ ์ตœ์†Œํ•˜ํ•˜๋Š” ฮฒ๋ฅผ ์ฐพ๋Š”๊ฒŒ ๋ชฉ์ ์ด๋ฏ€๋กœ ๋‹ค์Œ๊ณผ ๊ฐ™์€ ๊ทธ๋ ˆ๋””์–ธํŠธ ๋ฒกํ„ฐ๋ฅผ ๊ตฌํ•ด์•ผํ•œ๋‹ค.(loss : RMSE ๊ธฐ์ค€)
    \(\nabla_\beta\left\| y - X\beta\right\|_2 = (\partial_{\beta_1}\left\| y - X\beta\right\|_2,\dots,\partial_{\beta_d}\left\| y - X\beta\right\|_2)\\ \partial_{\beta_k}\left\| y - X\beta\right\|_2 = \partial_{\beta_k}\left\{\frac{1}n\sum_{i=1}^n{\left(y_i - \sum_{j=1}^{d}X_{ij}\beta_j\right)^2}\right\}^\frac{1}2 = -\frac{X^T_{\cdot k}(y - X\beta)}{n\left\| y - X\beta\right\|_2}\\ X^T_{\cdot k} = ํ–‰๋ ฌ\ X์˜\ k๋ฒˆ์งธ\ ์—ด(column)\ ๋ฒกํ„ฐ๋ฅผ\ ์ „์น˜์‹œํ‚จ\ ๊ฒƒ\\ ์ฆ‰,\\ \nabla_\beta\left\| y - X\beta\right\|_2 = (\partial_{\beta_1}\left\| y - X\beta\right\|_2,\dots,\partial_{\beta_d}\left\| y - X\beta\right\|_2) = \left( -\frac{X^T_{\cdot 1}(y - X\beta)}{n\left\| y - X\beta\right\|_2},\dots, -\frac{X^T_{\cdot d}(y - X\beta)}{n\left\| y - X\beta\right\|_2}\right)\)

  • ์ด์— ๋ชฉ์ ์‹์„ ์ตœ์†Œํ™”ํ•˜๋Š” ฮฒ๋ฅผ ๊ตฌํ•˜๋Š” ๊ฒฝ์‚ฌํ•˜๊ฐ•๋ฒ• ์•Œ๊ณ ๋ฆฌ์ฆ˜์€ ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.

\[\beta^{(t+1)}\leftarrow\beta^{(t)} - \lambda\nabla_\beta\left\| y - X\beta\right\|_2 =\beta^{(t)} + \frac{\lambda}n\frac{X^T(y - X\beta^{(t)})}{\left\| y - X\beta\right\|_2}\\\]
  • ๋ชฉ์ ์‹์œผ๋กœ โˆฅy-Xฮฒโˆฅ~2~ ๋Œ€์‹  โˆฅy-Xฮฒโˆฅ~2~^2^์„ ์ตœ์†Œํ™”ํ•˜๋ฉด ์‹์ด ์ข€๋” ๊ฐ„๋‹จํ•ด์ง„๋‹ค.
\[\nabla_\beta\left\| y - X\beta\right\|_{2}^2 = (\partial_{\beta_1}\left\| y - X\beta\right\|_{2}^2,\dots,\partial_{\beta_d}\left\| y - X\beta\right\|_{2}^2) = -\frac{2}nX^T(y - X\beta) \\ \beta^{(t+1)}\leftarrow\beta^{(t)} + \frac{2\lambda}{n}X^T(y - X\beta^{(t)})\]

๊ฒฝ์‚ฌํ•˜๊ฐ•๋ฒ• ๊ธฐ๋ฐ˜ ์„ ํ˜•ํšŒ๊ท€ ์•Œ๊ณ ๋ฆฌ์ฆ˜ ์Šˆ๋„ ์ฝ”๋“œ

Input: X, y, lr, T, Output: beta
# norm: L2-๋…ธ๋ฆ„์„ ๊ณ„์‚ฐํ•˜๋Š” ํ•จ์ˆ˜
# lr: ํ•™์Šต๋ฅ , T: ํ•™์ŠตํšŸ์ˆ˜ = hyperparameter
for t in range(T):  # ํ•™์ŠตํšŸ์ˆ˜ ์ œํ•œ, ๋˜๋Š” ์ด์ „ ์ฒ˜๋Ÿผ ์ผ์ •ํ•œ ์ˆ˜์ค€ ์ดํ•˜๋กœ ๋–จ์–ด์งˆ ๋•Œ๊นŒ์ง€ ํ•ด๋„ ๋œ๋‹ค.
    error = y - X @ beta
    grad = - transpose(X) @ error
    beta = beta - lr * grad # ๋ฒ ํƒ€ ์—…๊ทธ๋ ˆ์ด๋“œ

ํ™•๋ฅ ์  ๊ฒฝ์‚ฌํ•˜๊ฐ•๋ฒ• (stochastic gradient descent)

  • ์ด๋ก ์ ์œผ๋กœ ์ ์ ˆํ•œ ํ•™์Šต๋ฅ ๊ณผ ํ•™์ŠตํšŸ์ˆ˜๋ฅผ ์„ ํƒ์‹œ, ์ˆ˜๋ ด์ด ๋ณด์žฅ๋˜์–ด์žˆ๋‹ค.

  • ํ•˜์ง€๋งŒ ๋น„์„ ํ˜•ํšŒ๊ท€์˜ ๊ฒฝ์šฐ ๋ชฉ์ ์‹์ด ๋ณผ๋กํ•˜์ง€ ์•Š์œผ๋ฏ€๋กœ(non-convex) ์ˆ˜๋ ด์ด ํ•ญ์ƒ ๋ณด์žฅ๋˜์ง€ ์•Š์Œ

    • ๋”ฅ๋Ÿฌ๋‹์˜ ๋ชฉ์ ์‹์€ ๋Œ€๋ถ€๋ถ„ ๋ณผ๋กํ•จ์ˆ˜๊ฐ€ ์•„๋‹ˆ๋‹ค, ์ฆ‰ ๋Œ€๋ถ€๋ถ„ ๋ณด์žฅํ•˜์ง€ ์•Š์Œ
    • ์•„๋ž˜์™€ ๊ฐ™์€ ๊ฒฝ์šฐ ํŠน์ • ๋ถ€๋ถ„์— ์ˆ˜๋ ดํ–ˆ์ง€๋งŒ ํ•จ์ˆ˜์˜ ์ตœ์†Œ์ง€์ ์ด ์•„๋‹ˆ๋‹ค.

โ€‹

  • ํ™•๋ฅ ์  ๊ฒฝ์‚ฌํ•˜๊ฐ•๋ฒ•(SGD)์€ ๋ชจ๋“  ๋ฐ์ดํ„ฐ๋ฅผ ์‚ฌ์šฉํ•ด์„œ ์—…๋ฐ์ดํŠธ ํ•˜๋Š” ๋Œ€์‹  ๋ฐ์ดํ„ฐ ํ•œ๊ฐœ ๋˜๋Š” ์ผ๋ถ€ํ™œ์šฉํ•˜์—ฌ ์—…๋ฐ์ดํŠธํ•ฉ๋‹ˆ๋‹ค.

    ๊ฐ€์ •

\[\theta^{(t+1)}\leftarrow\theta^{(t)}-\widehat{\nabla _{a}L}(\theta^{(t)})\]
  • ๋งŒ๋Šฅ์€ ์•„๋‹ˆ์ง€๋งŒ ๋”ฅ๋Ÿฌ๋‹์—์„œ mini-batch ๋ฐฉ์‹์„ ์ถ”๊ฐ€ํ•˜์—ฌ ์ผ๋ฐ˜์ ์œผ๋กœ ์‚ฌ์šฉํ•จ.
  • SGD๋Š” ๋ฐ์ดํ„ฐ์˜ ์ผ๋ถ€๋ฅผ ๊ฐ€์ง€๊ณ  ํŒจ๋Ÿฌ๋ฏธํ„ฐ๋ฅผ ์—…๋ฐ์ดํŠธํ•˜๊ธฐ ๋•Œ๋ฌธ์— ์—ฐ์‚ฐ์ž์›์„ ์ข€ ๋” ํšจ์œจ์ ์œผ๋กœ ํ™œ์šฉ
    • ์—ฐ์‚ฐ๋Ÿ‰์ด b/n์œผ๋กœ ๊ฐ์†Œ
\[\beta^{(t+1)}\leftarrow\beta^{(t)} + \frac{2\lambda}{b}X^T_{(B)}(y_{(b)} - X_{(b)}\beta^{(t)})\]
  • ๊ฒฝ์‚ฌํ•˜๊ฐ•๋ฒ•์€ ์ „์ฒด ๋ฐ์ดํ„ฐ๋ฅผ ๊ฐ€์ง€๊ณ  ๋ชฉ์ ์‹ ๊ทธ๋ ˆ๋””์–ธํŠธ ๋ฒกํ„ฐ๋ฅผ ๊ณ„์‚ฐํ•˜๋Š” ๋ฐ˜๋ฉด, SGD๋Š” ๋ฏธ์น˜๋ฐฐ์น˜(๋ฐ์ดํ„ฐ์˜ ์ผ๋ถ€)๋ฅผ ๊ฐ€์ง€๊ณ  ๊ทธ๋ ˆ๋””์–ธํŠธ ๋ฒกํ„ฐ๋ฅผ ๊ณ„์‚ฐ

  • ๊ทธ๋Ÿฌ๋ฏ€๋กœ ๋งค๋ฒˆ ๋‹ค๋ฅธ ๋ฐ์ดํ„ฐ์…‹์„ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ๊ณผ ๋น„์Šทํ•˜๊ธฐ ๋•Œ๋ฌธ์— ํ•จ์ˆ˜ ๊ณก์„ ์˜ ๋ชจ์–‘์ด ๋ฏธ๋‹ˆ๋ฐฐ์น˜๋งˆ๋‹ค ๋ฐ”๋€Œ๊ฒŒ ๋œ๋‹ค.

  • ํ•˜์ง€๋งŒ ์ตœ์ข…์ ์ธ ๋ฐฉํ–ฅ์„ฑ์€ ์œ ์‚ฌํ•˜๊ฒŒ ์ด๋™ํ•˜๊ฒŒ ๋œ๋‹ค.

  • ์ฆ‰ ๋ณผ๋ก๋ชจ์–‘์ด ์•„๋‹ˆ์–ด๋„ ํšจ์œจ์ ์œผ๋กœ ํ™œ์šฉ๊ฐ€๋Šฅ ํ•˜๋‹ค.

๊ฒฝ์‚ฌํ•˜๊ฐ•๋ฒ•(์ขŒ) vs ํ™•๋ฅ ์  ๊ฒฝ์‚ฌํ•˜๊ฐ•๋ฒ•(์šฐ)

  • ๋‹ค๋งŒ mini-batch ์‚ฌ์ด์ฆˆ๋ฅผ ๋„ˆ๋ฌด ์ž‘๊ฒŒ ์žก์œผ๋ฉด ๊ฒฝ์‚ฌํ•˜๊ฐ•๋ฒ•์— ๋น„ํ•ด ๋„ˆ๋ฌด ๋Š๋ ค์ง„๋‹ค.

  • ์ตœ๊ทผ์—๋Š” ๊ฒฝ์‚ฌํ•˜๊ฐ•๋ฒ•์œผ๋กœ ์ „์ฒด ๋ฐ์ดํ„ฐ๋ฅผ ์ด์šฉํ•ด ํ•™์Šต์‹œํ‚ค๋ฉด, ๋ฐฉ๋Œ€ํ•œ ๋ฐ์ดํ„ฐ์…‹์— ์˜ํ•˜์—ฌ ๋ฉ”๋ชจ๋ฆฌ๊ฐ€ ์ดˆ๊ณผ๋˜๋ฏ€๋กœ ๋ฏธ๋‹ˆ๋ฐฐ์น˜๋กœ ๋‚˜๋ˆ„์–ด์„œ ํ•™์Šตํ•˜๋Š” SGD๋ฅผ ์‚ฌ์šฉํ•œ๋‹ค.

์ธ๊ณต์ง€๋Šฅ ํ•™์Šต์˜ ์ˆ˜ํ•™์  ์ดํ•ด

์‹ ๊ฒฝ๋ง์˜ ํ•ด์„

  • ๋น„์„ ํ˜•, ๋ณต์žกํ•œ ๋ชจ๋ธ์ด ๋Œ€๋ถ€๋ถ„์ธ ์‹ ๊ฒฝ๋ง์€ ์‚ฌ์‹ค ์„ ํ˜• ๋ชจ๋ธ๊ณผ ๋น„์„ ํ˜• ํ•จ์ˆ˜๋“ค์˜ ๊ฒฐํ•ฉ์œผ๋กœ ์ด๋ฃจ์–ด์ ธ ์žˆ๋‹ค.
\[\begin{aligned}\begin{aligned}\begin{bmatrix} O_1 \\ O_2 \\ \vdots \\ O_n \end{bmatrix} = \begin{bmatrix} x_1 \\ x_2 \\ \vdots \\ x_n \end{bmatrix} \begin{bmatrix} w_{11} & w_{12} & \dots & w_{1p}\\ w_{21} & w_{22} & \dots & w_{2p} \\ \vdots & \vdots & \ddots & \vdots \\ w_{d1} & w_{d2} & \dots & w_{dp} \end{bmatrix} + \begin{bmatrix} \vdots&\vdots&\ddots&\vdots\\b_1&b_2&\dots&b_p \\ b_1&b_2&\dots&b_p \\ b_1&b_2&\dots&b_p\\\vdots&\vdots&\ddots&\vdots\end{bmatrix}\\ O\ \ \ \ \ \ \ \ \ \ \ \ \ \ X\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ W \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ b\ \ \ \ \ \ \ \ \ \ \ \ \ \ \end{aligned}\\(n \times p)\ \ \ \ (n \times d)\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (d\times p)\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (n\times p) \ \ \ \ \ \ \ \ \ \end{aligned}\]

[math 0. ์‹ ๊ฒฝ๋ง ์ˆ˜์‹ ๋ถ„ํ•ด]

  • b๋Š” ๊ฐ ํ•œ ํ–‰์˜ ๋ชจ๋“  ๊ฐ’์ด ๊ฐ™๋‹ค.
์œผ์—์—... ์ด๊ฒŒ ๋จธ์•ผ ์–ธ์  ๊ฐ„ ๊ณ ์น˜๊ธฐ
graph BT
	x1((x&#38;#60;sub&#38;#62;1&#38;#60;/sub&#38;#62;)) &#38; x2((x&#38;#60;sub&#38;#62;2&#38;#60;/sub&#38;#62;)) &#38; x((x...)) &#38; xd((x&#38;#60;sub&#38;#62;d&#38;#60;/sub&#38;#62;))--&#38;#62;o1((O&#38;#60;sub&#38;#62;1&#38;#60;/sub&#38;#62;)) &#38; os((o...)) &#38; op((O&#38;#60;sub&#38;#62;p&#38;#60;/sub&#38;#62;))

[chart 0. ์‹ ๊ฒฝ๋ง ๋ชจ๋ธ์˜ ์ฐจํŠธํ™”]

  • d๊ฐœ์˜ ๋ณ€์ˆ˜๋กœ p๊ฐœ์˜ ์„ ํ˜•๋ชจ๋ธ์„ ๋งŒ๋“ค์–ด p๊ฐœ์˜ ์ž ์žฌ๋ณ€์ˆ˜๋ฅผ ์„ค๋ช…ํ•˜๋Š” ๋ชจ๋ธ์˜ ๋„์‹ํ™”์ด๋‹ค.

  • ํ™”์‚ดํ‘œ๋Š” w~ij~๋“ค์˜ ๊ณฑ์„ ์˜๋ฏธํ•œ๋‹ค.

์†Œํ”„ํŠธ๋งฅ์Šค ์—ฐ์‚ฐ

\[softmax(o) = softmax(Wx+b) = \left( \dfrac{\exp(0)}{\sum_{k=1}^p\exp(o_k) }, \dots, \dfrac{\exp(0_p)}{\sum_{k=1}^p\exp(o_k) },\right)\]

[math1.softmax]

  • ์†Œํ”„ํŠธ๋งฅ์Šค(softmax) ํ•จ์ˆ˜๋Š” ๋ชจ๋ธ์˜ ์ถœ๋ ฅ์„ ํ™•๋ฅ ๋กœ ํ•ด์„ํ•  ์ˆ˜ ์žˆ๊ฒŒ ๋ณ€ํ™˜ํ•ด์ฃผ๋Š” ์—ฐ์‚ฐ
  • ๋ถ„๋ฅ˜ ๋ฌธ์ œ๋ฅผ ํ’€ ๋•Œ ์„ ํ˜•๋ชจ๋ธ๊ณผ ์†Œํ”„ํŠธ ๋งฅ์Šค ํ•จ์ˆ˜๋ฅผ ๊ฒฐํ•ฉํ•˜์—ฌ ์˜ˆ์ธก
def softmax(vec):
    denumerator = np.exp(vec - np.max(vec, axis=-1, keepdims=True)) # ๊ฐ ์ถœ๋ ฅ์˜ ๊ฐ’
    # ๋„ˆ๋ฌด ํฐ ๋ฒกํ„ฐ๊ฐ€ ๋“ค์–ด์˜ค๋Š” ๊ฒƒ์„ ๋ง‰๊ธฐ์œ„ํ•ด np.max(vec)์„ vec์—์„œ ๋นผ์„œ ๋ฐฉ์ง€
    numerator = np.sum(denumerator, axis=-1, keepdims=True)#๋ชจ๋“  ์ถœ๋ ฅ์˜ ๊ฐ’๋“ค์˜ ํ•ฉ
    val = denumerator / numerator
    return val

[code 1.softmax์˜ ์ฝ”๋“œ ๊ตฌํ˜„]

  • ์ด๋Ÿฌํ•œ ์†Œํ”„ํŠธ๋งฅ์Šค ํ•จ์ˆ˜๋กœ ๋ฒกํ„ฐ๋ฅผ ํ™•๋ฅ  ๋ฒกํ„ฐ(๊ฐ ์„ฑ๋ถ„์˜ ํ•ฉ์ด 1์ธ ๋ฒกํ„ฐ)๋กœ ๋ณ€ํ™˜ํ•  ์ˆ˜ ์žˆ๋‹ค.
def one_hot(val, dim):
    return [np.eye(dim)[_] for _ in val]
def one_hot_encoding(vec):
    vec_dim = vec.shape[1]
    vec_argmax= np.argmax(vec, axis=-1)
    return one_hot(vec_argmax, vec_dim)

[code 1-1.one_hot ํ•จ์ˆ˜ ๊ตฌํ˜„]

  • ํ•™์Šต์ด ์•„๋‹Œ ์ถ”๋ก ์‹œ์—๋Š” one_hot ๋ฒกํ„ฐ๋ฅผ ์ด์šฉํ•ด ๊ฐ€์žฅ ํฐ ๋ฒกํ„ฐ๋ฅผ ์ •๋‹ต์œผ๋กœ ์‚ผ์œผ๋ฉด ๋˜๋ฏ€๋กœ softmax()๋ฅผ ์”Œ์šธ ํ•„์š” ์—†๋‹ค.(one_hot(softmax(o)=> X, one_hot(o) => O)

ํ™œ์„ฑํ™” ํ•จ์ˆ˜(activation function)์™€ ๋‹ค์ธต ์‹ ๊ฒฝ๋ง(MLP)

  • ์‹ ๊ฒฝ๋ง์€ ์„ ํ˜•๋ชจ๋ธ๊ณผ ํ™œ์„ฑํ•จ์ˆ˜(activation function)์„ ํ•ฉ์„ฑํ•œ ํ•จ์ˆ˜
\[H = (\sigma (z_1), \dots, \sigma(z_n)), \sigma(z) = \sigma(Wx + b)\\ \sigma = ํ™œ์„ฑํ•จ์ˆ˜(๋น„์„ ํ˜•),\ z = (z_1,\dots,z_q) = ์ž ์žฌ๋ฒกํ„ฐ,\ H = ์ƒˆ๋กœ์šด\ ์ž ์žฌ๋ฒกํ„ฐ = ๋‰ด๋Ÿฐ\]

[math 2. ์‹ ๊ฒฝ๋ง ๋‰ด๋Ÿฐ]

[img 2. ์‹ ๊ฒฝ๋ง ๋‰ด๋Ÿฐ ๋„์‹ํ™”]

  • ์ž ์žฌ๋ฒกํ„ฐ๋“ค์„ ์ด์šฉํ•ด ๋งŒ๋“  ์ƒˆ๋กœ์šด ์ž ์žฌ๋ฒกํ„ฐ๋“ค,(๊ทธ๋ฆฌ๊ณ , ์ด ์ƒˆ๋กœ ๋งŒ๋“  ์ž ์žฌ๋ฒกํ„ฐ๋กœ ๋งŒ๋“ค ์ƒˆ๋กœ์šด ์ž ์žฌ๋ฒกํ„ฐ๋“ค)์„ ๋‰ด๋Ÿฐ(neuron) ๋˜๋Š” ์ด๋ผ๊ณ  ํ•˜๋ฉฐ, ์ด๋Ÿฌํ•œ ๊ตฌ์กฐ์˜ ์ธ๊ณต์‹ ๊ฒฝ๋ง์„ ํผ์…‰ํŠธ๋ก (perceptron)์ด๋ผ๊ณ  ํ•œ๋‹ค.
    • ๊ฐ ๋‰ด๋Ÿฐ(๋…ธ๋“œ) ๊ฐ€ ๊ฐ€์ง€๊ณ  ์žˆ๋Š” ๊ฐ’์€ ํ…์„œ(tensor)๋ผ๊ณ  ํ•œ๋‹ค.
  • ํ™œ์„ฑํ™” ํ•จ์ˆ˜๋Š” ์‹ค์ˆ˜๊ฐ’์„ ๋ฐ›์•„ ์‹ค์ˆ˜๊ฐ’์„ ๋Œ๋ ค์ฃผ๋Š” ๋น„์„ ํ˜•(nonlinear) ํ•จ์ˆ˜
    • ๋น„์„ ํ˜„ ๊ทผ์‚ฌ๋ฅผ ํ•˜๊ธฐ ์œ„ํ•ด ์กด์žฌ
  • ์ด๋กœ ์ธํ•ด ๋”ฅ๋Ÿฌ๋‹์ด ์„ ํ˜•๋ชจํ˜•๊ณผ ์ฐจ์ด๋ฅผ ๋ณด์˜€์œผ๋ฉฐ, ์‹œ๊ทธ๋ชจ์ด๋“œ(sigmoid), $tanh$, ๊ทธ๋ฆฌ๊ณ  ์ฃผ๋กœ ์“ฐ์ด๊ณ  ์žˆ๋Š” ReLU ํ•จ์ˆ˜ ๋“ฑ์ด ์žˆ๋‹ค.

[img 3. sigmoid, tanh, ReLu ํ•จ์ˆ˜ ๊ทธ๋ž˜ํ”„]

  • ๋งŒ์•ฝ, ์ด๋ ‡๊ฒŒ ๊ตฌํ•œ ์ž ์žฌ ๋ฒกํ„ฐ H์—์„œ ๊ฐ€์ค‘์น˜ํ–‰๋ ฌ $W^{(2)}$์™€ $b^{(2)}$๋ฅผ ํ†ตํ•ด ๋‹ค์‹œ ํ•œ๋ฒˆ ์„ ํ˜• ๋ณ€ํ™˜ํ•ด์„œ ์ถœ๋ ฅํ•˜๊ฒŒ ๋˜๋ฉด ($W^{(2)}, W^{(1)}$)๋ฅผ ํŒจ๋Ÿฌ๋ฏธํ„ฐ๋กœ ๊ฐ€์ง„ 2์ธต(2-layers) ์‹ ๊ฒฝ๋ง์ด ๋œ๋‹ค.
\[O = H W^{(2)} + b^{(2)},\ H = (\sigma (z_1), \dots, \sigma(z_n)) = \sigma( Z^{(1)}),\ \sigma(z) = \sigma(W^{(1)}x + b^{(1)}) \\ Z^{(1)} = X W^{(1)} + b^{(1)}\]

[math 2-1. 2์ค‘ ์‹ ๊ฒฝ๋ง]

[img 2-1.2์ค‘ ์‹ ๊ฒฝ๋ง์˜ ๊ตฌ์กฐ]

  • ์ด๋ ‡๊ฒŒ ์‹ ๊ฒฝ๋ง์ด ์—ฌ๋Ÿฌ์ธต ํ•ฉ์„ฑ๋œ ํ•จ์ˆ˜๋ฅผ ๋‹ค์ธต(multi-layer) ํผ์…‰ํŠธ๋ก (MLP)๋ผ๊ณ  ํ•œ๋‹ค.
\[\\ O = Z^{(L)} \\ \vdots \\ H^{(l)} = \sigma(Z^{(l)}) \\ Z^{(l)} = H^{(l-1)} W^{(l)} + b^{(l)} \\ \vdots \\{H^{(1)}} = \sigma(\bold Z^{(1)}) \\ Z^{(1)} = X\bold W^{(1)} + b^{(1)}\]

[math 2-2. n์ธต์œผ๋กœ ์ด๋ฃจ์–ด์ง„ ๋‹ค์ค‘์‹ ๊ฒฝ๋ง์˜ ํ•ฉ์„ฑํ•จ์ˆ˜]

  • $l = 1,\dots,L$๊นŒ์ง€ ์ˆœ์ฐจ์ ์ธ ์‹ ๊ฒฝ๋ง ๊ณ„์‚ฐ์„ ์ˆœ์ „ํŒŒ(forward propagation)์ด๋ผ ๋ถ€๋ฅธ๋‹ค.

[img 2-3. ๋‹ค์ธต ์‹ ๊ฒฝ๋ง ๊ตฌ์กฐ]

  • ์ด๋ก ์ ์œผ๋กœ 2์ธต ์ •๋„์˜ ์‹ ๊ฒฝ๋ง์œผ๋กœ๋„ ์ž„์˜์˜ ์—ฐ์†ํ•จ์ˆ˜๋ฅผ ๊ทผ์‚ฌ(universal approximation theorem)ํ•  ์ˆ˜ ์žˆ๋‹ค.
  • ์ธต์ด ๊นŠ์„ ์ˆ˜๋ก ํ•„์š”ํ•œ ๋‰ด๋Ÿฐ(ํ…์„œ๋ฅผ ๊ฐ€์ง€๊ณ  ์žˆ๋Š” ๋…ธ๋“œ), ํŒŒ๋ผ๋ฏธํ„ฐ์˜ ์ˆซ์ž๊ฐ€ ๊ธฐํ•˜๊ธ‰์ˆ˜์ ์œผ๋กœ ์ค„์–ด๋“ค์–ด ์ข€ ๋” ํšจ์œจ์ ์ด๋‹ค.
    • ์ฆ‰, ์ธต์„ ๊นŠ์ด ํ•˜๋ฉด ๋„“์ด๋ฅผ ์–‡๊ฒŒ ํ•ด๋„ ๋œ๋‹ค.
    • ๋ฌผ๋ก  ์ตœ์ ํ™”๋Š” ์—ฌ์ „ํžˆ ์–ด๋ ต๋‹ค.(CNN์—์„œ ๊นŠ๊ฒŒ ์„ค๋ช…)

์—ญ์ „ํŒŒ(backpropagation) ์•Œ๊ณ ๋ฆฌ์ฆ˜

  • ๊ฐ ์ธต์— ์‚ฌ์šฉ๋œ ํŒจ๋Ÿฌ๋ฏธํ„ฐ ${W^{l},b^{l}}^L_{l=1}$์„ ์—ญ์ˆœ์œผ๋กœ ํ•™์Šตํ•˜๋Š”๋ฐ ์‚ฌ์šฉ๋œ๋‹ค.
  • ํ•ฉ์„ฑ๋ฏธ๋ถ„์˜ ์—ฐ์‡„๋ฒ•์น™(chain-rule) ๊ธฐ๋ฐ˜ ์ž๋™๋ฏธ๋ถ„(auto-differentiation)์„ ์ด์šฉํ•˜์—ฌ ์—ญ์ˆœ์œผ๋กœ ๊ตฌํ•œ๋‹ค.
\[z = (x+y)^2์˜\ ๊ทธ๋ ˆ๋””์–ธํŠธ\ ๋ฒกํ„ฐ,\ \dfrac{\partial z}{\partial x} =\ ?\\ z = w^2 \rightarrow \dfrac{\partial z}{\partial w} = 2w w = x + y \rightarrow \dfrac{\partial w}{\partial x} = 1,\ \dfrac{\partial w}{\partial y} = 1\\ \dfrac{\partial z}{\partial x} = \dfrac{\partial z}{\partial w}\dfrac{\partial w}{\partial x} = 2w \cdot 1 = 2(x+y)\]

[math 3. ํŽธ๋ฏธ๋ถ„์„ ์ด์šฉํ•œ ์—ญ์ „ํŒŒ ์•Œ๊ณ ๋ฆฌ์ฆ˜ ์˜ˆ์‹œ]

  • ๋จผ์ € ์œ—์ธต์˜ ๊ทธ๋ ˆ๋””์–ธํŠธ ๋ฒกํ„ฐ๋ฅผ ๊ตฌํ•œ ๋’ค, ๊ทธ ๋ฒกํ„ฐ๋ฅผ ์ด์šฉํ•ด ๊ทธ ์•„๋ž˜ ๊ทธ๋ ˆ๋””์–ธํŠธ ๋ฒกํ„ฐ๋ฅผ ๊ตฌํ•œ๋‹ค.
    • ํฌ์›Œ๋“œ ํ”„๋กœํŒŒ ๊ฒŒ์ด์…˜๊ณผ ๋‹ฌ๋ฆฌ, ๊ฐ ๋‰ด๋Ÿฐ ๋˜๋Š” ๋…ธ๋“œ์˜ ํ…์„œ ๊ฐ’์„ ๋ฉ”๋ชจ๋ฆฌ์— ๋„ฃ์–ด์•ผ ํ•˜๋ฏ€๋กœ, ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ๋งŽ์ด ๋จน๋Š”๋‹ค.

[img 3. 2์ธต ์‹ ๊ฒฝ๋ง ์–ด๋ ค์šด ์˜ˆ์ œ]

  • ํŒŒ๋ž€์ƒ‰ : forward propagation
  • ๋นจ๊ฐ„์ƒ‰ : back propagation

ํ™•๋ฅ ๋ก 

  • ๋”ฅ๋Ÿฌ๋‹์€ ํ™•๋ฅ ๋ก  ๊ธฐ๋ฐ˜์˜ ๊ธฐ๊ณ„ํ•™์Šต ์ด๋ก ์— ๋ฐ”ํƒ•์„ ๋‘๊ณ  ์žˆ์œผ๋ฉฐ, ํ†ต๊ณ„์  ํ•ด์„์€ ์†์‹คํ•จ์ˆ˜ ๋“ค์˜ ๊ธฐ๋ณธ ์ž‘๋™์›๋ฆฌ์ด๋‹ค.
    • ํšŒ๊ท€ ๋ถ„์„์˜ L2 ๋…ธ๋ฆ„์€ ์˜ˆ์ธก์˜ค์ฐจ ๋ถ„์‚ฐ์„ ์ตœ์†Œํ™”ํ•˜๋Š” ๋ฐฉํ–ฅ์œผ๋กœ ํ•™์Šตํ•˜๋ฉฐ
    • ๋ถ„๋ฅ˜ ๋ฌธ์ œ์˜ ๊ต์ฐจ ์—”ํŠธ๋กœํ”ผ๋Š” ๋ชจ๋ธ ์˜ˆ์ธก์˜ ๋ถˆํ™•์‹ค์„ฑ์„ ์ตœ์†Œํ™”ํ•˜๋Š” ๋ฐฉํ–ฅ์œผ๋กœ ํ•™์Šตํ•œ๋‹ค.

ํ™•๋ฅ ๋ถ„ํฌ

  • ํ™•๋ฅ ๋ถ„ํฌ๋ž€, ํ™•๋ฅ  ๋ณ€์ˆ˜๊ฐ€ ํŠน์ •ํ•œ ๊ฐ’์„ ๊ฐ€์งˆ ํ™•๋ฅ ์„ ๋‚˜ํƒ€๋‚ด๋Š” ํ•จ์ˆ˜๋ฅผ ์˜๋ฏธํ•œ๋‹ค.

  • ํ™•๋ฅ ๋ถ„ํฌ๋Š” ๋ฐ์ดํ„ฐ๊ณต๊ฐ„ $(x, y)$ ์—์„œ ๋ฐ์ดํ„ฐ๋ฅผ ์ถ”์ถœํ•˜๋Š” ๋ถ„ํฌ์ด๋‹ค.

  • ํ™•๋ฅ  ๋ณ€์ˆ˜๋Š” ์ด์‚ฐํ˜•(discrete)๊ณผ ์—ฐ์†ํ˜•(continuous)์œผ๋กœ ๊ตฌ๋ถ„๋œ๋‹ค.

    • ์ด์‚ฐํ˜• ํ™•๋ฅ  ๋ณ€์ˆ˜ : ํ™•๋ฅ  ๋ณ€์ˆ˜๊ฐ€ ๊ฐ€์งˆ ์ˆ˜ ์žˆ๋Š” ๊ฒฝ์šฐ์˜ ์ˆ˜๋ฅผ ๋ชจ๋‘ ๊ณ ๋ คํ•˜์—ฌ ํ™•๋ฅ ์„ ๋”ํ•ด ๋ชจ๋ธ๋ง
      \(\mathbb{P}(X \in A) = \sum_{x\in A}{P(X= x)}\\ \mathbb{P} = ํ™•๋ฅ ๋ณ€์ˆ˜\)

    [math 4. ์ด์‚ฐํ˜• ํ™•๋ฅ ๋ณ€์ˆ˜]

    • ์—ฐ์†ํ˜• ํ™•๋ฅ  ๋ณ€์ˆ˜ : ๋ฐ์ดํ„ฐ ๊ณต๊ฐ„์— ์ •์˜๋œ ํ™•๋ฅ ๋ณ€์ˆ˜์˜ ๋ฐ€๋„(density) ์œ„์—์„œ์˜ ์ ๋ถ„์„ ํ†ตํ•ด ๋ชจ๋ธ๋งํ•œ๋‹ค.
    \[\mathbb{P}(X \in A) = \int_{A}{P(x)dx}\\ P(x)=\lim_{h\rightarrow0}\frac{\mathbb{P}(x - h \leq X \leq x + h)}{2h}= ํ™•๋ฅ ๋ณ€์ˆ˜์˜\ ๋ฐ€๋„\]

    [math 4-1. ์—ฐ์†ํ˜• ํ™•๋ฅ ๋ณ€์ˆ˜]

  • $P(x)$๋Š” ์ž…๋ ฅ x์— ๋Œ€ํ•œ ์ฃผ๋ณ€ํ™•๋ฅ  ๋ถ„ํฌ๋กœ y์— ๋Œ€ํ•œ ์ •๋ณด๋ฅผ ์ฃผ์ง„ ์•Š์Œ

๊ธฐ๋Œ€๊ฐ’

  • ์กฐ๊ฑด๋ถ€ ํ™•๋ฅ  $P(x|y)$๋Š” ๋ฐ์ดํ„ฐ ๊ณต๊ฐ„์—์„œ ์ž…๋ ฅ x์™€ ์ถœ๋ ฅ y ์‚ฌ์ด์˜ ๊ด€๊ณ„๋ฅผ ๋ชจ๋ธ๋งํ•˜๋ฉฐ, ์ž…๋ ฅ ๋ณ€์ˆ˜ x์— ๋Œ€ํ•ด ์ •๋‹ต์ด y์ผ ํ™•๋ฅ ์„ ์˜๋ฏธํ•จ.

  • $softmax(W\phi + b )$์€ ๋ฐ์ดํ„ฐ x์œผ๋กœ๋ถ€ํ„ฐ ์ถ”์ถœ๋œ ํŠน์ง•ํŒจํ„ด $\phi (x)$๊ณผ ๊ฐ€์ค‘์น˜ํ–‰๋ ฌ W์„ ํ†ตํ•ด ์กฐ๊ฑด๋ถ€ํ™•๋ฅ  $P(y|x)$์„ ๊ณ„์‚ฐ

  • ๊ธฐ๋Œ€๊ฐ’(expectation)์€ ๋ฐ์ดํ„ฐ๋ฅผ ๋Œ€ํ‘œํ•˜๋Š” ํ†ต๊ณ„๋Ÿ‰, ํ™•๋ฅ  ๋ถ„ํฌ๋ฅผ ํ†ตํ•ด ๋‹ค๋ฅธ ํ†ต๊ณ„์  ๋ฒ”ํ•จ์ˆ˜๋ฅผ ๊ณ„์‚ฐํ•˜๋Š”๋ฐ ์‚ฌ์šฉ

    • ํšŒ๊ท€ ๋ฌธ์ œ์˜ ๊ฒฝ์šฐ ์กฐ๊ฑด๋ถ€ ๊ธฐ๋Œ€๊ฐ’์„ ์ถ”์ •ํ•˜๋ฉฐ ์ด๋Š” ๊ฐ ํ™•๋ฅ ๋ถ„ํฌ์— ๋”ฐ๋ผ ๋‹ค์Œ๊ณผ ๊ฐ™์ด ๊ตฌํ•œ๋‹ค.
    \[\mathbb{E}_{x\sim P(x)}[f(x)] = \int_\chi f(x)P(x)dx \rightarrow์—ฐ์†ํ™•๋ฅ ๋ถ„ํฌ\ \ \\ \mathbb{E}_{x\sim P(x)}[f(x)] = \sum_{x\in\chi} f(x)P(x) \rightarrow ์ด์‚ฐํ™•๋ฅ ๋ถ„ํฌ\]

    [math 5. ๊ธฐ๋Œ€๊ฐ’ ๊ตฌํ•˜๊ธฐ]

    • ๋ถ„์‚ฐ, ์ฒจ๋„ ๊ณต๋ถ„์‚ฐ ๋“ฑ์˜ ํ†ต๊ณ„๋Ÿ‰์„ ๊ณ„์‚ฐํ•˜๋Š”๋ฐ ์‚ฌ์šฉํ•จ.
\[๋ถ„์‚ฐ : \mathbb{V}(x) = \mathbb{E}_{x\sim P(x)}[(x-\mathbb{E}[x])^2]\\ ๋น„๋Œ€์นญ๋„ : Skewness(x) = \mathbb{E}\left[\left(\frac{x-\mathbb{E}[x]}{\sqrt{\mathbb{V}(x)}}\right)^3\right]\\ ๊ณต๋ถ„์‚ฐ : Cov(x_1,x_2) = \mathbb{E}_{x_1,x_2\sim P(x_1,x_2)}(x_1 -\mathbb{E}[x_1])(x_2 - \mathbb{E}[x_2])\]

โ€‹ [math 5-1. ๊ธฐ๋Œ€๊ฐ’์˜ ์‚ฌ์šฉ]

๋ชฌํ…Œ์นด๋ฅผ๋กœ ์ƒ˜ํ”Œ๋ง(Monte Carlo sampling)

  • ํ™•๋ฅ  ๋ถ„ํฌ๋ฅผ ๋ชจ๋ฅผ ๋•Œ ๋ฐ์ดํ„ฐ๋ฅผ ์ด์šฉํ•˜์—ฌ ๊ธฐ๋Œ€๊ฐ’์„ ๊ณ„์‚ฐํ•˜๋Š” ๋ฐฉ๋ฒ•.
  • ์ด์‚ฐํ˜•, ์—ฐ์†ํ˜•์ด๋“  ๊ด€๊ณ„์—†์ด ๋‹ค์Œ๊ณผ ๊ฐ™์ด ํ‘œํ˜„ํ•จ.
\[\mathbb{E}_{x\sim P(x)}[f(x)] \approx \frac{1}{N}\sum^N_{i=1}f(x^{(i)}),\ x^{(i)}\stackrel{\text{i.i.d.}}{\sim} P(x)\]

[math 6.๋ชฌํ…Œ์นด๋ฅผ๋กœ ์ƒ˜ํ”Œ๋ง]

  • ๋…๋ฆฝ ์ถ”์ถœ๋งŒ ๋ณด์žฅ๋œ๋‹ค๋ฉด ๋Œ€์ˆ˜์˜ ๋ฒ•์น™(law of large number)์— ์˜ํ•ด ์ˆ˜๋ ด์„ฑ์„ ๋ณด์žฅ
import numpy as np
#f(x) = e^(-x^2), [-1, 1]
def mc_int(fun, low, high, sample_size=100, repeat=10):
    int_len = np.abs(high - low)
    stat = []
    for _ in range(repeat):
        x = np.random.uniform(low=low, high=high, size=sample_size)
        fun_x = fun(x)
        int_val = int_len * np.mean(fun_x)
        stat.append(int_val)
    return np.mean(stat), np.std(stat)

def f_x(x):
    return np.exp(-x**2)

print(mc_int(f_x, low=-1, hight=1, sample_size=10000, repeat=100))

[code 6. f(x), [-1, 1] ๋ชฌํ…Œ์นด๋ฅผ๋กœ ์ฝ”๋“œ๊ตฌํ˜„]

ํ†ต๊ณ„ํ•™

  • ํ†ต๊ณ„์  ๋ชจ๋ธ๋ง์€ ์ ์ ˆํ•œ ๊ฐ€์ • ์œ„์—์„œ ํ™•๋ฅ  ๋ถ„ํฌ๋ฅผ ์ถ”์ •(inference)ํ•˜๋Š” ๊ฒƒ์ด ๋ชฉํ‘œ์ด๋ฉฐ, ์ด๋Š” ๊ธฐ๊ณ„ํ•™์Šต์ด ๊ฒฐ๊ณผ๋ฅผ ์˜ˆ์ธกํ•˜๋Š” ๊ฒƒ๊ณผ ๊ฐ™์€ ๋ชฉํ‘œ์ด๋‹ค.
  • ์œ ํ•œํ•œ ๊ฐœ์ˆ˜์˜ ๋ฐ์ดํ„ฐ๋กœ๋Š” ๋ชจ์ง‘๋‹จ์˜ ๋ถ„ํฌ๋ฅผ ์ •ํ™•ํ•˜๊ฒŒ ์•Œ์•„๋‚ผ ์ˆ˜ ์—†์œผ๋ฏ€๋กœ ๊ทผ์‚ฌ์ ์œผ๋กœ ํ™•๋ฅ ๋ถ„ํฌ๋ฅผ ์ถ”์ •ํ•˜์—ฌ ๋ถˆํ™•์‹ค์„ฑ์„ ์ตœ์†Œํ™”

๋ชจ์ˆ˜

  • ๋ฐ์ดํ„ฐ๊ฐ€ ํŠน์ • ํ™•๋ฅ ๋ถ„ํฌ๋ฅผ ๋”ฐ๋ฅธ๋‹ค๊ณ  ์„ ํ—˜์ ์œผ๋กœ(a priori) ๊ฐ€์ •ํ•œ ํ›„ ๊ทธ ๋ถ„ํฌ๋ฅผ ๊ฒฐ์ •ํ•˜๋Š” ๋ชจ์ˆ˜(parameter)๋ฅผ ์ถ”์ •ํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ๋ชจ์ˆ˜์ (parametirc) ๋ฐฉ๋ฒ•๋ก ์ด๋ผ๊ณ  ํ•จ.
    • ์ด์™€ ๋ฐ˜๋Œ€๋กœ ๋ฐ์ดํ„ฐ์— ๋”ฐ๋ผ ๋ชจ๋ธ์˜ ์ฃผ์กฐ ๋ฐ ๊ฐœ์ˆ˜๊ฐ€ ์œ ์—ฐํ•˜๊ฒŒ ๋ฐ”๋€Œ๋ฉด ๋น„๋ชจ์ˆ˜(nonparametric) ๋ฐฉ๋ฒ•๋ก ์ด๋ผ ๋ถ€๋ฅด๋ฉฐ ๊ธฐ๊ณ„ํ•™์Šต์˜ ๋งŽ์€ ๋ฐฉ๋ฒ”๋ก ์ด ์ด๋ฅผ ๋”ฐ๋ฅด๊ธฐ๋„ ํ•จ.

๋ฐ์ดํ„ฐ ๋ชจ์ˆ˜ ์ถ”์ •

ํ™•๋ฅ  ๋ถ„ํฌ ๊ฐ€์ •

  • ํžˆ์Šคํ† ๊ทธ๋žจ์˜ ๋ชจ์–‘์„ ๊ด€์ฐฐํ•˜์—ฌ ํ™•๋ฅ ๋ถ„ํฌ๋ฅผ ๊ฐ€์ •ํ•  ์ˆ˜ ๋„ ์žˆ๋‹ค.

[img 7. ์—ฌ๋Ÿฌ๊ฐ€์ง€ ๋ชจ์–‘์˜ ํ™•๋ฅ  ๋ถ„ํฌ]

ํ™•๋ฅ  ๋ถ„ํฌ๋ช… ๋ฐ์ดํ„ฐ ๋ชจ์–‘
๋ฒ ๋ฅด๋ˆ„์ด ๋ถ„ํฌ ๋ฐ์ดํ„ฐ๊ฐ€ 2๊ฐœ์˜ ๊ฐ’(0 ๋˜๋Š” 1)๋งŒ ๊ฐ€์ง
์นดํ…Œ๊ณ ๋ฆฌ ๋ถ„ํฌ ๋ฐ์ดํ„ฐ๊ฐ€ n๊ฐœ์˜ ์ด์‚ฐ์ ์ธ ๊ฐ’๋งŒ์„ ๊ฐ€์ง
๋ฒ ํƒ€ ๋ถ„ํฌ ๋ฐ์ดํ„ฐ๊ฐ€ [0, 1] ์‚ฌ์ด์—์„œ ๊ฐ’์„ ๊ฐ€์ง
๊ฐ๋งˆ ๋ถ„ํฌ, ๋กœ๊ทธ ์ •๊ทœ ๋ถ„ํฌ ๋ฐ์ดํ„ฐ๊ฐ€ 0 ์ด์ƒ์˜ ๊ฐ’์„ ๊ฐ€์ง
์ •๊ทœ ๋ถ„ํฌ, ๋ผํ”Œ๋ผ์Šค ๋ถ„ํฌ ๋ฐ์ดํ„ฐ๊ฐ€ $\mathbb{R}$(์‹ค์ˆ˜) ์ „์ฒด์—์„œ ๊ฐ’์„ ๊ฐ€์ง

[fig 7. ํ™•๋ฅ  ๋ถ„ํฌ์˜ ์˜ˆ์‹œ]

  • ํ•˜์ง€๋งŒ ์ด๋Ÿฐ์‹์œผ๋กœ ๊ฐ€์ •ํ•˜๋Š” ๊ฒƒ๋ณด๋‹ค ๋ฐ์ดํ„ฐ๋ฅผ ์ƒ์„ฑํ•˜๋Š” ์›๋ฆฌ๋ฅผ ๊ณ ๋ คํ•œ ๋’ค, ๊ฐ ๋ถ„ํฌ๋งˆ๋‹ค ๋ชจ์ˆ˜๋ฅผ ์ถ”์ • ํ›„ ๊ฐ ํ™•๋ฅ ๋ถ„ํฌ์˜ ๊ฒ€์ •๋ฐฉ๋ฒ™์œผ๋กœ ๊ฒ€์ •ํ•˜๋Š” ๋ฐฉ์‹์ด ์›์น™์ด๋‹ค.

๋ฐ์ดํ„ฐ ๋ชจ์ˆ˜ ์ถ”์ •

  • ๋ฐ์ดํ„ฐ ํ™•๋ฅ ๋ถ„ํฌ๋ฅผ ๊ฐ€์ •ํ•œ ์ˆ˜ ๋ฐ์ดํ„ฐ ๋ชจ์ˆ˜๋ฅผ ์ถ”์ •ํ•œ๋‹ค.

  • ํ‰๊ท  $\mu$์™€ ๋ถ„์‚ฐ $\sigma^2$์œผ๋กœ ์ด๋ฅผ ์ถ”์ •ํ•˜๋Š” ํ†ต๊ณ„๋Ÿ‰(statistic)์€ ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.
    \(\stackrel{ํ‘œ๋ณธ ํ‰๊ท }{\bar{X} = \frac{1}{N}\sum^N_{i=1}X_i},\ \ \ \stackrel{ํ‘œ๋ณธ ๋ถ„์‚ฐ}{S^2 = \frac{1}{N-1}\sum^N_{i=1}(X_i-\bar{X})^2}\\ \mathbb{E}[\bar{X}] = \mu,\ \ \ \mathbb{E}[S^2] = \sigma^2,\ ํ‘œ๋ณธ ํ‘œ์ค€ ํŽธ์ฐจ = \sqrt{ํ‘œ๋ณธ๋ถ„์‚ฐ} = \sqrt{S^2} = S\)
    [math 7. ํ†ต๊ณ„๋Ÿ‰ ๊ณ„์‚ฐ]

    • ํ‘œ๋ณธ๋ถ„์‚ฐ์„ ๊ตฌํ•  ๋•Œ N์ด ์•„๋‹Œ N-1๋กœ ๋‚˜๋ˆ„๋Š” ์ด์œ ๋Š” ๋ถˆํŽธ(unbiased) ์ถ”์ •๋Ÿ‰์„ ๊ตฌํ•˜๊ธฐ ์œ„ํ•ด์„œ๋ผ๊ณ  ํ•˜๋ฉฐ, ๊ณ ๊ธ‰ ํ†ต๊ณ„ํ•™ ๋‚ด์šฉ์ด๋ฏ€๋กœ ์ผ๋‹จ ๋„˜์–ด๊ฐ€๊ฒ ๋‹ค.
  • ํ†ต๊ณ„๋Ÿ‰์˜ ํ™•๋ฅ  ๋ถ„ํฌ๋ฅผ ํ‘œ์ง‘๋ถ„ํฌ(Sampling distribution)์ด๋ผ ๋ถ€๋ฅด๋ฉฐ, ํŠนํžˆ ํ‘œ๋ณธํ‰๊ท ์˜ ํ‘œ์ง‘๋ถ„ํฌ๋Š” N์ด ์ปค์งˆ์ˆ˜๋ก ์ •๊ทœ๋ถ„ํฌ $\mathcal{N}(\mu,\sigma^2/N )$๋ฅผ ๋”ฐ๋ฆ…๋‹ˆ๋‹ค.

    • ์ด๋ฅผ ์ค‘์‹ฌ ๊ทนํ•œ ์ •๋ฆฌ(Central Limit Theorem)์ด๋ผ ๋ถ€๋ฅด๋ฉฐ, ๋ชจ์ง‘๋‹จ์˜ ๋ถ„ํฌ๊ฐ€ ์ •๊ทœ๋ถ„ํฌ๋ฅผ ๋”ฐ๋ฅด์ง€ ์•Š์•„๋„ ์„ฑ๋ฆฝํ•ฉ๋‹ˆ๋‹ค.

์ตœ๋Œ€ ๊ฐ€๋Šฅ๋„(maximum likelihood estimation, MLE) ์ถ”์ •

  • ํ†ต๊ณ„๋Ÿ‰์„ ์ธก์ •ํ•˜๋Š” ์ ์ ˆํ•œ ๋ฐฉ๋ฒ•์€ ํ™•๋ฅ ๋ถ„ํฌ๋งˆ๋‹ค ๋‹ค๋ฅด๋‹ค.

  • ์ด๋ก ์ ์œผ๋กœ ๊ฐ€์žฅ ๊ฐ€๋Šฅ์„ฑ์ด ๋†’์€ ๋ชจ์ˆ˜ ์ธก์ • ๋ฐฉ๋ฒ•์€ ์ตœ๋Œ€ ๊ฐ€๋Šฅ๋„ ์ถ”์ •๋ฒ•(maximum likelihood estimation, MLE)์ž…๋‹ˆ๋‹ค.
    \(\hat{\theta}_{MLE} = argmax\ L(\theta;x) = argmax\ P(x|\theta)\)
    [math 8. ์ตœ๋Œ€๊ฐ€๋Šฅ๋„ ์ถ”์ •๋ฒ•]

  • ๋ฐ์ดํ„ฐ ์ง‘ํ•ฉ X๊ฐ€ ๋…๋ฆฝ์ ์œผ๋กœ ์ถ”์ถœ๋˜์—ˆ์„ ๊ฒฝ์šฐ ๋กœ๊ทธ๊ฐ€๋Šฅ๋„๋ฅผ ์ตœ์ ํ™”ํ•ฉ๋‹ˆ๋‹ค.

    • ์ด๋•Œ ๋ชจ์ˆ˜ $\theta$๋Š” ๊ฐ€๋Šฅ๋„๋ฅผ ์ตœ์ ํ™”ํ•˜๋Š” MLE๊ฐ€ ๋ฉ๋‹ˆ๋‹ค.
\[L(\theta;X) = \prod^n_{i=1}P(x_i|\theta) \Rightarrow log\ L(\theta; X) = \sum^n_{i=1}logP(x_i|\theta)\]

โ€‹ [math 8. ๋…๋ฆฝ ์ถ”์ถœ ์‹œ์˜ ์ถ”์ •๋ฒ• ์ตœ๋Œ€๊ฐ€๋Šฅ๋„ ์ถ”์ •๋ฒ•]

  • ๋ฐ์ดํ„ฐ์˜ ์ˆซ์ž๊ฐ€ ์ˆ˜์–ต๋‹จ์œ„๊ฐ€ ๋˜๋ฉด ์ปดํ“จํ„ฐ์˜ ์—ฐ์‚ฐ์œผ๋กœ ๊ณ„์‚ฐ ๋ถˆ๊ฐ€๋Šฅํ•˜๋ฏ€๋กœ, ๋ฐ์ดํ„ฐ๊ฐ€ ๋…๋ฆฝ์ผ ์‹œ, ๋กœ๊ทธ ๊ฐ€๋Šฅ๋„์˜ ๋ง์…ˆ์œผ๋กœ ๋ฐ”๊พธ๋ฉด ์ปดํ“จํ„ฐ๋กœ ์—ฐ์‚ฐ์ด ๊ฐ€๋Šฅํ•ด์ง.
    • ๊ฒฝ์‚ฌํ•˜๊ฐ•๋ฒ•์œผ๋กœ ๊ฐ€๋Šฅ๋„ ์ตœ์ ํ™”์‹œ, ๋ฏธ๋ถ„์—ฐ์‚ฐ์„ ์‚ฌ์šฉํ•˜๋ฉฐ, ์Œ์˜ ๋กœ๊ทธ๊ฐ€๋Šฅ๋„(negative log-likelihood)๋ฅผ ์‚ฌ์šฉํ•˜๋ฉด ์—ฐ์‚ฐ๋Ÿ‰์ด O(n^2^)์—์„œ O(n)์œผ๋กœ ์ค„์—ฌ์ค€๋‹ค.
    • ๋ถˆํŽธ ์ถ”์ •๋Ÿ‰์„ ๋ณด์žฅํ•˜์ง„ ์•Š์Œ.

์ตœ๋Œ€ ๊ฐ€๋Šฅ๋„ ์ถ”์ •๋ฒ• ์˜ˆ์ œ

์ •๊ทœ๋ถ„ํฌ
\[\hat{\theta}_{MLE}= argmax\ L(\theta; x) = argmax\ P(x|\theta)\\ log\ L(\theta;X) = \sum^n_{i=1}logP(x_i|\theta) = \sum^n_{i=1}log\frac{1}{\sqrt{2\pi\sigma^2}}e^{-\frac{{|x_i-\mu|^2}}{2\mu^2}} = -\frac{n}{2}log2\pi\sigma^2 - \sum^n_{i=1}\frac{{|x_i-\mu|}^2}{2\sigma^2}\]

[math 8-1. ๋ชจ์ˆ˜ ์ถ”์ •์„ ์œ„ํ•œ ๋กœ๊ทธ ๊ฐ€๋Šฅ๋„ ๊ณ„์‚ฐ]
\(0 = \frac{\partial logL}{\partial\mu}= -\sum^n_{i=1}\frac{x_i - \mu}{\sigma^2} \Rightarrow \hat{\mu}_{MLE}=\frac{1}{n}\sum^N_{i=1}x_i\\ 0 = \frac{\partial logL}{\partial\sigma}= -\frac{n}{\sigma}+\frac{1}{\sigma^3}\sum^n_{i=1}|{x_i - \mu}|^2 \Rightarrow \hat{\sigma}_{MLE}^2=\frac{1}{n}\sum^N_{i=1}(x_i -\mu)^2\\\)
[math 8-2. ๋ฏธ๋ถ„์„ ํ†ตํ•œ ๋ชจ์ˆ˜ ์ถ”์ •]

์นดํ…Œ๊ณ ๋ฆฌ ๋ถ„ํฌ
\[\hat{\theta}_{MLE}= argmax\ P(x_i|\theta) = argmax\ log(\prod^n_{i=1}\prod^d_{k=1}p_k^{x_i,k})\\ log(\prod^n_{i=1}\prod^d_{k=1}p_k^{x_i,k})=\sum^d_{k=1}(\sum^n_{i=1}x_{i,k})logp_k = \sum^d_{k=1}n_klogp_k\ with\ \sum^d_{k=1}p_k=1\\ \Rightarrow \mathcal{L}(p_1,\dots,p_k,\lambda) = \sum^d_{k=1}n_k logp_k + \lambda(1-\sum_kp_k) (๋ผ๊ทธ๋ž‘์ฃผ\ ์Šน์ˆ˜๋ฒ•)\\ 0 = \frac{\partial \mathcal{L}}{\partial p_k} = \frac{n_k}{p_k} - \lambda,\ \ \ 0=\frac{\partial \mathcal{L}}{\partial \lambda} = 1 - \sum^d_{k=1}p_k \rightarrow\ p_k =\frac{n_k}{\sum^d_{k=1}n_k}\]

**[math 8-3. ๋ชจ์ˆ˜ ์ถ”์ •] **

๋”ฅ ๋Ÿฌ๋‹์—์„œ ์ตœ๋Œ€๊ฐ€๋Šฅ๋„ ์ถ”์ •๋ฒ•

  • ๋”ฅ๋Ÿฌ๋‹ ๋ชจ๋ธ์˜ ๊ฐ€์ค‘์น˜๋ฅผ $\theta=(W^{(1)},\dots,W^{(L)})$๋ผ ํ‘œ๊ธฐํ–ˆ์„ ๋•Œ ๋ถ„๋ฅ˜ ๋ฌธ์ œ์—์„œ ์†Œํ”„ํŠธ๋งฅ์Šค ๋ฒกํ„ฐ๋Š” ์นดํ…Œ๊ณ ๋ฆฌ ๋ถ„ํฌ์˜ ๋ชจ์ˆ˜ $(p_1,\dots,p_k)$๋ฅผ ๋ชจ๋ธ๋งํ•ฉ๋‹ˆ๋‹ค.

  • ์›ํ•ซ๋ฒกํ„ฐ๋กœ ํ‘œํ˜„ํ•œ ์ •๋‹ต๋ ˆ์ด๋ธ” $y= (y_1, \dots,y_k)$์„ ๊ด€์ฐฐ๋ฐ์ดํ„ฐ๋กœ ์ด์šฉํ•ด ํ™•๋ฅ ๋ถ„ํฌ์ธ ์†Œํ”„ํŠธ๋งฅ์Šค ๋ฒกํ„ฐ์˜ ๋กœ๊ทธ๊ฐ€๋Šฅ๋„๋ฅผ ์ตœ์ ํ™”ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
    \(\hat{\theta}_{MLE} = argmax \frac{1}{n}\sum^n_{i=1}\sum^K_{k=1}y_{i,k}log(MLP_\theta(x_i)_k)\)
    [math 9. ๋ถ„๋ฅ˜ ๋ฌธ์ œ ์ตœ๋Œ€๊ฐ€๋Šฅ๋„ ์ถ”์ •]

ํ™•๋ฅ ๋ถ„ํฌ์˜ ๊ฑฐ๋ฆฌ ๊ตฌํ•˜๊ธฐ - ์ฟจ๋ฐฑ-๋ผ์ด๋ธ”๋Ÿฌ ๋ฐœ์‚ฐ

  • ๊ธฐ๊ณ„ํ•™์Šต์—์„œ ์‚ฌ์šฉ๋˜๋Š” ํ•จ์ˆ˜๋“ค์€ ๋ชจ๋ธ์ด ํ•™์Šตํ•˜๋Š” ํ™•๋ฅ ๋ถ„ํฌ์™€ ๋ฐ์ดํ„ฐ์—์„œ ๊ด€์ฐฐ๋˜๋Š” ํ™•๋ฅ ๋ถ„ํฌ์˜ ๊ฑฐ๋ฆฌ๋ฅผ ํ†ตํ•ด ์œ ๋„๋ฉ๋‹ˆ๋‹ค.
  • ๋‘ ๊ฐœ์˜ ํ™•๋ฅ ๋ถ„ํฌ P(x), Q(x)๊ฐ€ ์žˆ์„ ๊ฒฝ์šฐ ๋‘ ํ™•๋ฅ ๋ถ„ํฌ ์‚ฌ์ด์˜ ๊ฑฐ๋ฆฌ(distance)๋ฅผ ๊ณ„์‚ฐํ•  ๋•Œ ์—ฌ๋Ÿฌ ํ•จ์ˆ˜๋ฅผ ์ด์šฉ
    • ์ด๋ณ€๋™ ๊ฑฐ๋ฆฌ (Total Variation Distance, TV)
    • ์ฟจ๋ฐฑ-๋ผ์ด๋ธ”๋Ÿฌ ๋ฐœ์‚ฐ (Kullback-Leibler Divergence, KL)
    • ๋ฐ”์Šˆํƒ€์ธ ๊ฑฐ๋ฆฌ (Wasserstein Distance)
  • ์ด ์ค‘ ์ฟจ๋ฐฑ-๋ผ์ด๋ธ”๋Ÿฌ ๋ฐœ์‚ฐ(KL Divergence)์€ ๋‹ค์Œ๊ณผ ๊ฐ™์ด ์ •์˜.
    \(\mathbb{KL}(P\|Q) = -\mathbb{E}_{x\sim P(x)}[logQ(x)] + \mathbb{E}_{x\sim P(x)}[logP(x)]\\ -\mathbb{E}_{x\sim P(x)}[logQ(x)] = ํฌ๋กœ์Šค\ ์—”ํŠธ๋กœํ”ผ,\ \mathbb{E}_{x\sim P(x)}[logP(x)] = ์—”ํŠธ๋กœํ”ผ\)

[math 10. ์ฟจ๋ฐฑ-๋ผ์ด๋ธ”๋Ÿฌ ๋ฐœ์‚ฐ ๊ตฌํ•˜๊ธฐ]

  • ๋ถ„๋ฅ˜ ๋ฌธ์ œ์—์„œ ์ •๋‹ต๋ ˆ์ด๋ธ”์„ P, ๋ชจ๋ธ ์˜ˆ์ธก์„ Q๋ผ ๋‘๋ฉด ์ตœ๋Œ€๊ฐ€๋Šฅ๋„ ์ถ”์ •๋ฒ•์€ ์ฟจ๋ฐฑ-๋ผ์ด๋ธ”๋Ÿฌ ๋ฐœ์‚ฐ์„ ์ตœ์†Œํ™” ํ•˜๋Š” ๊ฒƒ๊ณผ ๊ฐ™์Œ.

๋ฒ ์ด์ฆˆ ํ†ต๊ณ„ํ•™

  • ๋ฒ ์ด์ฆˆ ํ†ต๊ณ„ํ•™์ด๋ž€, ๋ชจ์ˆ˜ ์ถ”์ •์— ์‚ฌ์šฉ๋˜๋Š” ๋ฒ ์ด์ฆˆ ์ •๋ฆฌ์— ๋Œ€ํ•œ ๋‚ด์šฉ, ๋ฐ์ดํ„ฐ ์ถ”๊ฐ€์‹œ ๋ฐ์ดํ„ฐ ์—…๋ฐ์ดํŠธ ๋ฐฉ๋ฒ•์— ๋Œ€ํ•œ ์ด๋ก 
  • ๋ฒ ์ด์ฆˆ ์ •๋ฆฌ๋ž€, ์กฐ๊ฑด๋ถ€ํ™•๋ฅ ์„ ์ด์šฉํ•ด ์ •๋ณด๋ฅผ ๊ฐฑ์‹ ํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ์•Œ๋ ค์คŒ, ์˜ˆ์ธก ๋ชจํ˜•์˜ ๋ฐฉ๋ฒ•๋ก 

์กฐ๊ฑด๋ถ€ ํ™•๋ฅ 

  • ์กฐ๊ฑด๋ถ€ ํ™•๋ฅ  $P(A|B)$๋Š” ์‚ฌ๊ฑด B๊ฐ€ ์ผ์–ด๋‚œ ์ƒํ™ฉ์—์„œ ์‚ฌ๊ฑด A๊ฐ€ ๋ฐœ์ƒํ•  ํ™•๋ฅ 
  • ์ด๋ฅผ ํ†ตํ•ด $P(A|B)$ ๋˜ํ•œ ๊ตฌํ•  ์ˆ˜ ์žˆ๋‹ค.
\[P(A\cap B) = P(B)P(A\|B)\\ P(B\|A) = \frac {P(A\cap B)}{P(A)} = P(B)\frac {P(A\|B)}{P(A)}\]

[math 11. ์กฐ๊ฑด๋ถ€ํ™•๋ฅ ์— ๋Œ€ํ•œ ์‹]
\(P(\theta\|\mathcal{D}) = P(\theta)\frac{P(\mathcal{D}\|\theta)}{P(\mathcal{D})}\\ P(\theta\|\mathcal{D}) : ์‚ฌํ›„ํ™•๋ฅ (posterior),\ P(\theta):์‚ฌ์ „ํ™•๋ฅ (prior), \ P(\mathcal{D}|\theta): ๊ฐ€๋Šฅ๋„(likelihood),\ P(\mathcal{D}):Evidence\)
[math 11-1. ๋ฒ ์ด์ฆˆ ์ •๋ฆฌ ์šฉ์–ด ์ •๋ฆฌ]

  • ์˜ˆ์‹œ๋ฅผ ๋“ค์–ด๋ณด์ž๋ฉด ์ฝ”๋กœ๋‚˜ ๋ฐœ๋ณ‘๋ฅ ์ด 10%(์‚ฌ์ „ํ™•๋ฅ  $P(\theta)$:0.1), ์‹ค์ œ๋กœ ๊ฑธ๋ ค์„œ ํ™•์ง„๋  ํ™•๋ฅ  99%, ์•ˆ๊ฑธ๋ ธ๋Š”๋ฐ ์˜ค์ง„๋  ํ™•๋ฅ  1% (๊ฐ€๋Šฅ๋„, $P(\mathcal{D}|\theta)$: 0.99, 0.01 )๋ผ ํ• ๋•Œ ์งˆ๋ณ‘์— ๊ฑธ๋ฆฐ ์‚ฌ๋žŒ์˜ ๊ฒ€์ง„๊ฒฐ๊ณผ๊ฐ€ ๋‚˜์™”์„ ๋•Œ ์ •๋ง๋กœ ์ฝ”๋กœ๋‚˜์— ๊ฐ์—ผ๋˜์—ˆ์„ ํ™•๋ฅ ?
\[P(\theta) = 0.1,\ P(\neg\theta) = 0.9,\ P(\mathcal{D}\|\theta)=0.99,\ p(\mathcal{D}\|\neg\theta)=0.01\\ P(\mathcal{D}) = \sum_\theta P(\mathcal{D}\|\theta)P(\theta) = 0.99 \times 0.1 + 0.01 \times0.9 = 0.108\\ P(\theta\|\mathcal{D}) = 0.1 \times\frac{0.99}{0.108} \approx 0.916 \rightarrow ์ •๋‹ต\]

[math 11-2. ์‚ฌํ›„ํ™•๋ฅ  ๊ณ„์‚ฐ]

  • $\theta : ์ฝ”๋กœ๋‚˜\ ๋ฐœ๋ณ‘์‚ฌ๊ฑด์œผ๋กœ\ ์ •์˜(๊ด€์ฐฐ ๋ถˆ๊ฐ€),\ \mathcal{D}: ํ…Œ์ŠคํŠธ\ ๊ฒฐ๊ณผ๋กœ\ ์ •์˜(๊ด€์ฐฐ ๊ฐ€๋Šฅ), \neg\theta : ~๊ฐ€\ ์•„๋‹\ ํ™•๋ฅ $
  • ์˜คํƒ์œจ(False alarm)์ด ์˜ค๋ฅด๋ฉด ํ…Œ์ŠคํŠธ์˜ ์ •๋ฐ€๋„(Precision)๊ฐ€ ๋–จ์–ด์ง„๋‹ค. (0.1๋กœ 10๋ฐฐ ์˜ค๋ฅผ์‹œ 0.524๊นŒ์ง€ ๋–จ์–ด์ง)

[img 11. ์กฐ๊ฑด๋ถ€ ํ™•๋ฅ ์˜ ์‹œ๊ฐํ™”]

๋ฒ ์ด์ฆˆ ์ •๋ฆฌ๋ฅผ ํ†ตํ•œ ์ •๋ณด์˜ ๊ฐฑ์‹ 

  • ๋ฒ ์ด์ฆˆ ์ •๋ฆฌ๋ฅผ ํ†ตํ•ด ์ƒˆ๋กœ์šด ๋ฐ์ดํ„ฐ๊ฐ€ ๋“ค์–ด์™”์„ ๋•Œ ์•ž์„œ ๊ณ„์‚ฐํ•œ ์‚ฌํ›„ํ™•๋ฅ ์„ ์‚ฌ์ „ํ™•๋ฅ ๋กœ ์‚ฌ์šฉํ•˜์—ฌ ๊ฐฑ์‹ ๋œ ์‚ฌํ›„ํ™•๋ฅ ์„ ๊ณ„์‚ฐํ•  ์ˆ˜ ์žˆ์Œ.
\[new\ P(\theta\|\mathcal{D}) = P(\theta\|\mathcal{D})\frac{P(\mathcal{D}\|\theta)}{P(\mathcal{D})}\]

[math 12. ๊ฐฑ์‹ ๋œ ์‚ฌํ›„ํ™•๋ฅ  ๊ตฌํ•˜๊ธฐ]
\(new\ P(\theta\|\mathcal{D}) = 0.1 \times \frac {0.99}{0.189} \approx 0.524,\ P(\theta\|\mathcal{D}) = 0.99,\ P(\theta\|\neg\mathcal{D}) = 0.1 \\ P(\mathcal{D}^*)=0.99\times0.524+0.1\times0.476 \approx0.566\\ ๊ฐฑ์‹ ๋œ\ ์‚ฌํ›„ํ™•๋ฅ \ P(\theta\|\mathcal{D}^*) = 0.524 \times\frac{0.99}{0.566}\approx0.917\)
[math 12-1. ๊ฐฑ์‹ ๋œ ์‚ฌํ›„ํ™•๋ฅ  ๊ณ„์‚ฐ]

  • ์ฝ”๋กœ๋‚˜ ํ™•์ •์„ ๋ฐ›์€ ์‚ฌ๋žŒ์ด ์˜ค์ง„์œจ์ด 10%์ผ์‹œ ๋‘๋ฒˆ์งธ ๊ฒ€์ง„์ด ์–‘์„ฑ์ผ ์‹œ์—๋„ ํ™•์ง„์ผ ํ™•๋ฅ ?

์กฐ๊ฑด๋ถ€ํ™•๋ฅ ๊ณผ ์ธ๊ณผ๊ด€๊ณ„์˜ ์ฐจ์ด

  • ๋ฐ์ดํ„ฐ๊ฐ€ ๋งŽ์•„๋„ ์กฐ๊ฑด๋ถ€ ํ™•๋ฅ ์€ ์ธ๊ณผ๊ด€๊ณ„(causality)์™€ ๋‹ค๋ฅด๋‹ค

  • ์ธ๊ณผ๊ด€๊ณ„๋Š” ๋ฐ์ดํ„ฐ ๋ถ„ํฌ์˜ ๋ณ€ํ™”์— ๊ฐ•๊ฑดํ•œ ์˜ˆ์ธก ๋ชจํ˜•์„ ๋งŒ๋“ค ๋•Œ ๊ณ ๋ คํ•จ.
    • ์ธ๊ณผ ๊ด€๊ณ„๋ฅผ ๊ณ ๋ คํ•˜์ง€ ์•Š์œผ๋ฉด ์‹œ๋‚˜๋ฆฌ์˜ค๋‚˜ ๋ฐ”๋€ ๋ฐ์ดํ„ฐ์— ๋”ฐ๋ผ ์ •ํ™•๋„๊ฐ€ ํฌ๊ฒŒ ๋–จ์–ด์ง
  • ์ธ๊ณผ๊ด€๊ณ„๋Š” ์ค‘์ฒฉ์š”์ธ(confounding factor)์˜ ํšจ๊ณผ๋ฅผ ์ œ๊ฑฐํ•˜๊ณ  ์›์ธ์— ํ•ด๋‹นํ•˜๋Š” ๋ณ€์ˆ˜๋งŒ์˜ ์ธ๊ณผ๊ด€๊ณ„๋ฅผ ๊ณ„์‚ฐ ํ•ด์•ผํ•จ.
    • ํ‚ค๊ฐ€ ํด์ˆ˜๋ก ์ง€๋Šฅ์ง€์ˆ˜๊ฐ€ ๋†’๋‹ค? => ์—ฐ๋ น์ด ํด์ˆ˜๋ก ๋‚˜์ด์™€ ์ง€๋Šฅ์ด ๋†’์•„์„œ ์ƒ๊ธฐ๋Š” ์ธ๊ณผ๊ด€๊ณ„
    • ์—ฌ๊ธฐ์„œ ์ค‘์ฒฉ์š”์ธ์€ ์—ฐ๋ น

simpsonโ€™s paradox์— ์˜ํ•œ ์ธ๊ณผ๊ด€๊ณ„ ์˜ค๋ฅ˜

[img 13. ์‹ ์žฅ ๊ฒฐ์„ ์น˜๋ฃŒ๋ฒ•์— ๋”ฐ๋ฅธ ์น˜๋ฃŒ์œจ]

  • ์‹ค์ œ๋กœ๋Š” ์ž‘์€ ๊ฒฐ์„ ๋˜, ํฐ ๊ฒฐ์„ ๊ด€๊ณ„์—†์ด ์™ธ๊ณผ์ˆ˜์ˆ ์ด ์น˜๋ฃŒ์œจ์ด ๋†’์ง€๋งŒ, ์•ฝ๋ฌผ ์น˜๋ฃŒ๋ฒ•์ด ์„ฑ๊ณต์œจ์ด ๋†’์€ ์ž‘์€ ๊ฒฐ์„ ์น˜๋ฃŒ์— ๋งŽ์ด ์‚ฌ์šฉ๋˜๋ฏ€๋กœ ์ „์ฒด ์น˜๋ฃŒ์œจ์€ ๋†’์•„๋ณด์ธ๋‹ค.
  • do(T=a)๋ผ๋Š” ์กฐ์ • ํšจ๊ณผ๋ฅผ ํ†ตํ•ด Z์˜ ๊ฐœ์ž…์„ ์ œ๊ฑฐํ•ด์•ผํ•œ๋‹ค.
\[P_a(R=1) = \sum_{z\in \{0,1\}}P(R=1\|T=b,Z=z)P(Z=z) = \frac{81}{87}\times\frac{(87+270)}{700} + \frac{192}{263}\times\frac{263+80}{700}\approx 0.8325 \\ P_b(R=1) = \sum_{z\in \{0,1\}}P(R=1\|T=b,Z=z)P(Z=z) = \frac{234}{270}\times\frac{(87+270)}{700} + \frac{55}{80}\times\frac{263+80}{700}\approx 0.7789\]

[math 13. ์กฐ์ • ํšจ๊ณผ๋ฅผ ํ†ตํ•œ Z(์ค‘์ฒฉ์š”์ธ) ๊ฐœ์ž… ์ œ๊ฑฐ]

CNN

CNN(Convolution Neural Network)์˜ ์ดํ•ด

  • ๊ธฐ์กด์˜ ๋ชจ๋ธ๋“ค์€ ๋‰ด๋Ÿฐ์ด ๋ชจ๋‘ ์—ฐ๊ฒฐ๋œ (fully connected) ๊ตฌ์กฐ์˜€์ง€๋งŒ CNN์€ ๋™์ผํ•œ ๊ณ ์ •๋œ ๊ฐ€์ค‘์น˜ ๊ฐ’์„ ๊ฐ€์ง„ ์ปค๋„(kernel)์„ ์ž…๋ ฅ๋ฒกํ„ฐ ์ƒ์—์„œ ์›€์ง์—ฌ๊ฐ€๋ฉด์„œ ์„ ํ˜•๋ชจ๋ธ๊ณผ ํ•ฉ์„ฑํ•จ์ˆ˜๊ฐ€ ์ ์šฉ๋˜๋Š” ๊ตฌ์กฐ์ž„.
  • ์„ ํ˜•์˜ ๋ณ€ํ™˜์˜ ํ•œ ์ข…๋ฅ˜์ž„์€ ๊ฐ™์Œ,
  • ์ปค๋„ ์‚ฌ์ด์ฆˆ๋Š” ๊ณ ์ •๋˜๋ฏ€๋กœ parameter ์‚ฌ์ด์ฆˆ๊ฐ€ ์ž‘๋‹ค.
\[h_i =\sigma\left(\sum^k_{j=1}V_jx_{i+j-1}\right)\]

[math 14. Convolution ์—ฐ์‚ฐ]

[img 14. Convolution ์—ฐ์‚ฐ ๊ทธ๋ฆผ]

  • CNN์˜ ์ˆ˜ํ•™์  ์˜๋ฏธ๋Š” ์‹ ํ˜ธ(signal)๋ฅผ ์ปค๋„์„ ์ด์šฉํ•ด ๊ตญ์†Œ์ ์œผ๋กœ ์ฆํญ ๋˜๋Š” ๊ฐ์†Œ์‹œ์ผœ ์ •๋ณด๋ฅผ ์ถ”์ถœ ๋˜๋Š” ํ•„ํ„ฐ๋งํ•˜๋Š” ๊ฒƒ์ด๋ฉฐ, ์ด๋Š” ํฌ๊ฒŒ 2๊ฐ€์ง€ ์ •์˜ ํ•  ์ˆ˜ ์žˆ๋‹ค.
    • ์ •์˜์—ญ์ด ์—ฐ์†์ (continuous)์ธ ๊ณต๊ฐ„ : ์ ๋ถ„์œผ๋กœ ํ‘œํ˜„
    • ์ •์˜์—ญ์ด ์ด์ƒ(discrete) ๊ณต๊ฐ„ : ๊ธ‰์ˆ˜๋กœ ํ‘œํ˜„
\[continuous\ \ \[f*g\]\(x\) = \int_{\mathbb{R}^d}f(z)g(x-z)dz=\int_{\mathbb{R}^d}f(x-z)g(z)dz=\[g*f\]\(x\)\\ discrete\ \ \[f*g\]\(i\) = \sum_{a \in \mathbb{Z}^d}f(a)g(i-a)=\sum_{a \in \mathbb{Z}^d}f(i-a)g(a)=\[g*f\]\(i\) \\ g(x-z), g(i-a) : signal\ term,\ f(z), f(a): kernal\ term\]

[math 14-1. Convolution ์—ฐ์‚ฐ ์ˆ˜์‹]

  • z ๋˜๋Š” a๋งŒ ์›€์ง์ด๋Š” ํ˜•ํƒœ๋กœ ์—ฐ์‚ฐ
  • ์‚ฌ์‹ค x-z, i-a๊ฐ€์•„๋‹ˆ๋ผ x+z, i+a ์ด๋ฉฐ cross-correlation ์ด๋‹ค.
    • ์ „์ฒด ๊ณต๊ฐ„์—์„œ๋Š” +,-๊ฐ€ ์ฐจ์ด๊ฐ€ ํฌ์ง€์•Š์œผ๋ฏ€๋กœ convolution์ด๋ผ ๋ถˆ๋Ÿฌ์™”์Œ
Convolution ์—ฐ์‚ฐ ๊ทธ๋ž˜ํ”ฝ์  ์ดํ•ด

[fig 14-1. Convolution ์—ฐ์‚ฐ ๊ทธ๋ž˜ํ”ฝ์  ์ดํ•ด]

  • ์ปค๋„์€ ์ •์˜์—ญ ๋‚ด์—์„œ ์›€์ง์—ฌ๋„ ๋ณ€ํ•˜์ง€ ์•Š๊ณ (translation invariant) ์ฃผ์–ด์ง„ ์‹ ํ˜ธ์— ๊ตญ์†Œ์ (local)๋กœ ์ ์šฉ.
\[1D-conv\ \ \[f*g\]\(i\) = \sum^d_{p=1}f(p)g(i+p)\\ 2D-conv\ \ \[f*g\]\(i,j\) = \sum_{p,q}f(p,q)g(i+p, j+q)\\ 3D-conv\ \ \[f*g\]\(i,j,k\) = \sum_{p,q,r}f(p,q,r)g(i+p, j+q, k+r)\\\]

[math 14-2. Convolution ์—ฌ๋Ÿฌ ์ฐจ์› ์—ฐ์‚ฐ]

  • 1์ฐจ์› ๋ฟ๋งŒ ์•„๋‹ˆ๋ผ ๋‹ค์–‘ํ•œ ์ฐจ์›์—์„œ ๊ณ„์‚ฐ ๊ฐ€๋Šฅ
  • 1์ฐจ์›(text), 2์ฐจ์›(ํ‘๋ฐฑ), 3์ฐจ์›(์ปฌ๋Ÿฌ)๋ณ„๋กœ ์ ์šฉ ๊ฐ€๋Šฅ
  • ์•ž์˜ fํ•ญ์€ ๋ฐ”๋€Œ์ง€ ์•Š๋Š”๋‹ค.

๋‹ค์ฐจ์› CNN(Convolution Neural Network)์˜ ์ดํ•ด

[img 14-2. 2์ฐจ์› Convolution ์—ฐ์‚ฐ ๊ทธ๋ž˜ํ”ฝ์  ์ดํ•ด-1]

[img 14-3. 2์ฐจ์› Convolution ์—ฐ์‚ฐ ๊ทธ๋ž˜ํ”ฝ์  ์ดํ•ด-2]
\(O_H=H-K_H+1\\ O_W=W-K_w+1\)

[math 14-3. Convolution ์ถœ๋ ฅ ํฌ๊ธฐ ๊ณ„์‚ฐ]

  • ์˜ˆ๋ฅผ ๋“ค์–ด 28x28 ์ž…๋ ฅ์„ 3x3 ์ปค๋„๋กœ ์—ฐ์‚ฐ์‹œ 26x26์ด ๋œ๋‹ค.

[img 14-4. 3์ฐจ์› Convolution ์—ฐ์‚ฐ ๊ทธ๋ž˜ํ”ฝ์  ์ดํ•ด]

  • 3์ฐจ์›์˜ ๊ฒฝ์šฐ 2์ฐจ์› Convolution์„ 3๋ฒˆ ์ ์šฉํ•˜๋Š” ๊ฒƒ์ดs๋‹ค.
    • ์ปค๋„์˜ ๊ฐฏ์ˆ˜๋„ ๋Š˜์–ด๋‚จ
  • 3์ฐจ์› ๋ถ€ํ„ฐ๋Š” ์ž…๋ ฅ์„ ํ…์„œ๋ผ๊ณ  ๋งํ•œ๋‹ค.

CNN์˜ ์—ญ์ „ํŒŒ

  • ์ปค๋„์ด ๋ชจ๋“  ์ž…๋ ฅ๋ฐ์ดํ„ฐ์— ๊ณตํ†ต์œผ๋กœ ์ ์šฉ๋˜๋ฏ€๋กœ ์—ญ์ „ํŒŒ ๊ณ„์‚ฐ์‹œ convolution ์—ฐ์‚ฐ์„ ํ•จ
\[\frac \partial {\partial x}\[f*g\]\(x\) = \frac \partial {\partial x}\int_{\mathbb{R}^d}f(y)g(x-y)dy =\int_{\mathbb{R}^d}f(y)\frac {\partial g}{\partial x}(x-y)dy =\[f*g'\]\(x\)\]

[math 14-4. Convolution ์—ฐ์‚ฐ ์—ฐ์†์‹œ ์—ญ์ „ํŒŒ]

  • Discrete ๊ตฌ์กฐ์—๋„ ๋งˆ์ฐฌ๊ฐ€์ง€๋กœ ์„ฑ๋ฆฝํ•œ๋‹ค.
\[\frac {\partial \mathcal{L}}{\partial w_i} = \sum_j \delta_jx_i+j-1, \\ ex) \frac {\partial \mathcal{L}}{\partial w_1}= \delta_ix_i + \delta_2x_2+\delta_3x_3\]

[math 14-5. Convolution ์—ฐ์‚ฐ]

RNN

์‹œํ€€์Šค(sequence) ๋ฐ์ดํ„ฐ

  • ์†Œ๋ฆฌ, ๋ฌธ์ž์—ด, ์ฃผ๊ฐ€ ์ถ”์ด ๋“ฑ, ์ˆœ์ฐจ์ ์œผ๋กœ ๋“ค์–ด์˜ค๋Š” ๋ฐ์ดํ„ฐ
    • ์‹œ๊ณ„์—ด(time-series) ๋ฐ์ดํ„ฐ๋Š” ์‹œ๊ฐ„ ์ˆœ์„œ์— ๋”ฐ๋ผ ๋‚˜์—ด๋œ ๋ฐ์ดํ„ฐ๋กœ, ์‹œํ€€์Šค ๋ฐ์ดํ„ฐ์— ์†ํ•จ.
  • ๋…๋ฆฝ ๋™๋“ฑ ๋ถ„ํฌ ($i.i.d$)๊ฐ€์ •์„ ์œ„๋ฐฐํ•˜๊ธฐ ๋•Œ๋ฌธ์— ์ˆœ์„œ๋ฅผ ๋ฐ”๊พธ๊ฑฐ๋‚˜ ๊ณผ๊ฑฐ ์ •๋ณด์ด ๋ณ€ํ˜•๋˜๋ฉด ๋ฐ์ดํ„ฐ์˜ ํ™•๋ฅ  ๋ถ„ํฌ๋„ ๋ฐ”๋€œ.
    • ex) ๊ฐœ๊ฐ€ ์‚ฌ๋žŒ์„ ๋ฌผ์—ˆ๋‹ค. $\leftrightarrow$ ์‚ฌ๋žŒ์ด ๊ฐœ๋ฅผ ๋ฌผ์—ˆ๋‹ค. $\rightarrow$ ์œ„์น˜๋ฅผ ๋ฐ”๊พผ๊ฒƒ ๋งŒ์œผ๋กœ, ๋ฐ์ดํ„ฐ์˜ ํ™•๋ฅ , ์˜๋ฏธ๊ฐ€ ๋‹ฌ๋ผ์ง.

sequence data handling

  • ์ด์ „ ์‹œํ€€์Šค์˜ ์ •๋ณด๋กœ ์•ž์œผ๋กœ์˜ ๋ฐ์ดํ„ฐ์˜ ํ™•๋ฅ  ๋ถ„ํฌ๋ฅผ ๋‹ค๋ฃจ๊ธฐ ์œ„ํ•ด ์กฐ๊ฑด๋ถ€ํ™•๋ฅ  ๊ฒฐํ•ฉ๋ฒ•์น™ ์ด์šฉ
\[P(X_1,\dots,X_t) = P(X_t\|X_1,\dots,X_{t-1})P(X_1,\dots,X_{t-1})\\ = P(X_t\|X_1,\dots,X_{t-1})P(X_{t-1}|X_1,\dots,X_{t-2})\times P(X_1,\dots,X_{t-2})\\ =\prod^t_{s=1}P(X_s\|X_{s-1},\dots,X_1)\\ \prod_{s=1}^t = s =1,\dots,t ๊นŒ์ง€\ ์ „๋ถ€\ ๊ณฑํ•˜๋ผ\\ ์ฆ‰,\ X_t \sim P(X_t\|X_{t-1},\dots,X_1), \\ X_{t+1} \sim P(x_{t+1}\|X_t,X_{t-1},\dots,X_1)\]

[math. ๋ฒ ์ด์ฆˆ ๋ฒ•์น™์— ์˜ํ•œ P(X~s~) ์ถ”๋ก  ]

  • ๊ณผ๊ฑฐ ์ •๋ณด๋ฅผ ์‚ฌ์šฉํ•˜์ง€๋งŒ 0๋ถ€ํ„ฐ t-1๊นŒ์ง€ ๋ชจ๋“  ๋ฐ์ดํ„ฐ๊ฐ€ ํ•„์š”ํ•œ๊ฑด ์•„๋‹˜, ์˜คํžˆ๋ ค ์ง€๋‚˜์นœ ๊ณผ๊ฑฐ ์ •๋ณด๋Š” ์ œ์™ธ
    • ์‹œํ€€์Šค ๋ฐ์ดํ„ฐ๋ฅผ ๋‹ค๋ฃจ๊ธฐ ์œ„ํ•ด์„  ๊ธธ์ด๊ฐ€ ๊ฐ€๋ณ€์ ์ธ ๋ฐ์ดํ„ฐ๋ฅผ ๋‹ค๋ฃฐ ์ˆ˜ ์žˆ์–ด์•ผ ํ•œ๋‹ค.
    • ์˜ˆ๋ฅผ ๋“ค์–ด, ์‹œํ€€์Šค ๋ฐ์ดํ„ฐ $X_t$ ์˜ˆ์ธก ์‹œ ์ „๋ถ€๊ฐ€ ์•„๋‹Œ, $X_{t-1}\sim X_{t-\tau}$๊ฐœ ๋งŒํผ๋งŒ ์‚ฌ์šฉํ•˜๋Š” ๋ชจ๋ธ์„ AR($\tau$), ์ฆ‰ ์ž๊ธฐ ํšŒ๊ท€ ๋ชจ๋ธ(Autoregressive Model)์ด๋ผ๊ณ  ๋ถ€๋ฆ„.
  • ์œ„์˜ ์ž๊ธฐ ํšŒ๊ท€ ๋ชจ๋ธ์˜ ๊ฒฝ์šฐ $\tau$๋ฅผ ํŒŒ๋ผ๋ฉ”ํ„ฐ๋กœ ํ•˜๋Š”๋ฐ ์ด ๊ฐ’์„ ์ง์ž‘ํ•˜๊ธฐ ํž˜๋“ค๊ฑฐ๋‚˜ $\tau$ ์ด์ƒ์˜ ๊ณผ๊ฑฐ ์ •๋ณด๊ฐ€ ํ•„์š”ํ• ์ง€๋„ ๋ชจ๋ฅธ๋‹ค.

    • ์ด๋ฅผ ๋ณด์™„ํ•˜๊ธฐ ์œ„ํ•œ ๋ชจ๋ธ์ด ์ž ์žฌ ์ž๊ธฐ ํšŒ๊ท€ ๋ชจ๋ธ(Latent autoregressive Model, ์ž ์žฌ AR ๋ชจ๋ธ)์ด๋ผ๊ณ  ๋ถ€๋ฅด๋ฉฐ RNN์˜ ๊ธฐ๋ณธ ๋ชจ๋ธ์ด๋‹ค.
\[X_t \sim P(X_t\|X_{t-1},H_t), \\ X_{t+1} \sim P(x_{t+1}\|X_t,X_{t-1},H_{t+1})\\ ์ž ์žฌ๋ณ€์ˆ˜\ H_t=X_{t-2},\dots,X_{1}=Net_\theta(H_{t-1},X_{t-1})\]

[math. ์ž ์žฌ๋ณ€์ˆ˜ H~t~๋ฅผ ์‹ ๊ฒฝ๋ง์„ ํ†ตํ•ด ๋ฐ˜๋ณต ์‚ฌ์šฉํ•ด ์‹œํ€€์Šค ๋ฐ์ดํ„ฐ์˜ ํŒจํ„ด์„ ํ•™์Šตํ•˜๋Š” ๊ฒƒ์ด RNN]

RNN์˜ ์ดํ•ด์™€ BPTT

[img. ๊ธฐ๋ณธ RNN ๋ชจํ˜•, ์ˆœ์ „ํŒŒ์™€ ์—ญ์ „ํŒŒ ํ™”์‚ดํ‘œ ํฌํ•จ]

  • ๊ธฐ๋ณธ์ ์ธ RNN ๋ชจํ˜•์€ Multi Layer Perceptron๊ณผ ์œ ์‚ฌํ•˜๋‹ค.
  • ์ด์ „ ์ˆœ์„œ์˜ ์ž ์žฌ๋ณ€์ˆ˜์™€ ํ˜„์žฌ์˜ ์ž…๋ ฅ์„ ํ™œ์šฉํ•˜์—ฌ ๋ชจ๋ธ๋ง
\[O_t = HW^{(2)}+b^{(2)}\\ H_t = \sigma(X_tW_X^{(1)}+H_{t-1}W^{(1)}_H+b^{(1)})\\ O_t = ์ถœ๋ ฅ,\ H_t=์ž ์žฌ๋ณ€์ˆ˜,\ \sigma=ํ™œ์„ฑํ™”ํ•จ์ˆ˜,\ X_tW^{(1)}=๊ฐ€์ค‘์น˜ํ–‰๋ ฌ,\ b^{(1)}=bias\\ O,H์™€\ ๋‹ฌ๋ฆฌ\ ๊ฐ€์ค‘์น˜\ ํ–‰๋ ฌ\ W\ ๋“ค์€\ t(์‹œ๊ฐ„)์—\ ๋”ฐ๋ผ\ ๋ณ€ํ•˜์ง€\ ์•Š์Œ.\]

[math. ์ž ์žฌ๋ณ€์ˆ˜ H~t~์˜ ์ƒ์„ฑ์—์„œ ์ด์ „ ์ž ์žฌ๋ณ€์ˆ˜์ธ H~t-1~ ํ™œ์šฉ]

  • RNN์˜ ์—ญ์ „ํŒŒ๋Š” ์ž ์žฌ๋ณ€์ˆ˜์˜ ์—ฐ๊ฒฐ ๊ทธ๋ž˜ํ”„์— ๋”ฐ๋ผ ์ˆœ์ฐจ์ ์œผ๋กœ ๊ณ„์‚ฐ๋˜๋ฉฐ ์ด๋ฅผ Backpropagtion Through Time(BPTT)๋ผ๊ณ  ํ•œ๋‹ค.

\(L(x,y,w_h,w_o)=\sum^T_{t=1}l(y_t,o_t)\\ \partial_{w_h}L(x,y,w_h,w_o)=\sum^T_{t=1}\partial_{w_h}l(y_t,o_t)=\sum^T_{t=1}\partial_{o_t}l(y_t,o_t)\part_{h_t}g(h_t,w_h)[\part_{w_h}h_t],\\ \part_{w_h}h_t=\part_{w_h}f(x_t,h_{t-1},w_h)+\sum^{t-1}_{i=1}\left(\prod^t_{j=t+1}\part_{h_{j-1}}f(x_j,h_{j-1},w_h)\right)\part_{w_h}f(x_i,h_{i-1},w_h)\\ while\ h_t=f(x_t,h_{t-1},w_h)\ and\ o_t =g(h_t,w_o).\)
[math. BPTT ์—ญ์ „ํŒŒ ๊ณ„์‚ฐ ๊ณผ์ •]

  • RNN์˜ ๊ฐ€์ค‘์น˜ํ–‰๋ ฌ์˜ ๋ฏธ๋ถ„์„ ๊ณ„์‚ฐํ•ด๋ณด๋ฉด ๋ฏธ๋ถ„์˜ ๊ณฑ์œผ๋กœ ์ด๋ฃจ์–ด์ง„ ํ•ญ์ด ๊ณ„์‚ฐ๋จ.
    • ์ด ๋ฏธ๋ถ„์˜ ๊ณฑ์€ ์‹œํ€€์Šค์˜ ๊ธธ์ด๊ฐ€ ๊ธธ์–ด์งˆ ์ˆ˜๋ก ๊ฐ’์ด ๋ถˆ์•ˆ์ •ํ•ด์ง„๋‹ค.(๋ฌดํ•œ๋Œ€๋กœ ์ˆ˜๋ ด ๋˜๋Š” 0์œผ๋กœ, ๊ฐ’์ด ํฌ๊ฒŒ ๋ฐ”๋€œ ๋“ฑ)
    • ์ด๋ฅผ ๋ง‰๊ธฐ ์œ„ํ•ด ์ ์ ˆํ•œ ๊ธธ์ด ์‹œ์ ์—์„œ ๋Š์–ด ์ค€๋‹ค.(truncated BPTT ๊ธฐ์ˆ )

[img. LSTM๊ณผ GPU ๊ทธ๋ฆผ]

  • ์ตœ๊ทผ์—๋Š” ๊ธธ์ด๊ฐ€ ๊ธด ์‹œํ€€์Šค๋ฅผ ์ฒ˜๋ฆฌํ•˜๊ธฐ ์œ„ํ•ด ๋‹ค๋ฅธ RNN unit์„ ์‚ฌ์šฉํ•จ.