ํ’€์Šคํƒ ์›น๐ŸŒ ๊ฐœ๋ฐœ์ž ์ง€๋ง์ƒ ๐Ÿง‘๐Ÿฝโ€๐Ÿ’ป
โž• ์ธ๊ณต์ง€๋Šฅ ๊ด€์‹ฌ ๐Ÿค–


Categories


Recent views

  • 1
  • 2
  • 3
  • 4
  • 5

DKT ๊ธฐ๋ณธ

  1. DKT ์ดํ•ด ๋ฐ DKT Trend ์†Œ๊ฐœ
  2. DKT Data Exploratory Data Analysis
  3. Sequence Data ๋ฌธ์ œ ์ •์˜์— ๋งž๋Š” Transformer Architecture ์„ค๊ณ„
  4. Kaggle Riiid Competition Winnerโ€™s Solution ํƒ์ƒ‰
  5. ML Pipeline
    • Model Serving
    • End to End Project

    ์‹ฌ์ธต ์ง€์‹ ํƒ์ƒ‰(Deep Knowledge Tracing, DKT) ๊ธฐ๋ณธ

    Naver AI boostcamp DKT ๊ฐ•์˜๋ฅผ ์ •๋ฆฌํ•œ ๋‚ด์šฉ์ž…๋‹ˆ๋‹ค.

    DKT ์ดํ•ด ๋ฐ DKT Trend ์†Œ๊ฐœ

    DKT Task ์ดํ•ด

    DKT (DEEP KNOWLEDGE TRACING) : ๋”ฅ๋Ÿฌ๋‹์„ ์ด์šฉํ•˜๋Š” ์ง€์‹ ์ƒํƒœ ์ถ”์ 

    Question๊ณผ Response๋กœ ์ด๋ฃจ์–ด์ง„ ๋ฌธ์ œ ํ’€์ด ์ •๋ณด๋ฅผ ํ†ตํ•ด ๋‹ค์Œ ์ง€์‹์ƒํƒœ(์ฃผ๋กœ ๋ฌธ์ œ๋ฅผ ํ’€ ์ˆ˜ ์žˆ๋Š”๊ฐ€?)๋ฅผ ์˜ˆ์ธกํ•˜๋Š” ๋ฐฉ์‹์œผ๋กœ ์ง„ํ–‰๋œ๋‹ค.

    • ์ฆ‰, ์ฃผ์–ด์ง„ ๋ฌธ์ œ๋ฅผ ๋งž์ท„๋Š” ์ง€ ํ‹€๋ ธ๋Š”์ง€ ์•Œ์•„๋ณด๋Š” Binary Classification ๋ฌธ์ œ์ด๊ธฐ๋„ ํ•˜๋‹ค.

    ์ง€์‹ ์ƒํƒœ๋Š” ๊ณ„์† ๋ณ€ํ™”ํ•˜๋ฏ€๋กœ ์ง€์†์ ์œผ๋กœ ์ถ”์ ํ•ด์•ผ ํ•œ๋‹ค.

    ๋ณดํ†ต ๋ฌธ์ œ์™€ ํ’€์ด ๊ฒฐ๊ณผ๋ฅผ Train set์œผ๋กœ,

    ๋งˆ์ง€๋ง‰ ๋ฌธ์ œ์˜ ํ’€์ด ๊ฒฐ๊ณผ๊ฐ€ masking ๋˜์žˆ๋Š” ๋ฌธ์ œ๋“ค๊ณผ ํ’€์ด๊ฒฐ๊ณผ๊ฐ€ Test set์œผ๋กœ ์ฃผ์–ด์ง„๋‹ค.

    ๋ฌธ์ œ ํ’€์ด ์ •๋ณด(๋ฐ์ดํ„ฐ)๊ฐ€ ์ถ”๊ฐ€๋  ์ˆ˜๋ก ํ•™์ƒ์˜ ์ง€์‹ ์ƒํƒœ๋ฅผ ๋” ์ •ํ™•ํžˆ ์˜ˆ์ธก ๊ฐ€๋Šฅ.

    ๋ฐ์ดํ„ฐ๊ฐ€ ์ ์„ ์ˆ˜๋ก ์˜ค๋ฒ„ํ”ผํŒ… ํ˜„์ƒ์ด ์‰ฝ๊ฒŒ ์ผ์–ด๋‚œ๋‹ค.

    Metric ์ดํ•ด

    AUC/ACC(Area under the roc curve/Accuracy)

    ๋ณดํ†ต ์˜ˆ์ธก์˜ ๊ฒฐ๊ณผ๋Š” float ํ˜•ํƒœ๋กœ ๋‚˜์˜ค๋ฉฐ, 0.5(Threshold)๋ฅผ ๊ธฐ์ค€์œผ๋กœ ์ •๋‹ต ์—ฌ๋ถ€(1,0)๋ฅผ ๊ฒฐ์ •ํ•œ๋‹ค.

    Confusion Matrix(ํ˜ผ๋™ํ–‰๋ ฌ)์˜ ์ดํ•ด

    Predicted : ๋ชจ๋ธ์˜ ์˜ˆ์ธก๊ฐ’

    Actual: ์‹ค์ œ ๊ฐ’

    Accuracy: ์ „์ฒด ์ค‘ ์˜ˆ์ธก๊ฐ’๊ณผ ๋งž๋Š” ๋น„์œจ

    Precision(PPV, Positive predictive value) : ๋ชจ๋ธ์ด ๋งž๋‹ค๊ณ  ์˜ˆ์ธกํ•œ ๋น„์œจ ์ค‘ ์‹ค์ œ ๋งž์€ ๋น„์œจ

    Recall,Sensitivity (True positive rate(TPR)): ์‹ค์ œ 1์ธ ๋น„์œจ ์ค‘์— ๋ชจ๋ธ์ด 1์ด๋ผ๊ณ  ํ•œ ๋น„์œจ

    Specificity : ์‹ค์ œ 0์ธ ๋น„์œจ ์ค‘์— ๋ชจ๋ธ์ด 0์ด๋ผ๊ณ  ํ•œ ๋น„์œจ

    F1 score : Prescision๊ณผ Recall์˜ ์ ˆ์ถฉ์•ˆ, ๋™์‹œ์— ๊ณ ๋ คํ•จ.

    ๋‹ค๋งŒ ์œ„์˜ metric ๋“ค์€ Threshold์— ์˜ํ–ฅ์„ ๋ฐ›๊ฒŒ๋จ(์—ฌ๊ธฐ์„œ๋Š” 0.5)

    AUC(Area under the roc curve)

    ๊ทธ๋ž˜ํ”„์˜ ๋ฉด์ ์ด ์ปค์งˆ์ˆ˜๋ก ์„ฑ๋Šฅ์ด ๋” ์ข‹์•„์ง„๋‹ค.

    AUC ๊ฐ’์˜ ๋ฒ”์œ„๋Š” 0~1์ด๋ฉฐ, ๋žœ๋คํ•˜๊ฒŒ 0๊ณผ 1์„ ๋„ฃ์€ ๊ฒฝ์šฐ 0.5์ด๋‹ค.

    AUC๋Š” ์ฒ™๋„ ๋ถˆ๋ฉด, ์ ˆ๋Œ€ ๊ฐ’์ด ์•„๋‹ˆ๋ผ ์˜ˆ์ธก์ด ์–ผ๋งˆ๋‚˜ ์ž˜ ํ‰๊ฐ€๋˜๋Š”์ง€ ์ธก์ •ํ•˜๋Š” ๊ฒƒ์ด๋ฉฐ(์˜ˆ์ธก๊ฐ’๋“ค์˜ ์ ˆ๋Œ€์ ์ธ ํฌ๊ธฐ์™€ ๊ด€๊ณ„์—†์Œ),

    ๋ถ„๋ฅ˜ ์ž„๊ณ„๊ฐ’ ๋ถˆ๋ณ€, ์–ด๋–ค ๋ถ„๋ฅ˜ ์ž„๊ณ„๊ฐ’์ด ์„ ํƒ๋˜์—ˆ๋Š”์ง€์™€ ์ƒ๊ด€์—†์ด ๋ชจ๋ธ์˜ ์˜ˆ์ธก ํ’ˆ์งˆ์„ ์ธก์ •ํ•  ์ˆ˜ ์žˆ๋‹ค. (Threshold ๊ด€๊ณ„ ์—†์Œ)

    ๋‹จ, ๋‹จ์ ๋“ค๋กœ,

    ์ฒ™๋„ ๋ถˆ๋ณ€์ด ํ•ญ์ƒ ์ด์ƒ์ ์ด์ง€ ์•Š์„ ์ˆ˜ ์žˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด, 0.9 ์ด์ƒ์˜ ๊ฐ’์ด ์ค‘์š”ํ•  ๊ฒฝ์šฐ AUC๋กœ ์ธก์ • ๋ถˆ๊ฐ€

    ๋ถ„๋ฅ˜ ์ž„๊ณ„๊ฐ’ ๋ถˆ๋ณ€์ด ํ•ญ์ƒ ์ด์ƒ์ ์ด์ง€ ์•Š๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด ํ—ˆ์œ„ ์–‘์„ฑ(FP) ์ตœ์†Œํ™”๊ฐ€ ๋”์šฑ ์ค‘์š”ํ•œ ๊ฒฝ์šฐ(์ค‘์š”ํ•œ ๋ฉ”์ผ์ด ์ง€์›Œ์ง€๋ฉด ์•ˆ๋˜๋Š” ์ŠคํŒธ๋ฉ”์ผ ๋ถ„๋ฅ˜ ๋“ฑ) ์ด๋Ÿด ๋•Œ๋Š” AUC๊ฐ€ ์œ ์šฉํ•œ ์ธก์ •ํ•ญ๋ชฉ์ด ์•„๋‹ˆ๋‹ค.

    imbalanced data์—์„œ๋Š” accuracy ๋ณด๋‹ค๋Š” ๋‚ซ์ง€๋งŒ, AUC๊ฐ€ ๋น„๊ต์  ๋†’๊ฒŒ ์ธก์ •๋˜๋Š” ๊ฒฝํ–ฅ์ด ์žˆ๋‹ค.

    (๋‹จ, Test data๊ฐ€ ๋™์ผํ•  ๊ฒฝ์šฐ, ์ƒ๋Œ€์ ์ธ ์„ฑ๋Šฅ ๋น„๊ต๋Š” ๊ฐ€๋Šฅํ•˜๋‹ค)

    FPR์€ Specificity๋ฅผ ์˜๋ฏธํ•˜๋ฉฐ, TPR์€ Recall์„ ์˜๋ฏธํ•œ๋‹ค.

    ๊ฒฐ๊ณผ๊ฐ’์— ๋”ฐ๋ผ ๋‹ค์Œ๊ณผ ๊ฐ™์€ ๋ฐฉ๋ฒ•์œผ๋กœ ROC curve๋ฅผ ๊ทธ๋ฆด ์ˆ˜ ์žˆ๋‹ค.

    ์œ„์™€ ๊ฐ™์ด Threshold ์ง€์ ์„ ์ค‘์‹ฌ์œผ๋กœ ๊ฒน์น˜๋Š” ๋ถ€๋ถ„ (=์˜ˆ์ธก์ด ํ‹€๋ฆฐ ๋ถ€๋ถ„)์ด ์ ์„์ˆ˜๋ก ROC Curve์˜ ๋ฉด์ ์ด ๋„“์–ด์ง€๊ณ , ์„ฑ๋Šฅ์ด ์ข‹๋‹ค๋Š” ์˜๋ฏธ์ด๋‹ค.

    DKT History ๋ฐ Trend

    ML, DL, Transformer, GNN ๋“ฑ์˜ DKT์˜ ํŠธ๋žœ๋“œ๊ฐ€ ๋ฐœ์ „ํ•ด ์™”๋‹ค.


    [1๊ฐ• ์ฐธ๊ณ  ์ž๋ฃŒ, History of deep knowledge tracing ์ฐธ์กฐ]

    DKT Data Exploratory Data Analysis

    DKT Datset EDA์— ๋Œ€ํ•œ ์˜ˆ์‹œ

    i-Scream ๋ฐ์ดํ„ฐ ๋ถ„์„

    i-Scream edu์—์„œ ์ œ๊ณตํ•˜๋Š” Dataset

    feature๋กœ userID, assessmentItemID, testId, answerCode, Timestamp, KnowledgeTag๋กœ ์ด๋ฃจ์–ด์ง.

    DKT์—์„œ ๋ณดํ†ต ํ•˜๋‚˜์˜ ํ–‰์„ Interaction์ด๋ผ๊ณ  ๋ถ€๋ฆ„

    userID

    • ์‚ฌ์šฉ์ž ๋ณ„ ๊ณ ์œ ๋ฒˆํ˜ธ, ์ด 7442๋ช…์˜ ๊ณ ์œ ํ•œ ์‚ฌ์šฉ์ž ์กด์žฌ

    assessmentItemID

    • ์‚ฌ์šฉ์ž๊ฐ€ ํ‘ผ ๋ฌธํ•ญ์˜ ์ผ๋ จ ๋ฒˆํ˜ธ, ์ด 9454๊ฐœ์˜ ๊ณ ์œ ํ•œ ๋ฌธํ•ญ์ด ์กด์žฌ
    • ์ด 10์ž๋ฆฌ๋กœ ๊ตฌ์„ฑ, ์ฒซ์ž๋ฆฌ๋Š” ํ•ญ์ƒ ์•ŒํŒŒ๋ฉง A, ๊ทธ๋‹ค์Œ 6์ž๋ฆฌ๋Š” ์‹œํ—˜์ง€ ๋ฒˆํ˜ธ, ๋งˆ์ง€๋ง‰ 3์ž๋ฆฌ๋Š” ์‹œํ—˜์ง€ ๋‚ด ๋ฌธํ•ญ์˜ ๋ฒˆํ˜ธ๋กœ ๊ตฌ์„ฑ
    • ex) A030071005

    testId

    • ์‚ฌ์šฉ์ž๊ฐ€ ํ‘ผ ๋ฌธํ•ญ์ด ํฌํ•จ๋œ ์‹œํ—˜์ง€์˜ ์ผ๋ จ ๋ฒˆํ˜ธ, ์ด 1537๊ฐœ์˜ ๊ณ ์œ ํ•œ ์‹œํ—˜์ง€๊ฐ€ ์กด์žฌ
    • ์ด 10์ž๋ฆฌ๋กœ ๊ตฌ์„ฑ, ์ฒซ ์ž๋ฆฌ๋Š” ํ•ญ์ƒ ์•ŒํŒŒ๋ฉง A, ๊ทธ ๋‹ค์Œ 9์ž๋ฆฌ ์ค‘ ์•ž์˜ 3์ž๋ฆฌ์™€ ๋์˜ 3์ž๋ฆฌ๊ฐ€ ์‹œํ—˜์ง€ ๋ฒˆํ˜ธ, ๊ฐ€์šด๋ฐ 3์ž๋ฆฌ๋Š” ๋ชจ๋‘ 000
    • ์•ž์˜ 3์ž๋ฆฌ ์ค‘ ๊ฐ€์šด๋ฐ ์ž๋ฆฌ๋Š” 1~9๊ฐ’์„ ๊ฐ€์ง€๋ฉฐ ์ด๋ฅผ ๋Œ€๋ถ„๋ฅ˜๋กœ ์‚ฌ์šฉ ๊ฐ€๋Šฅ
    • ex) A030000071

    answerCode

    • ์‚ฌ์šฉ์ž๊ฐ€ ๋ฌธํ•ญ์„ ๋งž์•˜๋Š” ์ง€ ์—ฌ๋ถ€๋ฅผ ๋‹ด์€ ์ด์ง„ ๋ฐ์ดํ„ฐ, 0์€ ํ‹€๋ฆผ, 1์€ ๋งž์Œ
    • ์ „์ฒด Interaction์— ๋Œ€ํ•ด 65.45%๊ฐ€ ์ •๋‹ต์„ ๋งž์ถค, ์ฆ‰ ์กฐ๊ธˆ ๋ถˆ๊ท ํ˜•ํ•œ ๋ฐ์ดํ„ฐ์…‹

    Timestamp

    • ์‚ฌ์šฉ์ž๊ฐ€ Interaction์„ ์‹œ์ž‘ํ•œ ์‹œ๊ฐ„ ์ •๋ณด, ์‹œ๊ฐ„ ๊ฐ„๊ฒฉ์„ ํ†ตํ•ด ๋ฌธ์ œ๋ฅผ ํ‘ธ๋Š” ์‹œ๊ฐ„์„ ๊ฐ€๋Š ํ•  ์ˆ˜ ์žˆ์Œ.

    KnowledgeTag

    • ๋ฌธํ•ญ ๋‹น ํ•˜๋‚˜์”ฉ ๋ฐฐ์ •๋˜๋Š” ํƒœ๊ทธ, ์ผ์ข…์˜ ์ค‘๋ถ„๋ฅ˜
    • ์ด 912๊ฐœ์˜ ๊ณ ์œ  ํƒœ๊ทธ ์กด์žฌ

    ๊ธฐ์ˆ  ํ†ต๊ณ„๋Ÿ‰ ๋ถ„์„

    ๊ธฐ์ˆ  ํ†ต๊ณ„๋Ÿ‰?

    • ์ผ๋ฐ˜์ ์œผ๋กœ ๋ฐ์ดํ„ฐ๋ฅผ ์‚ดํŽด๋ณผ ๋•Œ, ๊ฐ€์žฅ ๋จผ์ € ์‚ดํŽด๋ณด๋Š” ๊ฒƒ์€ ๊ธฐ์ˆ  ํ†ต๊ณ„๋Ÿ‰์ž…๋‹ˆ๋‹ค.
    • ๋ณดํ†ต ๋ฐ์ดํ„ฐ ์ž์ฒด์˜ ์ •๋ณด๋ฅผ ์ˆ˜์น˜๋กœ ์š”์•ฝ, ๋‹จ์ˆœํ™”ํ•˜๋Š” ๊ฒƒ์„ ๋ชฉ์ ์œผ๋กœ ํ•˜๋ฉฐ
    • ์šฐ๋ฆฌ๊ฐ€ ์ž˜ ์•Œ๊ณ  ์žˆ๋Š” ํ‰๊ท , ์ค‘์•™๊ฐ’, ์ตœ๋Œ€/์ตœ์†Œ์™€ ๊ฐ™์€ ๊ฐ’๋“ค์„ ์ฐพ์•„๋‚ด๊ณ , EDA ๊ณผ์ •์—์„œ๋Š” ์ด๋“ค์„ ์œ ์˜๋ฏธํ•˜๊ฒŒ ์‹œ๊ฐํ™”ํ•˜๋Š” ์ž‘์—…์„ ๊ฑฐ์นจ
    • ๋ถ„์„์€ ์ตœ์ข… ๋ชฉํ‘œ์ธ ์ •๋‹ต๋ฅ ๊ณผ ์—ฐ๊ด€ ์ง€์–ด ์ง„ํ–‰ํ•˜๋Š” ๊ฒƒ์ด ์œ ๋ฆฌ

    ๋‹ค์Œ์€ I-scream dataset์˜ ํŠน์„ฑ ๋ณ„ ๋นˆ๋„ ๋ถ„์„ ์ข…ํ•ฉ์ด๋‹ค.

    ๋‹ค์Œ์€ I-scream dataset์˜ ํŠน์„ฑ ๋ณ„ ์ •๋‹ต๋ฅ  ๋ถ„์„ ์ข…ํ•ฉ์ด๋‹ค.

    ์œ„์™€ ๊ฐ™์€ ๋‹จ์ˆœ ๊ธฐ์ˆ  ํ†ต๊ณ„๋Ÿ‰์„ ๋„˜์–ด์„œ, ์–ป์–ด๋‚ธ ํŠน์„ฑ๊ณผ ์ •๋‹ต๋ฅ  ์‚ฌ์ด์˜ ๊ด€๊ณ„๋ฅผ ๋ถ„์„ํ•ด์•ผ ํ•˜๋ฉฐ, ์ด๋•Œ, ์—ฌ๋Ÿฌ ์ง€์‹๊ณผ ๊ฒฝํ—˜์ด ์žˆ์œผ๋ฉด ์ข‹๋‹ค.

    ์˜ˆ๋ฅผ ๋“ค์–ด, ๋ฌธ์ œ๋ฅผ ๋งŽ์ด ํ‘ผ ์‚ฌ๋žŒ์ด ๋ฌธ์ œ๋ฅผ ๋” ์ž˜ ๋งž์ถ”๋Š”๊ฐ€?, ์ข€๋” ์ž์ฃผ ๋‚˜์˜ค๋Š” ํƒœ๊ทธ์˜ ๋ฌธ์ œ์˜ ์ •๋‹ต๋ฅ ์ด ๋†’์€๊ฐ€?, ๋ฌธํ•ญ์„ ํ‘ธ๋Š”๋ฐ ๊ฑธ๋ฆฐ ์‹œ๊ฐ„๊ณผ ์ •๋‹ต๋ฅ ์˜ ๊ด€๊ณ„๋Š” ์–ด๋– ํ•œ๊ฐ€?

    ๋ฌธํ•ญ์„ ๋” ๋งŽ์ด ํ‘ผ ํ•™์ƒ์ด ๋ฌธ์ œ๋ฅผ ๋” ์ž˜๋งž์ถ”๋Š” ๊ฒฝํ–ฅ์ด ์žˆ๋‹ค.

    ๋ฌธํ•ญ์„ ํ’€์ˆ˜๋ก ํ•œ ํ•™์ƒ์˜ ์ •๋‹ต๋ฅ ์ด ๋Š˜์–ด๋‚˜๋Š” ๊ฒฝํ–ฅ์ด ์žˆ๋Š”๊ฐ€?์— ๋Œ€ํ•œ ๊ทธ๋ž˜ํ”„์ด๋‹ค. ์ฃผ๋กœ ์ดˆ๋ฐ˜์— ์ž˜ ํ‘ผ ํ•™์ƒ์€ ์ ์  ๊ฐ์†Œํ•˜๋ฉฐ, ๋ฐ˜๋Œ€์˜ ๊ฒฝ์šฐ ์ ์  ์ฆ๊ฐ€ํ•œ๋‹ค.

    ์ „๋ฐ˜์ ์œผ๋กœ ์ฆ๊ฐ€ํ•˜๋Š” ์ถ”์„ธ์ด๋‹ค.

    ์ด์™ธ์—๋„ ๊ฐ™์€ ์‹œํ—˜์ง€๋‚˜ ํƒœ๊ทธ์˜ ๋ฌธ์ œ๋ฅผ ์—ฐ๋‹ฌ์•„ ํ’€๋ฉด ์ •๋‹ต๋ฅ ์ด ์˜ค๋ฅด๋Š”๊ฐ€? ๋“ฑ์„ ์ƒ๊ฐํ•ด๋ณผ ์ˆ˜ ์žˆ๋‹ค.

    Hands on EDA

    [Lab. ]

    Sequence ๋ชจ๋ธ๋ง

    ์ •ํ˜•๋ฐ์ดํ„ฐ์—๋Š” Titanic ์ฒ˜๋Ÿผ Time๊ณผ ๊ด€๊ณ„์—†๋Š” Non-Sequential Data์™€, Transaction์ฒ˜๋Ÿผ ์‹œ๊ฐ„์˜ ์ˆœ์„œ๊ฐ€ ์กด์žฌํ•˜๋Š” Sequential Data๊ฐ€ ์กด์žฌํ•œ๋‹ค.

    ์ด๋•Œ, Sequential Data๋ฅผ Time์„ ํ†ตํ•ฉํ•˜๊ณ  ํŠน์ • feature์— ๋งž์ถฐ ์ง‘๊ณ„ํ•˜๊ฑฐ๋‚˜ ๊ทธ๋Œ€๋กœ ๋‘”์ฑ„๋กœ ์ถ”๊ฐ€ feature๋ฅผ ์ƒ์„ฑํ•˜๋Š” ๋ฐฉ์‹์œผ๋กœ Feature Engineering์ด ๊ฐ€๋Šฅํ•˜๋‹ค.

    ์˜ˆ๋ฅผ ๋“ค์–ด, ๋ฌธ์ œ, ์‹œํ—˜, ๋˜๋Š” ์‚ฌ๋žŒ ๋ณ„๋กœ ์ง‘๊ณ„ํ•œ ๋’ค, ์ •๋‹ต ํ™•๋ฅ  feature๋ฅผ ์ถ”๊ฐ€ํ•  ์ˆ˜ ์žˆ๋‹ค.

    ์ด๋Ÿฌํ•œ feature๋“ค์€ hyperparameter ์ฒ˜๋Ÿผ ์ถ”๊ฐ€, ์‚ญ์ œ๋ฅผ ํ†ตํ•ด ๋ชจ๋ธ์˜ ์„ฑ๋Šฅ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋‹ค.

    ์ด๋•Œ, ๋‹จ์ˆœํžˆ ์ด๋ฒคํŠธ์˜ ํ–‰ ๋‹จ์œ„๋กœ ๊ฐœ์ˆ˜๋ฅผ ์„ธ์ง€ ์•Š๊ณ , aggregation ๊ธฐ์ค€์„ ์ค‘์‹ฌ์œผ๋กœ split ํ•ด์•ผ๋œ๋‹ค.

    ๊ทธ ์ดํ›„, feature์™€ hyperparameter๋ฅผ ๋ฐ”๊ฟ”๊ฐ€๋ฉด์„œ ์„ฑ๋Šฅ์˜ ์ฐจ์ด๋ฅผ ์•Œ์•„๋ณด๋ฉฐ feature๋ฅผ ๊ฒฐ์ •ํ•œ๋‹ค.

    import torch
    importh torch.nn as nn
    
    # Size: [batch_size, seq_len, input_size or num_of_features]
    input = torch.randn(3, 5, 4)
    
    lstm = nn.LSTM(input_size=4, hidden_size=2, batch_first=True)
    
    output, h = lstm(input)
    output.size() # => torch.Size([3, 5, 2]), batch_size, seq_len, hidden_size)
    
    

    LSTM ๊ตฌ์กฐ์˜ Sequece input์€ ๋‹ค์Œ๊ณผ ๊ฐ™์ด ์ด๋ฃจ์–ด์ง„๋‹ค.

    batch size(dataset chunk ํ•œ ํฌ๊ธฐ), seq_len(sequence์˜ ๊ธธ์ด), input_size(4 ์ฐจ์› embedding) or num_of_features์˜ 3์ฐจ์› ๋ฒกํ„ฐ๊ฐ€ ๋“ค์–ด๊ฐ„ ๋’ค,

    batch_size, seq_len, hiddensize(hyperparameter)์˜ output์ด ๋‚˜์˜จ๋‹ค.

    feature์˜ ์ˆ˜์— ๋”ฐ๋ผ input size๊ฐ€ ๋ณ€ํ•˜๋Š” ์˜ˆ์‹œ๋ฅผ ๋ณด์ž๋ฉด ์œ„์™€ ๊ฐ™๋‹ค.

     config = BertConfig(
     3, # vocab_size, not used
     hidden_size = 4, num_attention_heads=1
     )
     
     # Size: [batch_size, seq_len, input_size]
     input = torch.randn(3, 5, 4)
     # Size: [batch_size, seq_len]
     mask = torch.randn(3, 5)
     
     transformer = BertModel(config)
     encoded_layers = transformer(inputs_embeds=input, attention_mask=mask)
     sequence_output = encoded_layers[0]
     sequence_output.size() #=> torch.Size([batch_size, seq_len, input_size])
     
    

    Transformer์˜ input๊ณผ output ๋˜ํ•œ ํฌ๊ฒŒ ๋‹ค๋ฅด์ง€ ์•Š์ง€๋งŒ, masking์˜ ์ฐจ์›์ด 1์ฐจ์› ์ ๋‹ค.

    Transformer + ์—ฐ์†ํ˜•, ๋ฒ”์ฃผํ˜• ์กฐํ•ฉ์˜ input์˜ ๊ฒฝ์šฐ, embedding layer์˜ ์„ค์ •์— ๋”ฐ๋ผ input size๊ฐ€ ๋‹ค๋ฅด๋‹ค.

    ๋ฒ”์ฃผํ˜•์€ ์—ฐ์†ํ˜•๊ณผ ๋‹ค๋ฅด๊ฒŒ ์ธ์ฝ”๋”ฉ์„ ํ†ตํ•ด vector๋ฅผ ๋ฝ‘์•„๋‚ด์•ผ ํ•œ๋‹ค.

    Embedding์€ ์ผ์ข…์˜ Lookup Table์„ ๋งŒ๋“œ๋Š” ๊ฒƒ์œผ๋กœ, ์ด Lookup Table ๋˜ํ•œ ํ•™์Šต์„ ํ†ตํ•ด ๊ฒฐ์ •๋œ๋‹ค.

    ์ด๋Ÿฐ์‹์œผ๋กœ Embedding๋œ ๊ฐ’๋“ค์€ concat๋˜์–ด hidden_size๋ฅผ ๋งŒ๋“ ๋‹ค.

    ์ด๋•Œ concat๋˜๋Š” feature๋“ค์˜ ์ฐจ์›์ด Linear๋ฅผ ํ†ตํ•ด hidden size์— ๋งž๊ฒŒ ์ค„์–ด๋“ ๋‹ค.

    DKT์˜ ๊ฒฝ์šฐ, Transformer๊ตฌ์กฐ๋ฅผ ํ™œ์šฉ ์‹œ ๋ณดํ†ต, ์‚ฌ์šฉ์ž ๋‹จ์œ„๋กœ Sequence๋ฅผ ์ƒ์„ฑํ•œ ๋’ค, ๊ฐ๊ฐ train input์œผ๋กœ ๋„ฃ์–ด์ค€๋‹ค.

    DKT์˜ ๊ฒฝ์šฐ ๋งˆ์ง€๋ง‰ ๋ฌธ์ œ์˜ ์ •๋‹ต์—ฌ๋ถ€๋ฅผ ๋งž์ถ”๋Š” Task ์ด๋ฏ€๋กœ ๋ณดํ†ต Padding์„ ์•ž์— ์ถ”๊ฐ€ํ•˜์—ฌ ๋’ท๋ถ€๋ถ„์„ ๋งž์ถ˜๋‹ค.

    Sequence Data ๋ฌธ์ œ ์ •์˜์— ๋งž๋Š” Transformer Architecture ์„ค๊ณ„

    Transformer ๊ตฌ์กฐ๋Š” ๋‹ค์–‘ํ•œ Sequence ๋ฐ์ดํ„ฐ์—์„œ ๊ฐ•์ ์„ ๋ณด์ด์ง€๋งŒ, ๋งŽ์€ ์–‘์˜ ๋ฐ์ดํ„ฐ์™€ ์—ฐ์‚ฐ๋Ÿ‰์„ ์š”๊ตฌํ•˜๋ฉฐ, ์ข…์ข… ์ƒํ™ฉ์— ๋งž๊ฒŒ ๋ณ€ํ˜•ํ•ด์„œ ์‚ฌ์šฉํ•˜๊ฑฐ๋‚˜, ์•„์˜ˆ ๋‹ค๋ฅธ ๋ชจ๋ธ์„ ์‚ฌ์šฉํ•ด์•ผ ํ•˜๋Š” ๊ฒฝ์šฐ๋„ ์žˆ๋‹ค.

    inductive bias : ํŠน์ • ๋ชฉ์ ์— ๋งž๊ฒŒ ์„ค๊ณ„๋œ ๋ชจ๋ธ๋“ค(CNN, RNN)์˜ ๊ฒฝ์šฐ input์˜ ํ˜•ํƒœ์— ๋”ฐ๋ผ bias๊ฐ€ ์ƒ๊ธด๋‹ค, ์ฆ‰, ์ ์ ˆํ•˜์ง€ ๋ชปํ•œ input์˜ ๊ฒฝ์šฐ ์„ฑ๋Šฅ์ด ๋‚˜์˜๋‹ค.(CNN์— Sequential input์„ ๋„ฃ์–ด์ค€๋‹ค๋˜๊ฐ€)

    Transfomer์˜ ๊ฒฝ์šฐ, inductive bias๊ฐ€ ์กด์žฌํ•˜์ง€ ์•Š์ง€๋งŒ, ๊ทธ๋งŒํผ ๋ฐ์ดํ„ฐ๊ฐ€ ๋งŽ์ด ํ•„์š”ํ•˜๋‹ค.

    ์ด๋Ÿฌํ•œ Transformer๋ฅผ ๊ฐœ์กฐํ•˜๊ธฐ ์œ„ํ•ด Trasformer architecture์˜ ๋ณ€ํ˜•์„ ์•Œ์•„๋ณด์ž.

    Data Science Bowl

    3~5์„ธ ๋“ค์˜ ๊ธฐ์ดˆ์ˆ˜ํ•™ ํ•™์Šต์„ ์œ„ํ•ด ๊ฐœ๋…์„ ์ •ํ™•ํžˆ ๋ฐฐ์› ๋Š” ์ง€ ๋งž์ถ”๋Š” ๊ฒƒ์ด ๋Œ€ํšŒ์˜ ๋ชฉํ‘œ

    ๊ณผ๊ฑฐ ๋ฐ์ดํ„ฐ๋ฅผ ๋ฐ”ํƒ•์œผ๋กœ ์•ž์œผ๋กœ ์–ด๋–ป๊ฒŒ ํ’€์ง€ class 4(๋ฐ”๋กœ ๋งž์ถค, ํ•œ๋ฒˆ ํ‹€๋ฆฌ๊ณ  ๋งž์ถค, ์—ฌ๋Ÿฌ๋ฒˆ ํ‹€๋ฆฌ๊ณ ๋งž์ถค, ๋ชป๋งž์ถค)๊ฐœ๋ฅผ ํ†ตํ•ด ์˜ˆ์ธก

    ํ•™์Šต ์ง„ํ–‰ ์‹œ๊ฐ„, ํ•™์Šต ์ข…๋ฅ˜(์˜์ƒ๋ฌผ, ๊ฒŒ์ž„, ํ™œ๋™, ํ‰๊ฐ€ ๋“ฑ), ๊ฒŒ์ž„ํ”Œ๋ ˆ์ด ์„ธ๊ณ„๊ด€๊ณผ ์‚ฌ์šฉ ์ •๋ณด ๋“ฑ์ด ๊ธฐ๋ก๋˜์–ด ์ฃผ์–ด์ง„๋‹ค.

    ์ด ๋•Œ, Transformer ๊ตฌ์กฐ๊ฐ€ ๋„๋ฆฌ ์‚ฌ์šฉ๋˜์ง€ ์•Š๋˜ ์‹œ์ ˆ์ด์˜€๊ณ , ์ž์›๊ณผ ๋ฐ์ดํ„ฐ์–‘์ด ํ•œ์ •๋˜์–ด์žˆ์—ˆ์ง€๋งŒ, ํ•œ ์œ ์ €๊ฐ€ Transformer-Encoder ๋ชจ๋ธ์ธ BERT๋กœ 3์œ„๋ฅผ ์ฐจ์ง€ ํ•˜์˜€์œผ๋ฉฐ,

    • ์„œ๋กœ ๋‹ค๋ฅธ ๋ฒ”์ฃผํ˜•/์—ฐ์†ํ˜• ๋ฐ์ดํ„ฐ๋“ค์„ ์–ด๋–ป๊ฒŒ ์ž„๋ฒ ๋”ฉ ํ–ˆ๋Š”๊ฐ€,
    • BERT๋ฅผ ์–ด๋–ป๊ฒŒ ํ™œ์šฉํ–ˆ๋Š”๊ฐ€

    ๊ฐ€ ์ฃผ์•ˆ์ ์ด์˜€๋‹ค.

    ์„œ๋กœ ๋‹ค๋ฅธ ๋ฒ”์ฃผํ˜•/์—ฐ์†ํ˜• Emedding์€ ๋‹ค์Œ๊ณผ ๊ฐ™์€ ๋ฐฉ๋ฒ•์„ ํ†ตํ•˜์—ฌ ์ž„๋ฒ ๋”ฉ ํ–ˆ์œผ๋ฉฐ,

    ์œ„์™€ ๊ฐ™์ด Transformer ๊ตฌ์กฐ๋ฅผ ๊นŠ๊ฒŒ ์Œ“์•„ ๋งˆ์ง€๋ง‰ Transformer ๊ตฌ์กฐ์˜ output ๊ฐ’์„ softmaxํ•˜์—ฌ classification ํ•œ๋‹ค.

    ์ด๋•Œ ๋งˆ์ง€๋ง‰ layer ๋ถ€๋ถ„ ๋งˆ์ง€๋ง‰ Trnasformer๋ฅผ ์ œ์™ธํ•œ ์—ฐํ•˜๊ฒŒ ์น ํ•ด์ง„ Transfomer ๊ตฌ์กฐ์˜ output์€ ์‚ฌ์šฉ๋˜์ง€ ์•Š์œผ๋ฉฐ Loss์— ์˜ํ•ด Backpropagation์— ์—…๋ฐ์ดํŠธ ๋˜์ง€ ์•Š๋Š”๋‹ค.

    Riid!

    ํ† ์ต ์‹œํ—˜์— ๋Œ€๋น„ํ•˜์—ฌ ๊ณต๋ถ€ํ•œ ํ•™์ƒ๋“ค์˜ ํ•™์Šต ๊ณผ์ •์„ ๋ชจ์•„๋‘” ๋ฐ์ดํ„ฐ๋กœ, ์ตœ์ข…์ ์œผ๋กœ ํ•œ ํ•™์ƒ์ด ๋งˆ์ง€๋ง‰์— ํ‘ผ ๋ฌธํ•ญ์„ ๋งž์ถœ์ง€ ํ‹€๋ฆฌ์ง€ ๋งž์ถ”๋Š” ๋Œ€ํšŒ์ด๋ฉฐ, i-Scream ๋ฐ์ดํ„ฐ์™€ ๋งค์šฐ ํก์‚ฌํ•˜๋‹ค.
    ๋‹ค๋งŒ ๋ฐ์ดํ„ฐ์…‹์ด ์•„์ฃผ ๋งŽ์œผ๋ฉฐ, ๊ฐ•์˜๋ฅผ ๋ณด๋Š” interaction ๋ฐ์ดํ„ฐ์™€ ๋‹จ์ˆœํžˆ ๋‹ต์„ ๋งž์ท„๋Š”๊ฐ€ ์•„๋‹Œ๊ฐ€๊ฐ€ ์•„๋‹Œ, ์‚ฌ์šฉ์ž๊ฐ€ ์–ด๋–ค ๋‹ต์„ ๋ƒˆ๋Š”๊ฐ€์™€ ์˜ค๋‹ต ์ •๋ฆฌ๋ฅผ ํ–ˆ๋Š”์ง€๋„ ํฌํ•จ๋˜์–ด ์žˆ์Œ.

    ์ด๋•Œ, ๋„ˆ๋ฌด ๋ฐ์ดํ„ฐ๊ฐ€ ๋งŽ์•„์„œ, ์ž„๋ฒ ๋”ฉ๋œ 2๊ฐœ์˜ Sequence๋ฅผ ํ•˜๋‚˜๋กœ ์ด์–ด ๋ถ™์ธ ํ›„, Sequence์˜ ๊ธธ์ด๋ฅผ ๋ฐ˜์œผ๋กœ ์ค„์ด๋Š” ๋Œ€์‹ , ํ•˜๋‚˜์˜ ์ž„๋ฒ ๋”ฉ ์ฐจ์›์„ 2๋ฐฐ๋กœ ๋Š˜๋ ค ํ•™์Šต์‹œ์ผœ ์‹œ๊ฐ„ ๋ณต์žก๋„๋ฅผ ์ค„์ž„

    Predicting Molecular Properties

    ๋ถ„์ž์˜ ์—ฌ๋Ÿฌ ์ •๋ณด๋“ค์„ ํ†ตํ•ด ์›์ž ๊ฐ„ ๊ฒฐํ•ฉ ์ƒ์ˆ˜๋ฅผ ์ฐพ๋Š” ๋Œ€ํšŒ

    ๋ถ„์ž๋‚ด ์›์ž ๊ฐ„ ๊ฒฐํ•ฉ ์ •๋ณด, ์›์ž ๊ฐ„ ๊ฐ€๋ฆผ๋ง‰ ํšจ๊ณผ, ๋ถ„์ž์˜ ์—๋„ˆ์ง€ ์ƒํƒœ, ๋ถ„์ž ๋‚ด ์›์ž์˜ ์ „ํ•˜ ์ƒํƒœ, ๊ฒฐํ•ฉ ์ƒ์ˆ˜ ์„ธ๋ถ€ ์ •๋ณด ๋“ฑ์ด ๋ฐ์ดํ„ฐ๋กœ ์ฃผ์–ด์ง

    LGBM์ด๋‚˜ Grpah NN์„ ํ†ตํ•ด ์ ‘๊ทผํ•œ ํŒ€๋„ ๋งŽ์Œ

    ๋ถ„์ž ๋ณ„๋กœ ๊ฐ€๋Šฅํ•œ ์›์ž ์กฐํ•ฉ๋“ค์— ๋Œ€ํ•ด ๋ชจ๋“  scalar_coupling_constant๋ฅผ ๊ตฌํ•ด์•ผ ํ•˜๋ฏ€๋กœ, ์œ„์น˜๊ฐ€ ์ค‘์š”ํ•˜์ง€ ์•Š์€ Sequence Data๋กœ ๋ณผ ์ˆ˜ ์žˆ์œผ๋ฉฐ, ํ•œ ๋ถ„์ž๋ฅผ Total Sequence, ์›์ž ์Œ์˜ ํ•˜๋‚˜์˜ Sequence๋กœ ๋ณธ๋‹ค๋ฉด, ์œ„ ๊ทธ๋ฆผ ์ฒ˜๋Ÿผ ์›์ž ์Œ ์ˆœ์„œ๋Š” ๋‹ค๋ฅด์ง€๋งŒ ๊ฒฐ๊ณผ๊ฐ€ ๋˜‘๊ฐ™์ด ๋‚˜์™€์•ผํ•จ

    Sequence ์•ˆ์—์„œ ๋ชจ๋“  token์ด ๋‹ค๋ฅธ ๋ชจ๋“  token์„ ์ฐธ์กฐํ•˜๋ฉฐ, Positional Embedding์„ ํ†ตํ•ด ์œ„์น˜์ •๋ณด๋ฅผ ๋ฐ˜์˜ํ•˜๋Š” ๋ฐฉ์‹์ธ Transformer ๊ตฌ์กฐ๊ฐ€ ์ ์ ˆํ•˜๋‹ค.

    • ์ฆ‰, ์œ„์น˜ ๊ด€๊ณ„๊ฐ€ ์ƒ๊ด€์—†์œผ๋ฏ€๋กœ Positional Embedding์„ ์•ˆ์ฃผ๋ฉด ๋จ (Permutation Invariant Transformer)

    ์ด๋•Œ, ๋ถ„์ž ๋ณ„๋กœ ์›์ž์Œ์ด 135๊ฐœ ์ด๋ฏ€๋กœ Sequence Length๋Š” ์ด 135๊ฐœ SC(Scaling constant)์ด๋ฉฐ,

    ๋‘ ์›์ž์˜ ์ •๋ณด๋“ค๊ณผ ๋‘˜ ์‚ฌ์ด์˜ ๊ด€๊ณ„์ •๋ณด ๊นŒ์ง€ ์ž„๋ฒ ๋”ฉ ํ•˜์—ฌ, ๊ฐ ์›์ž์˜ ์ „ํ•˜, ์œ„์น˜, ์›์ž ๋ฒˆํ˜ธ, ์›์ž ์‚ฌ์ด์˜ ๊ฑฐ๋ฆฌ, ์›์ž ๊ฒฐํ•ฉ ์ข…๋ฅ˜๊ฐ€ embedding๋œ vector๋ฅผ input์œผ๋กœ ์‚ฌ์šฉ

    ์ตœ์ข…์ ์œผ๋กœ ์˜ˆ์ธกํ•ด์•ผํ•˜๋Š” scaling constant(SC)๊ฐ€ Fc, sd, pso, dso์˜ ํ•ฉ์œผ๋กœ ์ด๋ฃจ์–ด ์ ธ์žˆ์œผ๋ฏ€๋กœ, SC๋ฅผ ์˜ˆ์ธกํ•˜๋Š” Transformer์™€ Fc, sd, pso, dso๋ฅผ ๊ฐ๊ฐ ์—์ธกํ•˜๋Š” ๋‘ ์ข…๋ฅ˜์˜ ๊ฒฐ๊ณผ์˜ ํ‰๊ท ์„ ํ†ตํ•˜์—ฌ ์˜ˆ์ธก์œผ๋กœ ์ œ์ถœ

    Mechanisms of Actions (MoA)

    ์•ฝ๋ฌผ ํˆฌ์—ฌ์‹œ, ์–ด๋–ค ํ™”ํ•™ ๋ฐ˜์‘์ด ์ผ์–ด๋‚˜๋Š”์ง€ ์˜ˆ์ธกํ•˜๋Š” ๋Œ€ํšŒ๋กœ, ํˆฌ์—ฌํ•œ ์•ฝ๋ฌผ์˜ ์ข…๋ฅ˜, ์–‘, ์‹œ๊ฐ„, ์•ฝ๋ฌผ ํ•ฉ์„ฑ๋ฐฉ์‹, ํˆฌ์—ฌ ๋ฐ›์€ ์‚ฌ๋žŒ์˜ ์œ ์ „์ž ๋ฐœํ˜„ ์ข…๋ฅ˜(772 features), ์„ธํฌ ์ƒ์กด ๋Šฅ๋ ฅ(cell viability) ๋“ฑ์˜ ๋ฐ์ดํ„ฐ๋ฅผ ์ œ๊ณตํ•จ.

    Sequence๋กœ ๋ฌถ์„ ์ˆ˜ ์žˆ๋Š” ๋ฐ์ดํ„ฐ๊ฐ€ ์—†๊ณ , Feature ์ˆ˜๊ฐ€ ๋„ˆ๋ฌด ๋งŽ๊ณ , ์˜ˆ์ธก ํ•ด์•ผํ•  class์˜ ์ˆ˜๊ฐ€ 207๊ฐœ ์ž„์— ๋น„ํ•ด, ๋ฐ์ดํ„ฐ๋Š” 2๋งŒ 3์ฒœ๊ฐœ ๋ฐ–์— ์กด์žฌํ•˜์ง€ ์•Š์•„, Transformer ๊ตฌ์กฐ๊ฐ€ ์ž˜ ์ž‘๋™ํ•˜์ง€ ์•Š์•˜๋‹ค๊ณ  ํ•จ.

    ์ด๋ฅผ ์œ„ํ•ด ์œ„ ๊ทธ๋ฆผ๊ณผ ๊ฐ™์€ CNN ๋ชจ๋ธ์„ ์‚ฌ์šฉํ–ˆ๋‹ค.

    1. ๋‹ค์ˆ˜์˜ Feature๋ฅผ ๊ฐ€์ง„ ์œ ์ „ ์ •๋ณด์™€ ์„ธํฌ ์ƒ์กด ์ •๋ณด๋ฅผ PCA(Principal component analysis)๋ฅผ ํ†ตํ•ด 50์ฐจ์›, 15์ฐจ์›์˜ ๋ฒกํ„ฐ๋กœ ๋งŒ๋“ฆ
    2. ๊ธฐ์กด์˜ feature์™€ concatenateํ•˜์—ฌ ์ถ”๊ฐ€์ ์ธ feature๋กœ ์ƒ์„ฑ
    3. ์œ„ ๊ฒฐ๊ณผ๋ฅผ Linear์— ํ†ต๊ณผ์‹œ์ผœ ๋” ํฐ ์ฐจ์›์˜ 1์ฐจ์› ๋ฒกํ„ฐ๋กœ ๋ณ€ํ™˜,
      • Linear feature ordering์„ ํ†ตํ•˜์—ฌ ์ฐจ์›์„ ๋Š˜๋ ค์ฃผ์–ด, ํ™œ์šฉ๊ฐ€๋Šฅํ•œ ์ถฉ๋ถ„ํ•œ Pixel์˜ ์–‘์„ ์ƒ์„ฑ,
      • ์ƒ์„ฑ๋œ ๋ฐ์ดํ„ฐ ์•ˆ์—์„œ feature๋ฅผ ์ตœ์ ์˜ ์ •๋ ฌ์„ ํ•™์Šตํ•˜๋Š” ํšจ๊ณผ
      • ๊ฐ ๋ฒกํ„ฐ์˜ ์›์†Œ๊ฐ€ ๊ฐ€์ง€๋Š” ์˜๋ฏธ๋ฅผ ๋™์ผํ•˜๊ฒŒ ๋งŒ๋“ฆ
    4. ์ด๋ฅผ ์งง์€ ๊ธธ์ด์˜ ์—ฌ๋Ÿฌ ์ฑ„๋„์„ ๊ฐ€์ง€๋Š” 1D ๋ฐ์ดํ„ฐ๋กœ ๋ณ€ํ™˜
    5. ์ด ๋ฐ์ดํ„ฐ๋ฅผ Conv1D Architecture์— ํ†ต๊ณผ์‹œ์ผœ ์ตœ์ข…๊ฒฐ๊ณผ ์ƒ์„ฑ
      • ์ด๋•Œ, ์ปค๋„ ์‚ฌ์ด์ฆˆ๋Š” n X embedding size ์ธ๊ฒฝ์šฐ๊ฐ€ ๋งŽ๋‹ค. (๋’ท ๋ถ€๋ถ„์ด embedding size๊ฐ€ ์•„๋‹ˆ๋ฉด ํ•œ feature์˜ ์ผ๋ถ€ embedding ๋งŒ ๊ฐ€์ ธ๊ฐ€๋ฏ€๋กœ)

    ์œ„ ์„ฑ๋Šฅ์ด ๋‹จ์ผ ๋ชจ๋ธ ๊ธฐ์ค€์œผ๋กœ ๊ฐ€์žฅ ์„ฑ๋Šฅ์ด ์ข‹์•˜๋‹ค.

    Kaggle Riiid Competition Winnerโ€™s Solution ํƒ์ƒ‰

    Feature Engineering

    ๊ณตํ†ต์ ์ธ FE

    Feature Engineering์˜ ์ ‘๊ทผ ๋ฐฉ๋ฒ•์—๋Š” 2๊ฐ€์ง€๊ฐ€ ์žˆ๋‹ค.

    1. Bottom-Up

    Data ๊ธฐ๋ฐ˜ ๋ฐฉ์‹,

    1) EDA๋ฅผ ํ†ตํ•ด ํŠน์ง•์„ ์‚ดํ”ผ๊ณ ,

    2) ํ•ด๋‹น ํŠน์ง•์„ Test Data๋ฅผ ํ†ตํ•ด ๊ฒ€์ฆ ๋’ค,

    3) ์ด๋ฅผ ํ†ตํ•ด ์ƒˆ๋กœ์šด feature๋ฅผ ๋งŒ๋“ค์–ด ๋‚ด๊ณ , CV(Cross Validation) ์ƒ์Šน์„ ํ™•์ธ

    • Time, group์— ๋”ฐ๋ฅธ K-fold Validation์„ ์‹œํ–‰ํ•ด๋ณด๊ณ , ์˜ค๋ฅด์ง€ ์•Š์„๋•Œ ๊นŒ์ง€(ํ‹€๋ฆฌ์ง€ ์•Š์„๋•Œ ๊นŒ์ง€) ์‹œ๋„.

    4) model ์ƒ์„ฑํ•œ ํ›„, hyperparameter๋ฅผ ์ฐพ๋Š” ๋ฐฉ์‹

    ์˜ˆ๋ฅผ ๋“ค์–ด, ์ •๊ทœ๋ถ„ํฌ์™€ ์ผ๋ถ€ ๋‹ค๋ฅธ ์ง€์ ์„ ์ฐพ์•„, ํ•ด๋‹น ๋ถ€๋ถ„์€ feature๋กœ ์ƒ์„ฑ

    1. Top-Down

    ๊ฐ€์„ค(Hypothesis), domain ์ง€์‹ ๊ธฐ๋ฐ˜ ์ปจ์„คํŒ… ๋ฐฉ๋ฒ•๋ก (Logical thinking)

    ๊ฐ€์„ค-๊ตฌํ˜„-๊ฒ€์ฆ์œผ๋กœ ์ด๋ฃจ์–ด์ ธ ์žˆ์œผ๋ฉฐ,

    Feature Extraction ์‹œ,

    1) ๋ฐ์ดํ„ฐ์— ๋Œ€ํ•œ ์งˆ๋ฌธ & ๊ฐ€์„ค

    2) ๋ฐ์ดํ„ฐ๋ฅผ ์‹œ๊ฐํ™”ํ•˜๊ณ , ๋ณ€ํ™˜ํ•˜๊ณ , ๋ชจ๋ธ๋งํ•˜์—ฌ ๊ฐ€์„ค์— ๋Œ€ํ•œ ๋‹ต์„ ํƒ์ƒ‰(๊ตฌํ˜„-์„ฑ๋Šฅํ‰๊ฐ€)

    3) ์ฐพ๋Š” ๊ณผ์ •์—์„œ ๋ฐฐ์šด ๊ฒƒ๋“ค์„ ํ† ๋Œ€๋กœ, ๋‹ค์‹œ ๊ฐ€์„ค์„ ๋‹ค๋“ฌ๊ณ  ๋˜ ๋‹ค๋ฅธ ๊ฐ€์„ค ์ƒ์„ฑ

    ์œ„ ๋‘ ๋ฐฉ์‹์„ ๊ฐ™์ด ์‚ฌ์šฉํ•˜๋Š”๊ฒƒ์ด Best,

    ์ดํ›„, ์ •ํ˜• ๋ฐ์ดํ„ฐ์˜ ๊ฒฝ์šฐ, Feature์˜ Numerical, Categorical ์ข…๋ฅ˜๋ฅผ ๊ตฌ๋ถ„ํ•œ ํ›„, ๊ฐ ์ข…๋ฅ˜์˜ ํŠน์ง•์— ๋”ฐ๋ฅธ EDA๋ฅผ ํ•ด๋ณธ๋‹ค.

    ์˜ˆ๋ฅผ ๋“ค์–ด, ์ˆซ์žํ˜•์˜ ๊ฒฝ์šฐ, ํ‰๊ท , ๋ฒ”์œ„, ์ฒจ๋„ ๋“ฑ์„ ์•Œ์•„๋ณด๋ฉฐ,

    ๋ฒ”์ฃผํ˜•์˜ ๊ฒฝ์šฐ, Missing value, value ๋ณ„ Count, percent ์ตœ๋นˆ๋„ ๊ฐ’ ๋“ฑ์„ ์•Œ์•„๋ณด์ž.

    Target๊ณผ์˜ ์ƒ๊ด€๊ด€๊ณ„๋ฅผ Bar plot, hsit plot ๋“ฑ์„ ๊ทธ๋ ค ์•Œ์•„๋ณผ ์ˆ˜ ์žˆ๋‹ค.

    Riiid์˜ ๊ฒฝ์šฐ

    1) ๋ฌธํ•ญ์„ ํ‘ธ๋Š” ํŒจํ„ด์œผ๋กœ..

    ์ด์ „์— ํ‘ผ ๋ฌธ์ œ์ธ๊ฐ€?, ํ˜น์‹œ ์ •๋‹ต์„ ํ•œ ๋ฒˆํ˜ธ๋กœ ์ฐ์—ˆ๋Š”๊ฐ€?๋ฅผ ์•Œ ์ˆ˜ ์žˆ๋‹ค.

    ์•„์‰ฝ๊ฒŒ๋„ i-Scream ๋ฐ์ดํ„ฐ๋Š” ์„ ํƒ์ง€์— ๋Œ€ํ•œ ์ •๋ณด๊ฐ€ ์—†์Œ.

    2) ์‚ฌ์šฉ์ž๊ฐ€ ๋ฌธํ•ญ์„ ํ‘ธ๋Š” ๋ฐ ๊ฑธ๋ฆฐ ํ‰๊ท  ์‹œ๊ฐ„์œผ๋กœโ€ฆ

    ์˜ค๋ž˜ ๊ฑธ๋ ธ์„ ๊ฒฝ์šฐ, ๋งž์ถ˜ ํ•™์ƒ์˜ ํ‰๊ท  ์‹œ๊ฐ„๊ณผ ํ‹€๋ฆฐ ํ•™์ƒ์˜ ํ‰๊ท  ์‹œ๊ฐ„์„ Feature๋กœ ์ฃผ์–ด ํ™œ์šฉํ•  ์ˆ˜ ์žˆ๋‹ค.

    3) ์‚ฌ์šฉ์ž ์ •๋‹ต๋ฅ  ์ถ”์ด๋กœโ€ฆ

    ์ตœ๊ทผ ์ •๋‹ต๋ฅ ๋กœ, ์•ž์œผ๋กœ ๋ฌธํ•ญ๋“ค์˜ ์ •๋‹ต ์—ฌ๋ถ€๋ฅผ ๊ตฌํ•  ์ˆ˜ ์žˆ๋‹ค.

    • ์ตœ๊ทผ ์ •๋‹ต๋ฅ ์ด ๋‚ฎ์•„์ง€๋ฉด, ํ˜„์žฌ ํ‘ธ๋Š” ๋ฌธํ•ญ๋“ค์€ ์ž˜ ๋ชจ๋ฅธ๋‹ค๋Š” ์˜๋ฏธ์ด๋ฏ€๋กœ, ๋งˆ์ฐฌ๊ฐ€์ง€๋กœ ์ค„์–ด๋“ค ๊ฒƒ์ด๋‹ค.

    4) ์ด๋ฏธ ํ‘ผ ๋ฌธ์ œ๊ฐ€ ๋‹ค์‹œ ๋“ฑ์žฅํ•˜๋Š” ๊ฒฝ์šฐโ€ฆ

    ๋งž์ท„๊ฑฐ๋‚˜, ํ‹€๋ ธ์–ด๋„ ๋‹ค์‹œ ๋ณต์Šตํ–ˆ์„ ํ™•๋ฅ ์ด ์žˆ์œผ๋ฏ€๋กœ, ์ •๋‹ต๋ฅ ์ด ์˜ฌ๋ผ๊ฐˆ ์ˆ˜ ์žˆ์Œ.

    5) ๋ฌธํ•ญ, ์‹œํ—˜์ง€, ํƒœ๊ทธ์˜ ํ‰๊ท  ์ •๋‹ต๋ฅ ๋กœ โ€ฆ

    ์‰ฌ์šด ๋ฌธํ•ญ, ์‹œํ—˜์ง€, ํƒœ๊ทธ์˜ ๊ฒฝ์šฐ ์ •๋‹ต๋ฅ ์ด ์˜ฌ๋ผ๊ฐˆ ์ˆ˜ ์žˆ๋‹ค.

    ๋˜ํ•œ, ์‚ฌ์šฉ์ž๊ฐ€ ํ‘ธ๋Š” ๋ฌธ์ œ์— ๋Œ€ํ•œ ์ •๋ณด(๋ฌธํ•ญ์˜ ์ •๋‹ต๋ฅ , ๋ฌธํ•ญ์ด ๊ฐ€์ง„ ํƒœ๊ทธ์˜ ์ •๋‹ต๋ฅ )๊ฐ€ ๋งŽ์„ ์ˆ˜๋ก ํ™œ์šฉ ํ•˜๊ธฐ ์‰ฝ๋‹ค.

    • ๋ฌธํ•ญ-ํƒœ๊ทธ ์ •๋ณด ์—์„œ content2vec,

    • ์‚ฌ์šฉ์ž-๋ฌธํ•ญ ์ •๋ณด๋กœ SVD, LDA, item2vec
    • ๋ฌธํ•ญ์„ ํŠน์ง•ํ™”ํ•˜๋Š” IRT, ELO

    ๋“ฑ์˜ implicit ํ•œ ์ •๋ณด๋ฅผ ํ™œ์šฉํ•  ์ˆ˜ ์žˆ๋‹ค.

    Data Leakage

    Feature๋ฅผ ๋„ฃ์–ด์„œ ๊ฒฐ๊ณผ๊ฐ€ ์ข‹๊ฒŒ ๋‚˜์˜ค๋ฉด ์ ์šฉํ•ด๋„ ๋˜๋Š” ๊ฒƒ์ผ๊นŒ?

    ํ•ด๋‹น ๋ฌธํ•ญ์˜ ํ‰๊ท  ์ •๋‹ต๋ฅ  Feature๋ฅผ ์ƒ๊ฐํ•ด๋ณด์ž.

    ํ‰๊ท  ์ •๋‹ต๋ฅ ์€ validation์ด๋‚˜, test dataset์„ ์ œ์™ธํ•˜๊ณ , ๊ณ„์‚ฐํ•˜๊ฒŒ ๋œ๋‹ค.

    ์ฆ‰, ์ „์ฒด ๋ฐ์ดํ„ฐ์…‹์˜ ์ •๋‹ต๋ฅ ์€ ์‹ค์ œ ์ •๋‹ต๋ฅ ๊ณผ ๋‹ค๋ฅผ ์ˆ˜ ๋„ ์žˆ๋‹ค.

    ๊ณผ๊ฑฐ ํ˜„์—…์—์„œ๋Š” ์˜ˆ๋ฅผ ๋“ค์–ด, 5์›” 1์ผ ~ 8์ผ ๋ฐ์ดํ„ฐ๋Š” train dataset, 8์ผ๋ถ€ํ„ฐ ~10์ผ ๋ฐ์ดํ„ฐ๋Š” validation set, 11์ผ ๋ถ€ํ„ฐ 15์ผ ๊นŒ์ง€๋Š” test dataset์œผ๋กœ ์ฃผ๋Š” ๋“ฑ, ์‹œ๊ฐ„์„ ๊ณ ๋ คํ•˜์ง€ ์•Š๊ณ  ์ฃผ์–ด ์˜ฌ๋ฐ”๋ฅด์ง€ ๋ชปํ•œ ๊ฒฐ๊ณผ๋ฅผ ์ฃผ๋Š” ๊ฒฝ์šฐ๊ฐ€ ๋งŽ์•˜๋‹ค. (Inductive bias ๋ฌธ์ œ?)

    ํ•˜์ง€๋งŒ ์ตœ๊ทผ์—๋Š” time series api๋ฅผ ์ด์šฉํ•ด, inference ์‹œ, ํ•œ row๊ฐ€ ์ง„ํ–‰๋  ๋•Œ๋งˆ๋‹ค, updateํ•˜๋ฏ€๋กœ ๋ฌธ์ œ๊ฐ€ ์—†๋‹ค.

    ๋‹ค์–‘ํ•œ ๋ฐฉ๋ฒ•์„ ํ†ตํ•œ ๋ฌธํ•ญ ๊ณต์œ ์˜ Feature ๋ฝ‘์•„๋‚ด๊ธฐ

    ์ถ”์ฒœ ์‹œ์Šคํ…œ์—์„œ ๋งŽ์ด ์‚ฌ์šฉ๋˜๋Š” Matrix Factorization ๋ฐฉ์‹์œผ๋กœ ์‚ฌ์šฉ์ž์˜ ๋ฒกํ„ฐ์™€ ๋ฌธํ•ญ์˜ ๋ฒกํ„ฐ๋ฅผ ๋งŒ๋“ค ์ˆ˜ ์žˆ์Œ. (์ตœ๊ทผ์—๋Š” Factorization Machine์„ ๋งŽ์ด ์‚ฌ์šฉํ•จ.)

    Riid, i-Scream ๋ฐ์ดํ„ฐ์˜ ๊ฒฝ์šฐ, ๋ฌธ์ œ๋ฅผ ํ‘ผ ์‚ฌ์šฉ์ž์™€ ์‚ฌ์šฉ์ž๊ฐ€ ํ‘ผ ๋ฌธํ•ญ์„ ํ†ตํ•ด user-item ํ–‰๋ ฌ์„ ๋งŒ๋“ค์–ด ์ง„ํ–‰ ๊ฐ€๋Šฅ.

    ํ˜น์€ ์œ ์‚ฌํ•œ ๋ฐฉ๋ฒ•์œผ๋กœ ์„ ํ˜•๋Œ€์ˆ˜ํ•™์—์„œ Singular Value Decomposition (SVD)๋ฅผ ํ™œ์šฉํ•  ์ˆ˜ ์žˆ์Œ.

    ๋‚œ์ด๋„์˜ ์ด๋ก ์ธ ELO, IRT(Item Response Theory) ๋˜ํ•œ ํ™œ์šฉ ๊ฐ€๋Šฅํ•˜๋‹ค.

    ์ด๋Š” ํ•™์ƒ๊ณผ ๋ฌธํ•ญ ๋ณ„๋กœ ๊ณ ์œ ํ•œ ํŠน์„ฑ์ด ์žˆ๋‹ค๋Š” ๊ฐ€์ •์„ ํ•˜๋Š” ์ด๋ก ์ด๋‹ค.

    • ํ•™์ƒ์€ ์ž ์žฌ๋Šฅ๋ ฅ์ด ์žˆ๊ณ , ๊ฐ ๋ฌธํ•ญ์€ ํ•™์ƒ์˜ ์ž ์žฌ ๋Šฅ๋ ฅ์„ ๋ฐ›์•„ ๋ฌธํ•ญ์„ ๋งž์ถœ ํ™•๋ฅ ์„ ๋ฐ˜ํ™˜ํ•˜๋Š” ๊ณ ์œ  ํ•จ์ˆ˜๋ฅผ ๊ฐ€์ง€๊ณ  ์žˆ๋‹ค๊ณ  ๊ฐ€์ •.

    • ๋งŒ์•ฝ ํ•™์ƒ์˜ ์ž ์žฌ๋Šฅ๋ ฅ๊ณผ ๋ฌธํ•ญ ๋ณ„ ๋ชจ์ˆ˜๋ฅผ ์•ˆ๋‹ค๋ฉด, ์ „์ฒด ํ•™์ƒ์˜ ๋ชจ๋“  ๋ฌธ์ œ๋ฅผ ๋งž์ถœ ํ™•๋ฅ ์„ ๋ชจ๋‘ ์•Œ ์ˆ˜ ์žˆ๋‹ค๋Š” ์ด๋ก .

    ์ด๋•Œ, ๋ฌธํ•ญ์ด ๊ฐ€์ง„ ๊ณ ์œ  ํ•จ์ˆ˜๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์ด ์ •์˜๋จ.
    \(\phi(\theta;\beta)=c+\frac{1-c}{1+e^{-(\theta-\beta)}}\\ \phi : ํ•™์ƒ์˜\ ๊ณ ์œ \ ๋Šฅ๋ ฅ,\ \beta:\ ๋ฌธํ•ญ\ ๋ณ„\ ํ•จ์ˆ˜์˜\ ๋ชจ์ˆ˜,\\ c:\ ๋ฌด์ž‘์œ„๋กœ\ ์ฐ์„\ ์‹œ\ ๋งž์ถœ\ ํ™•๋ฅ (์‚ฌ์ง€์„ ๋‹ค\ ์‹œ, 0.25)\)
    IRT(Item Response Theory)์—์„œ๋Š” ์—ฌ๊ธฐ์— ๋” ๋งŽ์€ ๊ฐ€์ •์„ ๋„ฃ์–ด ๋ฌธํ•ญ ๋ณ„ ํ•จ์ˆ˜๋ฅผ ๋‹ค์–‘ํ•˜๊ฒŒ ๋งŒ๋“ค ์ˆ˜ ์žˆ์Œ.

    Riiid ์—์„œ๋Š” ๊ฐ„๋‹จํ•˜๊ฒŒ $\theta$์™€ $\beta$๋ฅผ ๊ฐ„๋‹จํ•˜๊ฒŒ ์ถ”์ •ํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ์‚ฌ์šฉํ•  ์ˆ˜๋„ ์žˆ๋‹ค.

    1. ์ „์ฒด ํ•™์ƒ์˜ $\theta$์™€ ์ „์ฒด ๋ฌธํ•ญ์˜ $\beta$๋ฅผ 0์œผ๋กœ ์ดˆ๊ธฐํ™”ํ•œ๋‹ค.
    2. ์•„๋ž˜ ์ˆ˜์‹์— ๋งž์ถฐ์„œ $\theta$์™€ $\beta$๋ฅผ ์—…๋ฐ์ดํ„ฐ, (correct๋Š” 0/1์˜ binary ์ •๋‹ต ์—ฌ๋ถ€)
    \[\theta_{n+1}\leftarrow \theta_n + \eta_{\theta_n}*(correct-\phi(\theta_n;\beta_n))\\ \beta_{n+1}\leftarrow \beta_n + \eta_{\beta_n}*(correct-\phi(\theta_n;\beta_n))\]
    1. ์ด ๊ณผ์ •์„ ์ „์ฒด ๋ฐ์ดํ„ฐ์— ๋Œ€ํ•ด ๋ฐ˜๋ณตํ•ด ์ตœ์ข…๊ฐ’์„ ์ฐพ์Œ
    2. ๊ตฌํ•œ ์ด ๊ฐ’๋“ค์„ ํ†ตํ•ด test ๋ฐ์ดํ„ฐ ๋‚ด์˜ ํ•™์ƒ ๋ณ„ ๋ฌธํ•ญ์— ๋Œ€ํ•œ ์ •๋‹ต๋ฅ ์„ ๊ตฌํ•จ.

    Continuous Embedding

    ์ผ๋ฐ˜์ ์œผ๋กœ ์—ฐ์†ํ˜• ๋ฐ์ดํ„ฐ๋Š” ์ž„๋ฒ ๋”ฉ ๋ฐ์ดํ„ฐ์™€ ๋‹ฌ๋ฆฌ Embedding ํ•˜์ง€ ์•Š๊ณ  ์ง‘์–ด๋„ฃ๋Š”๋‹ค.

    ๋ฒ”์ฃผํ˜• ๋ฐ์ดํ„ฐ์˜ ๊ฒฝ์šฐ, ์ž„๋ฒ ๋”ฉ ํ–‰๋ ฌ์˜ ํ•œ ์—ด์„ ์‚ฌ์šฉํ•˜๋Š” ํ˜•ํƒœ์ด๋ฉฐ, ์—ฐ์†ํ˜•์€ ๊ทธ๋Ÿด ์ˆ˜ ์—†์œผ๋ฏ€๋กœ, ์ž„๋ฒ ๋”ฉ ๋Œ€์‹ , ์ฃผ์–ด์ง„ ์—ฐ์†ํ˜• ๋ฐ์ดํ„ฐ ๊ฐ’์— ๊ฐ€์ค‘์„ ๋” ๋‘๊ณ , ๊ทธ ์ฃผ๋ณ€ ๊ฐ’๋“ค์— ๋” ์ž‘์€ ๊ฐ€์ค‘์„ ์ฃผ์–ด, ์ด ์ž„๋ฒ ๋”ฉ ํ–‰๋ ฌ์˜ ํŠน์ • ์—ด๋“ค์„ ๊ฐ€์ค‘ํ•ฉํ•œ ๋ฒกํ„ฐ๋ฅผ ์ž„๋ฒ ๋”ฉ์œผ๋กœ ์‚ฌ์šฉํ•จ.

    ์˜ˆ๋ฅผ ๋“ค์–ด 1~100๊นŒ์ง€ ์ž„๋ฒ ๋”ฉ ํ•ด ๋†“์€๋’ค, 50์„ ์ž„๋ฒ ๋”ฉํ•˜๋ ค ํ• ๋•Œ, (50์˜ ์ž„๋ฒ ๋”ฉ๊ฐ’*0.45) + (49์˜ ์ž„๋ฒ ๋”ฉ๊ฐ’*0.18)+ (51์˜ ์ž„๋ฒ ๋”ฉ๊ฐ’*0.18)+(48์˜ ์ž„๋ฒ ๋”ฉ๊ฐ’*0.09)+(52์˜ ์ž„๋ฒ ๋”ฉ๊ฐ’*0.9) โ€ฆ๋Œ€๋žต์ ์ธ ์ •๊ทœ๋ถ„ํฌ๋ฅผ ํ†ตํ•˜์—ฌ ์‚ฌ์šฉํ•œ๋‹ค.

    Last Query Transformer RNN

    ์ผ๋ฐ˜์ ์œผ๋กœ,

    1. LGBM, DNN ๊ฐ™์€ Machine Learing์˜ ๊ฒฝ์šฐ,
      • ๋งŽ์€ Feature Engineering์„ ํ†ตํ•ด, ๋‹ค๋Ÿ‰์˜ Feature๋ฅผ ํ•„์š”๋กœ ํ•˜๊ณ , ์œ ์˜๋ฏธํ•œ ๊ฒƒ์„ ์ฐพ์•„๋‚ด์•ผ ํ•œ๋‹ค.
    2. Transformer ๊ฐ™์€ Deep Learning์˜ ๊ฒฝ์šฐ,
      • ์•Œ์•„์„œ Feature๋ฅผ ์ฐพ์•„์ฃผ๋ฏ€๋กœ, FE๋ฅผ ์ ๊ฒŒ ์‚ฌ์šฉํ•˜๊ณ  ์•„์ฃผ ๋งŽ์€ ์–‘์˜ ๋ฐ์ดํ„ฐ๋ฅผ ์š”๊ตฌ๋กœ ํ•˜๊ณ , sequence์˜ ๊ธธ์ด์˜ ์ œ๊ณฑ์— ๋น„๋ก€ํ•œ ์‹œ๊ฐ„ ๋ณต์žก๋„๋ฅผ ๊ฐ€์ง€๋ฏ€๋กœ ๋ถ€๋‹ด์Šค๋Ÿฝ๋‹ค.
      • Tabular data(์ •ํ˜• ๋ฐ์ดํ„ฐ)์—์„œ๋Š” ์—ฌ์ „ํžˆ FE๊ฐ€ ๋งŽ์ด ํ•„์š”, ๋ณดํ†ต์˜ ๊ฒฝ์šฐ์—๋„ FE๋ฅผ ํ†ตํ•ด ์„ฑ๋Šฅ์„ ์˜ฌ๋ฆด ์ˆ˜ ์žˆ์Œ.

    Resolving deficits

    Riid์˜ 1๋“ฑ ์†”๋ฃจ์…˜์ธ Last Query Transformer RNN์€, ์œ„ ๋‘ ๊ฐ€์ง€ ๋ฌธ์ œ๋ฅผ ๋ชจ๋‘ ํ•ด๊ฒฐํ•œ ๋ฐฉ๋ฒ•์œผ๋กœ 1๋“ฑ์„ ์ฐจ์ง€.

    ํŠน์ง•์œผ๋กœ,

    1. ๋‹ค์ˆ˜์˜ Feature๋ฅผ ์‚ฌ์šฉํ•˜์ง€ ์•Š์Œ, ๋Œ€์‹  sequence ๊ธธ์ด๋ฅผ ๋Š˜๋ฆผ(์‹œ๊ฐ„ ๋ณต์žก๋„๊ฐ€ ์ฆ๊ฐ€ํ•˜๋Š” ๋ฌธ์ œ๋ฅผ ์•„๋ž˜๋กœ ํ•ด๊ฒฐ).

      • 5๊ฐœ์˜ feature ๋งŒ ์‚ฌ์šฉ, ๋‹ค๋ฅธ ์ƒ์œ„๊ถŒ ๋ชจ๋ธ์˜ ๊ฒฝ์šฐ 70~80๊ฐœ ์‚ฌ์šฉ
    2. ๋งˆ์ง€๋ง‰ Query๋งŒ ์‚ฌ์šฉํ•˜์—ฌ ์‹œ๊ฐ„ ๋ณต์žก๋„๋ฅผ ๋‚ฎ์ถค

      • ์ผ๋ฐ˜์ ์œผ๋กœ $n \times m$ ํ–‰๋ ฌ๊ณผ $m \times l$ ํ–‰๋ ฌ์˜ ๊ณฑ์— ๋Œ€ํ•œ ์‹œ๊ฐ„ ๋ณต์žก๋„๋Š” $O(nml)$์ด๋‹ค.
      • Transformer์—์„œ Query, Key, Value์— ๋Œ€ํ•œ ํ–‰๋ ฌ Q, K, V๊ฐ€ ๊ฐ๊ฐ (L, d)๋กœ ์ฃผ์–ด์ ธ ์žˆ๊ณ , ์šฐ๋ฆฌ๊ฐ€ ๊ณ„์‚ฐํ•˜๋Š” Attention Score์˜ ๊ณ„์‚ฐ์‹์€ ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.
      \[Att(Q, K, V) = \mathcal{softmax}\frac{QK^T}{\sqrt{d}}*V\\ Scaled\ dot\ attention : ์œ ์‚ฌ๋„\ ๊ตฌํ• ์‹œ\ dot\ ์—ฐ์‚ฐ\ ํ™œ์šฉ\]
      • ์‹œ๊ฐ„ ๋ณต์žก๋„๊ฐ€ $O(L^2d)$๋กœ ๋ณ€ํ•œ๋‹ค.
      • ์ถ”๊ฐ€์ ์œผ๋กœ ๋งˆ์ง€๋ง‰ Query๋งŒ ์‚ฌ์šฉํ•œ๋‹ค๋ฉด, Q ํ–‰๋ ฌ์˜ ์ฐจ์›์ด (L, d)์—์„œ (1, d)๋กœ ์ค„์–ด๋“ ๋‹ค.
      • ์ฆ‰, ์ตœ์ข…์ ์œผ๋กœ $O(Ld)$๋กœ ์ค„์–ด๋“ ๋‹ค.

    3. ๋ฌธ์ œ ๊ฐ„ ํŠน์ง•์„ Transformer๋กœ ํŒŒ์•…ํ•˜๊ณ , ์ผ๋ จ์˜ Sequece ์‚ฌ์ด ํŠน์ง•๋“ค์„ LSTM์„ ํ™œ์šฉํ•ด ๋ฝ‘์•„๋‚ธ ๋’ค, ๋งˆ์ง€๋ง‰ DNN์„ ํ†ตํ•ด Sequence ๋ณ„ ์ •๋‹ต์„ ์˜ˆ์ธก

    • Positional embedding๊ณผ look-ahead mask๋ฅผ ์ œ์™ธํ•˜์—ฌ ์ˆœ์„œ์™€ ๊ด€๊ณ„์—†์ด ์ž…๋ ฅ ๊ฐ„์˜ ๊ด€๊ณ„๋ฅผ ํŒŒ์•…ํ•˜๊ฒŒ ํ•จ
    • ๊ทธ ๋’ค, Sequential ํŠน์„ฑ ํŒŒ์•…์„ ์œ„ํ•ด LSTM ํ™œ์šฉ
    • ์ด๋ฅผ ํ†ตํ•ด, Encoder ์ˆ˜(=Layer ์ˆ˜)์™€ Sequence length๋ฅผ ์ฆ๊ฐ€์‹œ์ผœ ์„ฑ๋Šฅ์ด ํ–ฅ์ƒ ๋จ.
    • BERT ๋ชจ๋ธ์— ๋น„ํ•ด 3๋ฐฐ ์ด์ƒ์˜ sequence length๋ฅผ ๊ฐ€์ง(512 vs 1728)

    ML Pipeline

    [DKT-8]ML_Pipeline.ipynb ์ฐธ์กฐ

    Model Serving

    ๋ชจ๋ธ ์„œ๋น™์˜ ์ข…๋ฅ˜

    On-device Serving

    Cloud-based Serving

    ์›น์„œ๋ฒ„๋ฅผ ํ™œ์šฉํ•œ ๋ชจ๋ธ ์„œ๋น™

    HTTP ํ†ต์‹ 

    ์›น ์„œ๋ฒ„ ๊ตฌ์ถ•

    MLflow๋ฅผ ํ™œ์šฉํ•œ ๋ชจ๋ธ ์„œ๋น™

    MLflow

    ์˜ˆ์‹œ ์‹œ์Šคํ…œ

    End to End Project

    ์‹ค์ œ ํ˜„์—…๊ณผ Competition์˜ ๋น„๊ต

    ๋ฌธ์ œ์ •์˜ 3์š”์†Œ

    input(Data_X, DataType)

    Output(Data_Y, ์˜ˆ์ธกํ•ด์•ผ ํ•  ๊ฐ’)

    Metric(ํ‰๊ฐ€ ์ง€ํ‘œ)

    Workflow

    Workflow๋ž€?

    ์›Œํฌ ํ”Œ๋กœ์šฐ ๊ด€๋ฆฌ

    Apache Airflow๋ฅผ ํ™œ์šฉํ•œ ์›Œํฌ ํ”Œ๋กœ์šฐ ๊ด€๋ฆฌ

    Airflow๋Š” Workflow๋ฅผ ํ”„๋กœ๊ทธ๋ž˜๋ฐ ๋ฐฉ์‹์œผ๋กœ ์ž‘์„ฑ, ์˜ˆ์•ฝ ๋ฐ ๋ชจ๋‹ˆํ„ฐ๋งํ•˜๋Š” ํ”Œ๋žซํผ์œผ๋กœ, python์„์ด์šฉ ํ•œ ์›Œํฌ ํ”Œ๋กœ์šฐ ๊ด€๋ฆฌ ํˆด์ด๋‹ค.

    • Airbnb -> Apache ๋กœ ํ”„๋กœ์ ํŠธ ๋„˜์–ด๊ฐ

    Airflow๋Š” ํฌ๊ฒŒ

    Webserver, Scheduler, Worker, Meta DB๋กœ ์ด๋ฃจ์–ด์ ธ ์žˆ๋‹ค.

    ํ† ์ด ํ”„๋กœ์ ํŠธ ์†Œ๊ฐœ