Cheatsheet Supervised Learning

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 5

‫‪ - CS ۲۲۹‬ﺗﻌﻠﻢ آﻟﻲ‬ ‫‪https://stanford.

edu/~shervine/l/ar/‬‬

‫اﻟﺮﻣﻮز وﻣﻔﺎﻫﻴﻢ أﺳﺎﺳﻴﺔ‬

‫‪ r‬اﻟﻔﺮﺿﻴﺔ )‪ – (Hypothesis‬اﻟﻔﺮﺿﻴﺔ‪ ،‬وﻳﺮﻣﺰ ﻟﻬﺎ ﺑـ ‪ ،hθ‬ﻫﻲ اﻟﻨﻤﻮذج اﻟﺬي ﻧﺨﺘﺎره‪ .‬إذا ﻛﺎن ﻟﺪﻳﻨﺎ اﻟﻤﺪﺧﻞ )‪ ،x(i‬ﻓﺈن‬
‫اﻟﻤﺨﺮج اﻟﺬي ﺳﻴﺘﻮﻗﻌﻪ اﻟﻨﻤﻮذج ﻫﻮ ) )‪.hθ (x(i‬‬
‫اﻟﻤ َﻮ ﱠﺟﻪ‬ ‫ّ‬
‫ﻟﻠﺘﻌﻠﻢ ُ‬ ‫ﻣﺮﺟﻊ ﺳﺮﻳﻊ‬
‫‪ r‬داﻟﺔ اﻟﺨﺴﺎرة )‪ – (Loss function‬داﻟﺔ اﻟﺨﺴﺎرة ﻫﻲ اﻟﺪاﻟﺔ ‪ L : (z,y) ∈ R × Y 7−→ L(z,y) ∈ R‬اﻟﺘﻲ ﺗﺄﺧﺬ‬
‫ﻛﻤﺪﺧﻼت اﻟﻘﻴﻤﺔ اﻟﻤﺘﻮﻗﻌﺔ ‪ z‬واﻟﻘﻴﻤﺔ اﻟﺤﻘﻴﻘﻴﺔ ‪ y‬وﺗﻌﻄﻴﻨﺎ اﻻﺧﺘﻼف ﺑﻴﻨﻬﻤﺎ‪ .‬اﻟﺠﺪول اﻟﺘﺎﻟﻲ ﻳﺤﺘﻮي ﻋﻠﻰ ﺑﻌﺾ‬
‫دوال اﻟﺨﺴﺎرة اﻟﺸﺎﺋﻌﺔ‪:‬‬ ‫اﻓﺸﯿﻦ ﻋﻤﯿﺪی و ﺷﺮوﯾﻦ ﻋﻤﯿﺪی‬
‫اﻻﻧﺘﺮوﺑﻴﺎ اﻟﺘﻘﺎﻃﻌﻴﺔ‬ ‫ﺧﺴﺎرة ﻣﻔﺼﻠﻴﺔ‬ ‫ﺧﺴﺎرة ﻟﻮﺟﺴﺘﻴﺔ‬ ‫ﺧﻄﺄ أﺻﻐﺮ ﺗﺮﺑﻴﻊ‬ ‫‪ ١٤‬رﺑﻴﻊ اﻟﺜﺎﻧﻲ‪١٤٤١ ،‬‬
‫)‪(Cross-entropy‬‬ ‫)‪(Hinge loss‬‬ ‫)‪(Logistic loss‬‬ ‫)‪(Least squared error‬‬
‫[‬ ‫]‬ ‫‪1‬‬
‫)‪− y log(z) + (1 − y) log(1 − z‬‬ ‫)‪max(0,1 − yz‬‬ ‫))‪log(1 + exp(−yz‬‬ ‫‪(y − z)2‬‬ ‫ﺗﻤﺖ اﻟﺘﺮﺟﻤﺔ ﺑﻮاﺳﻄﺔ ﻓﺎرس اﻟﻘﻨﻴﻌﻴﺮ‪ .‬ﺗﻤﺖ اﻟﻤﺮاﺟﻌﺔ ﺑﻮاﺳﻄﺔ زﻳﺪ اﻟﻴﺎﻓﻌﻲ‪.‬‬
‫‪2‬‬

‫ّ‬
‫ﻟﻠﺘﻌﻠﻢ اﻟﻤُ ﻮَ ﱠﺟﻪ‬ ‫ﻣﻘﺪﻣﺔ‬

‫إذا ﻛﺎن ﻟﺪﻳﻨﺎ ﻣﺠﻤﻮﻋﺔ ﻣﻦ ﻧﻘﺎط اﻟﺒﻴﺎﻧﺎت } )‪ {x(1) , ..., x(m‬ﻣﺮﺗﺒﻄﺔ ﺑﻤﺠﻤﻮﻋﺔ ﻣﺨﺮﺟﺎت } )‪،{y (1) , ..., y (m‬‬
‫ﺼ ﱢﻨﻒ ﻳﺘﻌﻠﻢ ﻛﻴﻒ ﻳﺘﻮﻗﻊ ‪ y‬ﻣﻦ ‪.x‬‬
‫ﻧﺮﻳﺪ أن ﻧﺒﻨﻲ ﻣُ َ‬
‫ّ‬
‫اﻟﺘﻮﻗﻊ اﻟﻤﺨﺘﻠﻔﺔ ﻣﻮﺿﺤﺔ ﻓﻲ اﻟﺠﺪول اﻟﺘﺎﻟﻲ‪:‬‬ ‫ّ‬
‫اﻟﺘﻮﻗﻊ – أﻧﻮاع ﻧﻤﺎذج‬ ‫‪ r‬ﻧﻮع‬
‫اﻟﺸﺒﻜﺎت اﻟﻌﺼﺒﻴﺔ‬ ‫آﻟﺔ اﻟﻤﺘﺠﻬﺎت اﻟﺪاﻋﻤﺔ‬ ‫اﻻﻧﺤﺪار اﻟﻠﻮﺟﺴﺘﻲ‬ ‫ّ‬
‫اﻟﺨﻄﻲ‬ ‫اﻻﻧﺤﺪار‬
‫)‪(Neural Network‬‬ ‫)‪(SVM‬‬ ‫)‪(Logistic regression‬‬ ‫)‪(Linear regression‬‬ ‫اﻟﺘﺼﻨﻴﻒ‬ ‫اﻻﻧﺤﺪار‬
‫)‪(Classification‬‬ ‫)‪(Regression‬‬

‫ﺻﻨﻒ‬ ‫ﻣﺴﺘﻤﺮ‬ ‫اﻟﻤُ ﺨﺮَج‬


‫‪ r‬داﻟﺔ اﻟﺘﻜﻠﻔﺔ )‪ – (Cost function‬داﻟﺔ اﻟﺘﻜﻠﻔﺔ ‪ J‬ﺗﺴﺘﺨﺪم ﻋﺎدة ﻟﺘﻘﻴﻴﻢ أداء ﻧﻤﻮذج ﻣﺎ‪ ،‬وﻳﺘﻢ ﺗﻌﺮﻳﻔﻬﺎ ﻣﻊ داﻟﺔ‬
‫اﻟﺨﺴﺎرة ‪ L‬ﻛﺎﻟﺘﺎﻟﻲ‪:‬‬ ‫اﻧﺤﺪار ﻟﻮﺟﺴﺘﻲ )‪،(Logistic regression‬‬
‫آﻟﺔ اﻟﻤﺘﺠﻬﺎت اﻟﺪاﻋﻤﺔ )‪،(SVM‬‬ ‫اﻧﺤﺪار ّ‬
‫ﺧﻄﻲ )‪(Linear regression‬‬ ‫أﻣﺜﻠﺔ‬
‫∑‬
‫‪m‬‬
‫= )‪J(θ‬‬ ‫‪L(hθ (x‬‬ ‫)‪(i‬‬
‫‪), y‬‬ ‫)‪(i‬‬
‫)‬ ‫ﺑﺎﻳﺰ اﻟﺒﺴﻴﻂ )‪(Naive Bayes‬‬
‫‪i=1‬‬

‫‪ r‬ﻧﻮع اﻟﻨﻤﻮذج – أﻧﻮاع اﻟﻨﻤﺎذج اﻟﻤﺨﺘﻠﻔﺔ ﻣﻮﺿﺤﺔ ﻓﻲ اﻟﺠﺪول اﻟﺘﺎﻟﻲ‪:‬‬


‫ّ‬
‫اﻟﺘﻌﻠﻢ ‪ ،α ∈ R‬ﻳﻤﻜﻦ ﺗﻌﺮﻳﻒ اﻟﻘﺎﻧﻮن اﻟﺬي ﻳﺘﻢ ﺗﺤﺪﻳﺚ‬ ‫‪ r‬اﻟﻨﺰول اﻻﺷﺘﻘﺎﻗﻲ )‪ – (Gradient descent‬ﻟﻨﻌﺮّ ف ﻣﻌﺪل‬
‫ّ‬ ‫ﻧﻤﻮذج ﺗﻮﻟﻴﺪي )‪(Generative‬‬ ‫ﻧﻤﻮذج ﺗﻤﻴﻴﺰي )‪(Discriminative‬‬
‫ﺧﻮارزﻣﻴﺔ اﻟﻨﺰول اﻻﺷﺘﻘﺎﻗﻲ ﻣﻦ ﺧﻼﻟﻪ ﺑﺎﺳﺘﺨﺪام ﻣﻌﺪل اﻟﺘﻌﻠﻢ وداﻟﺔ اﻟﺘﻜﻠﻔﺔ ‪ J‬ﻛﺎﻟﺘﺎﻟﻲ‪:‬‬
‫ﺗﻘﺪﻳﺮ )‪ P (x|y‬ﺛﻢ اﺳﺘﻨﺘﺎج )‪P (y|x‬‬ ‫اﻟﺘﻘﺪﻳﺮ اﻟﻤﺒﺎﺷﺮ ﻟـ )‪P (y|x‬‬ ‫اﻟﻬﺪف‬
‫)‪θ ←− θ − α∇J(θ‬‬
‫اﻟﺘﻮزﻳﻊ اﻻﺣﺘﻤﺎﻟﻲ ﻟﻠﺒﻴﺎﻧﺎت‬ ‫ﺣﺪود اﻟﻘﺮار‬ ‫ﻣﺎذا ﻳﺘﻌﻠﻢ‬

‫ﺗﻮﺿﻴﺢ‬

‫‪ ،GDA‬ﺑﺎﻳﺰ اﻟﺒﺴﻴﻂ )‪(Naive Bayes‬‬ ‫اﻻﻧﺤﺪار )‪ ،(Regression‬آﻟﺔ اﻟﻤﺘﺠﻬﺎت اﻟﺪاﻋﻤﺔ )‪(SVM‬‬ ‫أﻣﺜﻠﺔ‬

‫ﺟﺎﻣﻌﺔ ﺳﺘﺎﻧﻔﻮرد‬ ‫‪۱‬‬ ‫ﺧﺮﻳﻒ ‪۲۰۱۸‬‬


‫‪ - CS ۲۲۹‬ﺗﻌﻠﻢ آﻟﻲ‬ ‫‪https://stanford.edu/~shervine/l/ar/‬‬

‫اﻟﺘﺼﻨﻴﻒ واﻻﻧﺤﺪار اﻟﻠﻮﺟﺴﺘﻲ‬ ‫ﻣﻼﺣﻈﺔ‪ :‬ﻓﻲ اﻟﻨﺰول اﻻﺷﺘﻘﺎﻗﻲ اﻟﻌﺸﻮاﺋﻲ ))‪ (Stochastic gradient descent (SGD‬ﻳﺘﻢ ﺗﺤﺪﻳﺚ اﻟﻤُ ﻌﺎﻣﻼت‬
‫اﻟﺤ َﺰﻣﻲ )‪(batch gradient descent‬‬ ‫ً‬
‫ﺑﻨﺎءا ﻋﻠﻰ ﻛﻞ ﻋﻴﻨﺔ ﺗﺪرﻳﺐ ﻋﻠﻰ ﺣﺪة‪ ،‬ﺑﻴﻨﻤﺎ ﻓﻲ اﻟﻨﺰول اﻻﺷﺘﻘﺎﻗﻲ ُ‬ ‫)‪(parameters‬‬
‫‪ r‬داﻟﺔ ﺳﻴﺠﻤﻮﻳﺪ )‪ – (Sigmoid‬داﻟﺔ ﺳﻴﺠﻤﻮﻳﺪ ‪ ،g‬وﺗﻌﺮف ﻛﺬﻟﻚ ﺑﺎﻟﺪاﻟﺔ اﻟﻠﻮﺟﺴﺘﻴﺔ‪ ،‬ﺗﻌﺮّ ف ﻛﺎﻟﺘﺎﻟﻲ‪:‬‬
‫ﻳﺘﻢ ﺗﺤﺪﻳﺜﻬﺎ ﺑﺎﺳﺘﺨﺪام ُﺣ َﺰم ﻣﻦ ﻋﻴﻨﺎت اﻟﺘﺪرﻳﺐ‪.‬‬
‫‪1‬‬
‫‪∀z ∈ R,‬‬ ‫= )‪g(z‬‬ ‫[‪∈]0,1‬‬ ‫‪ r‬اﻷرﺟﺤﻴﺔ )‪ – (Likelihood‬ﺗﺴﺘﺨﺪم أرﺟﺤﻴﺔ اﻟﻨﻤﻮذج )‪ ،L(θ‬ﺣﻴﺚ أن ‪ θ‬ﻫﻲ اﻟﻤُ ﺪﺧﻼت‪ ،‬ﻟﻠﺒﺤﺚ ﻋﻦ اﻟﻤُ ﺪﺧﻼت ‪θ‬‬
‫‪1 + e−z‬‬
‫ً‬
‫ﻋﻤﻠﻴﺎ ﻳﺘﻢ اﺳﺘﺨﺪام اﻷرﺟﺤﻴﺔ اﻟﻠﻮﻏﺎرﻳﺜﻤﻴﺔ )‪(log-likelihood‬‬ ‫اﻷﺣﺴﻦ ﻋﻦ ﻃﺮﻳﻖ ﺗﻌﻈﻴﻢ )‪ (maximizing‬اﻷرﺟﺤﻴﺔ‪.‬‬
‫))‪ ℓ(θ) = log(L(θ‬ﺣﻴﺚ أﻧﻬﺎ أﺳﻬﻞ ﻓﻲ اﻟﺘﺤﺴﻴﻦ )‪ .(optimize‬ﻓﻴﻜﻮن ﻟﺪﻳﻨﺎ‪:‬‬
‫‪ r‬اﻻﻧﺤﺪار اﻟﻠﻮﺟﺴﺘﻲ )‪ – (Logistic regression‬ﻧﻔﺘﺮض ﻫﻨﺎ أن )‪ .y|x; θ ∼ Bernoulli(ϕ‬ﻓﻴﻜﻮن ﻟﺪﻳﻨﺎ‪:‬‬

‫‪1‬‬ ‫)‪θopt = arg max L(θ‬‬


‫= )‪ϕ = p(y = 1|x; θ‬‬ ‫)‪= g(θT x‬‬ ‫‪θ‬‬
‫)‪1 + exp(−θT x‬‬

‫ﻣﻼﺣﻈﺔ‪ :‬ﻟﻴﺲ ﻫﻨﺎك ﺣﻞ رﻳﺎﺿﻲ ﻣﻐﻠﻖ ﻟﻼﻧﺤﺪار اﻟﻠﻮﺟﺴﺘﻲ‪.‬‬ ‫‪ r‬ﺧﻮارزﻣﻴﺔ ﻧﻴﻮﺗﻦ )‪ – (Newton’s algorithm‬ﺧﻮارزﻣﻴﺔ ﻧﻴﻮﺗﻦ ﻫﻲ ﻃﺮﻳﻘﺔ ﺣﺴﺎﺑﻴﺔ ﻟﻠﻌﺜﻮر ﻋﻠﻰ ‪ θ‬ﺑﺤﻴﺚ ﻳﻜﻮن‬
‫‪ .ℓ′ (θ) = 0‬ﻗﺎﻋﺪة اﻟﺘﺤﺪﻳﺚ ﻟﻠﺨﻮارزﻣﻴﺔ ﻛﺎﻟﺘﺎﻟﻲ‪:‬‬
‫‪ r‬اﻧﺤﺪار ﺳﻮﻓﺖ ﻣﺎﻛﺲ )‪ – (Softmax‬وﻳﻄﻠﻖ ﻋﻠﻴﻪ اﻻﻧﺤﺪار اﻟﻠﻮﺟﺴﺘﻲ ﻣﺘﻌﺪد اﻷﺻﻨﺎف ‪(multiclass logistic‬‬
‫)‪ℓ′ (θ‬‬
‫)‪ ،regression‬ﻳﺴﺘﺨﺪم ﻟﺘﻌﻤﻴﻢ اﻻﻧﺤﺪار اﻟﻠﻮﺟﺴﺘﻲ إذا ﻛﺎن ﻟﺪﻳﻨﺎ أﻛﺜﺮ ﻣﻦ ﺻﻨﻔﻴﻦ‪ .‬ﻓﻲ اﻟﻌﺮف ﻳﺘﻢ ﺗﻌﻴﻴﻦ ‪،θK = 0‬‬ ‫‪θ←θ−‬‬
‫)‪ℓ′′ (θ‬‬
‫ﺑﺤﻴﺚ ﺗﺠﻌﻞ ﻣُ ﺪﺧﻞ ﺑﻴﺮﻧﻮﻟﻠﻲ )‪ ϕi (Bernoulli‬ﻟﻜﻞ ﻓﺌﺔ ‪ i‬ﻳﺴﺎوي‪:‬‬
‫ﻣﻼﺣﻈﺔ‪ :‬ﻫﻨﺎك ﺧﻮارزﻣﻴﺔ أﻋﻢ وﻫﻲ ﻣﺘﻌﺪدة اﻷﺑﻌﺎد )‪ ،(multidimensional‬ﻳﻄﻠﻖ ﻋﻠﻴﻬﺎ ﺧﻮارزﻣﻴﺔ ﻧﻴﻮﺗﻦ�راﻓﺴﻮن‬
‫)‪exp(θiT x‬‬
‫= ‪ϕi‬‬ ‫)‪ ،(Newton-Raphson‬وﻳﺘﻢ ﺗﺤﺪﻳﺜﻬﺎ ﻋﺒﺮ اﻟﻘﺎﻧﻮن اﻟﺘﺎﻟﻲ‪:‬‬
‫∑‬‫‪K‬‬
‫(‬ ‫‪)−1‬‬
‫)‪exp(θjT x‬‬ ‫)‪θ ← θ − ∇2θ ℓ(θ‬‬ ‫)‪∇θ ℓ(θ‬‬
‫‪j=1‬‬

‫ّ‬
‫اﻟﺨﻄﻲ )‪(Linear regression‬‬ ‫اﻻﻧﺤﺪار‬
‫اﻟﻨﻤﺎذج اﻟﺨﻄﻴﺔ اﻟﻌﺎﻣﺔ )‪(Generalized Linear Models - GLM‬‬
‫ﻫﻨﺎ ﻧﻔﺘﺮض أن ) ‪y|x; θ ∼ N (µ,σ 2‬‬
‫ﺳﻴﺔ )‪ – (Exponential family‬ﻳﻄﻠﻖ ﻋﻠﻰ ﺻﻨﻒ ﻣﻦ اﻟﺘﻮزﻳﻌﺎت )‪ (distributions‬ﺑﺄﻧﻬﺎ ﺗﻨﺘﻤﻲ إﻟﻰ‬ ‫ُ‬
‫‪ r‬اﻟﻌﺎﺋﻠﺔ اﻷ ّ‬ ‫‪ r‬اﻟﻤﻌﺎدﻟﺔ اﻟﻄﺒﻴﻌﻴﺔ�اﻟﻨﺎﻇﻤﻴﺔ )‪ – (Normal‬إذا ﻛﺎن ﻟﺪﻳﻨﺎ اﻟﻤﺼﻔﻮﻓﺔ ‪ ،X‬اﻟﻘﻴﻤﺔ ‪ θ‬اﻟﺘﻲ ﺗﻘﻠﻞ ﻣﻦ داﻟﺔ اﻟﺘﻜﻠﻔﺔ ﻳﻤﻜﻦ‬
‫ﻛﺎف ‪(sufficient‬‬
‫ٍ‬ ‫اﻷﺳﻴﺔ إذا ﻛﺎن ﻳﻤﻜﻦ ﻛﺘﺎﺑﺘﻬﺎ ﺑﻮاﺳﻄﺔ ﻣُ ﺪﺧﻞ ﻗﺎﻧﻮﻧﻲ )‪ ،η (canonical parameter‬إﺣﺼﺎء‬
‫ّ‬ ‫اﻟﻌﺎﺋﻠﺔ‬ ‫ً‬
‫رﻳﺎﺿﻴﺎ ﺑﺸﻜﻞ ﻣﻐﻠﻖ )‪ (closed-form‬ﻋﻦ ﻃﺮﻳﻖ‪:‬‬ ‫ﺣﻠﻬﺎ‬
‫)‪ ،T (y) statistic‬وداﻟﺔ ﺗﺠﺰﺋﺔ ﻟﻮﻏﺎرﻳﺜﻤﻴﺔ )‪ ،a(η‬ﻛﺎﻟﺘﺎﻟﻲ‪:‬‬
‫‪θ = (X T X)−1 X T y‬‬
‫))‪p(y; η) = b(y) exp(ηT (y) − a(η‬‬

‫ً‬ ‫ّ‬
‫اﻟﺘﻌﻠﻢ ‪ ،α‬ﻓﺈن ﻗﺎﻧﻮن اﻟﺘﺤﺪﻳﺚ ﻟﺨﻮارزﻣﻴﺔ أﺻﻐﺮ‬ ‫‪ r‬ﺧﻮارزﻣﻴﺔ أﺻﻐﺮ ﻣﻌﺪل ﺗﺮﺑﻴﻊ ‪ – LMS‬إذا ﻛﺎن ﻟﺪﻳﻨﺎ ﻣﻌﺪل‬
‫ﻛﺜﻴﺮا ﻣﺎ ﺳﻴﻜﻮن ‪ .T (y) = y‬ﻛﺬﻟﻚ ﻓﺈن ))‪ exp(−a(η‬ﻳﻤﻜﻦ أن ﺗﻔﺴﺮ ﻛﻤُ ﺪﺧﻞ ﺗﺴﻮﻳﺔ )‪(normalization‬‬ ‫ﻣﻼﺣﻈﺔ‪:‬‬
‫ﻟﻠﺘﺄﻛﺪ ﻣﻦ أن اﻻﺣﺘﻤﺎﻻت ﻳﻜﻮن ﺣﺎﺻﻞ ﺟﻤﻌﻬﺎ ﻳﺴﺎوي واﺣﺪ‪.‬‬ ‫ﻣﻌﺪل ﺗﺮﺑﻴﻊ ))‪ (Least Mean Squares (LMS‬ﻟﻤﺠﻤﻮﻋﺔ ﺑﻴﺎﻧﺎت ﻣﻦ ‪ m‬ﻋﻴﻨﺔ‪ ،‬وﻳﻄﻠﻖ ﻋﻠﻴﻪ ﻗﺎﻧﻮن ﺗﻌﻠﻢ وﻳﺪرو�ﻫﻮف‬
‫)‪ ،(Widrow-Hoff‬ﻛﺎﻟﺘﺎﻟﻲ‪:‬‬
‫ً‬
‫اﺳﺘﺨﺪاﻣﺎ ﻓﻲ اﻟﺠﺪول اﻟﺘﺎﻟﻲ‪:‬‬ ‫اﻷﺳﻴﺔ‬
‫ّ‬ ‫ﺗﻢ ﺗﻠﺨﻴﺺ أﻛﺜﺮ اﻟﺘﻮزﻳﻌﺎت‬
‫∑‬
‫‪m‬‬
‫[‬ ‫]‬ ‫)‪(i‬‬
‫‪∀j, θj ← θj + α‬‬ ‫‪y (i) − hθ (x(i) ) xj‬‬
‫)‪b(y‬‬ ‫)‪a(η‬‬ ‫)‪T (y‬‬ ‫‪η‬‬ ‫اﻟﺘﻮزﻳﻊ‬ ‫‪i=1‬‬

‫(‬ ‫)‬
‫‪1‬‬ ‫))‪log(1 + exp(η‬‬ ‫‪y‬‬ ‫‪log‬‬ ‫‪ϕ‬‬
‫ِﺑﺮﻧﻮﻟﻠﻲ )‪(Bernoulli‬‬ ‫ﻣﻼﺣﻈﺔ‪ :‬ﻗﺎﻧﻮن اﻟﺘﺤﺪﻳﺚ ﻫﺬا ﻳﻌﺘﺒﺮ ﺣﺎﻟﺔ ﺧﺎﺻﺔ ﻣﻦ اﻟﻨﺰول اﻻﺷﺘﻘﺎﻗﻲ )‪.(Gradient descent‬‬
‫‪1−ϕ‬‬
‫(‬ ‫)‬
‫‪2‬‬
‫‪η2‬‬ ‫ﻣﺤﻠﻴ ًﺎ )‪ ،(Locally Weighted Regression‬وﻳﻌﺮف ﺑـ ‪،LWR‬‬
‫ّ‬ ‫ﻣﺤﻠﻴ ًﺎ )‪ – (LWR‬اﻻﻧﺤﺪار اﻟﻤﻮزون‬
‫ّ‬ ‫‪ r‬اﻻﻧﺤﺪار اﻟﻤﻮزون‬
‫‪√1‬‬ ‫‪exp − y2‬‬ ‫‪y‬‬ ‫‪µ‬‬ ‫ﺟﺎوﺳﻲ )‪(Gaussian‬‬
‫‪2π‬‬ ‫‪2‬‬
‫ﻫﻮ ﻧﻮع ﻣﻦ اﻻﻧﺤﺪار اﻟﺨﻄﻲ َﻳ ِﺰن ﻛﻞ ﻋﻴﻨﺔ ﺗﺪرﻳﺐ أﺛﻨﺎء ﺣﺴﺎب داﻟﺔ اﻟﺘﻜﻠﻔﺔ ﺑﺎﺳﺘﺨﺪام )‪ ،w (x‬اﻟﺘﻲ ﻳﻤﻜﻦ ﺗﻌﺮﻳﻔﻬﺎ‬
‫)‪(i‬‬

‫‪1‬‬ ‫ﺑﺎﺳﺘﺨﺪام اﻟﻤُ ﺪﺧﻞ )‪ τ ∈ R (parameter‬ﻛﺎﻟﺘﺎﻟﻲ‪:‬‬


‫‪eη‬‬ ‫‪y‬‬ ‫)‪log(λ‬‬ ‫ﺑﻮاﺳﻮن )‪(Poisson‬‬
‫!‪y‬‬ ‫(‬ ‫)‬
‫(‬ ‫)‬ ‫‪(x(i) − x)2‬‬
‫‪eη‬‬ ‫‪w(i) (x) = exp‬‬ ‫‪−‬‬
‫‪1‬‬ ‫‪log‬‬ ‫‪1−eη‬‬
‫‪y‬‬ ‫)‪log(1 − ϕ‬‬ ‫ﻫﻨﺪﺳﻲ )‪(Geometric‬‬ ‫‪2τ 2‬‬

‫ﺟﺎﻣﻌﺔ ﺳﺘﺎﻧﻔﻮرد‬ ‫‪۲‬‬ ‫ﺧﺮﻳﻒ ‪۲۰۱۸‬‬


‫‪ - CS ۲۲۹‬ﺗﻌﻠﻢ آﻟﻲ‬ ‫‪https://stanford.edu/~shervine/l/ar/‬‬

‫اﻟﺨﻄﻴﺔ اﻟﻌﺎﻣﺔ )‪ (GLM‬إﻟﻰ ﺗﻮﻗﻊ اﻟﻤﺘﻐﻴﺮ اﻟﻌﺸﻮاﺋﻲ ‪ y‬ﻛﺪاﻟﺔ ﻟـ ‪،x ∈ Rn+1‬‬


‫ّ‬ ‫‪ r‬اﻓﺘﺮاﺿﺎت ‪ – GLMs‬ﺗﻬﺪف اﻟﻨﻤﺎذج‬
‫وﺗﺴﺘﻨﺪ إﻟﻰ ﺛﻼﺛﺔ اﻓﺘﺮاﺿﺎت‪:‬‬

‫)‪(1‬‬ ‫)‪y|x; θ ∼ ExpFamily(η‬‬ ‫)‪(2‬‬ ‫]‪hθ (x) = E[y|x; θ‬‬ ‫)‪(3‬‬ ‫‪η = θT x‬‬

‫ﻣﻼﺣﻈﺔ‪ :‬أﺻﻐﺮ ﺗﺮﺑﻴﻊ )‪ (least squares‬اﻻﻋﺘﻴﺎدي و اﻻﻧﺤﺪار اﻟﻠﻮﺟﺴﺘﻲ ﻳﻌﺘﺒﺮان ﻣﻦ اﻟﺤﺎﻻت اﻟﺨﺎﺻﺔ ﻟﻠﻨﻤﺎذج‬
‫اﻟﺨﻄﻴﺔ اﻟﻌﺎﻣﺔ‪.‬‬
‫ّ‬

‫ﻣﻼﺣﻈﺔ‪ :‬ﻧﻘﻮل أﻧﻨﺎ ﻧﺴﺘﺨﺪم ”ﺣﻴﻠﺔ اﻟﻨﻮاة” )‪ (kernel trick‬ﻟﺤﺴﺎب داﻟﺔ اﻟﺘﻜﻠﻔﺔ ﻋﻨﺪ اﺳﺘﺨﺪام اﻟﻨﻮاة ﻷﻧﻨﺎ ﻓﻲ‬
‫آﻟﺔ اﻟﻤﺘﺠﻬﺎت اﻟﺪاﻋﻤﺔ )‪(Support Vector Machines‬‬
‫اﻟﺤﻘﻴﻘﺔ ﻻ ﻧﺤﺘﺎج أن ﻧﻌﺮف اﻟﺘﺤﻮﻳﻞ اﻟﺼﺮﻳﺢ ‪ ،ϕ‬اﻟﺬي ﻳﻜﻮن ﻓﻲ اﻟﻐﺎﻟﺐ ﺷﺪﻳﺪ اﻟﺘﻌﻘﻴﺪ‪ .‬وﻟﻜﻦ‪ ،‬ﻧﺤﺘﺎج أن ﻓﻘﻂ أن‬
‫ﻧﺤﺴﺐ اﻟﻘﻴﻢ )‪.K(x,z‬‬ ‫ﺗﻬﺪف آﻟﺔ اﻟﻤﺘﺠﻬﺎت اﻟﺪاﻋﻤﺔ )‪ (SVM‬إﻟﻰ اﻟﻌﺜﻮر ﻋﻠﻰ اﻟﺨﻂ اﻟﺬي ﻳﻌﻈﻢ أﺻﻐﺮ ﻣﺴﺎﻓﺔ إﻟﻴﻪ‪:‬‬

‫اﻟﻼﻏﺮاﻧﺠﻲ )‪ – (Lagrangian‬ﻳﺘﻢ ﺗﻌﺮﻳﻒ ّ‬


‫اﻟﻼﻏﺮاﻧﺠﻲ )‪ L(w,b‬ﻋﻠﻰ اﻟﻨﺤﻮ اﻟﺘﺎﻟﻲ‪:‬‬ ‫‪ّ r‬‬ ‫‪ r‬ﻣُ ﺼ ﱢﻨﻒ اﻟﻬﺎﻣﺶ اﻷﺣﺴﻦ )‪ – (Optimal margin classifier‬ﻳﻌﺮﱠ ف ﻣُ ﺼ ﱢﻨﻒ اﻟﻬﺎﻣﺶ اﻷﺣﺴﻦ ‪ h‬ﻛﺎﻟﺘﺎﻟﻲ‪:‬‬

‫∑‬
‫‪l‬‬
‫)‪h(x) = sign(wT x − b‬‬
‫‪L(w,b) = f (w) +‬‬ ‫)‪βi hi (w‬‬
‫‪i=1‬‬ ‫ﺣﻴﺚ ‪ (w, b) ∈ Rn × R‬ﻫﻮ اﻟﺤﻞ ﻟﻤﺸﻜﻠﺔ اﻟﺘﺤﺴﻴﻦ )‪ (optimization‬اﻟﺘﺎﻟﻴﺔ‪:‬‬
‫ﻣﻼﺣﻈﺔ‪ :‬اﻟﻤﻌﺎﻣِ ﻼت )‪ βi (coefficients‬ﻳﻄﻠﻖ ﻋﻠﻴﻬﺎ ﻣﻀﺮوﺑﺎت ﻻﻏﺮاﻧﺞ )‪.(Lagrange multipliers‬‬ ‫‪1‬‬
‫‪min‬‬ ‫‪||w||2‬‬ ‫ﺑﺤﻴﺚ أن‬ ‫‪y (i) (wT x(i) − b) ⩾ 1‬‬
‫‪2‬‬
‫اﻟﺘﻌﻠﻢ اﻟﺘﻮﻟﻴﺪي )‪(Generative Learning‬‬

‫اﻟﻨﻤﻮذج اﻟﺘﻮﻟﻴﺪي ﻓﻲ اﻟﺒﺪاﻳﺔ ﻳﺤﺎول أن ﻳﺘﻌﻠﻢ ﻛﻴﻒ ﺗﻢ ﺗﻮﻟﻴﺪ اﻟﺒﻴﺎﻧﺎت ﻋﻦ ﻃﺮﻳﻖ ﺗﻘﺪﻳﺮ )‪ ،P (x|y‬اﻟﺘﻲ ﻳﻤﻜﻦ‬
‫ﺣﻴﻨﻬﺎ اﺳﺘﺨﺪاﻣﻬﺎ ﻟﺘﻘﺪﻳﺮ )‪ P (y|x‬ﺑﺎﺳﺘﺨﺪام ﻗﺎﻧﻮن ﺑﺎﻳﺰ )‪.(Bayes’ rule‬‬

‫ﺗﺤﻠﻴﻞ اﻟﺘﻤﺎﻳﺰ اﻟﺠﺎوﺳﻲ )‪(Gaussian Discriminant Analysis‬‬

‫‪ r‬اﻹﻃﺎر – ﺗﺤﻠﻴﻞ اﻟﺘﻤﺎﻳﺰ اﻟﺠﺎوﺳﻲ ﻳﻔﺘﺮض أن ‪ y‬و ‪ x|y = 0‬و ‪ x|y = 1‬ﺑﺤﻴﺚ ﻳﻜﻮﻧﻮا ﻛﺎﻟﺘﺎﻟﻲ‪:‬‬
‫)‪y ∼ Bernoulli(ϕ‬‬

‫)‪x|y = 0 ∼ N (µ0 ,Σ‬‬ ‫و‬ ‫)‪x|y = 1 ∼ N (µ1 ,Σ‬‬

‫‪ r‬اﻟﺘﻘﺪﻳﺮ – اﻟﺠﺪول اﻟﺘﺎﻟﻲ ﻳﻠﺨﺺ اﻟﺘﻘﺪﻳﺮات اﻟﺘﻲ ﻳﻤﻜﻨﻨﺎ اﻟﺘﻮﺻﻞ ﻟﻬﺎ ﻋﻨﺪ ﺗﻌﻈﻴﻢ اﻷرﺟﺤﻴﺔ )‪:(likelihood‬‬
‫ﻣﻼﺣﻈﺔ‪ :‬ﻳﺘﻢ ﺗﻌﺮﻳﻒ اﻟﺨﻂ ﺑﻬﺬه اﻟﻤﻌﺎدﻟﺔ ‪. wT x − b = 0‬‬
‫‪b‬‬
‫‪Σ‬‬ ‫)‪µbj (j = 0,1‬‬ ‫‪b‬‬
‫‪ϕ‬‬
‫‪∑m‬‬ ‫‪ r‬اﻟﺨﺴﺎرة اﻟﻤﻔﺼﻠﻴﺔ )‪ – (Hinge loss‬ﺗﺴﺘﺨﺪم اﻟﺨﺴﺎرة اﻟﻤﻔﺼﻠﻴﺔ ﻓﻲ ﺣﻞ ‪ SVM‬وﻳﻌﺮف ﻋﻠﻰ اﻟﻨﺤﻮ اﻟﺘﺎﻟﻲ‪:‬‬
‫‪1‬‬ ‫∑‬
‫‪m‬‬
‫‪1‬‬
‫}‪i=1 {y (i) =j‬‬
‫)‪x(i‬‬ ‫‪1‬‬ ‫∑‬‫‪m‬‬
‫‪(x‬‬ ‫)‪(i‬‬
‫‪− µy(i) )(x‬‬ ‫)‪(i‬‬
‫) )‪− µy(i‬‬ ‫‪T‬‬ ‫∑‬ ‫‪m‬‬ ‫}‪1{y(i) =1‬‬ ‫)‪L(z,y) = [1 − yz]+ = max(0,1 − yz‬‬
‫‪m‬‬ ‫}‪1{y(i) =j‬‬ ‫‪m‬‬
‫‪i=1‬‬ ‫‪i=1‬‬ ‫‪i=1‬‬

‫‪ r‬اﻟﻨﻮاة )‪ – (Kernel‬إذا ﻛﺎن ﻟﺪﻳﻨﺎ داﻟﺔ رﺑﻂ اﻟﺨﺼﺎﺋﺺ )‪ ،ϕ (features‬ﻳﻤﻜﻨﻨﺎ ﺗﻌﺮﻳﻒ اﻟﻨﻮاة ‪ K‬ﻛﺎﻟﺘﺎﻟﻲ‪:‬‬
‫ﺑﺎﻳﺰ اﻟﺒﺴﻴﻂ )‪(Naive Bayes‬‬
‫)‪K(x,z) = ϕ(x)T ϕ(z‬‬
‫‪ r‬اﻻﻓﺘﺮاض – ﻳﻔﺘﺮض ﻧﻤﻮذج ﺑﺎﻳﺰ اﻟﺒﺴﻴﻂ أن ﺟﻤﻴﻊ اﻟﺨﺼﺎﺋﺺ ﻟﻜﻞ ﻋﻴﻨﺔ ﺑﻴﺎﻧﺎت ﻣﺴﺘﻘﻠﺔ )‪:(independent‬‬
‫∏‬
‫‪n‬‬ ‫(‬ ‫)‬
‫‪ ،K(x,z) = exp −‬وﻳﻄﻠﻖ ﻋﻠﻴﻬﺎ اﻟﻨﻮاة اﻟﺠﺎوﺳﻴﺔ‬
‫‪||x−z||2‬‬ ‫ً‬
‫ﻋﻤﻠﻴﺎ‪ ،‬ﻳﻤﻜﻦ أن ُﺗﻌَ ﺮﱠ ف اﻟﺪاﻟﺔ ‪ K‬ﻋﻦ ﻃﺮﻳﻖ اﻟﻤﻌﺎدﻟﺔ‬
‫= ‪P (x|y) = P (x1 ,x2 ,...|y) = P (x1 |y)P (x2 |y)...‬‬ ‫)‪P (xi |y‬‬ ‫‪2σ 2‬‬
‫‪i=1‬‬ ‫)‪ ،(Gaussian kernel‬وﻫﻲ ﺗﺴﺘﺨﺪم ﺑﻜﺜﺮة‪.‬‬

‫ﺟﺎﻣﻌﺔ ﺳﺘﺎﻧﻔﻮرد‬ ‫‪۳‬‬ ‫ﺧﺮﻳﻒ ‪۲۰۱۸‬‬


‫‪ - CS ۲۲۹‬ﺗﻌﻠﻢ آﻟﻲ‬ ‫‪https://stanford.edu/~shervine/l/ar/‬‬

‫‪ r‬اﻟﺤﻞ – ﺗﻌﻈﻴﻢ اﻷرﺟﺤﻴﺔ اﻟﻠﻮﻏﺎرﻳﺜﻤﻴﺔ )‪ (log-likelihood‬ﻳﻌﻄﻴﻨﺎ اﻟﺤﻠﻮل اﻟﺘﺎﻟﻴﺔ إذا ﻛﺎن ]]‪:k ∈ {0,1},l ∈ [[1,L‬‬

‫‪ xi‬و ‪#{j|y (j) = k‬‬


‫)‪(j‬‬
‫‪1‬‬ ‫}‪= l‬‬
‫= )‪P (y = k‬‬ ‫}‪× #{j|y (j) = k‬‬ ‫و‬ ‫= )‪P (xi = l|y = k‬‬
‫‪m‬‬ ‫)‪#{j|y (j‬‬ ‫}‪= k‬‬

‫ﻣﻼﺣﻈﺔ‪ :‬ﺑﺎﻳﺰ اﻟﺒﺴﻴﻂ ﻳﺴﺘﺨﺪم ﺑﺸﻜﻞ واﺳﻊ ﻟﺘﺼﻨﻴﻒ اﻟﻨﺼﻮص واﻛﺘﺸﺎف اﻟﺒﺮﻳﺪ اﻹﻟﻜﺘﺮوﻧﻲ اﻟﻤﺰﻋﺞ‪.‬‬

‫اﻟﻄﺮق اﻟﺸﺠﺮﻳﺔ )‪ (tree-based‬واﻟﺘﺠﻤﻴﻌﻴﺔ )‪(ensemble‬‬


‫ﱡ‬
‫اﻟﺘﻌﻠﻢ‬ ‫ﻧﻈﺮﻳﺔ‬
‫ﻟﻜﻞ ﻣﻦ ﻣﺸﺎﻛﻞ اﻻﻧﺤﺪار )‪ (regression‬واﻟﺘﺼﻨﻴﻒ )‪.(classification‬‬
‫ٍ‬ ‫ﻫﺬه اﻟﻄﺮق ﻳﻤﻜﻦ اﺳﺘﺨﺪاﻣﻬﺎ‬
‫‪ r‬ﺣﺪ اﻻﺗﺤﺎد )‪ – (Union bound‬ﻟﻨﺠﻌﻞ ‪ A1 , ..., Ak‬ﺗﻤﺜﻞ ‪ k‬ﺣﺪث‪ .‬ﻓﻴﻜﻮن ﻟﺪﻳﻨﺎ‪:‬‬
‫‪ r‬اﻟﺘﺼﻨﻴﻒ واﻻﻧﺤﺪار اﻟﺸﺠﺮي )‪ – (CART‬واﻻﺳﻢ اﻟﺸﺎﺋﻊ ﻟﻪ أﺷﺠﺎر اﻟﻘﺮار )‪ ،(decision trees‬ﻳﻤﻜﻦ أن ﻳﻤﺜﻞ‬
‫) ‪P (A1 ∪ ... ∪ Ak ) ⩽ P (A1 ) + ... + P (Ak‬‬
‫ﻛﺄﺷﺠﺎر ﺛﻨﺎﺋﻴﺔ )‪ .(binary trees‬ﻣﻦ اﻟﻤﺰاﻳﺎ ﻟﻬﺬه اﻟﻄﺮﻳﻘﺔ إﻣﻜﺎﻧﻴﺔ ﺗﻔﺴﻴﺮﻫﺎ ﺑﺴﻬﻮﻟﺔ‪.‬‬

‫ً‬
‫ﻛﺒﻴﺮا ﻣﻦ أﺷﺠﺎر اﻟﻘﺮار‬ ‫ً‬
‫ﻋﺪدا‬ ‫‪ r‬اﻟﻐﺎﺑﺔ اﻟﻌﺸﻮاﺋﻴﺔ )‪ – (Random forest‬ﻫﻲ أﺣﺪ اﻟﻄﺮق اﻟﺸﺠﺮﻳﺔ اﻟﺘﻲ ﺗﺴﺘﺨﺪم‬
‫ﻣﺒﻨﻴﺔ ﺑﺎﺳﺘﺨﺪام ﻣﺠﻤﻮﻋﺔ ﻋﺸﻮاﺋﻴﺔ ﻣﻦ اﻟﺨﺼﺎﺋﺺ‪ .‬ﺑﺨﻼف ﺷﺠﺮة اﻟﻘﺮار اﻟﺒﺴﻴﻄﺔ ﻻ ﻳﻤﻜﻦ ﺗﻔﺴﻴﺮ اﻟﻨﻤﻮذج ﺑﺴﻬﻮﻟﺔ‪،‬‬
‫وﻟﻜﻦ أداﺋﻬﺎ اﻟﻌﺎﻟﻲ ﺟﻌﻠﻬﺎ أﺣﺪ اﻟﺨﻮارزﻣﻴﺔ اﻟﻤﺸﻬﻮرة‪.‬‬

‫ﻣﻼﺣﻈﺔ‪ :‬أﺷﺠﺎر اﻟﻘﺮار ﻧﻮع ﻣﻦ اﻟﺨﻮارزﻣﻴﺎت اﻟﺘﺠﻤﻴﻌﻴﺔ )‪.(ensemble‬‬

‫‪ r‬ﻣﺘﺮاﺟﺤﺔ ﻫﻮﻓﺪﻳﻨﺞ )‪ – (Hoeffding‬ﻟﻨﺠﻌﻞ ‪ Z1 , .., Zm‬ﺗﻤﺜﻞ ‪ m‬ﻣﺘﻐﻴﺮ ﻣﺴﺘﻘﻠﺔ وﻣﻮزﻋﺔ ﺑﺸﻜﻞ ﻣﻤﺎﺛﻞ )‪(iid‬‬ ‫‪ r‬اﻟﺘﻌﺰﻳﺰ )‪ – (Boosting‬ﻓﻜﺮة ﺧﻮارزﻣﻴﺎت اﻟﺘﻌﺰﻳﺰ ﻫﻲ دﻣﺞ ﻋﺪة ﺧﻮارزﻣﻴﺎت ﺗﻌﻠﻢ ﺿﻌﻴﻔﺔ ﻟﺘﻜﻮﻳﻦ ﻧﻤﻮذج ﻗﻮي‪.‬‬
‫ﻣﺄﺧﻮذة ﻣﻦ ﺗﻮزﻳﻊ ِﺑﺮﻧﻮﻟﻠﻲ )‪ (Bernoulli distribution‬ذا ﻣُ ﺪﺧﻞ ‪ .ϕ‬ﻟﻨﺠﻌﻞ ‪b‬‬
‫‪ ϕ‬ﻣﺘﻮﺳﻂ اﻟﻌﻴﻨﺔ )‪ (sample mean‬و‬ ‫اﻟﻄﺮق اﻷﺳﺎﺳﻴﺔ ﻣﻠﺨﺼﺔ ﻓﻲ اﻟﺠﺪول اﻟﺘﺎﻟﻲ‪:‬‬
‫‪ γ > 0‬ﺛﺎﺑﺖ‪ .‬ﻓﻴﻜﻮن ﻟﺪﻳﻨﺎ‪:‬‬

‫)‪b| > γ) ⩽ 2 exp(−2γ 2 m‬‬


‫‪P (|ϕ − ϕ‬‬
‫اﻟﺘﻌﺰﻳﺰ اﻻﺷﺘﻘﺎﻗﻲ )‪(Gradient boosting‬‬ ‫اﻟﺘﻌﺰﻳﺰ اﻟﺘ َ‬
‫َﻜ ﱡﻴﻔﻲ )‪(Adaptive boosting‬‬
‫ﻣﻼﺣﻈﺔ‪ :‬ﻫﺬه اﻟﻤﺘﺮاﺟﺤﺔ ﺗﻌﺮف ﻛﺬﻟﻚ ﺑﺤﺪ ﺗﺸﺮﻧﻮف ‪(Chernoff bound).‬‬
‫‪ -‬ﻳﺘﻢ ﺗﺪرﻳﺐ ﺧﻮارزﻣﻴﺎت‬ ‫‪ -‬ﻳﺘﻢ اﻟﺘﺮﻛﻴﺰ ﻋﻠﻰ ﻣﻮاﻃﻦ اﻟﺨﻄﺄ‬
‫‪ r‬ﺧﻄﺄ اﻟﺘﺪرﻳﺐ – ﻟﻴﻜﻦ ﻟﺪﻳﻨﺎ اﻟﻤُ ﺼ ﱢﻨﻒ ‪ ،h‬ﻳﻤﻜﻦ ﺗﻌﺮﻳﻒ ﺧﻄﺄ اﻟﺘﺪرﻳﺐ )‪ϵ(h‬‬
‫‪ ،b‬وﻳﻌﺮف ﻛﺬﻟﻚ ﺑﺎﻟﺨﻄﺮ اﻟﺘﺠﺮﻳﺒﻲ أو‬
‫اﻟﺘﻌﻠﻢ اﻟﻀﻌﻴﻔﺔ ﻋﻠﻰ اﻷﺧﻄﺎء اﻟﻤﺘﺒﻘﻴﺔ‪.‬‬ ‫ﻟﺘﺤﺴﻴﻦ اﻟﻨﺘﻴﺠﺔ ﻓﻲ اﻟﺨﻄﻮة اﻟﺘﺎﻟﻴﺔ‪.‬‬
‫اﻟﺨﻄﺄ اﻟﺘﺠﺮﻳﺒﻲ‪ ،‬ﻛﺎﻟﺘﺎﻟﻲ‪:‬‬
‫‪”Adaboost” -‬‬

‫∑ ‪1‬‬
‫‪m‬‬
‫= )‪bϵ(h‬‬ ‫} )‪1{h(x(i) )̸=y(i‬‬
‫‪m‬‬
‫‪i=1‬‬

‫ﻃﺮق أﺧﺮى ﻏﻴﺮ ﺑﺎراﻣﺘﺮﻳﺔ )‪(non-parametric‬‬


‫ً‬
‫اﺣﺘﻤﺎﻟﻴﺎ ))‪ – (Probably Approximately Correct (PAC‬ﻫﻮ إﻃﺎر ﻳﺘﻢ ﻣﻦ ﺧﻼﻟﻪ إﺛﺒﺎت‬ ‫ً‬
‫ﺗﻘﺮﻳﺒﺎ ﺻﺤﻴﺢ‬ ‫‪r‬‬
‫اﻟﻌﺪﻳﺪ ﻣﻦ ﻧﻈﺮﻳﺎت اﻟﺘﻌﻠﻢ‪ ،‬وﻳﺤﺘﻮي ﻋﻠﻰ اﻻﻓﺘﺮاﺿﺎت اﻟﺘﺎﻟﻴﺔ‪:‬‬ ‫‪ r‬ﺧﻮارزﻣﻴﺔ أﻗﺮب اﻟﺠﻴﺮان )‪ – (k-nearest neighbors‬ﺗﻌﺘﺒﺮ ﺧﻮارزﻣﻴﺔ أﻗﺮب اﻟﺠﻴﺮان‪ ،‬وﺗﻌﺮف ﺑـ ‪ ،-NNk‬ﻃﺮﻳﻘﺔ‬
‫ﻏﻴﺮ ﺑﺎراﻣﺘﺮﻳﺔ‪ ،‬ﺣﻴﺚ ﻳﺘﻢ ﺗﺤﺪﻳﺪ ﻧﺘﻴﺠﺔ ﻋﻴﻨﺔ ﻣﻦ اﻟﺒﻴﺎﻧﺎت ﻣﻦ ﺧﻼل ﻋﺪد ‪ k‬ﻣﻦ اﻟﺒﻴﺎﻧﺎت اﻟﻤﺠﺎورة ﻓﻲ ﻣﺠﻤﻮﻋﺔ‬
‫• ﻣﺠﻤﻮﻋﺘﻲ اﻟﺘﺪرﻳﺐ واﻻﺧﺘﺒﺎر ﻳﺘﺒﻌﺎن ﻧﻔﺲ اﻟﺘﻮزﻳﻊ‪.‬‬ ‫اﻟﺘﺪرﻳﺐ‪ .‬وﻳﻤﻜﻦ اﺳﺘﺨﺪاﻣﻬﺎ ﻟﻠﺘﺼﻨﻴﻒ واﻻﻧﺤﺪار‪.‬‬

‫• ﻋﻴﻨﺎت اﻟﺘﺪرﻳﺐ ﺗﺆﺧﺬ ﺑﺸﻜﻞ ﻣﺴﺘﻘﻞ‪.‬‬ ‫ﻣﻼﺣﻈﺔ‪ :‬ﻛﻠﻤﺎ زاد اﻟﻤُ ﺪﺧﻞ ‪ ،k‬ﻛﻠﻤﺎ زاد اﻻﻧﺤﻴﺎز )‪ ،(bias‬وﻛﻠﻤﺎ ﻧﻘﺺ ‪ ،k‬زاد اﻟﺘﺒﺎﻳﻦ )‪.(variance‬‬

‫ﺟﺎﻣﻌﺔ ﺳﺘﺎﻧﻔﻮرد‬ ‫‪۴‬‬ ‫ﺧﺮﻳﻒ ‪۲۰۱۸‬‬


‫‪ - CS ۲۲۹‬ﺗﻌﻠﻢ آﻟﻲ‬ ‫‪https://stanford.edu/~shervine/l/ar/‬‬

‫‪ r‬ﻣﺠﻤﻮﻋﺔ ﺗﻜﺴﻴﺮﻳﺔ )‪ – (Shattering Set‬إذا ﻛﺎن ﻟﺪﻳﻨﺎ اﻟﻤﺠﻤﻮﻋﺔ } )‪ ،S = {x(1) ,...,x(d‬وﻣﺠﻤﻮﻋﺔ ﻣُ ﱟ‬


‫ﺼﻨﻔﺎت‬
‫‪ ،H‬ﻧﻘﻮل أن ‪ H‬ﺗﻜﺴﺮ ‪ (H shatters S) S‬إذا ﻛﺎن ﻟﻜﻞ ﻣﺠﻤﻮﻋﺔ ﻋﻼﻣﺎت )‪ {y (1) , ..., y (d) } (labels‬ﻟﺪﻳﻨﺎ‪:‬‬

‫‪∃h ∈ H,‬‬ ‫‪∀i ∈ [[1,d]],‬‬ ‫)‪h(x(i) ) = y (i‬‬

‫‪ r‬ﻣﺒﺮﻫﻨﺔ اﻟﺤﺪ اﻷﻋﻠﻰ )‪ – (Upper bound theorem‬ﻟﻨﺠﻌﻞ ‪ H‬ﻓﺌﺔ ﻓﺮﺿﻴﺔ ﻣﺤﺪودة )‪(finite hypothesis class‬‬
‫ﺑﺤﻴﺚ ‪ ،|H| = k‬و ‪ δ‬وﺣﺠﻢ اﻟﻌﻴﻨﺔ ‪ m‬ﺛﺎﺑﺘﻴﻦ‪ .‬ﺣﻴﻨﻬﺎ ﺳﻴﻜﻮن ﻟﺪﻳﻨﺎ‪ ،‬ﻣﻊ اﺣﺘﻤﺎل ﻋﻠﻰ اﻷﻗﻞ ‪ ،1 − δ‬اﻟﺘﺎﻟﻲ‪:‬‬

‫(‬ ‫)‬ ‫√‬ ‫(‬ ‫)‬


‫‪1‬‬ ‫‪2k‬‬
‫‪ϵ(b‬‬
‫⩽ )‪h‬‬ ‫‪min ϵ(h) + 2‬‬ ‫‪log‬‬
‫‪h∈H‬‬ ‫‪2m‬‬ ‫‪δ‬‬

‫‪ r‬ﺑُﻌْ ﺪ ﻓﺎﺑﻨﻴﻚ – ﺗﺸﺮﻓﻮﻧﻴﻜﺲ )‪ (Vapnik-Chervonenkis - VC‬ﻟﻔﺌﺔ ﻓﺮﺿﻴﺔ ﻏﻴﺮ ﻣﺤﺪودة ‪(infinite hypothesis‬‬
‫)‪ ،H class‬وﻳﺮﻣﺰ ﻟﻪ ﺑـ )‪ ،VC(H‬ﻫﻮ ﺣﺠﻢ أﻛﺒﺮ ﻣﺠﻤﻮﻋﺔ )‪ (set‬اﻟﺘﻲ ﺗﻢ ﺗﻜﺴﻴﺮﻫﺎ ﺑﻮاﺳﻄﺔ ‪.(shattered by H) H‬‬
‫ﻣﻼﺣﻈﺔ‪ :‬ﺑُﻌْ ﺪ ﻓﺎﺑﻨﻴﻚ�ﺗﺸﺮﻓﻮﻧﻴﻜﺲ ‪ VC‬ﻟـ = ‪{ H‬ﻣﺠﻤﻮﻋﺔ اﻟﺘﺼﻨﻴﻔﺎت اﻟﺨﻄﻴﺔ ﻓﻲ ﺑُﻌﺪﻳﻦ} ﻳﺴﺎوي ‪.۳‬‬

‫ﻋﻴﻨﺎت اﻟﺘﺪرﻳﺐ ‪ .m‬ﺳﻴﻜﻮن‬


‫‪ r‬ﻣﺒﺮﻫﻨﺔ ﻓﺎﺑﻨﻴﻚ )‪ – (Vapnik theorem‬ﻟﻴﻜﻦ ﻟﺪﻳﻨﺎ ‪ ،H‬ﻣﻊ ‪ VC(H) = d‬وﻋﺪد ّ‬
‫ﻟﺪﻳﻨﺎ‪ ،‬ﻣﻊ اﺣﺘﻤﺎل ﻋﻠﻰ اﻷﻗﻞ ‪ ،1 − δ‬اﻟﺘﺎﻟﻲ‪:‬‬

‫(‬ ‫)‬ ‫√(‬ ‫) (‬ ‫)) (‬


‫‪d‬‬ ‫‪m‬‬ ‫‪1‬‬ ‫‪1‬‬
‫‪ϵ(b‬‬
‫⩽ )‪h‬‬ ‫‪min ϵ(h) + O‬‬ ‫‪log‬‬ ‫‪+‬‬ ‫‪log‬‬
‫‪h∈H‬‬ ‫‪m‬‬ ‫‪d‬‬ ‫‪m‬‬ ‫‪δ‬‬

‫ﺟﺎﻣﻌﺔ ﺳﺘﺎﻧﻔﻮرد‬ ‫‪۵‬‬ ‫ﺧﺮﻳﻒ ‪۲۰۱۸‬‬

You might also like