Images
![]() |
Text
![]() |
Voice
![]() |
Go, 2016
![]() |
StarCraft, 2019
![]() |
Protein structure, 2020
![]() |
fj:X→R — numerical features
a(x,w)=f(⟨w,x⟩)=f(j=1∑nwjfj(x)+b)where w1,…,wn∈R — feature weights, b — bias
f(z) — activation function, for example, sign(z), 1+e−z1, (z)+
The functions AND, OR, NOT from binary variables x1 and x2:
x1∧x2=[x1+x2−23>0]
x1∨x2=[x1+x2−21>0]
¬x1=[−x1+21>0]
Function x1⨁x2=[x1=x2] is not implementable by a single neuron. There are two ways to implement:
Function σ(z) — sigmoid, if z→−∞limσ(z)=0 and z→+∞limσ(z)=1
If σ(z) is a continuous sigmoid, then for any continuous function f(x) on [0,1]n, there exist such parameter values wh∈Rn,b∈R,αh∈R that a single-layer network a(x)=σ(h=1∑Hαh⟨x,wh⟩+b) uniformly approximates f(x) with any desired accuracy ε: ∣a(x)−f(x)∣<ε, for all x∈[0,1]n
G. Cybenko. Approximation by Superpositions of a Sigmoidal Function. Mathematics of Control, Signals, and Systems (MCSS) 2 (4): 303-314 (Dec 1, 1989)
Prediction ypred=x⋅W+b
x⋅W+b
In our example, the space is 784-dimensional: R784
If ytrue,i∈R (that is, the task of linear regression), then to minimize the sum of the squares of differences (least squares method), the answer is calculated analytically by the formula:
W^=(XTX)−1XTytrueIn general, it is solved numerically by minimizing the loss function. Most often by gradient descent.
We transform our responses of the linear model into class probabilities:
p(c=0∣x)=ey0+ey1+⋯+eyney0=i∑eyiey0p(c=1∣x)=ey0+ey1+⋯+eyney1=i∑eyiey1…This is cross-entropy loss for the case yi∈{0,1}. In our case:
L(W,b)=−j∑lni∑e(xjW+b)ie(xjW+b)yjWe find the minimum of the function by stochastic gradient descent:
Wk+1=Wk−η∂W∂Lbk+1=bk−η∂b∂LWe reduce the variance of the gradient.
Input: sample Xℓ, learning rate η, forgetting rate λ
Output: weight vector w≡(wjh,whm)
Repeat:
Until the value of Q and/or the weight w converges