After watching all the videos of the famous Standford's CS231n course that took place in 2017, i decided to take summary of the whole course to help me to remember and to anyone who would like to know about it. I've skipped some contents in some lectures as it wasn't important to me.
After watching all the videos of the famous Standford's CS231n course that took place in 2017, i decided to take summary of the whole course to help me to remember and to anyone who would like to know about it. I've skipped some contents in some lectures as it wasn't important to me.
Website: http://cs231n.stanford.edu/
Lectures link: https://www.youtube.com/playlist?list=PLC1qU-LWwrF64f4QKQT-Vg5Wr4qEE1Zxk
Full syllabus link: http://cs231n.stanford.edu/syllabus.html
Assignments solutions: https://github.com/Burton2000/CS231n-2017
Number of lectures: 16
Course description:
Computer Vision has become ubiquitous in our society, with applications in search, image understanding, apps, mapping, medicine, drones, and self-driving cars. Core to many of these applications are visual recognition tasks such as image classification, localization and detection. Recent developments in neural network (aka “deep learning”) approaches have greatly advanced the performance of these state-of-the-art visual recognition systems. This course is a deep dive into details of the deep learning architectures with a focus on learning end-to-end models for these tasks, particularly image classification. During the 10-week course, students will learn to implement, train and debug their own neural networks and gain a detailed understanding of cutting-edge research in computer vision. The final assignment will involve training a multi-million parameter convolutional neural network and applying it on the largest image classification dataset (ImageNet). We will focus on teaching how to set up the problem of image recognition, the learning algorithms (e.g. backpropagation), practical engineering tricks for training and fine-tuning the networks and guide the students through hands-on assignments and a final course project. Much of the background and materials of this course will be drawn from the ImageNet Challenge.
f
folds.Y = wX + b
w
is the same as x
and shape of b
is 1.Y = wX
x
is oldX+1
and w
is the same as x
w
's and b
's that makes the classifier runs at best.In the last section we talked about linear classifier but we didn't discussed how we could train the parameters of that model to get best w
's and b
's.
We need a loss function to measure how good or bad our current parameters.
Loss = L[i] =(f(X[i],W),Y[i])
Loss_for_all = 1/N * Sum(Li(f(X[i],W),Y[i])) # Indicates the average
Then we find a way to minimize the loss function given some parameters. This is called optimization.
Loss function for a linear SVM classifier:
L[i] = Sum where all classes except the predicted class (max(0, s[j] - s[y[i]] + 1))
L = max (0, 437.9 - (-96.8) + 1) + max(0, 61.95 - (-96.8) + 1) = max(0, 535.7) + max(0, 159.75) = 695.45
If your loss function gives you zero, are this value is the same value for your parameter? No there are a lot of parameters that can give you best score.
You’ll sometimes hear about people instead using the squared hinge loss SVM (or L2-SVM). that penalizes violated margins more strongly (quadratically instead of linearly). The unsquared version is more standard, but in some datasets the squared hinge loss can work better.
We add regularization for the loss function so that the discovered model don't overfit the data.
Loss = L = 1/N * Sum(Li(f(X[i],W),Y[i])) + lambda * R(W)
Where R
is the regularizer, and lambda
is the regularization term.
There are different regularizations techniques:
Regularizer | Equation | Comments |
---|---|---|
L2 | R(W) = Sum(W^2) |
Sum all the W squared |
L1 | R(W) = Sum(lWl) |
Sum of all Ws with abs |
Elastic net (L1 + L2) | R(W) = beta * Sum(W^2) + Sum(lWl) |
|
Dropout | No Equation |
Regularization prefers smaller W
s over big W
s.
Regularizations is called weight decay. biases should not included in regularization.
Softmax loss (Like linear regression but works for more than 2 classes):
Softmax function:
A[L] = e^(score[L]) / sum(e^(score[L]), NoOfClasses)
Sum of the vector should be 1.
Softmax loss:
Loss = -logP(Y = y[i]|X = x[i])
Log of the probability of the good class. We want it to be near 1 thats why we added a minus.
Softmax loss is called cross-entropy loss.
Consider this numerical problem when you are computing Softmax:
f = np.array([123, 456, 789]) # example with 3 classes and each having large scores
p = np.exp(f) / np.sum(np.exp(f)) # Bad: Numeric problem, potential blowup
# instead: first shift the values of f so that the highest number is 0:
f -= np.max(f) # f becomes [-666, -333, 0]
p = np.exp(f) / np.sum(np.exp(f)) # safe to do, gives the correct answer
Optimization:
Follow the slope.
Our goal is to compute the gradient of each parameter we have.
After we compute the gradient of our parameters, we compute the gradient descent:
W = W - learning_rate * W_grad
learning_rate is so important hyper parameter you should get the best value of it first of all the hyperparameters.
stochastic gradient descent:
Computing the analytic gradient for arbitrary complex functions:
What is a Computational graphs?
Back-propagation simple example:
Suppose we have f(x,y,z) = (x+y)z
Then graph can be represented this way:
X
\
(+)--> q ---(*)--> f
/ /
Y /
/
/
Z---------/
We made an intermediate variable q
to hold the values of x+y
Then we have:
q = (x+y) # dq/dx = 1 , dq/dy = 1
f = qz # df/dq = z , df/dz = q
Then:
df/dq = z
df/dz = q
df/dx = df/dq * dq/dx = z * 1 = z # Chain rule
df/dy = df/dq * dq/dy = z * 1 = z # Chain rule
So in the Computational graphs, we call each operation f
. For each f
we calculate the local gradient before we go on back propagation and then we compute the gradients in respect of the loss function using the chain rule.
In the Computational graphs you can split each operation to as simple as you want but the nodes will be a lot. if you want the nodes to be smaller be sure that you can compute the gradient of this node.
A bigger example:
Modularized implementation: forward/ backward API (example multiply code):
class MultuplyGate(object):
"""
x,y are scalars
"""
def forward(x,y):
z = x*y
self.x = x # Cache
self.y = y # Cache
# We cache x and y because we know that the derivatives contains them.
return z
def backward(dz):
dx = self.y * dz #self.y is dx
dy = self.x * dz
return [dx, dy]
If you look at a deep learning framework you will find it follow the Modularized implementation where each class has a definition for forward and backward. For example:
So to define neural network as a function:
f = Wx
f = W2*max(0,W1*x)
f = W3*max(0,W2*max(0,W1*x)
Neural networks is a stack of some simple operation that forms complex operations.
Neural networks history:
Convolutional neural networks history:
ConvNet architectures make the explicit assumption that the inputs are images, which allows us to encode certain properties into the architecture.
There are a few distinct types of Layers in ConvNet (e.g. CONV/FC/RELU/POOL are by far the most popular)
Each Layer may or may not have parameters (e.g. CONV/FC do, RELU/POOL don’t)
Each Layer may or may not have additional hyperparameters (e.g. CONV/FC/POOL do, RELU doesn’t)
How Convolutional neural networks works?
(X, M)
the weighs shape for this will be (NoOfHiddenNeurons, X)
W.T*X + b
. This equation uses the broadcasting technique.W
and b
W
) as a vector not a matrix.(32,32,3)
(5,5,3)
(28,28,6)
(28,28,1)
(28,28,6)
(5,5,6)
(24,24,10)
What is stride when we are doing convolution:
(7,7)
and a filter with shape (3,3)
:
1
then the output shape will be (5,5)
# 2 are dropped
2
then the output shape will be (3,3)
# 4 are dropped
3
it doesn't work.((N-F)/stride +1)
1
then O = ((7-3)/1)+1 = 4 + 1 = 5
2
then O = ((7-3)/2)+1 = 2 + 1 = 3
3
then O = ((7-3)/3)+1 = 1.33 + 1 = 2.33
# doesn't work
In practice its common to zero pad the border. # Padding from both sides.
1
its common to pad to this equation: (F-1)/2
where F is the filter size
F = 3
==> Zero pad with 1
F = 5
==> Zero pad with 2
Example:
(32,32,3)
and ten filters with shape is (5,5)
with stride 1
and pad 2
(32,32,10)
# We maintain the size.
= 5*5*3 + 1 = 76
= 76 * 10 = 76
Number of filters is usually common to be to the power of 2. # To vectorize well.
So here are the parameters for the Conv layer:
Pooling makes the representation smaller and more manageable.
Pooling Operates over each activation map independently.
Example of pooling is the maxpooling.
2x2
with stride 2
# Usually the two parameters are the same 2 , 2
Also example of pooling is average pooling.
As a revision here are the Mini batch stochastic gradient descent algorithm steps:
Activation functions:
Different choices for activation function includes Sigmoid, tanh, RELU, Leaky RELU, Maxout, and ELU.
Sigmoid:
Sigmoid(x) = 1 / (1 + e^-x)
exp()
is a bit compute expensive.
Tanh:
Tanh(x)
is the equation.RELU (Rectified linear unit):
RELU(x) = max(0,x)
(6x)
Leaky RELU:
leaky_RELU(x) = max(0.01x,x)
Exponential linear units (ELU):
ELU(x) = { x if x > 0
alpah *(exp(x) -1) if x <= 0
# alpah are a learning parameter
}
It has all the benefits of RELU
Closer to zero mean outputs and adds some robustness to noise.
problems
exp()
is a bit compute expensive.Maxout activations:
maxout(x) = max(w1.T*x + b1, w2.T*x + b2)
In practice:
Data preprocessing:
Normalize the data:
# Zero centered data. (Calculate the mean for every input).
# On of the reasons we do this is because we need data to be between positive and negative and not all the be negative or positive.
X -= np.mean(X, axis = 1)
# Then apply the standard deviation. Hint: in images we don't do this.
X /= np.std(X, axis = 1)
To normalize images:
Weight initialization:
What happened when initialize all Ws with zeros?
First idea is to initialize the w's with small random numbers:
W = 0.01 * np.random.rand(D, H)
# Works OK for small networks but it makes problems with deeper networks!
The standard deviations is going to zero in deeper networks. and the gradient will vanish sooner in deep networks.
W = 1 * np.random.rand(D, H)
# Works OK for small networks but it makes problems with deeper networks!
The network will explode with big numbers!
Xavier initialization:
W = np.random.rand(in, out) / np.sqrt(in)
It works because we want the variance of the input to be as the variance of the output.
But it has an issue, It breaks when you are using RELU.
He initialization (Solution for the RELU issue):
W = np.random.rand(in, out) / np.sqrt(in/2)
Solves the issue with RELU. Its recommended when you are using RELU
Proper initialization is an active area of research.
Batch normalization:
Result = gamma * normalizedX + beta
Baby sitting the learning process
NAN
then your NN exploded and your learning rate is high.Hyperparameter Optimization
Optimization algorithms:
Problems with stochastic gradient descent:
SGD + momentum:
Build up velocity as a running mean of gradients:
# Computing weighted average. rho best is in range [0.9 - 0.99]
V[t+1] = rho * v[t] + dx
x[t+1] = x[t] - learningRate * V[t+1]
V[0]
is zero.
Solves the saddle point and local minimum problems.
It overshoots the problem and returns to it back.
Nestrov momentum:
dx = compute_gradient(x)
old_v = v
v = rho * v - learning_rate * dx
x+= -rho * old_v + (1+rho) * v
Doesn't overshoot the problem but slower than SGD + momentum
AdaGrad
grad_squared = 0
while(True):
dx = compute_gradient(x)
# here is a problem, the grad_squared isn't decayed (gets so large)
grad_squared += dx * dx
x -= (learning_rate*dx) / (np.sqrt(grad_squared) + 1e-7)
RMSProp
grad_squared = 0
while(True):
dx = compute_gradient(x)
#Solved ADAgra
grad_squared = decay_rate * grad_squared + (1-grad_squared) * dx * dx
x -= (learning_rate*dx) / (np.sqrt(grad_squared) + 1e-7)
People uses this instead of AdaGrad
Adam
beta1 = 0.9
and beta2 = 0.999
and learning_rate = 1e-3
or 5e-4
is a great starting point for many models!Learning decay
All the above algorithms we have discussed is a first order optimization.
Second order optimization
In practice first use ADAM and if it didn't work try L-BFGS.
Some says all the famous deep architectures uses SGS + Nestrov momentum
Regularization
Transfer learning:
Some times your data is overfitted by your model because the data is small not because of regularization.
You need a lot of data if you want to train/use CNNs.
Steps of transfer learning
Guide to use transfer learning:
Very Similar dataset | very different dataset | |
---|---|---|
very little dataset | Use Linear classifier on top layer | You're in trouble.. Try linear classifier from different stages |
quite a lot of data | Finetune a few layers | Finetune a large layers |
Transfer learning is the normal not an exception.
This section changes a lot every year in CS231n due to rabid changes in the deep learning softwares.
CPU vs GPU
Deep learning Frameworks
Tensorflow (Google)
#Ships with tensorflow
#Ships with tensorflow
#Ships with tensorflow
# New from deep mind
PyTorch (Facebook)
ndarray
but runs on GPU #Like numpy arrays in tensorflow
#Like Tensor, Variable, Placeholders
#Like tf.layers in tensorflow
Tensorflow builds the graph once, then run them many times (Called static graph)
In each PyTorch iteration we build a new graph (Called dynamic graph)
Static vs dynamic graphs:
Optimization:
Serialization
Static: Once graph is built, can serialize it and run it without the code that built the graph. Ex use the graph in c++
Dynamic: Always need to keep the code around.
Conditional
Loops:
Tensorflow fold make dynamic graphs easier in Tensorflow through dynamic batching.
Dynamic graph applications include: recurrent networks and recursive networks.
Caffe2 uses static graphs and can train model in python also works on IOS and Android
Tensorflow/Caffe2 are used a lot in production especially on mobile.
This section talks about the famous CNN architectures. Focuses on CNN architectures that won ImageNet competition since 2012.
These architectures includes: AlexNet, VGG, GoogLeNet, and ResNet.
Also we will discuss some interesting architectures as we go.
The first ConvNet that was made was LeNet-5 architectures are:by Yann Lecun at 1998.
CONV-POOL-CONV-POOL-FC-FC-FC
5x5
applied at stride 12x2
applied at stride 2
In 2010 Dan Claudiu Ciresan and Jurgen Schmidhuber published one of the very fist implementations of GPU Neural nets. This implementation had both forward and backward implemented on a a NVIDIA GTX 280 graphic processor of an up to 9 layers neural network.
AlexNet (2012):
CONV1-MAXPOOL1-NORM1-CONV2-MAXPOOL2-NORM2-CONV3-CONV4-CONV5-MAXPOOL3-FC6-FC7-FC8
16.4%
(55,55,96)
, Number of weights are (11*11*3*96)+96 = 34944
(27,27,96)
, No Weights(27,27,96)
, We don't do this any more(6,6,256)
0.5
128
0.9
1e-2
reduce by 10 at some iterations60 million
ZFNet (2013)
CONV1
: change from (11 x 11 stride 4) to (7 x 7 stride 2)CONV3,4,5
: instead of 384, 384, 256 filters use 512, 1024, 512OverFeat (2013)
VGGNet (2014) (Oxford)
GoogleNet (2014)
# output shape (28,28,128)
# output shape (28,28,192)
# output shape (28,28,96)
# output shape (28,28,256)
(28,28,672)
The first GoogleNet and VGG was before batch normalization invented so they had some hacks to train the NN and converge well.
ResNet (2015) (Microsoft Research)
152-layer model for ImageNet. Winner by 3.57% which is more than human level error.
This is also the very first time that a network of > hundred, even 1000 layers was trained.
Swept all classification and detection competitions in ILSVRC’15 and COCO’15!
What happens when we continue stacking deeper layers on a “plain” Convolutional neural network?
The deeper model should be able to perform at least as well as the shallower model.
A solution by construction is copying the learned layers from the shallower model and setting additional layers to identity mapping.
Residual block:
Microsoft came with the Residual block which has this architecture:
# Instead of us trying To learn a new representation, We learn only Residual
Y = (W2* RELU(W1x+b1) + b2) + X
Say you have a network till a depth of N layers. You only want to add a new layer if you get something extra out of adding that layer.
One way to ensure this new (N+1)th layer learns something new about your network is to also provide the input(x) without any transformation to the output of the (N+1)th layer. This essentially drives the new layer to learn something different from what the input has already encoded.
The other advantage is such connections help in handling the Vanishing gradient problem in very deep networks.
With the Residual block we can now have a deep NN of any depth without the fearing that we can't optimize the network.
ResNet with a large number of layers started to use a bottleneck layer similar to the Inception bottleneck to reduce the dimensions.
Full ResNet architecture:
0.9
)256
1e-5
Inception-v4: Resnet + Inception and was founded in 2016.
The complexity comparing over all the architectures:
ResNets Improvements:
Beyond ResNets:
Conclusion:
Vanilla Neural Networks "Feed neural networks", input of fixed size goes through some hidden units and then go to output. We call it a one to one network.
Recurrent Neural Networks RNN Models:
RNNs can also work for Non-Sequence Data (One to One problems)
So what is a recurrent neural network?
Recurrent core cell that take an input x and that cell has an internal state that are updated each time it reads an input.
The RNN block should return a vector.
We can process a sequence of vectors x by applying a recurrence formula at every time step:
h[t] = fw (h[t-1], x[t]) # Where fw is some function with parameters W
The same function and the same set of parameters are used at every time step.
(Vanilla) Recurrent Neural Network:
h[t] = tanh (W[h,h]*h[t-1] + W[x,h]*x[t]) # Then we save h[t]
y[t] = W[h,y]*h[t]
This is the simplest example of a RNN.
RNN works on a sequence of related data.
Recurrent NN Computational graph:
h0
are initialized to zero.W
is the sum of all the W
gradients that has been calculated!Examples:
[h, e, l, o]
and the words are [hello]
Backpropagation through time Forward through entire sequence to compute loss, then backward through entire sequence to compute gradient.
So in practice people are doing "Truncated Backpropagation through time" as we go on we Run forward and backward through chunks of the sequence instead of whole sequence
Example on image captioning:
Image Captioning with Attention is a project in which when the RNN is generating captions, it looks at a specific part of the image not the whole image.
Multilayer RNNs is generally using some layers as the hidden layer that are feed into again. LSTM is a multilayer RNNs.
Backward flow of gradients in RNN can explode or vanish. Exploding is controlled with gradient clipping. Vanishing is controlled with additive interactions (LSTM)
LSTM stands for Long Short Term Memory. It was designed to help the vanishing gradient problem on RNNs.
Highway networks is something between ResNet and LSTM that is still in research.
Better/simpler architectures are a hot topic of current research
Better understanding (both theoretical and empirical) is needed.
RNN is used for problems that uses sequences of related inputs more. Like NLP and Speech recognition.
So far we are talking about image classification problem. In this section we will talk about Segmentation, Localization, Detection.
Semantic Segmentation
We want to Label each pixel in the image with a category label.
As you see the cows in the image, Semantic Segmentation Don’t differentiate instances, only care about pixels.
The first idea is to use a sliding window. We take a small window size and slide it all over the picture. For each window we want to label the center pixel.
The second idea is designing a network as a bunch of Convolutional layers to make predictions for pixels all at once!
The third idea is based on the last idea. The difference is that we are downsampling and upsampling inside the network.
We downsample because using the whole image as it is very expensive. So we go on multiple layers downsampling and then upsampling in the end.
Downsampling is an operation like Pooling and strided convolution.
Upsampling is like "Nearest Neighbor" or "Bed of Nails" or "Max unpooling"
Nearest Neighbor example:
Input: 1 2 Output: 1 1 2 2
3 4 1 1 2 2
3 3 4 4
3 3 4 4
Bed of Nails example:
Input: 1 2 Output: 1 0 2 0
3 4 0 0 0 0
3 0 4 0
0 0 0 0
Max unpooling is depending on the earlier steps that was made by max pooling. You fill the pixel where max pooling took place and then fill other pixels by zero.
Max unpooling seems to be the best idea for upsampling.
There are an idea of Learnable Upsampling called "Transpose Convolution"
Classification + Localization:
# The plain classification problem we know
(x,y,w,h)
Object Detection
Denese Captioning
Instance Segmentation
We want to know what’s going on inside ConvNets?
People want to trust the black box (CNN) and know how it exactly works and give and good decisions.
A first approach is to visualize filters of the first layer.
We can visualize filters from the next layers but they won't tell us anything.
In AlexNet, there was some FC layers in the end. If we took the 4096-dimensional feature vector for an image, and collecting these feature vectors.
We can Visualize the activation maps.
There are something called Maximally Activating Patches that can help us visualize the intermediate features in Convnets
Another idea is Occlusion Experiments
Saliency Maps tells which pixels matter for classification
(guided) backprop Makes something like Maximally Activating Patches but unlike it gets the pixels in which we are caring of.
Gradient Ascent
Generate a synthetic image that maximally activates a neuron.
Reverse of gradient decent. Instead of taking the minimum it takes the maximum.
We want to maximize the neuron with the input image. So here instead we are trying to learn the image that maximize the activation:
# R(I) is Natural image regularizer, f(I) is the neuron value.
I *= argmax(f(I)) + R(I)
Steps of gradient ascent
R(I)
may equal to L2 of generated image.
To get a better results we use a better regularizer:
A better regularizer makes out images cleaner!
The results in the latter layers seems to mean something more than the other layers.
We can fool CNN by using this procedure:
# Random picture based on nothing.
# Random class
Results on fooling the network is pretty surprising!
DeepDream: Amplify existing features
# form an input image (Any image)
I* = arg max[I] sum(f(I)^2)
Feature Inversion
Texture Synthesis
Neural Style Transfer = Feature + Gram Reconstruction
Style transfer requires many forward / backward passes through VGG; very slow!
There are a lot of work on these style transfer and it continues till now!
Summary:
Generative models are type of Unsupervised learning.
Supervised vs Unsupervised Learning:
Supervised Learning | Unsupervised Learning | |
---|---|---|
Data structure | Data: (x, y), and x is data, y is label | Data: x, Just data, no labels! |
Data price | Training data is expensive in a lot of cases. | Training data are cheap! |
Goal | Learn a function to map x -> y | Learn some underlying hidden structure of the data |
Examples | Classification, regression, object detection, semantic segmentation, image captioning | Clustering, dimensionality reduction, feature learning, density estimation |
Autoencoders are a Feature learning technique.
Density estimation is where we want to learn/estimate the underlaying distribution for the data!
There are a lot of research open problems in unsupervised learning compared with supervised learning!
Generative Models
PixelRNN and PixelCNN
p(x) = sum(p(x[i]| x[1]x[2]....x[i-1]))
Autoencoders
L[i] = |y[i] - y'[i]|^2
# Now we have the features we need
Variational Autoencoders (VAE)
Generative Adversarial Networks (GANs)
GANs don’t work with any explicit density function!
Instead, take game-theoretic approach: learn to generate from training distribution through 2-player game.
Yann LeCun, who oversees AI research at Facebook, has called GANs:
The coolest idea in deep learning in the last 20 years
Problem: Want to sample from complex, high-dimensional training distribution. No direct way to do this as we have discussed!
Solution: Sample from a simple distribution, e.g. random noise. Learn transformation to training distribution.
So we create a noise image which are drawn from simple distribution feed it to NN we will call it a generator network that should learn to transform this into the distribution we want.
Training GANs: Two-player game:
If we are able to train the Discriminator well then we can train the generator to generate the right images.
The loss function of GANs as minimax game are here:
The label of the generator network will be 0 and the real images are 1.
To train the network we will do:
You can read the full algorithm with the equations here:
Aside: Jointly training two networks is challenging, can be unstable. Choosing objectives with better loss landscapes helps training is an active area of research.
Convolutional Architectures:
2017 is the year of the GANs! it has exploded and there are some really good results.
Active areas of research also is GANs for all kinds of applications.
The GAN zoo can be found here: https://github.com/hindupuravinash/the-gan-zoo
Tips and tricks for using GANs: https://github.com/soumith/ganhacks
NIPS 2016 Tutorial GANs: https://www.youtube.com/watch?v=AJVyzd0rqdc
s[t]
--> Agent --> Action a[t]
--> Environment --> Reward r[t]
+ Next state s[t+1]
--> Agent --> and so on..S
, A
, R
, P
, Y
) where:
S
: set of possible states.A
: set of possible actionsR
: distribution of reward given (state, action) pairP
: transition probability i.e. distribution over next state given (state, action) pair
Y
: discount factor # How much we value rewards coming up soon verses later on.
t=0
, environment samples initial state s[0]
a[t]
R
with (s[t]
, a[t]
)P
with (s[t]
, a[t]
)r[t]
and next state s[t+1]
pi
is a function from S to A that specifies what action to take in each state.pi*
that maximizes cumulative discounted reward: Sum(Y^t * r[t], t>0)
s
, is the expected cumulative reward from following the policy from state s
:
V[pi](s) = Sum(Y^t * r[t], t>0) given s0 = s, pi
a
, is the expected cumulative reward from taking action a
in state s
and then following the policy:
Q[pi](s,a) = Sum(Y^t * r[t], t>0) given s0 = s,a0 = a, pi
Q*
is the maximum expected cumulative reward achievable from a given (state, action) pair:
Q*[s,a] = Max(for all of pi on (Sum(Y^t * r[t], t>0) given s0 = s,a0 = a, pi))
Q*[s,a] = r + Y * max Q*(s',a') given s,a # Hint there is no policy in the equation
pi*
corresponds to taking the best action in any state as specified by Q*
Q(s,a)
. E.g. a neural network! this is called Q-learning
s[t]
, a[t]
, r[t]
, s[t+1]
) as game (experience) episodes are played.J(ceta)
, often good enough!# At Baidu
# Used for any hardware
# Latency oriented, Single strong threaded like a single elepahnt
# Throughput oriented, So many small threads like a lot of ants
#Tuned for a domain of applications
# Fixed logic, Designed for a certian applications (Can be designed for deep learning applications)
# Most of the time we are inside the linear curve
Xdash = x + epslion * (sign of the gradient)
#Is called Universal engineering machine by Ian
These Notes was made by Mahmoud Badry @2017