My notes on PyTorch Scholarship Challenge [Phase 1] 2018/2019
A collection of notes on PyTorch Scholarship Challenge 2018/2019.
Contributions are always welcome!
The problem of identifying to which of a set of categories (sub-populations) a new observation belongs.
The separator between classes learned by a model in a binary class or multi-class classification problems. For example, in the following image representing a binary classification problem, the decision boundary is the frontier between the blue class and the red class:
Linear Boundaries
Higher Dimensions
A system (either hardware or software) that takes in one or more input values, runs a function on the weighted sum of the inputs, and computes a single output value. In machine learning, the function is typically nonlinear, such as ReLU, sigmoid, or tanh.
In the following illustration, the perceptron takes n inputs, each of which is itself modified by a weight before entering the perceptron:
A perceptron that takes in n inputs, each multiplied by separate weights. The perceptron outputs a single value.
Perceptrons are the (nodes) in deep neural networks. That is, a deep neural network consists of multiple connected perceptrons, plus a backpropagation algorithm to introduce feedback.
AND Perceptron
OR Perceptron
NOT Perceptron Unlike the other perceptrons we looked at, the NOT operation only cares about one input. The operation returns a 0 if the input is 1 and a 1 if it's a 0. The other inputs to the perceptron are ignored.
XOR Perceptron
A function that provides probabilities for each possible class in a multi-class classification model. The probabilities add up to exactly 1.0. For example, softmax might determine that the probability of a particular image being a duck at 0.67, a beaver at 0.33, and a walrus at 0. (Also called full softmax.)
A sparse vector in which:
One-hot encoding is commonly used to represent strings or identifiers that have a finite set of possible values. For example, suppose a given botany data set chronicles 15,000 different species, each denoted with a unique string identifier. As part of feature engineering, you'll probably encode those string identifiers as one-hot vectors in which the vector has a size of 15,000.
A generalization of Log Loss to multi-class classification problems. Cross-entropy quantifies the difference between two probability distributions.
A model that generates a probability for each possible discrete label value in classification problems by applying a sigmoid function to a linear prediction. Although logistic regression is often used in binary classification problems, it can also be used in multi-class classification problems (where it becomes called multi-class logistic regression or multinomial regression).
A technique to minimize loss by computing the gradients of loss with respect to the model's parameters, conditioned on training data. Informally, gradient descent iteratively adjusts parameters, gradually finding the best combination of weights and bias to minimize loss.
The primary algorithm for performing gradient descent on neural networks. First, the output values of each node are calculated (and cached) in a forward pass. Then, the partial derivative of the error with respect to each parameter is calculated in a backward pass through the graph.
Creating a model that matches the training data so closely that the model fails to make correct predictions on new data.
Producing a model with poor predictive ability because the model hasn't captured the complexity of the training data. Many problems can cause underfitting, including:
A method for regularization that involves ending model training before training loss finishes decreasing. In early stopping, you end model training when the loss on a validation data set starts to increase, that is, when generalization performance worsens.
The penalty on a model's complexity. Regularization helps prevent overfitting. Different kinds of regularization include:
A form of regularization useful in training neural networks. Dropout regularization works by removing a random selection of a fixed number of the units in a network layer for a single gradient step. The more units dropped out, the stronger the regularization. This is analogous to training the network to emulate an exponentially large ensemble of smaller networks.
A sophisticated gradient descent algorithm in which a learning step depends not only on the derivative in the current step, but also on the derivatives of the step(s) that immediately preceded it. Momentum involves computing an exponentially weighted moving average of the gradients over time, analogous to momentum in physics. Momentum sometimes prevents learning from getting stuck in local minima.
The primary data structure in TensorFlow programs. Tensors are N-dimensional (where N could be very large) data structures, most commonly scalars, vectors, or matrices. The elements of a Tensor can hold integer, floating-point, or string values.
The "knobs" that you tweak during successive runs of training a model. For example, learning rate is a hyperparameter.
A model that, taking inspiration from the brain, is composed of layers (at least one of which is hidden) consisting of simple connected units or neurons followed by nonlinearities.
A public-domain data set compiled by LeCun, Cortes, and Burges containing 60,000 images, each image showing how a human manually wrote a particular digit from 0–9. Each image is stored as a 28x28 array of integers, where each integer is a grayscale value between 0 and 255, inclusive.
A function (for example, ReLU or sigmoid) that takes in the weighted sum of all of the inputs from the previous layer and then generates and passes an output value (typically nonlinear) to the next layer.
The primary algorithm for performing gradient descent on neural networks. First, the output values of each node are calculated (and cached) in a forward pass. Then, the partial derivative of the error with respect to each parameter is calculated in a backward pass through the graph.
The set of examples used in one iteration (that is, one gradient update) of model training.
The number of examples in a batch. For example, the batch size of SGD is 1, while the batch size of a mini-batch is usually between 10 and 1000. Batch size is usually fixed during training and inference;
A generalization of Log Loss to multi-class classification problems. Cross-entropy quantifies the difference between two probability distributions
A full training pass over the entire data set such that each example has been seen once. Thus, an epoch represents N/batch size training iterations, where N is the total number of examples.
A synthetic layer in a neural network between the input layer (that is, the features) and the output layer (the prediction). Hidden layers typically contain an activation function (such as ReLU) for training. A deep neural network contains more than one hidden layer.
The vector of raw (non-normalized) predictions that a classification model generates, which is ordinarily then passed to a normalization function. If the model is solving a multi-class classification problem, logits typically become an input to the softmax function. The softmax function then generates a vector of (normalized) probabilities with one value for each possible class.
A specific implementation of the gradient descent algorithm.
A forward and backward evaluation of one batch. step size Synonym for learning rate.
A gradient descent algorithm in which the batch size is one. In other words, SGD relies on a single example chosen uniformly at random from a data set to calculate an estimate of the gradient at each step.
A form of regularization useful in training neural networks. Dropout regularization works by removing a random selection of a fixed number of the units in a network layer for a single gradient step. The more units dropped out, the stronger the regularization. This is analogous to training the network to emulate an exponentially large ensemble of smaller networks.
In machine learning, often refers to the process of making predictions by applying the trained model to unlabeled examples. In statistics, inference refers to the process of fitting the parameters of a distribution conditioned on some observed data. (See the Wikipedia article on statistical inference.)
Creating a model that matches the training data so closely that the model fails to make correct predictions on new data.
A metric for classification models. Precision identifies the frequency with which a model was correct when predicting the positive class.
A metric for classification models that answers the following question: Out of all the possible positive labels, how many did the model correctly identify?
A subset of the data set—disjunct from the training set—that you use to adjust hyperparameters.
Data that captures the state of the variables of a model at a particular time. Checkpoints enable exporting model weights, as well as performing training across multiple sessions. Checkpoints also enable training to continue past errors (for example, job preemption). Note that the graph itself is not included in a checkpoint.
optimizer.zero_grad()
.model.eval()
, then back to training mode with model.train()
..to(device)
where device is either "cuda"
or "cpu"
# http://pytorch.org/
from os.path import exists
from wheel.pep425tags import get_abbr_impl, get_impl_ver, get_abi_tag
platform = '{}{}-{}'.format(get_abbr_impl(), get_impl_ver(), get_abi_tag())
cuda_output = !ldconfig -p|grep cudart.so|sed -e 's/.*\.\([0-9]*\)\.\([0-9]*\)$/cu\1\2/'
accelerator = cuda_output[0] if exists('/dev/nvidia0') else 'cpu'
!pip install -q http://download.pytorch.org/whl/{accelerator}/torch-0.4.1-{platform}-linux_x86_64.whl torchvision
import torch
!wget -c https://s3.amazonaws.com/content.udacity-data.com/nd089/Cat_Dog_data.zip;
!unzip -qq Cat_Dog_data.zip;
!wget -c https://raw.githubusercontent.com/udacity/deep-learning-v2-pytorch/master/intro-to-pytorch/helper.py
!pip install Pillow==4.0.0
!pip install PIL
!pip install image
import PIL
Runtime
- change Run time type
GPU
Data normalization is an important pre-processing step. It ensures that each input (each pixel value, in this case) comes from a standard distribution.
* layer
A set of neurons in a neural network that process a set of input features, or the output of those neurons.
Layers are Python functions that take Tensors and configuration options as input and produce other tensors as output. Once the necessary Tensors have been composed, the user can convert the result into an Estimator via a model function.
One of a set of enumerated target values for a label. For example, in a binary classification model that detects spam, the two classes are spam and not spam. In a multi-class classification model that identifies dog breeds, the classes would be poodle, beagle, pug, and so on.
The part of a recommendation system that provides a value or ranking for each item produced by the candidate generation phase.
An activation function with the following rules:
The steps for training/learning from a batch of data are described in the comments below:
model.eval()
will set all the layers in your model to evaluation mode.model.train()
(training mode) only during the training loop.
Intensity is a measure of light and dark, similiar to brightness
To identify the edges of an object, look at abrupt changes in intensity
Filters
To detect changes in intensity in an image, look at groups of pixels and react to alternating patterns of dark/light pixels. Producing an output that shows edges of objects and differing textures.
Edges
Area in images where the intensity changes very quickly
A layer of a deep neural network in which a convolutional filter passes along an input matrix. For example, consider the following 3x3 convolutional filter:
The following animation shows a convolutional layer consisting of 9 convolutional operations involving the 5x5 input matrix. Notice that each convolutional operation works on a different 3x3 slice of the input matrix. The resulting 3x3 matrix (on the right) consists of the results of the 9 convolutional operations:
convolutional neural network
A neural network in which at least one layer is a convolutional layer. A typical convolutional neural network consists of some combination of the following layers:
Convolutional neural networks have had great success in certain kinds of problems, such as image recognition.
pooling
Reducing a matrix (or matrices) created by an earlier convolutional layer to a smaller matrix. Pooling usually involves taking either the maximum or average value across the pooled area. For example, suppose we have the following 3x3 matrix:
A pooling operation, just like a convolutional operation, divides that matrix into slices and then slides that convolutional operation by strides. For example, suppose the pooling operation divides the convolutional matrix into 2x2 slices with a 1x1 stride. As the following diagram illustrates, four pooling operations take place. Imagine that each pooling operation picks the maximum value of the four in that slice:
Pooling helps enforce translational invariance in the input matrix.
Pooling for vision applications is known more formally as spatial pooling. Time-series applications usually refer to pooling as temporal pooling. Less formally, pooling is often called subsampling or downsampling.
self.conv1 = nn.Conv2d(in_channels, out_channels, kernel_size, stride=1, padding=0)
x = F.relu(self.conv1(x))
arguments
in_channels
- number of inputs (in depth)out_channels
- number of output channelskernel_size
- height and width (square) of convolutional kernelstride
- default 1
padding
- default 0
pooling layers
down sampling factors
self.pool = nn.MaxPool2d(2,2)
x = F.relu(self.conv1(x))
x = self.pool(x)
self.conv1 = nn.Conv2d(1, 16, 2, stride=2)
grayscale images (1 depth)
16 filter
filter size 2x2
filter jump 2 pixels at a time
example #2
self.conv2 = nn.Conv2d(16, 32, 3, padding=1)
sequential models
def __init__(self):
super(ModelName, self).__init__()
self.features = nn.Sequential(
nn.Conv2d(1, 16, 2, stride=2),
nn.MaxPool2d(2, 2),
nn.ReLU(True),
nn.Conv2d(16, 32, 3, padding=1),
nn.MaxPool2d(2, 2),
nn.ReLU(True)
)
formula: number of parameters in a convolutional layer
K
- number of filterF
- filter sizeD_in
- last value in the input shape
(K * F*F * D_in) + K
formula: shape of a convolutional layer
K
- number of filterF
- filter sizeS
- strideP
- paddingW_in
- size of prev layer((W_in - F + 2P) / S) + 1
flattening
to make all parameters can be seen (as a vector) by a linear classification layer
data augmentation
Artificially boosting the range and number of training examples by transforming existing examples to create additional examples. For example, suppose images are one of your features, but your data set doesn't contain enough image examples for the model to learn useful associations. Ideally, you'd add enough labeled images to your data set to enable your model to train properly. If that's not possible, data augmentation can rotate, stretch, and reflect each image to produce many variants of the original picture, possibly yielding enough labeled data to enable excellent training.
translational invariance
In an image classification problem, an algorithm's ability to successfully classify images even when the position of objects within the image changes. For example, the algorithm can still identify a dog, whether it is in the center of the frame or at the left end of the frame.
size invariance
In an image classification problem, an algorithm's ability to successfully classify images even when the size of the image changes. For example, the algorithm can still identify a cat whether it consumes 2M pixels or 200K pixels. Note that even the best image classification algorithms still have practical limits on size invariance. For example, an algorithm (or human) is unlikely to correctly classify a cat image consuming only 20 pixels.
rotational invariance
In an image classification problem, an algorithm's ability to successfully classify images even when the orientation of the image changes. For example, the algorithm can still identify a tennis racket whether it is pointing up, sideways, or down. Note that rotational invariance is not always desirable; for example, an upside-down 9 should not be classified as a 9.
784
28*28*1 values = 784
d
nn.MaxPool2d(2,4)
, nn.MaxPool2d(4,4)
or the following quiz questions, consider an input image that is 130x130 (x, y) and 3 in depth (RGB). Say, this image goes through the following layers in order:
nn.Conv2d(3, 10, 3)
nn.MaxPool2d(4, 4)
nn.Conv2d(10, 20, 5, padding=2)
nn.MaxPool2d(2, 2)
Q: After going through all four of these layers in sequence, what is the depth of the final output?
A: 20
E: the final depth is determined by the last convolutional layer, which has a depth
= out_channels
= 20.
Q: What is the x-y size of the output of the final maxpooling layer? Careful to look at how the 130x130 image passes through (and shrinks) as it moved through each convolutional and pooling layer.
A: 16
E: The 130x130 image shrinks by one after the first convolutional layer, then is down-sampled by 4 then 2 after each successive maxpooling layer!
((W_in - F + 2P) / S) + 1
((130 - 3 + 2*0) / 1) + 1 = 128
128 / 4 = 32
((32 - 5 + 2*2) / 1) + 1 = 32
32 / 2 = 16
Q: How many parameters, total, will be left after an image passes through all four of the above layers in sequence?
A: 16*16*20
E: It's the x-y size of the final output times the number of final channels/depth = 16*16 * 20
.
a
is constant that accounts for the number of values in each layer, w
is style weights
d x h x w = (20*8*8)
, what length will one row of the vectorized convolutional layer have? (Vectorized means that the spatial dimensions are flattened.)64
8*8 = 64
.d x h x w = (20*8*8)
, what dimensions (h x w) will the resultant Gram matrix have?(20 x 20)
RNN (R ecurrent N eural N etworks)
A neural network that is intentionally run multiple times, where parts of each run feed into the next run. Specifically, hidden layers from the previous run provide part of the input to the same hidden layer in the next run. Recurrent neural networks are particularly useful for evaluating sequences, so that the hidden layers can learn from previous runs of the neural network on earlier parts of the sequence.
For example, the following figure shows a recurrent neural network that runs four times. Notice that the values learned in the hidden layers from the first run become part of the input to the same hidden layers in the second run. Similarly, the values learned in the hidden layer on the second run become part of the input to the same hidden layer in the third run. In this way, the recurrent neural network gradually trains and predicts the meaning of the entire sequence rather than just the meaning of individual words.
LSTM (L ong S hort - T erm M emory)
LSTM are an improvement of the RNN, and quite useful when needs to switch between remembering recent things, and things from long time ago
Architecture of LSTM
forget gate
long term memory (LTM) goes here where it forgets everything that it doesn't consider useful
learn gate
short term memory and event are joined together containing information that have recently learned and it removes any unecessary information
remember gate
long term memory that haven't forgotten yet plus the new information that have learned get joined together to update long term memmory
use gate
decide what information use from what previously know plus what we just learned to make a prediction. The output becomes both the prediction and the new short term memory (STM)
input_size = 100
, hidden_size = 20
, and num_layers=1
. What will the dimensions of the hidden state be if you're passing in data, batch first, in batches of 3 sequences at a time?(1, 3, 20)
(num_layers, batch_size, hidden_dim)
.
torch.jit.ScriptModule
subclass and add @torch.jit.script_method
decorator to convert to script modulesave
method to serialize script module to a file which can then be loaded into C++