【Stanford ML Exercise4 Week5】Neural Networks Learning

Implement the backpropagation algorithm for neural networks and apply it to the task of hand-written digit recognition.

1. Neural Networks

  • implement the backpropagation algorithm to learn the parameters for the neural network.

1.1 Visualizing the data

Screen Shot 2017-06-26 at 10.24.42 PM.png

  • 5000 training examples

    each training example is a 20 pixel by 20 pixel grayscale image of the digit

    The 20 by 20 grid of pixels is “unrolled” into a 400-dimensional vector

     

1.2 Model representation

  • 3 layers – an input layer, a hidden layer and an output layer

Screen Shot 2017-06-26 at 10.29.10 PM.png

1.3 Feedforward and cost function

  • implement the cost function and gradient for the neural network
  • should not be regularizing the terms that correspond to the bias

Cost function with regularization:

Screen Shot 2017-07-06 at 10.44.14 PM

2. Backpropagation

  • compute the gradient for the neural network cost function

     

2.1 Sigmoid gradient

Gradient for the sigmoid function:

screen-shot-2017-07-06-at-11-16-02-pm.png

2.2 Random initialization

  • When training neural networks, it is important to randomly initialize the parameters for symmetry breaking.
epsilon init = 0.12;
W = rand(L out, 1 + L in) * 2 * epsilon init − epsilon init;

2.3 Backpropagation

Intuition behind the backpropagation algorithm:

  • Given a training example (x(t),y(t)), first run a “forward pass” to compute all the activations throughout the network
  • for each node j in layer l, compute an “error term” δ(l) that measures how much that node was “responsible” j for any errors in our output

     

Step 1-4 to implement backpropagation:

 

Stanford ML Week 6 learning Notes: Advice for Applying Machine Learning

Improve ml performance:

  1. Get more training examples: not certain to help —— fix high variance
  2. Try smaller sets of features: prevent overfitting ——- fix high variance
  3. Try getting additional features: more information ——– fix high bias
  4. Try adding polynomial features ——- fix high bias
  5. Try decreasing lambda ——- fix high bias
  6. Try increasing lambda ——- fix high variance

Evaluation Algorihm:

ML diagnostic: a test to gain insight what is/isn’t working about the algorithm

Model Selection and Train/Validation/Test Sets

Model selection: choose what degree polynomial to fit the model

How well this model generalize?

  1. Use test set to calculate J, problem: overly optimistic estimate of generalization error, reason: fit polynomial degree d based on the performance on test set.
  2. solution: do examination we don’t see before. Instead of using test set to select the model, using cv set to select the model, using test set to test.

Diagnosing Bias vs Variance

curve: x-axis: polynomial degree d, y-axis: error

underfit: high bias, training error high, cv error high (x-axis is d)

overfit: high variance, training error low, cv error high

Regularization and Bias/Variance

curve: x-axis: lambda, y-axis: error

underfit: high bias, large lambda

overfit: high variance, small lambda

With regularization parameter lambda increasing, training error low, cv error high

when lambda small: high variance, training error low, cv error high

when lambda large: high bias, training error high, cv error high

Learning Curves

learning curve: y-axis: error, x-axis: training set size

m small: training error low, cv error high

m large: training error high, cv error low

If a learning algorithm is suffering high bias, increasing training set size is not helping; if high variance, increasing size helps.

 

Prioritizing What to Work On

e.g. Build a spam classifier

How to spend time to make the model better?

  • Collect lots of data
  • Develop more features based on email routing
  • Develop more features based on message
  • Develop algorithm for misspelling

Error Analysis

  1. start with a model, test on cv data
  2. plot learning curve, decide more data or more feature
  3. error analysis: manually examine the examples

Error Metrics for Skewed Classes

skewed class: have a lot more examples for one class than the other classes

better way to examine whether a model is performing well:

  • Precision: True positive/# of predicted positive
  • Recall: True positive/# of actual positive

Trading Off Precision and Recall

increase threshold high: higher precision, lower recall

How to choose a good one:

F Score: 2*(PR)/(P+R)

Data for Machine Learning

 

 

【Stanford ML Exercise3】 Multi-class Classification and Neural Networks

In this exercise, use logistic regression and neural networks to recognize handwritten digits (from 0 to 9)

1. Multi-class Classification

  • First part: extend previous logistic regression, apply to one-vs-all classification

1.1 Dataset

5000 training examples, each training example is a 20 pixel by 20 pixel grayscale image of the digit.

1.3 Vectorizing Logistic Regression

  • train 10 separate logistic regression classifiers
  • implement a vectorized version of logistic regression that does not employ any for loops

1.3.3 Vectorizing regularized logistic regression

cost function:

Screen Shot 2017-06-25 at 6.07.37 PM.png

the partial derivative of regularized logistic regression cost:
Screen Shot 2017-06-25 at 6.10.35 PM.png
Code: lrCostFunction.m
function [J, grad] = lrCostFunction(theta, X, y, lambda)

% Initialize some useful values
m = length(y); % number of training examples

% You need to return the following variables correctly 
J = 0;
grad = zeros(size(theta));

J = - y'*log(sigmoid(X*theta)) - (1 - y)'*log(1 - sigmoid(X*theta));
J = J/m;
J = J + (lambda/(2*m))*sum(theta(2:length(theta)).^2);

j = 1;
grad(j) = (sigmoid(X*theta) - y)'*X(:,1);
grad(j) = grad(j)/m;

grad(2:length(theta)) = grad(2:length(theta)) + X(:,2:length(theta))'*(sigmoid(X*theta) - y);
grad(2:length(theta)) = grad(2:length(theta))/m;
grad(2:length(theta)) = grad(2:length(theta)) + (lambda/m)*theta(2:length(theta));

% =============================================================

grad = grad(:);

end

1.4 One-vs-all Classification

  • implement one-vs-all classification by training multiple regularized logistic regression classifiers
  • train one classifier for each class

    return all the classifier parameters in a matrix

Code: oneVsAll.m

function [all_theta] = oneVsAll(X, y, num_labels, lambda)

% Some useful variables
m = size(X, 1);
n = size(X, 2);

% You need to return the following variables correctly 
all_theta = zeros(num_labels, n + 1);

% Add ones to the X data matrix
X = [ones(m, 1) X];

for c = 1:num_labels
 initial_theta = zeros(n + 1, 1);
 options = optimset('GradObj', 'on', 'MaxIter', 50);
 [theta] = ...
 fmincg (@(t)(lrCostFunction(t, X, (y == c), lambda)), ...
 initial_theta, options);
 all_theta(c,:) = theta(:);

% =========================================================================

end

1.4.1 One-vs-all Prediction

  • compute the “probability” that it belongs to each class using the trained logistic regression classifiers

Code: predictOneVsAll.m

function p = predictOneVsAll(all_theta, X)

m = size(X, 1);
num_labels = size(all_theta, 1);

% You need to return the following variables correctly 
p = zeros(size(X, 1), 1);

% Add ones to the X data matrix
X = [ones(m, 1) X];

[p_val,p] = max(sigmoid(X*all_theta'), [], 2);

% =========================================================================
end

2. Neural Networks

  • logistic regression cannot form more complex hypotheses as it is only a linear classifier
  • The neural network will be able to represent complex models that form non-linear hypotheses

2.1 Model representation

  • It has 3 layers – an input layer, a hidden layer and an output layer
  • the images are of size 20-20, this gives us 400 input layer units
  • The parameters have dimensions that are sized for a neural network with 25 units in the second layer and 10 output units (corresponding to the 10 digit classes)

2.2 Feedforward Propagation and Prediction

  • implement feedforward propagation for the neural network

Code: predict.m

function p = predict(Theta1, Theta2, X)

% Useful values
m = size(X, 1);
num_labels = size(Theta2, 1);

% You need to return the following variables correctly 
p = zeros(size(X, 1), 1);

A2 = sigmoid([ones(m,1), X] *Theta1');

h = sigmoid([ones(m,1), A2]*Theta2');

[p_val,p] = max(h, [], 2);

% =========================================================================

end

【Algorithm学习笔记3】2017.07.19 终于到排序了…

2.1 初级排序算法

2.1.1游戏规则

2.1.2 选择排序

方法:找最小,和第一个换位置;找剩下的最小,和第二个换位置。如此反复,直到最后一个数。
效率:长度为N的数组,N次交换,(N*N)/2次比较
缺点:输入的初始状态对运行时间无关
优点:移动次数最少

2.1.3 插入排序

方法:一张一张,每一张插入已经有序的适当位置,插入位置右边所有元素向右移动一位,给插入元素腾出空间。
效率:长度为N的数组,平均(N*N)/4次交换,(N*N)/4次比较;最差(N*N)/2次交换,(N*N)/2次比较;最好0次交换,(N-1)次比较
优点:对部分有序数列十分高效

【Algorithm学习笔记2】2017.06.14

1.1.7 API
java.lang中有Math库
java.util.Arrays库中有sort
哪些库需要import?哪些不需要?
API的目的是将调用和实现分离。
1.1.8 字符串
 
1.1.8.1 字符串拼接
1.1.8.2 类型转换
Integer.parseInt
Integer.toString
Double.parseDouble
Double.toString
1.1.8.3 自动转换
加号一个参数是String,Java自动将其他转换为String
“”可以任意转换为String
1.1.8.4 命令行参数
1.1.9 输入输出
1.1.9.1 命令和参数
1.1.9.2 标准输出
1.1.9.3 格式化输出
StdOut.printf()
1.1.9.4 标准输入
1.1.9.5 重定向与管道
1.1.9.6 基于文件的输入输出
1.1.10 二分查找

【Algorithm学习笔记1】2017.05.28

1.1.1 Java程序的基本结构
【不懂】
要执行一个Java程序,首先要用javac命令编译它,然后再用java命令运行它。例如运行BinarySearch,首先要输入javac BinarySearch.java(这将生成一个叫BinarySearch.class的文件,其中含有这个程序的java字节码);然后再输入java BinarySearch(接着是一个白名单文件名)把控制权移交给这段字节码程序。
1.1.2 原始数据类型与表达式
原始数据类型:int,double,boolean,char
int值域:-2^31至+2^31-1之间,32位,二进制补码
int运算符:加,减,乘,除,求余
double值域:双精度实数,64位,IEEE754标准
boolean运算符:&&(and),||(or),!(not),^(nor)
char值域:字符,16位
运算符优先级:*,/,%      >>>>>    +,-
                         !>>> && >>> ||
数据类型转换:如果不会损失信息,数值会被自动提升为高级的数据类型。double to int 会截断小数部分,而不是四舍五入
其他原始类型:long(64位整数),short(16位整数),byte(8位整数),float(32位单精度实数)
1.1.3 语句
1.1.4 简便记法
如果条件语句或循环语句的代码段只有一条语句,{}可以省略
1.1.5 数组
1.1.5.1 创建并初始化数组
  1. 声明数组的名字和类型
  2. 创建数组:需要指定数组长度(元素个数),关键字:new
  3. 初始化数组元素
明确的创建数组的原因是Java编译器在编译时无法知道应该为数组预留多少空间
double[] a; 声明数组a
a = new double[N]; 创建数组a
for (int i = 0; i < N; i ++) 初始化
    a[i] = 0.0
1.1.5.2 简化写法
double[] a = new double[N] 声明+创建+默认初始值
int[] a = {1,1,2,3,5,8} 声明+创建+赋值初始值
double类型的变量的默认初始值都是0.0
布尔型的默认初始值都是false
1.1.5.3 使用数组
数组一经创建,大小就是固定的
1.1.5.4 起别名
如果将一个数组变量赋予给另一个变量,那么两个变量将会指向同一个数组。
想将数组复制,应重新声明、创建并初始化一个数组,将原数组中的值一一赋予新数组的值。
1.1.5.5 二维数组
1.1.6 静态方法
静态方法是一组在调用时会被顺序执行的语句。
签名 + 函数体
签名 = public static + 函数返回值 + 方法名 + 各种类型参数
递归:三点
  1. 总有一个最简单情况——第一条总是一个包含return的条件语句
  2. 总是去解决一个规模更小的子问题,这样才能收敛到最简单情况
  3. 调用的父问题和解决的子问题之间不能有交集
使用递归简洁易懂,且可以用数学估计程序性能(计算复杂度)
静态方法库是Java类中的一组静态方法
类的声明是public class  + 类名 + {静态方法}
存放类的文件的文件名和类名相同,扩展名是 .java