Stanford ML Week 6 learning Notes: Advice for Applying Machine Learning

Improve ml performance:

  1. Get more training examples: not certain to help —— fix high variance
  2. Try smaller sets of features: prevent overfitting ——- fix high variance
  3. Try getting additional features: more information ——– fix high bias
  4. Try adding polynomial features ——- fix high bias
  5. Try decreasing lambda ——- fix high bias
  6. Try increasing lambda ——- fix high variance

Evaluation Algorihm:

ML diagnostic: a test to gain insight what is/isn’t working about the algorithm

Model Selection and Train/Validation/Test Sets

Model selection: choose what degree polynomial to fit the model

How well this model generalize?

  1. Use test set to calculate J, problem: overly optimistic estimate of generalization error, reason: fit polynomial degree d based on the performance on test set.
  2. solution: do examination we don’t see before. Instead of using test set to select the model, using cv set to select the model, using test set to test.

Diagnosing Bias vs Variance

curve: x-axis: polynomial degree d, y-axis: error

underfit: high bias, training error high, cv error high (x-axis is d)

overfit: high variance, training error low, cv error high

Regularization and Bias/Variance

curve: x-axis: lambda, y-axis: error

underfit: high bias, large lambda

overfit: high variance, small lambda

With regularization parameter lambda increasing, training error low, cv error high

when lambda small: high variance, training error low, cv error high

when lambda large: high bias, training error high, cv error high

Learning Curves

learning curve: y-axis: error, x-axis: training set size

m small: training error low, cv error high

m large: training error high, cv error low

If a learning algorithm is suffering high bias, increasing training set size is not helping; if high variance, increasing size helps.

 

Prioritizing What to Work On

e.g. Build a spam classifier

How to spend time to make the model better?

  • Collect lots of data
  • Develop more features based on email routing
  • Develop more features based on message
  • Develop algorithm for misspelling

Error Analysis

  1. start with a model, test on cv data
  2. plot learning curve, decide more data or more feature
  3. error analysis: manually examine the examples

Error Metrics for Skewed Classes

skewed class: have a lot more examples for one class than the other classes

better way to examine whether a model is performing well:

  • Precision: True positive/# of predicted positive
  • Recall: True positive/# of actual positive

Trading Off Precision and Recall

increase threshold high: higher precision, lower recall

How to choose a good one:

F Score: 2*(PR)/(P+R)

Data for Machine Learning

 

 

Advertisements

Author: Lisa.zxiaoc

Data scientist and learner.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s