博客

Data Visualisation of Route on Map Python Tutorial

This tutorial is to introduce how to plot a route on the map using python. We are going to use Basemap python package in this tutorial, there are some other packages have similar function, like GCMap, plotly.

Install Basemap package

Install mpl_tookits.basemap python API is not hard. I am using installed Ipython environment, so the code is like

!apt-get install -y python-mpltoolkits.basemap

After that, import some basic python packages needed in this tutorial.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from mpl_toolkits.basemap import Basemap

Load Data

In this tutorial, we are going to use a set of pseudo trip data. The plotting data should contain at least 4 variables: start latitude, start longitude, stop latitude, stop longitude.

df = pd.read_csv("pseudo_trip.csv")

Pandas will load the data into default dataframe format.

Plot Data

We will plot the start point as a green circle marker, the destination point as a red circle marker, and there will be a big circle line between two points.

fig = plt.figure(figsize=(15,15))

xbuf = 0.2
ybuf = 0.35
minlat = np.min([df2.stop_lat.min(), df2.start_lat.min()])
minlon = np.min([df2.stop_lon.min(), df2.start_lon.min()])
maxlat = np.max([df2.stop_lat.max(), df2.start_lat.max()])
maxlon = np.max([df2.stop_lon.max(), df2.start_lon.max()])
width = maxlon - minlon
height = maxlat - minlat

m = Basemap(llcrnrlon=minlon - width* xbuf,
            llcrnrlat=minlat - height*ybuf,
            urcrnrlon=maxlon + width* xbuf,
            urcrnrlat=maxlat + height*ybuf,
            projection='merc',
            resolution='l',
            lat_0=minlat + height/2,
            lon_0=minlon + width/2,)

m.drawmapboundary(fill_color='#191970')
m.drawcoastlines()
m.drawstates()
m.drawcountries()
m.fillcontinents(color='black',lake_color='#191970')

for i in df.index:
    m.drawgreatcircle(df['start_lon'][i], df['start_lat'][i], 
                      df['stop_lon'][i], df['stop_lat'][i], 
                      linewidth=3, alpha=0.2, color='white')
    m.plot(*m(df['start_lon'][i], df['start_lat'][i]), 
           color='y', markersize=10, alpha=0.8, marker='o')
    m.plot(*m(df['stop_lon'][i], df['stop_lat'][i]), 
           color='r', alpha=0.5, marker='o' )

fig.text(0.15, 0.20, 
         "Plotted using Python, Basemap", 
         ha='left', color='white', style='italic')
fig.text(0.15, 0.18, 
         "data.com", 
         color='white', fontsize=16, ha='left')
plt.savefig('Map.png', dpi=150, 
            frameon=True, transparent=False, 
            bbox_inches='tight', 
            pad_inches=0.2)

The result looks like

US_CA

Last, choose a color you like!

Advertisements

Python/Pandas/Numpy DataAnalysis Cheating Note

Order by column

df.sort_values(['Column1', 'Column2'], ascending = False)

Count by value

df.value_counts()

Check ‘NaN’ in a dataframe

df.isnull()
df.isnull().sum(axis=0).reset_index()

Drop ‘NaN’ rows in a dataframe

df = df.drop(df['Column'][df['Column'].isnull()].index).reset_index()

Drop row

df = df.drop(df.index[[1,3]], inplace=True)

Drop column

del df.column

Reset column name

df.rename(columns = {'A':'a', 'B':'b'})

Reset index

df = df.reset_index()

lambda function

df.column.apply(lambda x: function(x))

Write pd.Dataframe to Google Big Query

pd.io.gbq.to_gbq(df, 'DatasetId.tableId', 'ProjectId', if_exists = 'replace')

 

Linux Cheating Note

Copy local file to Google cloud storage

!gsutil -m cp $file_name $bucket_name

Copy bucket file to local

!gsutil cp $bucket_file .

List files

ls $file_name
ls -a # list all files
ls -l # long format
ls -s # list file size
ls -S # list sort by size
ls -t # list sort by time

Make folder

mkdir $folder_name

Write file

%%writefile $folder_name/$file_name
#Put code here

 

Load Dataset with tensorflow

In this article, we go through a brief hand-on notebook of loading data using tensorflow. Tensorflow can load data directly from local, also can load data stored in google cloud storage if you feed it with a storage bucket path.

import tensorflow as tf
import os

Specify data file path

First import module. Then, specify the path of the data in google storage. Tensorflow will read all csv files under the “gs://bucket_name/folder_name/” path.

  • Directly load the local files:
input_file_names = tf.train.match_filenames_once('file_name.csv')
  • Load from Google Storage:
input_dir = 'gs://bucket_name'
file_prefix = 'folder_name/'
input_file_names = tf.train.match_filenames_once(os.path.join(input_dir, '{}*{}'.format(file_prefix, '.csv')))

Shuffle and read data

Shuffle the input data, skip the first line of the dataset. Put the loaded data into ‘example’ variable.

filename_queue = tf.train.string_input_producer(input_file_names, num_epochs=15, shuffle=True)
reader = tf.TextLineReader(skip_header_lines=1)
_,example = reader.read(filename_queue)

Give column names to data

The data has three columns. First column is numeric, second is a string, and the third is the target. ‘record_defaults’ declares the format of each column. Then we put the three columns into ‘x1, x2, target’.

record_defaults = [[0.], ['red'], ['False']]
c1, c2, c3 = tf.decode_csv(example, record_defaults = record_defaults)
features = {'x1': c1, 'x2':c2, 'target':c3}

Begin tensorflow session…print first 10 lines

Here we begin the tensorflow session. First declare a session. Then set the initializer. In this code, we print out the first 10 lines of the data.

sess = tf.Session()
sess.run(tf.global_variables_initializer())
sess.run(tf.local_variables_initializer())

with sess as sess:
 coord = tf.train.Coordinator()
 threads = tf.train.start_queue_runners(coord = coord)
 for i in range(10):
   print sess.run(features)
 
 coord.request_stop()
 coord.join(threads)

【Stanford ML Exercise4 Week5】Neural Networks Learning

Implement the backpropagation algorithm for neural networks and apply it to the task of hand-written digit recognition.

1. Neural Networks

  • implement the backpropagation algorithm to learn the parameters for the neural network.

1.1 Visualizing the data

Screen Shot 2017-06-26 at 10.24.42 PM.png

  • 5000 training examples

    each training example is a 20 pixel by 20 pixel grayscale image of the digit

    The 20 by 20 grid of pixels is “unrolled” into a 400-dimensional vector

     

1.2 Model representation

  • 3 layers – an input layer, a hidden layer and an output layer

Screen Shot 2017-06-26 at 10.29.10 PM.png

1.3 Feedforward and cost function

  • implement the cost function and gradient for the neural network
  • should not be regularizing the terms that correspond to the bias

Cost function with regularization:

Screen Shot 2017-07-06 at 10.44.14 PM

2. Backpropagation

  • compute the gradient for the neural network cost function

     

2.1 Sigmoid gradient

Gradient for the sigmoid function:

screen-shot-2017-07-06-at-11-16-02-pm.png

2.2 Random initialization

  • When training neural networks, it is important to randomly initialize the parameters for symmetry breaking.
epsilon init = 0.12;
W = rand(L out, 1 + L in) * 2 * epsilon init − epsilon init;

2.3 Backpropagation

Intuition behind the backpropagation algorithm:

  • Given a training example (x(t),y(t)), first run a “forward pass” to compute all the activations throughout the network
  • for each node j in layer l, compute an “error term” δ(l) that measures how much that node was “responsible” j for any errors in our output

     

Step 1-4 to implement backpropagation:

 

Stanford ML Week 6 learning Notes: Advice for Applying Machine Learning

Improve ml performance:

  1. Get more training examples: not certain to help —— fix high variance
  2. Try smaller sets of features: prevent overfitting ——- fix high variance
  3. Try getting additional features: more information ——– fix high bias
  4. Try adding polynomial features ——- fix high bias
  5. Try decreasing lambda ——- fix high bias
  6. Try increasing lambda ——- fix high variance

Evaluation Algorihm:

ML diagnostic: a test to gain insight what is/isn’t working about the algorithm

Model Selection and Train/Validation/Test Sets

Model selection: choose what degree polynomial to fit the model

How well this model generalize?

  1. Use test set to calculate J, problem: overly optimistic estimate of generalization error, reason: fit polynomial degree d based on the performance on test set.
  2. solution: do examination we don’t see before. Instead of using test set to select the model, using cv set to select the model, using test set to test.

Diagnosing Bias vs Variance

curve: x-axis: polynomial degree d, y-axis: error

underfit: high bias, training error high, cv error high (x-axis is d)

overfit: high variance, training error low, cv error high

Regularization and Bias/Variance

curve: x-axis: lambda, y-axis: error

underfit: high bias, large lambda

overfit: high variance, small lambda

With regularization parameter lambda increasing, training error low, cv error high

when lambda small: high variance, training error low, cv error high

when lambda large: high bias, training error high, cv error high

Learning Curves

learning curve: y-axis: error, x-axis: training set size

m small: training error low, cv error high

m large: training error high, cv error low

If a learning algorithm is suffering high bias, increasing training set size is not helping; if high variance, increasing size helps.

 

Prioritizing What to Work On

e.g. Build a spam classifier

How to spend time to make the model better?

  • Collect lots of data
  • Develop more features based on email routing
  • Develop more features based on message
  • Develop algorithm for misspelling

Error Analysis

  1. start with a model, test on cv data
  2. plot learning curve, decide more data or more feature
  3. error analysis: manually examine the examples

Error Metrics for Skewed Classes

skewed class: have a lot more examples for one class than the other classes

better way to examine whether a model is performing well:

  • Precision: True positive/# of predicted positive
  • Recall: True positive/# of actual positive

Trading Off Precision and Recall

increase threshold high: higher precision, lower recall

How to choose a good one:

F Score: 2*(PR)/(P+R)

Data for Machine Learning

 

 

【Stanford ML Exercise3】 Multi-class Classification and Neural Networks

In this exercise, use logistic regression and neural networks to recognize handwritten digits (from 0 to 9)

1. Multi-class Classification

  • First part: extend previous logistic regression, apply to one-vs-all classification

1.1 Dataset

5000 training examples, each training example is a 20 pixel by 20 pixel grayscale image of the digit.

1.3 Vectorizing Logistic Regression

  • train 10 separate logistic regression classifiers
  • implement a vectorized version of logistic regression that does not employ any for loops

1.3.3 Vectorizing regularized logistic regression

cost function:

Screen Shot 2017-06-25 at 6.07.37 PM.png

the partial derivative of regularized logistic regression cost:
Screen Shot 2017-06-25 at 6.10.35 PM.png
Code: lrCostFunction.m
function [J, grad] = lrCostFunction(theta, X, y, lambda)

% Initialize some useful values
m = length(y); % number of training examples

% You need to return the following variables correctly 
J = 0;
grad = zeros(size(theta));

J = - y'*log(sigmoid(X*theta)) - (1 - y)'*log(1 - sigmoid(X*theta));
J = J/m;
J = J + (lambda/(2*m))*sum(theta(2:length(theta)).^2);

j = 1;
grad(j) = (sigmoid(X*theta) - y)'*X(:,1);
grad(j) = grad(j)/m;

grad(2:length(theta)) = grad(2:length(theta)) + X(:,2:length(theta))'*(sigmoid(X*theta) - y);
grad(2:length(theta)) = grad(2:length(theta))/m;
grad(2:length(theta)) = grad(2:length(theta)) + (lambda/m)*theta(2:length(theta));

% =============================================================

grad = grad(:);

end

1.4 One-vs-all Classification

  • implement one-vs-all classification by training multiple regularized logistic regression classifiers
  • train one classifier for each class

    return all the classifier parameters in a matrix

Code: oneVsAll.m

function [all_theta] = oneVsAll(X, y, num_labels, lambda)

% Some useful variables
m = size(X, 1);
n = size(X, 2);

% You need to return the following variables correctly 
all_theta = zeros(num_labels, n + 1);

% Add ones to the X data matrix
X = [ones(m, 1) X];

for c = 1:num_labels
 initial_theta = zeros(n + 1, 1);
 options = optimset('GradObj', 'on', 'MaxIter', 50);
 [theta] = ...
 fmincg (@(t)(lrCostFunction(t, X, (y == c), lambda)), ...
 initial_theta, options);
 all_theta(c,:) = theta(:);

% =========================================================================

end

1.4.1 One-vs-all Prediction

  • compute the “probability” that it belongs to each class using the trained logistic regression classifiers

Code: predictOneVsAll.m

function p = predictOneVsAll(all_theta, X)

m = size(X, 1);
num_labels = size(all_theta, 1);

% You need to return the following variables correctly 
p = zeros(size(X, 1), 1);

% Add ones to the X data matrix
X = [ones(m, 1) X];

[p_val,p] = max(sigmoid(X*all_theta'), [], 2);

% =========================================================================
end

2. Neural Networks

  • logistic regression cannot form more complex hypotheses as it is only a linear classifier
  • The neural network will be able to represent complex models that form non-linear hypotheses

2.1 Model representation

  • It has 3 layers – an input layer, a hidden layer and an output layer
  • the images are of size 20-20, this gives us 400 input layer units
  • The parameters have dimensions that are sized for a neural network with 25 units in the second layer and 10 output units (corresponding to the 10 digit classes)

2.2 Feedforward Propagation and Prediction

  • implement feedforward propagation for the neural network

Code: predict.m

function p = predict(Theta1, Theta2, X)

% Useful values
m = size(X, 1);
num_labels = size(Theta2, 1);

% You need to return the following variables correctly 
p = zeros(size(X, 1), 1);

A2 = sigmoid([ones(m,1), X] *Theta1');

h = sigmoid([ones(m,1), A2]*Theta2');

[p_val,p] = max(h, [], 2);

% =========================================================================

end