Load Dataset with tensorflow

In this article, we go through a brief hand-on notebook of loading data using tensorflow. Tensorflow can load data directly from local, also can load data stored in google cloud storage if you feed it with a storage bucket path.

import tensorflow as tf
import os

Specify data file path

First import module. Then, specify the path of the data in google storage. Tensorflow will read all csv files under the “gs://bucket_name/folder_name/” path.

  • Directly load the local files:
input_file_names = tf.train.match_filenames_once('file_name.csv')
  • Load from Google Storage:
input_dir = 'gs://bucket_name'
file_prefix = 'folder_name/'
input_file_names = tf.train.match_filenames_once(os.path.join(input_dir, '{}*{}'.format(file_prefix, '.csv')))

Shuffle and read data

Shuffle the input data, skip the first line of the dataset. Put the loaded data into ‘example’ variable.

filename_queue = tf.train.string_input_producer(input_file_names, num_epochs=15, shuffle=True)
reader = tf.TextLineReader(skip_header_lines=1)
_,example = reader.read(filename_queue)

Give column names to data

The data has three columns. First column is numeric, second is a string, and the third is the target. ‘record_defaults’ declares the format of each column. Then we put the three columns into ‘x1, x2, target’.

record_defaults = [[0.], ['red'], ['False']]
c1, c2, c3 = tf.decode_csv(example, record_defaults = record_defaults)
features = {'x1': c1, 'x2':c2, 'target':c3}

Begin tensorflow session…print first 10 lines

Here we begin the tensorflow session. First declare a session. Then set the initializer. In this code, we print out the first 10 lines of the data.

sess = tf.Session()

with sess as sess:
 coord = tf.train.Coordinator()
 threads = tf.train.start_queue_runners(coord = coord)
 for i in range(10):
   print sess.run(features)

【Big Data Project2】Recommendation System

Based on item collaborative filtering (Item CF), build a co-occurrence matrix, using user’s historical data, more convincing.
Defined the relationship between movies using the rating history of users.
Represented the relationship between movies in a co-occurrence matrix, represent the difference between movies by a rating matrix. Multiply the co-occurrence matrix with the rating matrix with MapReduce jobs.
User CF & Item CF
  • User CF: A form of collaborative filtering based on the similarity between users calculated using people’s ratings of those items.
  • Item CF: A form of collaborative filtering based on the similarity between items calculated using people’s ratings of those items.

Why use item CF?
  • The number of users weights more than number of products.
  • Item will not change frequently, lowering calculation.
  • Using user’s historical data, more convincing.

Procedure of Item CF

  • Build co-occurrence matrix.
  • Build rating matrix.
  • Matrix computation to get recommending result.

How to define relationship between movies?
We can use:
  • user’s watching history, user’s rating history, favourite list.
  • movie’s category, movie’s producer

How to represent the relationship between movies?
  • Co-occurrence matrix: the times that two movies have been seen by a same user. Normalization.
  • Rating matrix: the rating of movies seen by the same user.

Data preprocessor

  • Divide the raw data by user_id (Mapper).
  • merge the data by the same user_id (Reducer).

Build co-occurrence matrix

Times one movie has been seen by different users (Diagonal of the co-occurrence matrix)
  • Divide by movie_id (Mapper).
  • Count by movie_id (Reducer).
Times two movies have been seen by one user (Others of co-occurrence matrix)
  • Divide by Movie A : Movie B (Mapper).
  • Count by the same Movie A : Movie B (Reducer).

Multiply co-occurrence matrix with rating matrix

【Big Data Project1】AutoComplete

Google Auto-Completion Big Data Project
  • Built the offline SQL database through two MapReduce data pipelines.
  • Run on Hadoop through Docker to show online website, realize to automatically show up with promising following words after typing one word.

What is Docker?
  • A container.
  • Container VS Visual Machine.


  • An n-gram is a contiguous sequence of n items from a given sequence of text or speech. 
  • N-gram = 1 gram + 2 gram + … + n gram.

A language model
A languagemodelis a probability distribution over entire sentences or texts.

How to build a Language Model?
  • A conditional probability: P(word | phrase) = Count(phrase that contains the word) / Count(All phrases).
  • Simplify: Just Count(phrase that contains the word).

N-gram model
N-gram model: use probability to predict next word/phrase.

How does MapReduce work in this project?

Used MapReduce1 job to
  • process the raw data, separate the sentence into phrases, trim non-alphabetical symbols(Mapper).
  • count the same phrases, build a N-gram library (Reducer).

Used MapReduce2 job to

  • separate each n-gram into “Input phrase + following word + count” (Mapper).
  • Integrate mapper job with the same “Input phrase” and stored into a database in format “Input phrase + {following word1 + count1, following word2 + count2 +…}” (Reducer).
  • Threshold Top k to show.

What to store in the database? 

  • Input phrase + Predicted word.
  • Probability/Count.

About Document preprocessing?
  • read a large-scale document collections sentence by sentence.
  • remove all the non-alphabetical symbols.

Frequently Used Class(Java) in MapReduce Job
  • IntWritable is a Writable class that wraps a Java int.
  • Text is a Writable class that wraps a Java String.
  • Configuration is like system properties in Java.
  • Job allows the user to configure the job, submit it, control its execution,  and query the state.
  • Context allows the Mapper and Reducer to interact with the rest of Hadoop system.

How to debug a normal Java application?
  • System.out.println(); the same as use Counter in a MapReduce framework.
  • Set breakpoint.


  • Read a large-scale text document collections.
  • MapReduce job1: build N-gram library.
  • MapReduce job2: calculate probability.
  • Run the project on Hadoop.