【Big Data Project2】Recommendation System

Based on item collaborative filtering (Item CF), build a co-occurrence matrix, using user’s historical data, more convincing.
Defined the relationship between movies using the rating history of users.
Represented the relationship between movies in a co-occurrence matrix, represent the difference between movies by a rating matrix. Multiply the co-occurrence matrix with the rating matrix with MapReduce jobs.
User CF & Item CF
  • User CF: A form of collaborative filtering based on the similarity between users calculated using people’s ratings of those items.
  • Item CF: A form of collaborative filtering based on the similarity between items calculated using people’s ratings of those items.

Why use item CF?
  • The number of users weights more than number of products.
  • Item will not change frequently, lowering calculation.
  • Using user’s historical data, more convincing.

Procedure of Item CF

  • Build co-occurrence matrix.
  • Build rating matrix.
  • Matrix computation to get recommending result.

How to define relationship between movies?
We can use:
  • user’s watching history, user’s rating history, favourite list.
  • movie’s category, movie’s producer

How to represent the relationship between movies?
  • Co-occurrence matrix: the times that two movies have been seen by a same user. Normalization.
  • Rating matrix: the rating of movies seen by the same user.

Data preprocessor

  • Divide the raw data by user_id (Mapper).
  • merge the data by the same user_id (Reducer).

Build co-occurrence matrix

Times one movie has been seen by different users (Diagonal of the co-occurrence matrix)
  • Divide by movie_id (Mapper).
  • Count by movie_id (Reducer).
Times two movies have been seen by one user (Others of co-occurrence matrix)
  • Divide by Movie A : Movie B (Mapper).
  • Count by the same Movie A : Movie B (Reducer).

Multiply co-occurrence matrix with rating matrix


【Big Data Project1】AutoComplete

Google Auto-Completion Big Data Project
  • Built the offline SQL database through two MapReduce data pipelines.
  • Run on Hadoop through Docker to show online website, realize to automatically show up with promising following words after typing one word.

What is Docker?
  • A container.
  • Container VS Visual Machine.


  • An n-gram is a contiguous sequence of n items from a given sequence of text or speech. 
  • N-gram = 1 gram + 2 gram + … + n gram.

A language model
A languagemodelis a probability distribution over entire sentences or texts.

How to build a Language Model?
  • A conditional probability: P(word | phrase) = Count(phrase that contains the word) / Count(All phrases).
  • Simplify: Just Count(phrase that contains the word).

N-gram model
N-gram model: use probability to predict next word/phrase.

How does MapReduce work in this project?

Used MapReduce1 job to
  • process the raw data, separate the sentence into phrases, trim non-alphabetical symbols(Mapper).
  • count the same phrases, build a N-gram library (Reducer).

Used MapReduce2 job to

  • separate each n-gram into “Input phrase + following word + count” (Mapper).
  • Integrate mapper job with the same “Input phrase” and stored into a database in format “Input phrase + {following word1 + count1, following word2 + count2 +…}” (Reducer).
  • Threshold Top k to show.

What to store in the database? 

  • Input phrase + Predicted word.
  • Probability/Count.

About Document preprocessing?
  • read a large-scale document collections sentence by sentence.
  • remove all the non-alphabetical symbols.

Frequently Used Class(Java) in MapReduce Job
  • IntWritable is a Writable class that wraps a Java int.
  • Text is a Writable class that wraps a Java String.
  • Configuration is like system properties in Java.
  • Job allows the user to configure the job, submit it, control its execution,  and query the state.
  • Context allows the Mapper and Reducer to interact with the rest of Hadoop system.

How to debug a normal Java application?
  • System.out.println(); the same as use Counter in a MapReduce framework.
  • Set breakpoint.


  • Read a large-scale text document collections.
  • MapReduce job1: build N-gram library.
  • MapReduce job2: calculate probability.
  • Run the project on Hadoop.