0. Overview
This tutorial will guide you through building a Word Count program using the Hadoop MapReduce framework. The basic idea is to store the input file in the HDFS structure of Hadoop and run the program from the local machine. The program will read this input file, count the number of words, and write the result back to HDFS.
Steps involved:
- Create an input file and upload it to HDFS.
- Write a Mapper, Reducer, and Driver in Java.
- Compile the code and create a jar file.
- Run the jar file using the Hadoop command.
- View the output stored in HDFS.
1. Create Input File
To create a folder in hdfs
:
hdfs dfs -mkdir /suvash/in
This will create a folder in
inside a folder suvash
.
Lets, create an input file in local and then upload it to the /suvash/in
of hdfs. I am assuming that, you are already inside /home/hadoop
directory of your local pc.
vi input.txt
// write some input
Now copy this file into HDFS:
hdfs dfs -put input.txt /suvash/in
2. Create Program
Write your program on the local machine.
Driver.java
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class Driver {
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "word count");
job.setJarByClass(Driver.class);
job.setMapperClass(WMapper.class);
job.setCombinerClass(WReducer.class);
job.setReducerClass(WReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
WMapper.java
import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
public class WMapper extends Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String[] words = value.toString().split("\\s+");
for (String str : words) {
word.set(str);
context.write(word, one);
}
}
}
WReducer.java
import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
public class WReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
context.write(key, new IntWritable(sum));
}
}
3. Creating Jar
To create the jar file, first compile your Java files:
javac -classpath `Hadoop classpath` -d . WMapper.java WReducer.java Driver.java
This will create class files for the Java files. Now it’s time to generate the jar file:
jar cf jar_file_name.jar *.class
This will create a jar file named jar_file_name.jar.
4. Running Jar File
To run this jar file, use this command:
hadoop jar jar_file_name.jar Driver /suvash/in/input.txt /suvash/out
Here:
Driver
is the base class name of the program./suvash/in/input.txt
is the input file path in HDFS./suvash/out
is the output file path in HDFS.
5. View The Result
To view the content of the /suvash/out folder, run this command:
hdfs dfs -cat /suvash/out/part-r-00000
By following these steps, you can run the Word Count program on Hadoop and see the word frequency in the input file.