In the previous post we’ve seen how to install Hadoop on Ubuntu, now it’s time to run our first Hadoop MapReduce job. We will use the WordCount example job which reads text files and counts how often words occur. The input is text files and the output is text files, each line of which contains a word and the count of how often it occurred, separated by a tab.
1. Download example input data
We will use three ebooks from Project Gutenberg for this example:
Download each ebook and store the files in a local temporary directory of choice, for example ‘/tmp/gutenberg’. Now we have tochange the file ownership to hduser. Open a terminal and run:
2. Restart the Hadoop cluster
Open a new terminal and restart your Hadoop cluster if it’s not running already
3. Copy local example data to HDFS
Before we run the actual MapReduce job, we first have to copy the files from our local file system to Hadoop’s HDFS.
We can also check whether our files are copied correctly (see image 1).
4. Run the MapReduce job
Now, we actually run the WordCount example job (image 2).
This command will read all the files in the HDFS directory '/user/hduser/gutenberg', process it, and store the result in the HDFS directory '/user/hduser/gutenberg-output'.
Check if the result is successfully stored in HDFS directory '/user/hduser/gutenberg-output' (image 3):
If you want to modify some Hadoop settings on the fly like increasing the number of Reduce tasks, you can use the "-D" option:
Note: Hadoop does not honor mapred.map.tasks beyond considering it a hint. But it accepts the user specified mapred.reduce.tasks and doesn’t manipulate that. You cannot force mapred.map.tasks but you can specify mapred.reduce.tasks.
5. Retrieve the job result from HDFS
You can use the command
to read the file directly from HDFS. Or alternatively you can copy it from HDFS to the local file system
Note: The command dfs -getmerge will simply concatenate any files it finds in the directory you specify. This means that the merged file might (and most likely will) not be sorted.
Now we can view the whole output file by opening it in any editor, open a new terminal and run:
6. Hadoop Web Interfaces
Hadoop comes with several web interfaces which are by default (see conf/hadoop-default.xml ) available at these locations:
These web interfaces provide concise information about what’s happening in your Hadoop cluster. You might want to give them a try.
I. NameNode Web Interface (HDFS layer)
The name node web UI shows you a cluster summary including information about total/remaining capacity, live and dead nodes. Additionally, it allows you to browse the HDFS namespace and view the contents of its files in the web browser. It also gives access to the local machine’s Hadoop log files.
By default, it’s available at http://localhost:50070/.
II. JobTracker Web Interface (MapReduce layer)
The JobTracker web UI provides information about general job statistics of the Hadoop cluster, running/completed/failed jobs and a job history log file. It also gives access to the ‘‘local machine’s’’ Hadoop log files (the machine on which the web UI is running on).
By default, it’s available at http://localhost:50030/.
III. TaskTracker Web Interface (MapReduce layer)
The task tracker web UI shows you running and non-running tasks. It also gives access to the "local machine's" Hadoop log files.
By default, it’s available at http://localhost:50060/.
— * — * — * — * —
Sources
1. Download example input data
We will use three ebooks from Project Gutenberg for this example:
- The Outline of Science, Vol. 1 (of 4) by J. Arthur Thomson
- The Notebooks of Leonardo Da Vinci
- Ulysses by James Joyce
Download each ebook and store the files in a local temporary directory of choice, for example ‘/tmp/gutenberg’. Now we have tochange the file ownership to hduser. Open a terminal and run:
1 | sudo chown –R hduser:hadoop /tmp/gutenberg |
2. Restart the Hadoop cluster
Open a new terminal and restart your Hadoop cluster if it’s not running already
1 | su - hduser |
2 | /usr/local/hadoop/bin/start-all.sh |
3. Copy local example data to HDFS
Before we run the actual MapReduce job, we first have to copy the files from our local file system to Hadoop’s HDFS.
1 | cd /usr/local/hadoop |
2 | bin/hadoop dfs –copyFromLocal /tmp/gutenberg /user/hduser/gutenberg |
We can also check whether our files are copied correctly (see image 1).
1 | bin/hadoop dfs –ls /user/hduser |
2 | bin/hadoop dfs –ls /user/hduser/gutenberg |
Image 1. Files Copied Successfully. |
4. Run the MapReduce job
Now, we actually run the WordCount example job (image 2).
1 | bin/hadoop jar hadoop*example*.jar wordcount /user/hduser/gutenberg /user/hduser/gutenberg-output |
This command will read all the files in the HDFS directory '/user/hduser/gutenberg', process it, and store the result in the HDFS directory '/user/hduser/gutenberg-output'.
Image 2. Running WordCount. |
Check if the result is successfully stored in HDFS directory '/user/hduser/gutenberg-output' (image 3):
1 | bin/hadoop dfs –ls /user/hduser |
2 | bin/hadoop dfs –ls /user/hduser/gutenberg |
Image 3. Results Stored Successfully. |
If you want to modify some Hadoop settings on the fly like increasing the number of Reduce tasks, you can use the "-D" option:
1 | bin/hadoop jar hadoop*example*.jar wordcount –D /user/hduser/gutenberg mapred.reduce.tasks=16 /user/hduser/gutenberg-output |
Note: Hadoop does not honor mapred.map.tasks beyond considering it a hint. But it accepts the user specified mapred.reduce.tasks and doesn’t manipulate that. You cannot force mapred.map.tasks but you can specify mapred.reduce.tasks.
5. Retrieve the job result from HDFS
You can use the command
1 | bin/hadoop dfs –cat /user/hduser/gutenberg-output/part-r-00000 |
to read the file directly from HDFS. Or alternatively you can copy it from HDFS to the local file system
1 | mkdir /tmp/gutenberg-output |
2 | bin/hadoop dfs –getmerge /user/hduser/gutenberg-output /tmp/gutenberg-output |
Note: The command dfs -getmerge will simply concatenate any files it finds in the directory you specify. This means that the merged file might (and most likely will) not be sorted.
Now we can view the whole output file by opening it in any editor, open a new terminal and run:
1 | sudo gedit /tmp/gutenberg-output/gutenberg-output |
6. Hadoop Web Interfaces
Hadoop comes with several web interfaces which are by default (see conf/hadoop-default.xml ) available at these locations:
- http://localhost:50070/– web UI of the NameNode daemon
- http://localhost:50030/– web UI of the JobTracker daemon
- http://localhost:50060/ – web UI of the TaskTracker daemon
These web interfaces provide concise information about what’s happening in your Hadoop cluster. You might want to give them a try.
I. NameNode Web Interface (HDFS layer)
The name node web UI shows you a cluster summary including information about total/remaining capacity, live and dead nodes. Additionally, it allows you to browse the HDFS namespace and view the contents of its files in the web browser. It also gives access to the local machine’s Hadoop log files.
By default, it’s available at http://localhost:50070/.
II. JobTracker Web Interface (MapReduce layer)
The JobTracker web UI provides information about general job statistics of the Hadoop cluster, running/completed/failed jobs and a job history log file. It also gives access to the ‘‘local machine’s’’ Hadoop log files (the machine on which the web UI is running on).
By default, it’s available at http://localhost:50030/.
III. TaskTracker Web Interface (MapReduce layer)
The task tracker web UI shows you running and non-running tasks. It also gives access to the "local machine's" Hadoop log files.
By default, it’s available at http://localhost:50060/.
— * — * — * — * —
Sources
Comments
Post a Comment