Skip to main content

Running a MapReduce Job (WordCount) on Hadoop Single-Node Cluster

In the previous post we’ve seen how to install Hadoop on Ubuntu, now it’s time to run our first Hadoop MapReduce job. We will use the WordCount example job which reads text files and counts how often words occur. The input is text files and the output is text files, each line of which contains a word and the count of how often it occurred, separated by a tab.


1. Download example input data
We will use three ebooks from Project Gutenberg for this example:

Download each ebook and store the files in a local temporary directory of choice, for example ‘/tmp/gutenberg’. Now we have tochange the file ownership to hduser. Open a terminal and run:

1 sudo chown –R hduser:hadoop /tmp/gutenberg


2. Restart the Hadoop cluster
Open a new terminal and restart your Hadoop cluster if it’s not running already

1 su - hduser
2 /usr/local/hadoop/bin/start-all.sh


3. Copy local example data to HDFS
Before we run the actual MapReduce job, we first have to copy the files from our local file system to Hadoop’s HDFS.

1 cd /usr/local/hadoop
2 bin/hadoop dfs –copyFromLocal /tmp/gutenberg /user/hduser/gutenberg

We can also check whether our files are copied correctly (see image 1).

1 bin/hadoop dfs –ls /user/hduser
2 bin/hadoop dfs –ls /user/hduser/gutenberg

Image 1. Files Copied Successfully.


4. Run the MapReduce job
Now, we actually run the WordCount example job (image 2).

1 bin/hadoop jar hadoop*example*.jar wordcount /user/hduser/gutenberg /user/hduser/gutenberg-output

This command will read all the files in the HDFS directory '/user/hduser/gutenberg', process it, and store the result in the HDFS directory '/user/hduser/gutenberg-output'.

Image 2. Running WordCount.

Check if the result is successfully stored in HDFS directory '/user/hduser/gutenberg-output' (image 3):

1 bin/hadoop dfs –ls /user/hduser
2 bin/hadoop dfs –ls /user/hduser/gutenberg

Image 3. Results Stored Successfully.

If you want to modify some Hadoop settings on the fly like increasing the number of Reduce tasks, you can use the "-D" option:

1 bin/hadoop jar hadoop*example*.jar wordcount –D /user/hduser/gutenberg mapred.reduce.tasks=16 /user/hduser/gutenberg-output

Note: Hadoop does not honor mapred.map.tasks beyond considering it a hint. But it accepts the user specified mapred.reduce.tasks and doesn’t manipulate that. You cannot force mapred.map.tasks but you can specify mapred.reduce.tasks.


5. Retrieve the job result from HDFS
You can use the command

1 bin/hadoop dfs –cat /user/hduser/gutenberg-output/part-r-00000

to read the file directly from HDFS. Or alternatively you can copy it from HDFS to the local file system

1 mkdir /tmp/gutenberg-output
2 bin/hadoop dfs –getmerge /user/hduser/gutenberg-output /tmp/gutenberg-output

Note: The command dfs -getmerge will simply concatenate any files it finds in the directory you specify. This means that the merged file might (and most likely will) not be sorted.

Now we can view the whole output file by opening it in any editor, open a new terminal and run:

1 sudo gedit /tmp/gutenberg-output/gutenberg-output


6. Hadoop Web Interfaces
Hadoop comes with several web interfaces which are by default (see conf/hadoop-default.xml ) available at these locations:


These web interfaces provide concise information about what’s happening in your Hadoop cluster. You might want to give them a try.


I. NameNode Web Interface (HDFS layer)
The name node web UI shows you a cluster summary including information about total/remaining capacity, live and dead nodes. Additionally, it allows you to browse the HDFS namespace and view the contents of its files in the web browser. It also gives access to the local machine’s Hadoop log files.

By default, it’s available at http://localhost:50070/.


II. JobTracker Web Interface (MapReduce layer)
The JobTracker web UI provides information about general job statistics of the Hadoop cluster, running/completed/failed jobs and a job history log file. It also gives access to the ‘‘local machine’s’’ Hadoop log files (the machine on which the web UI is running on).

By default, it’s available at http://localhost:50030/.


III. TaskTracker Web Interface (MapReduce layer)
The task tracker web UI shows you running and non-running tasks. It also gives access to the "local machine's" Hadoop log files.

By default, it’s available at http://localhost:50060/.

— * — * — * — * —

Sources

Comments

Popular posts from this blog

How to install CUCM 8.0.2 on Vmware | Call Manager Installation Guide

CUCM is a software-based call-processing component which serves as the main engine for Cisco VOIP infrastructure. This article can be used as a reference for your CUCM installation, This article can be used for any version of CUCM like 8.0, 8.6, 9.1 or 10.5. Let's jump to the installation directly. Make sure you have.. Before we begin, I want to make sure that you have the below softwares installed in your machine. Vmware Workstation Bootable Cisco Unified Call Manager ISO File GNS3 (Optional) - To Configure NTP Server A minimum of 2GB RAM and 80 GB HDD for Virtual Machine At this stage I assume that you have both Vmware Workstation and GNS3 installed. During installation you will be prompted to enter NTP server details. Follow the below steps to make a router to act as a NTP server in GNS3. Open ncpa.cpl from Run window. Find a Vmware network adapter. Double click the adapter and configure a private range IP, for example 120.0.0.1. Next open GNS3, Put a router and take console. Pa...

eXploit WP Themes Brainstorm Arbitrary File Upload Vulnerability

Hello World ; Malam Fans, Please Say Hello Haters :* xixixhi, Nuenomaru disini,  Sekarang Nue akan Share   Tutorial Deface dengan eXploit WP Themes Brainstorm Arbitrary File Upload Vulnerability * ini exploit lama sih wkwkw exploit ini salah satu bug/celah yang ada pada theme di CMS WordPress. iseng2 aja share, Kali aja masih  crotz  wkwkw * yg master minggir dlu ;* Lanjut intip tutor ae gaes :* mau sampai kapan ?! Dork: inurl:/wp-content/themes/brainstorm (Selebihnya kebangin lg menggunakan imajinasi vokepers kalian, biar dpt yg vuln n verawan) 1. Dorking di search Engine 2. pilih salah satu web target yg pengen ente tusb0l pak wkwk, lalu masukan exploitnya. exploit: /[path]/ /wp-content/themes/brainstorm/functions/jwpanel/scripts/uploadify/uploadify.php 3. Vulnerability ~ Blank Kurang lebih sih gituh awkkaw~ 4. Buat File Baru Berekstensi .php Contoh brain.php Dan Simpan Script Berikut Di Directory C:/XAMPP/php Masukan Script Berikut Edit-Edit Dikit dan t...

6 Url Shorteners Yang Menghasilkan Uang Selain AdF.ly

Salam sobat GBX nah di pertemuan kita kali ini ane akan membahas 6 Url Shorteners Yang Menghasilkan Uang Selain AdF.ly seperti yang anda tau adf.ly adalah url shortener terbaik yang membayar kita karena telah menggunakan jasannya, tapi selain adf.ly ternyata masih banyak lagi Url Shorteners Yang Menghasilkan Uang, sekarang kita lihat saja ulasan di bawah.  6 Url Shorteners Yang Menghasilkan Uang Selain AdF.ly 1. Adfly adf.ly Yup yang ini memang patut diletakkan di peringkat satu ya, Adfly sudah ada dan terpercaya sejak bertahu-tahun lalu dan sudah menjadi pemendek link / url terbaik selama ini.  Selain menyediakan fasilitas penyingkat url, adfly juga merupakan situs penghasil dollar yang sudah sangat terkenal (melalui Pay Per Click dan Referal). Banyak juga teman-teman di internet yang sudah sukses mendapatkan dollar lho. (adfly ini sudah terpercaya dan ya semoga terus aman begitu dech). Komisi yang telah anda dapatkan dari adfly ini bisa langsung di transfer ke pay...