Skip to main content

Running a MapReduce Job (WordCount) on Hadoop Single-Node Cluster

In the previous post we’ve seen how to install Hadoop on Ubuntu, now it’s time to run our first Hadoop MapReduce job. We will use the WordCount example job which reads text files and counts how often words occur. The input is text files and the output is text files, each line of which contains a word and the count of how often it occurred, separated by a tab.


1. Download example input data
We will use three ebooks from Project Gutenberg for this example:

Download each ebook and store the files in a local temporary directory of choice, for example ‘/tmp/gutenberg’. Now we have tochange the file ownership to hduser. Open a terminal and run:

1 sudo chown –R hduser:hadoop /tmp/gutenberg


2. Restart the Hadoop cluster
Open a new terminal and restart your Hadoop cluster if it’s not running already

1 su - hduser
2 /usr/local/hadoop/bin/start-all.sh


3. Copy local example data to HDFS
Before we run the actual MapReduce job, we first have to copy the files from our local file system to Hadoop’s HDFS.

1 cd /usr/local/hadoop
2 bin/hadoop dfs –copyFromLocal /tmp/gutenberg /user/hduser/gutenberg

We can also check whether our files are copied correctly (see image 1).

1 bin/hadoop dfs –ls /user/hduser
2 bin/hadoop dfs –ls /user/hduser/gutenberg

Image 1. Files Copied Successfully.


4. Run the MapReduce job
Now, we actually run the WordCount example job (image 2).

1 bin/hadoop jar hadoop*example*.jar wordcount /user/hduser/gutenberg /user/hduser/gutenberg-output

This command will read all the files in the HDFS directory '/user/hduser/gutenberg', process it, and store the result in the HDFS directory '/user/hduser/gutenberg-output'.

Image 2. Running WordCount.

Check if the result is successfully stored in HDFS directory '/user/hduser/gutenberg-output' (image 3):

1 bin/hadoop dfs –ls /user/hduser
2 bin/hadoop dfs –ls /user/hduser/gutenberg

Image 3. Results Stored Successfully.

If you want to modify some Hadoop settings on the fly like increasing the number of Reduce tasks, you can use the "-D" option:

1 bin/hadoop jar hadoop*example*.jar wordcount –D /user/hduser/gutenberg mapred.reduce.tasks=16 /user/hduser/gutenberg-output

Note: Hadoop does not honor mapred.map.tasks beyond considering it a hint. But it accepts the user specified mapred.reduce.tasks and doesn’t manipulate that. You cannot force mapred.map.tasks but you can specify mapred.reduce.tasks.


5. Retrieve the job result from HDFS
You can use the command

1 bin/hadoop dfs –cat /user/hduser/gutenberg-output/part-r-00000

to read the file directly from HDFS. Or alternatively you can copy it from HDFS to the local file system

1 mkdir /tmp/gutenberg-output
2 bin/hadoop dfs –getmerge /user/hduser/gutenberg-output /tmp/gutenberg-output

Note: The command dfs -getmerge will simply concatenate any files it finds in the directory you specify. This means that the merged file might (and most likely will) not be sorted.

Now we can view the whole output file by opening it in any editor, open a new terminal and run:

1 sudo gedit /tmp/gutenberg-output/gutenberg-output


6. Hadoop Web Interfaces
Hadoop comes with several web interfaces which are by default (see conf/hadoop-default.xml ) available at these locations:


These web interfaces provide concise information about what’s happening in your Hadoop cluster. You might want to give them a try.


I. NameNode Web Interface (HDFS layer)
The name node web UI shows you a cluster summary including information about total/remaining capacity, live and dead nodes. Additionally, it allows you to browse the HDFS namespace and view the contents of its files in the web browser. It also gives access to the local machine’s Hadoop log files.

By default, it’s available at http://localhost:50070/.


II. JobTracker Web Interface (MapReduce layer)
The JobTracker web UI provides information about general job statistics of the Hadoop cluster, running/completed/failed jobs and a job history log file. It also gives access to the ‘‘local machine’s’’ Hadoop log files (the machine on which the web UI is running on).

By default, it’s available at http://localhost:50030/.


III. TaskTracker Web Interface (MapReduce layer)
The task tracker web UI shows you running and non-running tasks. It also gives access to the "local machine's" Hadoop log files.

By default, it’s available at http://localhost:50060/.

— * — * — * — * —

Sources

Comments

Popular posts from this blog

eXploit WP Themes Brainstorm Arbitrary File Upload Vulnerability

Hello World ; Malam Fans, Please Say Hello Haters :* xixixhi, Nuenomaru disini,  Sekarang Nue akan Share   Tutorial Deface dengan eXploit WP Themes Brainstorm Arbitrary File Upload Vulnerability * ini exploit lama sih wkwkw exploit ini salah satu bug/celah yang ada pada theme di CMS WordPress. iseng2 aja share, Kali aja masih  crotz  wkwkw * yg master minggir dlu ;* Lanjut intip tutor ae gaes :* mau sampai kapan ?! Dork: inurl:/wp-content/themes/brainstorm (Selebihnya kebangin lg menggunakan imajinasi vokepers kalian, biar dpt yg vuln n verawan) 1. Dorking di search Engine 2. pilih salah satu web target yg pengen ente tusb0l pak wkwk, lalu masukan exploitnya. exploit: /[path]/ /wp-content/themes/brainstorm/functions/jwpanel/scripts/uploadify/uploadify.php 3. Vulnerability ~ Blank Kurang lebih sih gituh awkkaw~ 4. Buat File Baru Berekstensi .php Contoh brain.php Dan Simpan Script Berikut Di Directory C:/XAMPP/php Masukan Script Berikut Edit-Edit Dikit dan taruh juga shell sobat di dire

Method Amazon Terbaru + Seller

Hallo Sobat Cyber, Kali Ini Saya Akan Membagikan Method AMAZON. Mungkin Sobat Sudah Pada Tahu Apa Itu AMAZON, Dan Sudah Tahu Juga Method AMAZON Bagi Yang Hoby Carding. Tapi Masih Ada Juga Sobat Yang Belom Tahu Method Untuk Carding Di AMAZON. Bagi Sobat Yang Penasaran Method Amazon, Silahkan Ikuti Tutorial Dari Saya, Cekidot : Alat Tempurnya : Akun Amazon VPN Premium, ane sih pake hma Jika sobat punya akun uk,fr,de,it,dll. loginnya di amazon . com aja, pake seller yang dibawah, jadi misalkan agan punya akun IT, akun itu gaperlu selalu login di amazon.it, di amazon . com juga bisa.Tapi jika sobat punya seller sendiri sillahkan gunakan :D boleh di amazon manapun :v Jika sobat co sebaiknya 1 1 dulu, setelah prepairing silahkan order lagi.   METHODE PAKE AKUN BULE (BILL=SHIP) : -Siapkan akun amazon live pastinya harus have card -Connect VPN sesuai negara cc akunnya, lalu clear cookies and chache - Terus Buka Check2ip.com dan atur tanggal dan waktu nya agar tidak ada yang merah -Buka a

How To Install Xpath Automated Sqli tool on Windows

Assalamualaikum warohmatullah wabarokatuh ^_^ Download Xpath Automated Sqli tool Mirror 1 Mirror 2 Bahan-Bahan : -Python27 -Prettytable -Requests -Colorama Langkah-langkahnya soub : 1. Install Python :v 2. Install Module  [ Prettytable ] Buka CMD, Masuk ke Directory C:/Python27/Scripts kemudian eksekusi perintah berikut : C:\Python27\Scripts>pip install prettytable 3. Install Module [ Requests ] Buka CMD, Masuk ke Directory C:/Python27/Scripts kemudian eksekusi perintah berikut : C:\Python27\Scripts>pip install requests 4. Install Module [ Colorama ] Buka CMD, Masuk ke Directory C:/Python27/Scripts kemudian eksekusi perintah berikut : C:\Python27\Scripts>pip install colorama 5.  Berhasil Terinstall ^_^ Cara pemakaiannya : xpath.py -u http://www.test.com/index.php?id=1 --dbs xpath.py -u http://www.test.com/ --data "index.php?id=1" --dbs Yang Mau bertanya silahkan di kolom komentar