續前篇 http://codebeta.blogspot.tw/2012/12/hadoophadoop-ubuntu.html,接下來是試跑 wordcount 程式,驗證系統有安裝好。
下載所需資料
要執行 wordcount 程式,要先有資料才行。Noll 好心已經給了三個連結的文件讓你下載。
- http://www.gutenberg.org/etext/20417 The Outline of Science, Vol. 1 (of 4) by J. Arthur Thomson
- http://www.gutenberg.org/etext/5000 The Notebooks of Leonardo Da Vinci
- http://www.gutenberg.org/etext/4300 Ulysses by James Joyce
每一個電子書,請下載 UTF-8 的文字檔,放在你想要的位置。這裡選的是 /tmp/gutenberg
hduser@ubuntu:~$ ls -l /tmp/gutenberg/
total 3592
-rw-r--r-- 1 hduser hadoop 674566 Dec 23 07:37 pg20417.txt
-rw-r--r-- 1 hduser hadoop 1573150 Dec 23 07:39 pg4300.txt
-rw-r--r-- 1 hduser hadoop 1423801 Dec 23 07:38 pg5000.txt
hduser@ubuntu:~$
啟動 hadoop 叢集
hduser@ubuntu:~$ /usr/local/hadoop/bin/start-all.sh
複製資料到 HDFS (Hadoop 的檔案系統)
使用 hadoop 的指令,把檔案從真實檔案系統,送進 HDFS。
hduser@ubuntu:~$ /usr/local/hadoop/bin/hadoop dfs -copyFromLocal /tmp/gutenberg /user/hduser/gutenberg
hduser@ubuntu:~$ /usr/local/hadoop/bin/hadoop dfs -ls /user/hduser
Found 1 items
drwxr-xr-x - hduser supergroup 0 2012-12-25 00:23 /user/hduser/gutenberg
hduser@ubuntu:~$ /usr/local/hadoop/bin/hadoop dfs -ls /user/hduser/gutenberg
Found 4 items
drwxr-xr-x - hduser supergroup 0 2012-12-25 00:23 /user/hduser/gutenberg/gutenberg
-rw-r--r-- 1 hduser supergroup 674566 2012-12-25 00:21 /user/hduser/gutenberg/pg20417.txt
-rw-r--r-- 1 hduser supergroup 1573150 2012-12-25 00:21 /user/hduser/gutenberg/pg4300.txt
-rw-r--r-- 1 hduser supergroup 1423801 2012-12-25 00:21 /user/hduser/gutenberg/pg5000.txt
hduser@ubuntu:~$
執行 MapReduce 程式
請注意執行的路徑:要先 cd 到 /usr/local/hadoop。
hduser@ubuntu:/usr/local/hadoop$ bin/hadoop jar hadoop*examples*.jar wordcount /user/hduser/gutenberg/* /user/hduser/gutenberg-output
[註1]若在別的工作路徑,會發生:
hduser@ubuntu:~$ /usr/local/hadoop/bin/hadoop jar hadoop*examples*.jar wordcount /user/hduser/gutenberg /user/hduser/gutenberg-output
Exception in thread "main" java.io.IOException: Error opening job jar: hadoop*examples*.jar
[註2]在指令加 * 的理由在 http://stackoverflow.com/questions/6891600/how-to-get-hadoop-wordcount-example-working
[註3]如果已經執行過,想再執行一次,會因為 output 目錄已經存在而失敗。此時可以執行刪目錄的指令:
hduser@ubuntu:/usr/local/hadoop$ bin/hadoop dfs -rmr /user/hduser/gutenberg-output
Deleted hdfs://localhost:54310/user/hduser/gutenberg-output
hduser@ubuntu:/usr/local/hadoop$
執行結果如下:
hduser@ubuntu:/usr/local/hadoop$ bin/hadoop jar hadoop*examples*.jar wordcount /user/hduser/gutenberg/* /user/hduser/gutenberg-output
12/12/25 00:52:27 INFO input.FileInputFormat: Total input paths to process : 6
12/12/25 00:52:27 INFO util.NativeCodeLoader: Loaded the native-hadoop library
12/12/25 00:52:27 WARN snappy.LoadSnappy: Snappy native library not loaded
12/12/25 00:52:28 INFO mapred.JobClient: Running job: job_201212250016_0007
12/12/25 00:52:29 INFO mapred.JobClient: map 0% reduce 0%
12/12/25 00:52:44 INFO mapred.JobClient: map 16% reduce 0%
12/12/25 00:52:47 INFO mapred.JobClient: map 33% reduce 0%
12/12/25 00:52:53 INFO mapred.JobClient: map 50% reduce 0%
12/12/25 00:52:56 INFO mapred.JobClient: map 66% reduce 11%
12/12/25 00:52:59 INFO mapred.JobClient: map 83% reduce 11%
12/12/25 00:53:01 INFO mapred.JobClient: map 100% reduce 11%
12/12/25 00:53:04 INFO mapred.JobClient: map 100% reduce 22%
12/12/25 00:53:10 INFO mapred.JobClient: map 100% reduce 100%
12/12/25 00:53:15 INFO mapred.JobClient: Job complete: job_201212250016_0007
12/12/25 00:53:15 INFO mapred.JobClient: Counters: 29
12/12/25 00:53:15 INFO mapred.JobClient: Job Counters
12/12/25 00:53:15 INFO mapred.JobClient: Launched reduce tasks=1
12/12/25 00:53:15 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=45175
12/12/25 00:53:15 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0
12/12/25 00:53:15 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0
12/12/25 00:53:15 INFO mapred.JobClient: Launched map tasks=6
12/12/25 00:53:15 INFO mapred.JobClient: Data-local map tasks=6
12/12/25 00:53:15 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=25718
12/12/25 00:53:15 INFO mapred.JobClient: File Output Format Counters
12/12/25 00:53:15 INFO mapred.JobClient: Bytes Written=886978
12/12/25 00:53:15 INFO mapred.JobClient: FileSystemCounters
12/12/25 00:53:15 INFO mapred.JobClient: FILE_BYTES_READ=4429692
12/12/25 00:53:15 INFO mapred.JobClient: HDFS_BYTES_READ=7343786
12/12/25 00:53:15 INFO mapred.JobClient: FILE_BYTES_WRITTEN=7529522
12/12/25 00:53:15 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=886978
12/12/25 00:53:15 INFO mapred.JobClient: File Input Format Counters
12/12/25 00:53:15 INFO mapred.JobClient: Bytes Read=7343034
12/12/25 00:53:15 INFO mapred.JobClient: Map-Reduce Framework
12/12/25 00:53:15 INFO mapred.JobClient: Map output materialized bytes=2948682
12/12/25 00:53:15 INFO mapred.JobClient: Map input records=155864
12/12/25 00:53:15 INFO mapred.JobClient: Reduce shuffle bytes=2681669
12/12/25 00:53:15 INFO mapred.JobClient: Spilled Records=511924
12/12/25 00:53:15 INFO mapred.JobClient: Map output bytes=12152190
12/12/25 00:53:15 INFO mapred.JobClient: Total committed heap usage (bytes)=978935808
12/12/25 00:53:15 INFO mapred.JobClient: CPU time spent (ms)=7070
12/12/25 00:53:15 INFO mapred.JobClient: Combine input records=1258344
12/12/25 00:53:15 INFO mapred.JobClient: SPLIT_RAW_BYTES=752
12/12/25 00:53:15 INFO mapred.JobClient: Reduce input records=204644
12/12/25 00:53:15 INFO mapred.JobClient: Reduce input groups=82335
12/12/25 00:53:15 INFO mapred.JobClient: Combine output records=204644
12/12/25 00:53:15 INFO mapred.JobClient: Physical memory (bytes) snapshot=1124028416
12/12/25 00:53:15 INFO mapred.JobClient: Reduce output records=82335
12/12/25 00:53:15 INFO mapred.JobClient: Virtual memory (bytes) snapshot=2677747712
12/12/25 00:53:15 INFO mapred.JobClient: Map output records=1258344
檢查輸出結果,使用以下指令:
hduser@ubuntu:/usr/local/hadoop$ bin/hadoop dfs -ls /user/hduser/gutenberg-output
Found 3 items
-rw-r--r-- 1 hduser supergroup 0 2012-12-25 00:53 /user/hduser/gutenberg-output/_SUCCESS
drwxr-xr-x - hduser supergroup 0 2012-12-25 00:52 /user/hduser/gutenberg-output/_logs
-rw-r--r-- 1 hduser supergroup 886978 2012-12-25 00:53 /user/hduser/gutenberg-output/part-r-00000
hduser@ubuntu:/usr/local/hadoop$
把結果的內容顯示出來的指令:
hduser@ubuntu:/usr/local/hadoop$ bin/hadoop dfs -cat /user/hduser/gutenberg-output/part-r-00000
把結果搬回真實檔案系統,然後看內容的指令:
hduser@ubuntu:/usr/local/hadoop$ mkdir /tmp/gutenberg-output
hduser@ubuntu:/usr/local/hadoop$ bin/hadoop dfs -getmerge /user/hduser/gutenberg-output /tmp/gutenberg-output
12/12/25 01:03:58 INFO util.NativeCodeLoader: Loaded the native-hadoop library
hduser@ubuntu:/usr/local/hadoop$ head /tmp/gutenberg-output/gutenberg-output
"(Lo)cra" 2
"1490 2
"1498," 2
"35" 2
"40," 2
"A 4
"AS-IS". 2
"A_ 2
"Absoluti 2
"Alack! 2
hduser@ubuntu:/usr/local/hadoop$
Hadoop Web Interface
目前版本有提供 web 介面,只能看狀態,不能改。
- http://localhost:50070/ – 可以看 NameNode 的狀態
- http://localhost:50030/ – 可以看 JobTracker 的狀態
- http://localhost:50060/ – 可以看 TaskTracker 的狀態
結語
到這裡,單電腦叢集的 Hadoop 系統,算是非常簡單的介紹完畢。可以從這裡的操作知道,若是需要一個可用的 Hadoop,要經過:(1)操作系統的安裝、(2)java系統的安裝、(3)hadoop系統的安裝、(4)以上三個系統的設定。要執行程式,則要:(1)程式資料的準備、(2)將資料搬進至Hadoop檔案系統、(3)撰寫執行 MapReduce 程式、(4)將資料搬出Hadoop檔案系統。這些工作都不是簡單買個套裝程式回來,拖一拖、拉一拉,或是按一按鈕就可以解決的事情。而且,這個單叢集系統,還不具備備援、分散式計算的功能,都還需要進一步的規劃及調整。
從這個最簡單的系統,接下來則會因著各人的工作項目而有不同的發展方向:
- 系統工程師,要開始研究 hadoop 設定,使之成為分散式系統。並且建立備援機制。研究 hadoop 管理工具
- 軟體工程師,依據需求,安裝 HBASE 或 Cassandra 資料庫,或者利用檔案來存放資料。研究 MapReduce 程式風格。建立自用的介面程式與資料系統、檔案系統做存取。
沒有留言:
張貼留言