解读MapReduce程序实例

服务器

浏览数:38

2020-6-21

AD:资源代下载服务

    Mapreduce 是一个分布式运算程序的编程框架,核心功能是将用户编写的业务逻辑代码和自带默认组件整合成一个完整的 分布式运算程序,并发运行在一个 hadoop 集群上。MapReduce采用“分而治之”策略,一个存储在分布式文件系统中的大规模数据集,会被切分成许多独立的分片(split),这些分片可以被多个Map任务并行处理。

    Hadoop 的四大组件:

    (1)HDFS:分布式存储系统;
    (2)MapReduce:分布式计算系统;
    (3)YARN: hadoop 的资源调度系统;
    (4)Common: 以上三大组件的底层支撑组件,主要提供基础工具包和 RPC 框架等;

    在 MapReduce 组件里, 官方给我们提供了一些样例程序,其中非常有名的就是 wordcount 和 pi 程序,这些程序代码都在 hadoop-example.jar 包里,jar包的安装目录在Hadoop下,为:

/share/hadoop/mapreduce

    下面我们来逐一解读这两个样例程序。

    测试前,先关闭防火墙,启动Zookeeper、Hadoop集群,依次顺序为 :

./start-dfs.sh
./start-yarn.sh

    成功启动后,查看进程是否完整。这些可参考之前博客中关于集群的搭建。

    一、pi样例程序

    (1)执行命令,带上参数

[hadoop@slave01 mapreduce]$ hadoop jar hadoop-mapreduce-examples-2.7.6.jar pi 5 5
Number of Maps  = 5
Samples per Map = 5
Wrote input for Map #0
Wrote input for Map #1
Wrote input for Map #2
Wrote input for Map #3
Wrote input for Map #4
Starting Job
...
...
省略一部分
...
...
18/06/27 16:22:56 INFO mapreduce.Job:  map 0% reduce 0%
18/06/27 16:28:12 INFO mapreduce.Job:  map 73% reduce 0%
18/06/27 16:28:13 INFO mapreduce.Job:  map 100% reduce 0%
18/06/27 16:29:26 INFO mapreduce.Job:  map 100% reduce 100%
18/06/27 16:29:29 INFO mapreduce.Job: Job job_1530087649012_0001 completed successfully
18/06/27 16:29:30 INFO mapreduce.Job: Counters: 49
	File System Counters
		FILE: Number of bytes read=116
		FILE: Number of bytes written=738477
		FILE: Number of read operations=0
		FILE: Number of large read operations=0
		FILE: Number of write operations=0
		HDFS: Number of bytes read=1320
		HDFS: Number of bytes written=215
		HDFS: Number of read operations=23
		HDFS: Number of large read operations=0
		HDFS: Number of write operations=3
	Job Counters 
		Launched map tasks=5
		Launched reduce tasks=1
		Data-local map tasks=5
		Total time spent by all maps in occupied slots (ms)=1625795
		Total time spent by all reduces in occupied slots (ms)=48952
		Total time spent by all map tasks (ms)=1625795
		Total time spent by all reduce tasks (ms)=48952
		Total vcore-milliseconds taken by all map tasks=1625795
		Total vcore-milliseconds taken by all reduce tasks=48952
		Total megabyte-milliseconds taken by all map tasks=1664814080
		Total megabyte-milliseconds taken by all reduce tasks=50126848
	Map-Reduce Framework
		Map input records=5
		Map output records=10
		Map output bytes=90
		Map output materialized bytes=140
		Input split bytes=730
		Combine input records=0
		Combine output records=0
		Reduce input groups=2
		Reduce shuffle bytes=140
		Reduce input records=10
		Reduce output records=0
		Spilled Records=20
		Shuffled Maps =5
		Failed Shuffles=0
		Merged Map outputs=5
		GC time elapsed (ms)=107561
		CPU time spent (ms)=32240
		Physical memory (bytes) snapshot=500453376
		Virtual memory (bytes) snapshot=12460331008
		Total committed heap usage (bytes)=631316480
	Shuffle Errors
		BAD_ID=0
		CONNECTION=0
		IO_ERROR=0
		WRONG_LENGTH=0
		WRONG_MAP=0
		WRONG_REDUCE=0
	File Input Format Counters 
		Bytes Read=590
	File Output Format Counters 
		Bytes Written=97
Job Finished in 452.843 seconds
Estimated value of Pi is 3.68000000000000000000

    执行程序,参数含义:

    第1个参数5指的是要运行5次map任务 ;
    第2个参数5指的是每个map任务,要投掷多少次 ;

    2个参数的乘积就是总的投掷次数(pi代码就是以投掷来计算值)。

    通过上面我们获得了Pi的值:3.680000,当然也可以改变参数来验证得出的结果和参数的关系,比如我的参数换成10和10,则得出的结果为:3.20000。由此可见:参数越大,结果越是精确。

    (2)查看运行进程

    在执行过程中,它的时间不定,所以我们可以通过访问界面,查看具体的运行进程,访问:

slave01:8088

    界面显示如下:

    从上面我们可以看出:当Progress进程结束,即代表运算过程结束,也可以点击查看具体的内容,这里不做演示了。

    二、wordcount样例程序

    (1)准备数据,上传HDFS

    简单的说就是单词统计,这里我们新建一个txt文件,输入一些单词,方便统计:

[hadoop@slave01 mapreduce]$ touch wordcount.txt
[hadoop@slave01 mapreduce]$ vim wordcount.txt

    输入以下单词,并保存:

hello word !
you can help me ?
yes , I can
How do you do ?

    上传到HDFS,先在hdfs上创建文件夹,在将txt文件放到该文件夹下,下面是一种创建方式,或者是hadoop fs -mkdir 的方式,二者择其一,注意路径:

[hadoop@slave01 bin]$ hdfs dfs -mkdir -p /wordcount
[hadoop@slave01 bin]$ hdfs dfs -put ../share/hadoop/mapreduce/wordcount.txt /wordcount
[hadoop@slave01 bin]$ 

    我们可以通过访问 slave01:50070,查看HDFS文件系统:

    成功上传。

    (2)运行程序

    执行下面的命令,注意路径:

[hadoop@slave01 bin]$ yarn jar ../share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.6.jar wordcount /wordcount /word_output
18/06/27 17:34:24 INFO client.RMProxy: Connecting to ResourceManager at slave01/127.0.0.1:8032
18/06/27 17:34:30 INFO input.FileInputFormat: Total input paths to process : 1
18/06/27 17:34:30 INFO mapreduce.JobSubmitter: number of splits:1
18/06/27 17:34:31 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1530087649012_0003
18/06/27 17:34:32 INFO impl.YarnClientImpl: Submitted application application_1530087649012_0003
18/06/27 17:34:33 INFO mapreduce.Job: The url to track the job: http://slave01:8088/proxy/application_1530087649012_0003/
18/06/27 17:34:33 INFO mapreduce.Job: Running job: job_1530087649012_0003
18/06/27 17:34:52 INFO mapreduce.Job: Job job_1530087649012_0003 running in uber mode : false
18/06/27 17:34:52 INFO mapreduce.Job:  map 0% reduce 0%
18/06/27 17:35:02 INFO mapreduce.Job:  map 100% reduce 0%
18/06/27 17:35:31 INFO mapreduce.Job:  map 100% reduce 100%
18/06/27 17:35:32 INFO mapreduce.Job: Job job_1530087649012_0003 completed successfully
...
...
省略部分
...
...
	Shuffle Errors
		BAD_ID=0
		CONNECTION=0
		IO_ERROR=0
		WRONG_LENGTH=0
		WRONG_MAP=0
		WRONG_REDUCE=0
	File Input Format Counters 
		Bytes Read=59
	File Output Format Counters 
		Bytes Written=72

    命令参数的含义:

    第一个指的是jar包路径,第二个指的是要执行的样例程序名称wordcount,第三个指的是文件所在的HDFS路径,第四个指的是要输出的文件目录(不要是已经存在的)。

    上面是输出结果,同样的我们可以通过访问 slave01:8088 查看进程。

    执行结束后,在HDFS文件系统上,可以看到输出的目录已经创建好了,且里面存在了输出的文件:

    通过命令,可以查看执行后的结果文件:

[hadoop@slave01 bin]$ hdfs dfs -text /word_output/part*
!	1
,	1
?	2
How	1
I	1
can	2
do	2
hello	1
help	1
me	1
word	1
yes	1
you	2
[hadoop@slave01 bin]$ 

    从上面可以看出:单词已经统计完成,我们可以对照文件进行验证。

    好了,上面是对两个已有样例的解读,至于代码方面有空再一起讨论吧。

作者:海岸线的曙光