关于MapReduce的一些面试题

原创已于 2022-04-05 20:19:39 修改 · 1.1k 阅读

2 ·

本内容遵循CC 4.0 BY-SA版权协议

标签

#mapreduce #java #大数据 #python

于 2021-04-13 10:10:37 首次发布

Hadoop 专栏收录该内容

5 篇文章

订阅专栏

本文深入探讨Hadoop MapReduce的执行过程，包括map端多路归并排序和reduce端两路归并。通过WordCount案例，解释Mapper和Reducer的功能，以及Context的作用。此外，介绍了使用Hadoop Streaming实现Python版WordCount，并讨论MapJoin的原理，强调在没有reduce任务的情况下，MapJoin如何工作。

一.MapReduce的执行过程
- 面试题1：hadoop怎么实现mapreduce的归并排序
二.其实并不简单的WordCount
- 面试题2：context有何作用
- 面试题3：Mapper泛型各参数的意思
三.Hadoop Streaming
- 面试题4：你使用过Hadoop Streaming吗？
四.MapJoin和ReduceJoin
- 面试题5：说说MapJoin的原理

一.MapReduce的执行过程

官方描述：
在这里插入图片描述

面试题1：hadoop怎么实现mapreduce的归并排序

map端merge是多路归并：
map阶段中小文件合并采用了多轮递归合并排序，每轮选取文件大小最小的前io.sort.factor个文件进行合并，并将产生的文件重新加入待合并列表，直至剩下的文件数目小于io.sort.factor个。在每一轮合并过程中采用了小顶堆实现，可将文件合并过程看作一个不断建堆的过程。实质使用了基于极小堆实现的优先级队列。

reduce端merge是两路归并

自己描述：
请添加图片描述

二.其实并不简单的WordCount

public class WordCountMapper extends Mapper<LongWritable,Text,Text,IntWritable>{
    @Override
    protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
        // 把value对应的行数据按照指定的分隔符拆开
        String[] words = value.toString().split("\t");
        for(String word : words) {
            // (hello,1)  (world,1)
            context.write(new Text(word.toLowerCase()), new IntWritable(1));
        }
    }
}

public class WordCountReducer extends Reducer<Text,IntWritable, Text,IntWritable>{
    @Override
    protected void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
        int count = 0;
        Iterator<IntWritable> iterator = values.iterator();

        while (iterator.hasNext()) {
            IntWritable value = iterator.next();
            count += value.get();
        }
        context.write(key, new IntWritable(count));
    }
}

在这里插入图片描述
位于图中过程的两个位置，就是上方两个函数要写的内容，开发者只需要关注这两个函数即可。

面试题2：context有何作用

context其实就是个大map集合，即缓存。数据都缓存在map中。
context.write(new Text(word.toLowerCase()), new IntWritable(1));

面试题3：Mapper泛型各参数的意思

Mapper和Reducer的泛型都有4个参数，前两个是输入，后两个是输出。
Mapper前两个是偏移量和文本的一行信息，后两个是文本和数字（<word，number>）

三.Hadoop Streaming

面试题4：你使用过Hadoop Streaming吗？

参考文章：Hadoop Streaming with Python
参考文章：用python写MapReduce函数——以WordCount为例

/root/py/mapper.py:

#!/usr/bin/env python
import sys
for line in sys.stdin:
    line = line.strip()
    words = line.split()
    for word in words:
        print "%s\t%s" % (word, 1)

/root/py/reducer.py :

#!/usr/bin/env python
#coding=utf-8
from operator import itemgetter
import sys

current_word = None
current_count = 0
word = None

for line in sys.stdin:
    line = line.strip()
    word, count = line.split('\t', 1)
    try:
        count = int(count)
    except ValueError:  #count如果不是数字的话，直接忽略掉
        continue
    if current_word == word:
        current_count += count
    else:
        if current_word:
            print "%s\t%s" % (current_word, current_count)
        current_count = count
        current_word = word

if word == current_word:  #不要忘记最后的输出
    print "%s\t%s" % (current_word, current_count)

测试数据：/test/py_mr/wc.txt

hello world
hello zhu
hello happy

运行脚本：

bin/hadoop jar /opt/cloudera/parcels/CDH/jars/hadoop-streaming-2.6.0-cdh5.16.1.jar \
-D mapred.reduce.tasks=1 \
-file /root/py/mapper.py     -mapper /root/py/mapper.py \
-file /root/py/reducer.py    -reducer /root/py/reducer.py \
-input /test/py_mr/wc.txt    -output /test/py_mr/py_mr_output

Launched map tasks=2
Launched reduce tasks=1

[root@hadoop001 py]# hadoop fs -cat /test/py_mr/py_mr_output/part-00000
happy	1
hello	3
world	1
zhu	1

不设置reduce.tasks，则有3个。
原因如下：
在这里插入图片描述

四.MapJoin和ReduceJoin

面试题5：说说MapJoin的原理

在这里插入图片描述
mapjoin没有reduce，但是日志依旧会打印reduce过程，容易让人产生误解。
需要设置：job.setNumReduceTasks(0); 设置没有reduce。