hadoop

最新推荐文章于 2024-03-20 11:46:05 发布

原创最新推荐文章于 2024-03-20 11:46:05 发布 · 685 阅读

0 ·

本内容遵循CC 4.0 BY-SA版权协议

本文档详细介绍了如何使用Hadoop的MapReduce组件来处理大规模数据集，包括配置环境、格式化文件系统、启动守护进程、提交作业以及查看结果等步骤。

bin/hdfs dfs -mkdir /user

bin/hdfs dfs -ls /user

bin/hdfs dfs -put filepath /user

bin/hdfs dfs -get /user/filename localfilename

检查dfs文件的健康状况；

hadoop fsck /user/hadoop

报告集群dfs运行情况：

hdfs dfsadmin -report

http://blog.csdn.net/catontower/article/details/41719393

Execution

The following instructions are to run a MapReduce job locally. If you want to execute a job on YARN, seeYARN on Single Node.

Format the filesystem:
```
  $ bin/hdfs namenode -format
```
Start NameNode daemon and DataNode daemon:
```
  $ sbin/start-dfs.sh
```
The hadoop daemon log output is written to the $HADOOP_LOG_DIR directory (defaults to$HADOOP_HOME/logs).
Browse the web interface for the NameNode; by default it is available at:
- NameNode - http://localhost:50070/

Make the HDFS directories required to execute MapReduce jobs:

  $ bin/hdfs dfs -mkdir /user
  $ bin/hdfs dfs -mkdir /user/<username>

Copy the input files into the distributed filesystem:
```
  $ bin/hdfs dfs -put etc/hadoop input
```

Run some of the examples provided:

  $ bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.6.0.jar grep input output 'dfs[a-z.]+'

Examine the output files:
Copy the output files from the distributed filesystem to the local filesystem and examine them:
```
  $ bin/hdfs dfs -get output output
  $ cat output/*
```
or

View the output files on the distributed filesystem:
```
  $ bin/hdfs dfs -cat output/*
```
When you're done, stop the daemons with:
```
  $ sbin/stop-dfs.sh
```

YARN on Single Node

You can run a MapReduce job on YARN in a pseudo-distributed mode by setting a few parameters and running ResourceManager daemon and NodeManager daemon in addition.

The following instructions assume that 1. ~ 4. steps ofthe above instructions are already executed.

Configure parameters as follows:

etc/hadoop/mapred-site.xml:

<configuration>
    <property>
        <name>mapreduce.framework.name</name>
        <value>yarn</value>
    </property>
</configuration>

etc/hadoop/yarn-site.xml:

<configuration>
    <property>
        <name>yarn.nodemanager.aux-services</name>
        <value>mapreduce_shuffle</value>
    </property>
</configuration>

Start ResourceManager daemon and NodeManager daemon:
```
  $ sbin/start-yarn.sh
```
Browse the web interface for the ResourceManager; by default it is available at:
- ResourceManager - http://localhost:8088/
Run a MapReduce job.
When you're done, stop the daemons with:
```
  $ sbin/stop-yarn.sh
```