小白一个,暂时打算从各处把内容搬回来再说。
-
1.如何安装hadoop
- 我安装的版本是2.6.0,参考 http://www.micmiu.com/bigdata/hadoop/hadoop2x-single-node-setup/,基本操作都是一样的。唯一需要注意的是,涉及micmiu的地方得改成你自己的名字
- 安装后其实还留下个warning(“WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform… using builtin-java classes where applicable”),感觉比较复杂,留着以后解决 2.PutMerge
- 可以参考 http://blog.csdn.net/xiaotom5/article/details/8080615
- 导入jar包解决“import org.apache.hadoop.fs.Path;”: http://stackoverflow.com/questions/21658724/the-import-org-apache-hadoop-mapreduce-cannot-be-resolved
- 文中关于args[1](output 地址)localhost:9100, 但我安装的版本的默认端口号是9000
- hadoop jar PutMerge.jar hadoopInAction.PutMerge /Users/xxx/Desktop/hadoop_test_input hdfs://localhost:9000/user/XXX/putmergeresult.txt (hadoop jar包 package.class args) 3.自定义数据类型和IO
- 自定义Hadoop数据类型 http://book.douban.com/annotation/17067489/
- IO http://book.douban.com/annotation/17068812/ 4.patent
- gfds
- abc
- streaming: hadoop jar share/hadoop/tools/lib/hadoop-streaming-2.6.0.jar -input patentData/ci* -output patentResult/streaming1 -mapper ‘cut -f 2 -d ,’ -reducer ‘uniq’
- 抽样
import sys, random
for line in sys.stdin:
if (random.randint(1,100) <= int(sys.argv[1])):
print line.strip()
jar $HADOOP_HOME/hadoop-0.20.2-streaming.jar -input input/apat63_99.txt -output output -file /home/tanglg1987/test/streaming/RandomSample.py -mapper 'RandomSample.py 10' D mapred.reduce.tasks=1

1554

被折叠的 条评论
为什么被折叠?



