• Hadoop实践 »
  • RHadoop实践系列之二:RHadoop安装与使用

RHadoop实践系列之二:RHadoop安装与使用

RHadoop实践系列文章,包含了R语言与Hadoop结合进行海量数据分析。Hadoop主要用来存储海量数据,R语言完成MapReduce 算法,用来替代Java的MapReduce实现。有了RHadoop可以让广大的R语言爱好者,有更强大的工具处理大数据1G, 10G, 100G, TB, PB。 由于大数据所带来的单机性能问题,可能会一去不复返了。

RHadoop实践是一套系列文章,主要包括”Hadoop环境搭建”,”RHadoop安装与使用”,R实现MapReduce的协同过滤算法”,”HBase和rhbase的安装与使用”。对于单独的R语言爱好者,Java爱好者,或者Hadoop爱好者来说,同时具备三种语言知识并不容 易。此文虽为入门文章,但R,Java,Hadoop基础知识还是需要大家提前掌握。

关于作者:

  • 张丹(Conan), 程序员Java,R,PHP,Javascript
  • weibo:@Conan_Z
  • blog: http://blog.fens.me
  • email: bsspirit@gmail.com

转载请注明出处:
http://blog.fens.me/rhadoop-rhadoop/

rhadoop-rhadoop

第二篇 RHadoop安装与使用部分,分为3个章节。

1. 环境准备
2. RHadoop安装
3. RHadoop程序用例

每一章节,都会分为”文字说明部分”和”代码部分”,保持文字说明与代码的连贯性。

注:Hadoop环境搭建的详细记录,请查看 同系列上一篇文章 “RHadoop实践系列文章之Hadoop环境搭建”。
由于两篇文章并非同一时间所写,hadoop版本及操作系统,分步式环境都略有不同。
两篇文章相互独立,请大家在理解的基础上动手实验,不要完成依赖两篇文章中的运行命令。

环境准备

文字说明部分:

首先环境准备,这里我选择了Linux Ubuntu操作系统12.04的64位版本,大家可以根据自己的使用习惯选择顺手的Linux。

但JDK一定要用Oracle SUN官方的版本,请从官网下载,操作系统的自带的OpenJDK会有各种不兼容。JDK请选择1.6.x的版本,JDK1.7版本也会有各种的不兼容情况。
http://www.oracle.com/technetwork/java/javase/downloads/index.html

Hadoop的环境安装,请参考RHadoop实践系统”Hadoop环境搭建”的一文。

R语言请安装2.15以后的版本,2.14是不能够支持RHadoop的。
如果你也使用Linux Ubuntu操作系统12.04,请先更新软件包源,否则只能下载到2.14版本的R。

代码部分:

1. 操作系统Ubuntu 12.04 x64

~ uname -a
Linux domU-00-16-3e-00-00-85 3.2.0-23-generic #36-Ubuntu SMP Tue Apr 10 20:39:51 UTC 2012 x86_64 x86_64 x86_64 GNU/Linux

2 JAVA环境

~ java -version

java version "1.6.0_29"
Java(TM) SE Runtime Environment (build 1.6.0_29-b11)
Java HotSpot(TM) 64-Bit Server VM (build 20.4-b02, mixed mode)

3 HADOOP环境(这里只需要hadoop)

hadoop-1.0.3  hbase-0.94.2  hive-0.9.0  pig-0.10.0  sqoop-1.4.2  thrift-0.8.0  zookeeper-3.4.4

4 R的环境

R version 2.15.3 (2013-03-01) -- "Security Blanket"
Copyright (C) 2013 The R Foundation for Statistical Computing
ISBN 3-900051-07-0
Platform: x86_64-pc-linux-gnu (64-bit)

4.1 如果是Ubuntu 12.04,请更新源再下载R2.15.3版本

sh -c "echo deb http://mirror.bjtu.edu.cn/cran/bin/linux/ubuntu precise/ >>/etc/apt/sources.list"
apt-get update
apt-get install r-base

RHadoop安装

文字说明部分:

RHadoop是RevolutionAnalytics的工程的项目,开源实现代码在GitHub社区可以找到。RHadoop包含三个R包 (rmr,rhdfs,rhbase),分别是对应Hadoop系统架构中的,MapReduce, HDFS, HBase 三个部分。由于这三个库不能在CRAN中找到,所以需要自己下载。
https://github.com/RevolutionAnalytics/RHadoop/wiki

接下我们需要先安装这三个库的依赖库。
首先是rJava,上个章节我们已经配置好了JDK1.6的环境,运行R CMD javareconf命令,R的程序从系统变量中会读取Java配置。然后打开R程序,通过install.packages的方式,安装rJava。

然后,我还要安装其他的几个依赖库,reshape2,Rcpp,iterators,itertools,digest,RJSONIO,functional,通过install.packages都可以直接安装。

接下安装rhdfs库,在环境变量中增加 HADOOP_CMD 和 HADOOP_STREAMING 两个变量,可以用export在当前命令窗口中增加。但为下次方便使用,最好把变量增加到系统环境变更/etc/environment文件中。再用 R CMD INSTALL安装rhdfs包,就可以顺利完成了。

安装rmr库,使用R CMD INSTALL也可以顺利完成了。

安装rhbase库,后面”HBase和rhbase的安装与使用”文章中会继续介绍,这里暂时跳过。

最后,我们可以查看一下,RHADOOP都安装了哪些库。
由于我的硬盘是外接的,使用mount和软连接(ln -s)挂载了R类库的目录,所以是R的类库在/disk1/system下面
/disk1/system/usr/local/lib/R/site-library/
一般R的类库目录是/usr/lib/R/site-library或者/usr/local/lib/R/site-library,用户也可以使用whereis R的命令查询,自己电脑上R类库的安装位置

代码部分:

1. 下载RHadoop相关的3个程序包

https://github.com/RevolutionAnalytics/RHadoop/wiki/Downloads

rmr-2.1.0
rhdfs-1.0.5
rhbase-1.1

2. 复制到/root/R目录

~/R# pwd
/root/R

~/R# ls
rhbase_1.1.tar.gz  rhdfs_1.0.5.tar.gz  rmr2_2.1.0.tar.gz

3. 安装依赖库

命令行执行
~ R CMD javareconf 
~ R

启动R程序
install.packages("rJava")
install.packages("reshape2")
install.packages("Rcpp")
install.packages("iterators")
install.packages("itertools")
install.packages("digest")
install.packages("RJSONIO")
install.packages("functional")

4. 安装rhdfs库

~ export HADOOP_CMD=/root/hadoop/hadoop-1.0.3/bin/hadoop
~ export HADOOP_STREAMING=/root/hadoop/hadoop-1.0.3/contrib/streaming/hadoop-streaming-1.0.3.jar (rmr2会用到)
~ R CMD INSTALL /root/R/rhdfs_1.0.5.tar.gz 

4.1 最好把HADOOP_CMD设置到环境变量

~ vi /etc/environment

    HADOOP_CMD=/root/hadoop/hadoop-1.0.3/bin/hadoop
    HADOOP_STREAMING=/root/hadoop/hadoop-1.0.3/contrib/streaming/hadoop-streaming-1.0.3.jar

. /etc/environment

5. 安装rmr库

~  R CMD INSTALL rmr2_2.1.0.tar.gz 

6. 安装rhbase库 (暂时跳过)

7. 所有的安装包

~ ls /disk1/system/usr/local/lib/R/site-library/
digest  functional  iterators  itertools  plyr  Rcpp  reshape2  rhdfs  rJava  RJSONIO  rmr2  stringr

RHadoop程序用例

文字说明部分:

安装好rhdfs和rmr两个包后,我们就可以使用R尝试一些hadoop的操作了。

首先,是基本的hdfs的文件操作。

查看hdfs文件目录
hadoop的命令:hadoop fs -ls /user
R语言函数:hdfs.ls(”/user/“)

查看hadoop数据文件
hadoop的命令:hadoop fs -cat /user/hdfs/o_same_school/part-m-00000
R语言函数:hdfs.cat(”/user/hdfs/o_same_school/part-m-00000″)

接下来,我们执行一个rmr算法的任务

普通的R语言程序:

> small.ints = 1:10
> sapply(small.ints, function(x) x^2)

MapReduce的R语言程序:

> small.ints = to.dfs(1:10)
> mapreduce(input = small.ints, map = function(k, v) cbind(v, v^2))
> from.dfs("/tmp/RtmpWnzxl4/file5deb791fcbd5")

因为MapReduce只能访问HDFS文件系统,先要用to.dfs把数据存储到HDFS文件系统里。MapReduce的运算结果再用from.dfs函数从HDFS文件系统中取出。

第二个,rmr的例子是wordcount,对文件中的单词计数

> input<- '/user/hdfs/o_same_school/part-m-00000'
> wordcount = function(input, output = NULL, pattern = " "){

  wc.map = function(., lines) {
            keyval(unlist( strsplit( x = lines,split = pattern)),1)
    }

    wc.reduce =function(word, counts ) {
            keyval(word, sum(counts))
    }         

    mapreduce(input = input ,output = output, input.format = "text",
        map = wc.map, reduce = wc.reduce,combine = T)
}

> wordcount(input)
> from.dfs("/tmp/RtmpfZUFEa/file6cac626aa4a7")

我在HDFS上提前放置了数据文件/user/hdfs/o_same_school/part-m-00000。写wordcount的MapReduce函数,执行wordcount函数,最后用from.dfs从HDFS中取得结果。

代码部分:

1. rhdfs包的使用

启动R程序
> library(rhdfs)

Loading required package: rJava
HADOOP_CMD=/root/hadoop/hadoop-1.0.3/bin/hadoop
Be sure to run hdfs.init()

> hdfs.init()

1.1 命令查看hadoop目录

~ hadoop fs -ls /user

Found 4 items
drwxr-xr-x   - root supergroup          0 2013-02-01 12:15 /user/conan
drwxr-xr-x   - root supergroup          0 2013-03-06 17:24 /user/hdfs
drwxr-xr-x   - root supergroup          0 2013-02-26 16:51 /user/hive
drwxr-xr-x   - root supergroup          0 2013-03-06 17:21 /user/root

1.2 rhdfs查看hadoop目录

> hdfs.ls("/user/")

  permission owner      group size          modtime        file
1 drwxr-xr-x  root supergroup    0 2013-02-01 12:15 /user/conan
2 drwxr-xr-x  root supergroup    0 2013-03-06 17:24  /user/hdfs
3 drwxr-xr-x  root supergroup    0 2013-02-26 16:51  /user/hive
4 drwxr-xr-x  root supergroup    0 2013-03-06 17:21  /user/root

1.3 命令查看hadoop数据文件

~ hadoop fs -cat /user/hdfs/o_same_school/part-m-00000

10,3,tsinghua university,2004-05-26 15:21:00.0
23,4007,北京第一七一中学,2004-05-31 06:51:53.0
51,4016,大连理工大学,2004-05-27 09:38:31.0
89,4017,Amherst College,2004-06-01 16:18:56.0
92,4017,斯坦福大学,2012-11-28 10:33:25.0
99,4017,Stanford University Graduate School of Business,2013-02-19 12:17:15.0
113,4017,Stanford University,2013-02-19 12:17:15.0
123,4019,St Paul's Co-educational College - Hong Kong,2004-05-27 18:04:17.0
138,4019,香港苏浙小学,2004-05-27 18:59:58.0
172,4020,University,2004-05-27 19:14:34.0
182,4026,ff,2004-05-28 04:42:37.0
183,4026,ff,2004-05-28 04:42:37.0
189,4033,tsinghua,2011-09-14 12:00:38.0
195,4035,ba,2004-05-31 07:10:24.0
196,4035,ma,2004-05-31 07:10:24.0
197,4035,southampton university,2013-01-07 15:35:18.0
246,4067,美国史丹佛大学,2004-06-12 10:42:10.0
254,4067,美国史丹佛大学,2004-06-12 10:42:10.0
255,4067,美国休士顿大学,2004-06-12 10:42:10.0
257,4068,清华大学,2004-06-12 10:42:10.0
258,4068,北京八中,2004-06-12 17:34:02.0
262,4068,香港中文大学,2004-06-12 17:34:02.0
310,4070,首都师范大学初等教育学院,2004-06-14 15:35:52.0
312,4070,北京师范大学经济学院,2004-06-14 15:35:52.0

1.4 rhdfs查看hadoop数据文件

>  hdfs.cat("/user/hdfs/o_same_school/part-m-00000")

 [1] "10,3,tsinghua university,2004-05-26 15:21:00.0"
 [2] "23,4007,北京第一七一中学,2004-05-31 06:51:53.0"
 [3] "51,4016,大连理工大学,2004-05-27 09:38:31.0"
 [4] "89,4017,Amherst College,2004-06-01 16:18:56.0"
 [5] "92,4017,斯坦福大学,2012-11-28 10:33:25.0"
 [6] "99,4017,Stanford University Graduate School of Business,2013-02-19 12:17:15.0"
 [7] "113,4017,Stanford University,2013-02-19 12:17:15.0"
 [8] "123,4019,St Paul's Co-educational College - Hong Kong,2004-05-27 18:04:17.0"
 [9] "138,4019,香港苏浙小学,2004-05-27 18:59:58.0"
[10] "172,4020,University,2004-05-27 19:14:34.0"
[11] "182,4026,ff,2004-05-28 04:42:37.0"
[12] "183,4026,ff,2004-05-28 04:42:37.0"
[13] "189,4033,tsinghua,2011-09-14 12:00:38.0"
[14] "195,4035,ba,2004-05-31 07:10:24.0"
[15] "196,4035,ma,2004-05-31 07:10:24.0"
[16] "197,4035,southampton university,2013-01-07 15:35:18.0"
[17] "246,4067,美国史丹佛大学,2004-06-12 10:42:10.0"
[18] "254,4067,美国史丹佛大学,2004-06-12 10:42:10.0"
[19] "255,4067,美国休士顿大学,2004-06-12 10:42:10.0"
[20] "257,4068,清华大学,2004-06-12 10:42:10.0"
[21] "258,4068,北京八中,2004-06-12 17:34:02.0"
[22] "262,4068,香港中文大学,2004-06-12 17:34:02.0"
[23] "310,4070,首都师范大学初等教育学院,2004-06-14 15:35:52.0"
[24] "312,4070,北京师范大学经济学院,2004-06-14 15:35:52.0"

2. rmr2包的使用

启动R程序
> library(rmr2)

Loading required package: Rcpp
Loading required package: RJSONIO
Loading required package: digest
Loading required package: functional
Loading required package: stringr
Loading required package: plyr
Loading required package: reshape2

2.1 执行r任务

> small.ints = 1:10
> sapply(small.ints, function(x) x^2)

[1]   1   4   9  16  25  36  49  64  81 100

2.2 执行rmr2任务

> small.ints = to.dfs(1:10)

13/03/07 12:12:55 INFO util.NativeCodeLoader: Loaded the native-hadoop library
13/03/07 12:12:55 INFO zlib.ZlibFactory: Successfully loaded & initialized native-zlib library
13/03/07 12:12:55 INFO compress.CodecPool: Got brand-new compressor

> mapreduce(input = small.ints, map = function(k, v) cbind(v, v^2))

packageJobJar: [/tmp/RtmpWnzxl4/rmr-local-env5deb2b300d03, /tmp/RtmpWnzxl4/rmr-global-env5deb398a522b, /tmp/RtmpWnzxl4/rmr-streaming-map5deb1552172d, /root/hadoop/tmp/hadoop-unjar7838617732558795635/] [] /tmp/streamjob4380275136001813619.jar tmpDir=null
13/03/07 12:12:59 INFO mapred.FileInputFormat: Total input paths to process : 1
13/03/07 12:12:59 INFO streaming.StreamJob: getLocalDirs(): [/root/hadoop/tmp/mapred/local]
13/03/07 12:12:59 INFO streaming.StreamJob: Running job: job_201302261738_0293
13/03/07 12:12:59 INFO streaming.StreamJob: To kill this job, run:
13/03/07 12:12:59 INFO streaming.StreamJob: /disk1/hadoop/hadoop-1.0.3/libexec/../bin/hadoop job  -Dmapred.job.tracker=hdfs://r.qa.tianji.com:9001 -kill job_201302261738_0293
13/03/07 12:12:59 INFO streaming.StreamJob: Tracking URL: http://192.168.1.243:50030/jobdetails.jsp?jobid=job_201302261738_0293
13/03/07 12:13:00 INFO streaming.StreamJob:  map 0%  reduce 0%
13/03/07 12:13:15 INFO streaming.StreamJob:  map 100%  reduce 0%
13/03/07 12:13:21 INFO streaming.StreamJob:  map 100%  reduce 100%
13/03/07 12:13:21 INFO streaming.StreamJob: Job complete: job_201302261738_0293
13/03/07 12:13:21 INFO streaming.StreamJob: Output: /tmp/RtmpWnzxl4/file5deb791fcbd5

> from.dfs("/tmp/RtmpWnzxl4/file5deb791fcbd5")

$key
NULL

$val
       v
 [1,]  1   1
 [2,]  2   4
 [3,]  3   9
 [4,]  4  16
 [5,]  5  25
 [6,]  6  36
 [7,]  7  49
 [8,]  8  64
 [9,]  9  81
[10,] 10 100

2.3 wordcount执行rmr2任务

> input<- '/user/hdfs/o_same_school/part-m-00000'
> wordcount = function(input, output = NULL, pattern = " "){

    wc.map = function(., lines) {
            keyval(unlist( strsplit( x = lines,split = pattern)),1)
    }

    wc.reduce =function(word, counts ) {
            keyval(word, sum(counts))
    }         

    mapreduce(input = input ,output = output, input.format = "text",
        map = wc.map, reduce = wc.reduce,combine = T)
}

> wordcount(input)

packageJobJar: [/tmp/RtmpfZUFEa/rmr-local-env6cac64020a8f, /tmp/RtmpfZUFEa/rmr-global-env6cac73016df3, /tmp/RtmpfZUFEa/rmr-streaming-map6cac7f145e02, /tmp/RtmpfZUFEa/rmr-streaming-reduce6cac238dbcf, /tmp/RtmpfZUFEa/rmr-streaming-combine6cac2b9098d4, /root/hadoop/tmp/hadoop-unjar6584585621285839347/] [] /tmp/streamjob9195921761644130661.jar tmpDir=null
13/03/07 12:34:41 INFO util.NativeCodeLoader: Loaded the native-hadoop library
13/03/07 12:34:41 WARN snappy.LoadSnappy: Snappy native library not loaded
13/03/07 12:34:41 INFO mapred.FileInputFormat: Total input paths to process : 1
13/03/07 12:34:41 INFO streaming.StreamJob: getLocalDirs(): [/root/hadoop/tmp/mapred/local]
13/03/07 12:34:41 INFO streaming.StreamJob: Running job: job_201302261738_0296
13/03/07 12:34:41 INFO streaming.StreamJob: To kill this job, run:
13/03/07 12:34:41 INFO streaming.StreamJob: /disk1/hadoop/hadoop-1.0.3/libexec/../bin/hadoop job  -Dmapred.job.tracker=hdfs://r.qa.tianji.com:9001 -kill job_201302261738_0296
13/03/07 12:34:41 INFO streaming.StreamJob: Tracking URL: http://192.168.1.243:50030/jobdetails.jsp?jobid=job_201302261738_0296
13/03/07 12:34:42 INFO streaming.StreamJob:  map 0%  reduce 0%
13/03/07 12:34:59 INFO streaming.StreamJob:  map 100%  reduce 0%
13/03/07 12:35:08 INFO streaming.StreamJob:  map 100%  reduce 17%
13/03/07 12:35:14 INFO streaming.StreamJob:  map 100%  reduce 100%
13/03/07 12:35:20 INFO streaming.StreamJob: Job complete: job_201302261738_0296
13/03/07 12:35:20 INFO streaming.StreamJob: Output: /tmp/RtmpfZUFEa/file6cac626aa4a7

> from.dfs("/tmp/RtmpfZUFEa/file6cac626aa4a7")

$key
 [1] "-"
 [2] "04:42:37.0"
 [3] "06:51:53.0"
 [4] "07:10:24.0"
 [5] "09:38:31.0"
 [6] "10:33:25.0"
 [7] "10,3,tsinghua"
 [8] "10:42:10.0"
 [9] "113,4017,Stanford"
[10] "12:00:38.0"
[11] "12:17:15.0"
[12] "123,4019,St"
[13] "138,4019,香港苏浙小学,2004-05-27"
[14] "15:21:00.0"
[15] "15:35:18.0"
[16] "15:35:52.0"
[17] "16:18:56.0"
[18] "172,4020,University,2004-05-27"
[19] "17:34:02.0"
[20] "18:04:17.0"
[21] "182,4026,ff,2004-05-28"
[22] "183,4026,ff,2004-05-28"
[23] "18:59:58.0"
[24] "189,4033,tsinghua,2011-09-14"
[25] "19:14:34.0"
[26] "195,4035,ba,2004-05-31"
[27] "196,4035,ma,2004-05-31"
[28] "197,4035,southampton"
[29] "23,4007,北京第一七一中学,2004-05-31"
[30] "246,4067,美国史丹佛大学,2004-06-12"
[31] "254,4067,美国史丹佛大学,2004-06-12"
[32] "255,4067,美国休士顿大学,2004-06-12"
[33] "257,4068,清华大学,2004-06-12"
[34] "258,4068,北京八中,2004-06-12"
[35] "262,4068,香港中文大学,2004-06-12"
[36] "312,4070,北京师范大学经济学院,2004-06-14"
[37] "51,4016,大连理工大学,2004-05-27"
[38] "89,4017,Amherst"
[39] "92,4017,斯坦福大学,2012-11-28"
[40] "99,4017,Stanford"
[41] "Business,2013-02-19"
[42] "Co-educational"
[43] "College"
[44] "College,2004-06-01"
[45] "Graduate"
[46] "Hong"
[47] "Kong,2004-05-27"
[48] "of"
[49] "Paul's"
[50] "School"
[51] "University"
[52] "university,2004-05-26"
[53] "university,2013-01-07"
[54] "University,2013-02-19"
[55] "310,4070,首都师范大学初等教育学院,2004-06-14"

$val
 [1] 1 2 1 2 1 1 1 4 1 1 2 1 1 1 1 2 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[39] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

转载请注明出处:
http://blog.fens.me/rhadoop-rhadoop/

打赏作者

This entry was posted in Hadoop实践, R语言实践

0 0 votes
Article Rating
Subscribe
Notify of
guest

This site uses Akismet to reduce spam. Learn how your comment data is processed.

109 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
Nemo

-。- 哇,博主的WP主题和我的一样哇

Nemo

看样子我还没找到组织 =。=
我现在还是个打酱油的哈,java不太会,R也是才开始学习的,很多地方都不太懂的。
话说博主有无其他联系方式,如果有问题我好直接找你请教咯。

Nemo

嗯嗯,刚粉你的微博了哈,之后我就在论坛和lz博客学习了哈

[…] RHadoop实践系列之二:RHadoop安装与使用 […]

Conan Zhang

由于R已经升到3.0,默认会安装最新的。

请使用下面的命令安装2.15.3的程序包
sudo apt-get install r-base-core=2.15.3-1precise0precise1

Roy Guo

必须要使用2.1.5么?3.0还不支持?

Conan Zhang

我只在当前这个环境测试过,其他的环境不清楚。

DAHAI

root@slaver1:/home/hadoop/下载# R CMD INSTALL rmr2_3.3.1.tar.gz
* installing to library ‘/usr/local/lib/R/site-library’
* installing *source* package ‘rmr2’ …
** libs
g++ -I/usr/local/lib/R/include -DNDEBUG -I/usr/local/include `/usr/local/lib/R/bin/Rscript -e “Rcpp:::CxxFlags()”` -fpic -g -O2 -c extras.cpp -o extras.o
错误于get(name, envir = asNamespace(pkg), inherits = FALSE) :
找不到对象’CxxFlags’
Calls: ::: -> get
停止执行
In file included from extras.cpp:15:0:
extras.h:18:18: fatal error: Rcpp.h: No such file or directory
#include
^
compilation terminated.
/usr/local/lib/R/etc/Makeconf:130: recipe for target ‘extras.o’ failed
make: *** [extras.o] Error 1
ERROR: compilation failed for package ‘rmr2’
* removing ‘/usr/local/lib/R/site-library/rmr2’
请问这是什么原因呢?如何解决

Conan Zhang

Rcpp没有安装好吧?这个包单独装一下吧。

DAHAI

谢谢,这个问题解决了,但是新的问题有来了,当安装下载到本地的rmr2_3.3.1.tar.gz时候,(另外,rmr2_2.1.0.tar.gz 没有找到),请帮忙分析一下:貌似Rcpp问题,尝试了多次发现只有装 Rcpp.0.7.0 以下版本才可安装成功:

install.packages(“/home/hadoop/下载/rmr2_3.3.1.tar.gz”,repos=NULL)

报错:

* installing *source* package ‘rmr2’ …
** libs
g++ -I/usr/local/lib/R/include -DNDEBUG -I/usr/local/include `/usr/local/lib/R/bin/Rscript -e “Rcpp:::CxxFlags()”` -fpic -g -O2 -c extras.cpp -o extras.o
extras.cpp: In function ‘SEXPREC* vsum(SEXP)’:
extras.cpp:20:3: error: ‘Rcpp’ has not been declared
Rcpp::List _xx (xx);
^
extras.cpp:21:31: error: ‘_xx’ was not declared in this scope
std::vector results(_xx.size());
^
extras.cpp:23:29: error: ‘Rcpp’ has not been declared
std::vector x = Rcpp::as<std::vector >(_xx[i]);
^
extras.cpp:23:58: error: expected primary-expression before ‘>’ token
std::vector x = Rcpp::as<std::vector >(_xx[i]);
^
extras.cpp:26:10: error: ‘Rcpp’ has not been declared
return Rcpp::wrap(results);}
^
/usr/local/lib/R/etc/Makeconf:130: recipe for target ‘extras.o’ failed
make: *** [extras.o] Error 1
ERROR: compilation failed for package ‘rmr2’
* removing ‘/usr/local/lib/R/site-library/rmr2’
警告信息:
In install.packages(“/home/hadoop/下载/rmr2_3.3.1.tar.gz”, repos = NULL) :
安装程序包‘/home/hadoop/下载/rmr2_3.3.1.tar.gz’时退出狀態的值不是0

Conan Zhang

手动下载Rcpp之前的版本进行安装,不要直接用install.packages的命令。

xiaoxu

请问这个问题你解决了吗?你用的Hadoop版本是多少?

saheb

Hi, were you able to solve this issue? I am facing exactly the same problem.

[…] RHadoop实践是一套系列文章,主要包括”Hadoop环境搭建”,”RHadoop安装与使用”,”R实现MapReduce的协同过滤算法”,”HBase和rhbase的安装与使用”。对于单独的R语言爱好者,Java爱好者,或者Hadoop爱好者来说,同时具备三种语言知识并不容 易。此文虽为入门文章,但R,Java,Hadoop基础知识还是需要大家提前掌握。 […]

[…] RHadoop实践是一套系列文章,主要包括”Hadoop环境搭建”,”RHadoop安装与使用”,”R实现MapReduce的协同过滤算法”,”HBase和rhbase的安装与使用”。对于单独的R语言爱好者,Java爱好者,或者Hadoop爱好者来说,同时具备三种语言知识并不容 易。此文虽为入门文章,但R,Java,Hadoop基础知识还是需要大家提前掌握。 […]

X.J ZHOU

mark

pan

我用的R3.1.1,其他配置都一样,运行已经成功,但是from.dfs()函数使用就不行。难道是和R的版本有关吗?弄了好几天了,请老师指点一二呀!!而且我手动把运行结果取到本地再打开,发现是二进制文件,都是乱码!

Conan Zhang

中间涉及的技术很多,R3.1.1是否支持,要具体查官方文档;自己尝试只能自己解决问题了,我也不清楚具体的细节。

[…] RHadoop实践是一套系列文章,主要包括”Hadoop环境搭建”,”RHadoop安装与使用”,”R实现MapReduce的协同过滤算法”,”HBase和rhbase的安装与使用”。对于单独的R语言爱好者,Java爱好者,或者Hadoop爱好者来说,同时具备三种语言知识并不容 易。此文虽为入门文章,但R,Java,Hadoop基础知识还是需要大家提前掌握。 […]

[…] RHadoop实践是一套系列文章,主要包括”Hadoop环境搭建”,”RHadoop安装与使用”,”R实现MapReduce的协同过滤算法”,”HBase和rhbase的安装与使用”。对于单独的R语言爱好者,Java爱好者,或者Hadoop爱好者来说,同时具备三种语言知识并不容 易。此文虽为入门文章,但R,Java,Hadoop基础知识还是需要大家提前掌握。 […]

[…] RHadoop实践是一套系列文章,主要包括”Hadoop环境搭建”,”RHadoop安装与使用”,”R实现MapReduce的协同过滤算法”,”HBase和rhbase的安装与使用”。对于单独的R语言爱好者,Java爱好者,或者Hadoop爱好者来说,同时具备三种语言知识并不容 易。此文虽为入门文章,但R,Java,Hadoop基础知识还是需要大家提前掌握。 […]

isaac

请问大神hadoop-1.0.3 hbase-0.94.2 hive-0.9.0 pig-0.10.0 sqoop-1.4.2 thrift-0.8.0 zookeeper-3.4.4这些hadoop的环境是通过什么命令出来的?
另外这行代码sh -c “echo deb http://mirror.bjtu.edu.cn/cran/bin/linux/ubuntu precise/ >>/etc/apt/sources.list”的意义是什么呢?

Conan Zhang

1. hadoop系列软件的安装包,都是手动下载再部署的。
2. sh -c “echo deb….,增加bjtu.edu.cn的镜像源到系统

isaac

请问图上这个是什么意思呢?
卸载一个软件的话 这个命令是正确的吗?
还是软件名字错了? 我想把2.14的R卸载掉重新安装2.15的R软件

Conan Zhang

卸载一般用
apt-get autoremove r-base

再安装用
apt-get install r-base-core=2.15.3-1precise0precise1

adrianlu

请问R一定要装在master节点上吗?可否装在远程的客户端上?

Conan Zhang

R一定要安装在master和datanode,远程不行。

Isaac

安装R软件一定要安装到/root/R下面吗?如果是普通的用户可以吗?
另外为什么我进入root用户显示没有java
而我的普通isaac用户就有Java呢?所以如果把R安装到root/R目录下是不是会存在没有java的问题呢?

Conan Zhang

1. /root/R,并不是R的默认安装路径,除非你是动手安装到这个位置的。
2. 你的环境配置问题,是你自己要解决的。
3. 普通用户是否可以用root用户的应用或者命令,要看配置什么权限

Isaac

ERROR: compilation failed for package
‘RJSONIO’

* removing
‘/usr/local/lib/R/site-library/RJSONIO’

The downloaded source packages are in

‘/tmp/RtmpJleU68/downloaded_packages’

Warning message:

In install.packages(“RJSONIO”) :

installation of package ‘RJSONIO’ had non-zero exit status

“Rcpp”也出现了同样的错误。
希望得到大神的指教啊。百度了很多都无法解决

Conan Zhang

还有用baidu的程序员,改用google吧

看上去是版本的问题,你把R重装一下吧,用2.15.3这个版本。

beaderchen

您好,能否请教一下。我安装了R 2.15.3,但是无法安装Rcpp,提示
package ‘Rcpp’ is not available (for R version 2.15.3)

要如何安装Rcpp呢?

我尝试使用 sudo apt-get install r-cran-rcpp ,成功安装,并且在R中执行了library(Rcpp)也没有报错,但是在之后安装plyr和其他一些包时,会报错

g++ -I/usr/share/R/include -DNDEBUG -I”/usr/lib/R/site-library/Rcpp/include” -fpic -O3 -pipe -g -c RcppExports.cpp -o RcppExports.o

/bin/bash: g++: command not found

make: *** [RcppExports.o] Error 127

ERROR: compilation failed for package ‘plyr’

我想这个和Rcpp有关,也许是Rcpp没有正确安装。那么Rcpp该如何正确安装呢?
谢谢您

Conan Zhang

你没装g++ ?

beaderchen

搞定了,谢谢

Conan Zhang

解决就好。

姚冠

我也遇到相同問題
請問是怎麼解決呢??
8個包就差reshapes2了
多謝!

beaderchen

弄了一上午,我终于成功了!感谢博主。

Li Roy

牛气!已经看了您很多文章了,写的很好!

Conan Zhang

过奖,谢谢。

[…] RHadoop实践是一套系列文章,主要包括”Hadoop环境搭建”,”RHadoop安装与使用”,”R实现MapReduce的协同过滤算法”,”HBase和rhbase的安装与使用”。对于单独的R语言爱好者,Java爱好者,或者Hadoop爱好者来说,同时具备三种语言知识并不容 易。此文虽为入门文章,但R,Java,Hadoop基础知识还是需要大家提前掌握。 […]

[…] RHadoop实践是一套系列文章,主要包括”Hadoop环境搭建”,”RHadoop安装与使用”,”R实现MapReduce的协同过滤算法”,”HBase和rhbase的安装与使用”。对于单独的R语言爱好者,Java爱好者,或者Hadoop爱好者来说,同时具备三种语言知识并不容 易。此文虽为入门文章,但R,Java,Hadoop基础知识还是需要大家提前掌握。 […]

LEO

在安装rhr时 ,

root@fred-Rev-1-0:/home/fred# R CMD INSTALL rmr-2.1.0.tar.gz
错误于getOctD(x, offset, len) : invalid octal digit
root@fred-Rev-1-0:/home/fred# R CMD INSTALL rhdfs-1.0.5.tar.gz
错误于getOctD(x, offset, len) : invalid octal digit

Conan Zhang

R的版本?

LEO

2.15.3

Conan Zhang

估计是你下载的文件有问题,不是完整的tar.gz,重新下载这几个文件吧。

LEO

** R
** preparing package for lazy loading
Error : This is R 2.15.3, package ‘Rcpp’ needs >= 3.0.0
ERROR: lazy loading failed for package ‘rmr2’
* removing ‘/usr/local/lib/R/site-library/rmr2’
现在又出现这个错误啦

LEO

还是解决不了,,,怎么办,,,急死啦

LEO

> hdfs.init()
14/05/20 20:54:38 INFO ipc.Client: Retrying connect to server: localhost/127.0.0.1:54310. Already tried 0 time(s).
14/05/20 20:54:39 INFO ipc.Client: Retrying connect to server: localhost/127.0.0.1:54310. Already tried 1 time(s).
14/05/20 20:54:40 INFO ipc.Client: Retrying connect to server: localhost/127.0.0.1:54310. Already tried 2 time(s).
14/05/20 20:54:41 INFO ipc.Client: Retrying connect to server: localhost/127.0.0.1:54310. Already tried 3 time(s).
14/05/20 20:54:42 INFO ipc.Client: Retrying connect to server: localhost/127.0.0.1:54310. Already tried 4 time(s).
14/05/20 20:54:43 INFO ipc.Client: Retrying connect to server: localhost/127.0.0.1:54310. Already tried 5 time(s).
14/05/20 20:54:44 INFO ipc.Client: Retrying connect to server: localhost/127.0.0.1:54310. Already tried 6 time(s).
14/05/20 20:54:45 INFO ipc.Client: Retrying connect to server: localhost/127.0.0.1:54310. Already tried 7 time(s).
14/05/20 20:54:46 INFO ipc.Client: Retrying connect to server: localhost/127.0.0.1:54310. Already tried 8 time(s).
14/05/20 20:54:47 INFO ipc.Client: Retrying connect to server: localhost/127.0.0.1:54310. Already tried 9 time(s).
错误于.jcall(“RJavaTools”, “Ljava/lang/Object;”, “invokeMethod”, cl, :
java.net.ConnectException: Call to localhost/127.0.0.1:54310 failed on connection exception: java.net.ConnectException: 拒绝连接

Conan Zhang

你的Hadoop没有启动吧?

Patrick

请问可以在 hadoop客户端上安装 Rhadoop吗?

patrick

或者说如何在hadoop客户端上安装rhadoop?

Conan Zhang

什么是“hadoop客户端”?

LEO

> small.ints = to.dfs(1:10)

1: In rmr.options(“backend”) :
Please set an HDFS temp directory with rmr.options(hdfs.tempdir = …)
2: In rmr.options(“hdfs.tempdir”) :
Please set an HDFS temp directory with rmr.options(hdfs.tempdir = …)
3: In rmr.options(“backend”) :
Please set an HDFS temp directory with rmr.options(hdfs.tempdir = …)
从网上找的方法还是不行

LEO

> wordcount(input)
packageJobJar: [/tmp/RtmpUCYW5D/rmr-local-env32755e4d6e94, /tmp/RtmpUCYW5D/rmr-global-env327568c123c3, /tmp/RtmpUCYW5D/rmr-streaming-map3275e529172, /tmp/RtmpUCYW5D/rmr-streaming-reduce32755aa5ffab, /tmp/RtmpUCYW5D/rmr-streaming-combine327523ccd556, /app/hadoop/tmp/hadoop-unjar1493167086586431947/] [] /tmp/streamjob1371528725781418223.jar tmpDir=null
14/05/21 20:57:13 ERROR security.UserGroupInformation: PriviledgedActionException as:root cause:org.apache.hadoop.security.AccessControlException: org.apache.hadoop.security.AccessControlException: Permission denied: user=root, access=WRITE, inode=”staging”:hduser:supergroup:rwxr-xr-x
14/05/21 20:57:13 ERROR streaming.StreamJob: Error Launching job : org.apache.hadoop.security.AccessControlException: Permission denied: user=root, access=WRITE, inode=”staging”:hduser:supergroup:rwxr-xr-x
Streaming Command Failed!
错误于mr(map = map, reduce = reduce, combine = combine, vectorized.reduce, :
hadoop streaming failed with error code 5
此外: 警告信息:
1: In rmr.options(“backend”) :
Please set an HDFS temp directory with rmr.options(hdfs.tempdir = …)
2: In rmr.options(“hdfs.tempdir”) :
Please set an HDFS temp directory with rmr.options(hdfs.tempdir = …)
3: In rmr.options(“backend”) :
Please set an HDFS temp directory with rmr.options(hdfs.tempdir = …)
4: In rmr.options(“backend.parameters”) :
Please set an HDFS temp directory with rmr.options(hdfs.tempdir = …)
怎么解决

Shawn

实在抱歉,发错地方了。麻烦张老师给予一些指点~ 非常感谢哈

先是用下面这这种方法算矩阵相乘,得到了正确的结果。
rnxymat=to.dfs(cbind(c(1:5),c(1:5),c(11:15)))
from.dfs(mapreduce(input=rnxymat, map = function(k,v) keyval(NULL,t(v)%*%v)))

结果:

$key
NULL

$val
[,1] [,2] [,3]
[1,] 55 55 205
[2,] 55 55 205
[3,] 205 205 855

但是要是读取csv文件然后再用这种方法算就出错误结果了。csv是没有row name 和 colname的矩阵。有时候他会给出1个正确结果2倍nrow 的矩阵,如果我把1:nrow的结果加上nrow+1:2nrow的结果就等于正确结果。

错误的结果:

$key
NULL

$val
V1 V2 V3
V1 14 14 74
V2 14 14 74
V3 74 74 434
V1 41 41 131
V2 41 41 131
V3 131 131 421

代码如下:

iptmt=’/data/mat5.csv’

input.format = make.input.format(“csv”, sep = “,”)

from.dfs(mapreduce(input=input, input.format = make.input.format(“csv”, sep = “,”) ,map = function(k, v) keyval(NULL,t(as.matrix(v))%*%(as.matrix(v)) )))

另外,用to.dfs()那种方法还可以算lm(),kmeans,和arima(), 但是一读取csv就出错了。。。是我读取csv的方式有错误么?

Conan Zhang

1. cbind与csv,这是R语言的函数操作和hadoop没有关系,先在单机情况下写对了,再用到rhadoop的环境。

2. lm(), kmeans(), arima()都是单机算法,不支持MR,不能用于rhadoop的环境中。

araul

@bsspirit:disqus 你提到说 lm(), kmeans(), arima()都是单机算法,不支持MR,请问一下,如何去判断一个函数是否支持mapreduce,可以用于rhadoop的环境中,没有找到这方面的相关资料,烦请指点一下

Conan Zhang

只有rmr包中的函数支持MR,其他都不支持,需要自己实现。

lyming

您好,我按照您文章的配置执行,但是出现一个错误,看日志我也不知道原因在哪里,希望您能帮忙解决,非常感谢!
这是我的错误日志
java.lang.RuntimeException: Error in configuring object
at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:109)
at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:75)
at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:133)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:413)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:332)
at org.apache.hadoop.mapred.Child$4.run(Child.java:268)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
at org.apache.hadoop.mapred.Child.main(Child.java:262)
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.ja

Conan Zhang

看起来像org.apache.hadoop.security.UserGroupInformation的错误,应该是hadoop环境没有配置好。

参考这篇文章:http://blog.fens.me/hadoop-history-source-install/

lyming

你好,果然是环境没有配置好的问题,我用的CDH版本的hadoop,文件有点杂乱。。

现在能够成功执行了,可是有一个报错,虽然不影响执行,但是请问这是哪里的问题呢?

org.apache.hadoop.ipc.RemoteException(java.io.IOException): File /tmp/RtmpeJdUdz/filead2307cfe9c/_temporary/_attempt_201407160936_0001_m_000000_0/part-00000 could only be replicated to 0 nodes instead of minReplication (=1). There are 1 datanode(s) running and no node(s) are excluded in this operation.
at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget(BlockManager.java:1433)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2688)
at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:569)
at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:440)
at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585)
at org.apache.hadoop.

could only be replicated to 0 nodes instead of minReplication (=1). There are 1 datanode(s) running and no node(s) are excluded in this operation.这是说我datanode的问题吗?我没有改变过其他的设置,完全按照官方指南装配的

Conan Zhang

错误是datanode有问题,参考“官方指南”并不代表,你的配置是对的,再找找问题吧。

csyang

张老师您好,启动hdfs.init()时报错,怎么解决呢,

Conan Zhang

你先试试hadoop单独操作,能不能运行?
如果你提供完整的操作过程和文档就好了。只是这么简单的错误信息,不太容易定位错误。

csyang

张老师,hadoop单独操作应该没问题的,是按照您的系列文章一进行的,最后hdfs的简单测试也成功了,namedata是master,datanode是slave,jps查看后,namedata显示Jps SecondaryNameNode JobTracker NameNode ;datanode显示 jps datanode masktracker ,R安装那些依赖包时都是以root用户安装的,都在/usr/local/…目录下,java -version 也没问题,启动R是用新建的hadoop用户,hadoop也是用新建的hadoop用户启动,非常感谢!

Conan Zhang

你现在是已经解决了问题了,对吧?

csyang

张老师,问题仍然存在,我又一次重装还是报同样的错

Conan Zhang

环境变量,看上去也是对的,我也不清楚怎么解决,不好意思。

csyang

谢谢,我安装的R版本是3.X,想换成2.15.X,但Rcpp包又不行,R版本太低,我想知道您用R2.15版本的可以装Rcpp吗

Conan Zhang

1. 好像rhadoop还不支持R 3.X ,只支持R2.15.x。
2. Rcpp如果install.packages安装失败,需要手动下载源代码编译安装。
3. 我之前安装时,没有发现Rcpp安装不了的问题。

csyang

这是我的环境变量,麻烦您帮看一下,谢了
# User specific environment and startup programs

PATH=$PATH:$HOME/bin

export PATH

#R

R_HOME=/usr/local/lib64/R

export R_HOME

PATH=$PATH:$R_HOME/bin

export PATH

LD_LIBRARY_PATH=/usr/local/java/jdk1.6.0_45/jre/lib/amd64/server

export LD_LIBRARY_PATH

export CLASSPATH=$CLASSPATH:$R_HOME/library/rJava/jri

#hadoop

export HADOOP_HOME=/home/hadoop/conan/hadoop-0.20.2

export PATH=$HADOOP_HOME/bin:$PATH

export HADOOP_CMD=/home/hadoop/conan/hadoop-0.20.2/bin/hadoop

export HADOOP_STREAMING=/home/hadoop/conan/hadoop-0.20.2/contrib/streaming/hadoop-0.20.2-streaming.jar

#java

export JAVA_HOME=/usr/local/java/jdk1.6.0_45

PATH=$JAVA_HOME/bin:$PATH

export PATH

CLASSPATH=.:$JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools.jar:$JAVA_HOME/lib:$JAVA_HOME/jre/lib/rt.jar:$HADOOP_HOME/hadoop-0.20.2-core.jar:$HADOOP_HOME/lib/commons-logging-1.0.4.jar

export CLASSPATH

csyang

这是错误代码,谢谢

Exception in thread “main”
java.lang.NoClassDefFoundError: classpath

Caused by:
java.lang.ClassNotFoundException: classpath

at java.net.URLClassLoader$1.run(URLClassLoader.java:202)

at java.security.AccessController.doPrivileged(Native Method)

at java.net.URLClassLoader.findClass(URLClassLoader.java:190)

at java.lang.ClassLoader.loadClass(ClassLoader.java:306)

at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)

at java.lang.ClassLoader.loadClass(ClassLoader.java:247)

Could not find the main class:
classpath. Program will exit.

Error in
.jnew(“org/apache/hadoop/conf/Configuration”) :

java.lang.ClassNotFoundException

In addition: Warning message:

running command
‘/home/hadoop/conan/hadoop-0.20.2/bin/hadoop classpath’ had status 1

Conan Zhang

看起来是rjava没有装好。

csyang

张老师,我的rJava又重新装了一次,还是同样的问题,烦劳再帮看看吧,多谢,实在是解决不了这个问题

[…] RHadoop实践是一套系列文章,主要包括”Hadoop环境搭建”,”RHadoop安装与使用”,”R实现MapReduce的协同过滤算法”,”HBase和rhbase的安装与使用”。对于单独的R语言爱好者,Java爱好者,或者Hadoop爱好者来说,同时具备三种语言知识并不容 易。此文虽为入门文章,但R,Java,Hadoop基础知识还是需要大家提前掌握。 […]

Guest

张老师, from.dfs(“/tmp/file20752138297f”)出错,会不会是mapreduce命令返回结果的第一行末尾tmpDir=null有问题啊?

Conan Zhang

看起来像是环境没有搭建好的问题,必须用文章中指定的版本的软件。

luckywind

张老师您好!我的HADOOP_CMD环境变量设置好了,在R中也可以加载rhdfs, 我想把hdfs上的一个数据文件作为输入,取出其中两列,代码如下:

pre.ctl { font-family: “Liberation Mono”,monospace; }p { margin-bottom: 0.25cm; line-height: 120%; }

from.dfs(mapreduce(input=”/stock/table2.csv”,map=function(k,v){key=v[,2]

val=v[,5]

keyval(key,val)}

))
但是为什么会出现下面的错误呢?麻烦您帮我看一下,谢谢!

错误日志
work.dir
Loading required package: methods
Loading required package: rmr2
Loading required package: rJava
Loading required package: rhdfs
Error : .onLoad failed in loadNamespace() for ‘rhdfs’, details:
call: fun(libname, pkgname)
error: Environment variable HADOOP_CMD must be set before loading package rhdfs
FUN(c(“base”, “methods”, “datasets”, “utils”, “grDevices”, “graphics”, 里有警告:
can’t load rhdfs
pre.ctl { font-family: “Liberation Mono”,monospace; }p { margin-bottom: 0.25cm; line-height: 120%; }

luckywind

张老师您好!我的HADOOP_CMD环境变量设置好了,在R中也可以加载rhdfs, 我想把hdfs上的一个数据文件作为输入,取出其中两列,代码如下:
from.dfs(mapreduce(input=”/stock/table2.csv”,map=function(k,v){key=v[,2]
val=v[,5]
keyval(key,val)}
))
但是为什么会出现下面的错误呢?麻烦您帮我看一下,谢谢!

错误日志
work.dir
Loading required package: methods
Loading required package: rmr2
Loading required package: rJava
Loading required package: rhdfs
Error : .onLoad failed in loadNamespace() for ‘rhdfs’, details:
call: fun(libname, pkgname)
error: Environment variable HADOOP_CMD must be set before loading package rhdfs
FUN(c(“base”, “methods”, “datasets”, “utils”, “grDevices”, “graphics”, 里有警告:
can’t load rhdfs
pre.ctl { font-family: “Liberation Mono”,monospace; }p { margin-bottom: 0.25cm; line-height: 120%; }

Conan Zhang

input=”/stock/table2.csv”,

这里的input参数,只能读hdfs上的文件,不能读Linux操作系统的文件。

luckywind

没错,这个表格文件是在终端上传到hdfs上的

clz

在安装rJava是遇到以下问题,请问怎么回事?谢谢
** building package indices

** testing if installed package can be loaded

Error : .onLoad failed in loadNamespace() for ‘rJava’, details:

call: dyn.load(file, DLLpath = DLLpath, …)

error: unable to load shared object ‘/home/hadoop/R/x86_64-pc-linux-gnu-library/3.0/rJava/libs/rJava.so’:

libjvm.so: cannot open shared object file: No such file or directory

Error: loading failed

Execution halted

ERROR: loading failed

* removing ‘/home/hadoop/R/x86_64-pc-linux-gnu-library/3.0/rJava’

The downloaded source packages are in

‘/tmp/RtmpOifmH8/downloaded_packages’

Warning messages:

1: In open.connection(con, “r”) :

unable to connect to ‘cran.r-project.org’ on port 80.

2: In install.packages(“rJava”) :

installation of package ‘rJava’ had non-zero exit status

Conan Zhang

是不是网络访问的问题?
unable to connect to ‘cran.r-project.org’ on port 80.

学者

我在使用rhive.query(‘select * from a )時必需加上limit 才能顯示出來’,

但我用rhive.big.query(‘select * from a fetchSize=40, limit=-1, memLimit=64*1024*1024) 查詢時報如下錯誤:

Error: java.sql.SQLException: The query did not generate a result set!

Conan Zhang

结果集太大了,内存不够。

zj

15/05/21 21:34:07 INFO streaming.StreamJob: killJob…
Streaming Command Failed!
错误于mr(map = map, reduce = reduce, combine = combine, vectorized.reduce, :
hadoop streaming failed with error code 1
Warning: $HADOOP_HOME is deprecated.

Warning: $HADOOP_HOME is deprecated.

Deleted hdfs://localhost:9000/tmp/Rtmp4nJaqE/file12b25f2bf0ed
这个是怎么回事啊?

Conan Zhang

运行错误,检查数据和代码。

echocai

> hdfs.init()

错误: 找不到或无法加载主类 classpath

Error in .jnew(“org/apache/hadoop/conf/Configuration”) :

java.lang.ClassNotFoundException

此外: Warning message:

运行命令’/home/bcpdm/hadoop-0.20.2/bin/hadoop classpath’的状态是1

您好,我最近遇到了这个问题,请问会是什么原因,rJava正常,我其他程序使用过rJava没问题,拜托帮忙看下,谢谢了!

Conan Zhang

最近太忙回复晚了。
好像没有把hadoop.jar包加入到环境变量classpath中?

gxlook

java version “1.6.0_31”
Hadoop 2.0.0-cdh4.6.0
Linux arreat00 3.8.0-29-generic #42~precise1-Ubuntu SMP Wed Aug 14 16:19:23 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux
R version 2.15.3

想要安装RHadoop的时候安装rjava不成功,这个跟镜像选择有关系么?好像不同的镜像错误信息不同,但是没找到正确的

Conan Zhang

hadoop 2.x没试过,不清楚环境细节。

vpccw152c

大神,请教一下。我现在在按照《R的极客理想-工具篇》中的内容实践RHadoop,其中安装R时要求版本是2.15.3,然后RHadoop依赖的一些包,比如RCpp,只支持R3.0以上的版本,这个问题怎么解决呢?现在R3.0以上版本和RHadoop是否兼容??

Conan Zhang

你可以手动下载,低版本的RCpp,进行安装。

zhouxuan Wu

大神求解!!!

zhouxuan Wu

大神求解!!!

Conan Zhang

选择你最近的节点,4

Jay

求教!我的hadoop 版本是2.6.0 R的版本是3.1.0
在执行mapreduce(input = small.ints, map = function(k,v) cbind(v,v^2)) 后
报如下错误:

16/01/11 18:13:33 INFO Configuration.deprecation: mapred.reduce.tasks is deprecated. Instead, use mapreduce.job.reduces
16/01/11 18:13:34 INFO Configuration.deprecation: session.id is deprecated. Instead, use dfs.metrics.session-id
16/01/11 18:13:34 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId=
16/01/11 18:13:34 INFO jvm.JvmMetrics: Cannot initialize JVM Metrics with processName=JobTracker, sessionId= – already initialized
16/01/11 18:13:34 INFO mapred.FileInputFormat: Total input paths to process : 1
16/01/11 18:13:34 INFO mapreduce.JobSubmitter: number of splits:1
16/01/11 18:13:35 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_local1388960043_0001
16/01/11 18:13:35 INFO mapred.LocalDistributedCacheManager: Creating symlink: /hadoop/tmp/mapred/local/1452507215583/rmr-local-env12386f5ac56c <- /home/hadoop/workplace/rmr-local-env12386f5ac56c
16/01/11 18:13:35 INFO mapred.LocalDistributedCacheManager: Localized file:/tmp/RtmpQWwCmC/rmr-local-env12386f5ac56c as file:/hadoop/tmp/mapred/local/1452507215583/rmr-local-env12386f5ac56c
16/01/11 18:13:35 INFO mapred.LocalDistributedCacheManager: Creating symlink: /hadoop/tmp/mapred/local/1452507215584/rmr-global-env1238530a6ee6 <- /home/hadoop/workplace/rmr-global-env1238530a6ee6
16/01/11 18:13:35 INFO mapred.LocalDistributedCacheManager: Localized file:/tmp/RtmpQWwCmC/rmr-global-env1238530a6ee6 as file:/hadoop/tmp/mapred/local/1452507215584/rmr-global-env1238530a6ee6
16/01/11 18:13:35 INFO mapred.LocalDistributedCacheManager: Creating symlink: /hadoop/tmp/mapred/local/1452507215585/rmr-streaming-map1238540131a2 <- /home/hadoop/workplace/rmr-streaming-map1238540131a2
16/01/11 18:13:35 INFO mapred.LocalDistributedCacheManager: Localized file:/tmp/RtmpQWwCmC/rmr-streaming-map1238540131a2 as file:/hadoop/tmp/mapred/local/1452507215585/rmr-streaming-map1238540131a2
16/01/11 18:13:36 INFO mapreduce.Job: The url to track the job: http://localhost:8080/
16/01/11 18:13:36 INFO mapreduce.Job: Running job: job_local1388960043_0001
16/01/11 18:13:36 INFO mapred.LocalJobRunner: OutputCommitter set in config null
16/01/11 18:13:36 INFO mapred.LocalJobRunner: OutputCommitter is org.apache.hadoop.mapred.FileOutputCommitter
16/01/11 18:13:36 INFO mapred.LocalJobRunner: Waiting for map tasks
16/01/11 18:13:36 INFO mapred.LocalJobRunner: Starting task: attempt_local1388960043_0001_m_000000_0
16/01/11 18:13:36 INFO mapred.Task: Using ResourceCalculatorProcessTree : [ ]
16/01/11 18:13:36 INFO mapred.MapTask: Processing split: hdfs://172.16.15.128:9000/tmp/file12384ad13608:0+417
16/01/11 18:13:36 INFO zlib.ZlibFactory: Successfully loaded & initialized native-zlib library
16/01/11 18:13:36 INFO compress.CodecPool: Got brand-new decompressor [.deflate]
16/01/11 18:13:36 INFO mapred.MapTask: numReduceTasks: 0
16/01/11 18:13:36 INFO streaming.PipeMapRed: PipeMapRed exec [/usr/local/bin/Rscript, –vanilla, ./rmr-streaming-map1238540131a2]
16/01/11 18:13:36 INFO Configuration.deprecation: mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id
16/01/11 18:13:36 INFO Configuration.deprecation: user.name is deprecated. Instead, use mapreduce.job.user.name
16/01/11 18:13:36 INFO Configuration.deprecation: map.input.start is deprecated. Instead, use mapreduce.map.input.start
16/01/11 18:13:36 INFO Configuration.deprecation: mapred.task.is.map is deprecated. Instead, use mapreduce.task.ismap
16/01/11 18:13:36 INFO Configuration.deprecation: mapred.tip.id is deprecated. Instead, use mapreduce.task.id
16/01/11 18:13:36 INFO Configuration.deprecation: mapred.skip.on is deprecated. Instead, use mapreduce.job.skiprecords
16/01/11 18:13:36 INFO Configuration.deprecation: mapred.task.partition is deprecated. Instead, use mapreduce.task.partition
16/01/11 18:13:36 INFO Configuration.deprecation: map.input.length is deprecated. Instead, use mapreduce.map.input.length
16/01/11 18:13:36 INFO Configuration.deprecation: mapred.local.dir is deprecated. Instead, use mapreduce.cluster.local.dir
16/01/11 18:13:36 INFO Configuration.deprecation: mapred.work.output.dir is deprecated. Instead, use mapreduce.task.output.dir
16/01/11 18:13:36 INFO Configuration.deprecation: map.input.file is deprecated. Instead, use mapreduce.map.input.file
16/01/11 18:13:36 INFO Configuration.deprecation: mapred.job.id is deprecated. Instead, use mapreduce.job.id
16/01/11 18:13:36 INFO streaming.PipeMapRed: R/W/S=1/0/0 in:NA [rec/s] out:NA [rec/s]
/bin/sh: module: line 1: syntax error: unexpected end of file
/bin/sh: error importing function definition for `module'
Version 0.4-0 included new data defaults. See ?getSymbols.
16/01/11 18:13:37 INFO mapreduce.Job: Job job_local1388960043_0001 running in uber mode : false
16/01/11 18:13:37 INFO mapreduce.Job: map 0% reduce 0%
Please review your hadoop settings. See help(hadoop.settings)

HADOOP_CMD=/hadoop/hadoop-2.6.0/bin/hadoop

Be sure to run hdfs.init()
During startup – Warning message:
S3 methods ‘gorder.default’, ‘gorder.factor’, ‘gorder.data.frame’, ‘gorder.matrix’, ‘gorder.raw’ were declared in NAMESPACE but not found
Loading objects:
.Random.seed
small.ints
Loading objects:
backend.parameters
combine
combine.file
combine.line
debug
default.input.format
default.output.format
in.folder
in.memory.combine
input.format
libs
map
map.file
map.line
out.folder
output.format
pkg.opts
postamble
preamble
profile.nodes
reduce
reduce.file
reduce.line
rmr.global.env
rmr.local.env
save.env
tempfile
vectorized.reduce
verbose
work.dir
Loading objects:
backend.parameters
combine
combine.file
combine.line
debug
default.input.format
default.output.format
in.folder
in.memory.combine
input.format
libs
map
map.file
map.line
out.folder
output.format
pkg.opts
postamble
preamble
profile.nodes
reduce
reduce.file
reduce.line
rmr.global.env
rmr.local.env
save.env
tempfile
vectorized.reduce
verbose
work.dir
16/01/11 18:13:38 INFO streaming.PipeMapRed: Records R/W=3/1
16/01/11 18:13:38 INFO streaming.PipeMapRed: MRErrorThread done
16/01/11 18:13:38 INFO streaming.PipeMapRed: mapRedFinished
16/01/11 18:13:38 INFO mapred.LocalJobRunner:
16/01/11 18:13:38 INFO mapred.Task: Task:attempt_local1388960043_0001_m_000000_0 is done. And is in the process of committing
16/01/11 18:13:38 INFO mapred.LocalJobRunner:
16/01/11 18:13:38 INFO mapred.Task: Task attempt_local1388960043_0001_m_000000_0 is allowed to commit now
16/01/11 18:13:38 INFO output.FileOutputCommitter: Saved output of task 'attempt_local1388960043_0001_m_000000_0' to hdfs://172.16.15.128:9000/tmp/file12385c69c3aa/_temporary/0/task_local1388960043_0001_m_000000
16/01/11 18:13:38 INFO mapred.LocalJobRunner: Records R/W=3/1
16/01/11 18:13:38 INFO mapred.Task: Task 'attempt_local1388960043_0001_m_000000_0' done.
16/01/11 18:13:38 INFO mapred.LocalJobRunner: Finishing task: attempt_local1388960043_0001_m_000000_0
16/01/11 18:13:38 INFO mapred.LocalJobRunner: map task executor complete.
16/01/11 18:13:39 INFO mapreduce.Job: map 100% reduce 0%
16/01/11 18:13:39 INFO mapreduce.Job: Job job_local1388960043_0001 completed successfully
16/01/11 18:13:39 INFO mapreduce.Job: Counters: 23
File System Counters
FILE: Number of bytes read=117667
FILE: Number of bytes written=376441
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=420
HDFS: Number of bytes written=797
HDFS: Number of read operations=9
HDFS: Number of large read operations=0
HDFS: Number of write operations=3
Map-Reduce Framework
Map input records=3
Map output records=3
Input split bytes=98
Spilled Records=0
Failed Shuffles=0
Merged Map outputs=0
GC time elapsed (ms)=13
CPU time spent (ms)=0
Physical memory (bytes) snapshot=0
Virtual memory (bytes) snapshot=0
Total committed heap usage (bytes)=29556736
File Input Format Counters
Bytes Read=420
File Output Format Counters
Bytes Written=797
16/01/11 18:13:39 INFO streaming.StreamJob: Output directory: /tmp/file12385c69c3aa
16/01/11 18:13:45 INFO fs.TrashPolicyDefault: Namenode trash configuration: Deletion interval = 0 minutes, Emptier interval = 0 minutes.
Deleted /tmp/file1238165bcfc
16/01/11 18:13:51 INFO fs.TrashPolicyDefault: Namenode trash configuration: Deletion interval = 0 minutes, Emptier interval = 0 minutes.
Deleted /tmp/file1238697b8d5e
function ()
{
fname
}

Conan Zhang

S3 methods ‘gorder.default’, ‘gorder.factor’, ‘gorder.data.frame’, ‘gorder.matrix’, ‘gorder.raw’ were declared in NAMESPACE but not found

看上去是包不匹配的错误。

Jerome Cao

try rmr.options(backend=”local”) and mp<- mapreduce(input = small.ints, map = function(k, v) cbind(v, v^2)), then from.dfs(mp)

[…] RHadoop实践是一套系列文章,主要包括”Hadoop环境搭建”,”RHadoop安装与使用”,”R实现MapReduce的协同过滤算法”,”HBase和rhbase的安装与使用”。对于单独的R语言爱好者,Java爱好者,或者Hadoop爱好者来说,同时具备三种语言知识并不容 易。此文虽为入门文章,但R,Java,Hadoop基础知识还是需要大家提前掌握。 […]

Jerome Cao

谢谢丹哥!遇到好多错误,各种google~今天终于成功了,自我表扬一下。

Conan Zhang

赞啊

jpwl

groups = rbinom(32, n = 50, prob = 0.4)
groups2 = to.dfs(groups)
from.dfs(mapreduce(input = groups2, map = function(k,v) keyval(v, 1), reduce = function(k,vv) keyval(k, length(vv)), output.format=’text’, output = ‘/output10’))
我执行上面的程序,为什么只输出如下的结果:
SEQ/org.apache.hadoop.typedbytes.TypedBytesWritable/org.apache.hadoop.typedbytes.TypedBytesWritable
把output.format=’text’这句去掉以后,输出空白,这是什么原因?求张老师解答。

Conan Zhang

combine 设置为 false试试,可能是合并出现的问题。

109
0
Would love your thoughts, please comment.x
()
x