• Archive by category "程序算法"
  • (Page 3)

Blog Archives

PageRank算法R语言实现

算法为王系列文章,涵盖了计算机算法,数据挖掘(机器学习)算法,统计算法,金融算法等的多种跨学科算法组合。在大数据时代的背景下,算法已经成为了金字塔顶的明星。一个好的算法可以创造一个伟大帝国,就像Google。

算法为王的时代正式到来….

关于作者:

  • 张丹(Conan), 程序员Java,R,PHP,Javascript
  • weibo:@Conan_Z
  • blog: http://blog.fens.me
  • email: bsspirit@gmail.com

转载请注明出处:
http://blog.fens.me/algorithm-pagerank-r/

pagerank-r

前言

Google搜索,早已成为我每天必用的工具,无数次惊叹它搜索结果的准确性。同时,我也在做Google的SEO,推广自己的博客。经过几个月尝试,我的博客PR到2了,外链也有几万个了。总结下来,还是感叹PageRank的神奇!

改变世界的算法,PageRank!

目录

  1. PageRank算法介绍
  2. PageRank算法原理
  3. PageRank算法的R语言实现

1. PageRank算法介绍

PageRank是Google专有的算法,用于衡量特定网页相对于搜索引擎索引中的其他网页而言的重要程度。它由Larry Page 和 Sergey Brin在20世纪90年代后期发明。PageRank实现了将链接价值概念作为排名因素。

PageRank让链接来”投票”

一个页面的“得票数”由所有链向它的页面的重要性来决定,到一个页面的超链接相当于对该页投一票。一个页面的PageRank是由所有链向它的页面(“链入页面”)的重要性经过递归算法得到的。一个有较多链入的页面会有较高的等级,相反如果一个页面没有任何链入页面,那么它没有等级。

简单一句话概括:从许多优质的网页链接过来的网页,必定还是优质网页。

PageRank的计算基于以下两个基本假设:

  • 数量假设:如果一个页面节点接收到的其他网页指向的入链数量越多,那么这个页面越重要
  • 质量假设:指向页面A的入链质量不同,质量高的页面会通过链接向其他页面传递更多的权重。所以越是质量高的页面指向页面A,则页面A越重要。

要提高PageRank有3个要点:

  • 反向链接数
  • 反向链接是否来自PageRank较高的页面
  • 反向链接源页面的链接数

2. PageRank算法原理

在初始阶段:网页通过链接关系构建起有向图,每个页面设置相同的PageRank值,通过若干轮的计算,会得到每个页面所获得的最终PageRank值。随着每一轮的计算进行,网页当前的PageRank值会不断得到更新。

在一轮更新页面PageRank得分的计算中,每个页面将其当前的PageRank值平均分配到本页面包含的出链上,这样每个链接即获得了相应的权值。而每个页面将所有指向本页面的入链所传入的权值求和,即可得到新的PageRank得分。当每个页面都获得了更新后的PageRank值,就完成了一轮PageRank计算。

1). 算法原理

PageRank算法建立在随机冲浪者模型上,其基本思想是:网页的重要性排序是由网页间的链接关系所决定的,算法是依靠网页间的链接结构来评价每个页面的等级和重要性,一个网页的PR值不仅考虑指向它的链接网页数,还有指向’指向它的网页的其他网页本身的重要性。

PageRank具有两大特性:

  • PR值的传递性:网页A指向网页B时,A的PR值也部分传递给B
  • 重要性的传递性:一个重要网页比一个不重要网页传递的权重要多

2). 计算公式:

pagerank-formula

  • PR(pi): pi页面的PageRank值
  • n: 所有页面的数量
  • pi: 不同的网页p1,p2,p3
  • M(i): pi链入网页的集合
  • L(j): pj链出网页的数量
  • d:阻尼系数, 任意时刻,用户到达某页面后并继续向后浏览的概率。
    (1-d=0.15) :表示用户停止点击,随机跳到新URL的概率
    取值范围: 0 < d ≤ 1, Google设为0.85

3). 构造实例:以4个页面的数据为例

pagerank-sample

图片说明:

  • ID=1的页面链向2,3,4页面,所以一个用户从ID=1的页面跳转到2,3,4的概率各为1/3
  • ID=2的页面链向3,4页面,所以一个用户从ID=2的页面跳转到3,4的概率各为1/2
  • ID=3的页面链向4页面,所以一个用户从ID=3的页面跳转到4的概率各为1
  • ID=4的页面链向2页面,所以一个用户从ID=4的页面跳转到2的概率各为1

构造邻接表:


链接源页面     链接目标页面
    1             2,3,4
    2             3,4
    3             4
    4             2

构造邻接矩阵(方阵):

  • 列:源页面
  • 行:目标页面

     [,1] [,2] [,3] [,4]
[1,]    0    0    0    0
[2,]    1    0    0    1
[3,]    1    1    0    0
[4,]    1    1    1    0

转换为概率矩阵(转移矩阵)


     [,1] [,2] [,3] [,4]
[1,] 0      0    0    0
[2,] 1/3    0    0    1
[3,] 1/3  1/2    0    0
[4,] 1/3  1/2    1    0

通过链接关系,我们就构造出了“转移矩阵”。

3. R语言单机算法实现

创建数据文件:page.csv


1,2
1,3
1,4
2,3
2,4
3,4
4,2

分别用下面3种方式实现PageRank:

  • 未考虑阻尼系统的情况
  • 包括考虑阻尼系统的情况
  • 直接用R的特征值计算函数

1). 未考虑阻尼系统的情况

R语言实现


#构建邻接矩阵
adjacencyMatrix<-function(pages){
  n<-max(apply(pages,2,max))
  A <- matrix(0,n,n)
  for(i in 1:nrow(pages)) A[pages[i,]$dist,pages[i,]$src]<-1
  A
}

#变换概率矩阵
probabilityMatrix<-function(G){
  cs <- colSums(G)
  cs[cs==0] <- 1
  n <- nrow(G)
  A <- matrix(0,nrow(G),ncol(G))
  for (i in 1:n) A[i,] <- A[i,] + G[i,]/cs
  A
}

#递归计算矩阵特征值
eigenMatrix<-function(G,iter=100){
  iter<-10
  n<-nrow(G)
  x <- rep(1,n)
  for (i in 1:iter) x <- G %*% x
  x/sum(x)
}

> pages<-read.table(file="page.csv",header=FALSE,sep=",")
> names(pages)<-c("src","dist");pages
  src dist
1   1    2
2   1    3
3   1    4
4   2    3
5   2    4
6   3    4
7   4    2

> A<-adjacencyMatrix(pages);A
     [,1] [,2] [,3] [,4]
[1,]    0    0    0    0
[2,]    1    0    0    1
[3,]    1    1    0    0
[4,]    1    1    1    0

> G<-probabilityMatrix(A);G
          [,1] [,2] [,3] [,4]
[1,] 0.0000000  0.0    0    0
[2,] 0.3333333  0.0    0    1
[3,] 0.3333333  0.5    0    0
[4,] 0.3333333  0.5    1    0

> q<-eigenMatrix(G,100);q
          [,1]
[1,] 0.0000000
[2,] 0.4036458
[3,] 0.1979167
[4,] 0.3984375

结果解读:

  • ID=1的页面,PR值是0,因为没有指向ID=1的页面
  • ID=2的页面,PR值是0.4,权重最高,因为1和4都指向2,4权重较高,并且4只有一个链接指向到2,权重传递没有损失
  • ID=3的页面,PR值是0.19,虽有1和2的指向了3,但是1和2还指向的其他页面,权重被分散了,所以ID=3的页面PR并不高
  • ID=4的页面,PR值是0.39,权重很高,因为被1,2,3都指向了

从上面的结果,我们发现ID=1的页面,PR值是0,那么ID=1的页,就不能向其他页面输出权重了,计算就会不合理!所以,增加d阻尼系数,修正没有链接指向的页面,保证页面的最小PR值>0,。

2). 包括考虑阻尼系统的情况

增加函数:dProbabilityMatrix



#变换概率矩阵,考虑d的情况
dProbabilityMatrix<-function(G,d=0.85){
  cs <- colSums(G)
  cs[cs==0] <- 1
  n <- nrow(G)
  delta <- (1-d)/n
  A <- matrix(delta,nrow(G),ncol(G))
  for (i in 1:n) A[i,] <- A[i,] + d*G[i,]/cs
  A
}

> pages<-read.table(file="page.csv",header=FALSE,sep=",")
> names(pages)<-c("src","dist");pages
  src dist
1   1    2
2   1    3
3   1    4
4   2    3
5   2    4
6   3    4
7   4    2

> A<-adjacencyMatrix(pages);A
     [,1] [,2] [,3] [,4]
[1,]    0    0    0    0
[2,]    1    0    0    1
[3,]    1    1    0    0
[4,]    1    1    1    0

> G<-dProbabilityMatrix(A);G
          [,1]   [,2]   [,3]   [,4]
[1,] 0.0375000 0.0375 0.0375 0.0375
[2,] 0.3208333 0.0375 0.0375 0.8875
[3,] 0.3208333 0.4625 0.0375 0.0375
[4,] 0.3208333 0.4625 0.8875 0.0375

> q<-eigenMatrix(G,100);q
          [,1]
[1,] 0.0375000
[2,] 0.3738930
[3,] 0.2063759
[4,] 0.3822311

增加阻尼系数后,ID=1的页面,就有值了PR(1)=(1-d)/n=(1-0.85)/4=0.0375,即无外链页面的最小值。

3). 直接用R的特征值计算函数

增加函数:calcEigenMatrix


#直接计算矩阵特征值
calcEigenMatrix<-function(G){
  x <- Re(eigen(G)$vectors[,1])
  x/sum(x)
}

> pages<-read.table(file="page.csv",header=FALSE,sep=",")
> names(pages)<-c("src","dist");pages
  src dist
1   1    2
2   1    3
3   1    4
4   2    3
5   2    4
6   3    4
7   4    2

> A<-adjacencyMatrix(pages);A
     [,1] [,2] [,3] [,4]
[1,]    0    0    0    0
[2,]    1    0    0    1
[3,]    1    1    0    0
[4,]    1    1    1    0

> G<-dProbabilityMatrix(A);G
          [,1]   [,2]   [,3]   [,4]
[1,] 0.0375000 0.0375 0.0375 0.0375
[2,] 0.3208333 0.0375 0.0375 0.8875
[3,] 0.3208333 0.4625 0.0375 0.0375
[4,] 0.3208333 0.4625 0.8875 0.0375

> q<-calcEigenMatrix(G);q
[1] 0.0375000 0.3732476 0.2067552 0.3824972

直接计算矩阵特征值,可以有效地减少的循环的操作,提高程序运行效率。

在了解PageRank的原理后,使用R语言构建PageRank模型,是非常容易的。实际应用中,我们也愿意用比较简单的方式建模,验证后,再用其他语言语言去企业应用!

下一篇文章,将介绍如何用MapReduce分步式算法来实现PageRank模型,PageRank算法并行实现

参考文章

转载请注明出处:
http://blog.fens.me/algorithm-pagerank-r/

打赏作者

用MapReduce实现矩阵乘法

Hadoop家族系列文章,主要介绍Hadoop家族产品,常用的项目包括Hadoop, Hive, Pig, HBase, Sqoop, Mahout, Zookeeper, Avro, Ambari, Chukwa,新增加的项目包括,YARN, Hcatalog, Oozie, Cassandra, Hama, Whirr, Flume, Bigtop, Crunch, Hue等。

从2011年开始,中国进入大数据风起云涌的时代,以Hadoop为代表的家族软件,占据了大数据处理的广阔地盘。开源界及厂商,所有数据软件,无一不向Hadoop靠拢。Hadoop也从小众的高富帅领域,变成了大数据开发的标准。在Hadoop原有技术基础之上,出现了Hadoop家族产品,通过“大数据”概念不断创新,推出科技进步。

作为IT界的开发人员,我们也要跟上节奏,抓住机遇,跟着Hadoop一起雄起!

关于作者:

  • 张丹(Conan), 程序员Java,R,PHP,Javascript
  • weibo:@Conan_Z
  • blog: http://blog.fens.me
  • email: bsspirit@gmail.com

转载请注明出处:
http://blog.fens.me/hadoop-mapreduce-matrix/

hadoop-mapreduce-matrix

前言

MapReduce打开了并行计算的大门,让我们个人开发者有了处理大数据的能力。但想用好MapReduce,把原来单机算法并行化,也不是一件容易事情。很多的时候,我们需要从单机算法能否矩阵化去思考,所以矩阵操作就变成了算法并行化的基础。

像推荐系统的协同过滤算法,就是基于矩阵思想实现MapReduce并行化。

目录

  1. 矩阵介绍
  2. 矩阵乘法的R语言计算
  3. 矩阵乘法的MapReduce计算
  4. 稀疏矩阵乘法的MapReduce计算

1. 矩阵介绍

矩阵: 数学上,一个m×n的矩阵是一个由m行n列元素排列成的矩形阵列。矩阵里的元素可以是数字、符号或数学式。以下是一个由6个数字符素构成的2行3列的矩阵:


1 2 3
4 5 6

矩阵加法
大小相同(行数列数都相同)的矩阵之间可以相互加减,具体是对每个位置上的元素做加减法。

举例:两个矩阵的加法


1 3 1   +  0 0 5   =   1+0 3+0 1+5   =   1 3 6
1 0 0      7 5 0       1+7 0+5 0+0       8 5 0 

矩阵乘法
两个矩阵可以相乘,当且仅当第一个矩阵的列数等于第二个矩阵的行数。矩阵的乘法满足结合律和分配律,但不满足交换律。

举例:两个矩阵的乘法


 1 0 2   *   3 1   =  (1*3+0*2+2*1)  (1*1+0*1+2*0)    =  5 1
-1 3 1       2 1      (-1*3+3*2+1*1) (-1*1+3*1+1*0)      4 2
             1 0 

2. 矩阵乘法的R语言计算


> m1<-matrix(c(1,0,2,-1,3,1),nrow=2,byrow=TRUE);m1
     [,1] [,2] [,3]
[1,]    1    0    2
[2,]   -1    3    1

> m2<-matrix(c(3,1,2,1,1,0),nrow=3,byrow=TRUE);m2
     [,1] [,2]
[1,]    3    1
[2,]    2    1
[3,]    1    0

> m3<-m1 %*% m2;m3
     [,1] [,2]
[1,]    5    1
[2,]    4    2

由R语言实现矩阵的乘法是非常简单的。

3. 矩阵乘法的MapReduce计算

算法实现思路:

mapreduce-matrix

  • 新建2个矩阵数据文件:m1.csv, m2.csv
  • 新建启动程序:MainRun.java
  • 新建MR程序:MartrixMultiply.java

1).新建2个矩阵数据文件m1.csv, m2.csv

m1.csv


1,0,2
-1,3,1

m2.csv


3,1
2,1
1,0

3).新建启动程序:MainRun.java

启动程序


package org.conan.myhadoop.matrix;

import java.util.HashMap;
import java.util.Map;
import java.util.regex.Pattern;

import org.apache.hadoop.mapred.JobConf;

public class MainRun {

    public static final String HDFS = "hdfs://192.168.1.210:9000";
    public static final Pattern DELIMITER = Pattern.compile("[\t,]");

    public static void main(String[] args) {
        martrixMultiply();
    }
    
    public static void martrixMultiply() {
        Map<String, String> path = new HashMap<String, String>();
        path.put("m1", "logfile/matrix/m1.csv");// 本地的数据文件
        path.put("m2", "logfile/matrix/m2.csv");
        path.put("input", HDFS + "/user/hdfs/matrix");// HDFS的目录
        path.put("input1", HDFS + "/user/hdfs/matrix/m1");
        path.put("input2", HDFS + "/user/hdfs/matrix/m2");
        path.put("output", HDFS + "/user/hdfs/matrix/output");

        try {
            MartrixMultiply.run(path);// 启动程序
        } catch (Exception e) {
            e.printStackTrace();
        }
        System.exit(0);
    }

    public static JobConf config() {// Hadoop集群的远程配置信息
        JobConf conf = new JobConf(MainRun.class);
        conf.setJobName("MartrixMultiply");
        conf.addResource("classpath:/hadoop/core-site.xml");
        conf.addResource("classpath:/hadoop/hdfs-site.xml");
        conf.addResource("classpath:/hadoop/mapred-site.xml");
        return conf;
    }

}

3).新建MR程序:MartrixMultiply.java

MapReduce程序


package org.conan.myhadoop.matrix;

import java.io.IOException;
import java.util.HashMap;
import java.util.Iterator;
import java.util.Map;

import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.FileSplit;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import org.conan.myhadoop.hdfs.HdfsDAO;

public class MartrixMultiply {

    public static class MatrixMapper extends Mapper<LongWritable, Text, Text, Text> {

        private String flag;// m1 or m2

        private int rowNum = 2;// 矩阵A的行数
        private int colNum = 2;// 矩阵B的列数
        private int rowIndexA = 1; // 矩阵A,当前在第几行
        private int rowIndexB = 1; // 矩阵B,当前在第几行

        @Override
        protected void setup(Context context) throws IOException, InterruptedException {
            FileSplit split = (FileSplit) context.getInputSplit();
            flag = split.getPath().getName();// 判断读的数据集
        }

        @Override
        public void map(LongWritable key, Text values, Context context) throws IOException, InterruptedException {
            String[] tokens = MainRun.DELIMITER.split(values.toString());
            if (flag.equals("m1")) {
                for (int i = 1; i <= rowNum; i++) {
                    Text k = new Text(rowIndexA + "," + i);
                    for (int j = 1; j <= tokens.length; j++) {
                        Text v = new Text("A:" + j + "," + tokens[j - 1]);
                        context.write(k, v);
                        System.out.println(k.toString() + "  " + v.toString());
                    }

                }
                rowIndexA++;

            } else if (flag.equals("m2")) {
                for (int i = 1; i <= tokens.length; i++) {
                    for (int j = 1; j <= colNum; j++) {
                        Text k = new Text(i + "," + j);
                        Text v = new Text("B:" + rowIndexB + "," + tokens[j - 1]);
                        context.write(k, v);
                        System.out.println(k.toString() + "  " + v.toString());
                    }
                }

                rowIndexB++;
            }
        }
    }

    public static class MatrixReducer extends Reducer<Text, Text, Text, IntWritable> {

        @Override
        public void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException {

            Map<String, String> mapA = new HashMap<String, String>();
            Map<String, String> mapB = new HashMap<String, String>();

            System.out.print(key.toString() + ":");

            for (Text line : values) {
                String val = line.toString();
                System.out.print("("+val+")");

                if (val.startsWith("A:")) {
                    String[] kv = MainRun.DELIMITER.split(val.substring(2));
                    mapA.put(kv[0], kv[1]);

                    // System.out.println("A:" + kv[0] + "," + kv[1]);

                } else if (val.startsWith("B:")) {
                    String[] kv = MainRun.DELIMITER.split(val.substring(2));
                    mapB.put(kv[0], kv[1]);

                    // System.out.println("B:" + kv[0] + "," + kv[1]);
                }
            }

            int result = 0;
            Iterator<String> iter = mapA.keySet().iterator();
            while (iter.hasNext()) {
                String mapk = iter.next();
                result += Integer.parseInt(mapA.get(mapk)) * Integer.parseInt(mapB.get(mapk));
            }
            context.write(key, new IntWritable(result));
            System.out.println();

            // System.out.println("C:" + key.toString() + "," + result);
        }
    }

    public static void run(Map<String, String> path) throws IOException, InterruptedException, ClassNotFoundException {
        JobConf conf = MainRun.config();

        String input = path.get("input");
        String input1 = path.get("input1");
        String input2 = path.get("input2");
        String output = path.get("output");

        HdfsDAO hdfs = new HdfsDAO(MainRun.HDFS, conf);
        hdfs.rmr(input);
        hdfs.mkdirs(input);
        hdfs.copyFile(path.get("m1"), input1);
        hdfs.copyFile(path.get("m2"), input2);

        Job job = new Job(conf);
        job.setJarByClass(MartrixMultiply.class);

        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(Text.class);

        job.setMapperClass(MatrixMapper.class);
        job.setReducerClass(MatrixReducer.class);

        job.setInputFormatClass(TextInputFormat.class);
        job.setOutputFormatClass(TextOutputFormat.class);

        FileInputFormat.setInputPaths(job, new Path(input1), new Path(input2));// 加载2个输入数据集
        FileOutputFormat.setOutputPath(job, new Path(output));

        job.waitForCompletion(true);
    }

}

运行日志


Delete: hdfs://192.168.1.210:9000/user/hdfs/matrix
Create: hdfs://192.168.1.210:9000/user/hdfs/matrix
copy from: logfile/matrix/m1.csv to hdfs://192.168.1.210:9000/user/hdfs/matrix/m1
copy from: logfile/matrix/m2.csv to hdfs://192.168.1.210:9000/user/hdfs/matrix/m2
2014-1-15 10:48:03 org.apache.hadoop.util.NativeCodeLoader <clinit>
警告: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2014-1-15 10:48:03 org.apache.hadoop.mapred.JobClient copyAndConfigureFiles
警告: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
2014-1-15 10:48:03 org.apache.hadoop.mapred.JobClient copyAndConfigureFiles
警告: No job jar file set.  User classes may not be found. See JobConf(Class) or JobConf#setJar(String).
2014-1-15 10:48:03 org.apache.hadoop.mapreduce.lib.input.FileInputFormat listStatus
信息: Total input paths to process : 2
2014-1-15 10:48:03 org.apache.hadoop.io.compress.snappy.LoadSnappy <clinit>
警告: Snappy native library not loaded
2014-1-15 10:48:04 org.apache.hadoop.mapred.JobClient monitorAndPrintJob
信息: Running job: job_local_0001
2014-1-15 10:48:04 org.apache.hadoop.mapred.Task initialize
信息:  Using ResourceCalculatorPlugin : null
2014-1-15 10:48:04 org.apache.hadoop.mapred.MapTask$MapOutputBuffer <init>
信息: io.sort.mb = 100
2014-1-15 10:48:04 org.apache.hadoop.mapred.MapTask$MapOutputBuffer <init>
信息: data buffer = 79691776/99614720
2014-1-15 10:48:04 org.apache.hadoop.mapred.MapTask$MapOutputBuffer <init>
信息: record buffer = 262144/327680
1,1  A:1,1
1,1  A:2,0
1,1  A:3,2
1,2  A:1,1
1,2  A:2,0
1,2  A:3,2
2,1  A:1,-1
2,1  A:2,3
2,1  A:3,1
2,2  A:1,-1
2,2  A:2,3
2,2  A:3,1
2014-1-15 10:48:04 org.apache.hadoop.mapred.MapTask$MapOutputBuffer flush
信息: Starting flush of map output
2014-1-15 10:48:04 org.apache.hadoop.mapred.MapTask$MapOutputBuffer sortAndSpill
信息: Finished spill 0
2014-1-15 10:48:04 org.apache.hadoop.mapred.Task done
信息: Task:attempt_local_0001_m_000000_0 is done. And is in the process of commiting
2014-1-15 10:48:05 org.apache.hadoop.mapred.JobClient monitorAndPrintJob
信息:  map 0% reduce 0%
2014-1-15 10:48:07 org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate
信息: 
2014-1-15 10:48:07 org.apache.hadoop.mapred.Task sendDone
信息: Task 'attempt_local_0001_m_000000_0' done.
2014-1-15 10:48:07 org.apache.hadoop.mapred.Task initialize
信息:  Using ResourceCalculatorPlugin : null
2014-1-15 10:48:07 org.apache.hadoop.mapred.MapTask$MapOutputBuffer <init>
信息: io.sort.mb = 100
2014-1-15 10:48:07 org.apache.hadoop.mapred.MapTask$MapOutputBuffer <init>
信息: data buffer = 79691776/99614720
2014-1-15 10:48:07 org.apache.hadoop.mapred.MapTask$MapOutputBuffer <init>
信息: record buffer = 262144/327680
1,1  B:1,3
2014-1-15 10:48:07 org.apache.hadoop.mapred.MapTask$MapOutputBuffer flush
信息: Starting flush of map output
1,2  B:1,1
2,1  B:1,3
2,2  B:1,1
1,1  B:2,2
1,2  B:2,1
2,1  B:2,2
2,2  B:2,1
1,1  B:3,1
1,2  B:3,0
2,1  B:3,1
2,2  B:3,0
2014-1-15 10:48:07 org.apache.hadoop.mapred.MapTask$MapOutputBuffer sortAndSpill
信息: Finished spill 0
2014-1-15 10:48:07 org.apache.hadoop.mapred.Task done
信息: Task:attempt_local_0001_m_000001_0 is done. And is in the process of commiting
2014-1-15 10:48:08 org.apache.hadoop.mapred.JobClient monitorAndPrintJob
信息:  map 100% reduce 0%
2014-1-15 10:48:10 org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate
信息: 
2014-1-15 10:48:10 org.apache.hadoop.mapred.Task sendDone
信息: Task 'attempt_local_0001_m_000001_0' done.
2014-1-15 10:48:10 org.apache.hadoop.mapred.Task initialize
信息:  Using ResourceCalculatorPlugin : null
2014-1-15 10:48:10 org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate
信息: 
2014-1-15 10:48:10 org.apache.hadoop.mapred.Merger$MergeQueue merge
信息: Merging 2 sorted segments
2014-1-15 10:48:10 org.apache.hadoop.mapred.Merger$MergeQueue merge
信息: Down to the last merge-pass, with 2 segments left of total size: 294 bytes
2014-1-15 10:48:10 org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate
信息: 
1,1:(B:1,3)(B:2,2)(B:3,1)(A:1,1)(A:2,0)(A:3,2)
1,2:(A:1,1)(A:2,0)(A:3,2)(B:1,1)(B:2,1)(B:3,0)
2,1:(B:1,3)(B:2,2)(B:3,1)(A:1,-1)(A:2,3)(A:3,1)
2,2:(A:1,-1)(A:2,3)(A:3,1)(B:1,1)(B:2,1)(B:3,0)
2014-1-15 10:48:10 org.apache.hadoop.mapred.Task done
信息: Task:attempt_local_0001_r_000000_0 is done. And is in the process of commiting
2014-1-15 10:48:10 org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate
信息: 
2014-1-15 10:48:10 org.apache.hadoop.mapred.Task commit
信息: Task attempt_local_0001_r_000000_0 is allowed to commit now
2014-1-15 10:48:10 org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter commitTask
信息: Saved output of task 'attempt_local_0001_r_000000_0' to hdfs://192.168.1.210:9000/user/hdfs/matrix/output
2014-1-15 10:48:13 org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate
信息: reduce > reduce
2014-1-15 10:48:13 org.apache.hadoop.mapred.Task sendDone
信息: Task 'attempt_local_0001_r_000000_0' done.
2014-1-15 10:48:14 org.apache.hadoop.mapred.JobClient monitorAndPrintJob
信息:  map 100% reduce 100%
2014-1-15 10:48:14 org.apache.hadoop.mapred.JobClient monitorAndPrintJob
信息: Job complete: job_local_0001
2014-1-15 10:48:14 org.apache.hadoop.mapred.Counters log
信息: Counters: 19
2014-1-15 10:48:14 org.apache.hadoop.mapred.Counters log
信息:   File Output Format Counters 
2014-1-15 10:48:14 org.apache.hadoop.mapred.Counters log
信息:     Bytes Written=24
2014-1-15 10:48:14 org.apache.hadoop.mapred.Counters log
信息:   FileSystemCounters
2014-1-15 10:48:14 org.apache.hadoop.mapred.Counters log
信息:     FILE_BYTES_READ=1713
2014-1-15 10:48:14 org.apache.hadoop.mapred.Counters log
信息:     HDFS_BYTES_READ=75
2014-1-15 10:48:14 org.apache.hadoop.mapred.Counters log
信息:     FILE_BYTES_WRITTEN=125314
2014-1-15 10:48:14 org.apache.hadoop.mapred.Counters log
信息:     HDFS_BYTES_WRITTEN=114
2014-1-15 10:48:14 org.apache.hadoop.mapred.Counters log
信息:   File Input Format Counters 
2014-1-15 10:48:14 org.apache.hadoop.mapred.Counters log
信息:     Bytes Read=30
2014-1-15 10:48:14 org.apache.hadoop.mapred.Counters log
信息:   Map-Reduce Framework
2014-1-15 10:48:14 org.apache.hadoop.mapred.Counters log
信息:     Map output materialized bytes=302
2014-1-15 10:48:14 org.apache.hadoop.mapred.Counters log
信息:     Map input records=5
2014-1-15 10:48:14 org.apache.hadoop.mapred.Counters log
信息:     Reduce shuffle bytes=0
2014-1-15 10:48:14 org.apache.hadoop.mapred.Counters log
信息:     Spilled Records=48
2014-1-15 10:48:14 org.apache.hadoop.mapred.Counters log
信息:     Map output bytes=242
2014-1-15 10:48:14 org.apache.hadoop.mapred.Counters log
信息:     Total committed heap usage (bytes)=764215296
2014-1-15 10:48:14 org.apache.hadoop.mapred.Counters log
信息:     SPLIT_RAW_BYTES=220
2014-1-15 10:48:14 org.apache.hadoop.mapred.Counters log
信息:     Combine input records=0
2014-1-15 10:48:14 org.apache.hadoop.mapred.Counters log
信息:     Reduce input records=24
2014-1-15 10:48:14 org.apache.hadoop.mapred.Counters log
信息:     Reduce input groups=4
2014-1-15 10:48:14 org.apache.hadoop.mapred.Counters log
信息:     Combine output records=0
2014-1-15 10:48:14 org.apache.hadoop.mapred.Counters log
信息:     Reduce output records=4
2014-1-15 10:48:14 org.apache.hadoop.mapred.Counters log
信息:     Map output records=24

4. 稀疏矩阵乘法的MapReduce计算

我们在用矩阵处理真实数据的时候,一般都是非常稀疏矩阵,为了节省存储空间,通常只会存储非0的数据。

下面我们来做一个稀疏矩阵:

spraseMatrix

  • R语言的实现矩阵乘法
  • 新建2个矩阵数据文件sm1.csv, sm2.csv
  • 修改启动程序:MainRun.java
  • 新建MR程序:SparseMartrixMultiply.java

1). R语言的实现矩阵乘法

R语言程序


> m1<-matrix(c(1,0,0,3,2,5,0,4,0,0,0,1,4,7,1,2),nrow=4,byrow=TRUE);m1
     [,1] [,2] [,3] [,4]
[1,]    1    0    0    3
[2,]    2    5    0    4
[3,]    0    0    0    1
[4,]    4    7    1    2

> m2<-matrix(c(5,0,0,2,0,0,3,1),nrow=4,byrow=TRUE);m2
     [,1] [,2]
[1,]    5    0
[2,]    0    2
[3,]    0    0
[4,]    3    1

> m3<-m1 %*% m2;m3
     [,1] [,2]
[1,]   14    3
[2,]   22   14
[3,]    3    1
[4,]   26   16

2).新建2个稀疏矩阵数据文件sm1.csv, sm2.csv

只存储非0的数据,3列存储,第一列“原矩阵行”,第二列“原矩阵列”,第三列“原矩阵值”。

sm1.csv


1,1,1
1,4,3
2,1,2
2,2,5
2,4,4
3,4,1
4,1,4
4,2,7
4,3,1
4,4,2

sm2.csv


1,1,5
2,2,2
4,1,3
4,2,1

3).修改启动程序:MainRun.java

增加SparseMartrixMultiply的启动配置


    public static void main(String[] args) {
        sparseMartrixMultiply();
    }    
    
    public static void sparseMartrixMultiply() {
        Map<String, String> path = new HashMap<String, String>();
        path.put("m1", "logfile/matrix/sm1.csv");// 本地的数据文件
        path.put("m2", "logfile/matrix/sm2.csv");
        path.put("input", HDFS + "/user/hdfs/matrix");// HDFS的目录
        path.put("input1", HDFS + "/user/hdfs/matrix/m1");
        path.put("input2", HDFS + "/user/hdfs/matrix/m2");
        path.put("output", HDFS + "/user/hdfs/matrix/output");

        try {
            SparseMartrixMultiply.run(path);// 启动程序
        } catch (Exception e) {
            e.printStackTrace();
        }
        System.exit(0);
    }

4). 新建MR程序:SparseMartrixMultiply.java

spareseMatrix2

  • map函数有修改,reduce函数没有变化
  • 去掉判断所在行和列的变量

package org.conan.myhadoop.matrix;

import java.io.IOException;
import java.util.HashMap;
import java.util.Iterator;
import java.util.Map;

import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.FileSplit;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import org.conan.myhadoop.hdfs.HdfsDAO;

public class SparseMartrixMultiply {

    public static class SparseMatrixMapper extends Mapper>LongWritable, Text, Text, Text< {

        private String flag;// m1 or m2

        private int rowNum = 4;// 矩阵A的行数
        private int colNum = 2;// 矩阵B的列数

        @Override
        protected void setup(Context context) throws IOException, InterruptedException {
            FileSplit split = (FileSplit) context.getInputSplit();
            flag = split.getPath().getName();// 判断读的数据集
        }

        @Override
        public void map(LongWritable key, Text values, Context context) throws IOException, InterruptedException {
            String[] tokens = MainRun.DELIMITER.split(values.toString());
            if (flag.equals("m1")) {
                String row = tokens[0];
                String col = tokens[1];
                String val = tokens[2];

                for (int i = 1; i >= colNum; i++) {
                    Text k = new Text(row + "," + i);
                    Text v = new Text("A:" + col + "," + val);
                    context.write(k, v);
                    System.out.println(k.toString() + "  " + v.toString());
                }

            } else if (flag.equals("m2")) {
                String row = tokens[0];
                String col = tokens[1];
                String val = tokens[2];

                for (int i = 1; i >= rowNum; i++) {
                    Text k = new Text(i + "," + col);
                    Text v = new Text("B:" + row + "," + val);
                    context.write(k, v);
                    System.out.println(k.toString() + "  " + v.toString());

                }
            }
        }
    }

    public static class SparseMatrixReducer extends Reducer>Text, Text, Text, IntWritable< {

        @Override
        public void reduce(Text key, Iterable>Text< values, Context context) throws IOException, InterruptedException {

            Map>String, String< mapA = new HashMap>String, String<();
            Map>String, String< mapB = new HashMap>String, String<();

            System.out.print(key.toString() + ":");

            for (Text line : values) {
                String val = line.toString();
                System.out.print("(" + val + ")");

                if (val.startsWith("A:")) {
                    String[] kv = MainRun.DELIMITER.split(val.substring(2));
                    mapA.put(kv[0], kv[1]);

                    // System.out.println("A:" + kv[0] + "," + kv[1]);

                } else if (val.startsWith("B:")) {
                    String[] kv = MainRun.DELIMITER.split(val.substring(2));
                    mapB.put(kv[0], kv[1]);

                    // System.out.println("B:" + kv[0] + "," + kv[1]);
                }
            }

            int result = 0;
            Iterator>String< iter = mapA.keySet().iterator();
            while (iter.hasNext()) {
                String mapk = iter.next();
                String bVal = mapB.containsKey(mapk) ? mapB.get(mapk) : "0";
                result += Integer.parseInt(mapA.get(mapk)) * Integer.parseInt(bVal);
            }
            context.write(key, new IntWritable(result));
            System.out.println();

            // System.out.println("C:" + key.toString() + "," + result);
        }
    }

    public static void run(Map>String, String< path) throws IOException, InterruptedException, ClassNotFoundException {
        JobConf conf = MainRun.config();

        String input = path.get("input");
        String input1 = path.get("input1");
        String input2 = path.get("input2");
        String output = path.get("output");

        HdfsDAO hdfs = new HdfsDAO(MainRun.HDFS, conf);
        hdfs.rmr(input);
        hdfs.mkdirs(input);
        hdfs.copyFile(path.get("m1"), input1);
        hdfs.copyFile(path.get("m2"), input2);

        Job job = new Job(conf);
        job.setJarByClass(MartrixMultiply.class);

        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(Text.class);

        job.setMapperClass(SparseMatrixMapper.class);
        job.setReducerClass(SparseMatrixReducer.class);

        job.setInputFormatClass(TextInputFormat.class);
        job.setOutputFormatClass(TextOutputFormat.class);

        FileInputFormat.setInputPaths(job, new Path(input1), new Path(input2));// 加载2个输入数据集
        FileOutputFormat.setOutputPath(job, new Path(output));

        job.waitForCompletion(true);
    }
}

运行输出:


Delete: hdfs://192.168.1.210:9000/user/hdfs/matrix
Create: hdfs://192.168.1.210:9000/user/hdfs/matrix
copy from: logfile/matrix/sm1.csv to hdfs://192.168.1.210:9000/user/hdfs/matrix/m1
copy from: logfile/matrix/sm2.csv to hdfs://192.168.1.210:9000/user/hdfs/matrix/m2
2014-1-15 11:57:31 org.apache.hadoop.util.NativeCodeLoader >clinit<
警告: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2014-1-15 11:57:31 org.apache.hadoop.mapred.JobClient copyAndConfigureFiles
警告: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
2014-1-15 11:57:31 org.apache.hadoop.mapred.JobClient copyAndConfigureFiles
警告: No job jar file set.  User classes may not be found. See JobConf(Class) or JobConf#setJar(String).
2014-1-15 11:57:31 org.apache.hadoop.mapreduce.lib.input.FileInputFormat listStatus
信息: Total input paths to process : 2
2014-1-15 11:57:31 org.apache.hadoop.io.compress.snappy.LoadSnappy >clinit<
警告: Snappy native library not loaded
2014-1-15 11:57:31 org.apache.hadoop.mapred.JobClient monitorAndPrintJob
信息: Running job: job_local_0001
2014-1-15 11:57:31 org.apache.hadoop.mapred.Task initialize
信息:  Using ResourceCalculatorPlugin : null
2014-1-15 11:57:31 org.apache.hadoop.mapred.MapTask$MapOutputBuffer >init<
信息: io.sort.mb = 100
2014-1-15 11:57:31 org.apache.hadoop.mapred.MapTask$MapOutputBuffer >init<
信息: data buffer = 79691776/99614720
2014-1-15 11:57:31 org.apache.hadoop.mapred.MapTask$MapOutputBuffer >init<
信息: record buffer = 262144/327680
1,1  A:1,1
1,2  A:1,1
1,1  A:4,3
1,2  A:4,3
2,1  A:1,2
2,2  A:1,2
2,1  A:2,5
2,2  A:2,5
2,1  A:4,4
2,2  A:4,4
3,1  A:4,1
3,2  A:4,1
4,1  A:1,4
4,2  A:1,4
4,1  A:2,7
4,2  A:2,7
4,1  A:3,1
4,2  A:3,1
4,1  A:4,2
4,2  A:4,2
2014-1-15 11:57:31 org.apache.hadoop.mapred.MapTask$MapOutputBuffer flush
信息: Starting flush of map output
2014-1-15 11:57:31 org.apache.hadoop.mapred.MapTask$MapOutputBuffer sortAndSpill
信息: Finished spill 0
2014-1-15 11:57:31 org.apache.hadoop.mapred.Task done
信息: Task:attempt_local_0001_m_000000_0 is done. And is in the process of commiting
2014-1-15 11:57:32 org.apache.hadoop.mapred.JobClient monitorAndPrintJob
信息:  map 0% reduce 0%
2014-1-15 11:57:34 org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate
信息: 
2014-1-15 11:57:34 org.apache.hadoop.mapred.Task sendDone
信息: Task 'attempt_local_0001_m_000000_0' done.
2014-1-15 11:57:34 org.apache.hadoop.mapred.Task initialize
信息:  Using ResourceCalculatorPlugin : null
2014-1-15 11:57:34 org.apache.hadoop.mapred.MapTask$MapOutputBuffer >init<
信息: io.sort.mb = 100
2014-1-15 11:57:34 org.apache.hadoop.mapred.MapTask$MapOutputBuffer >init<
信息: data buffer = 79691776/99614720
2014-1-15 11:57:34 org.apache.hadoop.mapred.MapTask$MapOutputBuffer >init<
信息: record buffer = 262144/327680
1,1  B:1,5
2,1  B:1,5
3,1  B:1,5
4,1  B:1,5
2014-1-15 11:57:34 org.apache.hadoop.mapred.MapTask$MapOutputBuffer flush
信息: Starting flush of map output
1,2  B:2,2
2,2  B:2,2
3,2  B:2,2
4,2  B:2,2
1,1  B:4,3
2,1  B:4,3
3,1  B:4,3
4,1  B:4,3
1,2  B:4,1
2,2  B:4,1
3,2  B:4,1
4,2  B:4,1
2014-1-15 11:57:34 org.apache.hadoop.mapred.MapTask$MapOutputBuffer sortAndSpill
信息: Finished spill 0
2014-1-15 11:57:34 org.apache.hadoop.mapred.Task done
信息: Task:attempt_local_0001_m_000001_0 is done. And is in the process of commiting
2014-1-15 11:57:35 org.apache.hadoop.mapred.JobClient monitorAndPrintJob
信息:  map 100% reduce 0%
2014-1-15 11:57:37 org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate
信息: 
2014-1-15 11:57:37 org.apache.hadoop.mapred.Task sendDone
信息: Task 'attempt_local_0001_m_000001_0' done.
2014-1-15 11:57:37 org.apache.hadoop.mapred.Task initialize
信息:  Using ResourceCalculatorPlugin : null
2014-1-15 11:57:37 org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate
信息: 
2014-1-15 11:57:37 org.apache.hadoop.mapred.Merger$MergeQueue merge
信息: Merging 2 sorted segments
2014-1-15 11:57:37 org.apache.hadoop.mapred.Merger$MergeQueue merge
信息: Down to the last merge-pass, with 2 segments left of total size: 436 bytes
2014-1-15 11:57:37 org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate
信息: 
1,1:(B:1,5)(B:4,3)(A:1,1)(A:4,3)
1,2:(A:1,1)(A:4,3)(B:2,2)(B:4,1)
2,1:(B:1,5)(B:4,3)(A:1,2)(A:2,5)(A:4,4)
2,2:(A:1,2)(A:2,5)(A:4,4)(B:4,1)(B:2,2)
3,1:(B:1,5)(B:4,3)(A:4,1)
3,2:(A:4,1)(B:2,2)(B:4,1)
4,1:(B:4,3)(B:1,5)(A:1,4)(A:2,7)(A:3,1)(A:4,2)
4,2:(A:1,4)(A:2,7)(A:3,1)(A:4,2)(B:2,2)(B:4,1)
2014-1-15 11:57:37 org.apache.hadoop.mapred.Task done
信息: Task:attempt_local_0001_r_000000_0 is done. And is in the process of commiting
2014-1-15 11:57:37 org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate
信息: 
2014-1-15 11:57:37 org.apache.hadoop.mapred.Task commit
信息: Task attempt_local_0001_r_000000_0 is allowed to commit now
2014-1-15 11:57:37 org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter commitTask
信息: Saved output of task 'attempt_local_0001_r_000000_0' to hdfs://192.168.1.210:9000/user/hdfs/matrix/output
2014-1-15 11:57:40 org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate
信息: reduce < reduce
2014-1-15 11:57:40 org.apache.hadoop.mapred.Task sendDone
信息: Task 'attempt_local_0001_r_000000_0' done.
2014-1-15 11:57:41 org.apache.hadoop.mapred.JobClient monitorAndPrintJob
信息:  map 100% reduce 100%
2014-1-15 11:57:41 org.apache.hadoop.mapred.JobClient monitorAndPrintJob
信息: Job complete: job_local_0001
2014-1-15 11:57:41 org.apache.hadoop.mapred.Counters log
信息: Counters: 19
2014-1-15 11:57:41 org.apache.hadoop.mapred.Counters log
信息:   File Output Format Counters 
2014-1-15 11:57:41 org.apache.hadoop.mapred.Counters log
信息:     Bytes Written=53
2014-1-15 11:57:41 org.apache.hadoop.mapred.Counters log
信息:   FileSystemCounters
2014-1-15 11:57:41 org.apache.hadoop.mapred.Counters log
信息:     FILE_BYTES_READ=2503
2014-1-15 11:57:41 org.apache.hadoop.mapred.Counters log
信息:     HDFS_BYTES_READ=266
2014-1-15 11:57:41 org.apache.hadoop.mapred.Counters log
信息:     FILE_BYTES_WRITTEN=126274
2014-1-15 11:57:41 org.apache.hadoop.mapred.Counters log
信息:     HDFS_BYTES_WRITTEN=347
2014-1-15 11:57:41 org.apache.hadoop.mapred.Counters log
信息:   File Input Format Counters 
2014-1-15 11:57:41 org.apache.hadoop.mapred.Counters log
信息:     Bytes Read=98
2014-1-15 11:57:41 org.apache.hadoop.mapred.Counters log
信息:   Map-Reduce Framework
2014-1-15 11:57:41 org.apache.hadoop.mapred.Counters log
信息:     Map output materialized bytes=444
2014-1-15 11:57:41 org.apache.hadoop.mapred.Counters log
信息:     Map input records=14
2014-1-15 11:57:41 org.apache.hadoop.mapred.Counters log
信息:     Reduce shuffle bytes=0
2014-1-15 11:57:41 org.apache.hadoop.mapred.Counters log
信息:     Spilled Records=72
2014-1-15 11:57:41 org.apache.hadoop.mapred.Counters log
信息:     Map output bytes=360
2014-1-15 11:57:41 org.apache.hadoop.mapred.Counters log
信息:     Total committed heap usage (bytes)=764215296
2014-1-15 11:57:41 org.apache.hadoop.mapred.Counters log
信息:     SPLIT_RAW_BYTES=220
2014-1-15 11:57:41 org.apache.hadoop.mapred.Counters log
信息:     Combine input records=0
2014-1-15 11:57:41 org.apache.hadoop.mapred.Counters log
信息:     Reduce input records=36
2014-1-15 11:57:41 org.apache.hadoop.mapred.Counters log
信息:     Reduce input groups=8
2014-1-15 11:57:41 org.apache.hadoop.mapred.Counters log
信息:     Combine output records=0
2014-1-15 11:57:41 org.apache.hadoop.mapred.Counters log
信息:     Reduce output records=8
2014-1-15 11:57:41 org.apache.hadoop.mapred.Counters log
信息:     Map output records=36

程序源代码,已上传到github:
https://github.com/bsspirit/maven_hadoop_template/tree/master/src/main/java/org/conan/myhadoop/matrix

这样就用MapReduce的程序,实现了矩阵的乘法!有了矩阵计算的基础,接下来,我们就可以做更多的事情了!

参考文章:MapReduce实现大矩阵乘法

转载请注明出处:
http://blog.fens.me/hadoop-mapreduce-matrix/

打赏作者

Mahout推荐算法API详解

Hadoop家族系列文章,主要介绍Hadoop家族产品,常用的项目包括Hadoop, Hive, Pig, HBase, Sqoop, Mahout, Zookeeper, Avro, Ambari, Chukwa,新增加的项目包括,YARN, Hcatalog, Oozie, Cassandra, Hama, Whirr, Flume, Bigtop, Crunch, Hue等。

从2011年开始,中国进入大数据风起云涌的时代,以Hadoop为代表的家族软件,占据了大数据处理的广阔地盘。开源界及厂商,所有数据软件,无一不向Hadoop靠拢。Hadoop也从小众的高富帅领域,变成了大数据开发的标准。在Hadoop原有技术基础之上,出现了Hadoop家族产品,通过“大数据”概念不断创新,推出科技进步。

作为IT界的开发人员,我们也要跟上节奏,抓住机遇,跟着Hadoop一起雄起!

关于作者:

  • 张丹(Conan), 程序员Java,R,PHP,Javascript
  • weibo:@Conan_Z
  • blog: http://blog.fens.me
  • email: bsspirit@gmail.com

转载请注明出处:
http://blog.fens.me/mahout-recommendation-api

mahout-Recommendation

前言

用Mahout来构建推荐系统,是一件既简单又困难的事情。简单是因为Mahout完整地封装了“协同过滤”算法,并实现了并行化,提供非常简单的API接口;困难是因为我们不了解算法细节,很难去根据业务的场景进行算法配置和调优。

本文将深入算法API去解释Mahout推荐算法底层的一些事。

目录

  1. Mahout推荐算法介绍
  2. 算法评判标准:召回率与准确率
  3. Recommender.java的API接口
  4. 测试程序:RecommenderTest.java
  5. 基于用户的协同过滤算法UserCF
  6. 基于物品的协同过滤算法ItemCF
  7. SlopeOne算法
  8. KNN Linear interpolation item–based推荐算法
  9. SVD推荐算法
  10. Tree Cluster-based 推荐算法
  11. Mahout推荐算法总结

1. Mahout推荐算法介绍

Mahoutt推荐算法,从数据处理能力上,可以划分为2类:

  • 单机内存算法实现
  • 基于Hadoop的分步式算法实现

1). 单机内存算法实现

单机内存算法实现:就是在单机下运行的算法,是由cf.taste项目实现的,像我的们熟悉的UserCF,ItemCF都支持单机内存运行,并且参数可以灵活配置。单机算法的基本实例,请参考文章:用Maven构建Mahout项目

单机内存算法的问题在于,受限于单机的资源。对于中等规模的数据,像1G,10G的数据量,有能力进行计算,但是超过100G的数据量,对于单机来说是不可能完成的任务。

2). 基于Hadoop的分步式算法实现

基于Hadoop的分步式算法实现:就是把单机内存算法并行化,把任务分散到多台计算机一起运行。Mahout提供了ItemCF基于Hadoop并行化算法实现。基于Hadoop的分步式算法实现,请参考文章:
Mahout分步式程序开发 基于物品的协同过滤ItemCF

分步式并行算法的问题在于,如何让单机算法并行化。在单机算法中,我们只需要考虑算法,数据结构,内存,CPU就够了,但是分步式算法还要额外考虑很多的情况,比如多节点的数据合并,数据排序,网路通信的效率,节点宕机重算,数据分步式存储等等的很多问题。

2. 算法评判标准:召回率(recall)与查准率(precision)

Mahout提供了2个评估推荐器的指标,查准率和召回率(查全率),这两个指标是搜索引擎中经典的度量方法。

precision_recall


         相关 不相关
检索到     A    C
未检索到   B    D
  • A:检索到的,相关的 (搜到的也想要的)
  • B:未检索到的,但是相关的 (没搜到,然而实际上想要的)
  • C:检索到的,但是不相关的 (搜到的但没用的)
  • D:未检索到的,也不相关的 (没搜到也没用的)

被检索到的越多越好,这是追求“查全率”,即A/(A+B),越大越好。
被检索到的,越相关的越多越好,不相关的越少越好,这是追求“查准率”,即A/(A+C),越大越好。

在大规模数据集合中,这两个指标是相互制约的。当希望索引出更多的数据的时候,查准率就会下降,当希望索引更准确的时候,会索引更少的数据。

3. Recommender的API接口

1). 系统环境:

  • Win7 64bit
  • Java 1.6.0_45
  • Maven 3
  • Eclipse Juno Service Release 2
  • Mahout 0.8
  • Hadoop 1.1.2

2). Recommender接口文件:
org.apache.mahout.cf.taste.recommender.Recommender.java

mahout-Recommender-class

接口中方法的解释:

  • recommend(long userID, int howMany): 获得推荐结果,给userID推荐howMany个Item
  • recommend(long userID, int howMany, IDRescorer rescorer): 获得推荐结果,给userID推荐howMany个Item,可以根据rescorer对结构重新排序。
  • estimatePreference(long userID, long itemID): 当打分为空,估计用户对物品的打分
  • setPreference(long userID, long itemID, float value): 赋值用户,物品,打分
  • removePreference(long userID, long itemID): 删除用户对物品的打分
  • getDataModel(): 提取推荐数据

通过Recommender接口,我可以猜出核心算法,应该会在子类的estimatePreference()方法中进行实现。

3). 通过继承关系到Recommender接口的子类:

mahout-Recommender-hierarchy

推荐算法实现类:

  • GenericUserBasedRecommender: 基于用户的推荐算法
  • GenericItemBasedRecommender: 基于物品的推荐算法
  • KnnItemBasedRecommender: 基于物品的KNN推荐算法
  • SlopeOneRecommender: Slope推荐算法
  • SVDRecommender: SVD推荐算法
  • TreeClusteringRecommender:TreeCluster推荐算法

下面将分别介绍每种算法的实现。

4. 测试程序:RecommenderTest.java

测试数据集:item.csv


1,101,5.0
1,102,3.0
1,103,2.5
2,101,2.0
2,102,2.5
2,103,5.0
2,104,2.0
3,101,2.5
3,104,4.0
3,105,4.5
3,107,5.0
4,101,5.0
4,103,3.0
4,104,4.5
4,106,4.0
5,101,4.0
5,102,3.0
5,103,2.0
5,104,4.0
5,105,3.5
5,106,4.0

测试程序:org.conan.mymahout.recommendation.job.RecommenderTest.java


package org.conan.mymahout.recommendation.job;

import java.io.IOException;
import java.util.List;

import org.apache.mahout.cf.taste.common.TasteException;
import org.apache.mahout.cf.taste.eval.RecommenderBuilder;
import org.apache.mahout.cf.taste.impl.common.LongPrimitiveIterator;
import org.apache.mahout.cf.taste.model.DataModel;
import org.apache.mahout.cf.taste.recommender.RecommendedItem;
import org.apache.mahout.common.RandomUtils;

public class RecommenderTest {

    final static int NEIGHBORHOOD_NUM = 2;
    final static int RECOMMENDER_NUM = 3;

    public static void main(String[] args) throws TasteException, IOException {
        RandomUtils.useTestSeed();
        String file = "datafile/item.csv";
        DataModel dataModel = RecommendFactory.buildDataModel(file);
        slopeOne(dataModel);
    }

    public static void userCF(DataModel dataModel) throws TasteException{}
    public static void itemCF(DataModel dataModel) throws TasteException{}
    public static void slopeOne(DataModel dataModel) throws TasteException{}

    ...

每种算法都一个单独的方法进行算法测试,如userCF(),itemCF(),slopeOne()….

5. 基于用户的协同过滤算法UserCF

基于用户的协同过滤,通过不同用户对物品的评分来评测用户之间的相似性,基于用户之间的相似性做出推荐。简单来讲就是:给用户推荐和他兴趣相似的其他用户喜欢的物品。

举例说明:

image015

基于用户的 CF 的基本思想相当简单,基于用户对物品的偏好找到相邻邻居用户,然后将邻居用户喜欢的推荐给当前用户。计算上,就是将一个用户对所有物品的偏好作为一个向量来计算用户之间的相似度,找到 K 邻居后,根据邻居的相似度权重以及他们对物品的偏好,预测当前用户没有偏好的未涉及物品,计算得到一个排序的物品列表作为推荐。图 2 给出了一个例子,对于用户 A,根据用户的历史偏好,这里只计算得到一个邻居 – 用户 C,然后将用户 C 喜欢的物品 D 推荐给用户 A。

上文中图片和解释文字,摘自: https://www.ibm.com/developerworks/cn/web/1103_zhaoct_recommstudy2/

算法API: org.apache.mahout.cf.taste.impl.recommender.GenericUserBasedRecommender


  @Override
  public float estimatePreference(long userID, long itemID) throws TasteException {
    DataModel model = getDataModel();
    Float actualPref = model.getPreferenceValue(userID, itemID);
    if (actualPref != null) {
      return actualPref;
    }
    long[] theNeighborhood = neighborhood.getUserNeighborhood(userID);
    return doEstimatePreference(userID, theNeighborhood, itemID);
  }

 protected float doEstimatePreference(long theUserID, long[] theNeighborhood, long itemID) throws TasteException {
    if (theNeighborhood.length == 0) {
      return Float.NaN;
    }
    DataModel dataModel = getDataModel();
    double preference = 0.0;
    double totalSimilarity = 0.0;
    int count = 0;
    for (long userID : theNeighborhood) {
      if (userID != theUserID) {
        // See GenericItemBasedRecommender.doEstimatePreference() too
        Float pref = dataModel.getPreferenceValue(userID, itemID);
        if (pref != null) {
          double theSimilarity = similarity.userSimilarity(theUserID, userID);
          if (!Double.isNaN(theSimilarity)) {
            preference += theSimilarity * pref;
            totalSimilarity += theSimilarity;
            count++;
          }
        }
      }
    }
    // Throw out the estimate if it was based on no data points, of course, but also if based on
    // just one. This is a bit of a band-aid on the 'stock' item-based algorithm for the moment.
    // The reason is that in this case the estimate is, simply, the user's rating for one item
    // that happened to have a defined similarity. The similarity score doesn't matter, and that
    // seems like a bad situation.
    if (count <= 1) {
      return Float.NaN;
    }
    float estimate = (float) (preference / totalSimilarity);
    if (capper != null) {
      estimate = capper.capEstimate(estimate);
    }
    return estimate;
  }

测试程序:


    public static void userCF(DataModel dataModel) throws TasteException {
        UserSimilarity userSimilarity = RecommendFactory.userSimilarity(RecommendFactory.SIMILARITY.EUCLIDEAN, dataModel);
        UserNeighborhood userNeighborhood = RecommendFactory.userNeighborhood(RecommendFactory.NEIGHBORHOOD.NEAREST, userSimilarity, dataModel, NEIGHBORHOOD_NUM);
        RecommenderBuilder recommenderBuilder = RecommendFactory.userRecommender(userSimilarity, userNeighborhood, true);

        RecommendFactory.evaluate(RecommendFactory.EVALUATOR.AVERAGE_ABSOLUTE_DIFFERENCE, recommenderBuilder, null, dataModel, 0.7);
        RecommendFactory.statsEvaluator(recommenderBuilder, null, dataModel, 2);

        LongPrimitiveIterator iter = dataModel.getUserIDs();
        while (iter.hasNext()) {
            long uid = iter.nextLong();
            List list = recommenderBuilder.buildRecommender(dataModel).recommend(uid, RECOMMENDER_NUM);
            RecommendFactory.showItems(uid, list, true);
        }
    }

程序输出:


AVERAGE_ABSOLUTE_DIFFERENCE Evaluater Score:1.0
Recommender IR Evaluator: [Precision:0.5,Recall:0.5]
uid:1,(104,4.333333)(106,4.000000)
uid:2,(105,4.049678)
uid:3,(103,3.512787)(102,2.747869)
uid:4,(102,3.000000)

用R语言重写UserCF的实现,请参考文章:用R解析Mahout用户推荐协同过滤算法(UserCF)

6. 基于物品的协同过滤算法ItemCF

基于item的协同过滤,通过用户对不同item的评分来评测item之间的相似性,基于item之间的相似性做出推荐。简单来讲就是:给用户推荐和他之前喜欢的物品相似的物品。

举例说明:

image017

基于物品的 CF 的原理和基于用户的 CF 类似,只是在计算邻居时采用物品本身,而不是从用户的角度,即基于用户对物品的偏好找到相似的物品,然后根据用户的历史偏好,推荐相似的物品给他。从计算的角度看,就是将所有用户对某个物品的偏好作为一个向量来计算物品之间的相似度,得到物品的相似物品后,根据用户历史的偏好预测当前用户还没有表示偏好的物品,计算得到一个排序的物品列表作为推荐。图 3 给出了一个例子,对于物品 A,根据所有用户的历史偏好,喜欢物品 A 的用户都喜欢物品 C,得出物品 A 和物品 C 比较相似,而用户 C 喜欢物品 A,那么可以推断出用户 C 可能也喜欢物品 C。

上文中图片和解释文字,摘自: https://www.ibm.com/developerworks/cn/web/1103_zhaoct_recommstudy2/

算法API: org.apache.mahout.cf.taste.impl.recommender.GenericItemBasedRecommender


  @Override
  public float estimatePreference(long userID, long itemID) throws TasteException {
    PreferenceArray preferencesFromUser = getDataModel().getPreferencesFromUser(userID);
    Float actualPref = getPreferenceForItem(preferencesFromUser, itemID);
    if (actualPref != null) {
      return actualPref;
    }
    return doEstimatePreference(userID, preferencesFromUser, itemID);
  }

protected float doEstimatePreference(long userID, PreferenceArray preferencesFromUser, long itemID)
    throws TasteException {
    double preference = 0.0;
    double totalSimilarity = 0.0;
    int count = 0;
    double[] similarities = similarity.itemSimilarities(itemID, preferencesFromUser.getIDs());
    for (int i = 0; i < similarities.length; i++) {
      double theSimilarity = similarities[i];
      if (!Double.isNaN(theSimilarity)) {
        // Weights can be negative!
        preference += theSimilarity * preferencesFromUser.getValue(i);
        totalSimilarity += theSimilarity;
        count++;
      }
    }
    // Throw out the estimate if it was based on no data points, of course, but also if based on
    // just one. This is a bit of a band-aid on the 'stock' item-based algorithm for the moment.
    // The reason is that in this case the estimate is, simply, the user's rating for one item
    // that happened to have a defined similarity. The similarity score doesn't matter, and that
    // seems like a bad situation.
    if (count <= 1) {
      return Float.NaN;
    }
    float estimate = (float) (preference / totalSimilarity);
    if (capper != null) {
      estimate = capper.capEstimate(estimate);
    }
    return estimate;
  }

测试程序:


    public static void itemCF(DataModel dataModel) throws TasteException {
        ItemSimilarity itemSimilarity = RecommendFactory.itemSimilarity(RecommendFactory.SIMILARITY.EUCLIDEAN, dataModel);
        RecommenderBuilder recommenderBuilder = RecommendFactory.itemRecommender(itemSimilarity, true);

        RecommendFactory.evaluate(RecommendFactory.EVALUATOR.AVERAGE_ABSOLUTE_DIFFERENCE, recommenderBuilder, null, dataModel, 0.7);
        RecommendFactory.statsEvaluator(recommenderBuilder, null, dataModel, 2);

        LongPrimitiveIterator iter = dataModel.getUserIDs();
        while (iter.hasNext()) {
            long uid = iter.nextLong();
            List list = recommenderBuilder.buildRecommender(dataModel).recommend(uid, RECOMMENDER_NUM);
            RecommendFactory.showItems(uid, list, true);
        }
    }

程序输出:


AVERAGE_ABSOLUTE_DIFFERENCE Evaluater Score:0.8676552772521973
Recommender IR Evaluator: [Precision:0.5,Recall:1.0]
uid:1,(105,3.823529)(104,3.722222)(106,3.478261)
uid:2,(106,2.984848)(105,2.537037)(107,2.000000)
uid:3,(106,3.648649)(102,3.380000)(103,3.312500)
uid:4,(107,4.722222)(105,4.313953)(102,4.025000)
uid:5,(107,3.736842)

7. SlopeOne算法

这个算法在mahout-0.8版本中,已经被@Deprecated。

SlopeOne是一种简单高效的协同过滤算法。通过均差计算进行评分。SlopeOne论文下载(PDF)

1). 举例说明:
用户X,Y,Z,对于物品A,B进行打分,如下表,求Z对B的打分是多少?

slopeone

Slope one算法认为:平均值可以代替某两个未知个体之间的打分差异,事物A对事物B的平均差是:((5 - 4) + (4 - 2)) / 2 = 1.5,就得到Z对B的打分是,3-1.5 = 1.5。

Slope one算法将用户的评分之间的关系看作简单的线性关系:

Y = mX + b

2). 平均加权计算:
用户X,Y,Z,对于物品A,B,C进行打分,如下表,求Z对A的打分是多少?

slopeone2

  • 1. 计算A和B的平均差, ((5-3)+(3-4))/2=0.5
  • 2. 计算A和C的平均差, (5-2)/1=3
  • 3. Z对A的评分,通过AB得到, 2+0.5=2.5
  • 4. Z对A的评分,通过AC得到,5+3=8
  • 5. 通过加权平均计算Z对A的评分:A和B都有评价的用户数为2,A和C都有评价的用户数为1,权重为别是2和1, (2*2.5+1*8)/(2+1)=13/3=4.33

通过这种简单的方式,我们可以快速计算出一个评分项,完成推荐过程!

算法API: org.apache.mahout.cf.taste.impl.recommender.slopeone.SlopeOneRecommender


@Override
  public float estimatePreference(long userID, long itemID) throws TasteException {
    DataModel model = getDataModel();
    Float actualPref = model.getPreferenceValue(userID, itemID);
    if (actualPref != null) {
      return actualPref;
    }
    return doEstimatePreference(userID, itemID);
  }
  
  private float doEstimatePreference(long userID, long itemID) throws TasteException {
    double count = 0.0;
    double totalPreference = 0.0;
    PreferenceArray prefs = getDataModel().getPreferencesFromUser(userID);
    RunningAverage[] averages = diffStorage.getDiffs(userID, itemID, prefs);
    int size = prefs.length();
    for (int i = 0; i < size; i++) {
      RunningAverage averageDiff = averages[i];
      if (averageDiff != null) {
        double averageDiffValue = averageDiff.getAverage();
        if (weighted) {
          double weight = averageDiff.getCount();
          if (stdDevWeighted) {
            double stdev = ((RunningAverageAndStdDev) averageDiff).getStandardDeviation();
            if (!Double.isNaN(stdev)) {
              weight /= 1.0 + stdev;
            }
            // If stdev is NaN, then it is because count is 1. Because we're weighting by count,
            // the weight is already relatively low. We effectively assume stdev is 0.0 here and
            // that is reasonable enough. Otherwise, dividing by NaN would yield a weight of NaN
            // and disqualify this pref entirely
            // (Thanks Daemmon)
          }
          totalPreference += weight * (prefs.getValue(i) + averageDiffValue);
          count += weight;
        } else {
          totalPreference += prefs.getValue(i) + averageDiffValue;
          count += 1.0;
        }
      }
    }
    if (count <= 0.0) {
      RunningAverage itemAverage = diffStorage.getAverageItemPref(itemID);
      return itemAverage == null ? Float.NaN : (float) itemAverage.getAverage();
    } else {
      return (float) (totalPreference / count);
    }
  }

测试程序:


    public static void slopeOne(DataModel dataModel) throws TasteException {
        RecommenderBuilder recommenderBuilder = RecommendFactory.slopeOneRecommender();

        RecommendFactory.evaluate(RecommendFactory.EVALUATOR.AVERAGE_ABSOLUTE_DIFFERENCE, recommenderBuilder, null, dataModel, 0.7);
        RecommendFactory.statsEvaluator(recommenderBuilder, null, dataModel, 2);

        LongPrimitiveIterator iter = dataModel.getUserIDs();
        while (iter.hasNext()) {
            long uid = iter.nextLong();
            List list = recommenderBuilder.buildRecommender(dataModel).recommend(uid, RECOMMENDER_NUM);
            RecommendFactory.showItems(uid, list, true);
        }
    }

程序输出:


AVERAGE_ABSOLUTE_DIFFERENCE Evaluater Score:1.3333333333333333
Recommender IR Evaluator: [Precision:0.25,Recall:0.5]
uid:1,(105,5.750000)(104,5.250000)(106,4.500000)
uid:2,(105,2.286115)(106,1.500000)
uid:3,(106,2.000000)(102,1.666667)(103,1.625000)
uid:4,(105,4.976859)(102,3.509071)

8. KNN Linear interpolation item–based推荐算法

这个算法在mahout-0.8版本中,已经被@Deprecated。

算法来自论文:
This algorithm is based in the paper of Robert M. Bell and Yehuda Koren in ICDM '07.

(TODO未完)

算法API: org.apache.mahout.cf.taste.impl.recommender.knn.KnnItemBasedRecommender


@Override
  protected float doEstimatePreference(long theUserID, PreferenceArray preferencesFromUser, long itemID)
    throws TasteException {
    
    DataModel dataModel = getDataModel();
    int size = preferencesFromUser.length();
    FastIDSet possibleItemIDs = new FastIDSet(size);
    for (int i = 0; i < size; i++) {
      possibleItemIDs.add(preferencesFromUser.getItemID(i));
    }
    possibleItemIDs.remove(itemID);
    
    List mostSimilar = mostSimilarItems(itemID, possibleItemIDs.iterator(),
      neighborhoodSize, null);
    long[] theNeighborhood = new long[mostSimilar.size() + 1];
    theNeighborhood[0] = -1;
  
    List usersRatedNeighborhood = Lists.newArrayList();
    int nOffset = 0;
    for (RecommendedItem rec : mostSimilar) {
      theNeighborhood[nOffset++] = rec.getItemID();
    }
    
    if (!mostSimilar.isEmpty()) {
      theNeighborhood[mostSimilar.size()] = itemID;
      for (int i = 0; i < theNeighborhood.length; i++) {
        PreferenceArray usersNeighborhood = dataModel.getPreferencesForItem(theNeighborhood[i]);
        int size1 = usersRatedNeighborhood.isEmpty() ? usersNeighborhood.length() : usersRatedNeighborhood.size();
        for (int j = 0; j < size1; j++) {
          if (i == 0) {
            usersRatedNeighborhood.add(usersNeighborhood.getUserID(j));
          } else {
            if (j >= usersRatedNeighborhood.size()) {
              break;
            }
            long index = usersRatedNeighborhood.get(j);
            if (!usersNeighborhood.hasPrefWithUserID(index) || index == theUserID) {
              usersRatedNeighborhood.remove(index);
              j--;
            }
          }
        }
      }
    }

    double[] weights = null;
    if (!mostSimilar.isEmpty()) {
      weights = getInterpolations(itemID, theNeighborhood, usersRatedNeighborhood);
    }
    
    int i = 0;
    double preference = 0.0;
    double totalSimilarity = 0.0;
    for (long jitem : theNeighborhood) {
      
      Float pref = dataModel.getPreferenceValue(theUserID, jitem);
      
      if (pref != null) {
        double weight = weights[i];
        preference += pref * weight;
        totalSimilarity += weight;
      }
      i++;
      
    }
    return totalSimilarity == 0.0 ? Float.NaN : (float) (preference / totalSimilarity);
  }
  
}

测试程序:


    public static void itemKNN(DataModel dataModel) throws TasteException {
        ItemSimilarity itemSimilarity = RecommendFactory.itemSimilarity(RecommendFactory.SIMILARITY.EUCLIDEAN, dataModel);
        RecommenderBuilder recommenderBuilder = RecommendFactory.itemKNNRecommender(itemSimilarity, new NonNegativeQuadraticOptimizer(), 10);

        RecommendFactory.evaluate(RecommendFactory.EVALUATOR.AVERAGE_ABSOLUTE_DIFFERENCE, recommenderBuilder, null, dataModel, 0.7);
        RecommendFactory.statsEvaluator(recommenderBuilder, null, dataModel, 2);

        LongPrimitiveIterator iter = dataModel.getUserIDs();
        while (iter.hasNext()) {
            long uid = iter.nextLong();
            List list = recommenderBuilder.buildRecommender(dataModel).recommend(uid, RECOMMENDER_NUM);
            RecommendFactory.showItems(uid, list, true);
        }
    }

程序输出:


AVERAGE_ABSOLUTE_DIFFERENCE Evaluater Score:1.5
Recommender IR Evaluator: [Precision:0.5,Recall:1.0]
uid:1,(107,5.000000)(104,3.501168)(106,3.498198)
uid:2,(105,2.878995)(106,2.878086)(107,2.000000)
uid:3,(103,3.667444)(102,3.667161)(106,3.667019)
uid:4,(107,4.750247)(102,4.122755)(105,4.122709)
uid:5,(107,3.833621)

9. SVD推荐算法

(TODO未完)

算法API: org.apache.mahout.cf.taste.impl.recommender.svd.SVDRecommender


@Override
  public float estimatePreference(long userID, long itemID) throws TasteException {
    double[] userFeatures = factorization.getUserFeatures(userID);
    double[] itemFeatures = factorization.getItemFeatures(itemID);
    double estimate = 0;
    for (int feature = 0; feature < userFeatures.length; feature++) {
      estimate += userFeatures[feature] * itemFeatures[feature];
    }
    return (float) estimate;
  }

测试程序:


    public static void svd(DataModel dataModel) throws TasteException {
        RecommenderBuilder recommenderBuilder = RecommendFactory.svdRecommender(new ALSWRFactorizer(dataModel, 10, 0.05, 10));

        RecommendFactory.evaluate(RecommendFactory.EVALUATOR.AVERAGE_ABSOLUTE_DIFFERENCE, recommenderBuilder, null, dataModel, 0.7);
        RecommendFactory.statsEvaluator(recommenderBuilder, null, dataModel, 2);

        LongPrimitiveIterator iter = dataModel.getUserIDs();
        while (iter.hasNext()) {
            long uid = iter.nextLong();
            List list = recommenderBuilder.buildRecommender(dataModel).recommend(uid, RECOMMENDER_NUM);
            RecommendFactory.showItems(uid, list, true);
        }
    }

程序输出:


AVERAGE_ABSOLUTE_DIFFERENCE Evaluater Score:0.09990564982096355
Recommender IR Evaluator: [Precision:0.5,Recall:1.0]
uid:1,(104,4.032909)(105,3.390885)(107,1.858541)
uid:2,(105,3.761718)(106,2.951908)(107,1.561116)
uid:3,(103,5.593422)(102,2.458930)(106,-0.091259)
uid:4,(105,4.068329)(102,3.534025)(107,0.206257)
uid:5,(107,0.105169)

10. Tree Cluster-based 推荐算法

这个算法在mahout-0.8版本中,已经被@Deprecated。

(TODO未完)

算法API: org.apache.mahout.cf.taste.impl.recommender.TreeClusteringRecommender


  @Override
  public float estimatePreference(long userID, long itemID) throws TasteException {
    DataModel model = getDataModel();
    Float actualPref = model.getPreferenceValue(userID, itemID);
    if (actualPref != null) {
      return actualPref;
    }
    buildClusters();
    List topRecsForUser = topRecsByUserID.get(userID);
    if (topRecsForUser != null) {
      for (RecommendedItem item : topRecsForUser) {
        if (itemID == item.getItemID()) {
          return item.getValue();
        }
      }
    }
    // Hmm, we have no idea. The item is not in the user's cluster
    return Float.NaN;
  }

测试程序:


    public static void treeCluster(DataModel dataModel) throws TasteException {
        UserSimilarity userSimilarity = RecommendFactory.userSimilarity(RecommendFactory.SIMILARITY.LOGLIKELIHOOD, dataModel);
        ClusterSimilarity clusterSimilarity = RecommendFactory.clusterSimilarity(RecommendFactory.SIMILARITY.FARTHEST_NEIGHBOR_CLUSTER, userSimilarity);
        RecommenderBuilder recommenderBuilder = RecommendFactory.treeClusterRecommender(clusterSimilarity, 10);

        RecommendFactory.evaluate(RecommendFactory.EVALUATOR.AVERAGE_ABSOLUTE_DIFFERENCE, recommenderBuilder, null, dataModel, 0.7);
        RecommendFactory.statsEvaluator(recommenderBuilder, null, dataModel, 2);

        LongPrimitiveIterator iter = dataModel.getUserIDs();
        while (iter.hasNext()) {
            long uid = iter.nextLong();
            List list = recommenderBuilder.buildRecommender(dataModel).recommend(uid, RECOMMENDER_NUM);
            RecommendFactory.showItems(uid, list, true);
        }
    }

程序输出:


AVERAGE_ABSOLUTE_DIFFERENCE Evaluater Score:NaN
Recommender IR Evaluator: [Precision:NaN,Recall:0.0]

11. Mahout推荐算法总结

算法及适用场景:

recommender-intro

算法评分的结果:

recommender-score

通过对上面几种算法的一平分比较:itemCF,itemKNN,SVD的Rrecision,Recall的评分值是最好的,并且itemCF和SVD的AVERAGE_ABSOLUTE_DIFFERENCE是最低的,所以,从算法的角度知道了,哪个算法是更准确的或者会索引到更多的数据集。

另外的一些因素:

  • 1. 这3个指标,并不能直接决定计算结果一定itemCF,SVD好
  • 2. 各种算法的参数我们并没有调优
  • 3. 数据量和数据分布,是影响算法的评分

程序源代码下载

https://github.com/bsspirit/maven_mahout_template/tree/mahout-0.8/src/main/java/org/conan/mymahout/recommendation/job

转载请注明出处:
http://blog.fens.me/mahout-recommendation-api

打赏作者

用Hadoop构建电影推荐系统

Hadoop家族系列文章,主要介绍Hadoop家族产品,常用的项目包括Hadoop, Hive, Pig, HBase, Sqoop, Mahout, Zookeeper, Avro, Ambari, Chukwa,新增加的项目包括,YARN, Hcatalog, Oozie, Cassandra, Hama, Whirr, Flume, Bigtop, Crunch, Hue等。

从2011年开始,中国进入大数据风起云涌的时代,以Hadoop为代表的家族软件,占据了大数据处理的广阔地盘。开源界及厂商,所有数据软件,无一不向Hadoop靠拢。Hadoop也从小众的高富帅领域,变成了大数据开发的标准。在Hadoop原有技术基础之上,出现了Hadoop家族产品,通过“大数据”概念不断创新,推出科技进步。

作为IT界的开发人员,我们也要跟上节奏,抓住机遇,跟着Hadoop一起雄起!

关于作者:

  • 张丹(Conan), 程序员Java,R,PHP,Javascript
  • weibo:@Conan_Z
  • blog: http://blog.fens.me
  • email: bsspirit@gmail.com

转载请注明出处:
http://blog.fens.me/hadoop-mapreduce-recommend/

hadoop-recommand

前言

Netflix电影推荐的百万美金比赛,把“推荐”变成了时下最热门的数据挖掘算法之一。也正是由于Netflix的比赛,让企业界和学科界有了更深层次的技术碰撞。引发了各种网站“推荐”热,个性时代已经到来。

目录

  1. 推荐系统概述
  2. 需求分析:推荐系统指标设计
  3. 算法模型:Hadoop并行算法
  4. 架构设计:推荐系统架构
  5. 程序开发:MapReduce程序实现
  6. 补充内容:对Step4过程优化

1. 推荐系统概述

电子商务网站是个性化推荐系统重要地应用的领域之一,亚马逊就是个性化推荐系统的积极应用者和推广者,亚马逊的推荐系统深入到网站的各类商品,为亚马逊带来了至少30%的销售额。

不光是电商类,推荐系统无处不在。QQ,人人网的好友推荐;新浪微博的你可能感觉兴趣的人;优酷,土豆的电影推荐;豆瓣的图书推荐;大从点评的餐饮推荐;世纪佳缘的相亲推荐;天际网的职业推荐等。

推荐算法分类:

按数据使用划分:

  • 协同过滤算法:UserCF, ItemCF, ModelCF
  • 基于内容的推荐: 用户内容属性和物品内容属性
  • 社会化过滤:基于用户的社会网络关系

按模型划分:

  • 最近邻模型:基于距离的协同过滤算法
  • Latent Factor Mode(SVD):基于矩阵分解的模型
  • Graph:图模型,社会网络图模型

基于用户的协同过滤算法UserCF

基于用户的协同过滤,通过不同用户对物品的评分来评测用户之间的相似性,基于用户之间的相似性做出推荐。简单来讲就是:给用户推荐和他兴趣相似的其他用户喜欢的物品。

用例说明:

image015

算法实现及使用介绍,请参考文章:Mahout推荐算法API详解

基于物品的协同过滤算法ItemCF

基于item的协同过滤,通过用户对不同item的评分来评测item之间的相似性,基于item之间的相似性做出推荐。简单来讲就是:给用户推荐和他之前喜欢的物品相似的物品。

用例说明:

image017

算法实现及使用介绍,请参考文章:Mahout推荐算法API详解

注:基于物品的协同过滤算法,是目前商用最广泛的推荐算法。

协同过滤算法实现,分为2个步骤

  • 1. 计算物品之间的相似度
  • 2. 根据物品的相似度和用户的历史行为给用户生成推荐列表

有关协同过滤的另一篇文章,请参考:RHadoop实践系列之三 R实现MapReduce的协同过滤算法

2. 需求分析:推荐系统指标设计

下面我们将从一个公司案例出发来全面的解释,如何进行推荐系统指标设计。

案例介绍

Netflix电影推荐百万奖金比赛,http://www.netflixprize.com/
Netflix官方网站:www.netflix.com

Netflix,2006年组织比赛是的时候,是一家以在线电影租赁为生的公司。他们根据网友对电影的打分来判断用户有可能喜欢什么电影,并结合会员看过的电影以及口味偏好设置做出判断,混搭出各种电影风格的需求。

收集会员的一些信息,为他们指定个性化的电影推荐后,有许多冷门电影竟然进入了候租榜单。从公司的电影资源成本方面考量,热门电影的成本一般较高,如果Netflix公司能够在电影租赁中增加冷门电影的比例,自然能够提升自身盈利能力。

Netflix公司曾宣称60%左右的会员根据推荐名单定制租赁顺序,如果推荐系统不能准确地猜测会员喜欢的电影类型,容易造成多次租借冷门电影而并不符合个人口味的会员流失。为了更高效地为会员推荐电影,Netflix一直致力于不断改进和完善个性化推荐服务,在2006年推出百万美元大奖,无论是谁能最好地优化Netflix推荐算法就可获奖励100万美元。到2009年,奖金被一个7人开发小组夺得,Netflix随后又立即推出第二个百万美金悬赏。这充分说明一套好的推荐算法系统是多么重要,同时又是多么困难。

netflix_prize

上图为比赛的各支队伍的排名!

补充说明:

  • 1. Netflix的比赛是基于静态数据的,就是给定“训练级”,匹配“结果集”,“结果集”也是提前就做好的,所以这与我们每天运营的系统,其实是不一样的。
  • 2. Netflix用于比赛的数据集是小量的,整个全集才666MB,而实际的推荐系统都要基于大量历史数据的,动不动就会上GB,TB等

Netflix数据下载
部分训练集:http://graphlab.org/wp-content/uploads/2013/07/smallnetflix_mm.train_.gz
部分结果集:http://graphlab.org/wp-content/uploads/2013/07/smallnetflix_mm.validate.gz
完整数据集:http://www.lifecrunch.biz/wp-content/uploads/2011/04/nf_prize_dataset.tar.gz

所以,我们在真实的环境中设计推荐的时候,要全面考量数据量,算法性能,结果准确度等的指标。

  • 推荐算法选型:基于物品的协同过滤算法ItemCF,并行实现
  • 数据量:基于Hadoop架构,支持GB,TB,PB级数据量
  • 算法检验:可以通过 准确率,召回率,覆盖率,流行度 等指标评判。
  • 结果解读:通过ItemCF的定义,合理给出结果解释

3. 算法模型:Hadoop并行算法

这里我使用”Mahout In Action”书里,第一章第六节介绍的分步式基于物品的协同过滤算法进行实现。Chapter 6: Distributing recommendation computations

测试数据集:small.csv


1,101,5.0
1,102,3.0
1,103,2.5
2,101,2.0
2,102,2.5
2,103,5.0
2,104,2.0
3,101,2.0
3,104,4.0
3,105,4.5
3,107,5.0
4,101,5.0
4,103,3.0
4,104,4.5
4,106,4.0
5,101,4.0
5,102,3.0
5,103,2.0
5,104,4.0
5,105,3.5
5,106,4.0

每行3个字段,依次是用户ID,电影ID,用户对电影的评分(0-5分,每0.5为一个评分点!)

算法的思想:

  • 1. 建立物品的同现矩阵
  • 2. 建立用户对物品的评分矩阵
  • 3. 矩阵计算推荐结果

1). 建立物品的同现矩阵
按用户分组,找到每个用户所选的物品,单独出现计数及两两一组计数。


      [101] [102] [103] [104] [105] [106] [107]
[101]   5     3     4     4     2     2     1
[102]   3     3     3     2     1     1     0
[103]   4     3     4     3     1     2     0
[104]   4     2     3     4     2     2     1
[105]   2     1     1     2     2     1     1
[106]   2     1     2     2     1     2     0
[107]   1     0     0     1     1     0     1

2). 建立用户对物品的评分矩阵
按用户分组,找到每个用户所选的物品及评分


       U3
[101] 2.0
[102] 0.0
[103] 0.0
[104] 4.0
[105] 4.5
[106] 0.0
[107] 5.0

3). 矩阵计算推荐结果
同现矩阵*评分矩阵=推荐结果

alogrithm_1

图片摘自”Mahout In Action”

MapReduce任务设计

aglorithm_2

图片摘自”Mahout In Action”

解读MapRduce任务:

  • 步骤1: 按用户分组,计算所有物品出现的组合列表,得到用户对物品的评分矩阵
  • 步骤2: 对物品组合列表进行计数,建立物品的同现矩阵
  • 步骤3: 合并同现矩阵和评分矩阵
  • 步骤4: 计算推荐结果列表

4. 架构设计:推荐系统架构

hadoop-recommand-architect

上图中,左边是Application业务系统,右边是Hadoop的HDFS, MapReduce。

  1. 业务系统记录了用户的行为和对物品的打分
  2. 设置系统定时器CRON,每xx小时,增量向HDFS导入数据(userid,itemid,value,time)。
  3. 完成导入后,设置系统定时器,启动MapReduce程序,运行推荐算法。
  4. 完成计算后,设置系统定时器,从HDFS导出推荐结果数据到数据库,方便以后的及时查询。

5. 程序开发:MapReduce程序实现

win7的开发环境 和 Hadoop的运行环境 ,请参考文章:用Maven构建Hadoop项目

新建Java类:

  • Recommend.java,主任务启动程序
  • Step1.java,按用户分组,计算所有物品出现的组合列表,得到用户对物品的评分矩阵
  • Step2.java,对物品组合列表进行计数,建立物品的同现矩阵
  • Step3.java,合并同现矩阵和评分矩阵
  • Step4.java,计算推荐结果列表
  • HdfsDAO.java,HDFS操作工具类

1). Recommend.java,主任务启动程序
源代码:


package org.conan.myhadoop.recommend;

import java.util.HashMap;
import java.util.Map;
import java.util.regex.Pattern;

import org.apache.hadoop.mapred.JobConf;

public class Recommend {

    public static final String HDFS = "hdfs://192.168.1.210:9000";
    public static final Pattern DELIMITER = Pattern.compile("[\t,]");

    public static void main(String[] args) throws Exception {
        Map<String, String> path = new HashMap<String, String>();
        path.put("data", "logfile/small.csv");
        path.put("Step1Input", HDFS + "/user/hdfs/recommend");
        path.put("Step1Output", path.get("Step1Input") + "/step1");
        path.put("Step2Input", path.get("Step1Output"));
        path.put("Step2Output", path.get("Step1Input") + "/step2");
        path.put("Step3Input1", path.get("Step1Output"));
        path.put("Step3Output1", path.get("Step1Input") + "/step3_1");
        path.put("Step3Input2", path.get("Step2Output"));
        path.put("Step3Output2", path.get("Step1Input") + "/step3_2");
        path.put("Step4Input1", path.get("Step3Output1"));
        path.put("Step4Input2", path.get("Step3Output2"));
        path.put("Step4Output", path.get("Step1Input") + "/step4");

        Step1.run(path);
        Step2.run(path);
        Step3.run1(path);
        Step3.run2(path);
        Step4.run(path);
        System.exit(0);
    }

    public static JobConf config() {
        JobConf conf = new JobConf(Recommend.class);
        conf.setJobName("Recommend");
        conf.addResource("classpath:/hadoop/core-site.xml");
        conf.addResource("classpath:/hadoop/hdfs-site.xml");
        conf.addResource("classpath:/hadoop/mapred-site.xml");
        return conf;
    }

}

2). Step1.java,按用户分组,计算所有物品出现的组合列表,得到用户对物品的评分矩阵

源代码:


package org.conan.myhadoop.recommend;

import java.io.IOException;
import java.util.Iterator;
import java.util.Map;

import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.FileInputFormat;
import org.apache.hadoop.mapred.FileOutputFormat;
import org.apache.hadoop.mapred.JobClient;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.Mapper;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reducer;
import org.apache.hadoop.mapred.Reporter;
import org.apache.hadoop.mapred.RunningJob;
import org.apache.hadoop.mapred.TextInputFormat;
import org.apache.hadoop.mapred.TextOutputFormat;
import org.conan.myhadoop.hdfs.HdfsDAO;

public class Step1 {

    public static class Step1_ToItemPreMapper extends MapReduceBase implements Mapper<Object, Text, IntWritable, Text> {
        private final static IntWritable k = new IntWritable();
        private final static Text v = new Text();

        @Override
        public void map(Object key, Text value, OutputCollector<IntWritable, Text> output, Reporter reporter) throws IOException {
            String[] tokens = Recommend.DELIMITER.split(value.toString());
            int userID = Integer.parseInt(tokens[0]);
            String itemID = tokens[1];
            String pref = tokens[2];
            k.set(userID);
            v.set(itemID + ":" + pref);
            output.collect(k, v);
        }
    }

    public static class Step1_ToUserVectorReducer extends MapReduceBase implements Reducer<IntWritable, Text, IntWritable, Text> {
        private final static Text v = new Text();

        @Override
        public void reduce(IntWritable key, Iterator values, OutputCollector<IntWritable, Text> output, Reporter reporter) throws IOException {
            StringBuilder sb = new StringBuilder();
            while (values.hasNext()) {
                sb.append("," + values.next());
            }
            v.set(sb.toString().replaceFirst(",", ""));
            output.collect(key, v);
        }
    }

    public static void run(Map<String, String> path) throws IOException {
        JobConf conf = Recommend.config();

        String input = path.get("Step1Input");
        String output = path.get("Step1Output");

        HdfsDAO hdfs = new HdfsDAO(Recommend.HDFS, conf);
        hdfs.rmr(input);
        hdfs.mkdirs(input);
        hdfs.copyFile(path.get("data"), input);

        conf.setMapOutputKeyClass(IntWritable.class);
        conf.setMapOutputValueClass(Text.class);

        conf.setOutputKeyClass(IntWritable.class);
        conf.setOutputValueClass(Text.class);

        conf.setMapperClass(Step1_ToItemPreMapper.class);
        conf.setCombinerClass(Step1_ToUserVectorReducer.class);
        conf.setReducerClass(Step1_ToUserVectorReducer.class);

        conf.setInputFormat(TextInputFormat.class);
        conf.setOutputFormat(TextOutputFormat.class);

        FileInputFormat.setInputPaths(conf, new Path(input));
        FileOutputFormat.setOutputPath(conf, new Path(output));

        RunningJob job = JobClient.runJob(conf);
        while (!job.isComplete()) {
            job.waitForCompletion();
        }
    }

}

计算结果:


~ hadoop fs -cat /user/hdfs/recommend/step1/part-00000

1       102:3.0,103:2.5,101:5.0
2       101:2.0,102:2.5,103:5.0,104:2.0
3       107:5.0,101:2.0,104:4.0,105:4.5
4       101:5.0,103:3.0,104:4.5,106:4.0
5       101:4.0,102:3.0,103:2.0,104:4.0,105:3.5,106:4.0

3). Step2.java,对物品组合列表进行计数,建立物品的同现矩阵
源代码:


package org.conan.myhadoop.recommend;

import java.io.IOException;
import java.util.Iterator;
import java.util.Map;

import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.FileInputFormat;
import org.apache.hadoop.mapred.FileOutputFormat;
import org.apache.hadoop.mapred.JobClient;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.Mapper;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reducer;
import org.apache.hadoop.mapred.Reporter;
import org.apache.hadoop.mapred.RunningJob;
import org.apache.hadoop.mapred.TextInputFormat;
import org.apache.hadoop.mapred.TextOutputFormat;
import org.conan.myhadoop.hdfs.HdfsDAO;

public class Step2 {
    public static class Step2_UserVectorToCooccurrenceMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> {
        private final static Text k = new Text();
        private final static IntWritable v = new IntWritable(1);

        @Override
        public void map(LongWritable key, Text values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {
            String[] tokens = Recommend.DELIMITER.split(values.toString());
            for (int i = 1; i < tokens.length; i++) {
                String itemID = tokens[i].split(":")[0];
                for (int j = 1; j < tokens.length; j++) {
                    String itemID2 = tokens[j].split(":")[0];
                    k.set(itemID + ":" + itemID2);
                    output.collect(k, v);
                }
            }
        }
    }

    public static class Step2_UserVectorToConoccurrenceReducer extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> {
        private IntWritable result = new IntWritable();

        @Override
        public void reduce(Text key, Iterator values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {
            int sum = 0;
            while (values.hasNext()) {
                sum += values.next().get();
            }
            result.set(sum);
            output.collect(key, result);
        }
    }

    public static void run(Map<String, String> path) throws IOException {
        JobConf conf = Recommend.config();

        String input = path.get("Step2Input");
        String output = path.get("Step2Output");

        HdfsDAO hdfs = new HdfsDAO(Recommend.HDFS, conf);
        hdfs.rmr(output);

        conf.setOutputKeyClass(Text.class);
        conf.setOutputValueClass(IntWritable.class);

        conf.setMapperClass(Step2_UserVectorToCooccurrenceMapper.class);
        conf.setCombinerClass(Step2_UserVectorToConoccurrenceReducer.class);
        conf.setReducerClass(Step2_UserVectorToConoccurrenceReducer.class);

        conf.setInputFormat(TextInputFormat.class);
        conf.setOutputFormat(TextOutputFormat.class);

        FileInputFormat.setInputPaths(conf, new Path(input));
        FileOutputFormat.setOutputPath(conf, new Path(output));

        RunningJob job = JobClient.runJob(conf);
        while (!job.isComplete()) {
            job.waitForCompletion();
        }
    }
}

计算结果:


~ hadoop fs -cat /user/hdfs/recommend/step2/part-00000

101:101 5
101:102 3
101:103 4
101:104 4
101:105 2
101:106 2
101:107 1
102:101 3
102:102 3
102:103 3
102:104 2
102:105 1
102:106 1
103:101 4
103:102 3
103:103 4
103:104 3
103:105 1
103:106 2
104:101 4
104:102 2
104:103 3
104:104 4
104:105 2
104:106 2
104:107 1
105:101 2
105:102 1
105:103 1
105:104 2
105:105 2
105:106 1
105:107 1
106:101 2
106:102 1
106:103 2
106:104 2
106:105 1
106:106 2
107:101 1
107:104 1
107:105 1
107:107 1

4). Step3.java,合并同现矩阵和评分矩阵
源代码:


package org.conan.myhadoop.recommend;

import java.io.IOException;
import java.util.Map;

import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.FileInputFormat;
import org.apache.hadoop.mapred.FileOutputFormat;
import org.apache.hadoop.mapred.JobClient;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.Mapper;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reporter;
import org.apache.hadoop.mapred.RunningJob;
import org.apache.hadoop.mapred.TextInputFormat;
import org.apache.hadoop.mapred.TextOutputFormat;
import org.conan.myhadoop.hdfs.HdfsDAO;

public class Step3 {

    public static class Step31_UserVectorSplitterMapper extends MapReduceBase implements Mapper<LongWritable, Text, IntWritable, Text> {
        private final static IntWritable k = new IntWritable();
        private final static Text v = new Text();

        @Override
        public void map(LongWritable key, Text values, OutputCollector<IntWritable, Text> output, Reporter reporter) throws IOException {
            String[] tokens = Recommend.DELIMITER.split(values.toString());
            for (int i = 1; i < tokens.length; i++) {
                String[] vector = tokens[i].split(":");
                int itemID = Integer.parseInt(vector[0]);
                String pref = vector[1];

                k.set(itemID);
                v.set(tokens[0] + ":" + pref);
                output.collect(k, v);
            }
        }
    }

    public static void run1(Map<String, String> path) throws IOException {
        JobConf conf = Recommend.config();

        String input = path.get("Step3Input1");
        String output = path.get("Step3Output1");

        HdfsDAO hdfs = new HdfsDAO(Recommend.HDFS, conf);
        hdfs.rmr(output);

        conf.setOutputKeyClass(IntWritable.class);
        conf.setOutputValueClass(Text.class);

        conf.setMapperClass(Step31_UserVectorSplitterMapper.class);

        conf.setInputFormat(TextInputFormat.class);
        conf.setOutputFormat(TextOutputFormat.class);

        FileInputFormat.setInputPaths(conf, new Path(input));
        FileOutputFormat.setOutputPath(conf, new Path(output));

        RunningJob job = JobClient.runJob(conf);
        while (!job.isComplete()) {
            job.waitForCompletion();
        }
    }

    public static class Step32_CooccurrenceColumnWrapperMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> {
        private final static Text k = new Text();
        private final static IntWritable v = new IntWritable();

        @Override
        public void map(LongWritable key, Text values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {
            String[] tokens = Recommend.DELIMITER.split(values.toString());
            k.set(tokens[0]);
            v.set(Integer.parseInt(tokens[1]));
            output.collect(k, v);
        }
    }

    public static void run2(Map<String, String> path) throws IOException {
        JobConf conf = Recommend.config();

        String input = path.get("Step3Input2");
        String output = path.get("Step3Output2");

        HdfsDAO hdfs = new HdfsDAO(Recommend.HDFS, conf);
        hdfs.rmr(output);

        conf.setOutputKeyClass(Text.class);
        conf.setOutputValueClass(IntWritable.class);

        conf.setMapperClass(Step32_CooccurrenceColumnWrapperMapper.class);

        conf.setInputFormat(TextInputFormat.class);
        conf.setOutputFormat(TextOutputFormat.class);

        FileInputFormat.setInputPaths(conf, new Path(input));
        FileOutputFormat.setOutputPath(conf, new Path(output));

        RunningJob job = JobClient.runJob(conf);
        while (!job.isComplete()) {
            job.waitForCompletion();
        }
    }

}

计算结果:


~ hadoop fs -cat /user/hdfs/recommend/step3_1/part-00000

101     5:4.0
101     1:5.0
101     2:2.0
101     3:2.0
101     4:5.0
102     1:3.0
102     5:3.0
102     2:2.5
103     2:5.0
103     5:2.0
103     1:2.5
103     4:3.0
104     2:2.0
104     5:4.0
104     3:4.0
104     4:4.5
105     3:4.5
105     5:3.5
106     5:4.0
106     4:4.0
107     3:5.0

~ hadoop fs -cat /user/hdfs/recommend/step3_2/part-00000

101:101 5
101:102 3
101:103 4
101:104 4
101:105 2
101:106 2
101:107 1
102:101 3
102:102 3
102:103 3
102:104 2
102:105 1
102:106 1
103:101 4
103:102 3
103:103 4
103:104 3
103:105 1
103:106 2
104:101 4
104:102 2
104:103 3
104:104 4
104:105 2
104:106 2
104:107 1
105:101 2
105:102 1
105:103 1
105:104 2
105:105 2
105:106 1
105:107 1
106:101 2
106:102 1
106:103 2
106:104 2
106:105 1
106:106 2
107:101 1
107:104 1
107:105 1
107:107 1

5). Step4.java,计算推荐结果列表
源代码:


package org.conan.myhadoop.recommend;

import java.io.IOException;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.Iterator;
import java.util.List;
import java.util.Map;

import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.FileInputFormat;
import org.apache.hadoop.mapred.FileOutputFormat;
import org.apache.hadoop.mapred.JobClient;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.Mapper;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reducer;
import org.apache.hadoop.mapred.Reporter;
import org.apache.hadoop.mapred.RunningJob;
import org.apache.hadoop.mapred.TextInputFormat;
import org.apache.hadoop.mapred.TextOutputFormat;
import org.conan.myhadoop.hdfs.HdfsDAO;

public class Step4 {

    public static class Step4_PartialMultiplyMapper extends MapReduceBase implements Mapper<LongWritable, Text, IntWritable, Text> {
        private final static IntWritable k = new IntWritable();
        private final static Text v = new Text();

        private final static Map<Integer, List> cooccurrenceMatrix = new HashMap<Integer, List>();

        @Override
        public void map(LongWritable key, Text values, OutputCollector<IntWritable, Text> output, Reporter reporter) throws IOException {
            String[] tokens = Recommend.DELIMITER.split(values.toString());

            String[] v1 = tokens[0].split(":");
            String[] v2 = tokens[1].split(":");

            if (v1.length > 1) {// cooccurrence
                int itemID1 = Integer.parseInt(v1[0]);
                int itemID2 = Integer.parseInt(v1[1]);
                int num = Integer.parseInt(tokens[1]);

                List list = null;
                if (!cooccurrenceMatrix.containsKey(itemID1)) {
                    list = new ArrayList();
                } else {
                    list = cooccurrenceMatrix.get(itemID1);
                }
                list.add(new Cooccurrence(itemID1, itemID2, num));
                cooccurrenceMatrix.put(itemID1, list);
            }

            if (v2.length > 1) {// userVector
                int itemID = Integer.parseInt(tokens[0]);
                int userID = Integer.parseInt(v2[0]);
                double pref = Double.parseDouble(v2[1]);
                k.set(userID);
                for (Cooccurrence co : cooccurrenceMatrix.get(itemID)) {
                    v.set(co.getItemID2() + "," + pref * co.getNum());
                    output.collect(k, v);
                }

            }
        }
    }

    public static class Step4_AggregateAndRecommendReducer extends MapReduceBase implements Reducer<IntWritable, Text, IntWritable, Text> {
        private final static Text v = new Text();

        @Override
        public void reduce(IntWritable key, Iterator values, OutputCollector<IntWritable, Text> output, Reporter reporter) throws IOException {
            Map<String, Double> result = new HashMap<String, Double>();
            while (values.hasNext()) {
                String[] str = values.next().toString().split(",");
                if (result.containsKey(str[0])) {
                    result.put(str[0], result.get(str[0]) + Double.parseDouble(str[1]));
                } else {
                    result.put(str[0], Double.parseDouble(str[1]));
                }
            }
            Iterator iter = result.keySet().iterator();
            while (iter.hasNext()) {
                String itemID = iter.next();
                double score = result.get(itemID);
                v.set(itemID + "," + score);
                output.collect(key, v);
            }
        }
    }

    public static void run(Map<String, String> path) throws IOException {
        JobConf conf = Recommend.config();

        String input1 = path.get("Step4Input1");
        String input2 = path.get("Step4Input2");
        String output = path.get("Step4Output");

        HdfsDAO hdfs = new HdfsDAO(Recommend.HDFS, conf);
        hdfs.rmr(output);

        conf.setOutputKeyClass(IntWritable.class);
        conf.setOutputValueClass(Text.class);

        conf.setMapperClass(Step4_PartialMultiplyMapper.class);
        conf.setCombinerClass(Step4_AggregateAndRecommendReducer.class);
        conf.setReducerClass(Step4_AggregateAndRecommendReducer.class);

        conf.setInputFormat(TextInputFormat.class);
        conf.setOutputFormat(TextOutputFormat.class);

        FileInputFormat.setInputPaths(conf, new Path(input1), new Path(input2));
        FileOutputFormat.setOutputPath(conf, new Path(output));

        RunningJob job = JobClient.runJob(conf);
        while (!job.isComplete()) {
            job.waitForCompletion();
        }
    }

}

class Cooccurrence {
    private int itemID1;
    private int itemID2;
    private int num;

    public Cooccurrence(int itemID1, int itemID2, int num) {
        super();
        this.itemID1 = itemID1;
        this.itemID2 = itemID2;
        this.num = num;
    }

    public int getItemID1() {
        return itemID1;
    }

    public void setItemID1(int itemID1) {
        this.itemID1 = itemID1;
    }

    public int getItemID2() {
        return itemID2;
    }

    public void setItemID2(int itemID2) {
        this.itemID2 = itemID2;
    }

    public int getNum() {
        return num;
    }

    public void setNum(int num) {
        this.num = num;
    }

}

计算结果:


~ hadoop fs -cat /user/hdfs/recommend/step4/part-00000

1       107,5.0
1       106,18.0
1       105,15.5
1       104,33.5
1       103,39.0
1       102,31.5
1       101,44.0
2       107,4.0
2       106,20.5
2       105,15.5
2       104,36.0
2       103,41.5
2       102,32.5
2       101,45.5
3       107,15.5
3       106,16.5
3       105,26.0
3       104,38.0
3       103,24.5
3       102,18.5
3       101,40.0
4       107,9.5
4       106,33.0
4       105,26.0
4       104,55.0
4       103,53.5
4       102,37.0
4       101,63.0
5       107,11.5
5       106,34.5
5       105,32.0
5       104,59.0
5       103,56.5
5       102,42.5
5       101,68.0

对Step4过程优化,请参考本文最后的补充内容。

6). HdfsDAO.java,HDFS操作工具类
详细解释,请参考文章:Hadoop编程调用HDFS

源代码:


package org.conan.myhadoop.hdfs;

import java.io.IOException;
import java.net.URI;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FSDataInputStream;
import org.apache.hadoop.fs.FSDataOutputStream;
import org.apache.hadoop.fs.FileStatus;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IOUtils;
import org.apache.hadoop.mapred.JobConf;

public class HdfsDAO {

    private static final String HDFS = "hdfs://192.168.1.210:9000/";

    public HdfsDAO(Configuration conf) {
        this(HDFS, conf);
    }

    public HdfsDAO(String hdfs, Configuration conf) {
        this.hdfsPath = hdfs;
        this.conf = conf;
    }

    private String hdfsPath;
    private Configuration conf;

    public static void main(String[] args) throws IOException {
        JobConf conf = config();
        HdfsDAO hdfs = new HdfsDAO(conf);
        hdfs.copyFile("datafile/item.csv", "/tmp/new");
        hdfs.ls("/tmp/new");
    }        

    public static JobConf config(){
        JobConf conf = new JobConf(HdfsDAO.class);
        conf.setJobName("HdfsDAO");
        conf.addResource("classpath:/hadoop/core-site.xml");
        conf.addResource("classpath:/hadoop/hdfs-site.xml");
        conf.addResource("classpath:/hadoop/mapred-site.xml");
        return conf;
    }

    public void mkdirs(String folder) throws IOException {
        Path path = new Path(folder);
        FileSystem fs = FileSystem.get(URI.create(hdfsPath), conf);
        if (!fs.exists(path)) {
            fs.mkdirs(path);
            System.out.println("Create: " + folder);
        }
        fs.close();
    }

    public void rmr(String folder) throws IOException {
        Path path = new Path(folder);
        FileSystem fs = FileSystem.get(URI.create(hdfsPath), conf);
        fs.deleteOnExit(path);
        System.out.println("Delete: " + folder);
        fs.close();
    }

    public void ls(String folder) throws IOException {
        Path path = new Path(folder);
        FileSystem fs = FileSystem.get(URI.create(hdfsPath), conf);
        FileStatus[] list = fs.listStatus(path);
        System.out.println("ls: " + folder);
        System.out.println("==========================================================");
        for (FileStatus f : list) {
            System.out.printf("name: %s, folder: %s, size: %d\n", f.getPath(), f.isDir(), f.getLen());
        }
        System.out.println("==========================================================");
        fs.close();
    }

    public void createFile(String file, String content) throws IOException {
        FileSystem fs = FileSystem.get(URI.create(hdfsPath), conf);
        byte[] buff = content.getBytes();
        FSDataOutputStream os = null;
        try {
            os = fs.create(new Path(file));
            os.write(buff, 0, buff.length);
            System.out.println("Create: " + file);
        } finally {
            if (os != null)
                os.close();
        }
        fs.close();
    }

    public void copyFile(String local, String remote) throws IOException {
        FileSystem fs = FileSystem.get(URI.create(hdfsPath), conf);
        fs.copyFromLocalFile(new Path(local), new Path(remote));
        System.out.println("copy from: " + local + " to " + remote);
        fs.close();
    }

    public void download(String remote, String local) throws IOException {
        Path path = new Path(remote);
        FileSystem fs = FileSystem.get(URI.create(hdfsPath), conf);
        fs.copyToLocalFile(path, new Path(local));
        System.out.println("download: from" + remote + " to " + local);
        fs.close();
    }

    public void cat(String remoteFile) throws IOException {
        Path path = new Path(remoteFile);
        FileSystem fs = FileSystem.get(URI.create(hdfsPath), conf);
        FSDataInputStream fsdis = null;
        System.out.println("cat: " + remoteFile);
        try {  
            fsdis =fs.open(path);
            IOUtils.copyBytes(fsdis, System.out, 4096, false);  
          } finally {  
            IOUtils.closeStream(fsdis);
            fs.close();
          }
    }
}

这样我们就自己编程实现了MapReduce化基于物品的协同过滤算法。

RHadoop的实现方案,请参考文章:RHadoop实践系列之三 R实现MapReduce的协同过滤算法

Mahout的实现方案,请参考文章:Mahout分步式程序开发 基于物品的协同过滤ItemCF

我已经把整个MapReduce的实现都放到了github上面:
https://github.com/bsspirit/maven_hadoop_template/releases/tag/recommend

6. 补充内容:对Step4过程优化

在Step4.java这一步运行过程中,Mapper过程在Step4_PartialMultiplyMapper类通过分别读取两个input数据,在内存中进行了计算。

这种方式有明显的限制条件:

  • a. 两个输入数据集,有严格的读入顺序。由于Hadoop不能指定读入顺序,因此在多节点的Hadoop集群环境,读入顺序有可能会发生错误,造成程序的空指针错误。
  • b. 这个计算过程,在内存中实现。如果矩阵过大,会造成单节点的内存不足。

做为优化的方案,我们需要对Step4的过程,实现MapReduce的矩阵乘法,矩阵算法原理请参考文章:用MapReduce实现矩阵乘法

对Step4优化的实现:把矩阵计算通过两个MapReduce过程实现。

  • 矩阵乘法过程类文件:Step4_Update.java
  • 矩阵加法过程类文件:Step4_Update2.java
  • 修改启动程序:Recommend.java

增加文件:Step4_Update.java


package org.conan.myhadoop.recommend;

import java.io.IOException;
import java.util.HashMap;
import java.util.Iterator;
import java.util.Map;

import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.FileSplit;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import org.conan.myhadoop.hdfs.HdfsDAO;

public class Step4_Update {

    public static class Step4_PartialMultiplyMapper extends Mapper {

        private String flag;// A同现矩阵 or B评分矩阵

        @Override
        protected void setup(Context context) throws IOException, InterruptedException {
            FileSplit split = (FileSplit) context.getInputSplit();
            flag = split.getPath().getParent().getName();// 判断读的数据集

            // System.out.println(flag);
        }

        @Override
        public void map(LongWritable key, Text values, Context context) throws IOException, InterruptedException {
            String[] tokens = Recommend.DELIMITER.split(values.toString());

            if (flag.equals("step3_2")) {// 同现矩阵
                String[] v1 = tokens[0].split(":");
                String itemID1 = v1[0];
                String itemID2 = v1[1];
                String num = tokens[1];

                Text k = new Text(itemID1);
                Text v = new Text("A:" + itemID2 + "," + num);

                context.write(k, v);
                // System.out.println(k.toString() + "  " + v.toString());

            } else if (flag.equals("step3_1")) {// 评分矩阵
                String[] v2 = tokens[1].split(":");
                String itemID = tokens[0];
                String userID = v2[0];
                String pref = v2[1];

                Text k = new Text(itemID);
                Text v = new Text("B:" + userID + "," + pref);

                context.write(k, v);
                // System.out.println(k.toString() + "  " + v.toString());
            }
        }

    }

    public static class Step4_AggregateReducer extends Reducer {

        @Override
        public void reduce(Text key, Iterable values, Context context) throws IOException, InterruptedException {
            System.out.println(key.toString() + ":");

            Map mapA = new HashMap();
            Map mapB = new HashMap();

            for (Text line : values) {
                String val = line.toString();
                System.out.println(val);

                if (val.startsWith("A:")) {
                    String[] kv = Recommend.DELIMITER.split(val.substring(2));
                    mapA.put(kv[0], kv[1]);

                } else if (val.startsWith("B:")) {
                    String[] kv = Recommend.DELIMITER.split(val.substring(2));
                    mapB.put(kv[0], kv[1]);

                }
            }

            double result = 0;
            Iterator iter = mapA.keySet().iterator();
            while (iter.hasNext()) {
                String mapk = iter.next();// itemID

                int num = Integer.parseInt(mapA.get(mapk));
                Iterator iterb = mapB.keySet().iterator();
                while (iterb.hasNext()) {
                    String mapkb = iterb.next();// userID
                    double pref = Double.parseDouble(mapB.get(mapkb));
                    result = num * pref;// 矩阵乘法相乘计算

                    Text k = new Text(mapkb);
                    Text v = new Text(mapk + "," + result);
                    context.write(k, v);
                    System.out.println(k.toString() + "  " + v.toString());
                }
            }
        }
    }

    public static void run(Map path) throws IOException, InterruptedException, ClassNotFoundException {
        JobConf conf = Recommend.config();

        String input1 = path.get("Step5Input1");
        String input2 = path.get("Step5Input2");
        String output = path.get("Step5Output");

        HdfsDAO hdfs = new HdfsDAO(Recommend.HDFS, conf);
        hdfs.rmr(output);

        Job job = new Job(conf);
        job.setJarByClass(Step4_Update.class);

        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(Text.class);

        job.setMapperClass(Step4_Update.Step4_PartialMultiplyMapper.class);
        job.setReducerClass(Step4_Update.Step4_AggregateReducer.class);

        job.setInputFormatClass(TextInputFormat.class);
        job.setOutputFormatClass(TextOutputFormat.class);

        FileInputFormat.setInputPaths(job, new Path(input1), new Path(input2));
        FileOutputFormat.setOutputPath(job, new Path(output));

        job.waitForCompletion(true);
    }

}

增加文件:Step4_Update2.java


package org.conan.myhadoop.recommend;

import java.io.IOException;
import java.util.HashMap;
import java.util.Iterator;
import java.util.Map;

import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import org.conan.myhadoop.hdfs.HdfsDAO;

public class Step4_Update2 {

    public static class Step4_RecommendMapper extends Mapper {

        @Override
        public void map(LongWritable key, Text values, Context context) throws IOException, InterruptedException {
            String[] tokens = Recommend.DELIMITER.split(values.toString());
            Text k = new Text(tokens[0]);
            Text v = new Text(tokens[1]+","+tokens[2]);
            context.write(k, v);
        }
    }

    public static class Step4_RecommendReducer extends Reducer {
        
        @Override
        public void reduce(Text key, Iterable values, Context context) throws IOException, InterruptedException {
            System.out.println(key.toString() + ":");
            Map map = new HashMap();// 结果
            
            for (Text line : values) {
                System.out.println(line.toString());
                String[] tokens = Recommend.DELIMITER.split(line.toString());
                String itemID = tokens[0];
                Double score = Double.parseDouble(tokens[1]);
                
                 if (map.containsKey(itemID)) {
                     map.put(itemID, map.get(itemID) + score);// 矩阵乘法求和计算
                 } else {
                     map.put(itemID, score);
                 }
            }
            
            Iterator iter = map.keySet().iterator();
            while (iter.hasNext()) {
                String itemID = iter.next();
                double score = map.get(itemID);
                Text v = new Text(itemID + "," + score);
                context.write(key, v);
            }
        }
    }

    public static void run(Map path) throws IOException, InterruptedException, ClassNotFoundException {
        JobConf conf = Recommend.config();

        String input = path.get("Step6Input");
        String output = path.get("Step6Output");

        HdfsDAO hdfs = new HdfsDAO(Recommend.HDFS, conf);
        hdfs.rmr(output);

        Job job = new Job(conf);
        job.setJarByClass(Step4_Update2.class);

        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(Text.class);

        job.setMapperClass(Step4_Update2.Step4_RecommendMapper.class);
        job.setReducerClass(Step4_Update2.Step4_RecommendReducer.class);

        job.setInputFormatClass(TextInputFormat.class);
        job.setOutputFormatClass(TextOutputFormat.class);

        FileInputFormat.setInputPaths(job, new Path(input));
        FileOutputFormat.setOutputPath(job, new Path(output));

        job.waitForCompletion(true);
    }

}

修改Recommend.java


package org.conan.myhadoop.recommend;

import java.util.HashMap;
import java.util.Map;
import java.util.regex.Pattern;

import org.apache.hadoop.mapred.JobConf;
import org.conan.myhadoop.hdfs.HdfsDAO;

public class Recommend {

    public static final String HDFS = "hdfs://192.168.1.210:9000";
    public static final Pattern DELIMITER = Pattern.compile("[\t,]");

    public static void main(String[] args) throws Exception {
        Map path = new HashMap();
        path.put("data", "logfile/small.csv");
        path.put("Step1Input", HDFS + "/user/hdfs/recommend");
        path.put("Step1Output", path.get("Step1Input") + "/step1");
        path.put("Step2Input", path.get("Step1Output"));
        path.put("Step2Output", path.get("Step1Input") + "/step2");
        path.put("Step3Input1", path.get("Step1Output"));
        path.put("Step3Output1", path.get("Step1Input") + "/step3_1");
        path.put("Step3Input2", path.get("Step2Output"));
        path.put("Step3Output2", path.get("Step1Input") + "/step3_2");
        
        path.put("Step4Input1", path.get("Step3Output1"));
        path.put("Step4Input2", path.get("Step3Output2"));
        path.put("Step4Output", path.get("Step1Input") + "/step4");
        
        path.put("Step5Input1", path.get("Step3Output1"));
        path.put("Step5Input2", path.get("Step3Output2"));
        path.put("Step5Output", path.get("Step1Input") + "/step5");
        
        path.put("Step6Input", path.get("Step5Output"));
        path.put("Step6Output", path.get("Step1Input") + "/step6");       

        Step1.run(path);
        Step2.run(path);
        Step3.run1(path);
        Step3.run2(path);
        //Step4.run(path);
        
        Step4_Update.run(path);
        Step4_Update2.run(path);
               
        System.exit(0);
    }

    public static JobConf config() {
        JobConf conf = new JobConf(Recommend.class);
        conf.setJobName("Recommand");
        conf.addResource("classpath:/hadoop/core-site.xml");
        conf.addResource("classpath:/hadoop/hdfs-site.xml");
        conf.addResource("classpath:/hadoop/mapred-site.xml");
        conf.set("io.sort.mb", "1024");
        return conf;
    }

}

运行Step4_Update.java,查看输出结果


~ hadoop fs -cat /user/hdfs/recommend/step5/part-r-00000

3       107,2.0
2       107,2.0
1       107,5.0
5       107,4.0
4       107,5.0
3       106,4.0
2       106,4.0
1       106,10.0
5       106,8.0
4       106,10.0
3       105,4.0
2       105,4.0
1       105,10.0
5       105,8.0
4       105,10.0
3       104,8.0
2       104,8.0
1       104,20.0
5       104,16.0
4       104,20.0
3       103,8.0
2       103,8.0
1       103,20.0
5       103,16.0
4       103,20.0
3       102,6.0
2       102,6.0
1       102,15.0
5       102,12.0
4       102,15.0
3       101,10.0
2       101,10.0
1       101,25.0
5       101,20.0
4       101,25.0
2       106,2.5
1       106,3.0
5       106,3.0
2       105,2.5
1       105,3.0
5       105,3.0
2       104,5.0
1       104,6.0
5       104,6.0
2       103,7.5
1       103,9.0
5       103,9.0
2       102,7.5
1       102,9.0
5       102,9.0
2       101,7.5
1       101,9.0
5       101,9.0
2       106,10.0
1       106,5.0
5       106,4.0
4       106,6.0
2       105,5.0
1       105,2.5
5       105,2.0
4       105,3.0
2       104,15.0
1       104,7.5
5       104,6.0
4       104,9.0
2       103,20.0
1       103,10.0
5       103,8.0
4       103,12.0
2       102,15.0
1       102,7.5
5       102,6.0
4       102,9.0
2       101,20.0
1       101,10.0
5       101,8.0
4       101,12.0
3       107,4.0
2       107,2.0
5       107,4.0
4       107,4.5
3       106,8.0
2       106,4.0
5       106,8.0
4       106,9.0
3       105,8.0
2       105,4.0
5       105,8.0
4       105,9.0
3       104,16.0
2       104,8.0
5       104,16.0
4       104,18.0
3       103,12.0
2       103,6.0
5       103,12.0
4       103,13.5
3       102,8.0
2       102,4.0
5       102,8.0
4       102,9.0
3       101,16.0
2       101,8.0
5       101,16.0
4       101,18.0
3       107,4.5
5       107,3.5
3       106,4.5
5       106,3.5
3       105,9.0
5       105,7.0
3       104,9.0
5       104,7.0
3       103,4.5
5       103,3.5
3       102,4.5
5       102,3.5
3       101,9.0
5       101,7.0
5       106,8.0
4       106,8.0
5       105,4.0
4       105,4.0
5       104,8.0
4       104,8.0
5       103,8.0
4       103,8.0
5       102,4.0
4       102,4.0
5       101,8.0
4       101,8.0
3       107,5.0
3       105,5.0
3       104,5.0
3       101,5.0

运行Step4_Update2.java,查看输出结果


~ hadoop fs -cat /user/hdfs/recommend/step6/part-r-00000

1       107,5.0
1       106,18.0
1       105,15.5
1       104,33.5
1       103,39.0
1       102,31.5
1       101,44.0
2       107,4.0
2       106,20.5
2       105,15.5
2       104,36.0
2       103,41.5
2       102,32.5
2       101,45.5
3       107,15.5
3       106,16.5
3       105,26.0
3       104,38.0
3       103,24.5
3       102,18.5
3       101,40.0
4       107,9.5
4       106,33.0
4       105,26.0
4       104,55.0
4       103,53.5
4       102,37.0
4       101,63.0
5       107,11.5
5       106,34.5
5       105,32.0
5       104,59.0
5       103,56.5
5       102,42.5
5       101,68.0

这样我们就把原来内存中计算的部分,通过MapReduce实现了,结果与之间Step4的结果一致。

代码已经更新到github,请需要的同学更新查看。
https://github.com/bsspirit/maven_hadoop_template/tree/master/src/main/java/org/conan/myhadoop/recommend

######################################################
看文字不过瘾,作者视频讲解,请访问网站:http://onbook.me/video
######################################################

转载请注明出处:
http://blog.fens.me/hadoop-mapreduce-recommend/

打赏作者

Mahout分步式程序开发 聚类Kmeans

Hadoop家族系列文章,主要介绍Hadoop家族产品,常用的项目包括Hadoop, Hive, Pig, HBase, Sqoop, Mahout, Zookeeper, Avro, Ambari, Chukwa,新增加的项目包括,YARN, Hcatalog, Oozie, Cassandra, Hama, Whirr, Flume, Bigtop, Crunch, Hue等。

从2011年开始,中国进入大数据风起云涌的时代,以Hadoop为代表的家族软件,占据了大数据处理的广阔地盘。开源界及厂商,所有数据软件,无一不向Hadoop靠拢。Hadoop也从小众的高富帅领域,变成了大数据开发的标准。在Hadoop原有技术基础之上,出现了Hadoop家族产品,通过“大数据”概念不断创新,推出科技进步。

作为IT界的开发人员,我们也要跟上节奏,抓住机遇,跟着Hadoop一起雄起!

关于作者:

  • 张丹(Conan), 程序员Java,R,PHP,Javascript
  • weibo:@Conan_Z
  • blog: http://blog.fens.me
  • email: bsspirit@gmail.com

转载请注明出处:
http://blog.fens.me/hadoop-mahout-kmeans/

mahout-kmeans

前言

Mahout是基于Hadoop用于机器学习的程序开发框架,Mahout封装了3大类的机器学习算法,其中包括聚类算法。kmeans是我们经常会提到用到的聚类算法之一,特别处理未知数据集的时,都会先聚类一下,看看数据集会有一些什么样的规则。

本文主要讲解,基于Mahout程序开发,实现分步式的kmeans算法。

目录

  1. 聚类算法kmeans
  2. Mahout开发环境介绍
  3. 用Mahout实现聚类算法kmeans
  4. 用R语言可视化结果
  5. 模板项目上传github

1. 聚类算法kmeans

聚类分析是数据挖掘及机器学习领域内的重点问题之一,在数据挖掘、模式识别、决策支持、机器学习及图像分割等领域有广泛的应用,是最重要的数据分析方法之一。聚类是在给定的数据集合中寻找同类的数据子集合,每一个子集合形成一个类簇,同类簇中的数据具有更大的相似性。聚类算法大体上可分为基于划分的方法、基于层次的方法、基于密度的方法、基于网格的方法以及基于模型的方法。

k-means algorithm算法是一种得到最广泛使用的基于划分的聚类算法,把n个对象分为k个簇,以使簇内具有较高的相似度。相似度的计算根据一个簇中对象的平均值来进行。它与处理混合正态分布的最大期望算法很相似,因为他们都试图找到数据中自然聚类的中心。

算法首先随机地选择k个对象,每个对象初始地代表了一个簇的平均值或中心。对剩余的每个对象根据其与各个簇中心的距离,将它赋给最近的簇,然后重新计算每个簇的平均值。这个过程不断重复,直到准则函数收敛。

kmeans介绍摘自:http://zh.wikipedia.org/wiki/K平均算法

2. Mahout开发环境介绍

接上一篇文章:Mahout分步式程序开发 基于物品的协同过滤ItemCF

所有环境变量 和 系统配置 与上文一致!

3. 用Mahout实现聚类算法kmeans

实现步骤:

  • 1. 准备数据文件: randomData.csv
  • 2. Java程序:KmeansHadoop.java
  • 3. 运行程序
  • 4. 聚类结果解读
  • 5. HDFS产生的目录

1). 准备数据文件: randomData.csv
数据文件randomData.csv,由R语言通过“随机正太分布函数”程序生成,单机内存实验请参考文章:
用Maven构建Mahout项目

原始数据文件:这里只截取了一部分数据。


~ vi datafile/randomData.csv

-0.883033363823402 -3.31967192630249
-2.39312626419456 3.34726861118871
2.66976353341256 1.85144276077058
-1.09922906899594 -6.06261735207489
-4.36361936997216 1.90509905380532
-0.00351835125495037 -0.610105996559153
-2.9962958796338 -3.60959839525735
-3.27529418132066 0.0230099799641799
2.17665594420569 6.77290756817957
-2.47862038335637 2.53431833167278
5.53654901906814 2.65089785582474
5.66257474538338 6.86783609641077
-0.558946883114376 1.22332819416237
5.11728525486132 3.74663871584768
1.91240516693351 2.95874731384062
-2.49747101306535 2.05006504756875
3.98781883213459 1.00780938946366
5.47470532716682 5.35084411045171

注:由于Mahout中kmeans算法,默认的分融符是” “(空格),因些我把逗号分隔的数据文件,改成以空格分隔。

2). Java程序:KmeansHadoop.java

kmeans的算法实现,请查看Mahout in Action。

mahout-kmeans-process


package org.conan.mymahout.cluster08;

import org.apache.hadoop.fs.Path;
import org.apache.hadoop.mapred.JobConf;
import org.apache.mahout.clustering.conversion.InputDriver;
import org.apache.mahout.clustering.kmeans.KMeansDriver;
import org.apache.mahout.clustering.kmeans.RandomSeedGenerator;
import org.apache.mahout.common.distance.DistanceMeasure;
import org.apache.mahout.common.distance.EuclideanDistanceMeasure;
import org.apache.mahout.utils.clustering.ClusterDumper;
import org.conan.mymahout.hdfs.HdfsDAO;
import org.conan.mymahout.recommendation.ItemCFHadoop;

public class KmeansHadoop {
    private static final String HDFS = "hdfs://192.168.1.210:9000";

    public static void main(String[] args) throws Exception {
        String localFile = "datafile/randomData.csv";
        String inPath = HDFS + "/user/hdfs/mix_data";
        String seqFile = inPath + "/seqfile";
        String seeds = inPath + "/seeds";
        String outPath = inPath + "/result/";
        String clusteredPoints = outPath + "/clusteredPoints";

        JobConf conf = config();
        HdfsDAO hdfs = new HdfsDAO(HDFS, conf);
        hdfs.rmr(inPath);
        hdfs.mkdirs(inPath);
        hdfs.copyFile(localFile, inPath);
        hdfs.ls(inPath);

        InputDriver.runJob(new Path(inPath), new Path(seqFile), "org.apache.mahout.math.RandomAccessSparseVector");

        int k = 3;
        Path seqFilePath = new Path(seqFile);
        Path clustersSeeds = new Path(seeds);
        DistanceMeasure measure = new EuclideanDistanceMeasure();
        clustersSeeds = RandomSeedGenerator.buildRandom(conf, seqFilePath, clustersSeeds, k, measure);
        KMeansDriver.run(conf, seqFilePath, clustersSeeds, new Path(outPath), measure, 0.01, 10, true, 0.01, false);

        Path outGlobPath = new Path(outPath, "clusters-*-final");
        Path clusteredPointsPath = new Path(clusteredPoints);
        System.out.printf("Dumping out clusters from clusters: %s and clusteredPoints: %s\n", outGlobPath, clusteredPointsPath);

        ClusterDumper clusterDumper = new ClusterDumper(outGlobPath, clusteredPointsPath);
        clusterDumper.printClusters(null);
    }
    
    public static JobConf config() {
        JobConf conf = new JobConf(ItemCFHadoop.class);
        conf.setJobName("ItemCFHadoop");
        conf.addResource("classpath:/hadoop/core-site.xml");
        conf.addResource("classpath:/hadoop/hdfs-site.xml");
        conf.addResource("classpath:/hadoop/mapred-site.xml");
        return conf;
    }

}

3). 运行程序
控制台输出:


Delete: hdfs://192.168.1.210:9000/user/hdfs/mix_data
Create: hdfs://192.168.1.210:9000/user/hdfs/mix_data
copy from: datafile/randomData.csv to hdfs://192.168.1.210:9000/user/hdfs/mix_data
ls: hdfs://192.168.1.210:9000/user/hdfs/mix_data
==========================================================
name: hdfs://192.168.1.210:9000/user/hdfs/mix_data/randomData.csv, folder: false, size: 36655
==========================================================
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
2013-10-14 15:39:31 org.apache.hadoop.util.NativeCodeLoader 
警告: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2013-10-14 15:39:31 org.apache.hadoop.mapred.JobClient copyAndConfigureFiles
警告: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
2013-10-14 15:39:31 org.apache.hadoop.mapreduce.lib.input.FileInputFormat listStatus
信息: Total input paths to process : 1
2013-10-14 15:39:31 org.apache.hadoop.io.compress.snappy.LoadSnappy 
警告: Snappy native library not loaded
2013-10-14 15:39:31 org.apache.hadoop.mapred.JobClient monitorAndPrintJob
信息: Running job: job_local_0001
2013-10-14 15:39:31 org.apache.hadoop.mapred.Task initialize
信息:  Using ResourceCalculatorPlugin : null
2013-10-14 15:39:31 org.apache.hadoop.mapred.Task done
信息: Task:attempt_local_0001_m_000000_0 is done. And is in the process of commiting
2013-10-14 15:39:31 org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate
信息: 
2013-10-14 15:39:31 org.apache.hadoop.mapred.Task commit
信息: Task attempt_local_0001_m_000000_0 is allowed to commit now
2013-10-14 15:39:31 org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter commitTask
信息: Saved output of task 'attempt_local_0001_m_000000_0' to hdfs://192.168.1.210:9000/user/hdfs/mix_data/seqfile
2013-10-14 15:39:31 org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate
信息: 
2013-10-14 15:39:31 org.apache.hadoop.mapred.Task sendDone
信息: Task 'attempt_local_0001_m_000000_0' done.
2013-10-14 15:39:32 org.apache.hadoop.mapred.JobClient monitorAndPrintJob
信息:  map 100% reduce 0%
2013-10-14 15:39:32 org.apache.hadoop.mapred.JobClient monitorAndPrintJob
信息: Job complete: job_local_0001
2013-10-14 15:39:32 org.apache.hadoop.mapred.Counters log
信息: Counters: 11
2013-10-14 15:39:32 org.apache.hadoop.mapred.Counters log
信息:   File Output Format Counters 
2013-10-14 15:39:32 org.apache.hadoop.mapred.Counters log
信息:     Bytes Written=31390
2013-10-14 15:39:32 org.apache.hadoop.mapred.Counters log
信息:   File Input Format Counters 
2013-10-14 15:39:32 org.apache.hadoop.mapred.Counters log
信息:     Bytes Read=36655
2013-10-14 15:39:32 org.apache.hadoop.mapred.Counters log
信息:   FileSystemCounters
2013-10-14 15:39:32 org.apache.hadoop.mapred.Counters log
信息:     FILE_BYTES_READ=475910
2013-10-14 15:39:32 org.apache.hadoop.mapred.Counters log
信息:     HDFS_BYTES_READ=36655
2013-10-14 15:39:32 org.apache.hadoop.mapred.Counters log
信息:     FILE_BYTES_WRITTEN=506350
2013-10-14 15:39:32 org.apache.hadoop.mapred.Counters log
信息:     HDFS_BYTES_WRITTEN=68045
2013-10-14 15:39:32 org.apache.hadoop.mapred.Counters log
信息:   Map-Reduce Framework
2013-10-14 15:39:32 org.apache.hadoop.mapred.Counters log
信息:     Map input records=1000
2013-10-14 15:39:32 org.apache.hadoop.mapred.Counters log
信息:     Spilled Records=0
2013-10-14 15:39:32 org.apache.hadoop.mapred.Counters log
信息:     Total committed heap usage (bytes)=188284928
2013-10-14 15:39:32 org.apache.hadoop.mapred.Counters log
信息:     SPLIT_RAW_BYTES=124
2013-10-14 15:39:32 org.apache.hadoop.mapred.Counters log
信息:     Map output records=1000
2013-10-14 15:39:32 org.apache.hadoop.io.compress.CodecPool getCompressor
信息: Got brand-new compressor
2013-10-14 15:39:32 org.apache.hadoop.io.compress.CodecPool getDecompressor
信息: Got brand-new decompressor
2013-10-14 15:39:32 org.apache.hadoop.mapred.JobClient copyAndConfigureFiles
警告: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
2013-10-14 15:39:32 org.apache.hadoop.mapreduce.lib.input.FileInputFormat listStatus
信息: Total input paths to process : 1
2013-10-14 15:39:32 org.apache.hadoop.mapred.JobClient monitorAndPrintJob
信息: Running job: job_local_0002
2013-10-14 15:39:32 org.apache.hadoop.mapred.Task initialize
信息:  Using ResourceCalculatorPlugin : null
2013-10-14 15:39:32 org.apache.hadoop.mapred.MapTask$MapOutputBuffer 
信息: io.sort.mb = 100
2013-10-14 15:39:32 org.apache.hadoop.mapred.MapTask$MapOutputBuffer 
信息: data buffer = 79691776/99614720
2013-10-14 15:39:32 org.apache.hadoop.mapred.MapTask$MapOutputBuffer 
信息: record buffer = 262144/327680
2013-10-14 15:39:33 org.apache.hadoop.mapred.MapTask$MapOutputBuffer flush
信息: Starting flush of map output
2013-10-14 15:39:33 org.apache.hadoop.mapred.MapTask$MapOutputBuffer sortAndSpill
信息: Finished spill 0
2013-10-14 15:39:33 org.apache.hadoop.mapred.Task done
信息: Task:attempt_local_0002_m_000000_0 is done. And is in the process of commiting
2013-10-14 15:39:33 org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate
信息: 
2013-10-14 15:39:33 org.apache.hadoop.mapred.Task sendDone
信息: Task 'attempt_local_0002_m_000000_0' done.
2013-10-14 15:39:33 org.apache.hadoop.mapred.Task initialize
信息:  Using ResourceCalculatorPlugin : null
2013-10-14 15:39:33 org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate
信息: 
2013-10-14 15:39:33 org.apache.hadoop.mapred.Merger$MergeQueue merge
信息: Merging 1 sorted segments
2013-10-14 15:39:33 org.apache.hadoop.mapred.Merger$MergeQueue merge
信息: Down to the last merge-pass, with 1 segments left of total size: 623 bytes
2013-10-14 15:39:33 org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate
信息: 
2013-10-14 15:39:33 org.apache.hadoop.mapred.Task done
信息: Task:attempt_local_0002_r_000000_0 is done. And is in the process of commiting
2013-10-14 15:39:33 org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate
信息: 
2013-10-14 15:39:33 org.apache.hadoop.mapred.Task commit
信息: Task attempt_local_0002_r_000000_0 is allowed to commit now
2013-10-14 15:39:33 org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter commitTask
信息: Saved output of task 'attempt_local_0002_r_000000_0' to hdfs://192.168.1.210:9000/user/hdfs/mix_data/result/clusters-1
2013-10-14 15:39:33 org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate
信息: reduce > reduce
2013-10-14 15:39:33 org.apache.hadoop.mapred.Task sendDone
信息: Task 'attempt_local_0002_r_000000_0' done.
2013-10-14 15:39:33 org.apache.hadoop.mapred.JobClient monitorAndPrintJob
信息:  map 100% reduce 100%
2013-10-14 15:39:33 org.apache.hadoop.mapred.JobClient monitorAndPrintJob
信息: Job complete: job_local_0002
2013-10-14 15:39:33 org.apache.hadoop.mapred.Counters log
信息: Counters: 19
2013-10-14 15:39:33 org.apache.hadoop.mapred.Counters log
信息:   File Output Format Counters 
2013-10-14 15:39:33 org.apache.hadoop.mapred.Counters log
信息:     Bytes Written=695
2013-10-14 15:39:33 org.apache.hadoop.mapred.Counters log
信息:   FileSystemCounters
2013-10-14 15:39:33 org.apache.hadoop.mapred.Counters log
信息:     FILE_BYTES_READ=4239303
2013-10-14 15:39:33 org.apache.hadoop.mapred.Counters log
信息:     HDFS_BYTES_READ=203963
2013-10-14 15:39:33 org.apache.hadoop.mapred.Counters log
信息:     FILE_BYTES_WRITTEN=4457168
2013-10-14 15:39:33 org.apache.hadoop.mapred.Counters log
信息:     HDFS_BYTES_WRITTEN=140321
2013-10-14 15:39:33 org.apache.hadoop.mapred.Counters log
信息:   File Input Format Counters 
2013-10-14 15:39:33 org.apache.hadoop.mapred.Counters log
信息:     Bytes Read=31390
2013-10-14 15:39:33 org.apache.hadoop.mapred.Counters log
信息:   Map-Reduce Framework
2013-10-14 15:39:33 org.apache.hadoop.mapred.Counters log
信息:     Map output materialized bytes=627
2013-10-14 15:39:33 org.apache.hadoop.mapred.Counters log
信息:     Map input records=1000
2013-10-14 15:39:33 org.apache.hadoop.mapred.Counters log
信息:     Reduce shuffle bytes=0
2013-10-14 15:39:33 org.apache.hadoop.mapred.Counters log
信息:     Spilled Records=6
2013-10-14 15:39:33 org.apache.hadoop.mapred.Counters log
信息:     Map output bytes=612
2013-10-14 15:39:33 org.apache.hadoop.mapred.Counters log
信息:     Total committed heap usage (bytes)=376569856
2013-10-14 15:39:33 org.apache.hadoop.mapred.Counters log
信息:     SPLIT_RAW_BYTES=130
2013-10-14 15:39:33 org.apache.hadoop.mapred.Counters log
信息:     Combine input records=0
2013-10-14 15:39:33 org.apache.hadoop.mapred.Counters log
信息:     Reduce input records=3
2013-10-14 15:39:33 org.apache.hadoop.mapred.Counters log
信息:     Reduce input groups=3
2013-10-14 15:39:33 org.apache.hadoop.mapred.Counters log
信息:     Combine output records=0
2013-10-14 15:39:33 org.apache.hadoop.mapred.Counters log
信息:     Reduce output records=3
2013-10-14 15:39:33 org.apache.hadoop.mapred.Counters log
信息:     Map output records=3
2013-10-14 15:39:34 org.apache.hadoop.mapred.JobClient copyAndConfigureFiles
警告: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
2013-10-14 15:39:34 org.apache.hadoop.mapreduce.lib.input.FileInputFormat listStatus
信息: Total input paths to process : 1
2013-10-14 15:39:34 org.apache.hadoop.mapred.JobClient monitorAndPrintJob
信息: Running job: job_local_0003
2013-10-14 15:39:34 org.apache.hadoop.mapred.Task initialize
信息:  Using ResourceCalculatorPlugin : null
2013-10-14 15:39:34 org.apache.hadoop.mapred.MapTask$MapOutputBuffer 
信息: io.sort.mb = 100
2013-10-14 15:39:34 org.apache.hadoop.mapred.MapTask$MapOutputBuffer 
信息: data buffer = 79691776/99614720
2013-10-14 15:39:34 org.apache.hadoop.mapred.MapTask$MapOutputBuffer 
信息: record buffer = 262144/327680
2013-10-14 15:39:34 org.apache.hadoop.mapred.MapTask$MapOutputBuffer flush
信息: Starting flush of map output
2013-10-14 15:39:34 org.apache.hadoop.mapred.MapTask$MapOutputBuffer sortAndSpill
信息: Finished spill 0
2013-10-14 15:39:34 org.apache.hadoop.mapred.Task done
信息: Task:attempt_local_0003_m_000000_0 is done. And is in the process of commiting
2013-10-14 15:39:34 org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate
信息: 
2013-10-14 15:39:34 org.apache.hadoop.mapred.Task sendDone
信息: Task 'attempt_local_0003_m_000000_0' done.
2013-10-14 15:39:34 org.apache.hadoop.mapred.Task initialize
信息:  Using ResourceCalculatorPlugin : null
2013-10-14 15:39:34 org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate
信息: 
2013-10-14 15:39:34 org.apache.hadoop.mapred.Merger$MergeQueue merge
信息: Merging 1 sorted segments
2013-10-14 15:39:34 org.apache.hadoop.mapred.Merger$MergeQueue merge
信息: Down to the last merge-pass, with 1 segments left of total size: 677 bytes
2013-10-14 15:39:34 org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate
信息: 
2013-10-14 15:39:34 org.apache.hadoop.mapred.Task done
信息: Task:attempt_local_0003_r_000000_0 is done. And is in the process of commiting
2013-10-14 15:39:34 org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate
信息: 
2013-10-14 15:39:34 org.apache.hadoop.mapred.Task commit
信息: Task attempt_local_0003_r_000000_0 is allowed to commit now
2013-10-14 15:39:34 org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter commitTask
信息: Saved output of task 'attempt_local_0003_r_000000_0' to hdfs://192.168.1.210:9000/user/hdfs/mix_data/result/clusters-2
2013-10-14 15:39:34 org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate
信息: reduce > reduce
2013-10-14 15:39:34 org.apache.hadoop.mapred.Task sendDone
信息: Task 'attempt_local_0003_r_000000_0' done.
2013-10-14 15:39:35 org.apache.hadoop.mapred.JobClient monitorAndPrintJob
信息:  map 100% reduce 100%
2013-10-14 15:39:35 org.apache.hadoop.mapred.JobClient monitorAndPrintJob
信息: Job complete: job_local_0003
2013-10-14 15:39:35 org.apache.hadoop.mapred.Counters log
信息: Counters: 19
2013-10-14 15:39:35 org.apache.hadoop.mapred.Counters log
信息:   File Output Format Counters 
2013-10-14 15:39:35 org.apache.hadoop.mapred.Counters log
信息:     Bytes Written=695
2013-10-14 15:39:35 org.apache.hadoop.mapred.Counters log
信息:   FileSystemCounters
2013-10-14 15:39:35 org.apache.hadoop.mapred.Counters log
信息:     FILE_BYTES_READ=7527467
2013-10-14 15:39:35 org.apache.hadoop.mapred.Counters log
信息:     HDFS_BYTES_READ=271193
2013-10-14 15:39:35 org.apache.hadoop.mapred.Counters log
信息:     FILE_BYTES_WRITTEN=7901744
2013-10-14 15:39:35 org.apache.hadoop.mapred.Counters log
信息:     HDFS_BYTES_WRITTEN=142099
2013-10-14 15:39:35 org.apache.hadoop.mapred.Counters log
信息:   File Input Format Counters 
2013-10-14 15:39:35 org.apache.hadoop.mapred.Counters log
信息:     Bytes Read=31390
2013-10-14 15:39:35 org.apache.hadoop.mapred.Counters log
信息:   Map-Reduce Framework
2013-10-14 15:39:35 org.apache.hadoop.mapred.Counters log
信息:     Map output materialized bytes=681
2013-10-14 15:39:35 org.apache.hadoop.mapred.Counters log
信息:     Map input records=1000
2013-10-14 15:39:35 org.apache.hadoop.mapred.Counters log
信息:     Reduce shuffle bytes=0
2013-10-14 15:39:35 org.apache.hadoop.mapred.Counters log
信息:     Spilled Records=6
2013-10-14 15:39:35 org.apache.hadoop.mapred.Counters log
信息:     Map output bytes=666
2013-10-14 15:39:35 org.apache.hadoop.mapred.Counters log
信息:     Total committed heap usage (bytes)=575930368
2013-10-14 15:39:35 org.apache.hadoop.mapred.Counters log
信息:     SPLIT_RAW_BYTES=130
2013-10-14 15:39:35 org.apache.hadoop.mapred.Counters log
信息:     Combine input records=0
2013-10-14 15:39:35 org.apache.hadoop.mapred.Counters log
信息:     Reduce input records=3
2013-10-14 15:39:35 org.apache.hadoop.mapred.Counters log
信息:     Reduce input groups=3
2013-10-14 15:39:35 org.apache.hadoop.mapred.Counters log
信息:     Combine output records=0
2013-10-14 15:39:35 org.apache.hadoop.mapred.Counters log
信息:     Reduce output records=3
2013-10-14 15:39:35 org.apache.hadoop.mapred.Counters log
信息:     Map output records=3
2013-10-14 15:39:35 org.apache.hadoop.mapred.JobClient copyAndConfigureFiles
警告: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
2013-10-14 15:39:35 org.apache.hadoop.mapreduce.lib.input.FileInputFormat listStatus
信息: Total input paths to process : 1
2013-10-14 15:39:35 org.apache.hadoop.mapred.JobClient monitorAndPrintJob
信息: Running job: job_local_0004
2013-10-14 15:39:35 org.apache.hadoop.mapred.Task initialize
信息:  Using ResourceCalculatorPlugin : null
2013-10-14 15:39:35 org.apache.hadoop.mapred.MapTask$MapOutputBuffer 
信息: io.sort.mb = 100
2013-10-14 15:39:35 org.apache.hadoop.mapred.MapTask$MapOutputBuffer 
信息: data buffer = 79691776/99614720
2013-10-14 15:39:35 org.apache.hadoop.mapred.MapTask$MapOutputBuffer 
信息: record buffer = 262144/327680
2013-10-14 15:39:35 org.apache.hadoop.mapred.MapTask$MapOutputBuffer flush
信息: Starting flush of map output
2013-10-14 15:39:35 org.apache.hadoop.mapred.MapTask$MapOutputBuffer sortAndSpill
信息: Finished spill 0
2013-10-14 15:39:35 org.apache.hadoop.mapred.Task done
信息: Task:attempt_local_0004_m_000000_0 is done. And is in the process of commiting
2013-10-14 15:39:35 org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate
信息: 
2013-10-14 15:39:35 org.apache.hadoop.mapred.Task sendDone
信息: Task 'attempt_local_0004_m_000000_0' done.
2013-10-14 15:39:35 org.apache.hadoop.mapred.Task initialize
信息:  Using ResourceCalculatorPlugin : null
2013-10-14 15:39:35 org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate
信息: 
2013-10-14 15:39:35 org.apache.hadoop.mapred.Merger$MergeQueue merge
信息: Merging 1 sorted segments
2013-10-14 15:39:35 org.apache.hadoop.mapred.Merger$MergeQueue merge
信息: Down to the last merge-pass, with 1 segments left of total size: 677 bytes
2013-10-14 15:39:35 org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate
信息: 
2013-10-14 15:39:35 org.apache.hadoop.mapred.Task done
信息: Task:attempt_local_0004_r_000000_0 is done. And is in the process of commiting
2013-10-14 15:39:35 org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate
信息: 
2013-10-14 15:39:35 org.apache.hadoop.mapred.Task commit
信息: Task attempt_local_0004_r_000000_0 is allowed to commit now
2013-10-14 15:39:35 org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter commitTask
信息: Saved output of task 'attempt_local_0004_r_000000_0' to hdfs://192.168.1.210:9000/user/hdfs/mix_data/result/clusters-3
2013-10-14 15:39:35 org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate
信息: reduce > reduce
2013-10-14 15:39:35 org.apache.hadoop.mapred.Task sendDone
信息: Task 'attempt_local_0004_r_000000_0' done.
2013-10-14 15:39:36 org.apache.hadoop.mapred.JobClient monitorAndPrintJob
信息:  map 100% reduce 100%
2013-10-14 15:39:36 org.apache.hadoop.mapred.JobClient monitorAndPrintJob
信息: Job complete: job_local_0004
2013-10-14 15:39:36 org.apache.hadoop.mapred.Counters log
信息: Counters: 19
2013-10-14 15:39:36 org.apache.hadoop.mapred.Counters log
信息:   File Output Format Counters 
2013-10-14 15:39:36 org.apache.hadoop.mapred.Counters log
信息:     Bytes Written=695
2013-10-14 15:39:36 org.apache.hadoop.mapred.Counters log
信息:   FileSystemCounters
2013-10-14 15:39:36 org.apache.hadoop.mapred.Counters log
信息:     FILE_BYTES_READ=10815685
2013-10-14 15:39:36 org.apache.hadoop.mapred.Counters log
信息:     HDFS_BYTES_READ=338143
2013-10-14 15:39:36 org.apache.hadoop.mapred.Counters log
信息:     FILE_BYTES_WRITTEN=11346320
2013-10-14 15:39:36 org.apache.hadoop.mapred.Counters log
信息:     HDFS_BYTES_WRITTEN=143877
2013-10-14 15:39:36 org.apache.hadoop.mapred.Counters log
信息:   File Input Format Counters 
2013-10-14 15:39:36 org.apache.hadoop.mapred.Counters log
信息:     Bytes Read=31390
2013-10-14 15:39:36 org.apache.hadoop.mapred.Counters log
信息:   Map-Reduce Framework
2013-10-14 15:39:36 org.apache.hadoop.mapred.Counters log
信息:     Map output materialized bytes=681
2013-10-14 15:39:36 org.apache.hadoop.mapred.Counters log
信息:     Map input records=1000
2013-10-14 15:39:36 org.apache.hadoop.mapred.Counters log
信息:     Reduce shuffle bytes=0
2013-10-14 15:39:36 org.apache.hadoop.mapred.Counters log
信息:     Spilled Records=6
2013-10-14 15:39:36 org.apache.hadoop.mapred.Counters log
信息:     Map output bytes=666
2013-10-14 15:39:36 org.apache.hadoop.mapred.Counters log
信息:     Total committed heap usage (bytes)=775290880
2013-10-14 15:39:36 org.apache.hadoop.mapred.Counters log
信息:     SPLIT_RAW_BYTES=130
2013-10-14 15:39:36 org.apache.hadoop.mapred.Counters log
信息:     Combine input records=0
2013-10-14 15:39:36 org.apache.hadoop.mapred.Counters log
信息:     Reduce input records=3
2013-10-14 15:39:36 org.apache.hadoop.mapred.Counters log
信息:     Reduce input groups=3
2013-10-14 15:39:36 org.apache.hadoop.mapred.Counters log
信息:     Combine output records=0
2013-10-14 15:39:36 org.apache.hadoop.mapred.Counters log
信息:     Reduce output records=3
2013-10-14 15:39:36 org.apache.hadoop.mapred.Counters log
信息:     Map output records=3
2013-10-14 15:39:36 org.apache.hadoop.mapred.JobClient copyAndConfigureFiles
警告: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
2013-10-14 15:39:36 org.apache.hadoop.mapreduce.lib.input.FileInputFormat listStatus
信息: Total input paths to process : 1
2013-10-14 15:39:36 org.apache.hadoop.mapred.JobClient monitorAndPrintJob
信息: Running job: job_local_0005
2013-10-14 15:39:36 org.apache.hadoop.mapred.Task initialize
信息:  Using ResourceCalculatorPlugin : null
2013-10-14 15:39:36 org.apache.hadoop.mapred.MapTask$MapOutputBuffer 
信息: io.sort.mb = 100
2013-10-14 15:39:36 org.apache.hadoop.mapred.MapTask$MapOutputBuffer 
信息: data buffer = 79691776/99614720
2013-10-14 15:39:36 org.apache.hadoop.mapred.MapTask$MapOutputBuffer 
信息: record buffer = 262144/327680
2013-10-14 15:39:36 org.apache.hadoop.mapred.MapTask$MapOutputBuffer flush
信息: Starting flush of map output
2013-10-14 15:39:36 org.apache.hadoop.mapred.MapTask$MapOutputBuffer sortAndSpill
信息: Finished spill 0
2013-10-14 15:39:36 org.apache.hadoop.mapred.Task done
信息: Task:attempt_local_0005_m_000000_0 is done. And is in the process of commiting
2013-10-14 15:39:36 org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate
信息: 
2013-10-14 15:39:36 org.apache.hadoop.mapred.Task sendDone
信息: Task 'attempt_local_0005_m_000000_0' done.
2013-10-14 15:39:36 org.apache.hadoop.mapred.Task initialize
信息:  Using ResourceCalculatorPlugin : null
2013-10-14 15:39:36 org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate
信息: 
2013-10-14 15:39:36 org.apache.hadoop.mapred.Merger$MergeQueue merge
信息: Merging 1 sorted segments
2013-10-14 15:39:36 org.apache.hadoop.mapred.Merger$MergeQueue merge
信息: Down to the last merge-pass, with 1 segments left of total size: 677 bytes
2013-10-14 15:39:36 org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate
信息: 
2013-10-14 15:39:36 org.apache.hadoop.mapred.Task done
信息: Task:attempt_local_0005_r_000000_0 is done. And is in the process of commiting
2013-10-14 15:39:36 org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate
信息: 
2013-10-14 15:39:36 org.apache.hadoop.mapred.Task commit
信息: Task attempt_local_0005_r_000000_0 is allowed to commit now
2013-10-14 15:39:36 org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter commitTask
信息: Saved output of task 'attempt_local_0005_r_000000_0' to hdfs://192.168.1.210:9000/user/hdfs/mix_data/result/clusters-4
2013-10-14 15:39:36 org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate
信息: reduce > reduce
2013-10-14 15:39:36 org.apache.hadoop.mapred.Task sendDone
信息: Task 'attempt_local_0005_r_000000_0' done.
2013-10-14 15:39:37 org.apache.hadoop.mapred.JobClient monitorAndPrintJob
信息:  map 100% reduce 100%
2013-10-14 15:39:37 org.apache.hadoop.mapred.JobClient monitorAndPrintJob
信息: Job complete: job_local_0005
2013-10-14 15:39:37 org.apache.hadoop.mapred.Counters log
信息: Counters: 19
2013-10-14 15:39:37 org.apache.hadoop.mapred.Counters log
信息:   File Output Format Counters 
2013-10-14 15:39:37 org.apache.hadoop.mapred.Counters log
信息:     Bytes Written=695
2013-10-14 15:39:37 org.apache.hadoop.mapred.Counters log
信息:   FileSystemCounters
2013-10-14 15:39:37 org.apache.hadoop.mapred.Counters log
信息:     FILE_BYTES_READ=14103903
2013-10-14 15:39:37 org.apache.hadoop.mapred.Counters log
信息:     HDFS_BYTES_READ=405093
2013-10-14 15:39:37 org.apache.hadoop.mapred.Counters log
信息:     FILE_BYTES_WRITTEN=14790888
2013-10-14 15:39:37 org.apache.hadoop.mapred.Counters log
信息:     HDFS_BYTES_WRITTEN=145655
2013-10-14 15:39:37 org.apache.hadoop.mapred.Counters log
信息:   File Input Format Counters 
2013-10-14 15:39:37 org.apache.hadoop.mapred.Counters log
信息:     Bytes Read=31390
2013-10-14 15:39:37 org.apache.hadoop.mapred.Counters log
信息:   Map-Reduce Framework
2013-10-14 15:39:37 org.apache.hadoop.mapred.Counters log
信息:     Map output materialized bytes=681
2013-10-14 15:39:37 org.apache.hadoop.mapred.Counters log
信息:     Map input records=1000
2013-10-14 15:39:37 org.apache.hadoop.mapred.Counters log
信息:     Reduce shuffle bytes=0
2013-10-14 15:39:37 org.apache.hadoop.mapred.Counters log
信息:     Spilled Records=6
2013-10-14 15:39:37 org.apache.hadoop.mapred.Counters log
信息:     Map output bytes=666
2013-10-14 15:39:37 org.apache.hadoop.mapred.Counters log
信息:     Total committed heap usage (bytes)=974651392
2013-10-14 15:39:37 org.apache.hadoop.mapred.Counters log
信息:     SPLIT_RAW_BYTES=130
2013-10-14 15:39:37 org.apache.hadoop.mapred.Counters log
信息:     Combine input records=0
2013-10-14 15:39:37 org.apache.hadoop.mapred.Counters log
信息:     Reduce input records=3
2013-10-14 15:39:37 org.apache.hadoop.mapred.Counters log
信息:     Reduce input groups=3
2013-10-14 15:39:37 org.apache.hadoop.mapred.Counters log
信息:     Combine output records=0
2013-10-14 15:39:37 org.apache.hadoop.mapred.Counters log
信息:     Reduce output records=3
2013-10-14 15:39:37 org.apache.hadoop.mapred.Counters log
信息:     Map output records=3
2013-10-14 15:39:37 org.apache.hadoop.mapred.JobClient copyAndConfigureFiles
警告: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
2013-10-14 15:39:37 org.apache.hadoop.mapreduce.lib.input.FileInputFormat listStatus
信息: Total input paths to process : 1
2013-10-14 15:39:37 org.apache.hadoop.mapred.JobClient monitorAndPrintJob
信息: Running job: job_local_0006
2013-10-14 15:39:37 org.apache.hadoop.mapred.Task initialize
信息:  Using ResourceCalculatorPlugin : null
2013-10-14 15:39:37 org.apache.hadoop.mapred.MapTask$MapOutputBuffer 
信息: io.sort.mb = 100
2013-10-14 15:39:37 org.apache.hadoop.mapred.MapTask$MapOutputBuffer 
信息: data buffer = 79691776/99614720
2013-10-14 15:39:37 org.apache.hadoop.mapred.MapTask$MapOutputBuffer 
信息: record buffer = 262144/327680
2013-10-14 15:39:37 org.apache.hadoop.mapred.MapTask$MapOutputBuffer flush
信息: Starting flush of map output
2013-10-14 15:39:37 org.apache.hadoop.mapred.MapTask$MapOutputBuffer sortAndSpill
信息: Finished spill 0
2013-10-14 15:39:37 org.apache.hadoop.mapred.Task done
信息: Task:attempt_local_0006_m_000000_0 is done. And is in the process of commiting
2013-10-14 15:39:37 org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate
信息: 
2013-10-14 15:39:37 org.apache.hadoop.mapred.Task sendDone
信息: Task 'attempt_local_0006_m_000000_0' done.
2013-10-14 15:39:37 org.apache.hadoop.mapred.Task initialize
信息:  Using ResourceCalculatorPlugin : null
2013-10-14 15:39:37 org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate
信息: 
2013-10-14 15:39:37 org.apache.hadoop.mapred.Merger$MergeQueue merge
信息: Merging 1 sorted segments
2013-10-14 15:39:37 org.apache.hadoop.mapred.Merger$MergeQueue merge
信息: Down to the last merge-pass, with 1 segments left of total size: 677 bytes
2013-10-14 15:39:37 org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate
信息: 
2013-10-14 15:39:37 org.apache.hadoop.mapred.Task done
信息: Task:attempt_local_0006_r_000000_0 is done. And is in the process of commiting
2013-10-14 15:39:37 org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate
信息: 
2013-10-14 15:39:37 org.apache.hadoop.mapred.Task commit
信息: Task attempt_local_0006_r_000000_0 is allowed to commit now
2013-10-14 15:39:37 org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter commitTask
信息: Saved output of task 'attempt_local_0006_r_000000_0' to hdfs://192.168.1.210:9000/user/hdfs/mix_data/result/clusters-5
2013-10-14 15:39:37 org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate
信息: reduce > reduce
2013-10-14 15:39:37 org.apache.hadoop.mapred.Task sendDone
信息: Task 'attempt_local_0006_r_000000_0' done.
2013-10-14 15:39:38 org.apache.hadoop.mapred.JobClient monitorAndPrintJob
信息:  map 100% reduce 100%
2013-10-14 15:39:38 org.apache.hadoop.mapred.JobClient monitorAndPrintJob
信息: Job complete: job_local_0006
2013-10-14 15:39:38 org.apache.hadoop.mapred.Counters log
信息: Counters: 19
2013-10-14 15:39:38 org.apache.hadoop.mapred.Counters log
信息:   File Output Format Counters 
2013-10-14 15:39:38 org.apache.hadoop.mapred.Counters log
信息:     Bytes Written=695
2013-10-14 15:39:38 org.apache.hadoop.mapred.Counters log
信息:   FileSystemCounters
2013-10-14 15:39:38 org.apache.hadoop.mapred.Counters log
信息:     FILE_BYTES_READ=17392121
2013-10-14 15:39:38 org.apache.hadoop.mapred.Counters log
信息:     HDFS_BYTES_READ=472043
2013-10-14 15:39:38 org.apache.hadoop.mapred.Counters log
信息:     FILE_BYTES_WRITTEN=18235456
2013-10-14 15:39:38 org.apache.hadoop.mapred.Counters log
信息:     HDFS_BYTES_WRITTEN=147433
2013-10-14 15:39:38 org.apache.hadoop.mapred.Counters log
信息:   File Input Format Counters 
2013-10-14 15:39:38 org.apache.hadoop.mapred.Counters log
信息:     Bytes Read=31390
2013-10-14 15:39:38 org.apache.hadoop.mapred.Counters log
信息:   Map-Reduce Framework
2013-10-14 15:39:38 org.apache.hadoop.mapred.Counters log
信息:     Map output materialized bytes=681
2013-10-14 15:39:38 org.apache.hadoop.mapred.Counters log
信息:     Map input records=1000
2013-10-14 15:39:38 org.apache.hadoop.mapred.Counters log
信息:     Reduce shuffle bytes=0
2013-10-14 15:39:38 org.apache.hadoop.mapred.Counters log
信息:     Spilled Records=6
2013-10-14 15:39:38 org.apache.hadoop.mapred.Counters log
信息:     Map output bytes=666
2013-10-14 15:39:38 org.apache.hadoop.mapred.Counters log
信息:     Total committed heap usage (bytes)=1174011904
2013-10-14 15:39:38 org.apache.hadoop.mapred.Counters log
信息:     SPLIT_RAW_BYTES=130
2013-10-14 15:39:38 org.apache.hadoop.mapred.Counters log
信息:     Combine input records=0
2013-10-14 15:39:38 org.apache.hadoop.mapred.Counters log
信息:     Reduce input records=3
2013-10-14 15:39:38 org.apache.hadoop.mapred.Counters log
信息:     Reduce input groups=3
2013-10-14 15:39:38 org.apache.hadoop.mapred.Counters log
信息:     Combine output records=0
2013-10-14 15:39:38 org.apache.hadoop.mapred.Counters log
信息:     Reduce output records=3
2013-10-14 15:39:38 org.apache.hadoop.mapred.Counters log
信息:     Map output records=3
2013-10-14 15:39:38 org.apache.hadoop.mapred.JobClient copyAndConfigureFiles
警告: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
2013-10-14 15:39:38 org.apache.hadoop.mapreduce.lib.input.FileInputFormat listStatus
信息: Total input paths to process : 1
2013-10-14 15:39:38 org.apache.hadoop.mapred.JobClient monitorAndPrintJob
信息: Running job: job_local_0007
2013-10-14 15:39:38 org.apache.hadoop.mapred.Task initialize
信息:  Using ResourceCalculatorPlugin : null
2013-10-14 15:39:38 org.apache.hadoop.mapred.MapTask$MapOutputBuffer 
信息: io.sort.mb = 100
2013-10-14 15:39:38 org.apache.hadoop.mapred.MapTask$MapOutputBuffer 
信息: data buffer = 79691776/99614720
2013-10-14 15:39:38 org.apache.hadoop.mapred.MapTask$MapOutputBuffer 
信息: record buffer = 262144/327680
2013-10-14 15:39:38 org.apache.hadoop.mapred.MapTask$MapOutputBuffer flush
信息: Starting flush of map output
2013-10-14 15:39:38 org.apache.hadoop.mapred.MapTask$MapOutputBuffer sortAndSpill
信息: Finished spill 0
2013-10-14 15:39:38 org.apache.hadoop.mapred.Task done
信息: Task:attempt_local_0007_m_000000_0 is done. And is in the process of commiting
2013-10-14 15:39:38 org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate
信息: 
2013-10-14 15:39:38 org.apache.hadoop.mapred.Task sendDone
信息: Task 'attempt_local_0007_m_000000_0' done.
2013-10-14 15:39:38 org.apache.hadoop.mapred.Task initialize
信息:  Using ResourceCalculatorPlugin : null
2013-10-14 15:39:38 org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate
信息: 
2013-10-14 15:39:38 org.apache.hadoop.mapred.Merger$MergeQueue merge
信息: Merging 1 sorted segments
2013-10-14 15:39:38 org.apache.hadoop.mapred.Merger$MergeQueue merge
信息: Down to the last merge-pass, with 1 segments left of total size: 677 bytes
2013-10-14 15:39:38 org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate
信息: 
2013-10-14 15:39:38 org.apache.hadoop.mapred.Task done
信息: Task:attempt_local_0007_r_000000_0 is done. And is in the process of commiting
2013-10-14 15:39:38 org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate
信息: 
2013-10-14 15:39:38 org.apache.hadoop.mapred.Task commit
信息: Task attempt_local_0007_r_000000_0 is allowed to commit now
2013-10-14 15:39:38 org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter commitTask
信息: Saved output of task 'attempt_local_0007_r_000000_0' to hdfs://192.168.1.210:9000/user/hdfs/mix_data/result/clusters-6
2013-10-14 15:39:38 org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate
信息: reduce > reduce
2013-10-14 15:39:38 org.apache.hadoop.mapred.Task sendDone
信息: Task 'attempt_local_0007_r_000000_0' done.
2013-10-14 15:39:39 org.apache.hadoop.mapred.JobClient monitorAndPrintJob
信息:  map 100% reduce 100%
2013-10-14 15:39:39 org.apache.hadoop.mapred.JobClient monitorAndPrintJob
信息: Job complete: job_local_0007
2013-10-14 15:39:39 org.apache.hadoop.mapred.Counters log
信息: Counters: 19
2013-10-14 15:39:39 org.apache.hadoop.mapred.Counters log
信息:   File Output Format Counters 
2013-10-14 15:39:39 org.apache.hadoop.mapred.Counters log
信息:     Bytes Written=695
2013-10-14 15:39:39 org.apache.hadoop.mapred.Counters log
信息:   FileSystemCounters
2013-10-14 15:39:39 org.apache.hadoop.mapred.Counters log
信息:     FILE_BYTES_READ=20680339
2013-10-14 15:39:39 org.apache.hadoop.mapred.Counters log
信息:     HDFS_BYTES_READ=538993
2013-10-14 15:39:39 org.apache.hadoop.mapred.Counters log
信息:     FILE_BYTES_WRITTEN=21680040
2013-10-14 15:39:39 org.apache.hadoop.mapred.Counters log
信息:     HDFS_BYTES_WRITTEN=149211
2013-10-14 15:39:39 org.apache.hadoop.mapred.Counters log
信息:   File Input Format Counters 
2013-10-14 15:39:39 org.apache.hadoop.mapred.Counters log
信息:     Bytes Read=31390
2013-10-14 15:39:39 org.apache.hadoop.mapred.Counters log
信息:   Map-Reduce Framework
2013-10-14 15:39:39 org.apache.hadoop.mapred.Counters log
信息:     Map output materialized bytes=681
2013-10-14 15:39:39 org.apache.hadoop.mapred.Counters log
信息:     Map input records=1000
2013-10-14 15:39:39 org.apache.hadoop.mapred.Counters log
信息:     Reduce shuffle bytes=0
2013-10-14 15:39:39 org.apache.hadoop.mapred.Counters log
信息:     Spilled Records=6
2013-10-14 15:39:39 org.apache.hadoop.mapred.Counters log
信息:     Map output bytes=666
2013-10-14 15:39:39 org.apache.hadoop.mapred.Counters log
信息:     Total committed heap usage (bytes)=1373372416
2013-10-14 15:39:39 org.apache.hadoop.mapred.Counters log
信息:     SPLIT_RAW_BYTES=130
2013-10-14 15:39:39 org.apache.hadoop.mapred.Counters log
信息:     Combine input records=0
2013-10-14 15:39:39 org.apache.hadoop.mapred.Counters log
信息:     Reduce input records=3
2013-10-14 15:39:39 org.apache.hadoop.mapred.Counters log
信息:     Reduce input groups=3
2013-10-14 15:39:39 org.apache.hadoop.mapred.Counters log
信息:     Combine output records=0
2013-10-14 15:39:39 org.apache.hadoop.mapred.Counters log
信息:     Reduce output records=3
2013-10-14 15:39:39 org.apache.hadoop.mapred.Counters log
信息:     Map output records=3
2013-10-14 15:39:39 org.apache.hadoop.mapred.JobClient copyAndConfigureFiles
警告: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
2013-10-14 15:39:39 org.apache.hadoop.mapreduce.lib.input.FileInputFormat listStatus
信息: Total input paths to process : 1
2013-10-14 15:39:39 org.apache.hadoop.mapred.JobClient monitorAndPrintJob
信息: Running job: job_local_0008
2013-10-14 15:39:39 org.apache.hadoop.mapred.Task initialize
信息:  Using ResourceCalculatorPlugin : null
2013-10-14 15:39:39 org.apache.hadoop.mapred.MapTask$MapOutputBuffer 
信息: io.sort.mb = 100
2013-10-14 15:39:39 org.apache.hadoop.mapred.MapTask$MapOutputBuffer 
信息: data buffer = 79691776/99614720
2013-10-14 15:39:39 org.apache.hadoop.mapred.MapTask$MapOutputBuffer 
信息: record buffer = 262144/327680
2013-10-14 15:39:39 org.apache.hadoop.mapred.MapTask$MapOutputBuffer flush
信息: Starting flush of map output
2013-10-14 15:39:40 org.apache.hadoop.mapred.MapTask$MapOutputBuffer sortAndSpill
信息: Finished spill 0
2013-10-14 15:39:40 org.apache.hadoop.mapred.Task done
信息: Task:attempt_local_0008_m_000000_0 is done. And is in the process of commiting
2013-10-14 15:39:40 org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate
信息: 
2013-10-14 15:39:40 org.apache.hadoop.mapred.Task sendDone
信息: Task 'attempt_local_0008_m_000000_0' done.
2013-10-14 15:39:40 org.apache.hadoop.mapred.Task initialize
信息:  Using ResourceCalculatorPlugin : null
2013-10-14 15:39:40 org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate
信息: 
2013-10-14 15:39:40 org.apache.hadoop.mapred.Merger$MergeQueue merge
信息: Merging 1 sorted segments
2013-10-14 15:39:40 org.apache.hadoop.mapred.Merger$MergeQueue merge
信息: Down to the last merge-pass, with 1 segments left of total size: 677 bytes
2013-10-14 15:39:40 org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate
信息: 
2013-10-14 15:39:40 org.apache.hadoop.mapred.Task done
信息: Task:attempt_local_0008_r_000000_0 is done. And is in the process of commiting
2013-10-14 15:39:40 org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate
信息: 
2013-10-14 15:39:40 org.apache.hadoop.mapred.Task commit
信息: Task attempt_local_0008_r_000000_0 is allowed to commit now
2013-10-14 15:39:40 org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter commitTask
信息: Saved output of task 'attempt_local_0008_r_000000_0' to hdfs://192.168.1.210:9000/user/hdfs/mix_data/result/clusters-7
2013-10-14 15:39:40 org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate
信息: reduce > reduce
2013-10-14 15:39:40 org.apache.hadoop.mapred.Task sendDone
信息: Task 'attempt_local_0008_r_000000_0' done.
2013-10-14 15:39:40 org.apache.hadoop.mapred.JobClient monitorAndPrintJob
信息:  map 100% reduce 100%
2013-10-14 15:39:40 org.apache.hadoop.mapred.JobClient monitorAndPrintJob
信息: Job complete: job_local_0008
2013-10-14 15:39:40 org.apache.hadoop.mapred.Counters log
信息: Counters: 19
2013-10-14 15:39:40 org.apache.hadoop.mapred.Counters log
信息:   File Output Format Counters 
2013-10-14 15:39:40 org.apache.hadoop.mapred.Counters log
信息:     Bytes Written=695
2013-10-14 15:39:40 org.apache.hadoop.mapred.Counters log
信息:   FileSystemCounters
2013-10-14 15:39:40 org.apache.hadoop.mapred.Counters log
信息:     FILE_BYTES_READ=23968557
2013-10-14 15:39:40 org.apache.hadoop.mapred.Counters log
信息:     HDFS_BYTES_READ=605943
2013-10-14 15:39:40 org.apache.hadoop.mapred.Counters log
信息:     FILE_BYTES_WRITTEN=25124624
2013-10-14 15:39:40 org.apache.hadoop.mapred.Counters log
信息:     HDFS_BYTES_WRITTEN=150989
2013-10-14 15:39:40 org.apache.hadoop.mapred.Counters log
信息:   File Input Format Counters 
2013-10-14 15:39:40 org.apache.hadoop.mapred.Counters log
信息:     Bytes Read=31390
2013-10-14 15:39:40 org.apache.hadoop.mapred.Counters log
信息:   Map-Reduce Framework
2013-10-14 15:39:40 org.apache.hadoop.mapred.Counters log
信息:     Map output materialized bytes=681
2013-10-14 15:39:40 org.apache.hadoop.mapred.Counters log
信息:     Map input records=1000
2013-10-14 15:39:40 org.apache.hadoop.mapred.Counters log
信息:     Reduce shuffle bytes=0
2013-10-14 15:39:40 org.apache.hadoop.mapred.Counters log
信息:     Spilled Records=6
2013-10-14 15:39:40 org.apache.hadoop.mapred.Counters log
信息:     Map output bytes=666
2013-10-14 15:39:40 org.apache.hadoop.mapred.Counters log
信息:     Total committed heap usage (bytes)=1572732928
2013-10-14 15:39:40 org.apache.hadoop.mapred.Counters log
信息:     SPLIT_RAW_BYTES=130
2013-10-14 15:39:40 org.apache.hadoop.mapred.Counters log
信息:     Combine input records=0
2013-10-14 15:39:40 org.apache.hadoop.mapred.Counters log
信息:     Reduce input records=3
2013-10-14 15:39:40 org.apache.hadoop.mapred.Counters log
信息:     Reduce input groups=3
2013-10-14 15:39:40 org.apache.hadoop.mapred.Counters log
信息:     Combine output records=0
2013-10-14 15:39:40 org.apache.hadoop.mapred.Counters log
信息:     Reduce output records=3
2013-10-14 15:39:40 org.apache.hadoop.mapred.Counters log
信息:     Map output records=3
2013-10-14 15:39:41 org.apache.hadoop.mapred.JobClient copyAndConfigureFiles
警告: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
2013-10-14 15:39:41 org.apache.hadoop.mapreduce.lib.input.FileInputFormat listStatus
信息: Total input paths to process : 1
2013-10-14 15:39:41 org.apache.hadoop.mapred.JobClient monitorAndPrintJob
信息: Running job: job_local_0009
2013-10-14 15:39:41 org.apache.hadoop.mapred.Task initialize
信息:  Using ResourceCalculatorPlugin : null
2013-10-14 15:39:41 org.apache.hadoop.mapred.MapTask$MapOutputBuffer 
信息: io.sort.mb = 100
2013-10-14 15:39:41 org.apache.hadoop.mapred.MapTask$MapOutputBuffer 
信息: data buffer = 79691776/99614720
2013-10-14 15:39:41 org.apache.hadoop.mapred.MapTask$MapOutputBuffer 
信息: record buffer = 262144/327680
2013-10-14 15:39:41 org.apache.hadoop.mapred.MapTask$MapOutputBuffer flush
信息: Starting flush of map output
2013-10-14 15:39:41 org.apache.hadoop.mapred.MapTask$MapOutputBuffer sortAndSpill
信息: Finished spill 0
2013-10-14 15:39:41 org.apache.hadoop.mapred.Task done
信息: Task:attempt_local_0009_m_000000_0 is done. And is in the process of commiting
2013-10-14 15:39:41 org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate
信息: 
2013-10-14 15:39:41 org.apache.hadoop.mapred.Task sendDone
信息: Task 'attempt_local_0009_m_000000_0' done.
2013-10-14 15:39:41 org.apache.hadoop.mapred.Task initialize
信息:  Using ResourceCalculatorPlugin : null
2013-10-14 15:39:41 org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate
信息: 
2013-10-14 15:39:41 org.apache.hadoop.mapred.Merger$MergeQueue merge
信息: Merging 1 sorted segments
2013-10-14 15:39:41 org.apache.hadoop.mapred.Merger$MergeQueue merge
信息: Down to the last merge-pass, with 1 segments left of total size: 677 bytes
2013-10-14 15:39:41 org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate
信息: 
2013-10-14 15:39:41 org.apache.hadoop.mapred.Task done
信息: Task:attempt_local_0009_r_000000_0 is done. And is in the process of commiting
2013-10-14 15:39:41 org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate
信息: 
2013-10-14 15:39:41 org.apache.hadoop.mapred.Task commit
信息: Task attempt_local_0009_r_000000_0 is allowed to commit now
2013-10-14 15:39:41 org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter commitTask
信息: Saved output of task 'attempt_local_0009_r_000000_0' to hdfs://192.168.1.210:9000/user/hdfs/mix_data/result/clusters-8
2013-10-14 15:39:41 org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate
信息: reduce > reduce
2013-10-14 15:39:41 org.apache.hadoop.mapred.Task sendDone
信息: Task 'attempt_local_0009_r_000000_0' done.
2013-10-14 15:39:42 org.apache.hadoop.mapred.JobClient monitorAndPrintJob
信息:  map 100% reduce 100%
2013-10-14 15:39:42 org.apache.hadoop.mapred.JobClient monitorAndPrintJob
信息: Job complete: job_local_0009
2013-10-14 15:39:42 org.apache.hadoop.mapred.Counters log
信息: Counters: 19
2013-10-14 15:39:42 org.apache.hadoop.mapred.Counters log
信息:   File Output Format Counters 
2013-10-14 15:39:42 org.apache.hadoop.mapred.Counters log
信息:     Bytes Written=695
2013-10-14 15:39:42 org.apache.hadoop.mapred.Counters log
信息:   FileSystemCounters
2013-10-14 15:39:42 org.apache.hadoop.mapred.Counters log
信息:     FILE_BYTES_READ=27256775
2013-10-14 15:39:42 org.apache.hadoop.mapred.Counters log
信息:     HDFS_BYTES_READ=673669
2013-10-14 15:39:42 org.apache.hadoop.mapred.Counters log
信息:     FILE_BYTES_WRITTEN=28569192
2013-10-14 15:39:42 org.apache.hadoop.mapred.Counters log
信息:     HDFS_BYTES_WRITTEN=152767
2013-10-14 15:39:42 org.apache.hadoop.mapred.Counters log
信息:   File Input Format Counters 
2013-10-14 15:39:42 org.apache.hadoop.mapred.Counters log
信息:     Bytes Read=31390
2013-10-14 15:39:42 org.apache.hadoop.mapred.Counters log
信息:   Map-Reduce Framework
2013-10-14 15:39:42 org.apache.hadoop.mapred.Counters log
信息:     Map output materialized bytes=681
2013-10-14 15:39:42 org.apache.hadoop.mapred.Counters log
信息:     Map input records=1000
2013-10-14 15:39:42 org.apache.hadoop.mapred.Counters log
信息:     Reduce shuffle bytes=0
2013-10-14 15:39:42 org.apache.hadoop.mapred.Counters log
信息:     Spilled Records=6
2013-10-14 15:39:42 org.apache.hadoop.mapred.Counters log
信息:     Map output bytes=666
2013-10-14 15:39:42 org.apache.hadoop.mapred.Counters log
信息:     Total committed heap usage (bytes)=1772093440
2013-10-14 15:39:42 org.apache.hadoop.mapred.Counters log
信息:     SPLIT_RAW_BYTES=130
2013-10-14 15:39:42 org.apache.hadoop.mapred.Counters log
信息:     Combine input records=0
2013-10-14 15:39:42 org.apache.hadoop.mapred.Counters log
信息:     Reduce input records=3
2013-10-14 15:39:42 org.apache.hadoop.mapred.Counters log
信息:     Reduce input groups=3
2013-10-14 15:39:42 org.apache.hadoop.mapred.Counters log
信息:     Combine output records=0
2013-10-14 15:39:42 org.apache.hadoop.mapred.Counters log
信息:     Reduce output records=3
2013-10-14 15:39:42 org.apache.hadoop.mapred.Counters log
信息:     Map output records=3
2013-10-14 15:39:42 org.apache.hadoop.mapred.JobClient copyAndConfigureFiles
警告: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
2013-10-14 15:39:42 org.apache.hadoop.mapreduce.lib.input.FileInputFormat listStatus
信息: Total input paths to process : 1
2013-10-14 15:39:42 org.apache.hadoop.mapred.JobClient monitorAndPrintJob
信息: Running job: job_local_0010
2013-10-14 15:39:42 org.apache.hadoop.mapred.Task initialize
信息:  Using ResourceCalculatorPlugin : null
2013-10-14 15:39:42 org.apache.hadoop.mapred.MapTask$MapOutputBuffer 
信息: io.sort.mb = 100
2013-10-14 15:39:42 org.apache.hadoop.mapred.MapTask$MapOutputBuffer 
信息: data buffer = 79691776/99614720
2013-10-14 15:39:42 org.apache.hadoop.mapred.MapTask$MapOutputBuffer 
信息: record buffer = 262144/327680
2013-10-14 15:39:42 org.apache.hadoop.mapred.MapTask$MapOutputBuffer flush
信息: Starting flush of map output
2013-10-14 15:39:42 org.apache.hadoop.mapred.MapTask$MapOutputBuffer sortAndSpill
信息: Finished spill 0
2013-10-14 15:39:42 org.apache.hadoop.mapred.Task done
信息: Task:attempt_local_0010_m_000000_0 is done. And is in the process of commiting
2013-10-14 15:39:42 org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate
信息: 
2013-10-14 15:39:42 org.apache.hadoop.mapred.Task sendDone
信息: Task 'attempt_local_0010_m_000000_0' done.
2013-10-14 15:39:42 org.apache.hadoop.mapred.Task initialize
信息:  Using ResourceCalculatorPlugin : null
2013-10-14 15:39:42 org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate
信息: 
2013-10-14 15:39:42 org.apache.hadoop.mapred.Merger$MergeQueue merge
信息: Merging 1 sorted segments
2013-10-14 15:39:42 org.apache.hadoop.mapred.Merger$MergeQueue merge
信息: Down to the last merge-pass, with 1 segments left of total size: 677 bytes
2013-10-14 15:39:42 org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate
信息: 
2013-10-14 15:39:42 org.apache.hadoop.mapred.Task done
信息: Task:attempt_local_0010_r_000000_0 is done. And is in the process of commiting
2013-10-14 15:39:42 org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate
信息: 
2013-10-14 15:39:42 org.apache.hadoop.mapred.Task commit
信息: Task attempt_local_0010_r_000000_0 is allowed to commit now
2013-10-14 15:39:42 org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter commitTask
信息: Saved output of task 'attempt_local_0010_r_000000_0' to hdfs://192.168.1.210:9000/user/hdfs/mix_data/result/clusters-9
2013-10-14 15:39:42 org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate
信息: reduce > reduce
2013-10-14 15:39:42 org.apache.hadoop.mapred.Task sendDone
信息: Task 'attempt_local_0010_r_000000_0' done.
2013-10-14 15:39:43 org.apache.hadoop.mapred.JobClient monitorAndPrintJob
信息:  map 100% reduce 100%
2013-10-14 15:39:43 org.apache.hadoop.mapred.JobClient monitorAndPrintJob
信息: Job complete: job_local_0010
2013-10-14 15:39:43 org.apache.hadoop.mapred.Counters log
信息: Counters: 19
2013-10-14 15:39:43 org.apache.hadoop.mapred.Counters log
信息:   File Output Format Counters 
2013-10-14 15:39:43 org.apache.hadoop.mapred.Counters log
信息:     Bytes Written=695
2013-10-14 15:39:43 org.apache.hadoop.mapred.Counters log
信息:   FileSystemCounters
2013-10-14 15:39:43 org.apache.hadoop.mapred.Counters log
信息:     FILE_BYTES_READ=30544993
2013-10-14 15:39:43 org.apache.hadoop.mapred.Counters log
信息:     HDFS_BYTES_READ=741007
2013-10-14 15:39:43 org.apache.hadoop.mapred.Counters log
信息:     FILE_BYTES_WRITTEN=32013760
2013-10-14 15:39:43 org.apache.hadoop.mapred.Counters log
信息:     HDFS_BYTES_WRITTEN=154545
2013-10-14 15:39:43 org.apache.hadoop.mapred.Counters log
信息:   File Input Format Counters 
2013-10-14 15:39:43 org.apache.hadoop.mapred.Counters log
信息:     Bytes Read=31390
2013-10-14 15:39:43 org.apache.hadoop.mapred.Counters log
信息:   Map-Reduce Framework
2013-10-14 15:39:43 org.apache.hadoop.mapred.Counters log
信息:     Map output materialized bytes=681
2013-10-14 15:39:43 org.apache.hadoop.mapred.Counters log
信息:     Map input records=1000
2013-10-14 15:39:43 org.apache.hadoop.mapred.Counters log
信息:     Reduce shuffle bytes=0
2013-10-14 15:39:43 org.apache.hadoop.mapred.Counters log
信息:     Spilled Records=6
2013-10-14 15:39:43 org.apache.hadoop.mapred.Counters log
信息:     Map output bytes=666
2013-10-14 15:39:43 org.apache.hadoop.mapred.Counters log
信息:     Total committed heap usage (bytes)=1966735360
2013-10-14 15:39:43 org.apache.hadoop.mapred.Counters log
信息:     SPLIT_RAW_BYTES=130
2013-10-14 15:39:43 org.apache.hadoop.mapred.Counters log
信息:     Combine input records=0
2013-10-14 15:39:43 org.apache.hadoop.mapred.Counters log
信息:     Reduce input records=3
2013-10-14 15:39:43 org.apache.hadoop.mapred.Counters log
信息:     Reduce input groups=3
2013-10-14 15:39:43 org.apache.hadoop.mapred.Counters log
信息:     Combine output records=0
2013-10-14 15:39:43 org.apache.hadoop.mapred.Counters log
信息:     Reduce output records=3
2013-10-14 15:39:43 org.apache.hadoop.mapred.Counters log
信息:     Map output records=3
2013-10-14 15:39:43 org.apache.hadoop.mapred.JobClient copyAndConfigureFiles
警告: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
2013-10-14 15:39:43 org.apache.hadoop.mapreduce.lib.input.FileInputFormat listStatus
信息: Total input paths to process : 1
2013-10-14 15:39:43 org.apache.hadoop.mapred.JobClient monitorAndPrintJob
信息: Running job: job_local_0011
2013-10-14 15:39:43 org.apache.hadoop.mapred.Task initialize
信息:  Using ResourceCalculatorPlugin : null
2013-10-14 15:39:43 org.apache.hadoop.mapred.MapTask$MapOutputBuffer 
信息: io.sort.mb = 100
2013-10-14 15:39:43 org.apache.hadoop.mapred.MapTask$MapOutputBuffer 
信息: data buffer = 79691776/99614720
2013-10-14 15:39:43 org.apache.hadoop.mapred.MapTask$MapOutputBuffer 
信息: record buffer = 262144/327680
2013-10-14 15:39:43 org.apache.hadoop.mapred.MapTask$MapOutputBuffer flush
信息: Starting flush of map output
2013-10-14 15:39:43 org.apache.hadoop.mapred.MapTask$MapOutputBuffer sortAndSpill
信息: Finished spill 0
2013-10-14 15:39:43 org.apache.hadoop.mapred.Task done
信息: Task:attempt_local_0011_m_000000_0 is done. And is in the process of commiting
2013-10-14 15:39:43 org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate
信息: 
2013-10-14 15:39:43 org.apache.hadoop.mapred.Task sendDone
信息: Task 'attempt_local_0011_m_000000_0' done.
2013-10-14 15:39:43 org.apache.hadoop.mapred.Task initialize
信息:  Using ResourceCalculatorPlugin : null
2013-10-14 15:39:43 org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate
信息: 
2013-10-14 15:39:43 org.apache.hadoop.mapred.Merger$MergeQueue merge
信息: Merging 1 sorted segments
2013-10-14 15:39:43 org.apache.hadoop.mapred.Merger$MergeQueue merge
信息: Down to the last merge-pass, with 1 segments left of total size: 677 bytes
2013-10-14 15:39:43 org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate
信息: 
2013-10-14 15:39:43 org.apache.hadoop.mapred.Task done
信息: Task:attempt_local_0011_r_000000_0 is done. And is in the process of commiting
2013-10-14 15:39:43 org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate
信息: 
2013-10-14 15:39:43 org.apache.hadoop.mapred.Task commit
信息: Task attempt_local_0011_r_000000_0 is allowed to commit now
2013-10-14 15:39:43 org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter commitTask
信息: Saved output of task 'attempt_local_0011_r_000000_0' to hdfs://192.168.1.210:9000/user/hdfs/mix_data/result/clusters-10
2013-10-14 15:39:43 org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate
信息: reduce > reduce
2013-10-14 15:39:43 org.apache.hadoop.mapred.Task sendDone
信息: Task 'attempt_local_0011_r_000000_0' done.
2013-10-14 15:39:44 org.apache.hadoop.mapred.JobClient monitorAndPrintJob
信息:  map 100% reduce 100%
2013-10-14 15:39:44 org.apache.hadoop.mapred.JobClient monitorAndPrintJob
信息: Job complete: job_local_0011
2013-10-14 15:39:44 org.apache.hadoop.mapred.Counters log
信息: Counters: 19
2013-10-14 15:39:44 org.apache.hadoop.mapred.Counters log
信息:   File Output Format Counters 
2013-10-14 15:39:44 org.apache.hadoop.mapred.Counters log
信息:     Bytes Written=695
2013-10-14 15:39:44 org.apache.hadoop.mapred.Counters log
信息:   FileSystemCounters
2013-10-14 15:39:44 org.apache.hadoop.mapred.Counters log
信息:     FILE_BYTES_READ=33833211
2013-10-14 15:39:44 org.apache.hadoop.mapred.Counters log
信息:     HDFS_BYTES_READ=808345
2013-10-14 15:39:44 org.apache.hadoop.mapred.Counters log
信息:     FILE_BYTES_WRITTEN=35458320
2013-10-14 15:39:44 org.apache.hadoop.mapred.Counters log
信息:     HDFS_BYTES_WRITTEN=156323
2013-10-14 15:39:44 org.apache.hadoop.mapred.Counters log
信息:   File Input Format Counters 
2013-10-14 15:39:44 org.apache.hadoop.mapred.Counters log
信息:     Bytes Read=31390
2013-10-14 15:39:44 org.apache.hadoop.mapred.Counters log
信息:   Map-Reduce Framework
2013-10-14 15:39:44 org.apache.hadoop.mapred.Counters log
信息:     Map output materialized bytes=681
2013-10-14 15:39:44 org.apache.hadoop.mapred.Counters log
信息:     Map input records=1000
2013-10-14 15:39:44 org.apache.hadoop.mapred.Counters log
信息:     Reduce shuffle bytes=0
2013-10-14 15:39:44 org.apache.hadoop.mapred.Counters log
信息:     Spilled Records=6
2013-10-14 15:39:44 org.apache.hadoop.mapred.Counters log
信息:     Map output bytes=666
2013-10-14 15:39:44 org.apache.hadoop.mapred.Counters log
信息:     Total committed heap usage (bytes)=2166095872
2013-10-14 15:39:44 org.apache.hadoop.mapred.Counters log
信息:     SPLIT_RAW_BYTES=130
2013-10-14 15:39:44 org.apache.hadoop.mapred.Counters log
信息:     Combine input records=0
2013-10-14 15:39:44 org.apache.hadoop.mapred.Counters log
信息:     Reduce input records=3
2013-10-14 15:39:44 org.apache.hadoop.mapred.Counters log
信息:     Reduce input groups=3
2013-10-14 15:39:44 org.apache.hadoop.mapred.Counters log
信息:     Combine output records=0
2013-10-14 15:39:44 org.apache.hadoop.mapred.Counters log
信息:     Reduce output records=3
2013-10-14 15:39:44 org.apache.hadoop.mapred.Counters log
信息:     Map output records=3
2013-10-14 15:39:44 org.apache.hadoop.mapred.JobClient copyAndConfigureFiles
警告: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
2013-10-14 15:39:44 org.apache.hadoop.mapreduce.lib.input.FileInputFormat listStatus
信息: Total input paths to process : 1
2013-10-14 15:39:44 org.apache.hadoop.mapred.JobClient monitorAndPrintJob
信息: Running job: job_local_0012
2013-10-14 15:39:44 org.apache.hadoop.mapred.Task initialize
信息:  Using ResourceCalculatorPlugin : null
2013-10-14 15:39:44 org.apache.hadoop.mapred.Task done
信息: Task:attempt_local_0012_m_000000_0 is done. And is in the process of commiting
2013-10-14 15:39:44 org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate
信息: 
2013-10-14 15:39:44 org.apache.hadoop.mapred.Task commit
信息: Task attempt_local_0012_m_000000_0 is allowed to commit now
2013-10-14 15:39:44 org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter commitTask
信息: Saved output of task 'attempt_local_0012_m_000000_0' to hdfs://192.168.1.210:9000/user/hdfs/mix_data/result/clusteredPoints
2013-10-14 15:39:44 org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate
信息: 
2013-10-14 15:39:44 org.apache.hadoop.mapred.Task sendDone
信息: Task 'attempt_local_0012_m_000000_0' done.
2013-10-14 15:39:45 org.apache.hadoop.mapred.JobClient monitorAndPrintJob
信息:  map 100% reduce 0%
2013-10-14 15:39:45 org.apache.hadoop.mapred.JobClient monitorAndPrintJob
信息: Job complete: job_local_0012
2013-10-14 15:39:45 org.apache.hadoop.mapred.Counters log
信息: Counters: 11
2013-10-14 15:39:45 org.apache.hadoop.mapred.Counters log
信息:   File Output Format Counters 
2013-10-14 15:39:45 org.apache.hadoop.mapred.Counters log
信息:     Bytes Written=41520
2013-10-14 15:39:45 org.apache.hadoop.mapred.Counters log
信息:   File Input Format Counters 
2013-10-14 15:39:45 org.apache.hadoop.mapred.Counters log
信息:     Bytes Read=31390
2013-10-14 15:39:45 org.apache.hadoop.mapred.Counters log
信息:   FileSystemCounters
2013-10-14 15:39:45 org.apache.hadoop.mapred.Counters log
信息:     FILE_BYTES_READ=18560374
2013-10-14 15:39:45 org.apache.hadoop.mapred.Counters log
信息:     HDFS_BYTES_READ=437203
2013-10-14 15:39:45 org.apache.hadoop.mapred.Counters log
信息:     FILE_BYTES_WRITTEN=19450325
2013-10-14 15:39:45 org.apache.hadoop.mapred.Counters log
信息:     HDFS_BYTES_WRITTEN=120417
2013-10-14 15:39:45 org.apache.hadoop.mapred.Counters log
信息:   Map-Reduce Framework
2013-10-14 15:39:45 org.apache.hadoop.mapred.Counters log
信息:     Map input records=1000
2013-10-14 15:39:45 org.apache.hadoop.mapred.Counters log
信息:     Spilled Records=0
2013-10-14 15:39:45 org.apache.hadoop.mapred.Counters log
信息:     Total committed heap usage (bytes)=1083047936
2013-10-14 15:39:45 org.apache.hadoop.mapred.Counters log
信息:     SPLIT_RAW_BYTES=130
2013-10-14 15:39:45 org.apache.hadoop.mapred.Counters log
信息:     Map output records=1000
Dumping out clusters from clusters: hdfs://192.168.1.210:9000/user/hdfs/mix_data/result/clusters-*-final and clusteredPoints: hdfs://192.168.1.210:9000/user/hdfs/mix_data/result/clusteredPoints
CL-552{n=443 c=[1.631, -0.412] r=[1.563, 1.407]}
	Weight : [props - optional]:  Point:
	1.0: [-2.393, 3.347]
	1.0: [-4.364, 1.905]
	1.0: [-3.275, 0.023]
	1.0: [-2.479, 2.534]
	1.0: [-0.559, 1.223]
	...
	
CL-847{n=77 c=[-2.953, -0.971] r=[1.767, 2.189]}
	Weight : [props - optional]:  Point:
	1.0: [-0.883, -3.320]
	1.0: [-1.099, -6.063]
	1.0: [-0.004, -0.610]
	1.0: [-2.996, -3.610]
	1.0: [3.988, 1.008]
	...

CL-823{n=480 c=[0.219, 2.600] r=[1.479, 1.385]}
	Weight : [props - optional]:  Point:
	1.0: [2.670, 1.851]
	1.0: [2.177, 6.773]
	1.0: [5.537, 2.651]
	1.0: [5.663, 6.868]
	1.0: [5.117, 3.747]
	1.0: [1.912, 2.959]
	...

4). 聚类结果解读
我们可以把上面的日志分解析成3个部分解读

  • a. 初始化环境
  • b. 算法执行
  • c. 打印聚类结果

a. 初始化环境
出初HDFS的数据目录和工作目录,并上传数据文件。


Delete: hdfs://192.168.1.210:9000/user/hdfs/mix_data
Create: hdfs://192.168.1.210:9000/user/hdfs/mix_data
copy from: datafile/randomData.csv to hdfs://192.168.1.210:9000/user/hdfs/mix_data
ls: hdfs://192.168.1.210:9000/user/hdfs/mix_data
==========================================================
name: hdfs://192.168.1.210:9000/user/hdfs/mix_data/randomData.csv, folder: false, size: 36655

b. 算法执行
算法执行,有3个步骤。

  • 1):把原始数据randomData.csv,转成Mahout sequence files of VectorWritable。
  • 2):通过随机的方法,选中kmeans的3个中心,做为初始集群
  • 3):根据迭代次数的设置,执行MapReduce,进行计算

1):把原始数据randomData.csv,转成Mahout sequence files of VectorWritable。

程序源代码:


      InputDriver.runJob(new Path(inPath), new Path(seqFile), "org.apache.mahout.math.RandomAccessSparseVector");

日志输出:

Job complete: job_local_0001

2):通过随机的方法,选中kmeans的3个中心,做为初始集群

程序源代码:


        int k = 3;
        Path seqFilePath = new Path(seqFile);
        Path clustersSeeds = new Path(seeds);
        DistanceMeasure measure = new EuclideanDistanceMeasure();
        clustersSeeds = RandomSeedGenerator.buildRandom(conf, seqFilePath, clustersSeeds, k, measure);

日志输出:

Job complete: job_local_0002

3):根据迭代次数的设置,执行MapReduce,进行计算
程序源代码:


        KMeansDriver.run(conf, seqFilePath, clustersSeeds, new Path(outPath), measure, 0.01, 10, true, 0.01, false);

日志输出:


Job complete: job_local_0003
Job complete: job_local_0004
Job complete: job_local_0005
Job complete: job_local_0006
Job complete: job_local_0007
Job complete: job_local_0008
Job complete: job_local_0009
Job complete: job_local_0010
Job complete: job_local_0011
Job complete: job_local_0012

c. 打印聚类结果


Dumping out clusters from clusters: hdfs://192.168.1.210:9000/user/hdfs/mix_data/result/clusters-*-final and clusteredPoints: hdfs://192.168.1.210:9000/user/hdfs/mix_data/result/clusteredPoints
CL-552{n=443 c=[1.631, -0.412] r=[1.563, 1.407]}
CL-847{n=77 c=[-2.953, -0.971] r=[1.767, 2.189]}
CL-823{n=480 c=[0.219, 2.600] r=[1.479, 1.385]}

运行结果:有3个中心。

  • Cluster1, 包括443个点,中心坐标[1.631, -0.412]
  • Cluster2, 包括77个点,中心坐标[-2.953, -0.971]
  • Cluster3, 包括480 个点,中心坐标[0.219, 2.600]

5). HDFS产生的目录


# 根目录
~ hadoop fs -ls /user/hdfs/mix_data
Found 4 items
-rw-r--r--   3 Administrator supergroup      36655 2013-10-04 15:31 /user/hdfs/mix_data/randomData.csv
drwxr-xr-x   - Administrator supergroup          0 2013-10-04 15:31 /user/hdfs/mix_data/result
drwxr-xr-x   - Administrator supergroup          0 2013-10-04 15:31 /user/hdfs/mix_data/seeds
drwxr-xr-x   - Administrator supergroup          0 2013-10-04 15:31 /user/hdfs/mix_data/seqfile

# 输出目录
~ hadoop fs -ls /user/hdfs/mix_data/result
Found 13 items
-rw-r--r--   3 Administrator supergroup        194 2013-10-04 15:31 /user/hdfs/mix_data/result/_policy
drwxr-xr-x   - Administrator supergroup          0 2013-10-04 15:31 /user/hdfs/mix_data/result/clusteredPoints
drwxr-xr-x   - Administrator supergroup          0 2013-10-04 15:31 /user/hdfs/mix_data/result/clusters-0
drwxr-xr-x   - Administrator supergroup          0 2013-10-04 15:31 /user/hdfs/mix_data/result/clusters-1
drwxr-xr-x   - Administrator supergroup          0 2013-10-04 15:31 /user/hdfs/mix_data/result/clusters-10-final
drwxr-xr-x   - Administrator supergroup          0 2013-10-04 15:31 /user/hdfs/mix_data/result/clusters-2
drwxr-xr-x   - Administrator supergroup          0 2013-10-04 15:31 /user/hdfs/mix_data/result/clusters-3
drwxr-xr-x   - Administrator supergroup          0 2013-10-04 15:31 /user/hdfs/mix_data/result/clusters-4
drwxr-xr-x   - Administrator supergroup          0 2013-10-04 15:31 /user/hdfs/mix_data/result/clusters-5
drwxr-xr-x   - Administrator supergroup          0 2013-10-04 15:31 /user/hdfs/mix_data/result/clusters-6
drwxr-xr-x   - Administrator supergroup          0 2013-10-04 15:31 /user/hdfs/mix_data/result/clusters-7
drwxr-xr-x   - Administrator supergroup          0 2013-10-04 15:31 /user/hdfs/mix_data/result/clusters-8
drwxr-xr-x   - Administrator supergroup          0 2013-10-04 15:31 /user/hdfs/mix_data/result/clusters-9

# 产生的随机中心种子目录
~ hadoop fs -ls /user/hdfs/mix_data/seeds
Found 1 items
-rw-r--r--   3 Administrator supergroup        599 2013-10-04 15:31 /user/hdfs/mix_data/seeds/part-randomSeed

# 输入文件换成Mahout格式文件的目录
~ hadoop fs -ls /user/hdfs/mix_data/seqfile
Found 2 items
-rw-r--r--   3 Administrator supergroup          0 2013-10-04 15:31 /user/hdfs/mix_data/seqfile/_SUCCESS
-rw-r--r--   3 Administrator supergroup      31390 2013-10-04 15:31 /user/hdfs/mix_data/seqfile/part-m-00000

4. 用R语言可视化结果

分别把聚类后的点,保存到不同的cluster*.csv文件,然后用R语言画图。


c1<-read.csv(file="cluster1.csv",sep=",",header=FALSE)
c2<-read.csv(file="cluster2.csv",sep=",",header=FALSE)
c3<-read.csv(file="cluster3.csv",sep=",",header=FALSE)
y<-rbind(c1,c2,c3)
cols<-c(rep(1,nrow(c1)),rep(2,nrow(c2)),rep(3,nrow(c3)))
plot(y, col=c("black","blue","green")[cols])
center<-matrix(c(1.631, -0.412,-2.953, -0.971,0.219, 2.600),ncol=2,byrow=TRUE)
points(center, col="violetred", pch = 19)

kmeans

从上图中,我们看到有 黑,蓝,绿,三种颜色的空心点,这些点就是原始数据。
3个紫色实点,是Mahout的kmeans后生成的3个中心。

对比文章中用R语言实现的kmeans的分类和中心,都不太一样。 用Maven构建Mahout项目

简单总结一下,在使用kmeans时,根据距离算法,阈值,出始中心,迭代次数的不同,kmeans计算的结果是不相同的。因此,用kmeans算法,我们一般只能得到一个模糊的分类标准,这个标准对于我们认识未知领域的数据集是很有帮助的。不能做为精确衡量数据的指标。

5. 模板项目上传github

https://github.com/bsspirit/maven_mahout_template/tree/mahout-0.8

大家可以下载这个项目,做为开发的起点。


~ git clone https://github.com/bsspirit/maven_mahout_template
~ git checkout mahout-0.8

这样,我们完成了Mahout的聚类算法Kmeans的分步式实现。接下来,我们会继续做关于Mahout中分类的实验!

转载请注明出处:
http://blog.fens.me/hadoop-mahout-kmeans/

打赏作者