OpenBlas让R的矩阵计算加速

R的极客理想系列文章,涵盖了R的思想,使用,工具,创新等的一系列要点,以我个人的学习和体验去诠释R的强大。

R语言作为统计学一门语言,一直在小众领域闪耀着光芒。直到大数据的爆发,R语言变成了一门炙手可热的数据分析的利器。随着越来越多的工程背景的人的加入,R语言的社区在迅速扩大成长。现在已不仅仅是统计领域,教育,银行,电商,互联网….都在使用R语言。

要成为有理想的极客,我们不能停留在语法上,要掌握牢固的数学,概率,统计知识,同时还要有创新精神,把R语言发挥到各个领域。让我们一起动起来吧,开始R的极客理想。

关于作者:

  • 张丹(Conan), 程序员Java,R,Javascript
  • weibo:@Conan_Z
  • blog: http://blog.fens.me
  • email: bsspirit@gmail.com

转载请注明出处:
http://blog.fens.me/r-martix-openblas/

openblas

前言

昨天在IBM的大会上,又一次见到了OpenBlas主题的分享,这次必须要试一下。我第一次了解OpenBlas是在2年前的R语言大会上,听到了OpenBlas的各种优势,不过听完也就过去了。2年后再来这个项目,团队成员稳定,项目进展顺利,已经广泛接受,再不尝试一下就真的要落伍了。

目录

  1. OpenBlas介绍
  2. R和OpenBlas的安装
  3. 让R语言加速

1. OpenBlas介绍

OpenBlas是一个开源项目,是由 中科院软件所并行软件与计算科学实验室 发起的基于GotoBLAS2 1.13 BSD版的开源BLAS库高性能实现。

BLAS(Basic Linear Algebra Subprograms 基础线性代数程序集)是一个应用程序接口(API)标准,用以规范发布基础线性代数操作的数值库(如矢量或矩阵乘法)。该程序集最初发布于1979年,并 用于建立更大的数值程序包(如LAPACK)。在高性能计算领域,BLAS被广泛使用。例如,LINPACK的运算成绩则很大程度上取决于BLAS中子程 序DGEMM的表现。为提高性能,各軟硬件厂商则针对其產品对BLAS接口实现进行高度优化。

项目主页:http://www.openblas.net/

2. R和OpenBlas的安装

OpenBlas可以为各种语言底层,提供矩阵计算的性能提升,那么让我们把R和OpenBlas结合试试吧!

本机的系统环境:

  • Linux Ubuntu 14.01.1
  • CPU 双核 Intel(R) Xeon(R) CPU E5-2650 v2 @ 2.60GHz
  • 内存 4G

通过命令查看系统参数


# 操作系统
~ cat /etc/issue
Ubuntu 14.04.1 LTS \n \l

# CPU
cat /proc/cpuinfo 
processor	: 0
vendor_id	: GenuineIntel
cpu family	: 6
model		: 62
model name	: Intel(R) Xeon(R) CPU E5-2650 v2 @ 2.60GHz
stepping	: 4
microcode	: 0x428
cpu MHz		: 2600.048
cache size	: 20480 KB
physical id	: 0
siblings	: 2
core id		: 0
cpu cores	: 2
apicid		: 0
initial apicid	: 0
fpu		: yes
fpu_exception	: yes
cpuid level	: 13
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat clflush mmx fxsr sse sse2 ht syscall nx rdtscp lm constant_tsc rep_good nopl pni ssse3 cx16 sse4_1 sse4_2 popcnt aes hypervisor lahf_lm
bogomips	: 5200.09
clflush size	: 64
cache_alignment	: 64
address sizes	: 46 bits physical, 48 bits virtual
power management:

processor	: 1
vendor_id	: GenuineIntel
cpu family	: 6
model		: 62
model name	: Intel(R) Xeon(R) CPU E5-2650 v2 @ 2.60GHz
stepping	: 4
microcode	: 0x428
cpu MHz		: 2600.048
cache size	: 20480 KB
physical id	: 0
siblings	: 2
core id		: 1
cpu cores	: 2
apicid		: 2
initial apicid	: 2
fpu		: yes
fpu_exception	: yes
cpuid level	: 13
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat clflush mmx fxsr sse sse2 ht syscall nx rdtscp lm constant_tsc rep_good nopl pni ssse3 cx16 sse4_1 sse4_2 popcnt aes hypervisor lahf_lm
bogomips	: 5200.09
clflush size	: 64
cache_alignment	: 64
address sizes	: 46 bits physical, 48 bits virtual
power management:

# 内存
~ cat /proc/meminfo 
MemTotal:        4046820 kB
MemFree:         1572372 kB
Buffers:           40588 kB
Cached:           709684 kB
SwapCached:            0 kB
Active:          1953940 kB
Inactive:         418084 kB
Active(anon):    1621840 kB
Inactive(anon):     5732 kB
Active(file):     332100 kB
Inactive(file):   412352 kB
Unevictable:           0 kB
Mlocked:               0 kB
SwapTotal:             0 kB
SwapFree:              0 kB
Dirty:                24 kB
Writeback:             0 kB
AnonPages:       1623792 kB
Mapped:            34936 kB
Shmem:              5828 kB
Slab:              58024 kB
SReclaimable:      45252 kB
SUnreclaim:        12772 kB
KernelStack:        1512 kB
PageTables:         8980 kB
NFS_Unstable:          0 kB
Bounce:                0 kB
WritebackTmp:          0 kB
CommitLimit:     2023408 kB
Committed_AS:    2556460 kB
VmallocTotal:   34359738367 kB
VmallocUsed:        9664 kB
VmallocChunk:   34359723308 kB
HardwareCorrupted:     0 kB
AnonHugePages:   1562624 kB
HugePages_Total:       0
HugePages_Free:        0
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB
DirectMap4k:       28672 kB
DirectMap2M:     4296704 kB

首先,我们要安装R语言的运行环境,在Linux Ubuntu中一条命令就可以搞定。


# 安装R语言
~ sudo apt-get install r-base

#查看R语言的版本
~ R --version
R version 3.2.2 (2015-08-14) -- "Fire Safety"
Copyright (C) 2015 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under the terms of the
GNU General Public License versions 2 or 3.
For more information about these matters see
http://www.gnu.org/licenses/.

我们安装的R语言最新版本是3.2.2。

接下来,我们跑一个矩阵的计算,让2个3000行列的方阵相乘。


# 启动R
~ R

# 方阵相乘
> x <- matrix(1:(3000 * 3000), 3000, 3000)

# 计算耗时
> system.time(tmp <- x %*% x)
   user  system elapsed 
 33.329   0.332  33.788 

接下来,我们安装OpenBlas来提高计算性能。在Ubuntu中安装OpenBlas非常简单,只需要一条命令就可以搞定。


~ sudo apt-get install libopenblas-base

切换blas的计算引擎,使用openblas替换libblas。


~ sudo update-alternatives --config libblas.so.3
There are 2 choices for the alternative libblas.so.3 (providing /usr/lib/libblas.so.3).

  Selection    Path                                 Priority   Status
------------------------------------------------------------
* 0            /usr/lib/openblas-base/libblas.so.3   40        auto mode
  1            /usr/lib/libblas/libblas.so.3         10        manual mode
  2            /usr/lib/openblas-base/libblas.so.3   40        manual mode

Press enter to keep the current choice[*], or type selection number: 0

选择0,使用openblas-base引擎。

我们重新打开R运行环境,再次执行刚才的矩阵相乘计算。


~ R
> x <- matrix(1:(3000 * 3000), 3000, 3000)

# 计算耗时
> system.time(tmp <- x %*% x)
   user  system elapsed 
  7.391   0.127   3.869 

神奇的事情发生了,速度提升了4倍多。由于OpenBlas可以对矩阵计算加速,那么我们对所有矩阵操作都做一下测试吧。

3. 让R语言加速

通过互联网我找到了两个用于R语言性能测试的脚本,我们可以在自己的环境中测试一下。Benchmarks脚本的发布页,脚本代码下载

我发现Revolution Analytics公司也用这个脚本进行了测试,并对比了Revolution企业版和R的官方发行版的区别。

下载脚本


~ wget http://brettklamer.com/assets/files/statistical/faster-blas-in-r/R-benchmark-25.R
--2015-09-24 12:06:05--  http://brettklamer.com/assets/files/statistical/faster-blas-in-r/R-benchmark-25.R
Resolving brettklamer.com (brettklamer.com)... 199.96.156.242
Connecting to brettklamer.com (brettklamer.com)|199.96.156.242|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 13666 (13K)
Saving to: ‘R-benchmark-25.R’

100%[============================================================================================>] 13,666      --.-K/s   in 0s      

2015-09-24 12:06:06 (203 MB/s) - ‘R-benchmark-25.R’ saved [13666/13666]

执行脚本。


~ R

# 运行脚本
> source("R-benchmark-25.R")
Loading required package: Matrix
Loading required package: SuppDists


   R Benchmark 2.5
   ===============
Number of times each test is run__________________________:  3

   I. Matrix calculation
   ---------------------
Creation, transp., deformation of a 2500x2500 matrix (sec):  1.103 
2400x2400 normal distributed random matrix ^1000____ (sec):  0.812333333333333 
Sorting of 7,000,000 random values__________________ (sec):  0.962666666666667 
2800x2800 cross-product matrix (b = a' * a)_________ (sec):  1.547 
Linear regr. over a 3000x3000 matrix (c = a \ b')___ (sec):  0.828000000000001 
                      --------------------------------------------
                 Trimmed geom. mean (2 extremes eliminated):  0.957989159036612 

   II. Matrix functions
   --------------------
FFT over 2,400,000 random values____________________ (sec):  0.365333333333335 
Eigenvalues of a 640x640 random matrix______________ (sec):  1.43466666666667 
Determinant of a 2500x2500 random matrix____________ (sec):  0.895999999999998 
Cholesky decomposition of a 3000x3000 matrix________ (sec):  0.832000000000003 
Inverse of a 1600x1600 random matrix________________ (sec):  0.724333333333334 
                      --------------------------------------------
                Trimmed geom. mean (2 extremes eliminated):  0.814310314522547 

   III. Programmation
   ------------------
3,500,000 Fibonacci numbers calculation (vector calc)(sec):  0.776666666666661 
Creation of a 3000x3000 Hilbert matrix (matrix calc) (sec):  0.269666666666671 
Grand common divisors of 400,000 pairs (recursion)__ (sec):  0.570666666666663 
Creation of a 500x500 Toeplitz matrix (loops)_______ (sec):  0.506666666666665 
Escoufier's method on a 45x45 matrix (mixed)________ (sec):  0.533000000000001 
                      --------------------------------------------
                Trimmed geom. mean (2 extremes eliminated):  0.536138937440438 


Total time for all 15 tests_________________________ (sec):  12.162 
Overall mean (sum of I, II and III trimmed means/3)_ (sec):  0.747841037469598 
                      --- End of test ---

我们再切换到,R语言默认的blas引擎运行一次。


~ sudo update-alternatives --config libblas.so.3
There are 2 choices for the alternative libblas.so.3 (providing /usr/lib/libblas.so.3).

  Selection    Path                                 Priority   Status
------------------------------------------------------------
* 0            /usr/lib/openblas-base/libblas.so.3   40        auto mode
  1            /usr/lib/libblas/libblas.so.3         10        manual mode
  2            /usr/lib/openblas-base/libblas.so.3   40        manual mode

Press enter to keep the current choice[*], or type selection number: 1
update-alternatives: using /usr/lib/libblas/libblas.so.3 to provide /usr/lib/libblas.so.3 (libblas.so.3) in manual mode

选择1,切换到libblas引擎。重启R语言环境,并执行脚本。


~ R
> source("R-benchmark-25.R")
Loading required package: Matrix
Loading required package: SuppDists


   R Benchmark 2.5
   ===============
Number of times each test is run__________________________:  3

   I. Matrix calculation
   ---------------------
Creation, transp., deformation of a 2500x2500 matrix (sec):  1.09366666666667 
2400x2400 normal distributed random matrix ^1000____ (sec):  0.817333333333333 
Sorting of 7,000,000 random values__________________ (sec):  0.954333333333333 
2800x2800 cross-product matrix (b = a' * a)_________ (sec):  15.3033333333333 
Linear regr. over a 3000x3000 matrix (c = a \ b')___ (sec):  7.155 
                      --------------------------------------------
                 Trimmed geom. mean (2 extremes eliminated):  1.95463154033118 

   II. Matrix functions
   --------------------
FFT over 2,400,000 random values____________________ (sec):  0.363666666666669 
Eigenvalues of a 640x640 random matrix______________ (sec):  1.131 
Determinant of a 2500x2500 random matrix____________ (sec):  5.061 
Cholesky decomposition of a 3000x3000 matrix________ (sec):  5.634 
Inverse of a 1600x1600 random matrix________________ (sec):  4.142 
                      --------------------------------------------
                Trimmed geom. mean (2 extremes eliminated):  2.87278425762591 

   III. Programmation
   ------------------
3,500,000 Fibonacci numbers calculation (vector calc)(sec):  0.775000000000006 
Creation of a 3000x3000 Hilbert matrix (matrix calc) (sec):  0.259666666666665 
Grand common divisors of 400,000 pairs (recursion)__ (sec):  0.633333333333345 
Creation of a 500x500 Toeplitz matrix (loops)_______ (sec):  0.533666666666666 
Escoufier's method on a 45x45 matrix (mixed)________ (sec):  0.647999999999996 
                      --------------------------------------------
                Trimmed geom. mean (2 extremes eliminated):  0.602780428790226 


Total time for all 15 tests_________________________ (sec):  44.505 
Overall mean (sum of I, II and III trimmed means/3)_ (sec):  1.5014435867612 
                      --- End of test ---

从运行结果看到,用OpenBlas库在做矩阵计算时,性能优势是非常明显的。完成15个测试,OpenBlas库用时12秒,而默认的Blas库用时44秒。仅仅是切换一个底层算法库的成本,就可以让计算性能得到非常大的提升,各位R的小伙伴赶紧用起来吧。

转载请注明出处:
http://blog.fens.me/r-martix-openblas/

打赏作者

This entry was posted in R语言实践

  • Jerome Cao

    hello大神哥,这只是改变了运行环境和底层算法,但是基本的函数库没有变,对吧?谢谢