粉丝日志

Posted:

Oct 26, 2015

Tags:

Comments:

构建自己的Aleax查询服务

从零开始nodejs系列文章，将介绍如何利Javascript做为服务端脚本，通过Nodejs框架web开发。Nodejs框架是基于V8的引擎，是目前速度最快的Javascript引擎。chrome浏览器就基于V8，同时打开20-30个网页都很流畅。Nodejs标准的web开发框架Express，可以帮助我们迅速建立web站点，比起PHP的开发效率更高，而且学习曲线更低。非常适合小型网站，个性化网站，我们自己的Geek网站！！

关于作者

张丹(Conan), 程序员Java,R,Javascript
weibo：@Conan_Z
blog: http://blog.fens.me
email: bsspirit@gmail.com

转载请注明出处：
http://blog.fens.me/nodejs-alexa/

前言

每个网站的站长都会想尽办法提升网站的流量，从而获得更高的广告收入。那么评判一个网站好坏的标准，如Google的PR(PageRank)，百度权重等。从PV(Page View)流量的角度，一个非常重要指标就是Alexa网站排名。

同全球的网站相比，你就能了解到自己网站的位置，让我们先挤进全球前10万的排名吧，不然都不好意思跟同行说，“自己有一个网站”。

Alexa介绍
用Node开发Alexa服务

1. Alexa介绍

Alexa (http://www.alexa.com/)是一家发布世界网站排名的网站，以搜索引擎起家的Alexa创建于1996年4月（美国），目的是让互联网网友在分享虚拟世界资源的同时，更多地参与互联网资源的组织。Alexa每天在网上搜集超过1TB的信息，不仅给出多达几十亿的网址链接，而且为其中的每一个网站进行了排名。可以说，Alexa是当前拥有URL数量最庞大，排名信息发布最详尽的网站。

1999年，Alexa被美国电子商务旗舰企业“亚马逊”收购，成为后者的全资子公司。2002年春，Alexa放弃了自己的搜索引擎，转而与Google合作。

Alexa提供了网站流量统计的服务，对全球有域名的网站进行流量记录。也就是说，只要你申请了域名，在Alexa中就可以查询到你的网站的排名。Alexa的网站排名是按照每个特定网站的被浏览率进行排名的。浏览率越大，排名越靠前。

通常情况，如果你的域名刚刚注册，排名在1千万以上；接下来，你每天都经心运营网站，小有起色时，排名会进入前1百万；然后，你继续发布优质内容，坚持了一段时间，排名会升至前50万；当你的网站在某一领域小有名气时，排名可以到达10万，如粉丝日志122616(2015-10-25)，这时就会有广告主愿意来投放广告了；如果你做的是以盈利为目的的网站，那么你需要再加油，进入到前1万，这个时候你的流量已经可以为你带来生意了；如果能做的更好，排名进入前2000，像雪球排名到2109(2015-10-25)，那么你将会有一个很高的估值了；如果赶上一个天大的机遇，你的网站排名到了前100，那么你的网站将给你带来上市公司的价值，如京东105(2015-10-25)；如果你是天才型的CEO，网站进了前10名，那么你将会成为一个产业的领袖，甚至是某个区域的首富，如百度4(2015-10-25)。

站长们，加油！

2. 用Node开发Alexa服务

2.1 Alexa开放API

Alexa网站排名被业界普通的认可，排名数据会经常地被引用，每次都在网站上查询就会显得不方便。Amazon提供的Alexa的API，让开发者可以构建自己的Alexa查询的应用。

Alexa有2个主要的数据API服务。

Alexa Web Information Service，查询单个网站的排名信息
Alexa Top Sites，查询网站的综合排名

通常情况，只需要调用UrlInfo数据接口，就可以获得网站的流量数据了。当然，这个接口的定义，并不像我之前想象的那么好用，而且开放出来的数据有限。

UrlInfo接口的API，如下图所示。

官方提供了多语言的SDK工具包，我觉得还是Node.js最方便。我构建的一个Alexa数据查询服务，http://fens.me/alexa

2.2 创建AWS的API密钥

我们在使用AWS的API之前，需要先创建密钥，类似于OAuth2的访问的机制。

1. 注册AWS账号，请大家自己完成。注册
2. 进入AWS账号管理控制台，控制台

3. 从控制台选择“安全证书”

4. 创建访问密钥（访问密钥 ID 和私有访问密钥）

我们一会写程序的时候，需要输入创建的访问密钥 ID 和私有访问密钥。

2.3 用Node开发Alexa服务

接下来，介绍用Node构建一个Alexa的项目。

我的系统环境

Win10 64bit
Node v0.12.3
NPM 2.9.1

创建项目


~ D:\workspace\nodejs>mkdir nodejs-alexa && cd nodejs-alexa

新建Node项目配置文件：package.json


~ vi package.json

{
  "name": "alexa-demo",
  "version": "0.0.1",
  "description": "alexa web demo",
  "license": "MIT",
  "dependencies": {
    "awis": "0.0.8"
  }
}

安装awis包


~ D:\workspace\nodejs\nodejs-alexa>npm install
npm WARN package.json alexa-demo@0.0.1 No repository field.
npm WARN package.json alexa-demo@0.0.1 No README data
alexarank@0.1.1 node_modules\alexarank
├── xml2js@0.4.13 (sax@1.1.4, xmlbuilder@3.1.0)
└── request@2.30.0 (forever-agent@0.5.2, aws-sign2@0.5.0, qs@0.6.6, tunnel-agent@0.3.0, oauth-sign@0.3.0, json-stringify-safe@5.0.1, mime@1.2.11, node-uuid@1.4.3, tough-cookie@0.9.15, form-data@0.1.4, hawk@1.0.0, http-signature@0.10.1)

awis@0.0.8 node_modules\awis
├── xml2js@0.4.13 (sax@1.1.4, xmlbuilder@3.1.0)
├── lodash@3.10.1
└── request@2.65.0 (aws-sign2@0.6.0, forever-agent@0.6.1, caseless@0.11.0, stringstream@0.0.4, oauth-sign@0.8.0, tunnel-agent@0.4.1, isstream@0.1.2, json-stringify-safe@5.0.1, extend@3.0.0, node-uuid@1.4.3, qs@5.2.0, tough-cookie@2.2.0, combined-stream@1.0.5, mime-types@2.1.7, form-data@1.0.0-rc3, http-signature@0.11.0, hawk@3.1.0, bl@1.0.0, har-validator@2.0.2)

新建文件alexa.js，调用AWS Alexa网站排名API。


~ vi alexa.js

// 定义AWS密钥
var key = 'xxxxxxxxxxxxxxx';
var sercet = 'xxxxxxxxxxxxxxx';

// 创建awis实例化对象
var awis = require('awis');
var client = awis({
  key: key,
  secret: sercet
});

// 调用UrlInfo接口
console.log("=============UrlInfo=================");
client({
  'Action': 'UrlInfo',                         //UrlInfo接口
  'Url': 'fens.me',                            //查询的网站
  'ResponseGroup': 'TrafficData,ContentData'   //需要的数据组
}, function (err, data) {
  if(err) console.log(err);
  console.log(data);  
});

运行程序node alexa.js


~ D:\workspace\nodejs\nodejs-alexa>node alexa.js
=============UrlInfo=================
{ contentData:
   { dataUrl: 'fens.me',
     siteData:
      { title: '粉丝日志',
        description: '跨界的IT博客|Hadoop家族, R, RHadoop, Nodejs, AngularJS, NoSQL, IT金融' },
     speed: { medianLoadTime: '982', percentile: '70' },
     adultContent: '',
     language: '',
     linksInCount: '198',
     keywords: '',
     ownedDomains: '' },
  trafficData:
   { dataUrl: 'fens.me',
     rank: '122616',
     usageStatistics: { usageStatistic: [Object] },
     contributingSubdomains: { contributingSubdomain: [Object] } } }

简简单单地几行代码，都获得了Alexa的排名信息。后台打印时Object没有转到成对象，我做了一个服务，可以通过HTTP输出查看完整的返回。http://api.fens.me/alexa/fens.me

我们查检一下awis包的源代码可以发现，其实AWS Alexa服务返回是XML，awis的自动帮我们做了JSON的转型处理，如果想查看原始的返回值，可以修改awis包中index.js文件parse()函数。


function parse(xml, req, cb) {
  console.log(xml); //打印

  ....
}

运行程序


D:\workspace\nodejs\nodejs-alexa>node alexa.js
=============UrlInfo=================
<?xml version="1.0"?>
<aws:UrlInfoResponse xmlns:aws="http://alexa.amazonaws.com/doc/2005-10-05/"><aws:Response xmlns:aws="http://awis.amazonaws.com/doc/2005-07-11"><aws:OperationRequest><aws:RequestId>1e7d8406-4b62-3460-27fb-325fc3dc3e85</aws:RequestId></aws:OperationRequest><aws:UrlInfoResult><aws:Alexa>

  <aws:ContentData>
    <aws:DataUrl type="canonical">fens.me</aws:DataUrl>
    <aws:SiteData>
      <aws:Title>粉丝日志</aws:Title>
      <aws:Description>跨界的IT博客|Hadoop家族, R, RHadoop, Nodejs, AngularJS, NoSQL, IT金融</aws:Description>
    </aws:SiteData>
    <aws:Speed>
      <aws:MedianLoadTime>982</aws:MedianLoadTime>
      <aws:Percentile>70</aws:Percentile>
    </aws:Speed>
    <aws:AdultContent/>
    <aws:Language/>
    <aws:LinksInCount>198</aws:LinksInCount>
    <aws:Keywords/>
    <aws:OwnedDomains/>
  </aws:ContentData>
  <aws:TrafficData>
    <aws:DataUrl type="canonical">fens.me</aws:DataUrl>
    <aws:Rank>122616</aws:Rank>
    <aws:UsageStatistics>
      <aws:UsageStatistic>
        <aws:TimeRange>
          <aws:Months>3</aws:Months>
        </aws:TimeRange>
        <aws:Rank>
          <aws:Value>122616</aws:Value>
          <aws:Delta>+28849</aws:Delta>
        </aws:Rank>
        <aws:Reach>
          <aws:Rank>
            <aws:Value>110056</aws:Value>
            <aws:Delta>+25785</aws:Delta>
          </aws:Rank>
          <aws:PerMillion>
            <aws:Value>12.5</aws:Value>
            <aws:Delta>-24.68%</aws:Delta>
          </aws:PerMillion>
        </aws:Reach>
        <aws:PageViews>
          <aws:PerMillion>
            <aws:Value>0.27</aws:Value>
            <aws:Delta>-24.84%</aws:Delta>
          </aws:PerMillion>
          <aws:Rank>
            <aws:Value>194189</aws:Value>
            <aws:Delta>43945</aws:Delta>
          </aws:Rank>
          <aws:PerUser>
            <aws:Value>1.9</aws:Value>
            <aws:Delta>0%</aws:Delta>
          </aws:PerUser>
        </aws:PageViews>
      </aws:UsageStatistic>
      <aws:UsageStatistic>
        <aws:TimeRange>
          <aws:Months>1</aws:Months>
        </aws:TimeRange>
        <aws:Rank>
          <aws:Value>102621</aws:Value>
          <aws:Delta>-30257</aws:Delta>
        </aws:Rank>
        <aws:Reach>
          <aws:Rank>
            <aws:Value>95663</aws:Value>
            <aws:Delta>-20326</aws:Delta>
          </aws:Rank>
          <aws:PerMillion>
            <aws:Value>15</aws:Value>
            <aws:Delta>+20%</aws:Delta>
          </aws:PerMillion>
        </aws:Reach>
        <aws:PageViews>
          <aws:PerMillion>
            <aws:Value>0.37</aws:Value>
            <aws:Delta>+60%</aws:Delta>
          </aws:PerMillion>
          <aws:Rank>
            <aws:Value>153976</aws:Value>
            <aws:Delta>-69981</aws:Delta>
          </aws:Rank>
          <aws:PerUser>
            <aws:Value>2.2</aws:Value>
            <aws:Delta>+30%</aws:Delta>
          </aws:PerUser>
        </aws:PageViews>
      </aws:UsageStatistic>
      <aws:UsageStatistic>
        <aws:TimeRange>
          <aws:Days>7</aws:Days>
        </aws:TimeRange>
        <aws:Rank>
          <aws:Value>114709</aws:Value>
          <aws:Delta>+32390</aws:Delta>
        </aws:Rank>
        <aws:Reach>
          <aws:Rank>
            <aws:Value>103552</aws:Value>
            <aws:Delta>+27312</aws:Delta>
          </aws:Rank>
          <aws:PerMillion>
            <aws:Value>14</aws:Value>
            <aws:Delta>-28.59%</aws:Delta>
          </aws:PerMillion>
        </aws:Reach>
        <aws:PageViews>
          <aws:PerMillion>
            <aws:Value>0.3</aws:Value>
            <aws:Delta>-37.28%</aws:Delta>
          </aws:PerMillion>
          <aws:Rank>
            <aws:Value>188124</aws:Value>
            <aws:Delta>58655</aws:Delta>
          </aws:Rank>
          <aws:PerUser>
            <aws:Value>2.0</aws:Value>
            <aws:Delta>-12.11%</aws:Delta>
          </aws:PerUser>
        </aws:PageViews>
      </aws:UsageStatistic>
      <aws:UsageStatistic>
        <aws:TimeRange>
          <aws:Days>1</aws:Days>
        </aws:TimeRange>
        <aws:Rank>
          <aws:Value>74860</aws:Value>
          <aws:Delta>-93163</aws:Delta>
        </aws:Rank>
        <aws:Reach>
          <aws:Rank>
            <aws:Value>70563</aws:Value>
            <aws:Delta>-54001</aws:Delta>
          </aws:Rank>
          <aws:PerMillion>
            <aws:Value>20</aws:Value>
            <aws:Delta>+60%</aws:Delta>
          </aws:PerMillion>
        </aws:Reach>
        <aws:PageViews>
          <aws:PerMillion>
            <aws:Value>0.6</aws:Value>
            <aws:Delta>+300%</aws:Delta>
          </aws:PerMillion>
          <aws:Rank>
            <aws:Value>111541</aws:Value>
            <aws:Delta>-210757</aws:Delta>
          </aws:Rank>
          <aws:PerUser>
            <aws:Value>2</aws:Value>
            <aws:Delta>+100%</aws:Delta>
          </aws:PerUser>
        </aws:PageViews>
      </aws:UsageStatistic>
    </aws:UsageStatistics>
    <aws:ContributingSubdomains>
      <aws:ContributingSubdomain>
        <aws:DataUrl>blog.fens.me</aws:DataUrl>
        <aws:TimeRange>
          <aws:Months>1</aws:Months>
        </aws:TimeRange>
        <aws:Reach>
          <aws:Percentage>99.19%</aws:Percentage>
        </aws:Reach>
        <aws:PageViews>
          <aws:Percentage>99.64%</aws:Percentage>
          <aws:PerUser>2.2</aws:PerUser>
        </aws:PageViews>
      </aws:ContributingSubdomain>
      <aws:ContributingSubdomain>
        <aws:DataUrl>OTHER</aws:DataUrl>
        <aws:TimeRange>
          <aws:Months>1</aws:Months>
        </aws:TimeRange>
        <aws:Reach>
          <aws:Percentage>0</aws:Percentage>
        </aws:Reach>
        <aws:PageViews>
          <aws:Percentage>0.36%</aws:Percentage>
          <aws:PerUser>0</aws:PerUser>
        </aws:PageViews>
      </aws:ContributingSubdomain>
    </aws:ContributingSubdomains>
  </aws:TrafficData>
</aws:Alexa></aws:UrlInfoResult><aws:ResponseStatus xmlns:aws="http://alexa.amazonaws.com/doc/2005-10-05/"><aws:StatusCode>Success</aws:StatusCode></aws:ResponseStatus></aws:Response></aws:UrlInfoResponse>

除了UrlInfo接口还有几个接口可以使用。

TrafficHistory接口


console.log("=============TrafficHistory=================");
client({
  'Action': 'TrafficHistory',
  'Url': 'fens.me',
  'ResponseGroup': 'History'
}, function (err, res) {
    if(err) console.log(err);
    console.log(res.trafficHistory);
    console.log(res.trafficHistory.range);
    console.log(res.trafficHistory.site);
    console.log(res.trafficHistory.start);
    console.log(res.trafficHistory.historicalData);
    console.log(res.trafficHistory.historicalData.data);
    console.log(res.trafficHistory.historicalData.data.length);
    res.trafficHistory.historicalData.data.forEach(function (item) {
      console.log(item.date);
      console.log(item.pageViews);
      console.log(item.rank);
      console.log(item.reach);
    });
});

运行程序


~ D:\workspace\nodejs\nodejs-alexa>node alexa.js
=============TrafficHistory=================
{ range: '31',
  site: 'fens.me',
  start: '2015-09-23',
  historicalData:
   { data:
      [ [Object],
        [Object],
        [Object],
        [Object],
        [Object],
        [Object],
        [Object],
        [Object],
        [Object],
        [Object],
        [Object],
        [Object],
        [Object],
        [Object],
        [Object],
        [Object],
        [Object],
        [Object],
        [Object],
        [Object],
        [Object],
        [Object],
        [Object],
        [Object],
        [Object],
        [Object],
        [Object],
        [Object],
        [Object],
        [Object],
        [Object] ] } }

// 省略输出

SitesLinkingIn接口


console.log("=============SitesLinkingIn=================");
client({
  'Action': 'SitesLinkingIn',
  'Url': 'fens.me',
  'ResponseGroup': 'SitesLinkingIn'
}, function (err, data) {
  if(err) console.log(err);
  console.log(data);
});

运行程序


~ D:\workspace\nodejs\nodejs-alexa>node alexa.js
=============SitesLinkingIn=================
{ sitesLinkingIn:
   { site:
      [ [Object],
        [Object],
        [Object],
        [Object],
        [Object],
        [Object],
        [Object],
        [Object],
        [Object],
        [Object] ] } }

CategoryBrowse接口


console.log("=============CategoryBrowse=================");
client({
  'Action': 'CategoryBrowse',
  'Url': 'fens.me',
  'Path': 'Top/china',
  'ResponseGroup': 'LanguageCategories'
}, function (err, data) {
  if(err) console.log(err);
  console.log(data);
});

运行程序


~ D:\workspace\nodejs\nodejs-alexa>node alexa.js
=============CategoryBrowse=================
{ categoryBrowse: { languageCategories: '' } }

最后，我们只需要把这个程序用web封装一下，就可以提供对用户的服务了，参考我的网站 http://fens.me/alexa 。

本文对应的代码请通过github进行下载，下载地址为：https://github.com/bsspirit/nodejs-alexa

Alexa网站排名以第三方的视角给全球的每个网站进行了排名，甚至是定价。做为一个优秀的网长，我们要使用好Alexa工具，了解自己的网站和竞争对手的网站，才能网站脱颖而出，成为成功的站长！

转载请注明出处：
http://blog.fens.me/nodejs-alexa/

打赏作者

Posted:

Oct 12, 2013

Tags:

api fs shell Hadoop hdfs java

Comments:

17 Comments

Hadoop编程调用HDFS

Hadoop家族系列文章，主要介绍Hadoop家族产品，常用的项目包括Hadoop, Hive, Pig, HBase, Sqoop, Mahout, Zookeeper, Avro, Ambari, Chukwa，新增加的项目包括，YARN, Hcatalog, Oozie, Cassandra, Hama, Whirr, Flume, Bigtop, Crunch, Hue等。

从2011年开始，中国进入大数据风起云涌的时代，以Hadoop为代表的家族软件，占据了大数据处理的广阔地盘。开源界及厂商，所有数据软件，无一不向Hadoop靠拢。Hadoop也从小众的高富帅领域，变成了大数据开发的标准。在Hadoop原有技术基础之上，出现了Hadoop家族产品，通过“大数据”概念不断创新，推出科技进步。

作为IT界的开发人员，我们也要跟上节奏，抓住机遇，跟着Hadoop一起雄起！

关于作者：

张丹(Conan), 程序员Java,R,PHP,Javascript
weibo：@Conan_Z
blog: http://blog.fens.me
email: bsspirit@gmail.com

转载请注明出处：
http://blog.fens.me/hadoop-hdfs-api/

前言

HDFS 全称Hadoop分步文件系统(Hadoop Distributed File System)，是Hadoop的核心部分之一。要实现MapReduce的分步式算法时，数据必需提前放在HDFS上。因此，对于HDFS的操作就变得非常重要。Hadoop的命令行，提供了一套完整命令接口，就像Linux命令一样方便使用。

不过，有时候我们还需要在程序中直接访问HDFS，我们可以通过API的方式进行HDFS操作。

系统环境
ls操作
rmr操作
mkdir操作
copyFromLocal操作
cat操作
copyToLocal操作
创建一个新文件，并写入内容

1. 系统环境

Hadoop集群环境

Linux Ubuntu 64bit Server 12.04.2 LTS
Java 1.6.0_29
Hadoop 1.1.2

如何搭建Hadoop集群环境？请参考文章：Hadoop历史版本安装

开发环境

Win7 64bit
Java 1.6.0_45
Maven 3
Hadoop 1.1.2
Eclipse Juno Service Release 2

如何用Maven搭建Win7的Hadoop开发环境？请参考文章：用Maven构建Hadoop项目

注：hadoop-core-1.1.2.jar，已重新编译，已解决了Win远程调用Hadoop的问题，请参考文章：Hadoop历史版本安装

Hadooop命令行：java FsShell


~ hadoop fs

Usage: java FsShell
           [-ls ]
           [-lsr ]
           [-du ]
           [-dus ]
           [-count[-q] ]
           [-mv  ]
           [-cp  ]
           [-rm [-skipTrash] ]
           [-rmr [-skipTrash] ]
           [-expunge]
           [-put  ... ]
           [-copyFromLocal  ... ]
           [-moveFromLocal  ... ]
           [-get [-ignoreCrc] [-crc]  ]
           [-getmerge   [addnl]]
           [-cat ]
           [-text ]
           [-copyToLocal [-ignoreCrc] [-crc]  ]
           [-moveToLocal [-crc]  ]
           [-mkdir ]
           [-setrep [-R] [-w]  ]
           [-touchz ]
           [-test -[ezd] ]
           [-stat [format] ]
           [-tail [-f] ]
           [-chmod [-R]  PATH...]
           [-chown [-R] [OWNER][:[GROUP]] PATH...]
           [-chgrp [-R] GROUP PATH...]
           [-help [cmd]]

上面列出了30个命令，我只实现了一部分的HDFS的命令！

新建文件：HdfsDAO.java，用来调用HDFS的API。


public class HdfsDAO {

    //HDFS访问地址
    private static final String HDFS = "hdfs://192.168.1.210:9000/";

    public HdfsDAO(Configuration conf) {
        this(HDFS, conf);
    }

    public HdfsDAO(String hdfs, Configuration conf) {
        this.hdfsPath = hdfs;
        this.conf = conf;
    }

    //hdfs路径
    private String hdfsPath;
    //Hadoop系统配置
    private Configuration conf;

    //启动函数
    public static void main(String[] args) throws IOException {
        JobConf conf = config();
        HdfsDAO hdfs = new HdfsDAO(conf);
        hdfs.mkdirs("/tmp/new/two");
        hdfs.ls("/tmp/new");
    }        
    
    //加载Hadoop配置文件
    public static JobConf config(){
        JobConf conf = new JobConf(HdfsDAO.class);
        conf.setJobName("HdfsDAO");
        conf.addResource("classpath:/hadoop/core-site.xml");
        conf.addResource("classpath:/hadoop/hdfs-site.xml");
        conf.addResource("classpath:/hadoop/mapred-site.xml");
        return conf;
    }

    //API实现
    public void cat(String remoteFile) throws IOException {...}
    public void mkdirs(String folder) throws IOException {...}
    
    ...
}

2. ls操作

说明：查看目录文件

对应Hadoop命令：


~ hadoop fs -ls /
Found 3 items
drwxr-xr-x   - conan         supergroup          0 2013-10-03 05:03 /home
drwxr-xr-x   - Administrator supergroup          0 2013-10-03 13:49 /tmp
drwxr-xr-x   - conan         supergroup          0 2013-10-03 09:11 /user

Java程序:


    public void ls(String folder) throws IOException {
        Path path = new Path(folder);
        FileSystem fs = FileSystem.get(URI.create(hdfsPath), conf);
        FileStatus[] list = fs.listStatus(path);
        System.out.println("ls: " + folder);
        System.out.println("==========================================================");
        for (FileStatus f : list) {
            System.out.printf("name: %s, folder: %s, size: %d\n", f.getPath(), f.isDir(), f.getLen());
        }
        System.out.println("==========================================================");
        fs.close();
    }

    public static void main(String[] args) throws IOException {
        JobConf conf = config();
        HdfsDAO hdfs = new HdfsDAO(conf);
        hdfs.ls("/");
    }

控制台输出：

ls: /
==========================================================
name: hdfs://192.168.1.210:9000/home, folder: true, size: 0
name: hdfs://192.168.1.210:9000/tmp, folder: true, size: 0
name: hdfs://192.168.1.210:9000/user, folder: true, size: 0
==========================================================

3. mkdir操作

对应Hadoop命令：


~ hadoop fs -mkdir /tmp/new/one
~ hadoop fs -ls /tmp/new
Found 1 items
drwxr-xr-x   - conan supergroup          0 2013-10-03 15:35 /tmp/new/one

Java程序:


    public void mkdirs(String folder) throws IOException {
        Path path = new Path(folder);
        FileSystem fs = FileSystem.get(URI.create(hdfsPath), conf);
        if (!fs.exists(path)) {
            fs.mkdirs(path);
            System.out.println("Create: " + folder);
        }
        fs.close();
    }

    public static void main(String[] args) throws IOException {
        JobConf conf = config();
        HdfsDAO hdfs = new HdfsDAO(conf);
        hdfs.mkdirs("/tmp/new/two");
        hdfs.ls("/tmp/new");
    }

控制台输出：


Create: /tmp/new/two
ls: /tmp/new
==========================================================
name: hdfs://192.168.1.210:9000/tmp/new/one, folder: true, size: 0
name: hdfs://192.168.1.210:9000/tmp/new/two, folder: true, size: 0
==========================================================

4. rmr操作

说明：删除目录和文件

对应Hadoop命令：


~ hadoop fs -rmr /tmp/new/one
Deleted hdfs://master:9000/tmp/new/one

~  hadoop fs -ls /tmp/new
Found 1 items
drwxr-xr-x   - Administrator supergroup          0 2013-10-03 15:38 /tmp/new/two

Java程序:


    public void rmr(String folder) throws IOException {
        Path path = new Path(folder);
        FileSystem fs = FileSystem.get(URI.create(hdfsPath), conf);
        fs.deleteOnExit(path);
        System.out.println("Delete: " + folder);
        fs.close();
    }

    public static void main(String[] args) throws IOException {
        JobConf conf = config();
        HdfsDAO hdfs = new HdfsDAO(conf);
        hdfs.rmr("/tmp/new/two");
        hdfs.ls("/tmp/new");
    }

控制台输出：


Delete: /tmp/new/two
ls: /tmp/new
==========================================================
==========================================================

5. copyFromLocal操作

说明：复制本地文件系统到HDFS

对应Hadoop命令：


~ hadoop fs -copyFromLocal /home/conan/datafiles/item.csv /tmp/new/

~ hadoop fs -ls /tmp/new/
Found 1 items
-rw-r--r--   1 conan supergroup        210 2013-10-03 16:07 /tmp/new/item.csv

Java程序:


    public void copyFile(String local, String remote) throws IOException {
        FileSystem fs = FileSystem.get(URI.create(hdfsPath), conf);
        fs.copyFromLocalFile(new Path(local), new Path(remote));
        System.out.println("copy from: " + local + " to " + remote);
        fs.close();
    }

    public static void main(String[] args) throws IOException {
        JobConf conf = config();
        HdfsDAO hdfs = new HdfsDAO(conf);
        hdfs.copyFile("datafile/randomData.csv", "/tmp/new");
        hdfs.ls("/tmp/new");
    }

控制台输出：


copy from: datafile/randomData.csv to /tmp/new
ls: /tmp/new
==========================================================
name: hdfs://192.168.1.210:9000/tmp/new/item.csv, folder: false, size: 210
name: hdfs://192.168.1.210:9000/tmp/new/randomData.csv, folder: false, size: 36655
==========================================================

6. cat操作

说明：查看文件内容

对应Hadoop命令：


~ hadoop fs -cat /tmp/new/item.csv
1,101,5.0
1,102,3.0
1,103,2.5
2,101,2.0
2,102,2.5
2,103,5.0
2,104,2.0
3,101,2.5
3,104,4.0
3,105,4.5
3,107,5.0
4,101,5.0
4,103,3.0
4,104,4.5
4,106,4.0
5,101,4.0
5,102,3.0
5,103,2.0
5,104,4.0
5,105,3.5
5,106,4.0

Java程序:


    public void cat(String remoteFile) throws IOException {
        Path path = new Path(remoteFile);
        FileSystem fs = FileSystem.get(URI.create(hdfsPath), conf);
        FSDataInputStream fsdis = null;
        System.out.println("cat: " + remoteFile);
        try {  
            fsdis =fs.open(path);
            IOUtils.copyBytes(fsdis, System.out, 4096, false);  
          } finally {  
            IOUtils.closeStream(fsdis);
            fs.close();
          }
    }

    public static void main(String[] args) throws IOException {
        JobConf conf = config();
        HdfsDAO hdfs = new HdfsDAO(conf);
        hdfs.cat("/tmp/new/item.csv");
    }

控制台输出：


cat: /tmp/new/item.csv
1,101,5.0
1,102,3.0
1,103,2.5
2,101,2.0
2,102,2.5
2,103,5.0
2,104,2.0
3,101,2.5
3,104,4.0
3,105,4.5
3,107,5.0
4,101,5.0
4,103,3.0
4,104,4.5
4,106,4.0
5,101,4.0
5,102,3.0
5,103,2.0
5,104,4.0
5,105,3.5
5,106,4.0

7. copyToLocal操作

说明：从HDFS复制文件在本地操作系

对应Hadoop命令：


~ hadoop fs -copyToLocal /tmp/new/item.csv /home/conan/datafiles/tmp/

~ ls -l /home/conan/datafiles/tmp/
-rw-rw-r-- 1 conan conan 210 Oct  3 16:16 item.csv

Java程序:


    public void download(String remote, String local) throws IOException {
        Path path = new Path(remote);
        FileSystem fs = FileSystem.get(URI.create(hdfsPath), conf);
        fs.copyToLocalFile(path, new Path(local));
        System.out.println("download: from" + remote + " to " + local);
        fs.close();
    }

    public static void main(String[] args) throws IOException {
        JobConf conf = config();
        HdfsDAO hdfs = new HdfsDAO(conf);
        hdfs.download("/tmp/new/item.csv", "datafile/download");
        
        File f = new File("datafile/download/item.csv");
        System.out.println(f.getAbsolutePath());
    }

控制台输出：


2013-10-12 17:17:32 org.apache.hadoop.util.NativeCodeLoader 
警告: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
download: from/tmp/new/item.csv to datafile/download
D:\workspace\java\myMahout\datafile\download\item.csv

8. 创建一个新文件，并写入内容

说明：创建一个新文件，并写入内容。

touchz:可以用来创建一个新文件，或者修改文件的时间戳。
写入内容没有对应命令。

对应Hadoop命令：


~ hadoop fs -touchz /tmp/new/empty

~ hadoop fs -ls /tmp/new
Found 3 items
-rw-r--r--   1 conan         supergroup          0 2013-10-03 16:24 /tmp/new/empty
-rw-r--r--   1 conan         supergroup        210 2013-10-03 16:07 /tmp/new/item.csv
-rw-r--r--   3 Administrator supergroup      36655 2013-10-03 16:09 /tmp/new/randomData.csv

~ hadoop fs -cat /tmp/new/empty

Java程序:


    public void createFile(String file, String content) throws IOException {
        FileSystem fs = FileSystem.get(URI.create(hdfsPath), conf);
        byte[] buff = content.getBytes();
        FSDataOutputStream os = null;
        try {
            os = fs.create(new Path(file));
            os.write(buff, 0, buff.length);
            System.out.println("Create: " + file);
        } finally {
            if (os != null)
                os.close();
        }
        fs.close();
    }

    public static void main(String[] args) throws IOException {
        JobConf conf = config();
        HdfsDAO hdfs = new HdfsDAO(conf);
        hdfs.createFile("/tmp/new/text", "Hello world!!");
        hdfs.cat("/tmp/new/text");
    }

控制台输出：


Create: /tmp/new/text
cat: /tmp/new/text
Hello world!!

完整的文件：HdfsDAO.java
https://github.com/bsspirit/maven_mahout_template/blob/mahout-0.8/src/main/java/org/conan/mymahout/hdfs/HdfsDAO.java

转载请注明出处：
http://blog.fens.me/hadoop-hdfs-api/

打赏作者

排行榜

Blog Archives

Posted:

Tags:

Comments:

构建自己的Aleax查询服务

1. Alexa介绍

2. 用Node开发Alexa服务

Posted:

Tags:

Comments:

Hadoop编程调用HDFS

1. 系统环境

2. ls操作

3. mkdir操作

4. rmr操作

5. copyFromLocal操作

6. cat操作

7. copyToLocal操作

8. 创建一个新文件，并写入内容

站内导航

最新文章

最新评论

最热文章