Friday, October 26, 2012

Nginx to hadoop hdfs with fluentd



Nginx => hadoop HDFS using Fluentd


        Fluentd is a json everywhere log collector. It transmits logs as json streams so that log processing can be easily managed and processed.

      Hadoop HDFS is a distributed filesystem which can be used to store any amount of logs and run mapreduce jobs for faster log processing.

 We will be using fluent-webhdfs-plugin to send logs over to httpfs interface


1. Install hadoop-httpfs package

        yum install hadoop-httpfs

2. Enable access to HDFS for httpfs user

vi /etc/hadoop/conf/core-site.xml
  <property>  
   <name>hadoop.proxyuser.httpfs.hosts</name>  
   <value>localhost,httpfshost></value>  
  </property>  
  <property>  
   <name>hadoop.proxyuser.httpfs.groups</name>  
   <value>*</value>  
  </property>  

 

3. Restart the hadoop cluster.

4. Start the hadoop-httpfs service

/etc/init.d/hadoop-httpfs start

5. Check whether it is working
 curl -i "http://<namenode>:14000?user.name=httpfs&op=homedir"  
 HTTP/1.1 200 OK  
 Server: Apache-Coyote/1.1  



6. Install treasure date td-agent in nginx servers and log-aggregator server

cat > /etc/yum.repos.d/treasuredate.repo
[treasuredata]
name=TreasureData
baseurl=http://packages.treasure-data.com/redhat/$basearch
gpgcheck=0

 yum install td-agent

7. Install fluentd and fluentd-plugin-webhdfs in log-aggregator host

gem install fluent-logger --no-ri --no-rdoc
/usr/lib64/fluent/ruby/bin/fluent-gem install fluent-plugin-webhdfs

8. Edit td-agent configuration in nginx server

vi /etc/td-agent/td-agent.conf
 # Tail the nginx logs associated with stats.slideshare.net  
 <source>  
  type tail  
  path /var/log/nginx/stats_access.log  
  format apache  
  tag stats.access  
  pos_file /var/log/td-agent/stats_access.pos  
 </source>  
 <match stats.access>  
  type forward  
  <server>  
   host <LOG AGGREGATOR NODE>  
   port 24224  
  </server>  
  retry_limit 5  
  <secondary>  
   type file  
   path /var/log/td-agent/stats_access.log  
  </secondary>  
 </match>  

Edit Nginx configuration to use apache log format.
   log_format main '$remote_addr - $remote_user [$time_local] "$request" '  
            '$status $body_bytes_sent "$http_referer" '  
            '"$http_user_agent"';  



9. Edit td-agent configuration in log aggregator server




 <source>  
  type forward  
  port 24224  
 </source>  
 <match stats.access>  
  type webhdfs  
  host <NAMENODE OR HTTPFS HOST>  
  port 14000  
  path /user/hdfs/stats_logs/stats_access.%Y%m%d_%H.log  
  httpfs true  
  username httpfsuser  
 </match>  


10. Start td-agent in log aggregator host

/etc/init.d/td-agent start

 * ensure that there are no errors in /var/log/td-agent/td-agent.log

11. Start td-agent in nginx servers

/etc/init.d/td-agent start
/etc/init.d/nginx restart

12. Check whether you can see the logs in HDFS
 sudo -u hdfsuser hdfs dfs -ls /user/hdfs/stats_logs/  
 Found 1 items  
 -rw-r--r--  3 httpfsuser group   17441 2012-10-12 01:10 /user/hdfsuser/stats_logs/stats_access.20121012_07.log  


That is all.. Now you have a log aggregation happening

2 comments:

  1. Really useful information about hadoop, i have to know information about hadoop online training institutes.

    ReplyDelete
  2. Great efforts put it to find the list of articles which is very useful to know, Definitely will share the same to other forums.
    kajal hot

    ReplyDelete