Friday, October 26, 2012

Nginx to hadoop hdfs with fluentd

Nginx => hadoop HDFS using Fluentd

        Fluentd is a json everywhere log collector. It transmits logs as json streams so that log processing can be easily managed and processed.

      Hadoop HDFS is a distributed filesystem which can be used to store any amount of logs and run mapreduce jobs for faster log processing.

 We will be using fluent-webhdfs-plugin to send logs over to httpfs interface

1. Install hadoop-httpfs package

        yum install hadoop-httpfs

2. Enable access to HDFS for httpfs user

vi /etc/hadoop/conf/core-site.xml


3. Restart the hadoop cluster.

4. Start the hadoop-httpfs service

/etc/init.d/hadoop-httpfs start

5. Check whether it is working
 curl -i "http://<namenode>:14000?"  
 HTTP/1.1 200 OK  
 Server: Apache-Coyote/1.1  

6. Install treasure date td-agent in nginx servers and log-aggregator server

cat > /etc/yum.repos.d/treasuredate.repo

 yum install td-agent

7. Install fluentd and fluentd-plugin-webhdfs in log-aggregator host

gem install fluent-logger --no-ri --no-rdoc
/usr/lib64/fluent/ruby/bin/fluent-gem install fluent-plugin-webhdfs

8. Edit td-agent configuration in nginx server

vi /etc/td-agent/td-agent.conf
 # Tail the nginx logs associated with  
  type tail  
  path /var/log/nginx/stats_access.log  
  format apache  
  tag stats.access  
  pos_file /var/log/td-agent/stats_access.pos  
 <match stats.access>  
  type forward  
   port 24224  
  retry_limit 5  
   type file  
   path /var/log/td-agent/stats_access.log  

Edit Nginx configuration to use apache log format.
   log_format main '$remote_addr - $remote_user [$time_local] "$request" '  
            '$status $body_bytes_sent "$http_referer" '  

9. Edit td-agent configuration in log aggregator server

  type forward  
  port 24224  
 <match stats.access>  
  type webhdfs  
  port 14000  
  path /user/hdfs/stats_logs/stats_access.%Y%m%d_%H.log  
  httpfs true  
  username httpfsuser  

10. Start td-agent in log aggregator host

/etc/init.d/td-agent start

 * ensure that there are no errors in /var/log/td-agent/td-agent.log

11. Start td-agent in nginx servers

/etc/init.d/td-agent start
/etc/init.d/nginx restart

12. Check whether you can see the logs in HDFS
 sudo -u hdfsuser hdfs dfs -ls /user/hdfs/stats_logs/  
 Found 1 items  
 -rw-r--r--  3 httpfsuser group   17441 2012-10-12 01:10 /user/hdfsuser/stats_logs/stats_access.20121012_07.log  

That is all.. Now you have a log aggregation happening