Friday, October 26, 2012

Nginx to hadoop hdfs with fluentd

Nginx => hadoop HDFS using Fluentd

        Fluentd is a json everywhere log collector. It transmits logs as json streams so that log processing can be easily managed and processed.

      Hadoop HDFS is a distributed filesystem which can be used to store any amount of logs and run mapreduce jobs for faster log processing.

 We will be using fluent-webhdfs-plugin to send logs over to httpfs interface

1. Install hadoop-httpfs package

        yum install hadoop-httpfs

2. Enable access to HDFS for httpfs user

vi /etc/hadoop/conf/core-site.xml


3. Restart the hadoop cluster.

4. Start the hadoop-httpfs service

/etc/init.d/hadoop-httpfs start

5. Check whether it is working
 curl -i "http://<namenode>:14000?"  
 HTTP/1.1 200 OK  
 Server: Apache-Coyote/1.1  

6. Install treasure date td-agent in nginx servers and log-aggregator server

cat > /etc/yum.repos.d/treasuredate.repo

 yum install td-agent

7. Install fluentd and fluentd-plugin-webhdfs in log-aggregator host

gem install fluent-logger --no-ri --no-rdoc
/usr/lib64/fluent/ruby/bin/fluent-gem install fluent-plugin-webhdfs

8. Edit td-agent configuration in nginx server

vi /etc/td-agent/td-agent.conf
 # Tail the nginx logs associated with  
  type tail  
  path /var/log/nginx/stats_access.log  
  format apache  
  tag stats.access  
  pos_file /var/log/td-agent/stats_access.pos  
 <match stats.access>  
  type forward  
   port 24224  
  retry_limit 5  
   type file  
   path /var/log/td-agent/stats_access.log  

Edit Nginx configuration to use apache log format.
   log_format main '$remote_addr - $remote_user [$time_local] "$request" '  
            '$status $body_bytes_sent "$http_referer" '  

9. Edit td-agent configuration in log aggregator server

  type forward  
  port 24224  
 <match stats.access>  
  type webhdfs  
  port 14000  
  path /user/hdfs/stats_logs/stats_access.%Y%m%d_%H.log  
  httpfs true  
  username httpfsuser  

10. Start td-agent in log aggregator host

/etc/init.d/td-agent start

 * ensure that there are no errors in /var/log/td-agent/td-agent.log

11. Start td-agent in nginx servers

/etc/init.d/td-agent start
/etc/init.d/nginx restart

12. Check whether you can see the logs in HDFS
 sudo -u hdfsuser hdfs dfs -ls /user/hdfs/stats_logs/  
 Found 1 items  
 -rw-r--r--  3 httpfsuser group   17441 2012-10-12 01:10 /user/hdfsuser/stats_logs/stats_access.20121012_07.log  

That is all.. Now you have a log aggregation happening


  1. Really useful information about hadoop, i have to know information about hadoop online training institutes.

  2. Great efforts put it to find the list of articles which is very useful to know, Definitely will share the same to other forums.
    kajal hot

  3. These days, the fever for new innovative areas are expanding. Everyone is by all accounts stricken by AI, Internet of Things and others in comparative lines. One such insanely prominent mechanical area nowadays is information science. ExcelR Data Science Courses