Nginx => hadoop HDFS using Fluentd
Fluentd is a json everywhere log collector. It transmits logs as json streams so that log processing can be easily managed and processed.
Hadoop HDFS is a distributed filesystem which can be used to store any amount of logs and run mapreduce jobs for faster log processing.
We will be using fluent-webhdfs-plugin to send logs over to httpfs interface
1. Install hadoop-httpfs package
yum install hadoop-httpfs
2. Enable access to HDFS for httpfs user
vi /etc/hadoop/conf/core-site.xml
<property>
<name>hadoop.proxyuser.httpfs.hosts</name>
<value>localhost,httpfshost></value>
</property>
<property>
<name>hadoop.proxyuser.httpfs.groups</name>
<value>*</value>
</property>
3. Restart the hadoop cluster.
4. Start the hadoop-httpfs service
/etc/init.d/hadoop-httpfs start
5. Check whether it is working
curl -i "http://<namenode>:14000?user.name=httpfs&op=homedir"
HTTP/1.1 200 OK
Server: Apache-Coyote/1.1
6. Install treasure date td-agent in nginx servers and log-aggregator server
cat > /etc/yum.repos.d/treasuredate.repo
[treasuredata]
name=TreasureData
baseurl=http://packages.treasure-data.com/redhat/$basearch
gpgcheck=0
yum install td-agent
7. Install fluentd and fluentd-plugin-webhdfs in log-aggregator host
gem install fluent-logger --no-ri --no-rdoc
/usr/lib64/fluent/ruby/bin/fluent-gem install fluent-plugin-webhdfs
8. Edit td-agent configuration in nginx server
vi /etc/td-agent/td-agent.conf
# Tail the nginx logs associated with stats.slideshare.net
<source>
type tail
path /var/log/nginx/stats_access.log
format apache
tag stats.access
pos_file /var/log/td-agent/stats_access.pos
</source>
<match stats.access>
type forward
<server>
host <LOG AGGREGATOR NODE>
port 24224
</server>
retry_limit 5
<secondary>
type file
path /var/log/td-agent/stats_access.log
</secondary>
</match>
Edit Nginx configuration to use apache log format.
log_format main '$remote_addr - $remote_user [$time_local] "$request" '
'$status $body_bytes_sent "$http_referer" '
'"$http_user_agent"';
9. Edit td-agent configuration in log aggregator server
<source>
type forward
port 24224
</source>
<match stats.access>
type webhdfs
host <NAMENODE OR HTTPFS HOST>
port 14000
path /user/hdfs/stats_logs/stats_access.%Y%m%d_%H.log
httpfs true
username httpfsuser
</match>
10. Start td-agent in log aggregator host
/etc/init.d/td-agent start
* ensure that there are no errors in /var/log/td-agent/td-agent.log
11. Start td-agent in nginx servers
/etc/init.d/td-agent start
/etc/init.d/nginx restart
12. Check whether you can see the logs in HDFS
sudo -u hdfsuser hdfs dfs -ls /user/hdfs/stats_logs/
Found 1 items
-rw-r--r-- 3 httpfsuser group 17441 2012-10-12 01:10 /user/hdfsuser/stats_logs/stats_access.20121012_07.log
That is all.. Now you have a log aggregation happening