Personal and Technical study guide: Nginx to hadoop hdfs with fluentd

Friday, October 26, 2012

Nginx to hadoop hdfs with fluentd

Nginx => hadoop HDFS using Fluentd

Fluentd is a json everywhere log collector. It transmits logs as json streams so that log processing can be easily managed and processed.

Hadoop HDFS is a distributed filesystem which can be used to store any amount of logs and run mapreduce jobs for faster log processing.

We will be using fluent-webhdfs-plugin to send logs over to httpfs interface

1. Install hadoop-httpfs package

yum install hadoop-httpfs

2. Enable access to HDFS for httpfs user

vi /etc/hadoop/conf/core-site.xml

  <property>  
   <name>hadoop.proxyuser.httpfs.hosts</name>  
   <value>localhost,httpfshost></value>  
  </property>  
  <property>  
   <name>hadoop.proxyuser.httpfs.groups</name>  
   <value>*</value>  
  </property>

3. Restart the hadoop cluster.

4. Start the hadoop-httpfs service

/etc/init.d/hadoop-httpfs start

5. Check whether it is working

 curl -i "http://<namenode>:14000?user.name=httpfs&op=homedir"  
 HTTP/1.1 200 OK  
 Server: Apache-Coyote/1.1

6. Install treasure date td-agent in nginx servers and log-aggregator server

cat > /etc/yum.repos.d/treasuredate.repo
[treasuredata]
name=TreasureData
baseurl=http://packages.treasure-data.com/redhat/$basearch
gpgcheck=0

yum install td-agent

7. Install fluentd and fluentd-plugin-webhdfs in log-aggregator host

gem install fluent-logger --no-ri --no-rdoc
/usr/lib64/fluent/ruby/bin/fluent-gem install fluent-plugin-webhdfs

8. Edit td-agent configuration in nginx server

vi /etc/td-agent/td-agent.conf

 # Tail the nginx logs associated with stats.slideshare.net  
 <source>  
  type tail  
  path /var/log/nginx/stats_access.log  
  format apache  
  tag stats.access  
  pos_file /var/log/td-agent/stats_access.pos  
 </source>  
 <match stats.access>  
  type forward  
  <server>  
   host <LOG AGGREGATOR NODE>  
   port 24224  
  </server>  
  retry_limit 5  
  <secondary>  
   type file  
   path /var/log/td-agent/stats_access.log  
  </secondary>  
 </match>

Edit Nginx configuration to use apache log format.

   log_format main '$remote_addr - $remote_user [$time_local] "$request" '  
            '$status $body_bytes_sent "$http_referer" '  
            '"$http_user_agent"';

9. Edit td-agent configuration in log aggregator server

 <source>  
  type forward  
  port 24224  
 </source>  
 <match stats.access>  
  type webhdfs  
  host <NAMENODE OR HTTPFS HOST>  
  port 14000  
  path /user/hdfs/stats_logs/stats_access.%Y%m%d_%H.log  
  httpfs true  
  username httpfsuser  
 </match>

10. Start td-agent in log aggregator host

/etc/init.d/td-agent start

* ensure that there are no errors in /var/log/td-agent/td-agent.log

11. Start td-agent in nginx servers

/etc/init.d/td-agent start
/etc/init.d/nginx restart

12. Check whether you can see the logs in HDFS

 sudo -u hdfsuser hdfs dfs -ls /user/hdfs/stats_logs/  
 Found 1 items  
 -rw-r--r--  3 httpfsuser group   17441 2012-10-12 01:10 /user/hdfsuser/stats_logs/stats_access.20121012_07.log

That is all.. Now you have a log aggregation happening

4 comments:

rohitDecember 27, 2018 at 2:35 AM
Great efforts put it to find the list of articles which is very useful to know, Definitely will share the same to other forums.
kajal hot
ReplyDelete
Replies
marksonSeptember 12, 2019 at 12:55 AM
These days, the fever for new innovative areas are expanding. Everyone is by all accounts stricken by AI, Internet of Things and others in comparative lines. One such insanely prominent mechanical area nowadays is information science. ExcelR Data Science Courses
ReplyDelete
Replies
mary BrownNovember 11, 2019 at 8:20 PM
Great Article. Thank you for sharing! Really an awesome post for every one.
Compact Modeling of Perpendicular Magnetic Anisotropy Double Barrier Magnetic Tunnel Junction With Enhanced Thermal Stability Recording Structure Project For CSE
Learning Based Sphere Nonlinear Interpolation for Motion Synthesis Project For CSE
LSTM and Edge Computing for Big Data Feature Recognition of Industrial Electrical Equipment Project For CSE
A Compiler for Agnostic Programming and Deployment of Big Data Analytics on Multiple Platforms Project For CSE
An Edge Intelligence Empowered Recommender System Enabling Cultural Heritage Applications Project For CSE
ReplyDelete
Replies
AnonymousDecember 13, 2024 at 3:28 AM
The error "Class org.apache.hadoop.thriftfs.NamenodePlugin not found" typically indicates a missing or improperly configured dependency in the Hadoop environment. This article does an excellent job of addressing the issue by explaining possible causes, such as outdated Hadoop versions or absent JAR files. It also provides actionable steps to resolve the problem, like verifying classpath configurations and ensuring necessary libraries are included. The clear explanations and troubleshooting guidance make this a helpful resource for anyone dealing with similar Hadoop plugin or dependency-related challenges, enhancing understanding and ensuring smoother operations.
Data science Courses in Berlin

ReplyDelete
Replies

Add comment