Personal and Technical study guide: 2012

Friday, October 26, 2012

Nginx to hadoop hdfs with fluentd

Nginx => hadoop HDFS using Fluentd

Fluentd is a json everywhere log collector. It transmits logs as json streams so that log processing can be easily managed and processed.

Hadoop HDFS is a distributed filesystem which can be used to store any amount of logs and run mapreduce jobs for faster log processing.

We will be using fluent-webhdfs-plugin to send logs over to httpfs interface

1. Install hadoop-httpfs package

yum install hadoop-httpfs

2. Enable access to HDFS for httpfs user

vi /etc/hadoop/conf/core-site.xml

  <property>  
   <name>hadoop.proxyuser.httpfs.hosts</name>  
   <value>localhost,httpfshost></value>  
  </property>  
  <property>  
   <name>hadoop.proxyuser.httpfs.groups</name>  
   <value>*</value>  
  </property>

3. Restart the hadoop cluster.

4. Start the hadoop-httpfs service

/etc/init.d/hadoop-httpfs start

5. Check whether it is working

 curl -i "http://<namenode>:14000?user.name=httpfs&op=homedir"  
 HTTP/1.1 200 OK  
 Server: Apache-Coyote/1.1

6. Install treasure date td-agent in nginx servers and log-aggregator server

cat > /etc/yum.repos.d/treasuredate.repo
[treasuredata]
name=TreasureData
baseurl=http://packages.treasure-data.com/redhat/$basearch
gpgcheck=0

yum install td-agent

7. Install fluentd and fluentd-plugin-webhdfs in log-aggregator host

gem install fluent-logger --no-ri --no-rdoc
/usr/lib64/fluent/ruby/bin/fluent-gem install fluent-plugin-webhdfs

8. Edit td-agent configuration in nginx server

vi /etc/td-agent/td-agent.conf

 # Tail the nginx logs associated with stats.slideshare.net  
 <source>  
  type tail  
  path /var/log/nginx/stats_access.log  
  format apache  
  tag stats.access  
  pos_file /var/log/td-agent/stats_access.pos  
 </source>  
 <match stats.access>  
  type forward  
  <server>  
   host <LOG AGGREGATOR NODE>  
   port 24224  
  </server>  
  retry_limit 5  
  <secondary>  
   type file  
   path /var/log/td-agent/stats_access.log  
  </secondary>  
 </match>

Edit Nginx configuration to use apache log format.

   log_format main '$remote_addr - $remote_user [$time_local] "$request" '  
            '$status $body_bytes_sent "$http_referer" '  
            '"$http_user_agent"';

9. Edit td-agent configuration in log aggregator server

 <source>  
  type forward  
  port 24224  
 </source>  
 <match stats.access>  
  type webhdfs  
  host <NAMENODE OR HTTPFS HOST>  
  port 14000  
  path /user/hdfs/stats_logs/stats_access.%Y%m%d_%H.log  
  httpfs true  
  username httpfsuser  
 </match>

10. Start td-agent in log aggregator host

/etc/init.d/td-agent start

* ensure that there are no errors in /var/log/td-agent/td-agent.log

11. Start td-agent in nginx servers

/etc/init.d/td-agent start
/etc/init.d/nginx restart

12. Check whether you can see the logs in HDFS

 sudo -u hdfsuser hdfs dfs -ls /user/hdfs/stats_logs/  
 Found 1 items  
 -rw-r--r--  3 httpfsuser group   17441 2012-10-12 01:10 /user/hdfsuser/stats_logs/stats_access.20121012_07.log

That is all.. Now you have a log aggregation happening

Friday, July 27, 2012

Class org.apache.hadoop.thriftfs.NamenodePlugin not found

While starting namenode you might come across this error.

Class org.apache.hadoop.thriftfs.NamenodePlugin not found

In cdh4 it does not require a plug-in on the NameNode or DataNodes. Hence all the configuration related to that should be removed from namenode and datanode hdfs-site.xml

Thursday, April 12, 2012

Extract Data from Master oplog and Restore it in another MongoDB server

Extract data from oplog in MongoDB and restore in another MongoDB server.

Recently I came across a problem where we have to do a lot of modifications in the mongodb server which will be having issues with the production database. We removed the replication between the master and slave and then did the operation in slave and then updated the data using the tool wordnik-oss tools.

                             Unfortunately we did not have a replica set and we had normal Master-Slave setup. While all the updates are happening in slave I need to keep track of the master data so that I can add it to the slave. For this I used a tool named mongodb-admin-utils in wordnik-oss https://github.com/wordnik/wordnik-oss.

Required software:

1. Java and Git:
                  yum install java-1.6.0-sun-devel java-1.6.0-sun git

2. Maven:
recent version of wordnik-oss require maven 3

                  cd /usr/src
                   wgethttp://apache.techartifact.com/mirror/maven/binaries/apache-maven-3.0.4-bin.tar.gz
                    tar zxf apache-maven-3.0.4-bin.tar.gz

Building wordnik:

1. Download and compile

                            git clone http://github.com/wordnik/wordnik-oss.git wordnik

2. Compile and build

             In my case I only needed mongodb-admin-utils and hence I packaged only   that.

                         cd wordnik/modules/mongo-admin-utils
                        /usr/src/apache-maven-3.0.4/bin/mvn package

                       Once this is complete you can use mongo-admin-utils in the host.

Get Incremental oplog Backup from mongo master server

                cd wordnik/modules/mongo-admin-utils
               ./bin/run.sh com.wordnik.system.mongodb.IncrementalBackupUtil -o /root/mongo -h mastermongodb

                                   /root/mongo => output directory where the oplog is stored.
                                   mastermongodb => mongodb master host.

** We can't use this tool in slave as there is no oplog in slave.

Replay the Data from the oplog to the database

          I had some problems in restoring data from backup and I had to add the following settings for the restore to work without any issues.

    ulimit -n 20000

           Added the following Java options in run.sh so that it does not fail with Out Of Memory (OOM ) erros.

JAVA_CONFIG_OPTIONS="-Xms5g -Xmx10g -XX:NewSize=2g -XX:MaxNewSize=2g -XX:+UseConcMarkSweepGC -XX:+UseParNewGC -XX:PermSize=2g -XX:MaxPermSize=2g"

Replay Command:

            ./bin/run.sh com.wordnik.system.mongodb.ReplayUtil -i /root/mongo -h localhost
                localhost => the mongodb server in which you want the data to be added.

         If you face any issues you can go ahead and file a issue in https://github.com/wordnik/wordnik-os . The developer is a awesome person and will help you sort out the issue.

Custom puppet master hostname error hostname was not match with the server certificate

When you want to use custom hostname for puppet it shows the following error.
=============
err: Could not retrieve catalog from remote server: hostname was not match with the server certificate
warning: Not using cache on failed catalog
err: Could not retrieve catalog; skipping run
err: Could not send report: hostname was not match with the server certificate
==============

In my case I wanted to use the default hostname "puppet" . Add the following entries to puppet master configuration file /etc/puppet/puppet.conf

certname = puppet

stop puppet master; mv /var/lib/puppet /var/lib/puppet-bak

start puppet

Ensure that the cert is loaded with the name you want.

==================
puppet cert print $(puppet master --configprint certname)|grep Subject

Subject: CN=puppet
==================

If the CN field is showing your hostname then this would not work.

Now you can use puppet agents to connect using

puppet agent --test --server puppet

Ensure that you have entries in /etc/hosts for puppet master