Tuesday, July 23, 2013

Push Thrift Metrics to Ganglia CDH4

 

 Add the following entries in /etc/hbase/conf/hadoop-metrics.properties or $HBASE_HOME/conf/hadoop-metrics.properties file

 

===========

thriftserver.class=org.apache.hadoop.metrics.ganglia.GangliaContext31

thriftserver.period=10

thriftserver.servers=<ganglia_multicast_address>:<ganglia_port>

===========

 

Restart hbase-thrift-server and you should see thrift metrics in ganglia interface.

Tuesday, March 12, 2013

No such device or address while trying to determine filesystem size on SSD in Centos 5

[root@02 ~]# mkfs.ext4 /dev/sdb1
mke4fs 1.41.12 (17-May-2010)
mkfs.ext4: No such device or address while trying to determine filesystem size

All the usual stuff by playing around with df -h, mount , cat /proc/swaps, /etc/mtab , lsof did not yeild any results.

Checking out some forums suggested that it is related to low level raid stuff and can be found by help of dmsetup and dmraid.

List device name

[root@02 ~]# dmsetup ls
ddf1_4c5349202020202010000411100010044711471181ddbcc4 (253, 0)

Get some more information about the device

[root@02 ~]# dmsetup info ddf1_4c5349202020202010000411100010044711471181ddbcc4
Name: ddf1_4c5349202020202010000411100010044711471181ddbcc4
State: ACTIVE
Read Ahead: 256
Tables present: LIVE
Open count: 0
Event number: 0
Major, minor: 253, 0
Number of targets: 1
UUID: DMRAID-ddf1_4c5349202020202010000411100010044711471181ddbcc4

Discover all software raid devices in the system

[root@02 ~]# dmraid -r
/dev/sdb: ddf1, ".ddf1_disks", GROUP, ok, 123045888 sectors, data@ 0

Get the property of the raid disks

[root@02 ~]# dmraid -s
*** Group superset .ddf1_disks
--> Subset
name : ddf1_4c5349202020202010000411100010044711471181ddbcc4
size : 123045888
stride : 128
type : stripe
status : ok
subsets: 0
devs : 1
spares : 0

Tried to deactivate the raid device

[root@02 ~]# dmraid -an
ERROR: dos: partition address past end of RAID device
The dynamic shared library "libdmraid-events-ddf1.so" could not be loaded:
libdmraid-events-ddf1.so: cannot open shared object file: No such file or directory

Remove all raid devices metadata

[root@02 ~]# dmraid -r -E /dev/sdb
Do you really want to erase "ddf1" ondisk metadata on /dev/sdb ? [y/n] :y
ERROR: ddf1: seeking device "/dev/sdb" to 32779907366912
ERROR: writing metadata to /dev/sdb, offset 64023256576 sectors, size 0 bytes returned 0
ERROR: erasing ondisk metadata on /dev/sdb

Format the entire disk with zeros so that it does not get detected as raid device and cause further problem.

dd if=/dev/zero of=/dev/sdb

Friday, February 15, 2013

Centos 5 with ganglia and rrdcached

 

 As the number of metrics in your environment grows you start to see huge impact in system IO performance. At one stage the disk utilization stays at 100% and cpu spends a lot of time waiting for IO. At this time the IO exceeds the theoratical IO supported by the disk. We might go ahead and add fast 15K disks or raid10 array. The disk contention stays as long as we keep on expanding the cluster and adding new metrics.

 

 A simple solution would be to add the rrdcached layer in the middle. There are certain things to consider such as updating the rrdtool package and recompiling ganglia with the new rrdtool support.

 

 Just a step by step guide of doing it.

 

A. Steps for building and install rrdtool and ganglia.

 

1. uninstall existing rrdtool

yum -y install rrdtool rrdtool-perl

 

2. Download latest rrdtool

 

wget http://apt.sw.be/redhat/el5/en/x86_64/dag/RPMS/rrdtool-1.4.7-1.el5.rf.x86_64.rpm

wget http://apt.sw.be/redhat/el5/en/x86_64/dag/RPMS/perl-rrdtool-1.4.7-1.el5.rf.x86_64.rpm

wget http://apt.sw.be/redhat/el5/en/x86_64/dag/RPMS/rrdtool-devel-1.4.7-1.el5.rf.x86_64.rpm

 

3. Install all three in a single go. Otherwise it will show some weird perl dependency errors.

 

rpm -ivh rrdtool-1.4.7-1.el5.rf.x86_64.rpm rrdtool-devel-1.4.7-1.el5.rf.x86_64.rpm perl-rrdtool-1.4.7-1.el5.rf.x86_64.rpm

 

4. Get the ganglia source rpm 

 wget http://downloads.sourceforge.net/project/ganglia/ganglia%20monitoring%20core/3.4.0/ganglia-3.4.0-1.src.rpm?r=http%3A%2F%2Fsourceforge.net%2Fprojects%2Fganglia%2Ffiles%2Fganglia%2520monitoring%2520core%2F3.4.0%2F&ts=1360839947&use_mirror=citylan

 

5. Install the dependencies required for building the rpm.

 

yum -y install libpng-devel libart_lgpl-devel python-devel libconfuse-devel pcre-devel freetype-devel

 

6. Build the rpm

 rpm -ivh ganglia-3.4.0-1.src.rpm

rpmbuild -ba /usr/src/redhat/SPECS/ganglia.spec

 

7. Install the new ganglia versions

 

rpm -ivh /usr/src/redhat/RPMS/x86_64/ganglia-* /usr/src/redhat/RPMS/x86_64/libganglia-3.4.0-1.x86_64.rpm

 

B . Configuring rrdcached to work with ganglia.

 

1. Since gmetad runs with ganglia user and rrdcached require access to write to rrd dir and apache needs access of same directory. Add ganglia to apache group.

usermod -a -G apache ganglia

 

2. correct the permissions

chown -R ganglia:apache /var/lib/ganglia/rrds/

 

3. Update rrdcached sysconfig startup options.

cat /etc/sysconfig/rrdcached

OPTIONS="rrdcached -p /tmp/rrdcached.pid -s apache -m 664 -l unix:/tmp/rrdcached.sock -s apache -m 777 -P FLUSH,STATS,HELP -l unix:/tmp/rrdcached.limited.sock -b /var/lib/ganglia/rrds -B"

RRDC_USER=ganglia

5. Update gmetad sysconfig file so that it knows the rrdcached socket information.
cat /etc/sysconfig/gmetad
RRDCACHED_ADDRESS="unix:/tmp/rrdcached.sock"
6. Update ganglia-web config information so that apache communicates with rrdcached daemon to fetch rrd information.
grep rrdcached_socket /var/www/html/gweb/conf_default.php
$conf['rrdcached_socket'] = "/tmp/rrdcached.sock";
7.  stop gmetad
service gmetad stop
8. start rrdcached
service rrdcached start
9 . start gmetad
service gmetad start
 If everything is fine then you should see graphs populating in ganglia frontend.
At the same time you'll see that the IO disk utilization is reduced awesomely :)

Friday, October 26, 2012

Nginx to hadoop hdfs with fluentd



Nginx => hadoop HDFS using Fluentd


        Fluentd is a json everywhere log collector. It transmits logs as json streams so that log processing can be easily managed and processed.

      Hadoop HDFS is a distributed filesystem which can be used to store any amount of logs and run mapreduce jobs for faster log processing.

 We will be using fluent-webhdfs-plugin to send logs over to httpfs interface


1. Install hadoop-httpfs package

        yum install hadoop-httpfs

2. Enable access to HDFS for httpfs user

vi /etc/hadoop/conf/core-site.xml
  <property>  
   <name>hadoop.proxyuser.httpfs.hosts</name>  
   <value>localhost,httpfshost></value>  
  </property>  
  <property>  
   <name>hadoop.proxyuser.httpfs.groups</name>  
   <value>*</value>  
  </property>  

 

3. Restart the hadoop cluster.

4. Start the hadoop-httpfs service

/etc/init.d/hadoop-httpfs start

5. Check whether it is working
 curl -i "http://<namenode>:14000?user.name=httpfs&op=homedir"  
 HTTP/1.1 200 OK  
 Server: Apache-Coyote/1.1  



6. Install treasure date td-agent in nginx servers and log-aggregator server

cat > /etc/yum.repos.d/treasuredate.repo
[treasuredata]
name=TreasureData
baseurl=http://packages.treasure-data.com/redhat/$basearch
gpgcheck=0

 yum install td-agent

7. Install fluentd and fluentd-plugin-webhdfs in log-aggregator host

gem install fluent-logger --no-ri --no-rdoc
/usr/lib64/fluent/ruby/bin/fluent-gem install fluent-plugin-webhdfs

8. Edit td-agent configuration in nginx server

vi /etc/td-agent/td-agent.conf
 # Tail the nginx logs associated with stats.slideshare.net  
 <source>  
  type tail  
  path /var/log/nginx/stats_access.log  
  format apache  
  tag stats.access  
  pos_file /var/log/td-agent/stats_access.pos  
 </source>  
 <match stats.access>  
  type forward  
  <server>  
   host <LOG AGGREGATOR NODE>  
   port 24224  
  </server>  
  retry_limit 5  
  <secondary>  
   type file  
   path /var/log/td-agent/stats_access.log  
  </secondary>  
 </match>  

Edit Nginx configuration to use apache log format.
   log_format main '$remote_addr - $remote_user [$time_local] "$request" '  
            '$status $body_bytes_sent "$http_referer" '  
            '"$http_user_agent"';  



9. Edit td-agent configuration in log aggregator server




 <source>  
  type forward  
  port 24224  
 </source>  
 <match stats.access>  
  type webhdfs  
  host <NAMENODE OR HTTPFS HOST>  
  port 14000  
  path /user/hdfs/stats_logs/stats_access.%Y%m%d_%H.log  
  httpfs true  
  username httpfsuser  
 </match>  


10. Start td-agent in log aggregator host

/etc/init.d/td-agent start

 * ensure that there are no errors in /var/log/td-agent/td-agent.log

11. Start td-agent in nginx servers

/etc/init.d/td-agent start
/etc/init.d/nginx restart

12. Check whether you can see the logs in HDFS
 sudo -u hdfsuser hdfs dfs -ls /user/hdfs/stats_logs/  
 Found 1 items  
 -rw-r--r--  3 httpfsuser group   17441 2012-10-12 01:10 /user/hdfsuser/stats_logs/stats_access.20121012_07.log  


That is all.. Now you have a log aggregation happening

Friday, July 27, 2012

Class org.apache.hadoop.thriftfs.NamenodePlugin not found

 While starting namenode you might come across this error.


Class org.apache.hadoop.thriftfs.NamenodePlugin not found



 In cdh4 it does not require a plug-in on the NameNode or DataNodes. Hence all the configuration related to that should be removed from namenode and datanode hdfs-site.xml

Thursday, April 12, 2012

Extract Data from Master oplog and Restore it in another MongoDB server



Extract data from oplog in MongoDB and restore in another MongoDB server.





                            Recently I came across a problem where we have to do a lot of modifications in the mongodb server which will be having issues with the production database. We removed the replication between the master and slave and then did the operation in slave and then updated the data using the tool wordnik-oss tools.


                             Unfortunately we did not have a replica set and we had normal Master-Slave setup. While all the updates are happening in slave I need to keep track of the master data so that I can add it to the slave. For this I used a tool named mongodb-admin-utils in wordnik-oss https://github.com/wordnik/wordnik-oss.


Required software:


  1. Java and Git:
                  yum install java-1.6.0-sun-devel java-1.6.0-sun git


  2. Maven:
        recent version of wordnik-oss require maven 3


                  cd /usr/src
                   wgethttp://apache.techartifact.com/mirror/maven/binaries/apache-maven-3.0.4-bin.tar.gz
                    tar zxf apache-maven-3.0.4-bin.tar.gz


Building wordnik:


  1. Download and compile


                            git clone http://github.com/wordnik/wordnik-oss.git wordnik


  2. Compile and build
 
             In my case I only needed mongodb-admin-utils and hence I packaged only   that.


                         cd wordnik/modules/mongo-admin-utils
                        /usr/src/apache-maven-3.0.4/bin/mvn package



                       Once this is complete you can use mongo-admin-utils in the host.


Get Incremental oplog Backup from mongo master server




                cd wordnik/modules/mongo-admin-utils
               ./bin/run.sh com.wordnik.system.mongodb.IncrementalBackupUtil  -o /root/mongo -h mastermongodb



                                   /root/mongo => output directory where the oplog is stored.
                                   mastermongodb => mongodb master host.


** We can't use this tool in slave as there is no oplog in slave.


Replay the Data from the oplog to the database


          I had some problems in restoring data from backup and I had to add the following settings for the restore to work without any issues.

    ulimit -n 20000



           Added the following Java options in run.sh so that it does not fail with Out Of Memory (OOM ) erros.

JAVA_CONFIG_OPTIONS="-Xms5g -Xmx10g -XX:NewSize=2g -XX:MaxNewSize=2g -XX:+UseConcMarkSweepGC -XX:+UseParNewGC -XX:PermSize=2g -XX:MaxPermSize=2g"





Replay Command:


            ./bin/run.sh com.wordnik.system.mongodb.ReplayUtil -i /root/mongo -h localhost
                localhost => the mongodb server in which you want the data to be added.


         If you face any issues you can go ahead and file a issue in https://github.com/wordnik/wordnik-os . The developer is a awesome person and will help you sort out the issue.

Custom puppet master hostname error hostname was not match with the server certificate



 When you want to use custom hostname for puppet it shows the following error.
=============
err: Could not retrieve catalog from remote server: hostname was not match with the server certificate
warning: Not using cache on failed catalog
err: Could not retrieve catalog; skipping run
err: Could not send report: hostname was not match with the server certificate
==============


 In my case I wanted to use the default hostname "puppet" . Add the following entries to puppet master configuration file /etc/puppet/puppet.conf

 certname = puppet

stop puppet master; mv /var/lib/puppet /var/lib/puppet-bak

start puppet

 Ensure that the cert is loaded with the name you want.

==================
puppet cert print $(puppet master --configprint certname)|grep Subject

        Subject: CN=puppet
==================


If the CN field is showing your hostname then this would not work.


Now you can use puppet agents to connect using


puppet agent --test --server puppet








Ensure that you have entries in /etc/hosts for puppet master