As the number of metrics in your environment grows you start to see huge impact in system IO performance. At one stage the disk utilization stays at 100% and cpu spends a lot of time waiting for IO. At this time the IO exceeds the theoratical IO supported by the disk. We might go ahead and add fast 15K disks or raid10 array. The disk contention stays as long as we keep on expanding the cluster and adding new metrics.
A simple solution would be to add the rrdcached layer in the middle. There are certain things to consider such as updating the rrdtool package and recompiling ganglia with the new rrdtool support.
Just a step by step guide of doing it.
A. Steps for building and install rrdtool and ganglia.
1. uninstall existing rrdtool
yum -y install rrdtool rrdtool-perl
2. Download latest rrdtool
wget http://apt.sw.be/redhat/el5/en/x86_64/dag/RPMS/rrdtool-1.4.7-1.el5.rf.x86_64.rpm
wget http://apt.sw.be/redhat/el5/en/x86_64/dag/RPMS/perl-rrdtool-1.4.7-1.el5.rf.x86_64.rpm
wget http://apt.sw.be/redhat/el5/en/x86_64/dag/RPMS/rrdtool-devel-1.4.7-1.el5.rf.x86_64.rpm
3. Install all three in a single go. Otherwise it will show some weird perl dependency errors.
rpm -ivh rrdtool-1.4.7-1.el5.rf.x86_64.rpm rrdtool-devel-1.4.7-1.el5.rf.x86_64.rpm perl-rrdtool-1.4.7-1.el5.rf.x86_64.rpm
4. Get the ganglia source rpm
wget http://downloads.sourceforge.net/project/ganglia/ganglia%20monitoring%20core/3.4.0/ganglia-3.4.0-1.src.rpm?r=http%3A%2F%2Fsourceforge.net%2Fprojects%2Fganglia%2Ffiles%2Fganglia%2520monitoring%2520core%2F3.4.0%2F&ts=1360839947&use_mirror=citylan
5. Install the dependencies required for building the rpm.
yum -y install libpng-devel libart_lgpl-devel python-devel libconfuse-devel pcre-devel freetype-devel
6. Build the rpm
rpm -ivh ganglia-3.4.0-1.src.rpm
rpmbuild -ba /usr/src/redhat/SPECS/ganglia.spec
7. Install the new ganglia versions
rpm -ivh /usr/src/redhat/RPMS/x86_64/ganglia-* /usr/src/redhat/RPMS/x86_64/libganglia-3.4.0-1.x86_64.rpm
B . Configuring rrdcached to work with ganglia.
1. Since gmetad runs with ganglia user and rrdcached require access to write to rrd dir and apache needs access of same directory. Add ganglia to apache group.
usermod -a -G apache ganglia
2. correct the permissions
chown -R ganglia:apache /var/lib/ganglia/rrds/
3. Update rrdcached sysconfig startup options.
cat /etc/sysconfig/rrdcached
OPTIONS="rrdcached -p /tmp/rrdcached.pid -s apache -m 664 -l unix:/tmp/rrdcached.sock -s apache -m 777 -P FLUSH,STATS,HELP -l unix:/tmp/rrdcached.limited.sock -b /var/lib/ganglia/rrds -B"
RRDC_USER=ganglia
5. Update gmetad sysconfig file so that it knows the rrdcached socket information.
cat /etc/sysconfig/gmetad
RRDCACHED_ADDRESS="unix:/tmp/rrdcached.sock"
6. Update ganglia-web config information so that apache communicates with rrdcached daemon to fetch rrd information.
grep rrdcached_socket /var/www/html/gweb/conf_default.php
$conf['rrdcached_socket'] = "/tmp/rrdcached.sock";
7. stop gmetad
8. start rrdcached
9 . start gmetad
If everything is fine then you should see graphs populating in ganglia frontend.
At the same time you'll see that the IO disk utilization is reduced awesomely :)