Chances are, if you have thousands of servers, you are running some sort of hyperscale environment. But is your monitoring hyperscale-friendly?
In the beginning you might well have had 10 servers all running business-critical applications. You dutifully monitored everything on the server. Well, you dutifully monitored after you have had too many issues with no monitoring at all.
Then, over time, each new outage brought you a new set of checks and before you knew it, your boxes became extremely monitored. They have a multitude of ways to set the pager off. Your monitoring strategy continues like this as your server farm grows. Before you know it, you have hundreds, maybe even thousands, of machines with very basic false positive monitoring.
Now this set-up can be tuned and tuned, but your system will change over time and ultimately becomes whack-a-mole with a pager that continually grumbles about something. It’s time to think more intelligently.
With hyperscale, the application is designed to run on scale-out commodity hardware. Classic examples of this are:
Big data – Hadoop, Riak, Cassandra and Redis
Databases – MySQL, Couchbase, SQLServer, MongoDB
When you are at this point, a different stance has to be taken on the monitoring and this is not yet a fully solved problem. However, you should have two things on your side that can ease the situation:
1.) It’s highly likely that most of your hardware is identical. If it isn’t, see what can be done to unify configurations. The less configs you have, the less monitoring, less spare parts you have to carry, and so on.
2.) With hyperscaling, your nodes are only running a few applications and they all run the same applications. This mitigates the erratic nature of hosting many tiny applications which have their own workloads and resource demands.
Monitoring hyperscale environments is quite a challenge and there are many options in approach. However, it really comes down to three parts that need to be monitored separately:
This is crucial to get right when machine count is high. Budgeting forecasts can change dramatically with a large server count. Adjusting this mid-year will not be possible. However you measure your system capacity, do it regularly and plot the results. Make sure they make sense and reflect reality.
Service level monitoring
This type of monitoring may well be new to your platform, but it is a crucial change in mindset. It forces you to think holistically about your systems and applications: are they performing at acceptable levels? It is really thinking about what you are serving and how. This should lead to key performance indicators about your system. You can then define services to monitor. It is this sort of information that ends up in dashboards and on wallboards. You do not need to know if a disk has failed in a server, but you do need to know if your mission-critical service is working.
Physical hardware monitoring
With hyperscale you are going to have many components that fail. So the operation of fixing your platform must be well oiled or it will become a noose. Physical hardware monitoring should be separate to service level monitoring. It is something that should never hit the service pager. Ideally, if a server breaks beyond the point of being able to perform its duty, it should be pulled from service automatically. It would be very distracting and time-consuming if an action had to be taken by a human each time this happened. Typical products that work well for this type of yes/no monitoring are Nagios and Zabbix.
Once you have your system level monitoring and your physical monitoring set up, there is a middle ground of monitoring that you simply do not need any more. For example,Discover why measure the load of a machine? What is that actually telling you?
Conclusion on monitoring hyperscale DCs
Hyperscale monitoring isn’t simple but there are ways to do it to create an intelligent working environment. It is important to send only actionable alerts to your on-call technician to prevent loss of focus.