Last October one of our co-founders and senior consultants Nenad Bozic held a presentation on Data Science conference 2.0 about challenges of monitoring distributed systems.
This is the abstract of the presentation:
Back in the days, you had a single machine and you could scroll down the single log file to figure out what is going on. In this Big Data world you need to combine a lot of logs together to figure out what is going on. Data is coming in huge volumes, with high speed so choosing important information and getting rid of noise becomes real challenge. There is a need for a centralized monitoring platform which will aid the engineers operating the systems, and serve the right information at the right time.
This talk will try to help you understand all the challenges and you will get an idea which tools and technology stacks are good fit to successfully monitor Big Data systems. The focus will be on open source and free solutions. The problem can be separated in two domains which both are the subject of this talk: metrics stack to gather simple metrics on central place and log stack to aggregate logs from different machines to central place. We will finish up with a combined stack and ideas how it can be improved even further with alerting and automated failover scenarios.
Here is a link to the video: