Building distributed systems is our métier. One lesson we learned very early is the importance of visibility across all the elements in a system. But the more extended and loosely coupled your systems, the harder it is to achieve the visibility required. Loose coupling promotes availability and resilience but works against oversight and control. This is essentially a corollary of the CAP theorem. The challenge is very applicable to microservices as described by Benjamin Wootton in his article "Microservices - Not A Free Lunch!."
It's remarkable how commonly log data simply spools to local files and noone ever thinks about how to access and correlate all that useful visibility...until the first problem in integration testing, or worse, in production. Too many engineering hours are still spent on manually trawling through logfiles, manually correlating dodgy timestamps. People should know a lot better.
One approach that worked well for a while was to provide logging services where system components write their state to a central location like a queue or database. Individual transactions could be traced using a unique ID that is passed along the process chain within the message metadata. However as systems become bigger and more distributed logging services suffer from some shortcomings:
- Developers need to explicitly put log service calls in their code.
- COTS applications can't participate because they log elsewhere (usually to a file).
- The usually relational log database becomes a performance bottleneck.
- The usually relational log database has a fixed schema which doesn't support the kind of free text analysis and discovery required for "unknown unknowns" typically encountered in distributed troubleshooting.
The logging landscape has changed significantly in the last couple of years with approaches that address all these issues. Log shipping has become popular and easy, detecting events from many different sources (including logfiles) and forwarding them to logging middleware or direct into logging databases. NoSQL databases provide event storage and processing which is distributed, lightweight and hence fast and scalable. Dashboards and data visualization frameworks have proliferated to provide good choices for visualization, analysis and alerting based on application events.
The 12 factor applications approach has taught us the benefits of regarding log entries as "event streams" which can give us insight into the performance of our distributed systems. Indeed for a brilliant and comprehensive discourse on event logging and its position as a fundamental abstraction for distributed systems, take a look at Jay Kreps' blog post from LinkedIn Engineering. There Jay describes the two main roles that logs play—ordering changes and distributing data—not only as application log files but more fundamentally as a way of distributing and coordinating state across different parts of a distributed system. These parts could be instances of a distributed database, or applications across your enterprise.
But let's turn this capability on its head for a minute. Analytics and insight is an "obligation" for distributed systems in the sense that without it, we lose visibility and control. But analytics and insight also becomes an "opportunity" for businesses that run on top of these systems. Indeed a business is fundamentally a distributed system comprising people, processes and applications. The data in motion passing through these systems carries useful, real-time information about the business. Often this information isn't otherwise available or is locked up in hard to reach end-systems.
Externalising business information into business events and running those events through analytics and dashboards provides valuable real-time information. This "operational intelligence" capability is commonly used to:
- Analyze business data in realtime to support decision management.
- Track process activity across different systems to detect and help resolve business problems.
- Correlate events from multiple sources to derive higher-level business intelligence.
From operational intelligence it is not a distant step toward using models for predictive analytics and for triggering value-added business processes. The service layer that externalises services to the enterprise also becomes an event bus that reflects the in-flight "state" of the business.
So modern event-based logging frameworks do a great job of handling the monitoring and management requirements for distributed systems. The same infrastructure can also provide better business level visibility for your business processes by treating process interactions as business events.