Monitoring Anypoint CloudHub API Gateway using Amazon Lambda and ELK

Written by Nakul Bharade | Jul 26, 2016 10:14:00 AM

In a previous blog post, Rashmi showed us how to synchronise logs from CloudHub to an external logging system. This follows an increasing demand to utilise MuleSoft Anypoint CloudHub logs, events, and dashboard statistics as part of a broader monitoring strategy which aims to:

Aggregate information in a central location so as to get insight into API usage and perform queries against all the logs from all the applications from one place.
Visualize trends and patterns for historical data that would otherwise be extremely tedious to read and interpret.
Centralize information with the aim to perform deep data analytics on the captured information and make the information available by way of powerful data visualizations.
Correlate information from various sources based on date and time (ie. we can see what the CPU was doing when a particular log entry occurred).
Receive alerts when complex preset threshold conditions are breached.

In this post we show how to access and process Anypoint API Gateway logs using AWS Lambda and the ELK stack.

API Gateway in CloudHub offers the ability to pull data using Logs, Analytics, and Dashboard Statistics API’s. I’ll call these the LAD API’s. An application invoking them needs to:

Successfully log into API Gateway using the Anypoint CloudHub Login API. The Login API provides a bearer token. It is expected that this bearer token is used on subsequent LAD API calls.
Persist the state of parameters such as start date, end date, and page offsets that are passed between API calls.

Logstash input http_poller plugin or elastic httpbeat does not meet these requirements, not yet.

How then…?

So that brings us to Amazon Lambda—a serverless compute service that runs your code in response to the events and automatically manages the underlying compute resources for you. Lambda instances run briefly (for a period of seconds), complete their tasks and shutdown. We decided to use Lambda as a transport service for pulling API Gateway application logs, events, and dashboard statistics from CloudHub and transmitting them to Elasticsearch. It is important to mention that Lambda service is limited to Amazon cloud services platform. Other cloud services platform providers such as Google, IBM, and Microsoft have recently announced their own event triggered serverless applications. However, for the sake of this discussion, we will stick with Lambda. Figure below shows the high level solution with Lambda collectors.

The solution consists of:

Lambda Collectors: Lambda collectors act as transport agents. They are used to pull API Gateway application logs, events and dashboard statistics from CloudHub (using API’s shared in table below), enrich the information returned, where appropriate, and transmit it to Elasticsearch (or Logstash). Lambda instances are triggered by way of scheduled timer events. Lambda instances use Amazon S3 object store to persist the parameter values that will be used in LAD API calls to the CloudHub and to persist the login credentials (secured using Amazon Key Management Service (KMS)) required to access the CloudHub. The Lambda collectors are each focused on a particular task & application. Please note that our demonstration is based on a single worker instance of an application deployed to Anypoint CloudHub and as such the APIs shared in table below will need refactoring to account for additional worker instances.

Lambda Collector Type	CloudHub API
Logs	https://anypoint.mulesoft.com/cloudhub/api/v2/applications/{domain}/instances/{instanceId}/logs?head=true&offset=<pointer_to_last_read_logline>&limit=<number_of_loglines>
Events	https://anypoint.mulesoft.com/analytics/1.0/{organisationId}/events?format=json&apiIds=&apiVersionIds=&startDate=<start_date>&endDate=<end_date>&fields=Application.Application Name.Browser.City.Client IP.Continent.Country.Hardware Platform.Message ID.OS Family.OS Major Version.OS Minor Version.OS Version.Postal Code.Request Outcome.Request Size.Resource Path.Response Size.Response Time.Status Code.Timezone.User Agent Name.User Agent Version.Verb.Violated Policy Name
Stats	https://anypoint.mulesoft.com/cloudhub/api/v2/applications/{domain}/dashboardStats?startDate=<start_date_iso8601_format>&endDate=<end_date_iso8601_format>&interval=<time_between_samples_in_ms>

Logstash Indexer: Logstash indexer is configured to accept data from Lambda collectors by way of http input plugin. Data that needs formatting, parsing and filtering before pushing it into Elasticsearch is routed via Logstash. Logstash treats this data and writes it to Elasticsearch.
Elasticsearch Cluster: For the sake of this demonstration, we used a single node Elasticsearch cluster. Data that does not need any formatting, parsing or filtering is written (by Lambda collectors) to Elasticsearch using its Bulk API.

Note: Amazon offers managed Elasticsearch service. At the time of writing this blog, the managed service comes with Elasticsearch version 1.5.3. We decided to use the latest available version of Elasticsearch 2.3.3 for this demonstration as it includes fixes to bugs raised since 1.5.3 and it offers the ability to customize colors for visualisations and dashboards.

Below, we share some example dashboards we have built from the information captured (using Lambda collectors) from Anypoint CloudHub APIs. For client privacy reasons, we have blanked out some of the information captured in visualization legends and titles.

Statistics dashboard screenshot below shows cpu and memory usage information for an application hosted in Anypoint CloudHub.

Figure below shows custom dashboard for an application hosted in Anypoint CloudHub. Information presented on the dashboard includes:

Request distribution per resource.
Request distribution per (Anypoint) client-id.
Response distribution per resource.
Response distribution per http status code.

Figure below shows events dashboard for all applications hosted in Anypoint CloudHub. Information presented on the dashboard includes:

Event distribution by outcome.
Count of total requests that generated an event leading to unsuccessful processing of those requests. For example, requests that were blocked by client-id enforcement policy or rate limiting policy will reflect in this count.

And how much would it cost…?

The cost of bulding and sustaining such a solution is influenced by, number of Lambda collectors, the ELK stack topology and the amount of data transferred between API Gateway and Elasticsearch. Factors that can add to the cost include:

Amount of S3 storage used.
Compute time used by Lambda collector instances.
Amount of data transferred between CloudHub and Logstash/Elasticsearch.
Amount of data queried by users in Kibana.
Number of Amazon EC2 instances used to host the ELK stack.
Use of Amazon Key Management Service.
Use of Amazon Elastic Load Balancer.
ELK stack license (if using commercial ELK solution).

Conclusion

We discussed the use of Lambda collectors to retrieve API Gateway application logs, events, and dashboard statistics from Anypoint CloudHub and to publish them to Elasticsearch. We also shared examples of Kibana dashboards that can be produced using the information captured from Anypoint CloudHub. While information captured in the application logs can be reviewed and revised over time, it would be advantageous for the development team to collaborate with business teams and operational teams early in the development cycle, to put together basic guidelines on what should and should not be retained in these logs.

View full post