App Mesh visibility

Publish date: Sun, Dec 29, 2019

Increased visibilty via App Mesh

envoy stats

There are many advantages of using Service Meshes. One of the greatest is the increased visibility they can provide. AWS App Mesh leverages Envoy for its data plane. Each envoy proxy generates local statistics describing the network environment it is embedded into. Envoy originally only supported the TCP and UDP statsD protocol for exporting its statistics; currently, Envoy also supports Prometheus endpoints as well. statsD is an incredibly simple but very widely supported transport format.

One of the advantage of statsD, is that this format is easily consumed by a CloudWatch agent. This agent can then forward these stats to CloudWatch, allowing for dashboards, alarms and so on. This approach is demonstrated by the 2019 ReInvent talk - CON328, Improving observability of your containers.

This blog post will demonstrate how to enable statsD for your App Mesh data plane. This data will consumed by a CloudWatch agent, which will aggregate them into metrics and forward them (once per minute by default) into CloudWatch for further processing. App Mesh can be a critical component for any telemetry or observability based monitoring system, which ultimately allows for increased application visibility.

pillars of telemetry

In order to quickly show this approach, I have created a Cloud Development Kit or CDK template of a working containerized, micro-serviced based application running on ECS, along with a CloudWatch dashboard. I will assume you have CDK installed, and some familiarity with the tooling. Under the hood, CDK generates CloudFormation, but is much more powerful and terse. It allow one to use a general-purpose programming language to create your CloudFormation templates (i.e. JavaScript, TypeScript, Python, Java, and C#)

Let CDK do all the work

Clone the demo git repository. This repo has been tested in us-west-2, but should work in any region with only minor changes.

$ git clone https://github.com/nbrandaleone/appmesh-visibility-cdk.git
$ cd appmesh-visibility-cdk
$ npm install

$ npx cdk@1.19.0 deploy --require-approval never

$ # wait about 10 minutes...

The architecture

The application is based upon three micro-services (two tasks each). The greeter must get a greeting and a name.

The microservices are connected like this:

application images

Each containerized application has two additional side-cars. One, for the envoy proxy, and the second side-car is the CloudWatch agent. A rough schematic of the Task archiecture is below.

3 sidecars

From the ECS Console, one can see all three containers working together.

application images

The CDK template also creates an Application Load Balancer, which exposes the greeter micro-services to the world. If you look at the CDK output, you will see the name of the load-balancer, and quickly test out the application from the command line.

CDK output from the command line

Testing

$ curl http://Appme-exter-Q7XDLJTVUQED-270012836.us-west-2.elb.amazonaws.com
From ip-10-0-201-189.us-west-2.compute.internal: Greetings (ip-10-0-213-9.us-west-2.compute.internal) Art (ip-10-0-143-116.us-west-2.compute.internal)

$ curl http://Appme-exter-Q7XDLJTVUQED-270012836.us-west-2.elb.amazonaws.com
From ip-10-0-249-232.us-west-2.compute.internal: Hi (ip-10-0-248-152.us-west-2.compute.internal) Courtney (ip-10-0-175-239.us-west-2.compute.internal)

Let’s generate a little traffic to make our statistics more interesting.

$ while true; do curl -s http://Appme-exter-Q7XDLJTVUQED-270012836.us-west-2.elb.amazonaws.com; sleep 0.5; done

After a few minutes, go to your CloudWatch / Metrics console. You will see a new section called CWAgent. Once you click on it, you will see over 600 new metrics available to CloudWatch. Thank you Envoy!!!

envoy stats in CW

Configuration and Settings

There are several configuration settings that must be accomplished in order to get the telemetry data flowing into CloudWatch. They are:

  1. Configure your Task definition to use a special setting called proxyConfiguration. This proxyConfiguration sets up the network routing between the containers, so that all traffic is routed throught the envoy proxy on its way way in and out of the Task. For CDK, the code for the proxyConfiguration looks like this: {% highlight typescript %} this.taskDefinition = new ecs.Ec2TaskDefinition(this, ‘my-task-definition’, { taskRole: taskIAMRole, networkMode: ecs.NetworkMode.AWS_VPC, proxyConfiguration: new ecs.AppMeshProxyConfiguration({ containerName: ‘envoy’, properties: { appPorts: [this.portNumber], proxyEgressPort: 15001, proxyIngressPort: 15000, ignoredUID: 1337, egressIgnoredIPs: [ ‘169.254.170.2’, ‘169.254.169.254’ ] } }) })
    {% endhighlight %}
  2. Add two sidecar containers into your Task, along with your application container. The first sidecar is Envoy. The second sidecar will be the CloudWatch agent, which does not need anything special, so I will omit the config. App Mesh manages Envoy configuration to provide service mesh capabilities. App Mesh exports metrics, logs, and traces to the endpoints specified in the Envoy bootstrap configuration provided. Notice the environmental variables which enable statsD export. We also enable thhe datadog version of statsD because it allows for richer tagging of metrics which are supported by CloudWatch, even if we are obviously not using datadog.
this.envoyContainer = this.taskDefinition.addContainer('envoy', {
    image: ecs.ContainerImage.fromEcrRepository(appMeshRepository, 'v1.12.1.1-prod'),
    essential: true,
    environment: {
        APPMESH_VIRTUAL_NODE_NAME: 'mesh/meshName/virtualNode/myServiceName',
        AWS_REGION: cdk.Stack.of(this).region,
	    ENABLE_ENVOY_STATS_TAGS: '1',
	    ENABLE_ENVOY_DOG_STATSD: '1',
	    ENVOY_LOG_LEVEL: 'debug'
    },
    healthCheck: {
        command: [
            'CMD-SHELL',
            'curl -s http://localhost:9901/server_info | grep state | grep -q LIVE'
        ],
        startPeriod: cdk.Duration.seconds(10),
        interval: cdk.Duration.seconds(5),
        timeout: cdk.Duration.seconds(2),
        retries: 3
    },
    memoryLimitMiB: 128,
    user: '1337',
    logging: new ecs.AwsLogDriver({
        streamPrefix: 'myService-envoy'
    })
})
  1. Configure container dependencies for proper start-up. We do not want our application container to start until Envoy is functioning properly. See the docs for the Task definition commands or review the CDK statement below.
// Set start-up order of containers
this.applicationContainer.addContainerDependencies(
	{
           container: this.envoyContainer,
	       condition: ecs.ContainerDependencyCondition.HEALTHY,
	},
	{
  	container: this.cwAgentContainer,
  	condition: ecs.ContainerDependencyCondition.START,
	} 
)
  1. Configure your second sidecar (i.e. the CloudWatch agent) to accept statsD metrics. There are many ways of configuring the agent, but easiest is again using an environmental variable which contains the configuration information. While we will hard-code this variable, it is frequently stored in SSM Parameter Store, where the same config can be leveraged by many agents. The agent will grab most of its relevant information from the EC2 or Task meta-data server, making this configuration minimal.
environment: { 
    CW_CONFIG_CONTENT: '{ "logs": { "metrics_collected": {"emf": {} }}, "metrics": { "metrics_collected": { "statsd": {}}}}'

DNS

App Mesh relies heavily on DNS or Cloud Map to resolve virtual node endpoints. Fortunately, ECS integrates nicely with Cloud Map, and all our services have private DNS names. Let’s verify this:

$ aws servicediscovery list-services --output table
--------------------------------------------------------------------------------------------------
|                                          ListServices                                          |
+------------------------------------------------------------------------------------------------+
||                                           Services                                           ||
|+------------+---------------------------------------------------------------------------------+|
||  Arn       |  arn:aws:servicediscovery:us-west-2:<account ID>:service/srv-24raqotjvd34dfxj   ||
||  CreateDate|  1577741692.171                                                                 ||
||  Id        |  srv-24raqotjvd34dfxj                                                           ||
||  Name      |  name                                                                           ||
|+------------+---------------------------------------------------------------------------------+|
|||                                          DnsConfig                                         |||
||+--------------------------------------------------+-----------------------------------------+||
|||  RoutingPolicy                                   |  MULTIVALUE                             |||
||+--------------------------------------------------+-----------------------------------------+||
||||                                        DnsRecords                                        ||||
|||+---------------------------------------------------+--------------------------------------+|||
||||  TTL                                              |  60                                  ||||
||||  Type                                             |  A                                   ||||
|||+---------------------------------------------------+--------------------------------------+|||
|||                                   HealthCheckCustomConfig                                  |||
||+-------------------------------------------------------------------------+------------------+||
|||  FailureThreshold                                                       |  2               |||
||+-------------------------------------------------------------------------+------------------+||
||                                           Services                                           ||
|+------------+---------------------------------------------------------------------------------+|
...
||+-------------------------------------------------------------------------+------------------+||

Using CloudWatch

Now that all the metrics are streaming into CloudWatch, we can view the logs and visualize the data. I have built a sample dashboard, which graphs various metrics from Envoy. Envoy generates a lot of statistics, so I focused only on the following most important fields. First, a little clarification on some vocabulary that Envoy uses.

Downstream: A downstream host connects to Envoy, sends requests, and receives responses.

Upstream: An upstream host receives connections and requests from Envoy and returns responses.

Listener: A listener is a named network location (e.g., port, unix domain socket, etc.) that can be connected to by downstream clients. Envoy exposes one or more listeners that downstream hosts connect to.

Cluster: A cluster is a group of logically similar upstream hosts that Envoy connects to. Envoy discovers the members of a cluster via service discovery. It optionally determines the health of cluster members via active health checking.

Counters: - Unsigned integers that only increase and never decrease. E.g., total requests.

Gauges: Unsigned integers that both increase and decrease. E.g., currently active requests.

Timers/histograms: Unsigned integers that ultimately will yield summarized percentile values. Envoy does not differentiate between timers, which are typically measured in milliseconds, and raw histograms, which may be any unit. E.g., upstream request time in milliseconds.

Statistic Type Description
downstream_cx_total Counter Total downstream connections
downstream_cx_active Gauge Total active connections
downstream_cx_http1_active Gauge Total active HTTP/1.1 connections
downstream_cx_http2_total Counter Total HTTP/2 requests
downstream_rq_2xx Counter Total 2xx responses
upstream_cx_
upstream_rq_

The Dashboard

If you open up your CloudWatch tab of your AWS Console, you will find a new Dashboard. The name is dynamic, but should start with “cloudwatchdashboardappmesh…”.

This dashboard shows some of more useful metrics that is gathered from Envoy, along with information retrieved from your Application Load Balancer. As mentioned, Envoy generates a tremendous amount of useful metrics on your application. This dashboard is only an example, and it is quite likely that you would build out yours differently than mine. Here is what Matt Klein from Lyft uses.

CloudWatch Dashboard

NOTE: Metric Math is not yet available for CDK. Once it is, these envoy statistics will become even more powerful. Until them, if you wish to use Metric Math, one can build out Dashboards using CloudFormation, SDKs or manually via the Console.

CloudWatch Logs Insights

CloudWatch Logs Insight is an amazing tool. It allows you to parse through a mountain of logs using a powerful SQL-like syntax. What is also impressive, is that it will create graphs of the search terms you are looking for. Also, since container insights embeds performance data into logs, you have yet another way to analyze your containers health. See this helpful blog post to get you going.

We can use Logs Insights to parse through our Envoy logs as well. Every envoy proxy can log data into Cloudwatch Logs. To export only the Envoy access logs (and ignore the other Envoy container logs), you can set the ENVOY_LOG_LEVEL to off.

For example, I did a quick review of all my envoy logs to see if there were any HTTP 503 errors. I only found a handful, but I could imagine how useful this could be in investigating networking issues.

cloudwatch logs insights

Clean up your ECS cluster

Don’t forget to tear down your infrastructure to save money.

$ npx cdk@1.19.0 destroy

Kubernetes and EKS

We are going to deploy a different microservice based application on an EKS cluster. In terms of gathering statistics from our App Mesh based envoy proxies, we have more options using Kubernetes.

There are several different methods of forwarding the envoy statistics to CloudWatch or Prometheus.

  1. We can duplicate the sidecar pattern that we demonstrated for ECS (using statsD), and forward to CloudWatch.
  2. We can implement a daemon set, or a single cloudwatch-agent collector for the Kubernetes/EKS cluster (using statsD), and then forward to CloudWatch.
  3. We can use Prometheus, which scapes metrics from the envoy proxies. We can then optionally configure Prometheus to export the gathered stats into CloudWatch.
  4. Similar to #1, we can add a sidecar, but one that converts statsD metrics into Prometheus metrics. This has the advantage of using the native statsD metric exported by Envoy, but stored and made visibily by Prometheus/Grafana. At one time, Envoy’s Prometheus endpoint (option #3) was missing some histogram data, making this option more metric rich than option #3. However, that issue has been fixed, so this option is no longer relevant for App Mesh. Some other vendors still use this design.

For this Kuberntes demo, we will set-up option #3. This configuration is very lightweight, since a sidecar per pod is not required. It also gives all the metrics we want while leveraging the most popular Kubernetes logging and dashboard solution available.

We will use Helm exclusivey for our Kubernetes package installation and management.

It is worth noting that there are many commercial monitoring options as well. Some vendors include:

Demo Application

The demo is a very simple, self-contained application consisting of two nginx pods, and three traffic generating pods. The purpose of the application is simply to generate traffic that the envoy proxies can report upon.

prometheus scraping

Create a cluster

Let’s create an EKS cluster to work with. I am using the us-east-2 region this time, in order to spread out my resources.

$ # You can skip this part if you already cloned the repository.
$ git clone https://github.com/nbrandaleone/appmesh-visibility-cdk.git
$ cd appmesh-visibility-cdk

$ eksctl create cluster --name=cluster-appmesh --nodes=2 --node-type=t3.xlarge --appmesh-access --region us-east-2 

Install Helm charts

I recommend using helm 3.x, since this simplifies the configuration as there is no longer a tiller component to install on the cluster. We will also add the official EKS repo, so we can easily install App Mesh components and Prometheus/Grafana using the appropriate charts.

Of course, if you want to add these component manually, see here and here.

$ helm repo add eks https://aws.github.io/eks-charts

Install App Mesh

The next series of commands are documented in the Github repo for the EKS helm charts. There are additional commands and steps available that one might want to use. For example, there are examples on how to enable Jaeger tracing, Datadog tracing and AWS X-Ray. Check it out!

Create the namespace appmesh-system for the App Mesh system components

$ kubectl create ns appmesh-system

$ # Create a namespace for the demo app
$ kubectl create ns appmesh-demo
$ kubectl label namespace appmesh-demo appmesh.k8s.aws/sidecarInjectorWebhook=enabled

Install the App Mesh CRD’s and controller

$ kubectl apply -f https://raw.githubusercontent.com/aws/eks-charts/master/stable/appmesh-controller/crds/crds.yaml
$ helm upgrade -i appmesh-controller eks/appmesh-controller --namespace appmesh-system

Install the App Mesh injector

We are going to use the admissions controller/injector for this part of the tutorial. The injector will add the appropriate App Mesh init and envoy sidecar to your pod automatically. It will also wire up the proper networking between the containers and add the APPMESH_VIRTUAL_NODE_NAME environmental variable to your pod. The injector is a great help in making App Mesh an easy experience for developers working on Kubernetes/EKS. Still, one must add the appropriate annotations and create the various mesh components (i.e. virtual service/routes and nodes) via kubernetes templates.

$ helm upgrade -i appmesh-inject eks/appmesh-inject \
--namespace appmesh-system \
--set mesh.create=false \
--set mesh.name=appmesh

Override Sidecar Injector Default Behavior

To override the default behavior of the injector when deploying a pod in a namespace that you’ve enabled the injector for, add any of the following annotations to your pod spec.

appmesh.k8s.aws/mesh: mesh-name – Add when you want to use a different mesh name than the one that you specified when you installed the injector.

appmesh.k8s.aws/ports: “ports” – Specify particular ports when you don’t want all of the container ports defined in a pod spec passed to the sidecars as application ports.

appmesh.k8s.aws/egressIgnoredPorts: ports – Specify a comma separated list of port numbers for outbound traffic that you want ignored. By default all outbound traffic ports will be routed, except port 22 (SSH).

appmesh.k8s.aws/virtualNode: virtual-node-name – Specify your own name if you don’t want the virtual node name passed to the sidecars to be .

appmesh.k8s.aws/sidecarInjectorWebhook: disabled – Add when you don’t want the injector enabled for a pod.

apiVersion: appmesh.k8s.aws/v1beta1
kind: Deployment
spec:
    metadata:
      annotations:
        appmesh.k8s.aws/mesh: my-mesh2
        appmesh.k8s.aws/ports: "8079,8080"
        appmesh.k8s.aws/egressIgnoredPorts: "3306"
        appmesh.k8s.aws/virtualNode: my-app
        appmesh.k8s.aws/sidecarInjectorWebhook: disabled

Install Prometheus and Grafana

We are going to install both Prometheus and Grafana, which are the most common open-source monitoring tools for Kubernetes clusters. The AWS provided helm charts has the added benefit of creating a default Grafana dashboard, which displays various App Mesh statistics automatically.

$ # Install App Mesh Prometheus:
$ helm upgrade -i appmesh-prometheus eks/appmesh-prometheus \
--namespace appmesh-system

$ # Install App Mesh Grafana:
$ helm upgrade -i appmesh-grafana eks/appmesh-grafana \
--namespace appmesh-system 

Install the demo application

$ # You must be in the root of the cloned directory
$ helm install --generate-name -n appmesh-demo ./kubernetes

Verify App Mesh and demo pods

Our helm chart created five pods in the appmesh-demo namespace. Since this namespace is being watched by the App Mesh injector, all pods will be created with an Envoy container. Also, we should see additional components relating to the mesh, i.e. virtual nodes, virtual services and the mesh itself.

Check that they are all installed properly:

$ kubectl api-resources --api-group=appmesh.k8s.aws
NAME              SHORTNAMES   APIGROUP          NAMESPACED   KIND
meshes                         appmesh.k8s.aws   false        Mesh
virtualnodes                   appmesh.k8s.aws   true         VirtualNode
virtualservices                appmesh.k8s.aws   true         VirtualService

$ kubectl -n appmesh-demo get deploy,po,svc,virtualnode.appmesh.k8s.aws,virtualservice.appmesh.k8s.aws
NAME                                   READY   UP-TO-DATE   AVAILABLE   AGE
deployment.extensions/load-generator   3/3     3            3           22s
deployment.extensions/nginx            2/2     2            2           22s

NAME                                  READY   STATUS    RESTARTS   AGE
pod/load-generator-68c9646f97-bmqsd   2/2     Running   1          22s
pod/load-generator-68c9646f97-q9dcw   2/2     Running   1          22s
pod/load-generator-68c9646f97-vbljf   2/2     Running   1          22s
pod/nginx-65c6c4788-4qt5l             2/2     Running   0          22s
pod/nginx-65c6c4788-vnkj6             2/2     Running   0          22s

NAME            TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)   AGE
service/nginx   ClusterIP   10.100.140.21   <none>        80/TCP    22s

NAME                                         AGE
virtualnode.appmesh.k8s.aws/load-generator   22s
virtualnode.appmesh.k8s.aws/nginx            22s

NAME                                                                  AGE
virtualservice.appmesh.k8s.aws/nginx.appmesh-demo.svc.cluster.local   22s

If you do not see 2/2 in the application pods READY section, you will need to do some troubleshooting. Verify that the demo namespace is labeled, and that the appmesh-system namespace has all the components running as expected.

Review Grafana Dashboards

We will connect to the dashboard via the port-forward command. This keeps the dashboard private and simplifies testing.

kubectl -n appmesh-system port-forward svc/appmesh-grafana 3000:3000

$ # This can also be done for Prometheus, if you want to see its console
$ # kubectl -n appmesh-system port-forward svc/appmesh-prometheus 9090:9090

Now, open up a browser window to localhost:127.0.0.1:3000. There will be two Dashboards already created for you. The first Dashboard is for monitoring the App Mesh Control Plane that is running on your Kubernetes/EKS cluster. grafana App Mesh control plane

and the Envoy Data Plane grafana App Mesh data plane

I also like to add a community Dasbhoard. If you import 6693, you will get a nice Envoy Proxy dashboard. envoy dashboard

Clean up EKS cluster

$ kubectl delete ns appmesh-demo && kubectl delete mesh appmesh
$ kubectl delete namespace appmesh-system; kubectl delete mutatingwebhookconfiguration appmesh-inject
$ kubectl delete clusterrolebindings appmesh-inject; kubectl delete clusterrole appmesh-inject

$ eksctl delete cluster cluster-appmesh --region us-east-2

Summary

Although we have just scratched the service in terms of Envoy statistics, hopefully, you have found this post useful and interesting.

Of course, it is always possible to leverage AWS X-Ray (or open-source Jaeger) for increased visibility as well. X-Ray provides a visual representation of traffic flowing between services which can invaluable during troubleshooting. This particular topic has been discussed multiple times. For example: see these blog posts and articles:

xray images

Also, a very easy way to get started with in-depth container and cluster statistics for either ECS or EKS is Container Insights from CloudWatch.

Container Insights picture

Finally, you should definitely check out this new CloudWatch service, ServiceLens. It brings together numerous monitoring tools (i.e. CloudWatch logs and X-Ray) into a single location, while focusing on micro-services based application. There is already a great demo on Github.

CloudWatch ServiceLens


References