Canary Releases

A canary release is described with a Kubernetes custom resource named Canary.

Application bootstrap

Edit the podinfo Helm release and disable the image updates and the ClusterIP service:







 





 

 


apiVersion: helm.fluxcd.io/v1
kind: HelmRelease
metadata:
  name: podinfo
  namespace: prod
  annotations:
    fluxcd.io/automated: "false"
spec:
  releaseName: podinfo
  values:
    image:
      repository: stefanprodan/podinfo
      tag: 3.0.0
    service:
      enabled: false
      type: ClusterIP

Apply changes:

git add -A && \
git commit -m "prep canary" && \
git push origin master && \
fluxctl sync

Create a canary release for podinfo:







 
























apiVersion: flagger.app/v1alpha3
kind: Canary
metadata:
  name: podinfo
  namespace: prod
  annotations:
    fluxcd.io/ignore: "false"
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: podinfo
  service:
    port: 9898
  canaryAnalysis:
    interval: 10s
    stepWeight: 5
    threshold: 5
    metrics:
    - name: request-success-rate
      threshold: 99
      interval: 1m
    - name: request-duration
      threshold: 500
      interval: 1m
    webhooks:
      - name: load-test
        url: http://load-tester.prod/
        metadata:
          cmd: "hey -z 2m -q 10 -c 2 http://podinfo:9898/"

Apply changes:

git add -A && \
git commit -m "add canary" && \
git push origin master && \
fluxctl sync

Validate that Flagger has initialized the canary:

kubectl -n prod get canary

Automated canary promotion

Install the load testing service to generate traffic during the canary analysis:







 









apiVersion: helm.fluxcd.io/v1
kind: HelmRelease
metadata:
  name: load-tester
  namespace: prod
  annotations:
    fluxcd.io/ignore: "false"
spec:
  releaseName: load-tester
  chart:
    git: https://github.com/weaveworks/flagger
    ref: 0.18.4
    path: charts/loadtester
  values:
    fullnameOverride: load-tester

When you deploy a new podinfo version, Flagger gradually shifts traffic to the canary, and at the same time, measures the requests success rate as well as the average response duration. Based on an analysis of these Linkerd provided metrics, a canary deployment is either promoted or rolled back.

Trigger a canary deployment by updating the container image:







 

apiVersion: helm.fluxcd.io/v1
kind: HelmRelease
spec:
  releaseName: podinfo
  values:
    image:
      tag: 3.0.1

Apply changes:

git add -A && \
git commit -m "update podinfo" && \
git push origin master && \
fluxctl sync

When Flagger detects that the deployment revision changed it will start a new rollout. You can monitor the traffic shifting with:

watch kubectl -n prod get canaries

Automated rollback

During the canary analysis you can generate HTTP 500 errors and high latency to test if Flagger pauses and rolls back the faulted version.

Trigger another canary release:







 

apiVersion: helm.fluxcd.io/v1
kind: HelmRelease
spec:
  releaseName: podinfo
  values:
    image:
      tag: 3.0.2

Apply changes:

git add -A && \
git commit -m "update podinfo" && \
git push origin master && \
fluxctl sync

Exec into the tester pod and generate HTTP 500 errors:

kubectl -n prod exec -it $(kubectl -n prod get pods -o name | grep -m1 load-tester | cut -d'/' -f 2) bash

$ hey -z 1m -c 5 -q 5 http://podinfo-canary:9898/status/500
$ hey -z 1m -c 5 -q 5 http://podinfo-canary:9898/delay/1

When the number of failed checks reaches the canary analysis threshold, the traffic is routed back to the primary and the canary is scaled to zero.

Watch Flagger logs with:

$ kubectl -n linkerd logs deployment/flagger -f | jq .msg

 Starting canary analysis for podinfo.prod
 Advance podinfo.test canary weight 5
 Advance podinfo.test canary weight 10
 Advance podinfo.test canary weight 15
 Halt podinfo.test advancement success rate 69.17% < 99%
 Halt podinfo.test advancement success rate 61.39% < 99%
 Halt podinfo.test advancement success rate 55.06% < 99%
 Halt podinfo.test advancement request duration 1.20s > 0.5s
 Halt podinfo.test advancement request duration 1.45s > 0.5s
 Rolling back podinfo.prod failed checks threshold reached 5
 Canary failed! Scaling down podinfo.test

Monitoring with Linkerd

The Linkerd dashboard provides a high level view of what is happening with your services in real time. It can be used to visualize service dependencies, traffic splitting and understand the health of specific service routes.

Open the dashboard by running:

linkerd dashboard --port=50750

During the canary analysis, navigate to:

http://127.0.0.1:50750/namespaces/ingress-nginx/deployments/nginx-ingress-controller

linkerd

You can monitor the live traffic for the production namespace from the command line with:

linkerd -n prod top deploy

And you can view all the routes exposed by podinfo with:

linkerd -n prod routes service/podinfo

The above routes have been generated from the podinfo swagger spec and exported as Linkerd service profile.