A CIOs Guide To Kubernetes in Production Ebook

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 22

A CIOs Guide to

Kubernetes in
Production
replex.io 1
A CIOs Guide to Kubernetes in Production

Topics
Monitoring 1

High Availability, Backup and Disaster Recovery 2

Distributed DevOps and SRE 4

CI/CD 5

Choosing the Right Kubernetes Distribution 7

Storage 9

Networking 10

Security, Identity and Access Control 13

Cost Management 16

Application lifecycle 17
Here is our complete guide to Kubernetes in Production for and physical machines, logical abstractions like pods, services
CIOs and CTOs. The guide covers the topics of Monitoring, and replica sets also need to be considered.
High availability, storage, networking, security and access
control, cost management, CI/CD, application lifecycle, Observability Paradigm for Kubernetes
distributed DevOps and SRE and choosing a Kubernetes More importantly, however, Kubernetes monitoring needs to pivot
distribution. to a new observability paradigm. Traditionally organizations have
relied on black box monitoring methods to monitor infrastructure

Monitoring and applications. Black box monitoring observes only the external
behavior of a system.

Monitoring for Cloud-Native Applications


In the cloud-native age of containers, orchestration, and
The cloud-native set of tools have changed the way software is
microservices, monitoring needs to move beyond black box
developed, deployed and managed. This new toolset has
monitoring. Black box monitoring can still serve as the baseline for
necessitated a shift in the way both the tools themselves as well
a monitoring strategy but it needs to be complemented by newer
as the applications propped up by them are monitored.
white box monitoring methods more suited to the distributed,
ephemeral nature of containers and Kubernetes.
The same is true of Kubernetes, which introduces a number of
new abstractions on both the hardware as well as the application
Observability encompasses both traditional black box monitoring
layer. Any monitoring pipeline for Kubernetes needs to take both
methods in addition to newer monitoring paradigms like logging,
these new abstractions as well as its resource management
tracing and metrics (together known as white box monitoring).
model into account.
Observability pipelines decouple data collection from data
ingestion by introducing a buffer.
This means that in addition to monitoring historically relevant
infrastructure metrics like CPU and RAM utilization for cloud VMs

replex.io 1
A CIOs Guide to Kubernetes in Production

The pipeline serves as the central repository of traces, metrics, Observability pipelines allow organizations to better integrate
logs and events which are then forwarded to the appropriate these teams by helping build a culture based on facts and
service using a data router. This mitigates the need to have feedback.
agents for each destination running on each host and reduces the
number of integrations that need to be maintained. It also allows
High Availability, Backup and Disaster
enterprises to avoid vendor lock-in and quickly test new SaaS-
based monitoring services. Recovery

Observability Best Practices High availability and disaster recovery are crucial elements of any
Observability aims to understand the internals of a system and enterprise application. Orchestration engines like Kubernetes
how it works to quickly debug and resolve issues in production. introduce additional layers which have to be considered when
Since it integrates logs, traces and metrics into traditional designing highly available architectures.
monitoring pipelines it covers much more ground and requires a
lot more effort to deploy. Multi-Layered High Availability
Highly available Kubernetes environments can be seen in terms of
A best practice, therefore, is for CIOs and CTOs to gradually build two distinct layers or levels. The bottom-most layer is the
towards a full observability pipeline for their cloud-native infrastructure layer, which can refer to any number of public cloud
environments by integrating elements of white box monitoring over providers or physical infrastructure in a data center. Next is the
time. orchestration layer which includes both hardware and software
abstractions like nodes, clusters, containers and pods as well as
The adoption of cloud-native technologies has also resulted in other application components.
much more overlap between traditional dev and ops teams.

replex.io 2
A CIOs Guide to Kubernetes in Production

High Availability on the IAAS and On-premises Layer nodes. It is recommended to have at least 5 etcd members for
Public cloud providers provide a number of high availability production clusters.
mechanisms for compute, storage and networking that should
serve as a baseline for any Kubernetes environment. CIOs and On the application layer, CIOs and CTOs need to ensure the use
CTOs also need to bake in redundancy into compute, storage and of native Kubernetes controllers like statefulsets or deployments.
networking equipment supporting Kubernetes environments in on- These will ensure that the desired number of pod replicas are
premise data centers. always up and running.

High Availability on the Orchestration Layer Backup and Disaster Recovery


On the orchestration layer, a multi-master Kubernetes cluster is a Backup and disaster recovery should also figure at the top of
good starting point. Master nodes should also be distributed every CIOs to-do list for Kubernetes clusters in production. The
across cloud provider zones to ensure they are not affected by etcd master component is responsible for storing the cluster state
outages in any one zone. and configuration. Having a plan for regular etcd backups is,
therefore, a best practice. Stateful workloads on Kubernetes
Availability on the orchestration layer, however, needs to move leverage persistent volumes which also need to be backed up.
beyond simple multi-master clusters. A best practice is to
provision a minimum of 3 master nodes distributed across multiple Backup and disaster recovery are important elements of mission-
zones. Similarly, worker nodes should also be distributed across critical enterprise applications. CTOs and CIOs need to have a
zones for high availability. well thought out and comprehensive high availability, backup and
disaster recovery mechanism for Kubernetes, that encompasses
In addition to having at least 3 master nodes, a best practice is to all layers.
replicate the etcd master component and place it on dedicated

replex.io 3
A CIOs Guide to Kubernetes in Production

In the new cloud-native world, however, the boundary between


Distributed DevOps and SRE
dev and ops has blurred even more. CIOs and CTOs need to
ensure that every DevOps team has the required skills and
The future of enterprise software is moving towards containerized
knowledge to automate, monitor and optimize the distributed,
microservices based distributed applications deployed on
cloud-native applications being developed. They should also have
Kubernetes with the cloud as an underlying layer. This new cloud-
the required skills to ensure highly available and scalable
native landscape needs to be reflected in the way dev and ops
applications, implement networking as well as onboard the tools
teams are organized internally as well as in the software release
required throughout the application lifecycle.
cycle.

DevOps and SRE


Cloud-Native Roles and Teams
One way to inject these skills into already existing DevOps teams
Kubernetes and cloud-native technologies have changed
is to move towards SRE. SRE is an implementation of DevOps,
traditional dev and ops roles, broken down siloed dev and ops
developed internally by Google that pushes for an even more
teams as well as changing the entire software release lifecycle.
overlapping skill set for individual developers. SREs typically
Given these paradigm changes, we will outline best practices for
divide their time equally between development and ops
CIOs and CTOs in terms of role definitions, team composition and
responsibilities.
new paradigms for developing and deploying software.

In the context of Kubernetes, a best practice for CIOs and CTOs is


DevOps has already broken up the siloed development, testing
to sprinkle SREs among DevOps teams. These SREs would, in
and operations teams that serviced traditional monolithic
turn, be responsible for both development as well as managing
applications. More and more developer teams are internalizing
performance, on-boarding tools, building in automation and
ops skills.
monitoring.

replex.io 4
A CIOs Guide to Kubernetes in Production

The Role of Central IT can potentially lead to wastage and inefficient resource usage. A
The increasingly distributed nature of enterprise applications strong central IT team will be able to govern these distributed
translating into distributed DevOps teams, however, does not teams and avoid the fallouts from self-service and ballooning
mean that central IT loses its significance. There does need to be resources. They will also be able to hold teams accountable.
some degree of control and oversight over these teams.

CI/CD
Even though organizations increasingly prefer developers with
cross-domain knowledge of ops, overlapping skills do tend to
In the same way that Kubernetes and the wider cloud-native
dilute both development and ops.
technology toolset made CIOs rethink traditional dev and ops
roles, it has also required a new way of thinking about build and
A best practice, therefore, is to have a central IT team that
release cycles. Containerized, microservices based applications,
includes personnel with ops and infrastructure skill sets. This skill
developed, deployed and managed by distributed teams, are not
set will enable central IT to provide DevOps teams with critical
very suited to traditional one-dimensional build and release
services that are shared by those teams. It will also ensure that
pipelines.
organizations avoid wasted effort due to distributed teams figuring
out solutions to shared problems.
CI/CD for Distributed Teams
A best practice for CIOs, therefore, is to support distributed teams
Both the cloud and Kubernetes itself have made it increasingly
with a well-tooled and thought-out CI/CD pipeline. A robust CI/CD
easier for teams to provision and consume resources. The cloud-
pipeline is essential to fully realizing the benefits of faster release
native movement and DevOps also emphasize on agility and the
cycles and agility promised by Kubernetes and cloud-native
ability to self-service resources. This can at times lead to an
technologies. There are a number of tools that CIOs and CTOs
explosion in the number of compute resources provisioned and

replex.io 5
A CIOs Guide to Kubernetes in Production

can use to deploy CI/CD pipelines. These include Jenkins, delivery is an extension of continuous integration where code
TravisCI, GitLab CI and Spinnaker. changes are run through more rigorous tests and ultimately
deployed to an environment that closely mirrors the production
Continuous Integration (CI) environment.
CI/CD is a broad concept and touches on aspects of development, With continuous delivery there is often a human element involved
testing and operations. When deploying a CI/CD pipeline from making decisions about when and how frequently to push code
scratch a good place to start is with the developer team. into production. Continuous deployment automates the entire
Continuous integration is a subset of CI/CD that aims to increase pipeline by automatically pushing code into production once it
the frequency of code merges and automate build and test passes the automated builds and tests defined in both the
processes. integration and delivery phases.

Instead of developing new features in isolation, developers are CI/CD Best Practices
encouraged to merge code into the main pipeline as frequently as Agile distributed teams working in isolation can at times lead to an
possible. An automated build is created from these code changes explosion in the number of isolated build pipelines. To avoid this, a
which is then run through a suite of automated tests. Getting best practice for CIOs is to make the CI/CD pipeline the only way
developer teams to adopt CI best practices will ensure that code to push code into production. This will ensure that all code
changes and new features are always ready to be pushed out to changes are pushed into a unified build pipeline and are subjected
production. to a consistent set of integration and test suites.

Continuous Delivery and Deployment (CD) Distributed teams also tend to use a number of different tools and
Once CI practices are firmly in place, CIOs and CTOs can then frameworks. CIOs need to ensure that the CICD pipeline is flexible
move on to continuous delivery and deployment. Continuous enough to accommodate this usage.

replex.io 6
A CIOs Guide to Kubernetes in Production

Another best practice is to encourage a culture of small of oversight that while allowing them control does not impact the
incremental code changes and frequent merges among developer release velocity of software and teams.
teams. Smaller changes are easier to integrate and roll back and
minimize the fallout if something goes wrong.
Choosing the Right Kubernetes
CIOs also need to institute a build once policy at the start of the Distribution
pipeline. This ensures that later phases of the CI/CD pipeline have
a consistent build to work with. It also avoids any inconsistencies Even though Kubernetes on its own is vastly feature rich, mission-
that can creep in when using multiple build tools. critical enterprise workloads need to be supported by more feature
rich variants to provide required service levels.
Additionally, CIOs need to strike a balance between the extent of
the testing regime they push code changes through and the speed Managed Kubernetes
of the pipeline itself. More rigorous testing regimes while There are a number of managed Kubernetes offerings from public
minimizing the chances of bad code being pushed to production cloud providers that CIOs and CTOs can evaluate. These
also have a time overhead. managed offerings take over some of the heavy lifting involved in
managing upgrades, patches and HA.
CI/CD pipelines even though championing decentralization and
agility do still need to be governed by central IT for major feature Public cloud provider offerings do, however, restrict Kubernetes
releases. CIOs and CTOs need to ensure they strike a balance environments to a specific vendor and might not fit well with a
between governance and oversight from central IT and the agility future hybrid or multi-cloud strategy.
and flexibility of distributed teams. They need to ensure a degree

replex.io 7
A CIOs Guide to Kubernetes in Production

Commercial value-added Kubernetes distributions are also requirement should be support for fully automated cluster
available from vendors like Red Hat, Docker, Heptia, Pivotal and upgrades with zero downtime. The solution chosen should also
Rancher. Below we will outline some of the features CIOs and allow upgrades to be triggered manually. Monitoring, health
CTOs need to look for when choosing one. checks, cluster and node metrics and alerts and notifications
should also be a standard part.
Feature Set for Kubernetes Distributions
High availability and disaster recovery: CIOs and CTOs need Identity and access management: Identity and access
to look for distributions that support high availability out of the box. management are important both in terms of security as well as
This would include support for multi-master architectures, highly governance. CIOs need to ensure that the Kubernetes distribution
available etcd components as well as backup and recovery. they choose supports integration with already existing
authentication and authorization tools being used internally. RBAC
Hybrid and multi-cloud support: Vendor lock-in is a very real and granular access control are also important feature sets that
concern for the modern enterprise. To ensure Kubernetes should be supported.
environments are portable, CIOs need to choose distributions that
support a wide range of deployment models, from on-premise to Networking and Storage: The Kubernetes networking model is
hybrid and multi-cloud. Support for creating and managing highly configurable and can be implemented using a number of
multiple clusters is another feature that should be evaluated. options. The distribution chosen should either have a native
software-defined networking solution that covers the wide range of
Management, upgrades and Operational support: Managed requirements imposed by different applications or infrastructure or
Kubernetes offerings also need to be evaluated based on ease of support one of the more popular CNI based networking
setup, installation, and cluster creation as well as day 2 operations implementations including Flannel, Calico, kube-router or OVN
including upgrades, monitoring and troubleshooting. A baseline etc. CIOs also need to ensure that the Kubernetes distribution

replex.io 8
A CIOs Guide to Kubernetes in Production

they choose supports at a minimum, either flexvolume or CSI Most legacy applications, including databases, however, are
integration with storage providers as well as deployment on stateful and store tons of data for use across sessions. Since
multiple cloud providers and on-premise. volumes can only be used to store temporary data, they are not
well suited to these applications.
Deploy, manage and upgrade applications: Kubernetes
distributions being considered by CIOs also need to support a Kubernetes Persistent Volumes
comprehensive solution for deploying, managing, and upgrading In order to support stateful applications, Kubernetes introduced a
applications. A helm based, application catalog that aggregates new type of volume plugin called persistent volumes. The
both private and public chart repositories should be a minimal persistent volume resource decouples storage from the pod
requirement. lifecycle and allows data to persist across pod restarts making
them a good candidate for stateful applications.

Storage
Software-Defined Storage (SDS)
Software-defined storage (SDS) solutions are a good bet for CIOs
Kubernetes storage is a hard nut to crack. Kubernetes was initially and CTOs looking for a Kubernetes storage solution. Kubernetes
designed to support stateless applications that do not use saved supports a number of these SDS providers as persistent volume
data across sessions. Kubernetes pods are meant to be plugins (StorageOS, CephFS, Portworx, GlusterFS, ScaleIO and
ephemeral and are constantly created, destroyed and moved Quobyte).
across nodes. Whenever pods are destroyed Kubernetes volumes
attached to these pods are also terminated. SDS solutions abstract storage from the underlying hardware and
present it for consumption as shared storage pools. SDS solutions
also abstract away the complexities of having to manage

replex.io 9
A CIOs Guide to Kubernetes in Production

disparate storage devices and filesystem types. Built-in APIs allow storage solution chosen should also be declarative, and cloud-
consumers to manage and automate encryption, high availability, agnostic with support for automated upgrades. Encryption and
backups and replication. access control should also be required features.
With the emphasis on multi-cloud deployments, CIOs and CTOs
Storage Best Practices should also ensure that any storage solution they choose is
A best practice when choosing a storage solution for Kubernetes portable and cloud-agnostic. Storage solutions should be able to
is to ensure it is distributed, resilient, durable and robust with no pool storage resources from disparate (cloud, on-prem etc.)
single points of failure. The storage solution chosen should also hardware sources.
support dynamic, on-demand provisioning with support for
Kubernetes storage classes and be easily scalable. Dynamic Performance is another aspect that needs to be considered. In
provisioning significantly reduces the management overhead of most cases, the storage solution chosen will depend on the
creating, managing and configuring persistent volumes and unique attributes of each environment and application
making them available for consumption in the cluster. Storage can requirements. Before undertaking a review of SDS solution based
be automatically provisioned whenever it is requested by users. on the features outlined above, CIOs and CTOs should
benchmark the performance requirements of their environments
Another best practice is for CIOs and CTOs to evaluate SDS and applications.
solution based on support for high availability, automated
replication and backups.
Networking
Additional features to look for in a storage provider is support for
dynamic resizing, automated volume snapshots, backup and As with storage, networking is also an important component of
restore. As with Kubernetes and the cloud-native toolset, the enterprise environments. There are three important elements

replex.io 10
A CIOs Guide to Kubernetes in Production

CIOs and CTOs need to consider when setting up networking for and CTOs should consider one of the CNI compatible networking
a Kubernetes environment: communication between application plugins.
components (pod to pod communication), communication
between pods and services and communication with the outside Feature set for CNI Evaluation
internet. Flannel, Calico, Canal, kube-router and Weave Net are some of
the more famous CNI plugins. Below we review some of the
Kubernetes Networking Model features that CIOs and CTOs need to consider when choosing a
The Kubernetes networking model allots a unique IP to each CNI for their Kubernetes environment.
individual pod. By default, all pods belonging to a Kubernetes
cluster can communicate with all other pods. This communication Support for Network Policy
happens across Namespaces, services, or nodes. Support for Kubernetes network policies is a crucial functionality
that CIOs and CTOs should use to evaluate CNI plugins. Network
The networking model also allows groups of pods that provide the policies allow DevOps to configure and control traffic to and from
same functionality (Services) to communicate with other pods. their applications. Network policies perform both a security and an
The Service abstraction de-couples groups of dependent pods access control function.
and allows applications to continue functioning in the event of pod
restarts. With the network policy resource, Kubernetes enables a shift left
approach where DevOps can configure network policies using the
Kubenet, the default networking plugin from Kubernetes provides same concepts used to deploy applications.
some basic networking functionality but is limited when it comes to
cloud environments. For a more feature rich networking solution By default, pods do not filter incoming traffic and there are no
that can support mission critical enterprise environments, CIOs firewall rules. Network policies allow granular control over how

replex.io 11
A CIOs Guide to Kubernetes in Production

pods are allowed to communicate among each other and with


other network endpoints. This control extends to both ingress as Encryption
well as egress traffic. Encryption is another feature to look out for especially in
environments that exchange traffic across untrusted networks.
Type of Network The Weave works CNI plugin transports data using the fastest
When choosing a CNI plugin CIOs and CTOs should also available method and also encrypts it: TCP and UDP traffic for the
consider the type of network deployed. Plugins that deploy an sleeve method is encrypted using NACL encryption library, data
overlay network, usually encapsulate packets in an extra layer. plane traffic for the fast datapath method is encrypted using ESP
Flannel for example deploys a layer 3 overlay network and uses a of IPsec.
backend like VXlan, IPsec or host-gw for packet forwarding. Since
overlay networks encapsulate packets in an extra layer, they Support for Service Mesh
increase the network overhead and can affect performance. CIOs and CTOs looking for more control over networking, security
Additionally, tracing network packets is often difficult with plugins and access should also ensure that the network plugin chosen
that encapsulate traffic. supports integration with service meshes like Istio. This is also
important given the sheer scale and complicated nature of service
Calico in contrast deploys a Layer 3 network that uses BGP to to service communications for cloud-native applications.
route traffic between hosts. Using BGP has obvious performance
benefits since packets do not have to be encapsulated using a Service, Load Balancing and Ingress
backend. It also makes troubleshooting faster and easier using CNIs alone do not address how the cluster communicates with the
already existing tools. For situations where an overlay network is internet. There are a number of native Kubernetes abstractions
needed like routing traffic between AZs, it does use encapsulation that allow traffic to be exchanged with the wider internet. The
like IP-in-IP or VXlan. service abstraction is one, Kubernetes ingress is another.

replex.io 12
A CIOs Guide to Kubernetes in Production

ClusterIP, NodePort, LoadBalancer and ExternalName are all need to ensure that security is a part of the entire application
service types that allow external traffic into the cluster. lifecycle and encompasses all layers.
LoadBalancer service type is the standard way to expose services
to the internet. It does however require a supported cloud provider Access Control
to be present and can get expensive since each exposed service With the cloud-native movement, identity and access control have
gets its own IP address. become increasingly important in the context of security. Native
Kubernetes authentication, authorization and admission
For most enterprise environments, Kubernetes Ingress is the controllers allow CIOs and CTOs to draw a security perimeter
recommended method to expose services to the internet. Ingress around their environments, identify users and processes and
handles load balancing at Layer 7 and officially supports nginx govern the resources they are allowed to access.
and GCE ingress controllers. There are a number of additional
Ingress controllers including Contour and Istio that CIOs and There are two ways requests can be authenticated in Kubernetes:
CTOs can look into. Ingress is more feature rich as compared to normal accounts and service accounts. Normal accounts usually
the LoadBalancer service type and is also a less expensive correspond to user accounts and are managed by an outside
option. third-party service. A best practice for CIOs and CTOs is to enable
multiple authentication methods: one each for user accounts
(either OpenID Connect or X509 Client Certificates) and service
Security, Identity and Access Control
accounts (Service account tokens).

Kubernetes security extends beyond the immediate cluster Kubernetes RBAC


environment to applications and infrastructure. CIOs and CTOs Once requests are authenticated, they can then be granted
permission to access via RBAC. RBAC allows CIOs and CTOs to

replex.io 13
A CIOs Guide to Kubernetes in Production

regulate and govern access to resources based on roles of A best practice in the context of RBAC is to follow user-access
individual users. best practices and keep the scope of permissions small. CIOs and
CTOs however do need to consider the increased management
Cluster roles grant permissions for the entire cluster across all overhead that comes with a fine-grained RBAC policy.
Namespaces. CIOs and CTOs should ensure that Cluster roles
are only granted to trusted users or groups of users. The cluster- Continuous Security Scanning
admin role specifically, has a very wide range of permissions to CIOs and CTOs also need to encourage a culture of periodic
perform actions and has access to all resources. A best practice security scanning of container images. This can be accomplished
therefore, is to avoid granting the cluster-admin role as much as using tools like Claire and Anchore. Periodic scanning will identify
possible. any common security vulnerabilities (CVEs) in container images.

Roles are by default restricted to specific Namespaces and should To make this a continuous process, a best practice is to bake in
be preferred over Cluster Roles, whenever possible. For this to security and vulnerability checks and tools into the CI/CD pipeline.
work a best practice is to wall off groups of resources into Any new code as well as the container images built using this
individual namespaces for teams, departments, clients, and code should be checked for CVEs as part of the CI/CD pipeline.
applications etc. Roles can then be used to implement fine- CIOs and CTOs should also discourage the use of unknown
grained access control by specifying the apiGroup of the resource, images from public repositories and prioritize the use of private
the resource itself (e.g. pods) and the operations that can be registries.
performed. Roles are granted to individual users or groups of The AlwaysPullImages admission controller is another way to
users using Role Bindings. ensure images are always pulled with the correct authorization
and cannot be reused by pods.

replex.io 14
A CIOs Guide to Kubernetes in Production

Network Segmentation Logging


By default, Kubernetes pods and containers are allowed to Kubernetes environments are complex with many different
communicate across nodes, namespaces and services. To reduce components and layers. Logs are a good way to keep tabs on
the potential attack surface, limit the area of impact and fallout a cluster activity, understand what is happening and debug
best practice is to segment these communications. problems. CIOs and CTOs should put considerable thought into
CIOs and CTOs can use the native Kubernetes network policy designing a logging architecture for Kubernetes.
resource to control and specify how pods or groups of pods are
allowed to communicate. A network policy however only works in The native logging functionality for Kubernetes only stores logs for
tandem with a network plugin. CIOs and CTOs should ensure that the lifetime of a container, pod or node. A best practice therefore
the networking solution they choose supports a network policy. is to use a cluster level logging architecture and configure a
separate backend for storing Kubernetes logs.
Kubernetes Secrets
Kubernetes has a native secret resource that allows sensitive Auditing
information like passwords, tokens or keys to be stored. It is Kubernetes auditing is another native resource that can help CIOs
recommended best practice to store such sensitive information in and CTOs gain more security insights into cluster activities. Audits
secrets rather that directly specifying them in pods or containers keep track of what events happened inside the cluster, who
spec. CIOs and CTOs should also ensure that all secrets are performed those activities and which resources they affected.
encrypted at rest. They can do this using any of the four providers Rules about which events to record and the associated metadata
aescbc, secretbox, aesgcm or kms. Data should also be can be defined in the audit policy resource.
encrypted in transit using one of the supported network plugins.
CIOs and CTOs should also avoid giving direct access to
Kubernetes nodes.

replex.io 15
A CIOs Guide to Kubernetes in Production

Cluster Autoscaler
Cost Management
Using a cluster autoscaler is another best practice in the context
of resource governance and cost management. The cluster
Freedom, agility, self-service and the ability to move fast are core autoscaler right sizes Kubernetes clusters based on utilization
principles of modern DevOps. They are also baked-in into most metrics of individual nodes. Nodes seeing sustained low utilization
cloud-native tooling including Kubernetes. These concepts allow are taken out of the cluster pool by the cluster autoscaler.
distributed DevOps teams to provision and consume resources
with minimal oversight from central IT. This can at times lead to Trimming down clusters to reduce resource wastage is a good
increased wastage of resources and ballooning costs. CIOs and way to reduce cloud provider bills and ensure efficient resource
CTOs therefore need to implement a comprehensive resource usage. Another good way to control costs is to right size Nodes
governance and cost management framework to control costs. that see sustained levels of low utilization. This will ensure that the
resource footprint of the node matches that required by the pods
Resource Governance using Namespaces running on top.
Kubernetes namespaces are a great way to implement low level
resource governance mechanisms. Creating separate Horizontal and Vertical Pod Autoscaler
namespaces for teams will allow CIOs to control the resource CIOs and CTOs should also ensure the use of both the horizontal
consumption of individual containers as well as the total resource (HPA) and vertical pod auto scalers (VPA) to govern resource
consumption of all containers that belong to the namespace. They usage and cost management.
can also ensure that all containers run with default limits and
control the total number of pods or other Kubernetes objects The HPA is a native Kubernetes resource that increases or
allowed to run. decreases the number of pod replicas for an application based on

replex.io 16
A CIOs Guide to Kubernetes in Production

CPU utilization metrics. HPA ensures that controllers are right


Application lifecycle
sized and do not result in resource wastage.

We have already looked at CI/CD tooling in the context of


Similarly, the VPA ensures that individual pods use resources
Kubernetes and how it allows CIOs and CTOs to accelerate
efficiently. It continuously monitors and recommends optimal
software delivery and shorten release cycles. In adopting
values for resource requests, kills the pods that do not have the
Kubernetes and cloud-native, CIOs and CTOs also need to
correct values and sets the correct value when they are recreated
internalize a broader set of concepts, tools and best practices.
by the controller.

Infrastructure as code (IAC), Environment as code (EC),


Spot Instances
Immutable infrastructure (II), Declarative Configuration,
Spot instances have emerged as a viable cost management
Observability, and Operations by Pull Request are just some of
alternative to on-demand and reserved cloud provider instances.
these concepts. Integrating these concepts helps make both
On Kubernetes too, CIOs and CTOs can consider using spot
infrastructure and applications reproducible, consistent and
instances to reduce cluster costs. A best practice is to create a
traceable. It also helps build in automation, further accelerating
cluster with a mix of on-demand, reserved and spot instances that
software delivery and making it easier for DevOps teams to
is both reliable and reduces overall cluster costs. CIOs and CTOs
conduct operations.
should also develop a comprehensive tagging strategy to keep
track of resources provisioned by distributed teams. This will also
GitOps
help rapidly identify and decommission underutilized or unused
GitOps developed at Weaveworks is one such methodology that
assets.
allows CIOs and CTOs to stitch together tools and best practices
from these concepts. The GitOps workflow encompasses the
entire build, deploy, manage and monitor lifecycle.

replex.io 17
A CIOs Guide to Kubernetes in Production

Version Control detects those updates and automatically deploys them to the
The most important cog in the GitOps workflow is a version control Kubernetes cluster.
system like Git. Git stores the desired system (clusters,
applications and infrastructure) state as declarative version- Unlike regular CD tools, GitOps also compares the actual state of
controlled configuration files in Git. Any changes to production the production system with the one under version control and
environments happen via Git commits. sends out alerts whenever it discovers a divergence. It also
triggers a convergence mechanism that brings the observed and
Kubernetes, like most modern cloud-native tools, is declarative in desired states into sync.
nature and as such is perfectly suited to a GitOps workflow that
treats declarative configuration files stored in Git as the single Environment as Code
source of truth. Another GitOps best practice is to adopt a broader “Environment
as code” approach. Environment as code extends version control
A minimal GitOps workflow includes Git for version control, a CI to Kubernetes clusters, infrastructure and observability tooling.
tool for unit and integration tests, a private image repository, and a
deployment and release automation tool like Flux. Configuration files for clusters, infrastructure and observability
tooling are version controlled in Git. Since the entire system
The workflow starts by pushing code changes to Git, making a (clusters, application, infrastructure and observability tools) is
Pull request and reviewing and merging the code. Once code is version controlled, it is consistent and can be easily reproduced
merged the CI tool runs the changes through an automated unit enabling faster disaster recovery as well as making rollouts and
and integration test suite, builds a new image and deposits it to a rollbacks more seamless.
registry. The Flux tool, which continuously monitors the registry,

replex.io 18
A CIOs Guide to Kubernetes in Production

Audit and post mortems are also easier with the Git log serving as • Use automated difference tools to monitor divergence
an audit trail. between the actual production system and the one under
version control
GitOps also facilitates DevOps and SRE teams by making • Use IAC tools (Terraform, Ansible) to create server
development and operations activities part of the same workflow. configuration files and keep them under version control
Developers can easily internalize operation tasks using the • Use the principles of immutable infrastructure, containers
version-controlled observability stack, conducting operations tasks and diff tools (kubediff, ansiblediff, and terradiff) to reduce
and fixing production issues as pull requests rather than making configuration drift and ensure the desired system state is
changes to the running system. maintained
• Version control your observability stack
GitOps Best Practices • Use native Kubernetes constructs for rolling updates.
Below we will review some of the best practices that enable a
GitOps workflow:

• Use Git or another version control system as the single


source of truth to store version-controlled configuration
files for the entire system (cluster, infrastructure,
applications)
• Ensure any changes to the production system are
conducted via pull requests

replex.io 19
Get in touch
replex.io | sales@replex.io

AUTHOR

Hasham Haider
Fan of all things cloud, containers and micro-services!

*The information provided within this eBook is for general informational purposes only. While we try to keep the
information up-to-date and correct, there are no representations or warranties, express or implied, about the
completeness, accuracy, reliability, suitability or availability with respect to the information, products, services, or
related graphics contained in this eBook for any purpose. Any use of this information is at your own risk.

replex.io 20

You might also like