Professional Documents
Culture Documents
A CIOs Guide To Kubernetes in Production Ebook
A CIOs Guide To Kubernetes in Production Ebook
A CIOs Guide To Kubernetes in Production Ebook
Kubernetes in
Production
replex.io 1
A CIOs Guide to Kubernetes in Production
Topics
Monitoring 1
CI/CD 5
Storage 9
Networking 10
Cost Management 16
Application lifecycle 17
Here is our complete guide to Kubernetes in Production for and physical machines, logical abstractions like pods, services
CIOs and CTOs. The guide covers the topics of Monitoring, and replica sets also need to be considered.
High availability, storage, networking, security and access
control, cost management, CI/CD, application lifecycle, Observability Paradigm for Kubernetes
distributed DevOps and SRE and choosing a Kubernetes More importantly, however, Kubernetes monitoring needs to pivot
distribution. to a new observability paradigm. Traditionally organizations have
relied on black box monitoring methods to monitor infrastructure
Monitoring and applications. Black box monitoring observes only the external
behavior of a system.
replex.io 1
A CIOs Guide to Kubernetes in Production
The pipeline serves as the central repository of traces, metrics, Observability pipelines allow organizations to better integrate
logs and events which are then forwarded to the appropriate these teams by helping build a culture based on facts and
service using a data router. This mitigates the need to have feedback.
agents for each destination running on each host and reduces the
number of integrations that need to be maintained. It also allows
High Availability, Backup and Disaster
enterprises to avoid vendor lock-in and quickly test new SaaS-
based monitoring services. Recovery
Observability Best Practices High availability and disaster recovery are crucial elements of any
Observability aims to understand the internals of a system and enterprise application. Orchestration engines like Kubernetes
how it works to quickly debug and resolve issues in production. introduce additional layers which have to be considered when
Since it integrates logs, traces and metrics into traditional designing highly available architectures.
monitoring pipelines it covers much more ground and requires a
lot more effort to deploy. Multi-Layered High Availability
Highly available Kubernetes environments can be seen in terms of
A best practice, therefore, is for CIOs and CTOs to gradually build two distinct layers or levels. The bottom-most layer is the
towards a full observability pipeline for their cloud-native infrastructure layer, which can refer to any number of public cloud
environments by integrating elements of white box monitoring over providers or physical infrastructure in a data center. Next is the
time. orchestration layer which includes both hardware and software
abstractions like nodes, clusters, containers and pods as well as
The adoption of cloud-native technologies has also resulted in other application components.
much more overlap between traditional dev and ops teams.
replex.io 2
A CIOs Guide to Kubernetes in Production
High Availability on the IAAS and On-premises Layer nodes. It is recommended to have at least 5 etcd members for
Public cloud providers provide a number of high availability production clusters.
mechanisms for compute, storage and networking that should
serve as a baseline for any Kubernetes environment. CIOs and On the application layer, CIOs and CTOs need to ensure the use
CTOs also need to bake in redundancy into compute, storage and of native Kubernetes controllers like statefulsets or deployments.
networking equipment supporting Kubernetes environments in on- These will ensure that the desired number of pod replicas are
premise data centers. always up and running.
replex.io 3
A CIOs Guide to Kubernetes in Production
replex.io 4
A CIOs Guide to Kubernetes in Production
The Role of Central IT can potentially lead to wastage and inefficient resource usage. A
The increasingly distributed nature of enterprise applications strong central IT team will be able to govern these distributed
translating into distributed DevOps teams, however, does not teams and avoid the fallouts from self-service and ballooning
mean that central IT loses its significance. There does need to be resources. They will also be able to hold teams accountable.
some degree of control and oversight over these teams.
CI/CD
Even though organizations increasingly prefer developers with
cross-domain knowledge of ops, overlapping skills do tend to
In the same way that Kubernetes and the wider cloud-native
dilute both development and ops.
technology toolset made CIOs rethink traditional dev and ops
roles, it has also required a new way of thinking about build and
A best practice, therefore, is to have a central IT team that
release cycles. Containerized, microservices based applications,
includes personnel with ops and infrastructure skill sets. This skill
developed, deployed and managed by distributed teams, are not
set will enable central IT to provide DevOps teams with critical
very suited to traditional one-dimensional build and release
services that are shared by those teams. It will also ensure that
pipelines.
organizations avoid wasted effort due to distributed teams figuring
out solutions to shared problems.
CI/CD for Distributed Teams
A best practice for CIOs, therefore, is to support distributed teams
Both the cloud and Kubernetes itself have made it increasingly
with a well-tooled and thought-out CI/CD pipeline. A robust CI/CD
easier for teams to provision and consume resources. The cloud-
pipeline is essential to fully realizing the benefits of faster release
native movement and DevOps also emphasize on agility and the
cycles and agility promised by Kubernetes and cloud-native
ability to self-service resources. This can at times lead to an
technologies. There are a number of tools that CIOs and CTOs
explosion in the number of compute resources provisioned and
replex.io 5
A CIOs Guide to Kubernetes in Production
can use to deploy CI/CD pipelines. These include Jenkins, delivery is an extension of continuous integration where code
TravisCI, GitLab CI and Spinnaker. changes are run through more rigorous tests and ultimately
deployed to an environment that closely mirrors the production
Continuous Integration (CI) environment.
CI/CD is a broad concept and touches on aspects of development, With continuous delivery there is often a human element involved
testing and operations. When deploying a CI/CD pipeline from making decisions about when and how frequently to push code
scratch a good place to start is with the developer team. into production. Continuous deployment automates the entire
Continuous integration is a subset of CI/CD that aims to increase pipeline by automatically pushing code into production once it
the frequency of code merges and automate build and test passes the automated builds and tests defined in both the
processes. integration and delivery phases.
Instead of developing new features in isolation, developers are CI/CD Best Practices
encouraged to merge code into the main pipeline as frequently as Agile distributed teams working in isolation can at times lead to an
possible. An automated build is created from these code changes explosion in the number of isolated build pipelines. To avoid this, a
which is then run through a suite of automated tests. Getting best practice for CIOs is to make the CI/CD pipeline the only way
developer teams to adopt CI best practices will ensure that code to push code into production. This will ensure that all code
changes and new features are always ready to be pushed out to changes are pushed into a unified build pipeline and are subjected
production. to a consistent set of integration and test suites.
Continuous Delivery and Deployment (CD) Distributed teams also tend to use a number of different tools and
Once CI practices are firmly in place, CIOs and CTOs can then frameworks. CIOs need to ensure that the CICD pipeline is flexible
move on to continuous delivery and deployment. Continuous enough to accommodate this usage.
replex.io 6
A CIOs Guide to Kubernetes in Production
Another best practice is to encourage a culture of small of oversight that while allowing them control does not impact the
incremental code changes and frequent merges among developer release velocity of software and teams.
teams. Smaller changes are easier to integrate and roll back and
minimize the fallout if something goes wrong.
Choosing the Right Kubernetes
CIOs also need to institute a build once policy at the start of the Distribution
pipeline. This ensures that later phases of the CI/CD pipeline have
a consistent build to work with. It also avoids any inconsistencies Even though Kubernetes on its own is vastly feature rich, mission-
that can creep in when using multiple build tools. critical enterprise workloads need to be supported by more feature
rich variants to provide required service levels.
Additionally, CIOs need to strike a balance between the extent of
the testing regime they push code changes through and the speed Managed Kubernetes
of the pipeline itself. More rigorous testing regimes while There are a number of managed Kubernetes offerings from public
minimizing the chances of bad code being pushed to production cloud providers that CIOs and CTOs can evaluate. These
also have a time overhead. managed offerings take over some of the heavy lifting involved in
managing upgrades, patches and HA.
CI/CD pipelines even though championing decentralization and
agility do still need to be governed by central IT for major feature Public cloud provider offerings do, however, restrict Kubernetes
releases. CIOs and CTOs need to ensure they strike a balance environments to a specific vendor and might not fit well with a
between governance and oversight from central IT and the agility future hybrid or multi-cloud strategy.
and flexibility of distributed teams. They need to ensure a degree
replex.io 7
A CIOs Guide to Kubernetes in Production
Commercial value-added Kubernetes distributions are also requirement should be support for fully automated cluster
available from vendors like Red Hat, Docker, Heptia, Pivotal and upgrades with zero downtime. The solution chosen should also
Rancher. Below we will outline some of the features CIOs and allow upgrades to be triggered manually. Monitoring, health
CTOs need to look for when choosing one. checks, cluster and node metrics and alerts and notifications
should also be a standard part.
Feature Set for Kubernetes Distributions
High availability and disaster recovery: CIOs and CTOs need Identity and access management: Identity and access
to look for distributions that support high availability out of the box. management are important both in terms of security as well as
This would include support for multi-master architectures, highly governance. CIOs need to ensure that the Kubernetes distribution
available etcd components as well as backup and recovery. they choose supports integration with already existing
authentication and authorization tools being used internally. RBAC
Hybrid and multi-cloud support: Vendor lock-in is a very real and granular access control are also important feature sets that
concern for the modern enterprise. To ensure Kubernetes should be supported.
environments are portable, CIOs need to choose distributions that
support a wide range of deployment models, from on-premise to Networking and Storage: The Kubernetes networking model is
hybrid and multi-cloud. Support for creating and managing highly configurable and can be implemented using a number of
multiple clusters is another feature that should be evaluated. options. The distribution chosen should either have a native
software-defined networking solution that covers the wide range of
Management, upgrades and Operational support: Managed requirements imposed by different applications or infrastructure or
Kubernetes offerings also need to be evaluated based on ease of support one of the more popular CNI based networking
setup, installation, and cluster creation as well as day 2 operations implementations including Flannel, Calico, kube-router or OVN
including upgrades, monitoring and troubleshooting. A baseline etc. CIOs also need to ensure that the Kubernetes distribution
replex.io 8
A CIOs Guide to Kubernetes in Production
they choose supports at a minimum, either flexvolume or CSI Most legacy applications, including databases, however, are
integration with storage providers as well as deployment on stateful and store tons of data for use across sessions. Since
multiple cloud providers and on-premise. volumes can only be used to store temporary data, they are not
well suited to these applications.
Deploy, manage and upgrade applications: Kubernetes
distributions being considered by CIOs also need to support a Kubernetes Persistent Volumes
comprehensive solution for deploying, managing, and upgrading In order to support stateful applications, Kubernetes introduced a
applications. A helm based, application catalog that aggregates new type of volume plugin called persistent volumes. The
both private and public chart repositories should be a minimal persistent volume resource decouples storage from the pod
requirement. lifecycle and allows data to persist across pod restarts making
them a good candidate for stateful applications.
Storage
Software-Defined Storage (SDS)
Software-defined storage (SDS) solutions are a good bet for CIOs
Kubernetes storage is a hard nut to crack. Kubernetes was initially and CTOs looking for a Kubernetes storage solution. Kubernetes
designed to support stateless applications that do not use saved supports a number of these SDS providers as persistent volume
data across sessions. Kubernetes pods are meant to be plugins (StorageOS, CephFS, Portworx, GlusterFS, ScaleIO and
ephemeral and are constantly created, destroyed and moved Quobyte).
across nodes. Whenever pods are destroyed Kubernetes volumes
attached to these pods are also terminated. SDS solutions abstract storage from the underlying hardware and
present it for consumption as shared storage pools. SDS solutions
also abstract away the complexities of having to manage
replex.io 9
A CIOs Guide to Kubernetes in Production
disparate storage devices and filesystem types. Built-in APIs allow storage solution chosen should also be declarative, and cloud-
consumers to manage and automate encryption, high availability, agnostic with support for automated upgrades. Encryption and
backups and replication. access control should also be required features.
With the emphasis on multi-cloud deployments, CIOs and CTOs
Storage Best Practices should also ensure that any storage solution they choose is
A best practice when choosing a storage solution for Kubernetes portable and cloud-agnostic. Storage solutions should be able to
is to ensure it is distributed, resilient, durable and robust with no pool storage resources from disparate (cloud, on-prem etc.)
single points of failure. The storage solution chosen should also hardware sources.
support dynamic, on-demand provisioning with support for
Kubernetes storage classes and be easily scalable. Dynamic Performance is another aspect that needs to be considered. In
provisioning significantly reduces the management overhead of most cases, the storage solution chosen will depend on the
creating, managing and configuring persistent volumes and unique attributes of each environment and application
making them available for consumption in the cluster. Storage can requirements. Before undertaking a review of SDS solution based
be automatically provisioned whenever it is requested by users. on the features outlined above, CIOs and CTOs should
benchmark the performance requirements of their environments
Another best practice is for CIOs and CTOs to evaluate SDS and applications.
solution based on support for high availability, automated
replication and backups.
Networking
Additional features to look for in a storage provider is support for
dynamic resizing, automated volume snapshots, backup and As with storage, networking is also an important component of
restore. As with Kubernetes and the cloud-native toolset, the enterprise environments. There are three important elements
replex.io 10
A CIOs Guide to Kubernetes in Production
CIOs and CTOs need to consider when setting up networking for and CTOs should consider one of the CNI compatible networking
a Kubernetes environment: communication between application plugins.
components (pod to pod communication), communication
between pods and services and communication with the outside Feature set for CNI Evaluation
internet. Flannel, Calico, Canal, kube-router and Weave Net are some of
the more famous CNI plugins. Below we review some of the
Kubernetes Networking Model features that CIOs and CTOs need to consider when choosing a
The Kubernetes networking model allots a unique IP to each CNI for their Kubernetes environment.
individual pod. By default, all pods belonging to a Kubernetes
cluster can communicate with all other pods. This communication Support for Network Policy
happens across Namespaces, services, or nodes. Support for Kubernetes network policies is a crucial functionality
that CIOs and CTOs should use to evaluate CNI plugins. Network
The networking model also allows groups of pods that provide the policies allow DevOps to configure and control traffic to and from
same functionality (Services) to communicate with other pods. their applications. Network policies perform both a security and an
The Service abstraction de-couples groups of dependent pods access control function.
and allows applications to continue functioning in the event of pod
restarts. With the network policy resource, Kubernetes enables a shift left
approach where DevOps can configure network policies using the
Kubenet, the default networking plugin from Kubernetes provides same concepts used to deploy applications.
some basic networking functionality but is limited when it comes to
cloud environments. For a more feature rich networking solution By default, pods do not filter incoming traffic and there are no
that can support mission critical enterprise environments, CIOs firewall rules. Network policies allow granular control over how
replex.io 11
A CIOs Guide to Kubernetes in Production
replex.io 12
A CIOs Guide to Kubernetes in Production
ClusterIP, NodePort, LoadBalancer and ExternalName are all need to ensure that security is a part of the entire application
service types that allow external traffic into the cluster. lifecycle and encompasses all layers.
LoadBalancer service type is the standard way to expose services
to the internet. It does however require a supported cloud provider Access Control
to be present and can get expensive since each exposed service With the cloud-native movement, identity and access control have
gets its own IP address. become increasingly important in the context of security. Native
Kubernetes authentication, authorization and admission
For most enterprise environments, Kubernetes Ingress is the controllers allow CIOs and CTOs to draw a security perimeter
recommended method to expose services to the internet. Ingress around their environments, identify users and processes and
handles load balancing at Layer 7 and officially supports nginx govern the resources they are allowed to access.
and GCE ingress controllers. There are a number of additional
Ingress controllers including Contour and Istio that CIOs and There are two ways requests can be authenticated in Kubernetes:
CTOs can look into. Ingress is more feature rich as compared to normal accounts and service accounts. Normal accounts usually
the LoadBalancer service type and is also a less expensive correspond to user accounts and are managed by an outside
option. third-party service. A best practice for CIOs and CTOs is to enable
multiple authentication methods: one each for user accounts
(either OpenID Connect or X509 Client Certificates) and service
Security, Identity and Access Control
accounts (Service account tokens).
replex.io 13
A CIOs Guide to Kubernetes in Production
regulate and govern access to resources based on roles of A best practice in the context of RBAC is to follow user-access
individual users. best practices and keep the scope of permissions small. CIOs and
CTOs however do need to consider the increased management
Cluster roles grant permissions for the entire cluster across all overhead that comes with a fine-grained RBAC policy.
Namespaces. CIOs and CTOs should ensure that Cluster roles
are only granted to trusted users or groups of users. The cluster- Continuous Security Scanning
admin role specifically, has a very wide range of permissions to CIOs and CTOs also need to encourage a culture of periodic
perform actions and has access to all resources. A best practice security scanning of container images. This can be accomplished
therefore, is to avoid granting the cluster-admin role as much as using tools like Claire and Anchore. Periodic scanning will identify
possible. any common security vulnerabilities (CVEs) in container images.
Roles are by default restricted to specific Namespaces and should To make this a continuous process, a best practice is to bake in
be preferred over Cluster Roles, whenever possible. For this to security and vulnerability checks and tools into the CI/CD pipeline.
work a best practice is to wall off groups of resources into Any new code as well as the container images built using this
individual namespaces for teams, departments, clients, and code should be checked for CVEs as part of the CI/CD pipeline.
applications etc. Roles can then be used to implement fine- CIOs and CTOs should also discourage the use of unknown
grained access control by specifying the apiGroup of the resource, images from public repositories and prioritize the use of private
the resource itself (e.g. pods) and the operations that can be registries.
performed. Roles are granted to individual users or groups of The AlwaysPullImages admission controller is another way to
users using Role Bindings. ensure images are always pulled with the correct authorization
and cannot be reused by pods.
replex.io 14
A CIOs Guide to Kubernetes in Production
replex.io 15
A CIOs Guide to Kubernetes in Production
Cluster Autoscaler
Cost Management
Using a cluster autoscaler is another best practice in the context
of resource governance and cost management. The cluster
Freedom, agility, self-service and the ability to move fast are core autoscaler right sizes Kubernetes clusters based on utilization
principles of modern DevOps. They are also baked-in into most metrics of individual nodes. Nodes seeing sustained low utilization
cloud-native tooling including Kubernetes. These concepts allow are taken out of the cluster pool by the cluster autoscaler.
distributed DevOps teams to provision and consume resources
with minimal oversight from central IT. This can at times lead to Trimming down clusters to reduce resource wastage is a good
increased wastage of resources and ballooning costs. CIOs and way to reduce cloud provider bills and ensure efficient resource
CTOs therefore need to implement a comprehensive resource usage. Another good way to control costs is to right size Nodes
governance and cost management framework to control costs. that see sustained levels of low utilization. This will ensure that the
resource footprint of the node matches that required by the pods
Resource Governance using Namespaces running on top.
Kubernetes namespaces are a great way to implement low level
resource governance mechanisms. Creating separate Horizontal and Vertical Pod Autoscaler
namespaces for teams will allow CIOs to control the resource CIOs and CTOs should also ensure the use of both the horizontal
consumption of individual containers as well as the total resource (HPA) and vertical pod auto scalers (VPA) to govern resource
consumption of all containers that belong to the namespace. They usage and cost management.
can also ensure that all containers run with default limits and
control the total number of pods or other Kubernetes objects The HPA is a native Kubernetes resource that increases or
allowed to run. decreases the number of pod replicas for an application based on
replex.io 16
A CIOs Guide to Kubernetes in Production
replex.io 17
A CIOs Guide to Kubernetes in Production
Version Control detects those updates and automatically deploys them to the
The most important cog in the GitOps workflow is a version control Kubernetes cluster.
system like Git. Git stores the desired system (clusters,
applications and infrastructure) state as declarative version- Unlike regular CD tools, GitOps also compares the actual state of
controlled configuration files in Git. Any changes to production the production system with the one under version control and
environments happen via Git commits. sends out alerts whenever it discovers a divergence. It also
triggers a convergence mechanism that brings the observed and
Kubernetes, like most modern cloud-native tools, is declarative in desired states into sync.
nature and as such is perfectly suited to a GitOps workflow that
treats declarative configuration files stored in Git as the single Environment as Code
source of truth. Another GitOps best practice is to adopt a broader “Environment
as code” approach. Environment as code extends version control
A minimal GitOps workflow includes Git for version control, a CI to Kubernetes clusters, infrastructure and observability tooling.
tool for unit and integration tests, a private image repository, and a
deployment and release automation tool like Flux. Configuration files for clusters, infrastructure and observability
tooling are version controlled in Git. Since the entire system
The workflow starts by pushing code changes to Git, making a (clusters, application, infrastructure and observability tools) is
Pull request and reviewing and merging the code. Once code is version controlled, it is consistent and can be easily reproduced
merged the CI tool runs the changes through an automated unit enabling faster disaster recovery as well as making rollouts and
and integration test suite, builds a new image and deposits it to a rollbacks more seamless.
registry. The Flux tool, which continuously monitors the registry,
replex.io 18
A CIOs Guide to Kubernetes in Production
Audit and post mortems are also easier with the Git log serving as • Use automated difference tools to monitor divergence
an audit trail. between the actual production system and the one under
version control
GitOps also facilitates DevOps and SRE teams by making • Use IAC tools (Terraform, Ansible) to create server
development and operations activities part of the same workflow. configuration files and keep them under version control
Developers can easily internalize operation tasks using the • Use the principles of immutable infrastructure, containers
version-controlled observability stack, conducting operations tasks and diff tools (kubediff, ansiblediff, and terradiff) to reduce
and fixing production issues as pull requests rather than making configuration drift and ensure the desired system state is
changes to the running system. maintained
• Version control your observability stack
GitOps Best Practices • Use native Kubernetes constructs for rolling updates.
Below we will review some of the best practices that enable a
GitOps workflow:
replex.io 19
Get in touch
replex.io | sales@replex.io
AUTHOR
Hasham Haider
Fan of all things cloud, containers and micro-services!
*The information provided within this eBook is for general informational purposes only. While we try to keep the
information up-to-date and correct, there are no representations or warranties, express or implied, about the
completeness, accuracy, reliability, suitability or availability with respect to the information, products, services, or
related graphics contained in this eBook for any purpose. Any use of this information is at your own risk.
replex.io 20