Professional Documents
Culture Documents
Kubernetes Ebook 1703579772
Kubernetes Ebook 1703579772
Kubernetes Ebook 1703579772
Table of Contents
For the better part of a decade, IT has watched as developers provisioned object storage for emerging
applications on the public cloud - driving much of the adoption of this medium.
This creates many well-known issues for IT. This is not a simple control issue, it is a broader and much more
critical governance issue with regard to security, compliance, budget and overall alignment.
The primary driver for developers turning to the public cloud was simply that IT couldn’t provision multi-tenant
object storage as a service. While IT was adept at archival object storage and was able to protect the crown
jewels when it came to data, they simply didn’t have the skill set to create, deploy, tune, scale and manage
modern, application oriented object storage using Kubernetes.
MinIO is purpose-built to take full advantage of the Kubernetes architecture. Created from scratch in the last
five years, MinIO has known nothing but containers and orchestration - it is simply how we think. As a result,
MinIO and Kubernetes work together to simplify infrastructure management, providing a way to manage
object storage infrastructure within the Kubernetes toolset.
The new Operator and Operator Console graphical user interface are an important evolution in our approach.
They solve a key problem for IT (getting them going on Kubernetes) while further simplifying object storage
for developers - without sacrificing granularity or control in the process.
The operator pattern extends Kubernetes's familiar declarative API model with custom resource definitions
(CRDs) to perform common operations like resource orchestration, non-disruptive upgrades, cluster
expansion and to maintain high-availability - operations that were previously handled in a Helm chart.
There are two components at play here: Operator and Operator Console.
Examples include:
The Operator is inherently a command line proposition, but merely providing an Operator wasn’t our goal.
MinIO goes further to simplify creation, deployment and management of Kubernetes native object storage
with a straightforward list of commands that make it easy to execute all of the key capabilities outlined above.
The Operator Console makes Kubernetes object storage easier still. In this graphical user interface, MinIO created
something so simple that anyone in the organization can create, deploy and manage object storage as a service.
The primary unit of managing MinIO on Kubernetes is the tenant. The best way to think about tenancy is to start with the
Kubernetes cluster. The MinIO Operator can allocate multiple tenants within the same Kubernetes cluster. Each tenant, in
turn, can have different capacity (i.e: a small 500GB tenant vs a 100TB tenant), resources (1000m CPU and
4Gi RAM vs 4000m CPU and 16Gi RAM) and servers (4 pods vs 16 pods), as well a separate configurations
regarding Identity Providers, Encryption and versions.
In multi-tenant configurations, each tenant is a cluster of server pools (independent sets of nodes with their
own compute, network, and storage resources), that, while sharing the same physical infrastructure, are fully
isolated from each other in their own namespaces. Each tenant runs their own MinIO cluster, fully isolated
from other tenants giving them the ability to protect them from any disruption on upgrade, update, security
incidents. Each tenant scales independently by federating clusters across geographies.
Since the server binary is fast and lightweight, MinIO's operator is able to densely co-locate several tenants
and use resources efficiently.
In the spirit of Kubernetes everywhere, MinIO runs on any public cloud provider such as Amazon's EKS (Elastic
Kubernetes Engine), Google's GKE (Google Kubernetes Engine), Google's Anthos or Azure's AKS (Azure
Kubernetes Service).
With the introduction of the Operator and the browser-based Operator Console, MinIO has delivered a
material upgrade to its already strong Kubernetes story. Now, without even knowing how to spell Kubernetes,
IT administrators can provision multi-tenant object storage as a service across hybrid cloud environments.
Get started and download MinIO! We have a tutorial, Simplifying Object Storage as a Service with Kubernetes
and MinIO’s Operator, that can help you take the first steps. As always, if you have any questions to join our
Object storage as a service is the hottest concept in storage today. The reason is straightforward: object
storage is the storage class of the cloud and the ability to provision it seamlessly to applications or
developers makes it immensely valuable to enterprises of any size.
The challenge is that object storage as a service has traditionally been very difficult to deliver. Overly
complex, hard to tune for performance, prone to failure at scale etc. While systems like Kubernetes offer
powerful tools for automating the deployment and management of these systems, the overall problem of
complexity remains unsolved as administrators must still invest significant time and effort to deploy even a
small scale object storage resource.
By combining Kubernetes with our new Operator and our Operator Console graphical user interface, MinIO is
changing that dynamic in a big way. It should be stated upfront that MinIO has always obsessed over
simplicity. It permeates everything we do, every design decision we make, every line of code we write.
Nonetheless, we saw even more opportunity for simplification. To do this we created the MinIO Operator and
the MinIO kubectl plugin to facilitate the deployment and management of MinIO Object Storage on
Kubernetes. While the Operator commands were critical for users already proficient with Kubernetes, we also
wanted to address a wider audience so we created a Graphical User Interface for the Operator and
incorporated it into our new MinIO Operator Console to enable anyone in the organization to create, deploy
and manage object storage as a service.
Kubernetes is the platform of the Internet. Given its massive adoption, we chose to remain consistent with the
Kubernetes way of doing things. This meant not using any specialized tools or services to setup MinIO.
The effect is that the MinIO Operator works on any Kubernetes distribution, be it OpenShift, vSphere 7.0u1,
Rancher or stock upstream. Further, MinIO will work on any public cloud provider such as Amazon's EKS
Pretty much all you need to get started on any distribution of Kubernetes is some storage device that can be
presented to Kubernetes either via Local Persistent Volumes or with a CSI Driver.
Let’s start with a review on using MinIO with the kubectl plugin and a kustomize based approach. You'll need
to install the kubectl tool on a computer with network access to the Kubernetes cluster. See Install and Set Up
kubectl for installation instructions. You may need to contact your Kubernetes administrator for assistance in
configuring your kubectl installation for access to the Kubernetes cluster.
Installation
kubectl plugin
To install MinIO Operator we can leverage it's kubectl plugin which can be installed via krew
Alternatively for anyone who prefers a kustomize based approach our repository supports installing specific
tags, of course you can also use this as the base for your kustomization.yaml file
The analogy we used to represent a MinIO Object Storage cluster is Tenant. We did this to communicate that
with the MinIO Operator one can allocate multiple Tenants within the same Kubernetes cluster. Each tenant, in
turn, can have different capacity (i.e: a small 500GB tenant vs a 100TB tenant), resources (1000m CPU and
4Gi RAM vs 4000m CPU and 16Gi RAM) and servers (4 pods vs 16 pods), as well a separate configurations
regarding Identity Providers, Encryption and versions.
Let's start by creating a small tenant with 16Ti capacity across 4 nodes. We will first create a namespace for
the tenant to be installed called `minio-tenant-1` and then place the tenant there using the `kubectl minio
tenant create` command.
--storage-class standard
This command will output the credentials needed to connect to this tenant. MinIO only displays these
credentials once, so make sure you copy them to a secure location.
Note: Copy the credentials to a secure location. MinIO will not display these
again
+-------------+------------------------+------------------+--------------+-----------------+
| APPLICATION | SERVICE NAME | NAMESPACE | SERVICE TYPE | SERVICE PORT(S) |
+-------------+------------------------+------------------+--------------+-----------------+
| MinIO | minio | minio-tenant-1 | ClusterIP | 443 |
| Console | minio-tenant-1-console | minio-tenant-1 | ClusterIP | 9090,9443 |
+-------------+------------------------+------------------+--------------+-----------------+
Usually a tenant takes a few minutes to provision while the MinIO Operator requests TLS certificates for MinIO
and the Operator Console via Kubernetes Certificate Signing Requests, you can check the progress by doing:
That's it, our Object Storage cluster is up and running, we can access it via kubectl port-forward. To access
MinIO's Console:
But now let's stop, rewind and remix to add a tenant using the MinIO Console for Operator (a.k.a Operator
UI). To access it we can simply run the kubectl minio proxy command. This will tell us how to access the
Operator UI.
Inside the Operator UI we can see the tenant that we provisioned previously using the kubectl plugin.
To add another one, hit Create Tenant. The first screen will ask a few configuration questions:
If you wish to configure an Identity Provider, TLS Certificates, Encryption or Resources for this tenant I invite
you to play with the Sections on the left where these configuration options reside.
In this screen you can size your tenant with number of servers, number of drivers per server and desired
raw capacity additionally you can get a preview of the usable capacity and the SLA guarantees with each
erasure coding parity value you pick.
Going back to the list of tenants we can see our original cli-provisioned tenant next to the tenant created
using the Operator UI. These processes are equivalent. It is only personal preference as to which you select.
Which returns
apiVersion: minio.min.io/v2
kind: Tenant
metadata:
name: bigdata-storage
namespace: default
spec:
credsSecret:
name: bigdata-storage-secret
env:
- name: MINIO_STORAGE_CLASS_STANDARD
value: EC:8
exposeServices:
console: true
minio: true
image: minio/minio:RELEASE.2022-01-08T03-11-54Z
imagePullSecret: { }
log:
audit:
diskCapacityGB: 10
image: minio/logsearch:v4.4.3
resources: { }
mountPath: /export
pools:
I encourage you to try the MinIO Operator yourself and explore other cool features such as using the
Prometheus Metrics and Audit Log, or securing your MinIO Tenant with an external Identity Provider such as
LDAP/Active Directory or an OpenID provider.
No matter what approach you take, the ability to provision muti-tenant, object storage as a service is now
within the skill set of a wide range of IT administrators, developers and architects.
We were talking with a well-respected industry analyst the other day and he challenged us to articulate why
Kubernetes is so important to Object Storage. It got us thinking that this was a topic worthy of our time, and
yours.
At the most basic level, the value of Kubernetes lies in its ability to treat infrastructure as code, delivering full
scale automation to both stateful and stateless components of the software stack.
To derive the maximum amount of value requires treating the maximum number of components as code and
orchestrating those. That means you put EVERYTHING into the container, including applications,
infrastructure and data.
In the modern world, applications are stateless and containerized. Still, that state has to be held somewhere.
That somewhere is object storage (not legacy block and file) and that object storage needs to run IN the
container. When done this way Kubernetes can manage the automation of the infrastructure - both stateful
and stateless.
If the object store is left to bare metal or public cloud storage services, the benefits of Kubernetes based
infrastructure orchestration are considerably diminished.
Another way to think about it is through a VMware analogy. VMware created the concept of the software
defined datacenter. This was a predecessor to Kubernetes (which is why they claim it as their birthright). To
get the true value of SDDC, you have to virtualize the entire datacenter. If some of the applications are left
behind to run on bare metal, SDDC benefits are lost.
The same is true for Kubernetes. If you only use Kubernetes for the applications, you are only tapping a
fractional amount of the value. Let’s explore this a little deeper.
First off, in the modern model, CPU, Network and Storage are physical layers to be abstracted by Kubernetes.
They have to be abstracted so that applications and data stores can run as containers anywhere. In particular,
the data stores include all persistent services (databases, message queues, object stores..).
From the Kubernetes perspective, object stores are not different from any other key value stores or
databases. The storage layer is reduced to physical or virtual drives underneath. The need to run persistent
This VMware post announcing the reason they built the Data Persistence platform is an excellent resource.
DPp is the answer to the question “how can we allow modern applications to do what they do best, but still
provide the ease of use and transparent operations of the VMware platform to admins and developers?”
Modern applications, in particular, those built to run on Kubernetes, are designed to take care of availability,
replication, scaling and encryption within themselves to become completely independent of the
infrastructure. In turn storage needs to run IN the container in order to deliver Observability, Data Placement,
Maintenance Operations, and Failure Handling.
This was not always the case. Traditionally, applications relied on databases to store and work with structured
data, and storage, such as local drives or distributed file systems, to house all of their unstructured and even
semi-structured data. However, the rapid rise in unstructured data challenged this model. As developers
quickly learned, POSIX was too chatty, had too much overhead to allow the application to perform at scale
and was confined to the data center as it was never meant to provide access across regions and continents.
This led them to object storage, which is designed for RESTful APIs (as pioneered by AWS S3). Now
applications were free of any burden to handle local storage, making them effectively stateless (as the state
is with the remote storage system).
Modern applications are built ground up with this expectation. Well-designed modern applications that deal
with some kind of data (logs, metadata, blobs, etc), conform to the cloud-native (RESTful API) design
principle by saving the state to a relevant storage system.
As a quick side note, REST APIs only address application-storage communication challenges such as PUT and
GET or READ/WRITE data, and tracking metadata and version data, but not container orchestration and
automation. That requires Kubernetes.
SAN and NAS can also make application containers stateless - but POSIX based File and Block are hopelessly
inflexible in a containerized environment - i.e. ability to have application workers grow and shrink based on
inbound load, move to a new node as soon as a current node goes down and so on. This is why object
This is not to say that storage applications, e.g. databases, object stores, key value stores, must be stateless.
On the contrary, they need to be stateful - they just shouldn’t have the effect of making the application
stateful in the process.
Why MinIO
Kubernetes native storage applications (like MinIO) are designed to leverage the flexibility containers bring.
Agile and DevOps best practices dictate that applications and CI/CD processes be simple and
straightforward, independent of underlying infrastructure and consistent in how it accesses underlying
infrastructure. Simply put, containers need to run the same way everywhere in order to be portable across
development, test, and production. Combining that with variable hardware infrastructures, it makes sense for
Kubernetes to be the point of contact between all the disaggregated infrastructures, applications and data
stores.
Therefore, storage applications cannot make assumptions about the environment in which they are deployed.
For example, MinIO uses an internal erasure coding mechanism to ensure there is adequate redundancy in the
system, across varying hardware and cloud infrastructures, to allow up to half of the drives to fail. MinIO also
manages the data integrity and security using its own hashing and server side encryption.
In the Kubernetes world, functions are simplified and abstracted: applications do application things and
storage does storage things. The application doesn’t have to think about it - it just happens, all inside a
container that can be expanded, moved or wiped out.
There are certainly non-cloud native ways. For example, you could solve this problem with the Container
Storage Interfaces (CSI), but sophisticated architects and developers don’t because they add needless
complexity and scalability challenges. This is because CSI-based PVs bring their own management and
redundancy layers which generally compete with the stateful application’s design.
Take the following example of how cloud-native platforms work with storage and state. Apache Spark, in the
cloud-native world, runs in a stateless manner on Kubernetes and ships state to other systems while Spark
containers themselves are running completely stateless. Other major enterprise players in the big data
analytics space like Vertica, Teradata, Greenplum are also moving to a disaggregated model of compute and
storage.
Similarly, all the other major analytics platforms from Presto, Tensorflow to R, Jupyter notebooks follow such
patterns. Offloading state to remote cloud storage systems makes your application much easier to scale and
manage. Additionally, it helps keep the application portable to different environments.
We continue to refine our approach. For example, our widely adopted Helm chart approach was not enough
to cross the chasm from our DevOps audience to the mainstream IT administrator audience. Our previous
implementation effectively dealt with a single tenant. For multi-tenancy and other DevOps tasks like
provisioning, scaling, upgrades/updates, monitoring and encryption services - this required customer code.
Our new Kubernetes Operator helps our clients cross the chasm. Building a multi-tenant, self-service object
storage infrastructure on top of MinIO required a significant amount of skills and custom code development.
With the introduction of the Operator, such tasks are automated and API / Web driven. Now MinIO is a full
blown multi-tenant, self-service cloud storage on top of Kubernetes. The Operator and Console put the power
of Kubernetes-native, object-storage-as-a-service into the hands of IT - without requiring CLI or scripting
skills.
MinIO Everywhere
When we started talking about the concept of #minioeverywhere it was to illustrate our integrations with the
cloud-native elite. Now, however, #minioeverywhere speaks to the fact that MinIO, in conjunction with
Kubernetes, runs everywhere.
This can be lost on some given its nuance. Because of key economic and technical hurdles among the public
cloud providers, it is increasingly attractive to use MinIO/Kubernetes across all infrastructures.
For example, public clouds are not interchangeable. AWS S3 does not equal Blob (Azure) and certainly does
not equal GCP (marginally S3 compatible). Also, in the public cloud, bandwidth is more expensive than
storage and latency is high. Smoothing these differences is a very expensive proposition.
Enterprises are adopting MinIO as a core part of their software stack (applications AND storage) because
they can roll it anywhere. AWS, GCP, Azure, Tanzu, Openshift - the list goes on. Because MinIO is Kubernetes
There is a lot here so let’s summarize quickly. Kubernetes' value lies in its ability to treat infrastructure as
code, delivering full scale automation to both stateful and stateless components of the software stack.
The value of Kubernetes is only achieved if you can get the maximum number of components inside the
container. This includes storage/persistent data.
MinIO is built for this - it easily fits in containers (~45MB), it is designed for RESTful APIs and continues to
evolve its approach (see MinIO Operator) to deliver the most native Kubernetes experience when it comes to
storage.
When you are native to Kubernetes you can run anywhere it does - and today, that is everywhere you care
about running - public cloud, private cloud, Kubernetes distribution and edge.
Don’t take our word for it. See for yourself. You can pull the MinIO Operator for Kubernetes code from Github.
Questions? Join the conversation on our Slack channel, or hit the Ask an Expert button and get started today.
Welcome to the third and final installment of our MinIO and CI/CD series. So far, we’ve discussed the basics of
CI/CD concepts and how to build MinIO artifacts and how to test them in development. In this blog post, we’ll
focus on Continuous Delivery and MinIO. We’ll show you how to deploy a MinIO cluster in a production
environment using infrastructure as code to ensure anyone can read the resources installed and apply version
control to any changes.
MinIO is very versatile and could be installed in almost any environment. MinIO conforms to multiple use cases
for developers to have the same environment on a laptop that they work in production using the CI/CD
concepts and pipelines we discussed. We showed you previously how to install MinIO as a docker container
and even as a systemd service. Today we’ll show you how to deploy MinIO in distributed mode in a production
Kubernetes cluster using an operator. We’ll use Terraform to deploy the infrastructure first, then we’ll deploy
the required MinIO resources.
MinIO Network
First we’ll use Terraform to build the basic network needed for our infrastructure to get up and running. We
are going to set up a VPC networking with 3 basic commonly used networking types. Within that network we’ll
launch a Kubernetes cluster where we can deploy our MinIO workloads. The structure of our Terraform
modules would look something like this
modules
├── eks
│ ├── main.tf
│ ├── outputs.tf
│ └── variables.tf
└── vpc
├── main.tf
├── outputs.tf
└── variables.tf
In order for the VPC to have different networks each subnet requires a unique non overlapping subnet. These
subnets are split into CIDR blocks. For a handful, this is pretty easy to calculate, but for many subnets like we
have here, Terraform provides a handy function cidrsubnet() to split the subnets for us based on a larger
subnet we provide, in this case 10.0.0.0/16.
variable "minio_aws_vpc_cidr_block" {
description = "AWS VPC CIDR block"
type = string
default = "10.0.0.0/16"
}
variable "minio_aws_vpc_cidr_newbits" {
description = "AWS VPC CIDR new bits"
type = number
default =4
}
vpc/variables.tf#L1-L11
Define the VPC resource in Terraform. Any subnet created will be based on this VPC.
cidr_block = var.minio_aws_vpc_cidr_block
instance_tenancy = "default"
enable_dns_hostnames = true
vpc/main.tf#L1-L7
The Public Network with Internet Gateway (IGW) will have inbound and outbound internet access with a
public IP and an Internet Gateway.
default = {
"us-east-1b" = 1
"us-east-1d" = 2
"us-east-1f" = 3
}
}
vpc/variables.tf#L15-L24
The aws_subnet resource will loop 3 times creating 3 subnets in the public VPC
for_each = var.minio_public_igw_cidr_blocks
vpc_id = aws_vpc.minio_aws_vpc.id
cidr_block = cidrsubnet(aws_vpc.minio_aws_vpc.cidr_block, var.minio_aws_vpc_cidr_newbits,
each.value)
availability_zone = each.key
map_public_ip_on_launch = true
}
vpc_id = aws_vpc.minio_aws_vpc.id
for_each = aws_subnet.minio_aws_subnet_public_igw
subnet_id = each.value.id
route_table_id = aws_route_table.minio_aws_route_table_public_igw.id
}
vpc/main.tf#L11-L46
The Private Network with NAT Gateway (NGW) will have outbound network access, but no inbound network
access, with a private IP address and NAT Gateway.
variable "minio_private_ngw_cidr_blocks" {
type = map(number)
description = "Availability Zone CIDR Mapping for Private NGW subnets"
default = {
"us-east-1b" = 4
"us-east-1d" = 5
"us-east-1f" = 6
}
}
vpc/variables.tf#L26L-L35
The aws_subnet resource will loop 3 times creating 3 subnets in the private VPC
for_each = var.minio_private_isolated_cidr_blocks
vpc_id = aws_vpc.minio_aws_vpc.id
cidr_block = cidrsubnet(aws_vpc.minio_aws_vpc.cidr_block, var.minio_aws_vpc_cidr_newbits,
each.value)
availability_zone = each.key
}
vpc_id = aws_vpc.minio_aws_vpc.id
for_each = aws_subnet.minio_aws_subnet_private_isolated
subnet_id = each.value.id
route_table_id = aws_route_table.minio_aws_route_table_private_isolated.id
}
vpc/main.tf#L50-L98
Finally, we create an Isolated and Air-gapped network with neither outbound nor inbound internet access.
This network is completely air gapped with only a private IP address.
variable "minio_private_isolated_cidr_blocks" {
type = map(number)
description = "Availability Zone CIDR Mapping for Private isolated subnets"
default = {
"us-east-1b" = 7
"us-east-1d" = 8
"us-east-1f" = 9
}
}
vpc/variables.tf#L37-L46
The aws_subnet resource will loop 3 times creating 3 subnets in the isolated/air-gapped VPC
for_each = var.minio_private_isolated_cidr_blocks
vpc_id = aws_vpc.minio_aws_vpc.id
cidr_block = cidrsubnet(aws_vpc.minio_aws_vpc.cidr_block, var.minio_aws_vpc_cidr_newbits,
each.value)
availability_zone = each.key
}
vpc_id = aws_vpc.minio_aws_vpc.id
for_each = aws_subnet.minio_aws_subnet_private_isolated
subnet_id = each.value.id
route_table_id = aws_route_table.minio_aws_route_table_private_isolated.id
}
vpc/main.tf#L102-L123
Create a Kubernetes cluster on which we’ll deploy our MinIO cluster. The
minio_aws_eks_cluster_subnet_ids will be provided by the VPC that we’ll create. Later, we’ll show how
to stitch all this together in the deployment phase.
variable "minio_aws_eks_cluster_subnet_ids" {
description = "AWS EKS Cluster subnet IDs"
type = list(string)
}
variable "minio_aws_eks_cluster_name" {
description = "AWS EKS Cluster name"
type = string
default = "minio_aws_eks_cluster"
}
variable "minio_aws_eks_cluster_endpoint_private_access" {
description = "AWS EKS Cluster endpoint private access"
type = bool
default = true
}
variable "minio_aws_eks_cluster_endpoint_public_access" {
description = "AWS EKS Cluster endpoint public access"
type = bool
variable "minio_aws_eks_cluster_public_access_cidrs" {
description = "AWS EKS Cluster public access cidrs"
type = list(string)
default = ["0.0.0.0/0"]
}
eks/variables.tf#L1-L28
Note: In production you probably don’t want to have public access to the Kubernetes API endpoint because it
could become a security issue as it will open up control of the cluster.
You will also need a couple of roles to ensure the Kubernetes cluster can communicate properly via the
networks we’ve created, and those are defined at eks/main.tf#L1-L29. The Kubernetes cluster definition is as
follows
vpc_config {
subnet_ids = var.minio_aws_eks_cluster_subnet_ids
endpoint_private_access = var.minio_aws_eks_cluster_endpoint_private_access
endpoint_public_access = var.minio_aws_eks_cluster_endpoint_public_access
public_access_cidrs = var.minio_aws_eks_cluster_public_access_cidrs
}
depends_on = [
aws_iam_role.minio_aws_iam_role_eks_cluster,
]
eks/main.tf#L31-L46
The cluster takes in the API requests made from commands like kubectl, but there’s more to it than that –
the workloads need to be scheduled somewhere. This is where a Kubernetes cluster node group is required.
Below, we define the node group name, the type of instance and the desired group size. Since we have 3 AZs,
we’ll create 3 nodes one for each of them.
variable "minio_aws_eks_node_group_instance_types" {
description = "AWS EKS Node group instance types"
type = list(string)
default = ["t3.large"]
}
variable "minio_aws_eks_node_group_desired_size" {
description = "AWS EKS Node group desired size"
type = number
default =3
}
variable "minio_aws_eks_node_group_max_size" {
description = "AWS EKS Node group max size"
type = number
default =5
}
variable "minio_aws_eks_node_group_min_size" {
description = "AWS EKS Node group min size"
type = number
default =1
}
eks/variables.tf#L30-L58
You need a couple of roles to ensure the Kubernetes node group can communicate properly, and those are
defined at eks/main.tf#L48-L81. The Kubernetes node group (workers) definition is as follows:
scaling_config {
depends_on = [
aws_iam_role.minio_aws_iam_role_eks_worker,
]
eks/main.tf#L83-L100
This configuration will launch a control plane with worker nodes in any of the 3 VPC networks we configured.
We’ll show later the kubectl get no output once the cluster is launched.
MinIO Deployment
By now, we have all the necessary infrastructure in code form. Next, we’ll deploy these resources and create
the cluster on which we’ll deploy MinIO.
Create an AWS IAM user with the following policy. Note the AWS_ACCESS_KEY_ID and
AWS_SECRET_ACCESS_KEY after creating the user.
$ export AWS_ACCESS_KEY_ID=<access_key>
$ export AWS_SECRET_ACCESS_KEY=<secret_key>
Create a folder called hello_world in the same directory as modules using the structure below
├── hello_world
│ ├── main.tf
│ ├── outputs.tf
│ ├── terraform.tfvars
│ └── variables.tf
├── modules
│ ├── eks
│ └── vpc
https://github.com/minio/blog-assets/tree/main/ci-cd-deploy/terraform/aws/hello_world
hello_minio_aws_region = "us-east-1"
Create a file called main.tf and initialize the terraform AWS provider and S3 backend. Note that the S3
bucket needs to exist beforehand. We are using S3 backend to store the state so that it can be shared among
developers and CI/CD processes alike without dealing with trying to keep local state in sync across the org.
terraform {
required_version = ">= 1.0"
required_providers {
aws = {
source = "hashicorp/aws"
version = "~> 4.31.0"
}
}
backend "s3" {
bucket = "aj-terraform-bucket"
key = "tf/aj/mo"
region = "us-east-1"
}
provider "aws" {
region = var.hello_minio_aws_region
}
hello_world/main.tf#L1-L21
Setting the backend bucket and key as variables is not supported, so those values need to be hard coded.
minio_aws_vpc_cidr_block = var.hello_minio_aws_vpc_cidr_block
minio_aws_vpc_cidr_newbits = var.hello_minio_aws_vpc_cidr_newbits
minio_public_igw_cidr_blocks = var.hello_minio_public_igw_cidr_blocks
minio_private_ngw_cidr_blocks = var.hello_minio_private_ngw_cidr_blocks
minio_private_isolated_cidr_blocks = var.hello_minio_private_isolated_cidr_blocks
hello_world/main.tf#L23-L33
hello_minio_aws_vpc_cidr_block = "10.0.0.0/16"
hello_minio_aws_vpc_cidr_newbits = 4
hello_minio_public_igw_cidr_blocks = {
"us-east-1b" = 1
"us-east-1d" = 2
"us-east-1f" = 3
}
hello_minio_private_ngw_cidr_blocks = {
"us-east-1b" = 4
"us-east-1d" = 5
"us-east-1f" = 6
}
hello_minio_private_isolated_cidr_blocks = {
"us-east-1b" = 7
"us-east-1d" = 8
"us-east-1f" = 9
}
hello_world/terraform.tfvars#L3-L22
module "hello_minio_aws_eks_cluster" {
source = "../modules/eks"
minio_aws_eks_cluster_name = var.hello_minio_aws_eks_cluster_name
minio_aws_eks_cluster_endpoint_private_access =
var.hello_minio_aws_eks_cluster_endpoint_private_access
minio_aws_eks_cluster_endpoint_public_access = var.hello_minio_aws_eks_cluster_endpoint_public_access
minio_aws_eks_cluster_public_access_cidrs = var.hello_minio_aws_eks_cluster_public_access_cidrs
minio_aws_eks_cluster_subnet_ids =
values(module.hello_minio_aws_vpc.minio_aws_subnet_private_ngw_map)
minio_aws_eks_node_group_name = var.hello_minio_aws_eks_node_group_name
minio_aws_eks_node_group_instance_types = var.hello_minio_aws_eks_node_group_instance_types
minio_aws_eks_node_group_desired_size = var.hello_minio_aws_eks_node_group_desired_size
minio_aws_eks_node_group_max_size = var.hello_minio_aws_eks_node_group_max_size
minio_aws_eks_node_group_min_size = var.hello_minio_aws_eks_node_group_min_size
hello_world/main.tf#L37-L51
hello_minio_aws_eks_cluster_name = "hello_minio_aws_eks_cluster"
hello_minio_aws_eks_cluster_endpoint_private_access = true
hello_minio_aws_eks_cluster_endpoint_public_access = true
hello_minio_aws_eks_cluster_public_access_cidrs = ["0.0.0.0/0"]
hello_minio_aws_eks_node_group_name = "hello_minio_aws_eks_node_group"
hello_minio_aws_eks_node_group_instance_types = ["t3.large"]
hello_minio_aws_eks_node_group_desired_size = 3
hello_minio_aws_eks_node_group_max_size = 5
hello_minio_aws_eks_node_group_min_size = 1
hello_world/terraform.tfvars#L24-L32
Finally we’ll apply the configuration. While still in the hello_world directory run the following terraform
commands. This will take about 15-20 minutes to get the entire infrastructure up and running. Towards the
end, you should see an output similar to below:
…TRUNCATED…
$ terraform apply
…TRUNCATED…
hello_minio_aws_eks_cluster_name = "hello_minio_aws_eks_cluster"
hello_minio_aws_eks_cluster_region = "us-east-1"
…TRUNCATED…
Finished: SUCCESS
Update your --kubeconfig default configuration to use the cluster we just created using aws eks
command. The --region and --name are available from the previous output.
$ kubectl get no
NAME STATUS ROLES AGE VERSION
ip-10-0-105-186.ec2.internal Ready <none> 3d8h v1.23.9-eks-ba74326
ip-10-0-75-92.ec2.internal Ready <none> 3d8h v1.23.9-eks-ba74326
ip-10-0-94-57.ec2.internal Ready <none> 3d8h v1.23.9-eks-ba74326
Next, install EBS drivers so gp2 PVCs can mount. We are using gp2 because this is the default storage class
supported by AWS.
Set credentials for the AWS secret using the same credentials used for awscli
$ kubectl apply -k
"github.com/kubernetes-sigs/aws-ebs-csi-driver/deploy/kubernetes/overlays/stable/?ref=release-1.12"
Now we’re ready to deploy MinIO. First, clone the MinIO repository
Since this is AWS, we need to update the storageClassName to gp2. Open the following file and update any
references from storageClassName: standard to storageClassName: gp2. Each MinIO tenant has its
own tenant.yaml that contains the storageClassName configuration. Based on the tenant you are using, be
sure to update the storageClassName accordingly.
$ vim ./operator/examples/kustomization/base/tenant.yaml
Wait at least 5 minutes for the resources to come up, then verify that MinIO is up and running.
If you notice the above output, each storage-lite-pool- is on a different worker node. Two of them share
the same node because we have 3 nodes, but that is okay because we only have 3 availability zones (AZs).
Basically there are 3 nodes in 3 AZs and 4 MinIO pods with 2 PVCs each which is reflected in the status 8
Online below.
…TRUNCATED…
Documentation: https://min.io/docs/minio/linux/index.html
You will need the TCP port of the MinIO console; in this case it is 9443.
With this information, we can set up Kubernetes port forwarding. We chose port 39443 for the host, but this
could be anything, just be sure to use this same port when accessing the console through a web browser.
Access MinIO Operator Console through the web browser using the following credentials:
URL: https://localhost:39443
User: minio
You now have a fully production setup of a distributed MinIO cluster. Here is how you can automate it using
Jenkins:
Unset
export PATH=$PATH:/usr/local/bin
cd ci-cd-deploy/terraform/aws/hello_world/
terraform init
terraform plan
terraform apply -auto-approve
Final Thoughts
In these past few blogs of the CI/CD series we’ve shown you how nimble and flexible MinIO is. You can build it
into anything you want using Packer and deploy it in VMs or Kubernetes clusters wherever it is needed. This
allows your developers to have as close to a production infrastructure as possible in their development
environment, while at the same time leveraging powerful security features such as Server Side Object
Encryption and managing IAM policies for restricting access to buckets.
In a production environment, you might want to restrict the IAM user to a specific policy but that really
depends on your use cases. For demonstration purposes, we kept things simple with a broad policy, but in
production you would want to narrow it down to specific resources and groups of users. In a later blog we’ll
show some of the best practices on how to design your infrastructure for different AZs and regions.
Would you like to try automating the kubectl part as well with Jenkins instead of applying manually? Let us
know what type of pipeline you’ve built using our tutorials for planning, deploying, scaling and securing MinIO
across the multicloud, and reach out to us on our Slack and share your pipelines!
If you are part of a team running infrastructure whether it is DevOps, SRE or Systems Engineer its paramount
to ensure you keep tech debt to a minimum. In this case you want to ensure the number of supporting
systems in your infrastructure such as Databases, Cache Systems, Messaging Queues, Log Aggregators,
Monitoring Systems, Application Performance Monitoring systems and I’m sure I’m missing a few more here do
not add to the overall complexity of managing the infrastructure.
As an Engineer you always want to ensure the infrastructure you are setting up is something that could be
used by multiple teams for multiple applications and projects. If you are picking a database do your due
diligence and pick one that could serve most, if not all, of your applications. Having too many of these
disparate database servers running alongside each other will add to the complexity of installing them,
updating, maintenance, backups, testing the backups, monitoring, among other things that really takes up a
lot of the resources.
Same is true with storage systems. One of things you want to make sure is that the storage systems for
backups and DR are not used just for those purposes but the other way around. Meaning, you should have
storage systems that support object storage, DB external tables storage, Metrics and Logs Storage,
Configuration Management data, AI/ML data in Data Lakes, Big Data such as Spark among countless other
use cases. This way you have a storage infrastructure that is not only resilient, but also scalable because
there are many different teams and applications that are depending on Storage more than ever. While the
overhead of managing myriad pieces of infrastructure is now minimized, it becomes critical to ensure the
underlying infrastructure powering them is also scalable, reliable and performant. MinIO fits the bill for more
than several of these use cases because of its industry-leading performance and scalability. MinIO is capable
of tremendous performance – we’ve benchmarked it at 325 GiB/s (349 GB/s) on GETs and 165 GiB/s (177
GB/s) on PUTs with just 32 nodes of off-the-shelf NVMe SSDs – and is used to build data lakes/lake houses
and analytics and AI/ML workloads. With MinIO playing a critical role in storage infrastructure, it's important to
collect, monitor and analyze performance and usage metrics.
MinIO is also a major proponent of Kubernetes, because of its open source nature and the vast resources
available in the way applications are deployed. Just as MinIO can be deployed anywhere, it was a no-brainer
why we wanted to have more ways to deploy MinIO via containers. From the get go MinIO supported the
Docker way of deployment which is the easiest way to get up and running. You might also know we maintain
and distribute our own Helm Chart to deploy the MinIO Operator and other components. This helps us deploy
MinIO alongside any Kubernetes Application whether they use Helm or Not.
Install MinIO
Before we set up and configure the Helm repository please ensure you have a working MinIO cluster which
you are already using for existing data. If you just want to test this out first please follow the instructions
below to spin up a MinIO container quickly.
We’ll bring up a MinIO node with 4 disks. MinIO runs anywhere - physical, virtual or containers - and in this
overview, we will use containers created using Docker.
mkdir -p /home/aj/minio/disk-1 \
mkdir -p /home/aj/minio/disk-2 \
mkdir -p /home/aj/minio/disk-3 \
mkdir -p /home/aj/minio/disk-4
Launch the Docker container with the following specifications for the MinIO node:
docker run -d \
-p 20091:9001 \
-v /home/aj/minio/disk-1:/mnt/disk1 \
-v /home/aj/minio/disk-2:/mnt/disk2 \
-v /home/aj/minio/disk-3:/mnt/disk3 \
-v /home/aj/minio/disk-4:/mnt/disk4 \
--name minio \
--hostname minio \
quay.io/minio/minio server http://minio/mnt/disk{1...4}/minio --console-address ":9001"
The above will launch a MinIO service in Docker with the console port listening on 20091 on the host. It will
also mount the local directories we created as volumes in the container and this is where MinIO will store its
data. You can access your MinIO service via http://localhost:20091.
Documentation: https://docs.min.io
Go to the browser to load the MinIO console using http://localhost:20091, log in using minioadmin and
minioadmin for username and password respectively. Click on the Create Bucket button and create
testbucket123.
Installing Helm
Another similarity Helm shares with MinIO is its single binary. You don’t need to install multiple files or
dependencies in different locations dealing with path issues. You simple run the command below and you’ll be
able to get going
Unset
$ curl -fsSL -o get_helm.sh
https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3
$ ./get_helm.sh
Unset
helm repo update
In order to set up Helm as a repository in MinIO, we’ll need to create a separate bucket. You can name this as
helm-repo or use the testbucket123 we used earlier. Ensure this bucket is public and accessible over
http/https without any API or libraries.
In this bucket we’ll need to create an Index.yaml. Think of it as the index.html on a website. It is the first page
that the browser looks for when accessing a website. Similarly, when the helm command is run it will look for
the index.yaml at the root of the MinIO bucket which will have a list of packages that are available in that
bucket which Helm can use. Think of it as a combination of index.html and the contents of a Site Map page so
you would know exactly where you want to go all the time. This is the same way Helm package manager uses
the index.yaml in the MinIO bucket.
Unset
apiVersion: v1
entries:
minio:
- apiVersion: v1
created: 2017-06-15T17:48:36.895822482Z
devops.
digest: 75ff1e3d779d8937cff57c28a102da97a520245d50e22c1a2763cbea064a76cd
home: https://minio.io
icon: https://www.minio.io/img/logo_160x160.png
keywords:
- storage
- S3
maintainers:
- email: hello@acale.ph
name: Acaleph
- email: hello@minio.io
name: Minio
name: minio
sources:
- https://github.com/minio/minio
urls:
- https://play.minio.io:9000/minio-helm/minio-0.1.2.tgz
version: 0.1.2
Unset
mc cp ./index.yaml myminio/testbucket123
Unset
mc cp minio-0.1.2.tgz myminio/testbucket123
The moment of truth! Try install the Helm package from the MinIO bucket you added as a Helm repo
Unset
helm install myrepo/minio
This term is generally used when describing monitoring. But it can also be used to explain the state of your
infrastructure. By having yet another useful way to use your MinIO clusters, you have automatically reduced
the tech debt of your team and the larger organization. By using MinIO you can also leverage our SUBNET
support portal where a team of engineers who actually write the core codebase of MinIO are available to help
you tackle some of the cluster’s important tasks such as architecting the initial set up, developing a DR plan
and scaling the cluster as the need for more data, applications and use cases increases.
For more information, ask our experts using the live chat at the bottom right of the blog to learn more about
the SUBNET experience or email us at hello@min.io.
Kubeflow Pipelines (KFP) is the most popular feature of Kubeflow. A Python engineer can turn a function
written in plain old Python into a component that runs in Kubernetes using the KFP decorators. If you used
KFP v1, be warned - the programming model in KFP v2 is very different - however, it is a big improvement.
Transforming plain old Python into reusable components and orchestrating these components into pipelines is
a lot easier.
In this post I want to go beyond the obligatory “Hello World” demo and present something that I hope you will
find either directly usable or at the very least a framework for plugging in your own logic.
What I will do is show how to build a KFP Pipeline that downloads US Census Bureau Data (which is a public
data set that is free to access) and saves this data to MinIO. MinIO is a great way to store your ML data and
models. Using MinIO, you can save training sets, validation sets, test sets, and models without worrying about
scale or performance. Also, someday AI will be regulated; when this day comes, you will need MinIO’s
enterprise features (object locking, versioning, encryption, and legal locks) to secure your data at rest and to
make sure you do not accidentally delete something that a regulatory agency may request.
You can learn more about the data we will be using here. To get an API key for the Census API, go to the
Census Bureau’s site for developers. This is very simple. All you need to do is specify an email address.
In this post, I will build a pipeline that takes a table code (an identifier within the Census Bureau’s dataset) and
a year as parameters. It will then download the table via an API, if we have not previously downloaded it.
We will only call the Census API if we have not previously downloaded the table. When we call the ACS API,
we will save the data in an instance of MinIO that we set up for storing raw data. This is different from the
MinIO instance KFP uses internally. We could have tried to use KFP’s instance of MinIO - however, this is not
the best design for an ML Data Pipeline. You will want a storage solution that is totally under your control for
the reasons I described earlier. Below is a diagram of our Kubeflow and MinIO deployments that illustrates the
purpose of each instance of MinIO.
The pipelines you run in KFP are known as Directed Acyclic Graphs (DAG). They move in one direction and do
not backtrack - no closed loops. This is what you would expect of a data pipeline. Below is the logical design
of the DAG we will build and run in KFP. It is self-explanatory. Starting with a conceptual workflow is a good
way to help you transform your logic into functions that will leverage KFP to the fullest.
Now that we have a logical design, let’s start coding. I am going to assume you have KFP installed and that
you also have set up your own instance of MinIO. If you do not have KFP 2.0 and MinIO installed, check out
Setting up a Development Machine with Kubeflow Pipeline 2.0 and MinIO.
Each task in the logical design above is going to become a Python function. The function signatures below
show how the parameters and return values would be designed if we were writing a Python script or stand
alone service without KFP. I want to discuss this in case you are migrating existing code to KFP
A few comments about the functions above. They use type hints. If you are writing plain old Python, you can
opt out of type hints because they are optional. In Kubeflow Pipelines, they are not - you must use type hints
so that KFP can tell you if your parameters and return values do not match when assembling functions into a
pipeline. This is a good thing. KFP will find type mismatch errors when you compile your pipeline. These same
errors would be very hard to track down at runtime within a cluster.
It may be tempting to combine functions so that you have fewer functions to manage. For example, the last
three functions could be combined into one by using a simple “if else” statement and then the first function
would not be needed. This is not a best practice when using a tool like KFP. As we will see, KFP has
constructs for conditions and loops. By using KFPs constructs you will get better visualizations of your
pipeline in the KFP UI. Parallelisms are also possible which will improve pipeline performance. Finally, if we
keep our functions simple we will get better reuse.
We are now ready to create Kubeflow Pipeline components using our Python functions.
The code below is the complete implementation of our Pipeline components. When you use tools like KFP and
MinIO, you really do not have a lot of plumbing code to write.
@dsl.component(packages_to_install=['minio==7.1.14'])
def table_data_exists(bucket: str, table_code: str, year: int) -> bool:
'''
Check for the existence of Census table data in MinIO.
'''
from minio import Minio
from minio.error import S3Error
import logging
object_name=f'{table_code}-{year}.csv'
logger = logging.getLogger('kfp_logger')
logger.setLevel(logging.INFO)
try:
# Create client with access and secret key.
client = Minio('host.docker.internal:9000',
'Access key here.',
'Secret key here.',
secure=False)
bucket_found = client.bucket_exists(bucket)
if not bucket_found:
return False
objects = client.list_objects(bucket)
found = False
for obj in objects:
logger.info(obj.object_name)
if object_name == obj.object_name: found = True
return found
@dsl.component(packages_to_install=['pandas==1.3.5', 'requests'])
def download_table_data(dataset: str, table_code: str, year: int, table_df: Output[Dataset]):
'''
Returns all fields for the specified table. The output is a DataFrame saved to csv.
'''
import logging
import pandas as pd
import requests
logger = logging.getLogger('kfp_logger')
logger.setLevel(logging.INFO)
census_endpoint = f'https://api.census.gov/data/{year}/{dataset}'
@dsl.component(packages_to_install=['pandas==1.3.5', 'minio==7.1.14'])
def save_table_data(bucket: str, table_code: str, year: int, table_df: Input[Dataset]):
import io
import logging
from minio import Minio
from minio.error import S3Error
import pandas as pd
object_name=f'{table_code}-{year}.csv'
logger = logging.getLogger('kfp_logger')
logger.setLevel(logging.INFO)
logger.info(bucket)
logger.info(table_code)
logger.info(year)
logger.info(object_name)
df = pd.read_csv(table_df.path)
try:
# Create client with access and secret key
client = Minio('host.docker.internal:9000',
@dsl.component(packages_to_install=['pandas==1.3.5', 'minio==7.1.14'])
def get_table_data(bucket: str, table_code: str, year: int, table_df: Output[Dataset]):
import io
import logging
from minio import Minio
from minio.error import S3Error
import pandas as pd
object_name=f'{table_code}-{year}.csv'
logger = logging.getLogger('kfp_logger')
logger.setLevel(logging.INFO)
logger.info(bucket)
logger.info(table_code)
logger.info(year)
logger.info(object_name)
finally:
response.close()
response.release_conn()
The most important fact to keep in mind as you implement and troubleshoot these functions is that at runtime
they are not functions at all. They will be components. In other words, KFP will take each function and deploy
it to its own container. This sample uses Lightweight Python Components. You can also use containerized
Python components which give you more control over what is put into the container. There is also a
containerized components option for non-Python code.
KFP introduces several constructs to help you seamlessly create functions that can behave as standalone
components running in a container. They are the component decorator, parameters, and artifacts. Let’s walk
through these tools so that you understand how KFP deploys functions and passes data between them at run
time.
Components
The component decorator tells KFP that a function should be deployed as a component. Carefully look at how
this decorator is used in the code above. Since the function will be deployed separately to a container you
need to tell KFP its dependencies. This is done using the packages_to_install parameter of the decorator. This
only ensures that dependencies are installed (via pip). It does not import them for you. You need to do this
yourself within the function definition. This may look a little unorthodox as most of us are used to importing
dependencies at the module level - but is OK when using a tool like KFP that turns functions into services.
Passing data between components must be done with care. KFP v2 makes the distinction between
parameters and artifacts. Parameters are for simple data that is passed between function calls (int, bool, str,
float, list, dict). Artifacts, on the other hand, represent data that your functions retrieve from an external
KFP makes use of Python type hints for specifying simple input parameters and simple return values. You are
limited to using str, int, float, bool, list, and dict. The table_data_exists function above shows how parameters
are specified in a function signature. Syntactically, you specify these the same way you would with standard
Python. Remember using type hints is a requirement. At runtime, KFP takes care of marshaling these values
between components - which are running in different containers.
If a function requires a more complicated data type as an input or if it returns a complicated data type then
use artifacts.
Artifacts
Artifacts are different from input parameters and output values in that they may get large. Examples of an
artifact are: a dataset, a model, metrics (the results of ML training efforts), HTML, and Mark Down. Under the
hood, KFP uses its own instance of MinIO to store artifacts. When you pass an artifact from one component to
another KFP does not pass the artifact directly - rather it stores the artifact in MinIO and passes a reference
to the artifact (object) in MinIO. This is really clever. It means that if you have a large artifact that needs to be
accessed by several components then the artifact can be efficiently accessed by these components - since
MinIO is purpose built for efficient object storage and access.
Let’s look at what happens when you pass an artifact to a component. In the code sample above,
save_table_data shows how this is done. Before your function is invoked, KFP copies the artifact from its
instance of MinIO to the local file system of the container your component is running in. Your code will need to
read this file. This is done using the path attribute of the parameter you declared to be of type Input[Dataset].
In the save_table_data function, I read this file into a Pandas DataFrame.
Output artifacts are specified as function parameters and cannot be the return value of a function. In the code
above, get_table_data shows how to use output artifacts. Notice that the table_df parameter has a datatype
of Output[Dataset]. To successfully return data from a function, you must write the data to the location
specified in the parameter’s path attribute. Again, this is a reference to the local file system in your container
- KFP will take care of moving this file to its instance of MinIO when your function completes.
The code below creates our pipeline (or DAG) from the components we implemented in the previous section.
return table_data.outputs['table_df']
There are a few things worth noting in this function. First, the pipeline decorator is telling KFP that this
function contains our pipeline definition. The name and description you specify here will show up in the KFP
UI.
Next, the return value of this pipeline function is a Dataset. It turns out that pipelines can be used just like
components. When a pipeline has a return value then it can be used within another pipeline. This is a great
way to reuse components.Finally, we are using the dsl.Condition (which is a Python context manager) to only
call our download component if the data we need is not already in our instance of MinIO. We could have used
a conventional if statement here. However, if we did then KFP would not have any way of knowing that we
have a branch in our logic. By using the dsl.Condition construct we are telling KFP about a branch in our
pipeline. This will allow the KFP UI to give us a better visual representation.
Running a Pipeline
Once you have your components and your pipeline implemented you are two lines of code away from running
your pipeline.
Choose a meaningful experiment name. The KFP UI has an experiments tab that will group runs with the same
experiment name. The code above “compiles” your pipeline and components - which is merely the act of
putting everything into a YAML file (including your source code). If you have any type mismatches that I
described earlier, then you will find out about these problems while creating the run. This code will also send
your pipeline to KFP and run it. Below is a screenshot showing a few successful runs of our pipeline.
Summary
In this post we created a data pipeline that uses KFP and MinIO to download and save US Census data. To do
this we set up our own instance of MinIO for storing raw data. This is an important piece of an ML pipeline -
someday AI will be regulated and having a storage solution under your control allows you to version, lock, and
encrypt data used for training and the models themselves.
We also discussed how KFP uses its own instance of MinIO to efficiently save and access artifacts during
pipeline runs.
In my next post, I will show how this data pipeline can be used as input to another pipeline that uses Census
data to train a model. If you have questions, drop us a line at hello@min.io or join the discussion on our
general Slack channel.
Apache Kafka is an open-source distributed event streaming platform that is used for building real-time data
pipelines and streaming applications. It was originally developed by LinkedIn and is now maintained by the
Apache Software Foundation. Kafka is designed to handle high volume, high throughput, and low latency data
streams, making it a popular choice for building scalable and reliable data streaming solutions.
■ Scale and Speed: Handling large-scale data streams and millions of events per second, and .scales
horizontally by adding more Kafka brokers to the cluster
■ Fault Tolerance: Replicating data across multiple brokers in a Kafka cluster ensures that data is
highly available and can be recovered in case of failure, making Kafka a reliable choice for critical
data streaming applications
■ Versatility: Support for a variety of data sources and data sinks making it highly versatile. It can be
used for building a wide range of applications, such as real-time data processing, data ingestion, data
streaming, and event-driven architectures
■ Durability: All published messages are stored for a configurable amount of time, allowing consumers
to read data at their own pace. This makes Kafka suitable where data needs to be retained for
historical analysis or replayed for recovery purposes.
Deploying Kafka on Kubernetes, a widely-used container orchestration platform, offers several additional
advantages. Kubernetes enables dynamic scaling of Kafka clusters based on demand, allowing for efficient
resource utilization and automatic scaling of Kafka brokers to handle changing data stream volumes. This
ensures that Kafka can handle varying workloads without unnecessary resource wastage or performance
degradation.
It provides easy deployment, management, and monitoring Running Kafka clusters as containers provides
easy deployment, management, and monitoring, and makes them highly portable across different
environments. This allows for seamless migration of Kafka clusters across various cloud providers, data
centers, or development environments.
Kafka and MinIO are commonly used to build data streaming solutions. MinIO is a high-performance,
distributed object storage system designed to support cloud-native applications with S3-compatible storage
for unstructured, semi-structured and structured data. When used as a data sink with Kafka, MinIO enables
organizations to store and process large volumes of data in real-time.
■ High Performance: MinIO writes Kafka streams as fast as they come in. A recent benchmark
achieved 325 GiB/s (349 GB/s) on GETs and 165 GiB/s (177 GB/s) on PUTs with just 32 nodes of
off-the-shelf NVMe SSDs.
■ Scalability: MinIO handles large amounts of data and scales horizontally across multiple nodes,
making it a perfect fit for storing data streams generated by Kafka. This allows organizations to store
and process massive amounts of data in real-time, making it suitable for big data and high-velocity
data streaming use cases.
■ Durability: MinIO provides durable storage, allowing organizations to retain data for long periods of
time, such as for historical analysis, compliance requirements, or data recovery purposes.
■ Fault Tolerance: MinIO erasure codes data across multiple nodes, providing fault tolerance and
ensuring data durability. This complements Kafka's fault tolerance capabilities, making the overall
solution highly available, reliable and resilient.
■ Easy Integration: MinIO is easily integrated with Kafka using Kafka Connect, a built-in framework for
connecting Kafka with external systems. This makes it straightforward to stream data from Kafka to
MinIO for storage, and vice versa for data retrieval, enabling seamless data flow between Kafka and
MinIO. We’ll see how straightforward this is in the tutorial below.
In this post, we will walk through how to set up Kafka on Kubernetes using Strimzi, an open-source project
that provides operators to run Apache Kafka and Apache ZooKeeper clusters on Kubernetes, including
distributions such as OpenShift. Then we will use Kafka Connect to stream data to MinIO.
Prerequisites
The first step is to install the Strimzi operator on your Kubernetes cluster. The Strimzi operator manages the
lifecycle of Kafka and ZooKeeper clusters on Kubernetes.
NAME: my-release
LAST DEPLOYED: Mon Apr 10 20:03:12 2023
NAMESPACE: kafka
STATUS: deployed
REVISION: 1
TEST SUITE: None
NOTES:
Thank you for installing strimzi-kafka-operator-0.34.0
https://strimzi.io/docs/operators/latest/deploying.html#deploying-cluster-operator-helm-chart-str
This installs the latest version (0.34.0 at the time of this writing) of the operator in the newly created kafka
namespace. For additional configurations refer to this page.
Now that we have installed the Strimzi operator, we can create a Kafka cluster. In this example, we will create
a Kafka cluster with three Kafka brokers and three ZooKeeper nodes.
%%writefile deployment/kafka-cluster.yaml
apiVersion: kafka.strimzi.io/v1beta2
kind: Kafka
metadata:
Overwriting deployment/kafka-cluster.yaml
kafka.kafka.strimzi.io/my-kafka-cluster created
Unset
!kubectl -n kafka get kafka my-kafka-cluster
my-kafka-cluster 3 3 True
Now that we have the cluster up and running, let’s produce and consume sample topic events, starting with
the kafka topic my-topic.
Create a YAML file for the kafka topic my-topic as shown below and apply it.
%%writefile deployment/kafka-my-topic.yaml
apiVersion: kafka.strimzi.io/v1beta2
kind: KafkaTopic
metadata:
name: my-topic
namespace: kafka
labels:
strimzi.io/cluster: my-kafka-cluster
spec:
partitions: 3
replicas: 3
Overwriting deployment/kafka-my-topic.yaml
With the Kafka cluster and topic set up, we can now produce and consume messages.
To create a Kafka producer pod to produce messages to the my-topic topic, try the below commands in a
terminal
This will give us a prompt to send messages to the producer. In parallel, we can bring up the consumer to
start consuming the messages that we sent to producer
The consumer will replay all the messages that we sent to the producer earlier and, if we add any new
messages to the producer, they will also start showing up at the consumer side.
Now that the Kafka cluster is up and running with a dummy topic producer/consumer, we can start consuming
topics directly into MinIO using the Kafka Connector.
Next we will use the Kafka Connector to stream topics directly to MinIO. First let's look at what connectors are
and how to set one up. Here is an high level overview of how the different Kafka Components interact
Kafka Connectors
Kafka Connect is an integration toolkit for streaming data between Kafka brokers and other systems. The
other system is typically an external data source or target, such as MinIO.
Kafka Connect utilizes a plugin architecture to provide implementation artifacts for connectors, which are
used for connecting to external systems and manipulating data. Plugins consist of connectors, data
converters, and transforms. Connectors are designed to work with specific external systems and define a
schema for their configuration. When configuring Kafka Connect, you configure the connector instance, and
the connector instance then defines a set of tasks for data movement between systems.
In the distributed mode of operation, Strimzi operates Kafka Connect by distributing data streaming tasks
across one or more worker pods. A Kafka Connect cluster consists of a group of worker pods, with each
connector instantiated on a single worker. Each connector can have one or more tasks that are distributed
across the group of workers, enabling highly scalable data pipelines.
Workers in Kafka Connect are responsible for converting data from one format to another, making it suitable
for the source or target system. Depending on the configuration of the connector instance, workers may also
apply transforms, also known as Single Message Transforms (SMTs), which can adjust messages, such as by
filtering certain data, before they are converted. Kafka Connect comes with some built-in transforms, but
additional transformations can be provided by plugins as needed.
2. Sink Connectors - extracts data from Kafka to external source like MinIO
Let’s configure a Sink Connector that extracts data from Kafka and stores it into MinIO as shown below
The Sink Connector streams data from Kafka and goes through following steps
1. A plugin provides the implementation artifacts for the Sink Connector: In Kafka Connect, a Sink
Connector is used to stream data from Kafka to an external system. The implementation artifacts for
the Sink Connector, such as the code and configuration, are provided by a plugin. Plugins are used to
extend the functionality of Kafka Connect and enable connections to different external data systems.
2. A single worker initiates the Sink Connector instance: In a distributed mode of operation, Kafka
Connect runs as a cluster of worker pods. Each worker pod can initiate a Sink Connector instance,
which is responsible for streaming data from Kafka to the external data system. The worker manages
the lifecycle of the Sink Connector instance, including its initialization and configuration.
3. The Sink Connector creates tasks to stream data: Once the Sink Connector instance is initiated, it
creates one or more tasks to stream data from Kafka to the external data system. Each task is
responsible for processing a portion of the data and can run in parallel with other tasks for efficient
data processing.
Setup
We will create a simple example which will perform the following steps
1. Create a Producer that will stream data from MinIO and produce events for a topic in JSON format
We will be using the NYC Taxi dataset that is available on MinIO. If you don't have the dataset follow the
instructions here
Producer
Below is a simple Python code that consumes data from MinIO and produces events for the topic my-topic
%%writefile sample-code/producer/src/producer.py
import logging
import os
import fsspec
import pandas as pd
import s3fs
logging.basicConfig(level=logging.INFO)
producer = KafkaProducer(bootstrap_servers="my-kafka-cluster-kafka-bootstrap:9092")
fsspec.config.conf = {
"s3":
{
"key": os.getenv("AWS_ACCESS_KEY_ID", "openlakeuser"),
"secret": os.getenv("AWS_SECRET_ACCESS_KEY", "openlakeuser"),
"client_kwargs": {
"endpoint_url": "https://play.min.io:50000"
}
}
}
s3 = s3fs.S3FileSystem()
total_processed = 0
i=1
for df in pd.read_csv('s3a://openlake/spark/sample-data/taxi-data.csv', chunksize=1000):
count = 0
for index, row in df.iterrows():
producer.send("my-topic", bytes(row.to_json(), 'utf-8'))
count += 1
producer.flush()
total_processed += count
if total_processed % 10000 * i == 0:
logging.info(f"total processed till now {total_processed}")
i += 1
Overwriting sample-code/src/producer.py
adds requirements and Dockerfile based on which we will build the docker image
Overwriting sample-code/producer/requirements.txt
In [14]:
%%writefile sample-code/producer/Dockerfile
FROM python:3.11-slim
ENV PYTHONDONTWRITEBYTECODE=1
COPY requirements.txt .
RUN pip3 install -r requirements.txt
COPY src/producer.py .
CMD ["python3", "-u", "./producer.py"]
Overwriting sample-code/Dockerfile
Build and push the Docker image for the producer using the above Docker file or you can use the one
available in openlake openlake/kafka-demo-producer
Let's create a YAML file that deploys our producer in the Kubernetes cluster as a job
%%writefile deployment/producer.yaml
apiVersion: batch/v1
kind: Job
metadata:
name: producer-job
namespace: kafka
spec:
template:
metadata:
name: producer-job
spec:
containers:
- name: producer-job
Writing deployment/producer.yaml
In [84]:
!kubectl apply -f deployment/producer.yaml
job.batch/producer-job created
In [24]:
!kubectl logs -f job.batch/producer-job -n kafka # stop this shell once you are done
%%writefile sample-code/connect/Dockerfile
FROM confluentinc/cp-kafka-connect:7.0.9 as cp
RUN confluent-hub install --no-prompt confluentinc/kafka-connect-s3:10.4.2
RUN confluent-hub install --no-prompt confluentinc/kafka-connect-avro-converter:7.3.3
FROM quay.io/strimzi/kafka:0.34.0-kafka-3.4.0
USER root:root
# Add S3 dependency
COPY --from=cp /usr/share/confluent-hub-components/confluentinc-kafka-connect-s3/
/opt/kafka/plugins/kafka-connect-s3/
Overwriting sample-code/connect/Dockerfile
Build and push the Docker image for the producer using the above Dockerfile or can use the one available in
openlake openlake/kafka-connect:0.34.0
Before we deploy Kafka Connect, we need to create storage topics if not already present for Kafka Connect
to work as expected.
Lets create connect-status, connect-configs and connect-offsets topics and deploy them as shown below
%%writefile deployment/connect-status-topic.yaml
apiVersion: kafka.strimzi.io/v1beta2
kind: KafkaTopic
metadata:
name: connect-status
namespace: kafka
labels:
strimzi.io/cluster: my-kafka-cluster
spec:
partitions: 1
replicas: 3
Writing deployment/connect-status-topic.yaml
In [73]:
%%writefile deployment/connect-configs-topic.yaml
apiVersion: kafka.strimzi.io/v1beta2
kind: KafkaTopic
metadata:
name: connect-configs
namespace: kafka
labels:
strimzi.io/cluster: my-kafka-cluster
spec:
partitions: 1
replicas: 3
config:
cleanup.policy: compact
Writing deployment/connect-configs-topic.yaml
In [74]:
%%writefile deployment/connect-offsets-topic.yaml
apiVersion: kafka.strimzi.io/v1beta2
kind: KafkaTopic
metadata:
name: connect-offsets
namespace: kafka
labels:
strimzi.io/cluster: my-kafka-cluster
spec:
partitions: 1
replicas: 3
config:
cleanup.policy: compact
In [ ]:
!kubectl apply -f deployment/connect-status-topic.yaml
!kubectl apply -f deployment/connect-configs-topic.yaml
!kubectl apply -f deployment/connect-offsets-topic.yaml
Next, create a YAML file for Kafka Connect that uses the above image and deploys it in Kubernetes. Kafka
Connect will have 1 replica and make use of the storage topics we created above.
NOTE: spec.template.connectContainer.env has the credentials defined in order for Kafka Connect to store
data in the Minio cluster. Other details like the endpoint_url, bucket_name will be part of KafkaConnector
In [75]:
%%writefile deployment/connect.yaml
apiVersion: kafka.strimzi.io/v1beta2
kind: KafkaConnect
metadata:
name: connect-cluster
namespace: kafka
annotations:
strimzi.io/use-connector-resources: "true"
spec:
image: openlake/kafka-connect:0.34.0
version: 3.4.0
replicas: 1
bootstrapServers: my-kafka-cluster-kafka-bootstrap:9093
tls:
trustedCertificates:
- secretName: my-kafka-cluster-cluster-ca-cert
certificate: ca.crt
config:
bootstrap.servers: my-kafka-cluster-kafka-bootstrap:9092
group.id: connect-cluster
key.converter: org.apache.kafka.connect.json.JsonConverter
value.converter: org.apache.kafka.connect.json.JsonConverter
internal.key.converter: org.apache.kafka.connect.json.JsonConverter
internal.value.converter: org.apache.kafka.connect.json.JsonConverter
Writing deployment/connect.yaml
In [87]:
!kubectl apply -f deployment/connect.yaml
kafkaconnect.kafka.strimzi.io/connect-cluster created
Now that we have Kafka Connect up and running, the next step is to deploy the Sink Connector that will poll
my-topic and store data into the MinIO bucket openlake-tmp.
connector.class - specifies what type of connector the Sink Connector will use, in our case it is
io.confluent.connect.s3.S3SinkConnector
store.url - MinIO endpoint URL where you want to store the data from Kafka Connect
storage.class - specifies which storage class to use, in our case we are storing in MinIO so
io.confluent.connect.s3.storage.S3Storage will be used
In [90]:
%%writefile deployment/connector.yaml
apiVersion: kafka.strimzi.io/v1beta2
kind: KafkaConnector
metadata:
name: "minio-connector"
namespace: "kafka"
labels:
strimzi.io/cluster:
connect-cluster
spec:
class: io.confluent.connect.s3.S3SinkConnector
config:
connector.class: io.confluent.connect.s3.S3SinkConnector
task.max: '1'
topics: my-topic
s3.region: us-east-1
s3.bucket.name: openlake-tmp
s3.part.size: '5242880'
flush.size: '1000'
store.url: https://play.min.io:50000
storage.class: io.confluent.connect.s3.storage.S3Storage
format.class: io.confluent.connect.s3.format.json.JsonFormat
partitioner.class: io.confluent.connect.storage.partitioner.DefaultPartitioner
behavior.on.null.values: ignore
Overwriting deployment/connector.yaml
In [89]:
!kubectl apply -f deployment/connector.yaml
kafkaconnector.kafka.strimzi.io/minio-connector created
We can see files being added to the Minio openlake-tmp bucket with
[...TRUNCATED…]
We created an end-to-end implementation of producing topics in Kafka and consuming it directly into MinIO
using the Kafka Connectors. This is a great start learning how to use MinIO and Kafka together to build a
streaming data repository. But wait, there’s more.
In my next post, I explain and show you how to take this tutorial and turn it into something that is a lot more
efficient and performant.
This blog post showed you how to get started building a streaming data lake. Of course, there are many more
steps involved between this beginning and production.
MinIO is cloud-native object storage that forms the foundation for ML/AI, analytics, streaming video, and
other demanding workloads running in Kubernetes. MinIO scales seamlessly, ensuring that you can simply
expand storage to accommodate a growing data lake.
Customers frequently build data lakes using MinIO and expose them to a variety of cloud-native applications
for business intelligence, dashboarding and other analysis. They build them using Apache Iceberg, Apache
Hudi and Delta Lake. They use Snowflake, SQL Server, or a variety of databases to read data saved in MinIO
as external tables. And they use Dremio, Apache Druid and Clickhouse for analytics, and Kubeflow and
Tensorflow for ML.
MinIO can even replicate data between clouds to leverage specific applications and frameworks, while it is
protected using access control, version control, encryption and erasure coding.
Don’t take our word for it though — build it yourself. You can download MinIO and you can join our Slack
channel.
Cloud native object stores such as MinIO are frequently used to build data lakes that house large structured,
semi-structured and unstructured data in a central repository. Data lakes usually contain raw data obtained
from multiple sources, including streaming and ETL. Organizations analyze this data to spot trends and
measure the health of the business.
What is Dremio?
Dremio is an open-source, distributed analytics engine that provides a simple, self-service interface for data
exploration, transformation, and collaboration. Dremio's architecture is built on top of Apache Arrow, a
high-performance columnar memory format, and leverages the Parquet file format for efficient storage. For
more on Dremio, please see Getting Started with Dremio.
MinIO is a high-performance, distributed object storage system designed for cloud-native applications. The
combination of scalability and high-performance puts every workload, no matter how demanding, within
reach. A recent benchmark achieved 325 GiB/s (349 GB/s) on GETs and 165 GiB/s (177 GB/s) on PUTs with
just 32 nodes of off-the-shelf NVMe SSDs.
MinIO is built to power data lakes and the analytics and AI that runs on top of them. MinIO includes a number
of optimizations for working with large datasets consisting of many small files, a common occurrence with any
of today’s open table formats.
Perhaps more importantly for data lakes, MinIO guarantees durability and immutability. In addition, MinIO
encrypts data in transit and on drives, and regulates access to data using IAM and policy based access
controls (PBAC).
We can use Helm charts to deploy Dremio in a Kubernetes cluster. In this scenario, we will use the Dremio
OSS (Open Source Software) image to deploy one Master, three Executors and three Zookeepers. The Master
node coordinates the cluster and the Executors processing data. By deploying multiple Executors, we can
parallelize data processing and improve cluster performance.
We’ll use a MinIO bucket to store the data. New files uploaded to Dremio are stored in the MinIO bucket. This
enables us to store and process large amounts of data in a scalable and distributed manner.
Prerequisites
■ A Kubernetes cluster. You can use Minikube or Kind to set up a local Kubernetes cluster.
■ Helm, the package manager for Kubernetes. You can follow this guide to install Helm on your
machine.
■ A MinIO server running on bare metal or kubernetes, or you can use our Play server for testing
purposes.
■ A MinIO client (mc) to access the MinIO server. You can follow this guide to install mc on your
machine.
MinIO engineers put together the openlake repository to give you the tools to build open source data lakes.
The overall goal of this repository is to guide you through the steps needed to build a data lake using open
source tools like Apache Spark, Apache Kafka, Trino, Apache Iceberg, Apache Airflow, and other tools
deployed on Kubernetes with MinIO as the object store.
Unset
!git clone https://github.com/minio/openlake
Let's create a MinIO bucket openlake/dremio which will be used by Dremio as the distributed storage
!mc mb play/openlake
!mc mb play/openlake/dremio
We will use the helm charts from the Dremio repo to set it up
We will use the dremio_v2 version of the charts, and we will use the values.minio.yaml file in the Dremio
directory of the openlake repository to set up Dremio. Let’s copy the YAML to
dremio-cloud-tools/charts/dremio_v2 and then confirm that it has been copied
Deployment Details
If we take a deep dive into the values.minio.yaml file (feel free to cat or open the file in your editor of
choice), we’ll gain a greater understanding of our deployment and learn about some of the modifications
made to the distStorage section
distStorage:
aws:
bucketName: "openlake"
path: "/dremio"
authentication: "accessKeySecret"
credentials:
accessKey: "minioadmin"
secret: "minioadmin"
extraProperties: |
<property>
<name>fs.s3a.endpoint</name>
<value>play.min.io</value>
</property>
<property>
<name>fs.s3a.path.style.access</name>
<value>true</value>
</property>
<property>
<name>dremio.s3.compat</name>
We set the distStorage to aws, the name of the bucket is openlake and all the storage for Dremio will be
under the prefix dremio (aka s3://openlake/dremio). We also need to add extraProperties since we
are specifying the MinIO Endpoint. We also need to add two additional properties in order to make Dremio
work with MinIO, fs.s3a.path.style.access needs to be set to true and dremio.s3.compat must be set
to true so that Dremio knows this is an S3 compatible object store.
Apart from this we can customize multiple other configurations like executor CPU and Memory usage
depending on the Kubernetes cluster capacity. We can also specify how many executors we need depending
on the size of the workloads Dremio is going to handle.
Make sure to update your Minio endpoint, access key and secret key in values.minio.yaml. The commands
below will install the Dremio release named dremio in the newly created namespace dremio.
!cd ~/dremio-cloud-tools/charts
Give Helm a few minutes to work its magic, then verify that Dremio was installed and is running
!kubectl -n dremio get pods # after the helm setup is complete it takes some time for the pods to be up and running
!kubectl -n dremio get svc # List all the services in namespace dremio
!mc ls play/openlake/dremio # we should see new prefixes being created that Dremio will use later
Log in to Dremio
To log in to Dremio, let’s open a port-forward for the dremio-client service to our localhost. After executing
the below command, point your browser at http://localhost:9047. For security purposes, please remember to
close the port-forward after you are finished exploring Dremio.
You will need to create a new user when you first launch Dremio
Dremio will automatically parse the CSV and provide the recommended formatting as shown below, click Save
to proceed.
!mc ls --summarize --recursive openlake/dremio/uploads # you will see the CSV file uploaded in to the MinIO
bucket
After loading the file, we will be taken to the SQL Query Console where we can start executing queries. Here
are 2 sample queries that you can try executing
Unset
SELECT count(*) FROM nyc_taxi_small;
Paste the above in the console and click Run, then you see something like below
You can click on the Query2 tab to see the number of rows in the dataset
This blog post walked you through deploying Dremio in a Kubernetes cluster and using MinIO as the
distributed storage. We also saw how to upload a sample dataset to Dremio and start querying it. We have
😜
just touched the tip of the iceberg in this post to help you get started building your data lake.
Speaking of icebergs, Apache Iceberg is an open table format that was built for object storage. Many a data
lake has been built using the combination of Dremio, Spark, Iceberg, and MinIO. To learn more, please see The
Definitive Guide to Lakehouse Architecture with Iceberg and MinIO.
Try Dremio on MinIO today. If you have any questions or want to share tips, please reach out through our
Slack channel or drop us a note on hello@min.io.
Enterprises are deploying multi-cloud services on a scale we’ve never seen before. Kubernetes is a key
enabler of multi-cloud success because it establishes a common, declarative software-based platform that
provides a consistent API-driven experience regardless of underlying hardware and software. However, it can
be time consuming and error prone to manage a multitude of Kubernetes clusters and their applications and
data across the multi-cloud.
It’s no secret that managing Kubernetes manually requires considerable skill to scale effectively. Challenges
grow as you scale because you’re supporting more and bigger Kubernetes clusters. At some point,
Kubernetes’ complexity may even threaten your ability to adapt legacy software to the cloud-native age.
Adding external storage to the mix compounds those challenges, especially when you have to deal with
variations in hardware and inconsistent APIs. If you are not architected for the multi-cloud, you run the risk of
failing in the multi-cloud.
We’ve joined forces with Rafay to develop this tutorial to show you how to make the most of multi-cloud
Kubernetes using Rafay to deploy, update and manage Kubernetes and applications using MinIO for object
storage. Rafay is a SaaS-based Kubernetes operations solution that standardizes, configures, monitors,
automates and manages a set of Kubernetes clusters through a single interface. MinIO is the fastest
software-defined, Kubernetes native, object store. It includes replication, integrations, automations and runs
anywhere Kubernetes does – public/private cloud, edge, developer laptops and more.
MinIO brings S3 API functionality and object storage to Kubernetes, providing a consistent interface anywhere
you run Kubernetes. DevOps and platform teams use the MinIO Operator and kubectl plugin to deploy and
manage object storage across the multi-cloud. Cloud-native MinIO integrates with external identity
management, encryption key management, load balancing, certificate management and monitoring and
alerting applications and services – it simply works with whatever you're already using in your organization.
MinIO is frequently used to build data lakes/lakehouses, at the edge and to deliver Object Storage as a
Service in the datacenter.
MinIO and Rafay are both known for their combination of power and simplicity. Follow the tutorial below to
begin exploring how they can standardize and automate operations for your Kubernetes clusters and manage
its applications and data.
We need a Kubernetes cluster to get started on our endeavor. Regular EKS or GKE clusters would work but
on-prem bare metal Kubernetes clusters would work as well. Our ethos has always been simplicity where
anyone can get started with just their laptop and grow production systems from there. We’ll use our laptops
for this tutorial in order to demonstrate the simplicity of Rafay and MinIO.
Download MicroK8s
....
==> microk8s
Run `microk8s install` to start with MicroK8s
Install MicroK8s
% microk8s install
Add a shortcut alias in bash so you do not have to repeat the entire command everytime
% vim ~/.bash_profile
Great, if that is working, lets move on to enabling some essential addons required for the operations of our
cluster.
In order for the pods in the MicroK8s cluster to talk internally and to route external DNS requests, let's enable
DNS, which is essentially managed by CoreDNS. In order to have a persistent volume for our MinIO
installation, we’ll enable Microk8s hostpath storage. Last but not least, we also need RBAC to securely enable
access to Calico for routing and other internal user based kubectl access configured using Rafay console.
deployment.apps/hostpath-provisioner created
storageclass.storage.k8s.io/microk8s-hostpath created
serviceaccount/microk8s-hostpath created
% mk8s get po -A
NAMESPACE NAME READY STATUSRESTARTS AGE
kube-system calico-kube-controllers-869878fccf-84l9q 1/1 Running 0 15m
kube-system calico-node-x4xsj 1/1 Running 0 15m
kube-system coredns-6f5f9b5d74-p4skc
MinIO Cluster
There are a couple of ways to get our MinIO Kubernetes cluster connected to Rafay. We can either go to the
Rafay console and launch the cluster on AWS, GCP, Azure and even bare metal to import an already running
existing Kubernetes into the Rafay console. In this case, we already have a running MicroK8s Kubernetes
cluster, so we’ll go ahead and import that.
Follow steps 1 and 2 on this page to import the MicroK8s cluster we set up locally. Once you are on step 3,
you’ll get a bootstrap yaml file which you need to apply to the Microk8s cluster.
Bootstrap Cluster
Once you apply the bootstrap file, it will take about 5 minutes for all the pods to come up
In the reachability check you should see SUCCESS and the control plane should look HEALTHY.
Deploy MinIO
There are several ways to deploy MinIO: Using the go binary and systemctl file, in Kubernetes as an
operator, and also using a Helm chart. We’ll use a Helm chart in this example to show the workflow in the
Rafay console to import a helm chart.
Download the MinIO tar.gz helm chart package, which will later be used to upload to Rafay console.
## Change below settings if you would like to use K8S secrets for the MinIO's access and secret key
## Remove this if you are planning to use the Vault integration
##
existingSecret: ""
accessKey: "minioadmin"
secretKey: "minioadmin"
% mk8s get ns
NAME STATUS AGE
default Active 2d
kube-system Active 2d
kube-public Active 2d
kube-node-lease Active 2d
rafay-system Active 2d
minio Active 4s
We have all the prerequisites now: Helm Chart tar.gz, Helm Values yaml file and Namespce to deploy on the
cluster. Next, create a new workload, name it “minio” with package type “Helm 3”. This is helpful because it
tells the Rafay console to use the Helm prerequisite files we created earlier. Select “Upload files manually” to
upload the helm chart tar.gz and the helm values yaml file.
Select the MinIO package and values yaml file and Publish the workload
MinIO Console
Next, expose the MinIO Console, a browser-based GUI for managing a MinIO Tenant, using Kubernetes port
forwarding.
Log in to the MinIO Console using the credentials that were set in the Helm chart’s values file.
At MiniO, we always strive to make our software as seamless and straightforward as possible. It starts with
detailed, easy-to-read documentation and single-command deployment. You get software-defined object
storage that runs anywhere from a developer’s laptop to production Kubernetes or bare metal clusters
combined with the simplicity of the browser-based MinIO Console user interface. A commercial subscription
adds access to the MinIO Subscription Network and ties it together,with real-time collaboration with our
engineers on our revolutionary SUBNET portal. This tutorial shows you how to work with MinIO object storage
and Rafay System’s management console for Kubernetes to set up Kubernetes workloads on a Microk8s
cluster. Once the necessary operators are installed, you will be able to see the status of your locally running
Microk8s cluster in the Rafay console.
This short tutorial can be run on a laptop to demonstrate how quick and easy it is to get started with MinIO
and Rafay. Once you’ve completed this tutorial, you’ll see how simple it is to manage your MinIO object
storage deployments with Rafay Systems. You can focus on running your MinIO clusters in multiple locations
and connect them all back to be managed and monitored by Rafay Systems in a single pane of glass view.
With the average enterprise running 100s of Kubernetes clusters across more than 2 locations, it gives them
the team autonomy and reliability with the combination of MinIO and Rafay Systems laying the groundwork for
successfully deploying and maintaining applications across your entire multi-cloud presence.
Apache Spark is an open-source, distributed computing system used for big data processing and analytics. It
is designed to handle large-scale data processing with speed, efficiency and ease of use. Spark provides a
unified analytics engine for large-scale data processing, with support for multiple languages, including Java,
Scala, Python, and R.
The benefits of using Spark are numerous. First, it provides a high level of parallelism, which means that it can
process large amounts of data quickly and efficiently across multiple nodes in a cluster. Second, Spark
provides a rich set of APIs for data processing, including support for SQL queries, machine learning, graph
processing, and stream processing. Third, Spark has a flexible and extensible architecture that allows
developers to easily integrate with various data sources and other tools.
When running Spark jobs, it is crucial to use a suitable storage system to store the input and output data.
Object storage systems like MinIO are the only way to run Spark jobs against petabytes of data as they are
highly scalable and durable storage solutions. MinIO is an open-source object storage system that can be
easily deployed on-premises or in the cloud of your choice. With industry leading S3-compatibility, MinIO is
used with a wide range of tools that support the S3 API, including Spark.
Using MinIO with Spark provides several benefits over traditional Hadoop Distributed File System (HDFS) or
other file-based storage systems. MinIO is highly scalable and can handle large amounts of data, as in
petabytes, with ease. Capable of over 2.6Tbps for READS and 1.32Tbps for WRITES, MinIO provides the
performance-at-scale that is needed to support large Spark datasets. MinIO is a flexible and cost-effective
storage solution that can be easily integrated with other tools and systems. Data written to MinIO is
immutable and versioned, as well as highly durable, with multiple copies of erasure coded data stored across
multiple nodes for redundancy and fault tolerance. Rounding out functionality, Active-Active replication and
Batch Replication can be used for further redundancy and fault tolerance, or simply to move data where it can
best be used.
Deploying Apache Spark on Kubernetes offers several advantages over deploying it standalone. Here are
some reasons why:
1. Resource management: Kubernetes provides powerful resource management capabilities that can
help optimize resource utilization and minimize waste. By deploying Spark on Kubernetes, you can
2. Scalability: Kubernetes can automatically scale the resources allocated to Spark based on the
workload. This means that Spark can scale up or down depending on the amount of data it needs to
process, without the need for manual intervention.
3. Fault-tolerance: Kubernetes provides built-in fault tolerance mechanisms that ensure the reliability of
Spark clusters. If a node in the cluster fails, Kubernetes automatically reschedules the Spark tasks to
another node, ensuring that the workload is not impacted.
4. Simplified deployment: Kubernetes offers a simplified deployment model, where you can deploy
Spark using a single YAML file. This file specifies the resources required for the Spark cluster, and
Kubernetes automatically handles the rest.
5. Integration with other Kubernetes services: By deploying Spark on Kubernetes, you can take
advantage of other Kubernetes services, such as monitoring and logging, to gain greater visibility into
your Spark cluster's performance and health.
We will use Spark Operator to set up Spark on Kubernetes. Spark Operator is a Kubernetes controller that
allows you to manage Spark applications on Kubernetes. It provides a custom resource definition (CRD) called
SparkApplication, which allows you to define and run Spark applications on Kubernetes. Spark Operator also
provides a web UI that allows you to easily monitor and manage Spark applications. Spark Operator is built on
top of the Kubernetes Operator SDK, which is a framework for building Kubernetes operators. Spark Operator
is open-source and available on GitHub. It is also available as a Helm chart, which makes it easy to deploy on
Kubernetes. In this tutorial, we will use the Helm chart to deploy Spark Operator on a Kubernetes cluster.
Spark Operator offers various features to simplify the management of Spark applications in Kubernetes
environments. These include declarative application specification and management using custom resources,
automatic submission of eligible SparkApplications, native cron support for scheduled applications, and
customization of Spark pods beyond native capabilities through the mutating admission webhook.
Additionally, the tool supports automatic re-submission and restart of updated SparkAppliations, as well as
retries of failed submissions with linear back-off. It also provides functionality to mount local Hadoop
configuration as a Kubernetes ConfigMap and automatically stage local application dependencies to MinIO via
sparkctl. Finally, the tool supports the collection and export of application-level metrics and driver/executor
metrics to Prometheus.
Prerequisites
1. A Kubernetes cluster. You can use Minikube to set up a local Kubernetes cluster on your machine.
2. Helm, the package manager for Kubernetes. You can follow this guide to install Helm on your
machine.
3. A MinIO server running on bare metal or Kubernetes. You can follow this guide to install MinIO on bare
metal or this guide to install MinIO on Kubernetes or you can use the MinIO Play server for testing
purposes.
To install Spark Operator, you need to add the Helm repository for Spark Operator to your local Helm client.
You can do this by running the following command:
Unset
helm repo add spark-operator
https://googlecloudplatform.github.io/spark-on-k8s-operator
Once the repository is added, you can install Spark Operator using the following command (you may have to
wait a minute while it is installed):
Unset
helm install my-release spark-operator/spark-operator \
--namespace spark-operator \
--set webhook.enable=true \
--set image.repository=openlake/spark-operator \
--set image.tag=3.3.1 \
--create-namespace
Unset
LAST DEPLOYED: Mon Feb 27 19:48:33 2023
NAMESPACE: spark-operator
STATUS: deployed
REVISION: 1
To verify that Spark Operator is installed successfully, you can run the following command:
Unset
kubectl get pods -n spark-operator
Unset
NAME READY STATUS RESTARTS AGE
Now that we have the Spark operator installed, we can deploy a Spark application or Scheduled Spark
application on Kubernetes.
Let's try deploying one of the example simple Spark applications that comes with the Spark operator. You can
find the list of example applications here, and we’re interested in calculating Pi, so we will modify the spark Pi
application to use Spark 3.3.1 and run it on Kubernetes.
Unset
apiVersion: "sparkoperator.k8s.io/v1beta2"
kind: SparkApplication
name: pyspark-pi
namespace: spark-operator
spec:
type: Python
pythonVersion: "3"
mode: cluster
image: "openlake/spark-py:3.3.1"
imagePullPolicy: Always
mainApplicationFile: local:///opt/spark/examples/src/main/python/pi.py
sparkVersion: "3.3.1"
restartPolicy:
type: OnFailure
onFailureRetries: 3
onFailureRetryInterval: 10
onSubmissionFailureRetries: 5
onSubmissionFailureRetryInterval: 20
driver:
cores: 1
coreLimit: "1200m"
memory: "512m"
labels:
version: 3.1.1
serviceAccount: my-release-spark
executor:
cores: 1
memory: "512m"
labels:
version: 3.3.1
The above application will calculate the value of Pi using Spark on Kubernetes. You can save the above
application as spark-pi.yaml and deploy it using the following command:
Unset
kubectl apply -f spark-pi.yaml
To verify that the job is running, you can run the following:
Unset
kubectl -n spark-operator get pods
Unset
NAME READY STATUS RESTARTS
AGE
You can check the status of the application using the following command:
Unset
NAME STATUS ATTEMPTS START FINISH
AGE
You can also check the logs of the application using the following command:
Unset
kubectl logs pyspark-pi-driver -n spark-operator
Unset
23/02/27 15:20:55 INFO DAGScheduler: Job 0 finished: reduce at
/opt/spark/examples/src/main/python/pi.py:42, took 2.597098 s
Pi is roughly 3.137960
Now that we have the simple Spark application working as expected we can try to read and write data from
MinIO using Spark.
Reading and writing data from and to MinIO using Spark is very simple once we have the right dependencies
and configurations in place. In this post we will not be discussing the dependencies, to keep things simple we
use the openlake/spark-py:3.3.1 image that contains all the dependencies required to read and write data
from MinIO using Spark.
We will be using the NYC Taxi dataset that is available on MinIO. You can download the dataset from here
which has ~112M rows and is ~10GB in size. For this exercise any existing or new MinIO deployment with
enough free space. You can use any other dataset of your choice and upload it to MinIO using the following
commands, first we’ll create the buckets that will be referenced by our applications:
Unset
mc mb <Your-MinIO-Endpoint>/openlake
mc mb <Your-MinIO-Endpoint>/openlake/spark
mc mb <Your-MinIO-Endpoint>/openlake/spark/sample-data
mc cp nyc-taxi-data.csv
<Your-MinIO-Endpoint>/openlake/spark/sample-data/nyc-taxi-data.csv
Let's now read and write data from MinIO using Spark. We will use the following sample python application to
do that.
Unset
import logging
import os
logger = logging.getLogger("MinioSparkJob")
spark = SparkSession.builder.getOrCreate()
spark_context._jsc.hadoopConfiguration().set("fs.s3a.secret.key",
os.getenv("AWS_SECRET_ACCESS_KEY", "<Your-MinIO-SecretKey>"))
spark_context._jsc.hadoopConfiguration().set("fs.s3a.endpoint",
os.getenv("ENDPOINT", "<Your-MinIO-Endpoint>"))
spark_context._jsc.hadoopConfiguration().set("fs.s3a.connection.ssl.enabled",
"true")
spark_context._jsc.hadoopConfiguration().set("fs.s3a.path.style.access",
"true")
spark_context._jsc.hadoopConfiguration().set("fs.s3a.attempts.maximum", "1")
spark_context._jsc.hadoopConfiguration().set("fs.s3a.connection.establish.timeo
ut", "5000")
spark_context._jsc.hadoopConfiguration().set("fs.s3a.connection.timeout",
"10000")
load_config(spark.sparkContext)
schema = StructType([
df = spark.read.option("header", "true").schema(schema).csv(
os.getenv("INPUT_PATH", "s3a://openlake/spark/sample-data/taxi-data.csv"))
total_rows_count = df.count()
filtered_rows_count = large_passengers_df.count()
# File Output Committer is used to write the output to the destination (Not
recommended for Production)
large_passengers_df.write.format("csv").option("header", "true").save(
os.getenv("OUTPUT_PATH", "s3a://openlake-tmp/spark/nyc/taxis_small"))
The above application reads the NYC Taxi dataset from MinIO and filters the rows where the passenger count
is greater than 6. The filtered data is then written to MinIO. You can save the above code as main.py.
We will now build the docker image that contains the above python application. You can create a Dockerfile
with the following contents to build the image:
USER root
WORKDIR /app
COPY src/*.py .
You can build your own docker image or use the pre-built image openlake/sparkjob-demo:3.3.1 that is
available on Docker Hub. If you need a refresher on building docker images, please see docker build.
To read and write data from MinIO using Spark, you need to create a secret that contains the MinIO access
key and secret key. You can create the secret using the following command:
Unset
kubectl create secret generic minio-secret \
--from-literal=AWS_ACCESS_KEY_ID=<Your-MinIO-AccessKey> \
--from-literal=AWS_SECRET_ACCESS_KEY=<Your-MinIO-SecretKey> \
--from-literal=ENDPOINT=<Your-MinIO-Endpoint> \
--from-literal=AWS_REGION=us-east-1 \
--namespace spark-operator
Unset
secret/minio-secret created
Now that we have the secret created, we can deploy the Spark application that reads and writes data from
MinIO. You can save the following application as sparkjob-minio.yaml:
kind: SparkApplication
metadata:
name: spark-minio
namespace: spark-operator
spec:
type: Python
pythonVersion: "3"
mode: cluster
image: "openlake/sparkjob-demo:3.3.1"
imagePullPolicy: Always
mainApplicationFile: local:///app/main.py
sparkVersion: "3.3.1"
restartPolicy:
type: OnFailure
onFailureRetries: 3
onFailureRetryInterval: 10
onSubmissionFailureRetries: 5
onSubmissionFailureRetryInterval: 20
driver:
cores: 1
memory: "1024m"
labels:
version: 3.3.1
serviceAccount: my-release-spark
env:
value: us-east-1
- name: AWS_ACCESS_KEY_ID
value: <Your-MinIO-AccessKey>
- name: AWS_SECRET_ACCESS_KEY
value: <Your-MinIO-SecretKey>
executor:
cores: 1
instances: 3
memory: "1024m"
labels:
version: 3.3.1
env:
- name: INPUT_PATH
value: "s3a://openlake/spark/sample-data/taxi-data.csv"
- name: OUTPUT_PATH
value: "s3a://openlake/spark/output/taxi-data-output"
- name: AWS_REGION
valueFrom:
secretKeyRef:
name: minio-secret
key: AWS_REGION
- name: AWS_ACCESS_KEY_ID
valueFrom:
secretKeyRef:
name: minio-secret
- name: AWS_SECRET_ACCESS_KEY
valueFrom:
secretKeyRef:
name: minio-secret
key: AWS_SECRET_ACCESS_KEY
- name: ENDPOINT
valueFrom:
secretKeyRef:
name: minio-secret
key: ENDPOINT
The above Python Spark Application YAML file contains the following configurations:
After the application is deployed, you can check the status of the application using the following command:
Unset
kubectl get sparkapplications -n spark-operator
Unset
NAME STATUS ATTEMPTS START FINISH AGE
Once the application is completed, you can check the output data in MinIO. You can use the following
command to list the files in the output directory:
Unset
mc ls minio/openlake/spark/output/taxi-data-output
You can also check the logs of the application using the following command:
Unset
kubectl logs -f spark-minio-driver -n spark-operator
Unset
23/02/27 19:06:11 INFO FileFormatWriter: Finished processing stats for write
job 91dee4ed-3f0f-4b5c-8260-bf99c0b662ba.
There is also an option for you to use the Spark UI to monitor the application while it runs. You can use the
following command to port forward the Spark UI for external access:
Unset
kubectl port-forward svc/spark-minio-ui-svc 4040:4040 -n spark-operator
In your browser, you can access the Spark UI using the following URL:
Unset
http://localhost:4040
Unset
kubectl delete sparkapplications spark-minio -n spark-operator
Deploying a Scheduled Spark Application is almost the same as deploying a normal Spark Application. The
only difference is that you need to add the spec.schedule field to the Spark Application YAML file and the
kind is ScheduledSparkApplication. You can save the following application as
sparkjob-minio-scheduled.yaml:
Unset
apiVersion: "sparkoperator.k8s.io/v1beta2"
kind: ScheduledSparkApplication
metadata:
name: spark-scheduled-minio
namespace: spark-operator
spec:
concurrencyPolicy: Allow
template:
type: Python
pythonVersion: "3"
mode: cluster
image: "openlake/sparkjob-demo:3.3.1"
imagePullPolicy: Always
mainApplicationFile: local:///app/main.py
sparkVersion: "3.3.1"
restartPolicy:
type: OnFailure
onFailureRetryInterval: 10
onSubmissionFailureRetries: 5
onSubmissionFailureRetryInterval: 20
driver:
cores: 1
memory: "1024m"
labels:
version: 3.3.1
serviceAccount: my-release-spark
env:
- name: AWS_REGION
value: us-east-1
- name: AWS_ACCESS_KEY_ID
value: <Your-MinIO-AccessKey>
- name: AWS_SECRET_ACCESS_KEY
value: <Your-MinIO-SecretKey>
executor:
cores: 1
instances: 3
memory: "1024m"
labels:
version: 3.3.1
env:
- name: INPUT_PATH
value: "s3a://openlake/spark/sample-data/taxi-data.csv"
value: "s3a://openlake/spark/output/taxi-data-output"
- name: AWS_REGION
valueFrom:
secretKeyRef:
name: minio-secret
key: AWS_REGION
- name: AWS_ACCESS_KEY_ID
valueFrom:
secretKeyRef:
name: minio-secret
key: AWS_ACCESS_KEY_ID
- name: AWS_SECRET_ACCESS_KEY
valueFrom:
secretKeyRef:
name: minio-secret
key: AWS_SECRET_ACCESS_KEY
- name: ENDPOINT
valueFrom:
secretKeyRef:
name: minio-secret
key: ENDPOINT
You can deploy and see the results of the application in the same way as the normal Spark Application. The
above Spark Application will run every hour and will write the output to the same bucket.
All the source code for this tutorial is available in the following GitHub repository: openlake/spark
Apache Spark and MinIO are powerful tools for data lakes and analytics. Running Spark on Kubernetes gives
you the benefits of better resource management, fault tolerance and scalability for Spark jobs. Add high
performance and highly scalable MinIO and you have a combination that supports all your Spark workloads
wherever you need to run them – public/private cloud, data center, edge – on the Kubernetes platform of your
choice.
Download MinIO and give the Spark Operator a test drive. If you’ve got questions, please ask us on our Slack
channel.
About MinIO
MinIO is pioneering high performance, Kubernetes-native object storage for the multi-cloud. The
software-defined, Amazon S3-compatible object storage system is used by more than half of the Fortune
500. With 1.18B+ Docker pulls, MinIO is the fastest-growing cloud object storage company and is consistently
ranked by industry analysts as a leader in object storage. Founded in 2014, the company is backed by Intel
Capital, Softbank Vision Fund 2, Dell Technologies Capital, Nexus Venture Partners, General Catalyst and key
angel investors.
Additional Information: