Write your first Kubernetes operator in go
What you need to know to get started
What is a Kubernetes operator?
Kubernetes is designed for automation. Out of the box, you get lots of built-in automation from the core of Kubernetes.
Kubernetes’ operator pattern concept lets you extend the cluster’s behaviour without modifying the code of Kubernetes itself by linking controllers to one or more custom resources. Operators are clients of the Kubernetes API that act as controllers for a Custom Resource.
- Kubernetes official document
We can do lots of automation for generic operational use cases like — running 3 replicas of some stateless application, increasing that to 6 automatically when the CPU usage is above 60%, etc with out-of-box Kubernetes features.
But for some operational use cases, built-in Kubernetes features are not enough.
For example, operating a distributed database requires lots of domain knowledge and domain-specific operations. With increasing load just simply adding 1 more instance(Kubernetes replica)won’t help, that new instance needs to be added to the active member list and so much more.
This is where the Kubernetes operator comes in.
Kubernetes operators are Software SREs.
Using the Kubernetes operator pattern we can extend Kubernetes built-in features, to automate a specific complex workload that requires domain-specific operations by a human operator.
Before diving into writing an operator, let's first understand how Kubernetes works.
Kubernetes architecture
That’s it, that is what Kubernetes is all about. You want to have a target state of the world, and there is the real world, and you continuously observe and take action to make the real world the same as the target state.
Now, If you were to design the solution yourself what would you have done?
The first thing that comes to mind, we need a data store to store the ideal state and the current state.
We also need to define the entity/schema of an object which holds all the information required to do action on them, for example, to deploy an application to the server, we need information on where could we find the application, how many instances we want to run of that application, etc.
Do we need an API server? So that user can create, update, delete and view the ideal state of some object in the datastore? For authentication, authorization?
We need a control loop process that will continuously observe the current state and the target state in the datastore and would take action or reconcile to make the current state same as the target state.
How can the control loop process access the data store? Should the control loop process directly communicate with the datastore and run operations on the datastore? In that case, the control loop process needs to know all about the schema and validation of objects, what happens when we add more control loop processes for different things, the same logic gets duplicated, so it seems it is better Only the API server communicates with the datastore directly, and has the all the validation and other logics about it. All other components talks to the API server to make operations on the datastore.
At this point, we know that our control loop process needs to know all that happening in the datastore as soon as possible so that it can take action. How can we do that?
we can poll the API server periodically,
or we can do some sort of event-driven mechanism, after changing any object API server would emit an event about the change to a message broker/queue and the control loop process would subscribe and consume those events.
How about our API server implements long polling mechanism?
In that case, the control loop process could make a connection to the API server and the API server would stream all the changes to the control loop process. This seems to be the simplest one.
Now we have more understanding of the problem and we have an initial design solution in our mind, Let’s see the real architecture of Kubernetes.
There is this logical separation of components — control plane and data plane. While the control plane is about running Kubernetes itself and the data plane is about running user workload on worker nodes.
Etcd, API server, Scheduler, and Controller manager are control plane component and runs on the master node.
And KubeProxy, Kubelet are data plane components and runs on worker nodes.
Etcd
Etcd is the datastore of Kubernetes for all cluster data. All the objects- Pods, ReplicaSet, Deployment, Service, Secrets, and so on are stored in Etcd.
API Server
The API server is the only component that communicates directly with Etcd, all other component calls the API server’s APIs to do anything on Etcd. It also handles authentication, authorization, validation, etc.
As we figured out earlier our control loop process needs to know everything that's happening on resources on the API server, and we also talked about a couple of ways we can achieve this. Well, Kubernetes uses long polling mechanism on the API server. And this is called watch
.
Clients can watch resources on the API server by adding ?watch=true
a query string to the API call. Ex- GET /../pods?watch=true
So, whenever we say some component x watches resource y on the API server, you now know what that means, it means-
component x makes an HTTP connection with ?watch=true
and the API server considers that as a long polling connection and starts sending streams of modification to the client.
Every Kubernetes resource has a field called resourceVersion
, with every update, it’s changed/increased. We can also track changes from a specific state of the object by using this resourceVersion
when watching.
That's all the API server does, it lets other components CRUD objects and notifies changes to all the watchers. It doesn’t schedule or deploy pods, or deployments on nodes.
Scheduler
The scheduler watches the API server for pod changes, whenever it finds a pod without nodeName
, based on some algorithm, it figures out which node would be appropriate for deploying that pod. It then updates that pod object withnodeName: somenode
.
So the scheduler's responsibility is to figure out an optimal node to run that pod on and update that object with that nodeName through the API server.
Kubelet
Kubelet runs on every worker node. Its first job is to register that node to the API server. Then it watches the API server for pods, and whenever it sees any pods that are scheduled for running on that node, it deploys that pod by using container run time, similarly, it deletes any pods that are running on that node to be deleted. It also continuously updates the API server about all the container status, resource usage, and events running on that node.
Controller Manager
As you have already seen, the pod is the building block of Kubernetes. Schedular watches for pod change and update pod with nodeName and Kubelet watches for pod change and deploy pod with the same nodeName it’s running on.
So, how do Deployment, ReplicaSet, and other complex type works?
The Controller manager handles them. Currently, a single Controller manager process contains several controllers performing various reconciliation tasks.
If many controllers are running in a cluster, and if they all query or watch objects, there would be increased traffic on the API server. So, The Controller manager does all the watching and querying the API server and manages a local cache, whenever a change event occurs it passes that event to the appropriate controller’s work queue and then the controller does reconciliation.
Let's explore how the Replication manager works in a simple overview —
The replication manager watches the Replication controller resources and pods on the API server for changes.
A ReplicaSet object is created, The Replicaset manager gets that creation event and passes that to The Replication controller, then the controller creates pods matching the replica count defined in that ReplicaSet object. You already know what happens next, right? The scheduler would receive this pod creation event and would assign a node to it, and finally, Kubelet would see there is a new pod that needs to be deployed on that node and it would deploy that pod.
As the Replication manager also watches pods, whenever a pod is deleted, it would also receive that event and the ReplicaSet controller would create another pod to make the current state the same as the desired state
Kubernetes operator
So we can see how Kubernetes controller builds on top of fundamental blocks like pods and manages complex types like Deployment, ReplicaSet, Jobs, and others.
At this point, We should have grown an understanding that a Kubernetes operator is nothing else but a Kubernetes controller manager and one or many controllers that watches and modifies custom resources and other built-in resources.
Let’s first define a problem
Let's imagine a use-case — there is an application used by your company internally. And the requirement is, during office hours(peak traffic) you want to run a certain amount of replicas and then you want to gradually decrease them to the lowest during the night, and the next morning you want to gradually increase them to max during office hours.
9 AM — 5 PM (5 replica)
5 PM — 7 PM (3 replica)
7 PM- 7 AM (2 replica)
7 AM — 9 AM (3 replica)
Basically, we want to scale up or down our application based on the hour of the day.
So, As an operator/SRE if you were doing this yourself, following the above requirement, every day at 9 AM you would have scaled up the replica to 5, and then at 6 PM, you would have scaled it down to 3 replicas and so on.
As we know exactly when, how and what we need to do, let's turn this operational logic into a Kubernetes operator.
But, Do you remember the official documentation says — Operators are clients of the Kubernetes API that act as controllers for a Custom Resource?
Let's talk about Custom Resources now.
Custom Resource(CR) and Custom Resource Definition(CRD)
Pod object has some properties like image, port, metadata, and other things, and that’s sufficient to work with pods.
We know how ReplicaSet works, it creates, updates, and deletes pods. But to do that it also needs other information like replica count. And the ReplicaSet object has that field.
Have you noticed, for our use case we also need some information like the scaling config on which hour, and how many replicas should there be? We can not store that information on the API server with any built-in object. We can achieve this with Custom Resources. We can define a schema(Custom resource definition) based on our need and we can register that in the API server, and the API server would provide us endpoints to create, read, update, view, and watch that custom resource.
Let's first use the operator
I have already written the operator to solve our use case, you can find it here — https://github.com/BackAged/tdset-operator
To deploy the operator https://github.com/BackAged/tdset-operator/tree/main/example
You can simply do kubectl apply -f example/
But let's see in detail what are the steps to understand it better.
- At first, we need to deploy the custom resource definition to add a custom resource on the API server —
example/tdset_crd.yaml
- Now to deploy the operator, as you know, the operator controller watches custom resources and may crud other built-in resources on the API server, so it would require permissions to do those —
example/role.yaml example/role_binding.yaml example/service_account.yaml
- Now that we have the permission ready, let's deploy the TDSet operator
example/tdset_operator.yaml
you can see it’s a regular deployment, that deploys the TdSet controller - At this point, we have deployed the TDSet custom resource definition, which has added TDset custom resource APIs to the API server and deployed a controller which watches TDset resources to take action (reconcile).
Let’s deploy a TDSet custom resource to achieve our requirements.example/tdset_helloworld.yaml
apiVersion: schedule.rs/v1 # see we have added new api group v
kind: TDSet # new kind -> resource
metadata:
name: tdset-hello-world
namespace: temp
spec:
container:
image: crccheck/hello-world
port: 8000
schedulingConfig:
- startTime: 9 # 24 hour format
endTime: 17
replica: 5
- startTime: 17
endTime: 19
replica: 3
- startTime: 19
endTime: 0
replica: 2
- startTime: 1
endTime: 7
replica: 2
- startTime: 7
endTime: 9
replica: 3
defaultReplica: 3
You will see that A new deployment has been created with the default replica, and automatically updating the replica count by the hour of the day!!!
Let’s write the operator
We are going to use https://github.com/operator-framework/operator-sdk to write the operator, it's built on top of https://github.com/kubernetes-sigs/kubebuilder.
It generates lots of boilerplate code for the controller manager, controller, CRDS, roles, mapper between go type to CRDS, etc, and lets us focus on the actual business logic of the controller reconciliation.
Please see operator-framework documentation for installation and basic usage.
- Init the project
operator-sdk init — domain rs — repo github.com/BackAged/tdset-operator — plugins=go/v4-alpha
- Create CRDS, go types, controller
operator-sdk create api — group schedule — version v1 — kind TDSet — resource — controller
- Update
tdset_types.go
// Container defines container related properties.
type Container struct {
Image string `json:"image"`
Port int `json:"port"`
}
// Service defines service related properties.
type Service struct {
Port int `json:"port"`
}
// SchedulingConfig defines scheduling related properties.
type Scheduling struct {
// +kubebuilder:validation:Minimum=0
// +kubebuilder:validation:Maximum=23
StartTime int `json:"startTime"`
// +kubebuilder:validation:Minimum=0
// +kubebuilder:validation:Maximum=23
EndTime int `json:"endTime"`
// +kubebuilder:validation:Minimum=0
Replica int `json:"replica"`
}
// TDSetSpec defines the desired state of TDSet
type TDSetSpec struct {
// +kubebuilder:validation:Required
Container Container `json:"container"`
// +kubebuilder:validation:Optional
Service Service `json:"service,omitempty"`
// +kubebuilder:validation:Required
SchedulingConfig []*Scheduling `json:"schedulingConfig"`
// +kubebuilder:validation:Required
// +kubebuilder:validation:Minimum=1
DefaultReplica int32 `json:"defaultReplica"`
// +kubebuilder:validation:Optional
// +kubebuilder:validation:Minimum=1
// +kubebuilder:validation:Maximum=1440
IntervalMint int32 `json:"intervalMint"`
}
// TDSetStatus defines the observed state of TDSet
type TDSetStatus struct {
Conditions []metav1.Condition `json:"conditions,omitempty" patchStrategy:"merge" patchMergeKey:"type" protobuf:"bytes,1,rep,name=conditions"`
}
//+kubebuilder:object:root=true
//+kubebuilder:subresource:status
// TDSet is the Schema for the tdsets API
type TDSet struct {
metav1.TypeMeta `json:",inline"`
metav1.ObjectMeta `json:"metadata,omitempty"`
Spec TDSetSpec `json:"spec,omitempty"`
Status TDSetStatus `json:"status,omitempty"`
}
//+kubebuilder:object:root=true
// TDSetList contains a list of TDSet
type TDSetList struct {
metav1.TypeMeta `json:",inline"`
metav1.ListMeta `json:"metadata,omitempty"`
Items []TDSet `json:"items"`
}
func init() {
SchemeBuilder.Register(&TDSet{}, &TDSetList{})
}
4. Generate controller managers, controllers, CRDS, roles, and others make generate
make manifests
// this is where we are glueing the controller with controller manager.
// watching resources.
func (r *TDSetReconciler) SetupWithManager(mgr ctrl.Manager) error {
return ctrl.NewControllerManagedBy(mgr).
For(&schedulev1.TDSet{}).
Owns(&appsv1.Deployment{}).
Complete(r)
}
5. Implement TDSet controller Reconcile functionfunc (r *TDSetReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {}
In the reconciler function, we need to handle every possible state there could be and take appropriate action.
This is what our implementation looks like
func (r *TDSetReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
log := log.FromContext(ctx)
log.Info("starting reconciliation")
tdSet := &schedulev1.TDSet{}
// Get the TDSet
err := r.GetTDSet(ctx, req, tdSet)
if err != nil {
if apierrors.IsNotFound(err) {
log.Info("TDSet resource not found. Ignoring since object must be deleted")
return ctrl.Result{}, nil
}
log.Error(err, "Failed to get TDSet")
return ctrl.Result{}, err
}
// Try to set initial condition status
err = r.SetInitialCondition(ctx, req, tdSet)
if err != nil {
log.Error(err, "failed to set initial condition")
return ctrl.Result{}, err
}
// TODO: Delete finalizer
// Deployment if not exist
ok, err := r.DeploymentIfNotExist(ctx, req, tdSet)
if err != nil {
log.Error(err, "failed to deploy deployment for TDSet")
return ctrl.Result{}, err
}
if ok {
return ctrl.Result{RequeueAfter: time.Minute}, nil
}
// Update deployment replica if mis matched.
err = r.UpdateDeploymentReplica(ctx, req, tdSet)
if err != nil {
log.Error(err, "failed to update deployment for TDSet")
return ctrl.Result{}, err
}
interval := DefaultReconciliationInterval
if tdSet.Spec.IntervalMint != 0 {
interval = int(tdSet.Spec.IntervalMint)
}
log.Info("ending reconciliation")
return ctrl.Result{RequeueAfter: time.Duration(time.Minute * time.Duration(interval))}, nil
}
Please go through the GitHub repo to see all the controller implementation-related code.
https://github.com/BackAged/tdset-operator/tree/main/controllers
At this point, you may have a question in mind, As different controllers update different things concurrently, how Kubernetes makes sure any resources are not having a concurrent update-related issue like lost update, stale update, etc?
Kubernetes uses the field-resourceVersion
on resources, for optimistic concurrency control. For any update request, The API server checks whether the resourceVersion matches with the current resourceVersion
on Etcd, if it doesn’t, it means it was updated by some other process meantime and rejects that update request. It’s similar to -
UPDATE ResourceX
SET fieldX= 'XX', resourceVersion = x + 1
WHERE ResourceXID = x and resourceVersion = x
We have come to the end of this long article. Hope you have enough understanding of Kubernetes and Kubernetes operators now.
Wish to see what cool operators you write yourself!