What Are Kubernetes StatefulSets? When Should You Use Them?

Graphic with the Kubernetes logo

StatefulSets are Kubernetes objects used to consistently deploy stateful application components. Pods created as part of a StatefulSet are given persistent identifiers that they retain even when they’re rescheduled.

A StatefulSet can deploy applications that need to reliably identify specific replicas, rollout updates in a pre-defined order, or stably access storage volumes. They’re applicable to many different use cases but are most commonly used for databases and other types of persistent data store.

In this article you’ll learn what StatefulSets are, how they work, and when you should use them. We’ll also cover their limitations and the situations where other Kubernetes objects are a better choice.

What Are StatefulSets?

Making Pods part of a StatefulSet instructs Kubernetes to schedule and scale them in a guaranteed manner. Each Pod gets allocated a unique identity which any replacement Pods retain.

The Pod name is suffixed with an ordinal index that defines its order during scheduling operations. A StatefulSet called mysql containing three replicas will create the following named Pods:

Pods use their names as their hostname so other services that need to reliably access the second replica of the StatefulSet can connect to mysql-2. Even if the specific Pod that runs mysql-2 gets rescheduled later on, its identity will pass to its replacement.

StatefulSets also enforce that Pods are removed in reverse order of their creation. If the StatefulSet is scaled down to one replica, mysql-3 is guaranteed to exit first, followed by mysql-2. This behavior doesn’t apply when the entire StatefulSet is deleted and can be disabled by setting a StatefulSet’s podManagementPolicy field to Parallel.

StatefulSet Use Cases

StatefulSets are normally used to run replicated applications where individual Pods have different roles. As an example, you could be deploying a MySQL database with a primary instance and two read-only replicas. A regular ReplicaSet or Deployment would not be appropriate because you couldn’t reliably identify the Pod running the primary replica.

StatefulSets address this by guaranteeing that each Pod in the ReplicaSet maintains its identity. Your other services can reliably connect to mysql-1 to interact with the primary replica. ReplicaSets also enforce that new Pods are only started when the previous Pod is running. This ensures the read-only replicas get created after the primary is up and ready to expose its data.

The purpose of StatefulSets is to accommodate non-interchangeable replicas inside Kubernetes. Whereas Pods in a stateless application are equivalent to each other, stateful workloads require an intentional approach to rollouts, scaling, and termination.

StatefulSets integrate with local persistent volumes to support persistent storage that sticks to each replica. Each Pod gets access to its own volume that will be automatically reattached when the replica’s rescheduled to another node.

Creating a StatefulSet

Here’s an example YAML manifest that defines a stateful set for running MySQL with a primary node and two replicas:

apiVersion: v1
kind: Service
metadata:
  name: mysql
  labels:
    app: mysql
spec:
  ports:
    - name: mysql
      port: 3306
  clusterIP: None
  selector:
    app: mysql
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: mysql
spec:
  selector:
    matchLabels:
      app: mysql
  serviceName: mysql
  replicas: 3
  template:
    metadata:
      labels:
        app: mysql
    spec:
      initContainers:
      - name: mysql-init
        image: mysql:8.0
        command:
        - bash
        - "-c"
        - |
          set -ex
          [[ `hostname` =~ -([0-9]+)$ ]] || exit 1
          ordinal=$BASH_REMATCH[1]
          echo [mysqld] > /mnt/conf/server-id.cnf
          # MySQL doesn't allow "0" as a `server-id` so we have to add 1 to the Pod's index
          echo server-id=$((1 + $ordinal)) >> /mnt/conf/server-id.cnf
          if [[ $ordinal -eq 0 ]]; then
            printf "[mysqld]\nlog-bin" > /mnt/conf/primary.cnf
          else
            printf "[mysqld]\nsuper-read-only" /mnt/conf/replica.cnf
          fi          
        volumeMounts:
        - name: config
          mountPath: /mnt/conf
      containers:
      - name: mysql
        image: mysql:8.0
        env:
        - name: MYSQL_ALLOW_EMPTY_PASSWORD
          value: "1"
        ports:
        - name: mysql
          containerPort: 3306
        volumeMounts:
        - name: config
          mountPath: /etc/mysql/conf.d
        - name: data
          mountPath: /var/lib/mysql
          subPath: mysql
        livenessProbe:
          exec:
            command: ["mysqladmin", "ping"]
          initialDelaySeconds: 30
          periodSeconds: 5
          timeoutSeconds: 5
        readinessProbe:
          exec:
            command: ["mysql", "-h", "127.0.0.1", "-e", "SELECT 1"]
          initialDelaySeconds: 5
          periodSeconds: 5
          timeoutSeconds: 1
      volumes:
      - name: config
        emptyDir: 
  volumeClaimTemplates:
  - metadata:
      name: data
    spec:
      accessModes: ["ReadWriteOnce"]
      resources:
        requests:
          storage: 1Gi

This is quite a long manifest so lets unpack what happens.

  • A headless service is created by setting its clusterIP to None. This is tied to the StatefulSet and provides the network identities for its Pods.
  • A StatefulSet is created to hold the MySQL Pods. The replicas field specifies that three Pods will run. The headless service is referenced by the serviceName field.
  • Within the StatefulSet, an init container is created that pre-populates a file inside a config directory mounted using a persistent volume. The container runs a Bash script that establishes the ordinal index of the running Pod. When the index is 0, the Pod is the first to be created within the StatefulSet so it becomes the MySQL primary node. The other Pods are configured as replicas. The appropriate config file gets written into the volume where it’ll be accessible to the MySQL container later on.
  • The MySQL container is created with the config volume mounted to the correct MySQL directory. This ensures the MySQL instance gets configured as either the primary or a replica, depending on whether it’s the first Pod to start in the StatefulSet.
  • Liveness and readiness probes are used to detect when the MySQL instance is ready. This prevents successive Pods in the StatefulSet from starting until the previous one is Running, ensuring MySQL replicas don’t exist before the primary node is up.

An ordinary Deployment or ReplicaSet could not implement this workflow. Once your Pods have started, you can scale the StatefulSet up or down without risking the destruction of the MySQL primary node. Kubernetes provides a guarantee that the established Pod order will be respected.

# Create the MySQL StatefulSet
$ kubectl apply -f mysql-statefulset.yaml

# Scale up to 5 Pods - a MySQL primary and 4 MySQL replicas
$ kubectl scale statefulset mysql --replicas=5

Rolling Updates

StatefulSets implement rolling updates when you change their specification. The StatefulSet controller will replace each Pod in sequential reverse order, using the persistently assigned ordinal indexes. mysql-3 will be deleted and replaced first, followed by mysql-2 and mysql-1. mysql-2 won’t get updated until the new mysql-3 Pod transitions to the Running state.

The rolling update mechanism includes support for staged deployments too. Setting the .spec.updateStrategy.rollingUpdate.partition field in your StatefulSet’s manifest instructs Kubernetes to only update the Pods with an ordinal index greater than or equal to the given partition.

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: mysql
spec:
  selector:
    matchLabels:
      app: mysql
  serviceName: mysql
  replicas: 3
  updateStrategy:
    rollingUpdate:
      partition: 1
  template:
    ...
  volumeClaimTemplates:
    ...

In this example only Pods indexed 1 or higher will be targeted by update operations. The first Pod in the StatefulSet won’t receive a new specification until the partition is lowered or removed.

Limitations

StatefulSets have some limitations you should be aware of before you adopt them. These common gotchas can trip you up when you start deploying stateful applications.

  • Deleting a StatefulSet does not guarantee the Pods will be terminated in the order indicated by their identities.
  • Deleting a StatefulSet or scaling down its replica count will not delete any associated volumes. This guards against accidental data loss.
  • Using rolling updates can create a situation where an invalid broken state occurs. This happens when you supply a configuration that never transitions to the Running or Ready state because of a problem with your application. Reverting to a good configuration won’t fix the problem because Kubernetes waits indefinitely for the bad Pod to become Ready. You have to manually resolve the situation by deleting the pending or failed Pods.

StatefulSets also omit a mechanism for resizing the volumes linked to each Pod. You have to manually edit each persistent volume and its corresponding persistent volume claim, then delete the StatefulSet and orphan its Pods. Creating a new StatefulSet with the revised specification will allow Kubernetes to reclaim the orphaned Pods and resize the volumes.

When Not To Use a StatefulSet

You should only use a StatefulSet when individual replicas have their own state. A StatefulSet isn’t necessary when all the replicas share the same state, even if it’s persistent.

In these situations you can use a regular ReplicaSet or Deployment to launch your Pods. Any mounted volumes will be shared across all of the Pods which is the expected behavior for stateless systems.

A StatefulSet doesn’t add value unless you need individual persistent storage or sticky replica identifiers. Using a StatefulSet incorrectly can cause confusion by suggesting Pods are stateful when they’re actually running a stateless workload.

Summary

StatefulSets provide persistent identities for replicated Kubernetes Pods. Each Pod is named with an ordinal index that’s allocated sequentially. When the Pod gets rescheduled, its replacement inherits its identity. The StatefulSet also ensures that Pods get terminated in the reverse order they were created in.

StatefulSets allow Kubernetes to accommodate applications that require graceful rolling deployments, stable network identifiers, and reliable access to persistent storage. They’re suitable for any situation where the replicas in a set of Pods have their own state that needs to be preserved.

A StatefulSet doesn’t need to be used if your replicas are stateless, even if they’re storing some persistent data. Deployments and ReplicaSets are more suitable when individual replicas don’t need to be identified or scaled in a consistent order.

Leave a Reply