Miscellaneous notes when reading <Kubernetes in Action>
.
api group and api version
core api group need’t specified in apiVersion
field.
For example, ReplicationController
is on core api group, so only:
apiVersion: v1
kind: ReplicationController
...
ReplicationSet
is added later in app
group, v1beta2
version (k8s v1.8):
apiVersion: apps/v1beta2 1
kind: ReplicaSet
https://kubernetes.io/docs/concepts/overview/kubernetes-api/
ReplicationController VS ReplicationSet
ReplicationController is replaced by ReplicationSet, which has more expressive pod selectors.
ReplicationController’s label selector only allows matching pods that include a certain label, ReplicationSet can meet multi labels at same time.
rs also support operator on key value: In, NotIn, Exists, DoesNotExist
If migrate from rc to rs, can delete rc with --cascade=false
option, it will delete
rc only, but left pods running, then we can create a rs with same selector to make pods under management.
DaemonSet
DaemonSet ensures pod run exact one copy on one node, useful for processes like monitor agent and log collector. Use node-selector
to make ds only run on specific nodes.
If node is made unschedulable, normal pods won’t be scheduled to deploy on them, but ds will still be deployed to it, since ds will bypass scheduler.
Job
Job is used to run a single completable task.
Use activeDeadlineSeconds
to control job timeout. backoffLimit
define how many times a job can retry before it’s defined as failed, default to 6.
ConfigMap and Secret
If configmap is injected as env variables for container, it can’t be modified after prod started.
If need to refer to a value from configmap in container’s cmd
and args
fields, need to refer configmap in env, then refer to env value in those fields.
Use configmap and expose it through a volume. If configmap is updated, the file mounted in volume will be updated atomically (through symbolic link), but the update interval is long, up to 1 minute.
If only mount a sinlge file instead of whole volume, the file will not be updated! one workaround is to mount the whole volume into a different directory and then create a symbolic link pointing to the file.
The contents of a Secret’s entries are shown as Base64-encoded strings, whereas those of a ConfigMap are shown in clear text. Secret entries can contain binary values, size is limited to 1MB. stringData
field can be used for non-binary Secret data, it’s write-only
, if you do kubectl get -o yaml
, stringData
field will not be shown, they will be base64 encoded and display in data
field. secret
volume is mounted to container as tmpfs.
metadata
Metadata info can be exposed via env variables or downwardAPI
volume.
labels
and annotations
can’t be exposed via env variables, since they can be modified during pod run time.
rolling update with Deployment
With old ReplicationController
, there is a kubectl rolling-update
cmd to do rolling update of pods, but it has problems:
- need client mainten connection to kube api server, if disconnected during updating, pods will be in a mid-way.
- it actually start a new ReplicationController with new image, and rolling shutdown old ones, it violates kubernetes design: declare desired state, not tell it add something or remove something.
- it will modify container’s label and rc’s label selector.
Communication between all components in k8s
etcd is the only persistent store, kubernetes api server is the only component access etcd directly, other components only talk to kubernets api server.
All data in k8s is stored in etcd under /register
:
$ etcdctl ls /registry
/registry/configmaps
/registry/daemonsets
/registry/deployments
/registry/events
/registry/namespaces
/registry/pods
...
K8s api server just manager data stored in etcd, other clients (kubelet, kubeproxy) connect to api server through watch, if data is modified, api server will notify those clients(eg: pod creation, service creation), then clients do the real things.
pause container
Every pod will have a pause container, it’s an infrastructure container to hold all namespaces, all user defined containers of the pod use the namespaces of it.
If pause container is killed , kubelet will recreate it and all pod’s containers.
scheduler policy
Scheduler doesn’t look at how much of each individual resource is being used at the exact time of scheduling but at the sum of resources requested by the existing pods deployed on the node.
LeastRequestedPriority
prefers nodes with fewer requested resources.
MostRequestedPriority
prefers nodes that have the most requested resources(to save cost on cloud infra).
resource management
Resource requests sum of all pods on a node can’t be larger than node’s allocatable resource, but resource limits can overcommitted.
If 100% of node’s resources are used up, containers will be killed.
If container exceeds cpu limit, it won’t be killed, and won’t get more cpu time than configured.
If container tries to allocate memory over its limit, process is killed.If pod’s restart policy is Always
or OnFailure
, it will be restarted immediately.If it keeps
being killed, restart time delay will be increased(10s, 20s, 40s, 80s, 160s, 300s), pod status will be CrashLoopBackOff
.
Container always see node’s memory and cpu, won’t be aware of its limit.
Problems:
- If you run a java program, jvm will set maximum heap size based on host’s total memory instead of memory available for the container, which means it maybe OOM killed.
- If program looks up cpu numbers to decide spawn how many workers, maybe over spawn too many workers.
To get real cpu limit, can use downward api to pass in cpu limit, or look /sys/fs/cgroup/cpu/cpu.cfs_quota_us
and /sys/fs/cgroup/cpu/cpu.cfs_period_us
.
Pod’s QoS classes:
- BestEffort (lowest priority). all containers in pod don’t have
requests
andlimits
set. - Burstable.
limits
not matchrequests
. - Guaranteed (highest).
requests
andlimits
bot set for all containers in pod and equal. If requests is not set, default to limit.
Pod with lower priority will be killed first if resource is not enough.
If two pods at same QoS class, the one with higher ram percentage usage will be killed first.
LimitRange
can be used to valid pod spec, if pod requires more resources than available, api server will reject requst(it works on individual pod/container), without LimitRange
, api server will accept pod, but never schedule it. LimitRange
can be used to prevent users from creating pods bigger than any node in cluster.
ResourceRuote
can be used to limit total amount of resources available in a namespace.But it has no effect on existing pods.
kubelet contains an agent called cAdvisor, it will collect statistics usage of node and containers, we can run heapster centrally to gather those data.
Note:
cAdvisor and heapster only hold usage data for a short time, if historical monitor data is needed, we need usual monitor tools like influxdb, datadog, NewRelic
After heapster is running:
kubectl top node # get node runtime metrics
kubectl top pod --all--namespaces # get pods runtime metrics from all namespace
kubectl top
get metrics from heapster, so the metrics may delay for several minutes, following error is okay:
W0312 22:12:58.021885 15126 top_pod.go:186] Metrics not available for pod default/kubia-3773182134-63bmb, age: 1h24m19.021873823s
With --containers
option, you can see metrics for every container instead of pods.
HorizonalPodAutoScaler (HPA)
Managed by Horizonal controller. Check pod metrics, calculats the number of replicas required to meet target metric value configured, then adjusts replicas field on target resource(Deployment, ReplicaSet, ReplicationController, StatefulSet).
HPA query Heapster through REST apis to get metrics.
Pods -> cAdvisor -> Heapster -> HorizonalPodAutoscaler
To test autoscaler, we can manually create load:
kubectl run -it --rm --restart=Never loadgenerator --image=busybox -- sh -c "while true; do wget -O - -q http://kubia.default; done"
Initially, HPA can only scale based on cpu usage, in 1.8 memory-based was introduced.Also works on custom metrics.
When configure HPA, metrics type can be Pods
, Resource
, Object
.
If metrics is defined as Object
, autoscaler obtains a single metric from the single object.
Cluster Autoscaler can be used to auto scale nodes on cloud infrastructure: https://github.com/kubernetes/autoscaler/tree/master/cluster-autoscaler
Situations when nodes won’t be relinquished:
- If system pods are running(only, but except DaemonSet) on the node.
- If an unmanaged pod running on the node.
- A pod with local storage is running on the node.
A node will only be returned to cloud provider if Cluster Autoscaler knows the pods running on the node will be rescheduled to other nodes.
If a node is selected to shut down, it will be marked as unscheduled first, then all pods on it will be rescheduled to other nodes.
kubectl cordon <node>
marks node as unschedulable, but do nothing with pods running on the node.
kubectl drain <node>
markds node as unschedulable then evits all pods from the node.
PodDisruptionBudget
can be used to ensure minium number of pods when rescheduling pods.