Miscellaneous notes when reading
<Kubernetes in Action>.
api group and api version
core api group need't specified in
ReplicationController is on core api group, so only:
apiVersion: v1 kind: ReplicationController ...
ReplicationSet is added later in
v1beta2 version (k8s v1.8):
apiVersion: apps/v1beta2 1 kind: ReplicaSet
ReplicationController VS ReplicationSet
ReplicationController is replaced by ReplicationSet, which has more expressive pod selectors.
ReplicationController's label selector only allows matching pods that include a certain label, ReplicationSet can meet multi labels at same time.
rs also support operator on key value:
In, NotIn, Exists, DoesNotExist
If migrate from rc to rs, can delete rc with
--cascade=false option, it will delete
rc only, but left pods running, then we can create a rs with same selector to make pods under management.
DaemonSet ensures pod run exact one copy on one node, useful for processes like monitor agent and log collector. Use
to make ds only run on specific nodes.
If node is made unschedulable, normal pods won't be scheduled to deploy on them, but ds will still be deployed to it, since ds will bypass scheduler.
Job is used to run a single completable task.
activeDeadlineSeconds to control job timeout.
backoffLimit define how many times a job can retry before it's defined as failed, default to 6.
ConfigMap and Secret
If configmap is injected as env variables for container, it can't be modified after prod started.
If need to refer to a value from configmap in container's
args fields, need to refer configmap in env, then refer to env value in those fields.
Use configmap and expose it through a volume. If configmap is updated, the file mounted in volume will be updated atomically (through symbolic link), but the update interval is long, up to 1 minute.
If only mount a sinlge file instead of whole volume, the file will not be updated! one workaround is to mount the whole volume into a different directory and then create a symbolic link pointing to the file.
The contents of a Secret's entries are shown as Base64-encoded strings, whereas those of a ConfigMap are shown in clear text. Secret entries can contain binary values, size is limited to 1MB.
stringData field can be used for non-binary Secret data, it's
write-only, if you do
kubectl get -o yaml,
stringData field will not be shown, they will be base64 encoded and display in
secret volume is mounted to container as tmpfs.
Metadata info can be exposed via env variables or
annotations can't be exposed via env variables, since they can be modified during pod run time.
rolling update with Deployment
ReplicationController, there is a
kubectl rolling-update cmd to do rolling update of pods, but it has problems:
- need client mainten connection to kube api server, if disconnected during updating, pods will be in a mid-way.
- it actually start a new ReplicationController with new image, and rolling shutdown old ones, it violates kubernetes design: declare desired state, not tell it add something or remove something.
- it will modify container's label and rc's label selector.
Communication between all components in k8s
etcd is the only persistent store, kubernetes api server is the only component access etcd directly, other components only talk to kubernets api server.
All data in k8s is stored in etcd under
$ etcdctl ls /registry /registry/configmaps /registry/daemonsets /registry/deployments /registry/events /registry/namespaces /registry/pods ...
K8s api server just manager data stored in etcd, other clients (kubelet, kubeproxy) connect to api server through watch, if data is modified, api server will notify those clients(eg: pod creation, service creation), then clients do the real things.
Every pod will have a pause container, it's an infrastructure container to hold all namespaces, all user defined containers of the pod use the namespaces of it.
If pause container is killed , kubelet will recreate it and all pod's containers.
Scheduler doesn't look at how much of each individual resource is being used at the exact time of scheduling but at the sum of resources requested by the existing pods deployed on the node.
LeastRequestedPriority prefers nodes with fewer requested resources.
MostRequestedPriority prefers nodes that have the most requested resources(to save cost on cloud infra).
Resource requests sum of all pods on a node can't be larger than node's allocatable resource, but resource limits can overcommitted.
If 100% of node's resources are used up, containers will be killed.
If container exceeds cpu limit, it won't be killed, and won't get more cpu time than configured.
If container tries to allocate memory over its limit, process is killed.If pod's restart policy is
OnFailure, it will be restarted immediately.If it keeps
being killed, restart time delay will be increased(10s, 20s, 40s, 80s, 160s, 300s), pod status will be
Container always see node's memory and cpu, won't be aware of its limit.
- If you run a java program, jvm will set maximum heap size based on host's total memory instead of memory available for the container, which means it maybe OOM killed.
- If program looks up cpu numbers to decide spawn how many workers, maybe over spawn too many workers.
To get real cpu limit, can use downward api to pass in cpu limit, or look
Pod's QoS classes:
- BestEffort (lowest priority). all containers in pod don't have
- Guaranteed (highest).
limitsbot set for all containers in pod and equal. If requests is not set, default to limit.
Pod with lower priority will be killed first if resource is not enough.
If two pods at same QoS class, the one with higher ram percentage usage will be killed first.
LimitRange can be used to valid pod spec, if pod requires more resources than available, api server will reject requst(it works on individual pod/container), without
LimitRange, api server will accept pod, but never schedule it.
LimitRange can be used to prevent users from creating pods bigger than any node in cluster.
ResourceRuote can be used to limit total amount of resources available in a namespace.But it has no effect on existing pods.
kubelet contains an agent called cAdvisor, it will collect statistics usage of node and containers, we can run heapster centrally to gather those data.
cAdvisor and heapster only hold usage data for a short time, if historical monitor data is needed, we need usual monitor tools like influxdb, datadog, NewRelic
After heapster is running:
kubectl top node # get node runtime metrics kubectl top pod --all--namespaces # get pods runtime metrics from all namespace
kubectl top get metrics from heapster, so the metrics may delay for several minutes, following error is okay:
W0312 22:12:58.021885 15126 top_pod.go:186] Metrics not available for pod default/kubia-3773182134-63bmb, age: 1h24m19.021873823s
--containers option, you can see metrics for every container instead of pods.
Managed by Horizonal controller. Check pod metrics, calculats the number of replicas required to meet target metric value configured, then adjusts replicas field on target resource(Deployment, ReplicaSet, ReplicationController, StatefulSet).
HPA query Heapster through REST apis to get metrics.
Pods -> cAdvisor -> Heapster -> HorizonalPodAutoscaler
To test autoscaler, we can manually create load:
kubectl run -it --rm --restart=Never loadgenerator --image=busybox -- sh -c "while true; do wget -O - -q http://kubia.default; done"
Initially, HPA can only scale based on cpu usage, in 1.8 memory-based was introduced.Also works on custom metrics.
When configure HPA, metrics type can be
If metrics is defined as
Object, autoscaler obtains a single metric from the single object.
Cluster Autoscaler can be used to auto scale nodes on cloud infrastructure: https://github.com/kubernetes/autoscaler/tree/master/cluster-autoscaler
Situations when nodes won't be relinquished:
- If system pods are running(only, but except DaemonSet) on the node.
- If an unmanaged pod running on the node.
- A pod with local storage is running on the node.
A node will only be returned to cloud provider if Cluster Autoscaler knows the pods running on the node will be rescheduled to other nodes.
If a node is selected to shut down, it will be marked as unscheduled first, then all pods on it will be rescheduled to other nodes.
kubectl cordon <node> marks node as unschedulable, but do nothing with pods running on the node.
kubectl drain <node> markds node as unschedulable then evits all pods from the node.
PodDisruptionBudget can be used to ensure minium number of pods when rescheduling pods.