Shining Moon

管理负载

最近在看 google 的 <The Site Reliablity Workbook>, 其中有一章是"Manage load", 内容还挺详细的, 结合在 aws 上的经验做点笔记.

Load Balancing

流量的入口是负载均衡, 最最简单的做法是在 DNS 上做 round robin, 但这样很依赖 client, 不同的 client 可能不完全遵守 DNS 的 TTL, 当地的 ISP 也会有缓存.

google 用 anycast 技术在自己的网络中通过 BGP 给一个域名发布多个 endpoint, 共享一个 vip(virtual ip), 通过 BGP routing 来将用户的数据包发送到最近的 frontend server, 以此来减少 latency.

但只依赖 BGP 会带来两个问题:

某个地区的用户过多会给最近的 frontend server 带来过高的负载
ISP 的 BGP 路由会重计算, 当 BGP routing 变化后, 进行中的 tcp connection 会被 reset(同一个 connection 上的后续数据包被发送到不同的 server, tcp session 不存在于新 server 上)

为了解决原生 BGP anycast 的问题, google 开发了 Maglev, 即使路由发生了变化(routing flap), connection 也不断开, 把这种方式叫做 stablized anycast.

......

2019.02.12 gce aws book

去年对线上业务做了一些性能优化, 当时把 http client 从 requests 换成了 geventhttpclient , 上线后发起 rpc 调用的 server 整体负载低了很多, 但 client 端 latency 却高了很多, 经过 debug 觉得问题是 geventhttpclient 把 header 和 body 通过两次 sock send 发出的额外开销造成的, 尝试修改成一次 send 后 latency 就恢复了: https://github.com/gwik/geventhttpclient/pull/85

最近在调试 gunicorn 的代码时候, 看到它建立 socket 的时候设置了 TCP_NODELAY, 在很多项目里看到过这个 tcp option, 但没细究过, man tcp 得知是用来关闭 tcp 里的 nagle 算法的. nagle 在 linux 的默认 tcp 协议栈里是开启的, 当发送的数据包 size 小于 mss 的时候会在内存里 buffer 起来, 积累起来后再发送, 目的是提高带宽利用率, 毕竟 payload 只发一次字节也要带上 40 字节的 ip+tcp header.

......

2018.12.29 http tcp code-infra

Kubernetes 中的 pod 调度

定义 pod 的时候通过添加 node selector 可以让 pod 调度到有特定 label 的 node 上去, 这是最简单的调度方式. 其他还有更复杂的调度方式: node-taints/tolerations, node-affinity, pod-affinity, 来达到让某些类型的 pod 调度到一起, 让某些类型的 pod 不跑一起的效果.

Taints and Tolerations

如果 node 有 taints, 那只有能 tolerate 这些 taints 的 pod 才能调度到上面.

taint 的基本格式是: <key><operator><value>:<effect>

kubectl describe node xxx 可以看到节点的 taints, 比如 master 节点上会有:

Taints:       node-role.kubernetes.io/master:NoSchedule

这里 key 是 node-role.kubernetes.io/master, 没有等号和 value, operator 就是 Exists , effect 是 NoSchedule.

master 节点上的这条 taint 就定义了只有能 tolerate 它的 pod 能调度到上面, 一般都是些系统 pod.

比如看下 kube-proxy: kubectl describe pod kube-proxy-efiv -n kube-system

......

2018.12.16 k8s

Debug Skills on Linux

This post will show several commands used for debugging on linux server, all examples are tested on ubuntu 18.04, some tools are not installed by default, you can installl by sudo apt install xxx. Some commands must be used via sudo.

System resources can be classified in three main categories: compute, storage, and network. Usually, when you come to a performance issue, it’s always caused by exhaustion of those resources.

Universal metric: System load

There’re several ways to get system load, w, uptime, top, cat /proc/loadavg

uptime example:

03:34:23 up 20:31,  1 user,  load average: 1.02, 0.65, 0.45

Top right corner three values named load average is system load.

They means: average system load during last minute/last 5 minutes/last 15 minutes periods.

If load/(# of cpu core) > 1 means there’re tasks pending in cpu queue. Usually, you will feel slow.

It’s different with cpu utilization. CPU utilization is a metrics shows how busy cpu is handling tasks.

If system load is high, means tasks are pending in CPU queue, maybe a result of:

High cpu usage.
Poor disk performance(disk io).
Exhaustion of ram.
…

free

free -k/-m/-g show memory usage in KB/MB/GB

free -h humanize output (automatically show in KB/MB/GB…)

......

2018.12.03 linux

杂谈

屋里有两个钟，一个快 5 分钟，　一个慢 5 分钟,　一直懒得把它们调正, 有种莫名的时间撕裂感, 好像还挺喜欢.还有一个月 2018 就过去了, 我想想最近都干嘛了.

用 Bloom filter 给推荐列表去重

之前产品里有一个功能是每天给用户推荐一批文章,要保证最后推给用户的文章每天不重复. 原先的实现很直接, 每次推送时候记录下用户 id 和 topic id 的键值对, 拿到新 topic 列表后,取出曾经给该用户推送过的文章列表, 两个 set 去重.

这个实现的问题很明显, 存储空间量太大(M * N), user id (int64) + topic id (int64) = 16 bytes, 1 million 的用户, 每天给用户推送10篇文章, 一年要存储: 16 * 10 * 365 * 1M = 54.4GB. 查询效率也很低,要么一次取所有已读 topic id, 要么把要推送的 topic id 都丢进数据库去重.

......

2018.11.17 algorithm bloomfilter code-infra

濑户内海的风与阳

上月休假去濑户内海溜达了一圈, 一个比想象中美太多的地方.

对濑户内海的印象, 要追溯到高中时看的一部搞笑番 <濑户花嫁>, 挺有意思的片子, 爆笑之余, 对濑户内海这个地方有了模模糊糊的印象.

去年去了青森和九州, 就想着再去个日本犄角旮旯的地吧, 四国就被提上了日程, 稍做攻略, 发现濑户内海就这里, 兴趣来了.

明年三月的时候濑户内海有艺术祭, 那会估计就被世界各地的人给挤爆了, 不凑这热闹, 还是10月去吧.

大致行程: 上海 -> 高松 -> 小豆岛 -> 丰岛 -> 直岛. 除了在小豆岛上呆了一晚上，其他时候都住在了高松.

......

2018.11.04

AWS Aurora DB

最近在把部分用 RDS 的 MySQL 迁移到 aurora 上去, 读了下 aurora 的 paper, 顺便和 RDS 的架构做些对比.

Paper notes

存储计算分离
redo log 下推到存储层
副本: 6 副本 3 AZ(2 per az), 失去一个 AZ + 1 additoinal node 不会丢数据(可读不可写). 失去一个 AZ (或任意2 node) 不影响数据写入.
10GB 一个 segment, 每个 segment 6 副本一个 PG (protection group), 一 AZ　两副本.
在 10Gbps 的网络上, 修复一个 10GB 的segment 需要 10s.

MySQL 一个应用层的写会在底层产生很多额外的写操作，会带来写放大问题:

redo log 用来 crash recovery, binlog 会上传 s3　用于 point in time restore.

在 aurora 里，只有 redo log 会通过网络复制到各个 replica, master 会等待 4/6 replicas 完成 redo log 的写入就认为写入成功 (所以失去3副本就无法写入数据了). 其他副本会根据 redo log 重建数据(单独的 redo log applicator 进程).

......

2018.10.31 aws mysql database server-infra

为 service 制定 SLO

通常我们使用云服务的时候, 服务提供商会提供 SLA(Service Level Aggrement),作为他们提供的服务质量的标准(常说的几个9),达不到会进行赔偿. 比如 AWS 的计算类服务: https://aws.amazon.com/compute/sla/ .

对公司自己 host 的 service, 我们内部也需要一些技术指标来 track 我们为客户提供的服务质量如何, 这个叫做 SLO(Service Level Objective). 也可以把他当成一个对内的,没有赔偿协议的SLA.

定义指标

我主要 track 两个指标:

Availability (服务的可用性)
Quality (服务质量)

Availability 的定义, 以前用简单的 service uptime 来定义, 在集群外部用一个 service check 定时 ping 我们 service　的 check endpoint, 失败就定义为 failure.

......

2018.10.15 server-infra

在 redshift 中计算 p95 latency

p95 latency 的定义: 把一段时间的 latency 按照从小到大排序, 砍掉最高的 %5, 剩下最大的值就是 p95 latency. p99, p90 同理.

p95 latency 表示该时间段内 95% 的 reqeust 都比这个值快.

一般我直接看 CloudWatch, 和 datadog 算好的 p95 值. 这次看看怎么从 access log 里直接计算 p95 latency.

假设在 redshift 中有一张表存储了应用的 access log, 结构如下:

CREATE TABLE access_log (
    url         string,
    time        string,
    resp_time   real
);

url	time	resp_time
/test1	2018-10-11T00:10:00.418480Z	0.123
/test2	2018-10-11T00:12:00.512340Z	0.321

要算 p95 很简单, 把 log 按分钟数分组, 用 percentile_cont 在组内按 resp_time 排序计算就能得到:

select date_trunc('minute', time::timestamp) as ts,
      percentile_cont(0.95) within group(order by resp_time) as p95
from access_log 
group by 1
order by 1;

得到:

     ts          |        p95
---------------------+-------------------
 2018-10-11 00:00:00 |  0.71904999999995
 2018-10-11 00:01:00 | 0.555550000000034
 2018-10-11 00:02:00 | 0.478999999999939
 2018-10-11 00:03:00 | 0.507250000000081
 2018-10-11 00:04:00 | 0.456000000000025
 2018-10-11 00:05:00 | 0.458999999999949
 2018-10-11 00:06:00 | 0.581000000000054
 2018-10-11 00:07:00 | 0.585099999999937
 2018-10-11 00:08:00 | 0.527999999999908
 2018-10-11 00:09:00 | 0.570999999999936
 2018-10-11 00:10:00 | 0.587950000000069
 2018-10-11 00:11:00 | 0.648900000000077
 2018-10-11 00:12:00 | 0.570000000000024
 2018-10-11 00:13:00 | 0.592649999999954
 2018-10-11 00:14:00 | 0.584149999999998
 2018-10-11 00:15:00 |  3.00854999999952
 2018-10-11 00:16:00 | 0.832999999999871
 2018-10-11 00:17:00 |  1.07154999999991
 2018-10-11 00:18:00 | 0.553600000000092
 2018-10-11 00:19:00 | 0.605799999999997
 2018-10-11 00:20:00 | 0.832000000000137
 ...

PERCENTILE_CONT 是逆分布函数, 给定一个百分比, 在一个连续分布模型上计算该百分比处的数值, 如果在该点处没有数据, 会根据最接近的前后值进行插值计算出实际值.

......

2018.10.12 aws data-infra redshift

Shining Moon

管理负载

Load Balancing

从去年的一个patch说起

Kubernetes 中的 pod 调度

Taints and Tolerations

Debug Skills on Linux

Universal metric: System load

free

杂谈

最近看的书

用 Bloom filter 给推荐列表去重

濑户内海的风与阳

AWS Aurora DB

Paper notes

为 service 制定 SLO

定义指标

在 redshift 中计算 p95 latency