百种弊病,皆从懒生

Use SNS & SQS to build Pub/Sub System

Recently, we build pub/sub system based on AWS’s SNS & SQS service, take some notes.

Originally, we have an pub/sub system based on redis(use BLPOP to listen to a redis list). It’s really simple, and mainly for cross app operations. Now we have needs to enhance it to support more complex pubsub logic, eg: topic based distribution. It don’t support redelivery as well, if subscribers failed to process the message, message will be dropped.

There’re three obvious choices in my mind:

  1. kafka
  2. AMQP based system (rabbitmq,activemq …)
  3. SNS + SQS

My demands for this system are:

  • Support message persistence.
  • Support topic based message distribution.
  • Easy to manage.

The data volume won’t be very large, so performance and throughput won’t be critical concerns.

I choose SNS + SQS, main concerns are from operation side:

  • kafka need zookeeper to support cluster.
  • rabbitmq need extra configuration for HA, and AMQP model is relatively complex for programming.

So my decision is:

  • application publish message to SNS topic
  • Setup multi SQS queues to subscribe SNS topic
  • Let different application processes to subscribe to different queues to finish its logic.

SQS and SNS is very simple, not too much to say, just some notes:

  • SQS queue have two types, FIFO queue and standard queue. FIFO queue will ensure message order, and ensure exactly once delivery, tps is limited(3000/s) standard queue is at least once delivery, message order is not ensured, tps is unlimited. In my case, I use standard queue, order is not very important.
  • SQS message size limit is 256KB.
  • Use goaws for local development, it has problem on processing message attributes, but I just use message body. messages only store in ram, will be cleared after restarted.
  • If you failed to deliver message to sqs from sns, can setup topic’s sqs failure feedback role to log to cloudwatch, in most case it’s caused by iam permission.
  • Message in sqs can retain at most 14 days.
  • Once a message is received by a client, it will be invisible to other clients in visibility_timeout_seconds(default 30s). It means if your client failed to process the message, it will be redelivered after 30s.
  • SQS client use long polling to receive message, set receive_wait_time_seconds to reduce api call to reduce fee.
  • If your client failed to process a message due to bug, the message will be redelivered looply, set redrive_policy for the queue to limit retry count, and set a dead letter queue to store those messages. You can decide how to handle them late.

I setup SNS and SQS via terraform, used following resources:

......

Migrate to Sqlalchemy

最近把公司 db 层的封装代码基于 sqlalchemy 重写了, 记录一些.

原来的 db 层代码历史非常古老(10年以上…), 最早写代码的人早就不在了, 问题很多:

  • 完全没有单元测试.
  • 暴露出的接口命名很混乱, 多数是为了兼容一些历史问题.
  • 里面带一套 client 端 db sharding 的逻辑, 但在新项目里完全用不到, 还导致无法做 join, 无法子查询, 很不方便.
  • 老的 db 代码没有 model 层, 和 db migration 通过一种很 trick 的方式绑定在一起实现的, 导致开发时候对着代码完全无法知道数据库表结构,只能直接看数据库.

重写时候要考虑到的:

......

AWS 的 K8S CNI Plugin

EKS 还没有 launch, 但 AWS 先开源了自己的 CNI 插件, 简单看了下, 说说它的实现和其他 K8S 网络方案的差别.

K8S 集群对网络有几个基本要求:

  • container 之间网络必须可达,且不通过 NAT
  • 所有 node 必须可以和所有 container 通信, 且不通过 NAT
  • container 自己看到的 IP, 必须和其他 container 看到的它的 ip 相同.

Flannel in VPC

flannel 是 K8S 的一个 CNI 插件, 在 VPC 里使用 flannel 的话, 有几个选择:

  1. 通过 VXLAN/UDP 进行封包, 封包影响网络性能, 而且不好 debug
  2. 用 aws vpc backend, 这种方式会把每台主机的 docker 网段添加进 vpc routing table, 但默认 routing table 里只能有50条规则, 所以只能 50 个 node, 可以发 ticket 提升, 但数量太多会影响 vpc 性能.
  3. host-gw, 在每个 node 上直接维护集群中所有节点的路由, 没测试过, 感觉出问题也很难 debug, 假如用 autoscaling group 管理 node 集群, 能否让 K8S 在 scale in/out 的时候修改所有节点的路由?

以上方式都只能利用 EC2 上的单网卡, security group 也没法作用在 pod 上.

......

AWS lambda 的一些应用场景

这几年吹 serverless 的比较多, 在公司内部也用 lambda , 记录一下, 这东西挺有用, 但远不到万能, 场景比较有限.

lambda 的代码的部署用的 serverless 框架, 本身支持多种 cloud 平台, 我们就只在 aws lambda 上了.

我基本上就把 lambda 当成 trigger 和 web hook 用.

和 auto scaling group 一起用

线上所有分组的机器都是用 auto scaling group 管理的, 只不过 stateless 的 server 开了自动伸缩, 带状态的 (ElasticSearch cluster, redis cache cluster) 只用来维护固定 size.

在往一个 group 里加 server 的时候, 要做的事情挺多的, 给新 server 添加组内编号 tag, 添加内网域名, provision, 部署最新代码.

这些事都用 jenkins 来做, 但怎么触发 jenkins job 呢?

......

一次失败的性能问题排查

一叶障目, 不见泰山. 前阵子一直在排查一个性能问题, 结果由于一些惯性思维, 费了好大劲才弄明白原因, 而且原因非常简单….把这事记录下来,免得以后再掉坑里去.

现象是到了晚上10点多, server lantency 突然一瞬间变高, 但持续时间很短,马上就会恢复, timeout 的请求也不多,影响不大.问题其实从蛮久前就出现了, 但一直也没很重视, 因为持续时间短,影响也不大,简单看了下也没看出明显的问题, 就一直搁置着. 直到最近,觉得问题变严重了, latency 变的更高了,而且在10~11点间多次变高, 开始认真看为什么.

......

Access sensitive variables on AWS lambda

AWS lambda is convenient to run simple serverless application, but how to access sensitive data in code? like password,token…

Usually, we inject secrets as environment variables, but they’re still visable on lambda console. I don’t use it in aws lambda.

The better way is use aws parameter store as configuration center. It can work with KMS to encrypt your data.

Code example:

client = boto3.client('ssm')
resp = client.get_parameter(
    Name='/redshift/admin/password',
    WithDecryption=True
)
resp:

    {
        "Parameter": {
            "Name": "/redshift/admin/password",
            "Type": "SecureString",
            "Value": "password value",
            "Version": 1
        }
    }

Things you need to do to make it work:

  • Create a new KMS key
  • Use new created KMS key to encrypt your data in parameter store.
  • Set a execution role for your lambda function.
  • In the KMS key’s setting page, add the lambda execution role to the list which can read this KMS key.

Then your lambda code can access encrypted data at runtime, and you needn’t set aws access_key/secret_key, lambda execution role enable access to data in parameter store.

BTW, parameter store support hierarchy(at most 15 levels), splitted by /. You can retrive data under same level in one call, deltails can be found in doc, eg: http://boto3.readthedocs.io/en/latest/reference/services/ssm.html#SSM.Client.get_parameters_by_path

......

Glow Infra Evolution

Glow data infrastructure 的演化

Glow 一向是一个 data driven 做决策的公司,稳定高效的平台是必不可少的支撑, 本文总结几年里公司 data infrastructure 的演进过程.

结合业务特点做技术选型和实现时候的几个原则:

  1. real time 分析的需求不高,时间 delta 控制在1 小时以内可接受 .
  2. 支持快速的交互式查询.
  3. 底层平台尽量选择 AWS 托管服务, 减少维护成本.
  4. 遇到故障, 数据可以 delay 但不能丢.
  5. 可回溯历史数据.
  6. 成本可控.

用到的 AWS 服务:

  • 数据存储和查询: S3, Redshift (spectrum), Athena
  • ETL: DMS, EMR, Kinesis, Firehose, Lambda

开源软件: td-agent, maxwell

数据来源:

  1. 线上业务数据库
  2. 用户活动产生的 metrics log
  3. 从各种第三方服务 api 拉下来的数据 (email之类)

最早期

刚开始的时候业务单纯,数据量也少, 所有数据都用 MySQL 存储,搭了台 slave, 分析查询都在 slave 上进行.

......

Get Real Client Ip on AWS

If you run a webserver on AWS, get real client ip will be tricky if you didn’t configure server right and write code correctly.

Things related to client real ip:

  • CloudFront (cdn)
  • ALB (loadbalancer)
  • nginx (on ec2)
  • webserver (maybe a python flask application).

Request sequence diagram will be like following:

req

User’s real client ip is forwarded by front proxies one by one in head X-Forwarded-For.

For CloudFront:

  • If user’s req header don’t have X-Forwarded-For, it will set user’s ip(from tcp connection) in X-Forwarded-For
  • If user’s req already have X-Forwarded-For, it will append user’s ip(from tcp connection) to the end of X-Forwarded-For

For ALB, rule is same as CloudFront, so the X-Forwarded-For header pass to nginx will be the value received from CloudFront + CloudFront’s ip.

For nginx, things will be tricky depends on your config.

Things maybe involved in nginx:

  • real ip module
  • proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;

If you didn’t use real ip module, you need to pass X-Forwarded-For head explictly.

proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for; will append ALB’s ip to the end of X-Forwarded-For header received from ALB.

So X-Forwarded-For header your webserver received will be user ip,cloudfront ip, alb ip

Or you can use real ip module to trust the value passed from ALB.

......

DynamoDB

DynamoDB 是 AWS 的托管 NoSQL 数据库,可以当作简单的 KV 数据库使用,也可以作为文档数据库使用.

Data model

组织数据的单位是 table, 每张 table 必须设置 primary key, 可以设置可选的 sort key 来做索引.

每条数据记作一个 item, 每个 item 含有一个或多个 attribute, 其中必须包括 primary key.

attribute 对应的 value 支持以下几种类型:

  • Number, 由于 DynamoDB 的传输协议是 http + json, 为了跨语言的兼容性, number 一律会被转成 string 传输.
  • Binary, 用来表示任意的二进制数据,会用 base64 encode 后传输.
  • Boolean, true or false
  • Null
  • Document 类型包含 List 和 Map, 可以互相嵌套.
    • List, 个数无限制, 总大小不超过 400KB
    • Map, 属性个数无限制,总大小不超过 400 KB, 嵌套层级不超过 32 级.
  • Set, 一个 set 内元素数目无限制, 无序,不超过 400KB, 但必须属于同一类型, 支持 number set, binary set, string set.

选择 primary key

Table 的 primary key 支持单一的 partition key 或复合的 partition key + sort key. 不管哪种,最后的组成的primary key 在一张表中必须唯一.

......

Handle outage

A few weeks ago, production environment came to an outage, solve it cost me 8 hours (from 3am to 11am) although total down time is not long, really a bad expenrience. Finally, impact was mitigated, and I’m working on a long term solution. I learned some important things from this accident.

The outage

I received alarms about live performance issue at 3am, first is server latency increaing, soon some service’s health check failed due to high load.

I did following:

  1. Check monitor
  2. Identify the problem is caused by KV system

Okay, problem is here, I know the problem is KV system’s performance issue. But I can’t figure out the root case right now, I need a temporary solution. Straightward way is redirect traffic to slave instance. But I know it won’t work (actually it is true), I come to similar issue before, did a fix for it, but seems it doesn’t work.

The real down time was not long, performance recovered to some degree soon, but latency was still high, not normal. I monitored it for long time, and tried to find out the root case until morning. Since traffic was growing when peak hour coming, performance became problem again.

......