最近把公司 db 层的封装代码基于 sqlalchemy 重写了, 记录一些. 原来的 db 层代码历史非常古老(10年以上…), 最早写代码的人早就不在了, 问题很多: 完全没有单元测试. 暴露出的接口命名很混乱, 多数是为了兼容一些历史问题. 里面带一套 client 端 db sharding 的逻辑, 但在新项目里完全用不到, 还导致无法做 join, 无法子查询, 很不方便. 老的 db 代码没有 model 层, 和 db migration 通过一种很 trick 的方式绑定在一起实现的, 导致开发时候对着代码完全无法知道数据库表......
EKS 还没有 launch, 但 AWS 先开源了自己的 CNI 插件, 简单看了下, 说说它的实现和其他 K8S 网络方案的差别. K8S 集群对网络有几个基本要求: container 之间网络必须可达,且不通过 NAT 所有 node 必须可以和所有 container 通信, 且不通过 NAT container 自己看到的 IP, 必须和其他 container 看到的它的 ip 相同. Flannel in VPC flannel 是 K8S 的一个 CNI 插件, 在 VPC 里使用 flannel 的话, 有几个选择: 通过 VXLAN/UDP 进行封包, 封包影响网络性能, 而且不好 debug 用 aws vpc backend, 这种方式会把每台主机的 docker 网段添加进 vpc routing table, 但默认 routing table 里只能有50条规则......
这几年吹 serverless 的比较多, 在公司内部也用 lambda , 记录一下, 这东西挺有用, 但远不到万能, 场景比较有限. lambda 的代码的部署用的 serverless 框架, 本身支持多种 cloud 平台, 我们就只在 aws lambda 上了. 我基本上就把 lambda 当成 trigger 和 web hook 用. 和 auto scaling group 一起用 线上所有分组的机器都是用 auto scaling group 管理的, 只不过 stateless 的 server 开了自动伸缩, 带状态的 (ElasticSearch cluster, redis cache cluster) 只用来维护固定 size. 在往一个 group 里加 server 的时候, 要做的事情挺多的, 给新 server 添加组内编号 tag, 添加内网域名, provision, 部署最新代码. 这些事都用......
一叶障目, 不见泰山. 前阵子一直在排查一个性能问题, 结果由于一些惯性思维, 费了好大劲才弄明白原因, 而且原因非常简单….把这事记录下来,免得以后再掉坑里去. 现象是到了晚上10点多, server lantency 突然一瞬间变高, 但持续时间很短,马上就会恢复, timeout 的请求也不多,影响不大.问题其实从蛮久前就出现了, 但一直也没很重视, 因为持续时间短,影响也不大,简单看了下也没看出明显的问题, 就一直搁置着. 直到最近,觉得问题......
AWS lambda is convenient to run simple serverless application, but how to access sensitive data in code? like password,token… Usually, we inject secrets as environment variables, but they’re still visable on lambda console. I don’t use it in aws lambda. The better way is use aws parameter store as configuration center. It can work with KMS to encrypt your data. Code example: client = boto3.client('ssm') resp = client.get_parameter( Name='/redshift/admin/password', WithDecryption=True ) resp: { "Parameter": { "Name": "/redshift/admin/password", "Type": "SecureString", "Value": "password value", "Version": 1 } } Things you need to do to make it work: Create a new KMS key Use new created KMS key to encrypt your data in parameter store. Set a execution role for your lambda function. In the KMS key’s setting page, add the lambda execution role to the list which can read this KMS key. Then your lambda code can access encrypted data at runtime, and you needn’t set aws access_key/secret_key, lambda execution role enable access to data in parameter store. BTW, parameter store support hierarchy(at most 15 levels), splitted by /. You can retrive data under same level in one call, deltails can be found in doc, eg: http://boto3.readthedocs.io/en/latest/reference/services/ssm.html#SSM.Client.get_parameters_by_path......
Glow data infrastructure 的演化 Glow 一向是一个 data driven 做决策的公司,稳定高效的平台是必不可少的支撑, 本文总结几年里公司 data infrastructure 的演进过程. 结合业务特点做技术选型和实现时候的几个原则: real time 分析的需求不高,时间 delta 控制在1 小时以内可接受 . 支持快速的交互式查询. 底层平台尽量选择 AWS 托管服务, 减少维护成本. 遇到故障, 数据可以 delay 但不能丢. 可回溯历史数据. 成本可控. 用到的 AWS 服务: 数据存储和查询: S3, Redshift (spectrum), Athena ETL: DMS, EMR, Kinesis, Firehose, Lambda 开源软件: td-agent, maxwell 数据来源: 线上......
If you run a webserver on AWS, get real client ip will be tricky if you didn’t configure server right and write code correctly. Things related to client real ip: CloudFront (cdn) ALB (loadbalancer) nginx (on ec2) webserver (maybe a python flask application). Request sequence diagram will be like following: User’s real client ip is forwarded by front proxies one by one in head X-Forwarded-For. For CloudFront: If user’s req header don’t have X-Forwarded-For, it will set user’s ip(from tcp connection) in X-Forwarded-For If user’s req already have X-Forwarded-For, it will append user’s ip(from tcp connection) to the end of X-Forwarded-For For ALB, rule is same as CloudFront, so the X-Forwarded-For header pass to nginx will be the value received from CloudFront + CloudFront’s ip. For nginx, things will be tricky depends on your config. Things maybe involved in nginx: real ip module proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for; If you didn’t use real ip module, you need to pass X-Forwarded-For head explictly. proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for; will append ALB’s ip to the end of X-Forwarded-For header received from ALB. So X-Forwarded-For header your webserver received will be user ip,cloudfront ip, alb ip Or you can use real ip module to trust the value passed from ALB.......
DynamoDB 是 AWS 的托管 NoSQL 数据库,可以当作简单的 KV 数据库使用,也可以作为文档数据库使用. Data model 组织数据的单位是 table, 每张 table 必须设置 primary key, 可以设置可选的 sort key 来做索引. 每条数据记作一个 item, 每个 item 含有一个或多个 attribute, 其中必须包括 primary key. attribute 对应的 value 支持以下几种类型: Number, 由于 DynamoDB 的传输协议是 http + json, 为了跨语言的兼容性, number 一律会被转成 string 传输. Binary, 用来表示任意的二进制数据,会用 base64 encode 后传输. Boolean, true or false Null Document 类型包含 List 和 Map, 可以互相嵌套. List, 个数无限制, 总大小......
A few weeks ago, production environment came to an outage, solve it cost me 8 hours (from 3am to 11am) although total down time is not long, really a bad expenrience. Finally, impact was mitigated, and I’m working on a long term solution. I learned some important things from this accident. The outage I received alarms about live performance issue at 3am, first is server latency increaing, soon some service’s health check failed due to high load. I did following: Check monitor Identify the problem is caused by KV system Okay, problem is here, I know the problem is KV system’s performance issue. But I can’t figure out the root case right now, I need a temporary solution. Straightward way is redirect traffic to slave instance. But I know it won’t work (actually it is true), I come to similar issue before, did a fix for it, but seems it doesn’t work. The real down time was not long, performance recovered to some degree soon, but latency was still high, not normal. I monitored it for long time, and tried to find out the root case until morning. Since traffic was growing when peak hour coming, performance became problem again.......
AWS’s DMS (Data migration service) can be used to do incremental ETL between databases. I use it to load data from RDS (MySQL) to Redshift. It works, but have some concerns. Take some notes when doing this project. Prerequisites Source RDS must: Enable automatic backups Increase binlog remain time, call mysql.rds_set_configuration('binlog retention hours', 24); Set binlog_format to ROW. Privileges on source RDS: REPLICATION CLIENT , REPLICATION SLAVE , SELECT on replication target tables DDL on source table Redshift has some limits on change columns: New column only must be added in the end Can’t rename columns So for DDL on source MySQL, you can’t add columns at non end postition, otherwise data in target table will corrupt. I disabled ddl changes target db: "ChangeProcessingDdlHandlingPolicy":{ "HandleSourceTableDropped":false, "HandleSourceTableTruncated":false, "HandleSourceTableAltered":false }, If source table schema changed, I just drop and reload target table on console. Control write speed on Redshift Since Redshift is an OLAP database, write operation is slow and concurrency is low, streaming data directly will have big impact on it. And we have may analysis jobs running on redshift all the time, directly streaming will lock target table and make my analysis jobs timeout. So I need to batch apply changes on DMS.......