AWS DMS notes

AWS’s DMS (Data migration service) can be used to do incremental ETL between databases. I use it to load data from RDS (MySQL) to Redshift.

It works, but have some concerns. Take some notes when doing this project.

Prerequisites

Source RDS must:

Enable automatic backups
Increase binlog remain time, call mysql.rds_set_configuration('binlog retention hours', 24);
Set binlog_format to ROW.
Privileges on source RDS: REPLICATION CLIENT , REPLICATION SLAVE , SELECT on replication target tables

DDL on source table

Redshift has some limits on change columns:

New column only must be added in the end
Can’t rename columns

So for DDL on source MySQL, you can’t add columns at non end postition, otherwise data in target table will corrupt. I disabled ddl changes target db:

    "ChangeProcessingDdlHandlingPolicy":{  
        "HandleSourceTableDropped":false,
        "HandleSourceTableTruncated":false,
        "HandleSourceTableAltered":false
    },

If source table schema changed, I just drop and reload target table on console.

Control write speed on Redshift

Since Redshift is an OLAP database, write operation is slow and concurrency is low, streaming data directly will have big impact on it.

And we have may analysis jobs running on redshift all the time, directly streaming will lock target table and make my analysis jobs timeout.

So I need to batch apply changes on DMS. Follow settings need to tweak in task settings json:

......

2017.10.14 lang-en AWS DMS data-infra

Get all invalid PTR record on Route53

I use autoscaling group to manage stateless servers. Servers go up and down every day.

Once server is up, I will add a PTR record for it’s internal ip. But when it’s down, I didn’t cleanup the PTR record. As times fly, a lot of invalid PTR records left in Route53.

To cleanup those PTR records realtime, you can write a lambda function, use server termination event as trigger. But how to cleanup the old records at once?

Straightforward way is write a script to call AWS API to get a PTR list, get ip from record, test whether the ip is live, if not, delete it.

Since use awscli to delete a Route53 record is very troublesome (involve json format), you’d better write a python script to delete them. I just demo some ideas to collect those records via shell.

You can do it in a single line, but make things clear and easy to debug, I split it into several steps.

Get PTR record list

aws route53 list-resource-record-sets  --hosted-zone-id xxxxx --query "ResourceRecordSets[?Type=='PTR'].Name" |  grep -Po '"(.+?)"' | tr -d \" > ptr.txt

ptr.txt will contain lines like:

1.0.0.10.in-addr.arpa.
2.0.0.10.in-addr.arpa.

Get ip list from PTR records

cat ptr.txt | while read -r line ; do echo -n $line | tac -s. | cut -d. -f3- | sed 's/.$//' ; done > ip.txt

ip.txt:

......

2017.09.29 lang-en AWS route53

Build private static website on S3

Build static website on S3 is very easy, but by default, it can be accessed by open internet.It will be super helpful if we can build website only available in VPC. Then we can use it to host internal deb repo, doc site…

Steps are very easy, you only need VPC endpoints and S3 bucket policy.

AWS api is open to internet, if you need to access S3 in VPC, your requests will pass through VPC’s internet gateway or NAT gateway. With VPC endpoints(can be found in VPC console), your requests to S3 will go through AWS’s internal network. Currently, VPC endpoints only support S3, support for dynamodb is in test.

To restrict S3 bucket only available in your VPC, need to set bucket policy (to host static website, enable static website support first). At first, I didn’t check doc, try to restrict access by my VPC ip cidr, but it didn’t work, I need to restrict by VPC endpoint id:

{
  "Version": "2012-10-17",
  "Id": "Policy1415115909152",
  "Statement": [
    {
      "Sid": "Access-to-specific-VPCE-only",
      "Principal": "*",
      "Action": "s3:GetObject",
      "Effect": "Allow",
      "Resource": ["arn:aws:s3:::my_secure_bucket",
                   "arn:aws:s3:::my_secure_bucket/*"],
      "Condition": {
        "StringEquals": {
          "aws:sourceVpce": "vpce-1a2b3c4d"
        }
      }
    }
  ]
}

BTW, if you can config bucket policy restrict on VPC directly, with VPC endpoint you can limit to subnets. Details can be found in doc: http://docs.aws.amazon.com/AmazonVPC/latest/UserGuide/vpc-endpoints-s3.html

......

2017.08.19 lang-en AWS S3

旅行散记

前阵子总有点心烦意乱的，生活上的，家里的，堵在一块，弄得自己都有点疲惫了，8月初的时候去日本东北逛了一圈，恰逢当地祭奠密集期，也就凑了把热闹，还挺有意思。

行程: 上海 -> 东京 -> 盛冈 -> 八户 -> 十和田湖 -> 青森 -> 弘前 -> 东京 -> 上海, 满满当当的7天, 懒得贴图，瞎记一点。

盛冈是岩手县的首府，但刚到的时候感觉真是个大乡下啊，大白天，出了车站区域，步行1公里多的时间里只碰到了2个人，一个桥下睡觉的大叔，一个遛狗的，像个鬼城似的，超级不习惯。到了晚上，当地有三飒舞祭典，一下子不知从哪冒出来好多人。大家聚在一起，各种路边摊买买买，吃吃吃，热闹的不敢相信是白天那个城市。三飒舞是当地一种传统舞蹈(传说是为了封印一个什么鬼的)，一边跳一边前进，有的打太鼓，有的吹笛，有的空手，跳的专业的还挺好看的，也有很多充数的小盆友啦:). 参加的都是当地一些团体和企业，基本就当是一次大型广告巡游, 当打太鼓的方阵经过面前的时候，气势还是很震撼的。在盛冈呆了两天，从刚到的不适应一下变得相当享受那里的城市氛围, 城市里到处都是风铃和紫阳花，闲适的生活，第一次这么向往在一个城市生活。

......

2017.08.13

Use redshift spectrum to do query on s3

使用 redshift spectrum 查询 S3 数据

通常使用 redshift 做数据仓库的时候要做大量的 ETL 工作，一般流程是把各种来源的数据捣鼓捣鼓丢到 S3 上去，再从 S3 倒腾进 redshift. 如果你有大量的历史数据要导进 redshift，这个过程就会很痛苦，redshift 对一次倒入大量数据并不友好，你要分批来做。

今年4月的时候， redshift 发布了一个新功能 spectrum, 可以从 redshift 里直接查询 s3 上的结构化数据。最近把部分数据仓库直接迁移到了 spectrum, 正好来讲讲。

动机

Glow 的数据仓库建在 redshift 上，又分成了两个集群，一个 ssd 的集群存放最近 4 个月的数据，供产品分析，metrics report, debug 等等 adhoc 的查询。4个月之前的数据存放在一个 hdd 的集群里，便宜容量大，查询慢。

......

2017.07.21 AWS Redshift spark data-infra

Enable coredump on ubuntu 16.04

Coredump file is useful for debuging program crash. This post will show several settings related to coredump.

Enable coredump

If you run program from shell , enable coredump via unlimit -c unlimited， then check unlimit -a | grep core, if it shows unlimited, coredump is enabled for your current session.

If your program is hosted by systemd, you need to edit your program’s service unit file’s [Service] section, add LimitCORE=infinity to enable coredump.

coredump location

Coredump file’s location is determined by kernerl parameter kernel.core_pattern.

On ubuntu 16.04 kernel.core_pattern default value is |/usr/share/apport/apport %p %s %c %P. Leading | means pass coredump file to following program. %p %c %P is used to create dump filename, their meaning can check via man core. apport will save dump file in /var/crash

If your default disk partition don’t have enough space to hold dump file, you can change kernel.core_pattern to another location, eg: /mnt/core/%e-%t.%P. If redis-server crashes, the dump file will be something like /mnt/core/redis-server-1500000000.1452. Also ensure crash process’s running user have write privilege on target location.

systemd-coredump

You can install systemd-coredump to control dump file deeply, like: size, compression….

Its config file is /etc/systemd/coredump.conf.

After install, it will change kernel.core_pattern to |/lib/systemd/systemd-coredump %P %u %g %s %t 9223372036854775808 %e.

......

2017.07.15 coredump system ubuntu

Python Web 应用性能调优

Python web 应用性能调优

为了快速上线，早期很多代码基本是怎么方便怎么来，这样就留下了很多隐患，性能也不是很理想，python 因为 GIL 的原因，在性能上有天然劣势，即使用了 gevent/eventlet 这种协程方案，也很容易因为耗时的 CPU 操作阻塞住整个进程。前阵子对基础代码做了些重构，效果显著，记录一些。

设定目标:

性能提高了，最直接的效果当然是能用更少的机器处理相同流量，目标是关闭 20% 的 stateless webserver.
尽量在框架代码上做改动，不动业务逻辑代码。
低风险 (历史经验告诉我们，动态一时爽，重构火葬场….)

治标

常见场景是大家开开心心做完一个 feature， sandbox 测试也没啥问题，上线了，结果 server load 飙升，各种 timeout 都来了，要么 rollback 代码，要么加机器。问题代码在哪?

......

2017.07.01 glow server-infra

Build deb repository with fpm , aptly and s3

I’m lazy, I don’t want to be deb/rpm expert, I don’t want to maintain repo server. I want as less maintenance effort as possible. 🙂

Combine tools fpm, aptly with aws s3, we can do it.

Use fpm to convert python package to deb

fpm can transform python/gem/npm/dir/… to deb/rpm/solaris/… packages

Example:

fpm -s python -t  deb -m [email protected] --verbose  -v 0.10.1 --python-pip /usr/local/pip Flask

It will transform Flask 0.10.1 package to deb. Output package will be python-flask_0.10.1_all.deb

Notes:

If python packages rely on some c libs like MySQLdb (libmysqlclient-dev), you need to install them on the machine to build deb binary.
By default fpm use easy_install to build packages, some packages like httplib2 have permission bug with easy_install, so I use pip
By default, msgpack-python will be convert to python-msgpack-python, I don’t like it, so add -n python-msgpack to normalize the package name.
Some package’s dependencies’ version number is not valid(eg: celery 3.1.25 deps pytz >= dev), so I replace the dependencies with --python-disable-dependency pytz -d 'pytz >= 2016.7'
fpm will not dowload package’s dependency automatically, you need to do it by your self

Use aptly to setup deb repository

aptly can help build a self host deb repository and publish it on s3.

......

2017.06.23 lang-en aptly fpm server-infra

Debug python performance issue with pyflame

pyflame is an opensource tool developed by uber: https://github.com/uber/pyflame

It can take snapshots of running python process, combined with flamegraph.pl, can output flamegraph picture of python call stacks. Help analyze bottleneck of python program, needn’t inject any perf code into your application, and overhead is very low.

Basic Usage

sudo pyflame -s 10 -x -r 0.001 $pid | ./flamegraph.pl > perf.svg

-s, how many seconds to run
-r, sample rate (seconds)

Your output will be something like following:

Longer bar means more sample points located in it, which means this part code is slower so it has a higher chance seen by pyflame.

In my case, the output graph has a long IDLE part. Pyflame can detect call stacks who are holding GIL, so if the running code doesn’t hold GIL, pyflame don’t know what it’s doing, it will label them as IDLE. Following cases will be considered as IDLE:

Your process is sleeping, do nothing.
Waiting for IO.(eg: Your application is calling a very slow RPC server)
Call libs written in C.

The right part is real application logic code. You can check this part to get a sense of overview performance of your code.

Also you can exclude the IDLE part from graph if you don’t care about them, just apply -x

......

2017.06.05 lang-en python

Designing data intensive application, reading notes, Part 2

Chapter 4, 5, 6

Encoding formats

xml, json, msgpack are text based encoding format, they can’t carry binary bytes (useless you encode them in base64, size grows 33%). And they cary schema definition with data, wast a lot of space.

thrift, protobuf are binary format, can take binary bytes, only carry data, the schema is defined with IDL(interface definition language). They have code generation tool to generate code to encode and decode data, along with check. Every field of data is binded with a tag(mapped to a field in IDL file). If a field is defined is required, it can’t by removed or change tag value, otherwise old code will not be able to decode it.

avro (used in hadoop), have a write schema and a read schema, when store a large file in avro format(contain many records with same schema), the avro write schama file is appended to the data. If use avro in RPC, the avro schema is exchanged during connection setup. When decoding avro, the lib will look both write schema and read schema, and translate write schema into read schema. Forward compatibility means that you can have a new version of the schema as writer and an old version of the schema as reader, backward compatibility means that you can have a new version of the schema as reader and an old version as writer.

......

2017.05.17 lang-en book