AWS’s DMS (Data migration service) can be used to do incremental ETL between databases. I use it to load data from RDS (MySQL) to Redshift.
It works, but have some concerns. Take some notes when doing this project.
Prerequisites
Source RDS must:
- Enable automatic backups
- Increase binlog remain time,
call mysql.rds_set_configuration('binlog retention hours', 24);
- Set
binlog_format
toROW
. - Privileges on source RDS:
REPLICATION CLIENT
,REPLICATION SLAVE
,SELECT
on replication target tables
DDL on source table
Redshift has some limits on change columns:
- New column only must be added in the end
- Can’t rename columns
So for DDL on source MySQL, you can’t add columns at non end postition, otherwise data in target table will corrupt. I disabled ddl changes target db:
"ChangeProcessingDdlHandlingPolicy":{
"HandleSourceTableDropped":false,
"HandleSourceTableTruncated":false,
"HandleSourceTableAltered":false
},
If source table schema changed, I just drop and reload target table on console.
Control write speed on Redshift
Since Redshift is an OLAP database, write operation is slow and concurrency is low, streaming data directly will have big impact on it.
And we have may analysis jobs running on redshift all the time, directly streaming will lock target table and make my analysis jobs timeout.
So I need to batch apply changes on DMS. Follow settings need to tweak in task settings json:
......