The S3 storage engine in MariaDB 10.5 FOSDEM 2020

The S3 storage engine in

MariaDB 10.5

FOSDEM 2020

Michael Widenius

CTO @ MariaDB

What is Amazon S3

●

Storage of files in the cloud

●

Works through http requests (think get/put)

●

Supports basic functions like list files, copy to/from and delete.

●

Move is implemented as copy + delete of the whole file (not fast)

●

Quite slow for small files. Optimal file size for retrial is said to be around

4M.

●

Many vendors support storage of files according to the ‘S3 interface’

●

Not really usable for databases who needs to update blocks withing a file.

S3 storage

●

All files are stored in one of several ‘buckets’

●

Files are stored as objects in a bucket accessed with a key.

●

The key may contain ‘/’ and commands like ‘aws ls’ make S3 storage to feel

like a file structure, even it it really isn’t.

S3 storage engine in MariaDB 10.5

●

Read only tables (perfect for inexpensive archiving of old data).

●

For some organizations S3 storage is cheaper than local storage

●

Data and index can optionally be compressed on S3, which can make storage

even cheaper.

●

S3 works with partitioning for flexible handling of multiple tables

●

Very fast, thanks to reading of blocks in big chunks (4M by default)

●

Supports all key formats and optimizations that Aria supports

●

Data can be accessed by multiple MariaDB servers. Tables are automatically

discovered when needed.

●

S3 has it’s own page cache

●

S3 is also backported to 10.3 and 10.4 MariaDB enterprise server

Converting a table to S3

Converting a table to S3:

ALTER TABLE old_table ENGINE=S3

And converting back to local:

ALTER TABLE table_on_s3 ENGINE=InnoDB

Internally the data is first copied to a local Aria table with

ROW_FORMAT=PAGE TRANSACTIONAL=0 and then moved to S3.

Converting a table to S3

One can also use ‘aria_s3_copy’ to:

●

Copy Aria tables to S3

●

Copy S3 tables to local storage

●

Delete tables on S3

((/my/maria-10.5/storage/maria)) aria_s3_copy --help

aria_s3_copy Ver 1.0 for Linux on x86_64

...

Copy an Aria table to and from s3

Usage: aria_s3_copy --aws-access-key=# --aws-secret-access-key=# --aws-

region=# --op=(from_s3 | to_s3 | delete_from_s3) [OPTIONS] tables[.MAI]

ALTER TABLE options for S3

ALTER TABLE old_data ENGINE=S3

S3_BLOCK_SIZE=# // Default 4M

COMPRESSION_ALGORITHM=none|zlib // Default none

Both index and data are compressed. Typical savings up to 70%

Setting up S3 storage engine

Add to your my.cnf file something like:

[mariadb-10.5]

s3=ON

s3-host-name=s3.amazonaws.com

s3-bucket=mariadb

s3-access-key=xxxx

s3-secret-key=xxx

s3-region=eu-north-1

#s3-slave-ignore-updates=1

#s3-pagecache-buffer-size=256M

Different S3 storage

Replication with different S3 storage

on master and slave

●

In this case the normal replication works.

●

The slaves should be able to duplicate any command related to S3 on the master.

Shared S3 storage

Replication with same S3 storage

on master and slave

In this case the slave cannot repeat any commands related to S3 as the master has

already executed these. Instead the slave has to do:

●

All CREATE TABLE of S3 tables should be ignored

●

All updates to S3 tables should be ignored (only ALTER is possible)

●

ALTER TABLE org_name to s3_name should do DROP TABLE org_name

●

DROP TABLE of a S3 table should only remove any local .frm definition, not touch

the S3 data

●

RENAME of S3 tables are replicated as RENAME IF EXISTS so that the slave can

ignore the rename (as the table is already renamed in S3). If the slave .frm file is

for an S3 table it’s also deleted.

●

Because of this, RENAME IF EXISTS was implemented in 10.5

Replication with same S3 storage

on master and slave

On the master:

ALTER TABLE s3_table ENGINE=InnoDB should be replicated in the binary

log as follows:

●

DROP TABLE IF EXISTS s3_table

●

CREATE TABLE “local_table”

●

Copy data to binary log with row-logged insert statements.

Setting up replication with same S3 storage

on master and slave

The master can’t know if any of the slaves may will use same S3 storage as the

master. Because of this, the following option was introduced:

--s3_replicate_alter_as_create_select (defaults to on)

If the slave is using same storage as master, then the slave should set the

option:

--s3_slave_ignore_updates (defaults to off)

Interface used to connect to S3

I first tried to use the AWS recommended interface (aws-dk-cpp.git):

●

Lots of templates and templates on top of templates

●

Lots of code!

●

Hard to work with binary files (optimized for reading line by line or block

by block)

●

VERY hard to use for what was needed:

●

Reading or writing a complete file

●

Checking if files existed

●

Getting list of files

libmarias3

We decided instead to create our own layer for accessing S3

●

Uses libcurl and libxml2 internally

●

Developed by Andrew Hutchings as an independent project

●

Released under LGPL

●

Simple and efficient API:

●

ms3_list(ms3, bucket, prefix, &list)

●

ms3_put(ms3, bucket, key, data, length)

●

ms3_get(ms3, bucket, key, data, &length)

●

ms3_delete(ms3, bucket, key)

●

ms3_status(ms3, bucket, key, &status)

libmarias3 internals

●

libmarias3 provides the libmarias3 API, builds the headers and HTTPS

requests.

●

Curl handles the HTTPS request / response, including TLS (transport layer

security)

●

libxml2 parses the XML response from the list commands (think ls)

Storage layout on S3

frm file (used by discovery):

s3_bucket/database/table/frm

First index block (contains description of the Aria files):

s3_bucket/database/table/aria

Rest of the index file:

s3_bucket/database/table/index/block_number

Data file:

s3_bucket/database/table/data/block_number

Example of data stored on S3

shell> aws s3 ls --recursive s3://mariadb-bucket/

2019-05-10 17:46:48 8192 foo/test1/aria

2019-05-10 17:46:49 3227648 foo/test1/data/000001

2019-05-10 17:46:48 942 foo/test1/frm

2019-05-10 17:46:48 1015808 foo/test1/index/000001

Implementation of S3

The S3 storage class (ha_s3) inherits from the Aria storage engine.

It uses one external library, libmarias3.

The changes need in the Aria storage engine where quite small:

First commit (almost all code, except replication and the libmarias3 library):

git show --stat ab38b7511bad8cc03a67f0d43e7169e6dfcac9fa

...

66 files changed, 4390 insertions(+), 212 deletions(-)

Implementation of S3

(/my/maria-10.5/storage/maria) wc *s3*.*

330 1109 10336 aria_s3_copy.cc

810 2383 24753 ha_s3.cc

71 269 2048 ha_s3.h

1458 3967 41178 s3_func.c

118 507 4965 s3_func.h

56 139 1060 test_aria_s3_copy.sh

2843 8374 84340 total

The two main aria files that where changed where:

storage/maria/ma_open.c | 257 (147 non space changes)

storage/maria/ma_pagecache.c | 414 (390 non page changes)

Implementation of S3

ha_s3.h

class ha_s3 :public ha_maria

{

bool in_alter_table;

S3_INFO *open_args;

public:

ha_s3(handlerton *hton, TABLE_SHARE * table_arg);

int create(const char *name, TABLE *table_arg, ...);

int open(const char *name, int mode, uint open_flags);

int write_row(const uchar *buf); // Only usable by ALTER TABLE

void drop_table(const char name) {} // Only used for internal temporary tables

int delete_table(const char *name);

int rename_table(const char *from, const char *to);

int discover_check_version();

S3_INFO *s3_open_args() { return open_args; }

void register_handler(MARIA_HA *file);

};

Implementation of S3

Aria changes

ma_open.cc

●

If called by S3 engine, read header from S3 instead of from file

●

Read specific S3 options from the index file header (block size, compression etc)

ma_pagecache.cc

●

Internally both Aria and S3 tables are stored in blocks of aria_block-size

●

Added support of reading big blocks. When reading an index or page block in

the middle of a S3 block, read the corresponding S3 block and update all pages

read in the dedicated S3 page cache.

Limitations with S3 tables

●

They are read only ;)

●

If a S3 table is dropped in one MariaDB server, other MariaDB servers may get

read errors if accessing table while it was dropped.

●

ALTER TABLE of existing S3 table, may cause problems on other MariaDB

servers as they may read data from old and new server.

●

Can be avoid by always renaming the S3 tables when doing ALTER

Future work

●

Store aws keys and region in the mysql.servers table (as Spider and FederatedX).

This will allow one to have different tables on different S3 servers.

●

Allow batch updates (single user) to S3 tables.

●

For adding new archive data to S3 table instead of having to alter the table back

to local, update the table and copy back to s3.

●

Add shared cache for S3 connections instead of having one per open table

●

Will use less memory and speed up opening new S3 tables.

That’s all folks

QUESTIONS ?

The S3 engine is documented in detail at:

https://mariadb.com/kb/en/s3-storage-engine/

Thank you