By Franck Pachot
This year is not very nice for conferences as everything is virtual and we miss the most important: meeting and sharing with people. But the AWS re:Invent is actually a great experience. As an AWS Data Heros, I received an Oculus Quest 2 to teleport to the virtual Neon City where we can meet and have fun in Virtual Reality (but incredibly real-life chatting):
— Filipe Barretto (@filipe_barretto) November 30, 2020
There are 3 important new launches announced around databases: Babelfish for Aurora, Aurora Serverless v2 and AWS Glue Elastic Views but let’s start by a recap of the pre-reInvent new features from this year.
We have more regions, even one planned in Switzerland. And also more cloud at customer solutions, like RDS in Outposts in addition to RDS on VMware. We had new versions, PostgreSQL 12, MariaDB 10.5, SQL Server 2019. SQL Server even came with SSRS). And also recent Release Updates for Oracle (July 2020).
About new features from 2020, we can export RDS snapshots to S3 parquet format. We can share AD with RDS from multiple VPC, we have connection pooling in RDS Proxy (session state aware). SQL Server supports parallel backups. Oracle supports backup to other regions. RDS can use always-on for SQL Server read replicas. And Oracle does not need Active Data Guard option when the replica is not there for read workloads. And talking about licenses, there’s the License Manager for Oracle to help manage them. There’s also the new Graviton2 processors for RDS PostgreSQL and MySQL.
AWS Glue Elastic Views
I mentioned that we can query the NoSQL DynamoDB tables with a SQL-like API, PartiQL. Now those PartiQL queries can do more: continuous query to propagate data and changes, like materialized views. This event sourcing is based on CDC (not Stream). It propagates changes in near real-time (asynchronous, can be throttled by the target capacity) and to multiple destinations: Elasticsearch to search, S3 for data lake, Redshift for analytics. A nice serverless solution for CQRS: DynamoDB for ingest and OLTP and propagation to purpose-build services for the queries that cannot be done in the NoSQL operational database. This is serverless: billed per second of compute, and volume of storage.
Currently, those materialized views support only selection and projection, but hopefully, in the future, they will be able to maintain aggregations with GROUP BY. As I’m not a fan of writing procedural code to process data, I really like materialized views for replication, rather than triggers and lambdas.
Aurora Serverless v2
You don’t want to pre-plan the capacity but have your database server scale up, out, and down according to the load? That’s serverless. You don’t provision servers, but capacity units: Aurora Capacity Units (ACU). Rather than multiplying the capacity when needed, by changing the instance size, the new Aurora Serverless v2 elasticity has a granularity of 0.5 ACU: you start by provisioning 0.5 ACU (not zero because you don’t want to wait seconds on first start after being idle). When compared with v1 (which is still available) the starting capacity is lower, the increment is finer, and the scale-down is in minute rather than a 15 minutes cool down. And it has all Aurora features: Multi-AZ, Global Database, Proxy,… Basically, this relies on the ability to add vCPU and memory online, and reduce it (this includes shrinking the buffer pool according to LRU). This means scale up and down as long as it is possible (depends on the neighbors activity in the same VM). It can scale out as well if in the compute fleet and move to another VM if needed, but the goal is to be able to scale-up in-place most of the time.
Releasing idle CPU is easy, but knowing how much RAM can be released without significantly increase I/O and response time, is probably more challenging. Anyway, we can expect min/max controls on it. The goal is not to replace the capacity planning, but to be more elastic with unplanned workloads.
You have the choice to migrate to v2, but look at the price. The ACU is more expensive, but given the elasticity, you probably save a lot (start lower, increase by smaller steps, decrease sooner).
This is the most revolutionary in my opinion. We want polyglot databases not only to have the coice of language or API for new developments. Many databases run applications, like ERP, which are tied to a specific commercial database. And companies want to get out of this vendor lock-in but migration of those applications is impossible. They use specific behaviour, or code, in the database, and they do it for a reason: the agility and performance of processing data within the database. The business logic is tied to data for consistency and performance, in stored procedures. There are many attempts to translate the code, but this works partially. And that’s not sufficient for enterprise software: rewriting is easy but testing… who will sign the UAT validation that the business code, working for years in a database engine, has been rewritten to show the same behaviour?
This is different when there is no application change at all, and that’s the idea of Babelfish, starting with SQL Server compatibility in Aurora. Given the powerful extensibility of PostgreSQL, AWS has built some extensions to understand T-SQL, and specific SQL Server datatype behaviour. They also add endpoints that understand the MS SQL network protocol. And then can run the applications running on SQL Server, without any change besides the connection to the new endpoint. Of course, this is not easy. Each application may have specificities and need to implement new extensions. And for this reason, AWS decided to Open Source this compatibility layer. Who will contribute? Look at the ISV who has an ERP running on SQL Server. They can invest in developing the compatibility with Babelfish, and then can propose to their customer to move out of the commercial database, to PostgreSQL. Of course, the goal of AWS is to get them to Aurora, providing the high availability and scalability that big companies may require. But Babelfish target is PostgreSQL, the community one.
About the target, Aurora comes with two flavors, using the upper layer from MySQL or PostgreSQL. PostgreSQL was chosen as it is probably the most compatible with commercial databases, and provides easy extensibility in procedural language, datatypes and extensions. About the source, it is SQL Server for the moment (a commercial reply to the licensing policy they have set for their cloud competitors) but I’m sure Oracle will come one day. Probably not 100% compatible, given the complexity of it, but the goal of an ISV is to provide 100% compatibility for one application. And, once compatibility is there, the database is also accessible with the native PostgreSQL API for further developments.
I’m looking forward to seeing how this Open Source project will get contributions. Aurora has a bad reputation in the PostgreSQL community, taking the community code, making money with it, and not giving back their optimizations. But this Babelfish can really extend the popularity of this reliable open-source database. Contributions are not only extensions for code-compatibility. I can expect lot of contributions about test cases and documentation.
I’ve seen a demo about T-SQL and Money datatype. This is nice, but a single-user test case. I’ll test concurrency as soon as I have the preview. Isolation of transactions, read and write consistency in multi-user workloads are very different in PostgreSQL and SQL Server. And test case for compatibility acceptance is not easy.
You can expect more technical insights on this blog, as soon as I have access to the preview. For the moment, let me share some pictures about the Oculus Quest 2 I got from the AWS Heroes program, and the Neon City place where we meet:
I forgot to mention the io2 Block Express which will be very interesting for database bandwidth with 4GB/s (and 256K IOPS if you really need this):
AWS io2 Block Express is multi-attach. Like a SAN in the Cloud. Will be tempting to try Oracle RAC😎
Oh… did I forget that production use of Oracle RAC on
"Third-Party Clouds" is not permitted by this list: https://t.co/KYrofX1Z1E 🤷♂️ https://t.co/Hb6O24wXd4
— Franck Pachot 💙💛 🇺🇦 (@FranckPachot) December 1, 2020
and the EC2 R5b instance: