4
0

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?

SODAAdvent Calendar 2024

Day 18

MySQL 8(Aurora for MySQL) Upgrade, Lessons Learned

Posted at

Hi, I'm Duc, a member of SODA's SRE team.
Our team focuses on maintaining the availability and performance of the snkrdunk.com website and the SNKRDUNK application.
Here is our little story behind how we upgraded our central database from MySQL 5.7 to MySQL 8.0(Aurora MySQL v2 to v3)

Our setup

As of October 31, 2024, AWS has ended support for Aurora MySQL 5.7, and 1 month later, the Extended Support cost is kicking in. We currently use Aurora MySQL for our primary databases, so the upgrade is inevitable.

Our central cluster has its challenges:

  • Contains approximately 1TB of data
  • The biggest table is > 500GB
  • Is the center of our system; many things depend on it(the main site, part of a smaller system, BI tools, etc.)
  • Its workloads are very read-heavy
  • Workload is very high during the day, especially when notifications are pushed to our users(4, 5 times a day), and low at night

Our upgrade process

Take a quick look at the database

Using a cloned cluster, we performed a quick dirty upgrade to see if it failed. Luckily, the upgrade succeeded. However, what we want to check is the pre-upgrade check result.
You can also use MySQL's official client, but MySQL Aurora is the modified version of MySQL from AWS, so we think this is the best way.
In our case, we were primarily concerned about these two warnings.
a. Usage of database objects with names conflicting with new reserved keywords
b. Usage of the utf8mb3 charset
The first problem can be easily fixed by adding single quotes around the reserved keywords. The second one is a little bit tricky. I will get there later.

Decide the upgrade strategy

We need to utilize binlog replication to reduce downtime, which everybody already knows about.
- Create a secondary cluster and replicate all data from the origin with binlog replication
- Upgrade the secondary cluster, and then when the time comes, do the switchover.
Simple isn't it? Moreover, RDS supports the Blue/Green Deployment, which does all the heavy lifting of the switchover procedure.
Unfortunately, we cannot use that, as some technical debts must be paid.

At that time, we had two switchover strategies in mind.

  1. The no-brainer way
    - Put all related systems into maintenance mode
    After confirming that no insert or update is happening, stop the replication and switch all the database's endpoints to the new cluster.
    - In our case, it should take about two hours to complete. And that will be 2 hours of downtime
  2. The fancy way
    - Fix the application so it works with both versions of MySQL. After upgrading the secondary cluster to MySQL 8, slowly direct your application to the new cluster's reader.
    You must use an internal domain for the writer and reader endpoint. Then, you can use weighted routing to gradually shift the select query from the old cluster to the new one.
    Something like this.
    DB-weighted.drawio.png
    - After all your SELECT queries are using the new reader endpoint, stop the replication and then finally switchover the writer endpoint.
    - The switchover is pretty quick, so downtime should be minimal(a couple of seconds)
    This solution is very tempting due to the minimal downtime, but after serious consideration, the risk to data integrity is there, and the 2-hour downtime is acceptable from both the business standpoint and our SLO policy. We decided to go with the traditional solution.

Rollback strategy

What if, after upgrading to the new version of MySQL, you found out that there is a critical bug that affects your business logic or database performance does not meet your expectations(Yes, the MySQL 8.0 version is not always better than 5.7. See: https://www.percona.com/blog/mysql-8-0-vs-5-7-are-the-newer-versions-more-problematic/), how to roll back to the previous version?
Once you put down the maintenance page and let users use your application, the newly added data only exists in the MySQL 8.0 cluster. You can not reverse your whole database back to the 5.7 version. (In theory, you can dump and restore via a SQL file, but with a big database, that will take days, and so will your application's downtime).
We need an insurance policy to revert our data as quickly and safely as possible.
Here is the architecture that we finally came up with.
ロールバック戦略.drawio.png

After the switchover, if you need to roll back with the newest data, you can promote the rollback database to primary. MySQL does not officially support reverse binlog replication from 8.0 to 5.7, but it is not impossible. You need to ensure everything in your database works with both versions.
That's what takes us to the next step.

Change all table's character set to using utf8mb4 character set

From MySQL 8.0, the default character set is utf8mb4 with the default collation utf8mb4_0900_ai_ci instead of utf8 and utf8_general_ci. The utf8 character set in MySQL 8 is aliased to utf8mb3.
https://dev.mysql.com/doc/refman/8.4/en/charset.html
Theoretically, the binlog replication should work fine between 2 versions of MySQL despite the character set being aliased. But since MySQL 5.7 already supports the utf8mb4 character set, and we already have problems because utf8 can not handle some special characters, we decided to eliminate the old utf8 character set to reduce the risk of backward replication incompatible problems.

To change the whole table character set and collation

ALTER TABLE table_0 CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci;

To change the character set of some columns in a table

ALTER TABLE table_1 
    MODIFY column_0 varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci,
    MODIFY column_1 varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci;

You should keep in mind that if the changed columns are part of any index, MySQL will rebuild that index. The index rebuilding process can be IO-intensive on big tables, so we recommend you test it first on a separate environment and utilize Blue/Green Deployment if needed.

After changing all columns to use utf8mb4, we realized some queries were significantly slower. It turns out that with the new character set and collation, some queries did not use the correct index like the old one. So, we must to use the FORCE INDEX option to fix the problem.

Code changes

After enabling MySQL 8 in our CI workflow, we got 82 failed tests.
That was a considerable number, but we managed to fix it all in 2 weeks.

Preparing CI(test environment)

The MySQL upgrade project needs to be run without stopping other feature developments.
So, we need to ensure that new functions are compatible with MySQL 8 during the development phase while we fix the incompatible ones.
First, we enabled MySQL testing for every pull request in all our tests. But after a few days, we realized the GitHub Actions billing was over the roof. Yeah, of course. The number of tests that needed to be run was double, and so was our billing.
If we don't fix this, soon the CI billing will double the money we need to pay for computing resources in our production environment :)
After all, one of my colleagues came up with a brilliant idea.

  • Remove the requirement of running MySQL 8 CI in every test job
  • The MySQL 8 CI workflow will only run if a specified label is attached to the PR
    if: github.event_name == 'pull_request' && contains(github.event.pull_request.labels.*.name, 'MySQL 8 Test')
    
  • To check if newly developed functions are compatible with MySQL 8, we created a feature branch from the main branch, attached the label above to test it against the MySQL 8 database, and then once a day merged the most recent main branch into it
    To achieve that, you can create a scheduled workflow on GitHub Actions, or just a couple of commands in a cron job that runs from a PC should do the job.

With the solution above, the GitHub Actions cost was significantly reduced and everyone is happy!

Fixing the code

Our failed tests can be categorized into 3 sections:

  1. MySQL 8 reserved word violation
    Just add a backquote, and they will be good to go.

  2. Query syntax error
    We go some tests that reported Incorrect arguments to mysqld_stmt_execute error.
    Digging deeper, in some queries, with the limit option from the pagination parameter, we used the code below

    limit := math.Abs(float64(perPage * (page - 1))),
    

    Despite being a float number, somehow, it worked with MySQL 5.7.
    With the right data type(int), those query is working again.

    limit := int(math.Abs(float64(perPage * (page - 1)))),
    
  3. The query returns results with order different than MySQL 5.7
    a. Queries without order by
    The test's expected result fluctuated when changing between MySQL versions

    select id from users where status = "active";
    MySQL 5.7 
    +----+
    | id |
    +----+
    | 10 |
    | 11 |
    | 12 |
    | 13 |
    | 14 |
    +----+
    MySQL 8.0
    +----+
    | id |
    +----+
    | 11 |
    | 12 |
    | 10 |
    | 14 |
    | 13 |
    +----+
    

So just add a simple order by id or if the result order is not important, you can ignore it in the test code.
See: https://pkg.go.dev/github.com/google/go-cmp/cmp/cmpopts

b. Queries with order by
This was mostly because our test data was poorly defined.
With order by, if 2 rows have the same value the output from MySQL 5.7 and 8.0 might not be the same.

select id, lastLogin from users where status = "active" order by lastLogin;
MySQL 5.7 
select id, lastLogin from users limit 5;
+----+---------------------+
| id | lastLogin           |
+----+---------------------+
| 10 | 2018-07-11 06:00:00 |
| 11 | 2018-07-11 06:00:00 |
| 12 | 2018-07-11 06:01:00 |
| 13 | 2018-07-11 06:02:00 |
| 14 | 2018-07-11 06:03:00 |
+----+---------------------+
MySQL 8.0
+----+---------------------+
| id | lastLogin           |
+----+---------------------+
| 11 | 2018-07-11 06:00:00 |
| 10 | 2018-07-11 06:00:00 |
| 12 | 2018-07-11 06:01:00 |
| 13 | 2018-07-11 06:02:00 |
| 14 | 2018-07-11 06:03:00 |
+----+---------------------+

We fixed the fixture data, and the problem went away.

Load testing

As I mentioned earlier, MySQL 8 does not perform 100% better than MySQL 5.7. We need to do load testing to make sure that the application will perform at least the same as before after the upgrade.
We can perform load testing on a closed environment but after fixing the code to be compatible with both versions of MySQL and a careful QA process, we think it is safe to go for the A/B testing method.
We created a "Green" environment with backend servers and a database cluster (continuously replicating data from the old 5.7 cluster), with the only difference being the reader database's endpoint. The green environment's server will execute all SELECT queries on the new MySQL 8.0 cluster's readers, while all the DML queries will still be executed on the old MySQL 5.7 writer.
(As I described above, our workload is very read-heavy, so testing the replicas for SELECT queries should be enough)
Because it is not the real release, we planned that only a maximum of 25% of traffic would use the new cluster. So we created the green environment with 25% capacity(number of servers, number of Aurora replicas) of the current running environment.

load-testing.drawio.png

On the load testing day, we gradually increased traffic to the green environment step by step (1% -> 5% -> 10% -> 15% -> 20% -> 25%) with careful monitoring from the DataDog dashboard. If the error rate compared to the current environment is higher than 10%, we will stop traffic to the new environment immediately. We also stopped all production releases for 24 hours to get the most accurate data.
Using DataDog APM's metric, we can see every API performance compared to the current environment.

Screenshot 2024-12-16 at 20.11.35.png
Screenshot 2024-12-16 at 20.12.08.png

Some APIs performed better, some performed a little bit worse but luckily enough for us, after 24 hours of load testing, no API performs significantly badly in terms of latency or error rate that needs to be fixed before the release.

It's show time

Since we decided to go with the traditional solution, our upgrade process was pretty straight forward.

  1. Stop all traffic to our system using a maintenance page that shows a 503 error from WAF
  2. Stop all batch jobs and wait for all other asynchronous jobs to finish
  3. After confirming there were no DML queries on the writer, stop the binlog replication and perform the switchover
  4. Restart all batch jobs
  5. Internal confirmation
  6. Take down the maintenance page and allow users to access the application as usual

Lessons learned along the way (TL;DR)

  • Major MySQL upgrade is a very time-consuming task, so it should be planned 3 months before it can be released
  • MySQL 8.0 is not a magic wand that "just works" out of the box. You should test your application carefully in both functionality and performance
  • The utf8mb4 can perform poorly in MySQL 5.7 because it can mislead the query optimizer to use the wrong index or use no index at all
  • The reverse binlog replication (from MySQL 8.0 to 5.7) is not impossible and can be used as an insurance policy in case you need it
  • Be careful when enabling testing MySQL on your CI. You might be surprised when you realise how much it cost.
4
0
0

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up
4
0

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?