Introduction
At the very end of August, there was a live service issue that affected players for several days. This resulted in inconsistent login access and apparent loss of progression for some players, and an inability to complete Daily Quests.
This is not the kind of experience we want to provide to you. Unavoidable errors may be a reality in any type of software development, but if they do happen, we want to make sure that we have processes for robust development, quality assurance, and update releases that will keep the impact of any of those errors low.
In the weeks after these issues were resolved, the team came together to understand what caused them and formalize the steps we could take to prevent these issues from happening again. This article is a summary of those internal discussions.
Incident Summary
To briefly summarize the live service issues:
On August 26th, shortly after midnight PDT (UTC -7), the Community Team alerted other studio teams that players were reporting being unable to log in, and that Android players seemed to be loading into the game as brand new players, with the appearance of no account progress. By 12:10 am, our Backend Team confirmed this unusual behavior in our cloud services.
We adjusted our database configurations to address the login failures and the appearance of progression issues. Within an hour, we saw reports that players could log in, but unfortunately, many players still reported accounts that seemed to have no progress. On top of that, we began seeing reports that Daily Quests could not be completed.
This meant there were three main issues to tackle:
- Inconsistent login
- The appearance of loss of progression
- Inability to complete Daily Quests
Investigation and Resolution
Login and Progression Issues:
Through a number of monitoring tools, the Cloud Platforms Team discovered that two services essential for the login process were timing out. In other words, one service would request data from another, and if that second service did not respond within a certain number of seconds, the first service would close the connection and report that the second service took too long to respond.
The result: Some players were unable to log in to Sky, and others who could log in did not receive their account progression data.
Any potential solution had to also be compatible with other backend services so that they wouldn't be negatively affected, and testing had to be done quickly and carefully so that no further impact would be felt in the game.
Our Engineering Team worked on solutions that accounted for these concerns, and together with the Cloud Platform Team they developed, tested, and reviewed fixes for both the progress data issue as well as issues with one of our databases.
Daily Quest Completion
As we investigated reports about Daily Quests, we realized this was caused by the wrong Quest system being used.
As you might recall, we’re working on a big revamp to quests in Sky, and this requires developing a new Quest system. We’ve been testing it internally, and we also use temporary configurations that safely enable the new system in the live game exclusively for our QA team so that they can perform vital tests, and when testing is completed, those configurations are removed and the new Quest system is completely disabled.
However, as we investigated, we discovered that the new Quest system had mistakenly been deployed to the live environment of the game with the 0.26.5 update. As a result, both the new and legacy Quest systems were enabled, but the legacy system could not handle the new Daily Quests that the update introduced. Any time a player tried to complete a new quest that was designated for the new system, the legacy system would display an error.
Due to specific ways that our platforms and services interact with the devices that Sky is played on, we ultimately determined that disabling the new Quest system would lead to more problems. So, we tested and pushed out fixes to the new Quest system which would resolve the completion problems, and then disabled the legacy Quest system.
Fixes for all of these issues were confirmed at 00:15 on August 29, 2024, 72 hours after the first report of issues from players.
In all, the login issue lasted for about one hour, account progress timeout for 37 hours, and Daily Quest completion errors for 72 hours.
Moving Forward
To prevent similar issues in the future, we’ve taken steps that include:
- Updates to our live release readiness checklist: The new version of this checklist better accounts for the needs of a live service game like Sky, and it’s been updated to better emphasize clarity and additional checkpoints for new features.
- New feature sign-off process: Development for new features will include more robust rollback plans, which can be followed in any cases where live services are being impacted. Additional validation steps across teams will ensure both feature readiness and stronger reviews for quality standards.
It is very easy to celebrate our successes; acknowledging mistakes and lessons learned is much more difficult, but still important. Our goal with this post is to address the concerns you may have had as a result of this earlier incident, and we appreciate your patience and your feedback as we worked through these issues.
By sharing these insights, we hope it makes it easier to see our commitment to making your time in Sky more positive, and to taking action to improve the quality of the player experience we provide.
As always, we look forward to hearing your feedback, and encourage you to join us in that discussion on our official Discord server at discord.gg/thatskygame!