Good day, everyone.
We’ve had numerous server problems since the debut of Diablo II: Resurrected, and we wanted to be transparent about what’s causing them and what we’ve done so far to fix them. We’d also want to offer you some insight into our plans for the future.
tl;dr: Our server failures aren’t the result of a single problem; we’re addressing each issue as it arises, with both short-term fixes and longer-term architecture adjustments. A tiny number of players have suffered character advancement loss; going forward, any character progression loss caused by a server crash should be limited to a few minutes. This is not a full solution for us, and we are still working on it. Our staff is working hard, with the assistance of others within Blizzard, to make the gaming experience better for everyone.
We’re going to go into some engineering details here, but we hope that this helps you understand why these outages have been happening and what we’ve done to resolve each occurrence, as well as how we’re researching the broader root cause. Let’s begin from the very beginning.
The issue(s) with the servers are as follows:
Before we go into the details of the issues, let’s have a look at how our server databases operate. The first is our global database, which serves as a single source of truth for all of your character’s data and development. As you would see, that’s a huge job for a single database, and it wouldn’t be able to do it on its own. To reduce stress and delay on our global database, each region–North America, Europe, and Asia–has its own database that stores your character’s information and progress, and your region’s database will occasionally write to the global database. Because it’s quicker, most of your in-game activities are conducted against this regional database, and your character is “frozen” there to preserve the integrity of your unique character record. In the event that the primary database fails, the global database has a backup.
With that in mind, we’ll concentrate on the downtimes that occurred between Saturday, October 9 and today to explain what’s been going on.
Due to a sudden, substantial spike in traffic, we had a worldwide outage on Saturday AM Pacific time. This was a new barrier that our servers had never encountered before, even when they were first launched. This was worsened by a previous-day upgrade aimed at improving game creation performance–the combination of these two reasons overwhelmed our global database, leading it to time out. We opted to roll back the Friday upgrade we’d previously delivered in the hopes of reducing the strain on the servers before Sunday while also allowing us to dig further into the underlying problem.
On Sunday, though, it became apparent that what we’d done on Saturday wasn’t enough: traffic increased even more, leading us to experience another outage. Our game servers noticed the database’s disconnect and tried to rejoin frequently, which meant the database never had time to catch up on the work we’d done since it was too busy dealing with a constant stream of connection attempts from game servers. During this time, we also discovered that we could enhance the setup of our database event logging, which is required to restore a healthy state in the event of a database failure, so we performed that task as well as doing further root cause analysis.
The double-edged blade of Sunday’s outage was that, as a result of Saturday’s experience, we had developed what was basically a playbook for rapidly recovering from it. That was a nice thing.
However, we crashed again because we returned online so fast at a high period of player activity, with hundreds of thousands of games being played in a matter of minutes. That was a mistake.
As a result, we had a lot of changes to deploy, including code and configuration enhancements, which we did on the backup global database. This brings us to Monday, October 11th, when the worldwide databases were switched. This resulted in a second outage, as our backup database continued to perform its backup process despite the fact that it should have been serving server requests. We discovered more issues and made more improvements during this time–we found a since-deprecated-but-taxing query we could remove entirely from the database, we optimized eligibility checks for players when they join a game, further reducing the load, and we are currently testing more performance improvements. We also think we’ve resolved the database-reconnect storms, since we didn’t see any on Tuesday.
Then on Tuesday, we set a new concurrent player record, with tens of thousands of gamers in just one area. This resulted in another instance of database performance degradation, the reason of which is presently being investigated by our database engineers. We also enlisted the help of additional Blizzard engineers to work on minor solutions while our own team concentrated on the main server problems, as well as our third-party partners.
What is causing this:
We preserved a lot of old code to remain faithful to the original game. One old service, in particular, is having trouble keeping up with current player behavior.
This service performs essential game functions, such as game creation/joining, updating/reading/filtering game lists, validating game server health, and reading characters from the database to guarantee your character may participate in whatever you’re filtering for. This service is a singleton, which means we can only run one instance of it at a time to guarantee that all players see the most up-to-date and accurate game list possible. We improved our service in a number of ways to make it more contemporary, however as previously said, a lot of our problems come from game development.
We bring up “contemporary player behavior” because it’s a fascinating topic to consider. There wasn’t nearly as much information on the internet in 2001 about how to play Diablo II “properly” (Baal runs for XP, Pindleskin/Ancient Sewers/etc for magic find, etc). Today, however, a new player may find a plethora of great content producers who can teach them how to play the game in a variety of methods, many of which involve a significant amount of database load in the form of rapidly building, loading, and deleting games. Though we anticipated this–with people creating new characters on new servers and working diligently to get their magic-finding items–we greatly underestimated the extent of what we learned through beta testing.
Furthermore, we were saving to the global database much too often: there was no need to do so as frequently as we were. We should save you to the regional database first, and then to the global database only when we need to unlock you–this is one of the mitigations we’ve implemented. Right now, we’re developing code to completely alter how we do this, so we’ll virtually never save to the global database, reducing the demand on that server considerably. However, this is an architectural overhaul that will take some time to create, test, and deploy.
A word on reversal of progress:
We lock your character in the global database when you are assigned to a region (for example, when you play in the US region, your character is locked to the US region, and most actions are handled in the US region’s database). This causes some players to lose progress.
The issue was that during a server outage, while the database was failing, a lot of characters were trapped in the regional database, and we couldn’t get them to go to the global database. We thought we had two options at the time: either unlock everyone with unsaved changes in the global database, which would result in some progress being lost due to a global database overwrite, or shut down the game for an indefinite period of time and run a script to write the regional data to the global database.
We acted on the former at the time, believing it was more essential to keep the game up so people could play than to take it down for a lengthy time to recover the data. We apologize to any gamers who have lost significant progress or expensive goods. As gamers, we understand the agony of a reversal and feel it keenly.
Moving ahead, we think we’ve found a method to restore characters that doesn’t result in substantial data loss–in the case of a server crash, the loss should be confined to a few minutes, if at all.
This is a step forward, but it’s still not good enough in our opinion.
What we’re doing to address it:
Rate restriction: We’re restricting the amount of database operations related to generating and joining games, and we’re aware that this is affecting a lot of people. For those of you performing Pindleskin runs, for example, you’ll be in and out of a game in 20 seconds and starting a new one. You will be rate restricted at some time in this scenario. When this happens, the error message will state that there is a problem connecting with game servers: this is not an indication that game servers are down in this instance; rather, it indicates you have been rate restricted to decrease stress on the database momentarily in order to keep the game operating. We can tell you that this is just a temporary solution; we do not view this as a long-term solution.
Creating a Login Queue: This past weekend was a slew of issues, none of which were the same. We may continue to run into minor issues as a result of a reinvigorated playerbase, the inclusion of additional platforms, and other scaling issues. We need to halt the “herding”–large numbers of people logging in at the same time–in order to detect and treat them quickly. To remedy this, we’ve set up a login queue, similar to what you may have seen in World of Warcraft. This will maintain the population at the current safe level, allowing us to monitor where the system is strained and fix it before the game is totally shut down. We’ll be able to raise the population caps each time we repair a strain. This login queue has already been partly implemented on the backend (at the moment, it seems that the client has failed authentication) and should be completely deployed on PC in the next days, with console to follow.
Breaking down key functions into smaller services: This work is now under process for tasks that can be done in less than a day (some have already been finished this week) as well as planned for bigger projects, such as new microservices (for example, a GameList service that is only responsible for providing the game list to players). We can look at scaling up our game management services, which will decrease the amount of demand, after essential functionality has been torn down.
Not only on the D2R team, but throughout Blizzard, we have individuals working very hard to handle problems in real time, diagnose issues, and execute solutions. This game is very important to all of us. Many of us on the team have been D2 gamers for a long time–we played when it first came out in 2001, others are involved in the modding community, and so on. We can promise you that we will continue to work until the game experience meets our expectations, not just as creators but also as players and members of the community.
Please continue to provide comments in the Diablo II: Resurrected topic, report problems in our Bug Report thread, and seek help from our Technical Support forum for troubleshooting. Thank you for keeping in touch with us through all channels–been it’s very helpful as we work through these problems.
Through the Diablo community forums, the Diablo community team will keep you up to speed on our work.
- Diablo II: Resurrected’s Development Team
Blizzard has responded to the diablo 2 resurrected server status ps4 by saying that they are looking into the issue and will update us as soon as possible.
Related Tags
- diablo 2 resurrected server status ps5
- diablo 2 resurrected server status xbox
- blizzard server status
- diablo 2 resurrected server status xbox one
- diablo 2 resurrected server status twitter