Update: Improved chat server restarts

brildum · November 7, 2014, 12:15am

There is a bug when restarting our edge servers (the IRC servers, as you call them) that causes messages to stop flowing during the restart (we disconnect clients over the course of several minutes to avoid thundering herds).

When this happens, users will stop receiving messages despite having an open connection to chat. A page refresh will resolve the issue, but its definitely a bad experience.

We’re currently rolling out a fix for this issue. Restarts should no longer be problematic. Along with this current arc of work, we plan to update our clients to gracefully reconnect such that users will no longer see “You were disconnected. Reconnecting in 2 seconds.” after a restart and instead transparently reconnect during restarts without missing any messages.

I should add that because the current restart is pushing out the new changes, you may see this issue appear these last few restarts, but hopefully it never happens again!

moocat · November 7, 2014, 12:36pm

I’m not sure I get it.
Are you saying you’re updating the clients, as in the web client, to adjust to and handle this behavior, or are you saying that you’re rolling out a fix that will solve this for everyone connected, including the people running on their own IRC clients?

In other words, is there anything I need to do to prevent my IRC connection getting stuck?

brildum · November 7, 2014, 3:00pm

We plan on updating the Twitch clients to reconnect gracefully during restarts. We will publicize a guide for how this is accomplished so that bot developers can do the same.

From a high level, it will look something like this:

For clients which have registered for custom commands (IRC v3 capability negotiation), we will send a custom IRC command.
possibly similar to — :tmi.twitch.tv RECONNECT
The client will have some duration (30s possibly) to create a new connection and re-join their channels
To avoid missing any messages, clients will have to maintain old and new connections simultaneously for a brief period (a few secs after you’ve re-joined channels on the new connection). Clients will need to de-dupe messages while maintaining duplicate connections. After a brief period with duplicate connections receiving messages, the client can close the old connection (it will be forcefully closed on the server at some point)

Keep in mind this logic is subject to change, but that is likely to be how it works.

Another note, we’re actively working on creating chat-specific API documentation in the Twitch-API github repo, so expect some of that soon.

brildum · November 7, 2014, 3:57pm

Oh, so the graceful reconnects are separate from the bugfix which resolved the issue of clients not receiving messages during our restarts.

night · November 7, 2014, 8:18pm

How do you propose we implement this reliably with rate limits on joins and new connections?

There’s a severe lack of proper error messages provided by the server for every rate limiting event.

Connecting too quickly? → Disconnected
Joining too quickly? → Disconnected
Chatting too frequently? → Disconnected
Temporary connection interruption (not long enough to cause a ping timeout) → Disconnected

How do you possibly distinguish all of these errors from the client side?

For example, I can only assume from my perspective that the last disconnection listed above was the cause of the server-side 20 message buffer filling up during the connection interruption. There’s no way to know for sure on the client side why any errors occurred.

Even if rate limits are published knowledge, there’s no guarantee that following them will warrant an uninterrupted access to IRC. This can be easily witnessed from past experience, as when TMI was so severely overloaded in the past the servers started to see bursts of activity from the clients rather than steady flow of data. Without a steady flow of data, it became all too easy to break these limits.

Preferably, it would be nice if disconnections only occurred when necessary. Perhaps the system could be modified to warn first, and then disconnect for habitual offenders.

For example:
Client breaks rate limit → Error message sent to client
Client breaks same limit within a certain timeframe → Error message sent to client → Disconnects client

In an ideal setup for me:
Client breaks rate limit → Error message sent to client → Client handles error and increases client side rate limit

moocat · November 7, 2014, 11:30pm

With this it’s safe to assume you’re going to be doing restarts more frequently than in the past?
Keep in mind it already takes hours to rejoin all channels with the current rate limits. (Simple calculation, even with optimal and no lag rates)

brildum · November 8, 2014, 2:54am

@moocat We do hope to restart our edge servers more often (we currently actively avoid making changes to these servers unless they are absolutely required), so that we can iterate without hesitation and improve the overall system.

@night I definitely hear your feedback and it is totally reasonable. We have some large projects currently wrapping up in the next few weeks, after we’ll start working on addressing some of these issues.

How we’ll attempt to solve them I cannot say yet. There are still many things to consider, but expect incremental improvements towards that end to start showing up soon.

moocat · November 8, 2014, 4:10am

I still don’t get it. You’re saying you don’t want to reconnect all clients immediately on a restart, to avoid thundering herds, but there will be an IRC event that makes the clients instead open another connection to rejoin the room? Wouldn’t that be the same as creating a thundering herd, only instead of the disconnection event there’s an IRC event triggering it?

Also, 30 seconds to rejoin rooms? As you can only (optimally) join 5 rooms per second, without counting the authentication rate limit, that’s going to be max 150 rooms in that duration, which is basically nothing. It’s currently taking me more than 3 hours to rejoin rooms on full restarts, and that’s not accounting for all the time it takes to get the moderator status, which often only occurs more than 10 minutes after a join. So we’re talking a significant amount of downtime here.

EDIT: If you’re doing lots of changes on the edge servers, wouldn’t it be better to have a dedicated edge server for this (I assume they run independently from each other), or use hot swapping? Don’t you even have dark launching for chat?

brildum · November 8, 2014, 5:37am

I’ll go into some more detail as to how restarts currently work to clarify things.

The chat system is separated by clusters:

main cluster (majority of channels, on 7 IPs)
event cluster (large event channels, on 3 IPs)
group cluster (on 2 IPs)

There are 3 edge servers per IP: 80, 443, 6667

During restarts, there is at most one box (IP) per cluster being restarted at any given time. Meaning, we can restart edges on a group server, event server, main server simultaneously but cannot restart edges on 2 main servers simultaneously. Each servers restart process takes ~15 minutes. The server will notify clients to reconnect at a constant rate over the course of those 15 minutes (the connection will be closed by the server ~30s after they are notified). This period of disconnecting clients one by one is how we avoid thundering herds, we do not notify all clients to reconnect at once.

To restart all edge servers in the main cluster takes a minimum of 105 minutes currently. With that in mind, bots should distribute their connections between all servers so that when a given connection is disconnected, you only need to rejoin a subset of the rooms you care about. Obviously, over the course a full cluster restart you’ll need to rejoin all rooms, but you should never need to do full-restart (aka re-join all your channels immediately).

Its still potentially troublesome for bots considering the case where you get unlucky and get notified at the end of a 15 min interval, and then another connection is notified at the start of the next 15 min interval. You effectively have 2 connections restarting simultaneously, but its still not a full restart.

The arc of work to make restarts not affect users (aka no missed messages) is aimed at improving the restart experience for users (who largely only need to join a single room) and not bots which are connected to hundreds/thousands of rooms.

This unlocks the ability for us to iterate on our edge server without hesitation and start improving it. A large portion of the work we have planned are changes that have been requested via these forums.

moocat · November 8, 2014, 2:42pm

@brildum Thank you, that clears everything up.

EDIT: One more thing. When you’re given the restart event, I assume you can reconnect to the same server? Or will that one be in a kind of locked state until the full 15 minutes are up?

brildum · November 8, 2014, 6:36pm

You can reconnect to the same server when notified to reconnect.

tduva_ · November 13, 2014, 9:05pm

I don’t quite understand this. Why would you want to reconnect to a server that is being restarted? Or am I misunderstanding what the restart process means?

Also, considering a custom client doesn’t support the graceful reconnect with two simultaneous connections (yet), I suppose it would still be preferred if it reacted to the reconnect command but instead just reconnected normally?

brildum · November 13, 2014, 11:19pm

Prior to the restart, we have an operating system process (our edge server) accepting connections on a given port. When we restart, a new process starts accepting connections on the port (and the old process stops accepting connections). So connections are still accepted, but new connections are run with the new code, and the old process kills its connections over 15 minutes (the old code).

For 15 minutes, we have 2 edge server processes running, but only 1 actively accepts connections (the newest).

brildum · November 13, 2014, 11:31pm

Reconnecting normally would be fine in most cases. In fact, if you ignore the reconnect message completely you will be disconnected 30s later. Bots should already handle disconnects, so if you’re doing that, you don’t need to add any new code to handle restarts.

tduva_ · November 14, 2014, 3:53am

Thanks for clearing things up.

Dmitriy_Kirillov · April 15, 2015, 8:24pm

Why am I constantly reconnecting to chat? I wrote “Sorry, we were unable to connect to chat. Reconnecting in 2 seconds”, “Sorry, we were unable to connect to chat. Reconnecting in 4 seconds”, “… in 8 seconds”, "… in 16 seconds “,” … in 32 seconds “,” … in 64 seconds “,” … in 128 seconds “,” … in 256 seconds ", … and so on. And I never connect! Tried to clean your browser’s cache Google Chrom. It did not help.
Sorry if I’m not there wrote: I do not know much English and writing through Google translator. Answer me please. Or where do I go for help?

george · April 15, 2015, 8:31pm

Are you logging into Twitch with a Facebook account? If so, try changing your Twitch password.

Dmitriy_Kirillov · April 15, 2015, 8:56pm

Thanks a lot! It helped! =)))