Why Infinit was down briefly on Friday?
We hope that no one was too inconvenienced by the outage Friday and thought that this would be a good opportunity to start blogging about some of our technical challenges.
When updating our servers and client on Friday afternoon, we encountered an unexpected problem with our notification server. This server is used to push transaction and user status information to clients using an SSL TCP socket.
Part of our update process is killing old server instances and launching new ones which the clients automatically connect to. This approach means that users just see a quick loss of connection (i.e. you see the message, "Have you tried turning it off and on again?") before the client connects to the new server.
We noticed immediately that there was a problem because not all of the clients were reconnecting to the server. The ones that were not reconnecting were timing out during their SSL handshake. They would then cool down for several seconds before trying again. The problem was that the queue of clients trying to connect to the server was never getting shorter so the clients would never reconnect.
To temporarily resolve the problem, we worked quickly to improve the performance of the notification servers so that everyone could get connected again. The long term fix is to increase the handshake timeout on the client side and add a factor of randomness to the client reconnect cool-down. The client side changes will be pushed in the next release.
What did we learn from this? A couple of things. The first is that even though our server load during normal operation is relatively low, we now have enough users to create load problems when everyone tries to connect at the same time. The second is that our great (in house) logging system allowed us to quickly understand the root of the problem and implement a fix.











