User Facing System Deployment and Testing Tips:
When deploying a new version or a complete rewrite of an old system a lot of things can screw up. Here I am documenting my learnings on carrying out deployment of user facing systems in ecommerce space where such systems are expected to be Highly Available with zero down time. In short I would say, Make sure you have a proxy layer.
As you make major changes to your system, it's bound to impact on your contracts with other systems. Consider a case where you may choose to scope down a single system to smaller independent microservices, in such cases your api endpoints also change significantly. Building a proxy which acts as router and also an adaptor to translate request to your contract specification is very useful, this way you can deploy independently and give a deadline to other systems to move to new contract post which you shutdown the proxy.
Another major advantage of using a proxy layer is being able to test and onboard the new system easily. Consider a scenario of rewriting a mission critical system in user's path. In situations where the system is very complex with many use cases, And there is neither sufficient stage setup up to test nor the time to spend on stage testing. Added to this is the fact that the new system is completely new such that its data model is redesigned, its data store has been changed, it's been broken into smaller micro services. In such cases having the outside world talking to the old system over proxy helps running the new system in passive mode and being monitored.
To run in passive mode the trick would be to use proxy layer to fork incoming request into two request - one hitting the main system and the other the new system. The proxy also ensures that output of the first is sent back to client and the output of second is logged along with the main system output. Now the output of both systems can be checked for diff by a cron for any anomalies.
How does this Help. Well imagine running passive mode, it will start to spit out all errors in the system, which you can keep fixing and deploying. Run it for long enough and all your new data will be auto migrated as passive mode had been creating entries in the new systems. The outside world is oblivious to the fact you had 2 system running and when you configured proxy to cut over and pointing only to the second system.
Having said this, there is more to this then using a proxy to fork incoming request. There is also outbound request where things get challenging. As outbound call by primary system and secondary can occur at different points of time. This makes the job of anomaly detection in outbound calls difficult.
Here the primary problem is how to understand which request payloads to compare when they are serialized and one way to solve it is to know what field in the payload to group on, Say both request from old and new system would have primary key like an entity id. But this does not work always and it’s not the right way as now your anomaly detector system is specific to you business logic and coupled to it. It didn't for us, as we were even changing the format of ids from int to string. The better way was to use meta data such as request Headers in case of Http outbound request, This is similar to concept of request tracing in microservice architecture.
If you have heard of zipkin @ twitter you will know what I mean, else don't worry I will try to provide a gist of what it is in my context. It's the idea to monitor and measure metrics across systems in a microservice architecture. Consider one request to web page at facebook hitting 50 services in the backend. Now If you want to analyze a sample of such request and see the duration it took from each sub system, as tree from the user clicking on the homepage to understand bottlenecks.
On a high level this is achieved by using headers (meta data) to pass in a trace Id identifying the request across the system. This can only work when all systems agree to pass this meta data in their outbound calls. In big companies this get standardized by implementing a generic reusable outbound proxy such as finagle in twitter, to some extent service proxy in flipkart. Hence leveraging a similar traceId generated by our proxy which forks incoming calls to the old and new system we can catch the successive outbound calls made by both systems and group then on the meta trace id.
If all these things are great, you would feel that having a proxy is a no brainer. Not really, proxy brings in another hop and all the above is needed when you have a complex system with many interacting services. They are unnecessary effort for small scale or monolithic system. Now about what kind of proxy would work best, we used openresty Nginx with lua, HA proxy is another amazing alternative. The only reason we didn't use HA proxy is because it didn't have plugins to embed lua into it to carry out our fork logic etc. To do it in HAproxy we needed to write C code and we didn't see the need for that as Nginx setup met our requirements of scale.