Migrating SmartMonsters to the Cloud. Part Two: Breaking Up the Bots.
TriadCity’s robots Death, Help, Marvin and Oscar are AIML-based chatterbots architected monolithically, meaning each bot is an individual application, fully vertically integrated: TriadCity client code, AIML interpreter, AIML files, plus each bot’s thin layer of hardwired individual behaviors, packaged independently of each other, deployed separately, where each deployment has its own directory hierarchy, and so on. This is not some massive scalability bottleneck. We’re only talking four bots. With a little work we can even semi-automatically update their AIML without having to fully rebuild each application. But, there are advantages to decoupling their AIML brains from their TC client code:
Simpler to develop a new generation of bots hosted on different platforms, where these new bots can access a central shared AIML brain via different protocols and devices. For example, we can access Oscar’s newly RESTful brain from mobile devices, Web sites, Twitter feeds, and the existing TriadCity bot without having to maintain separate AIML repositories for each client. AIML updates become independent of the Java code for these multiple platforms, doable in real time without rebooting the bots. The AIML service can scale independently, becoming fully elastic in The Cloud.
Easier to harden our AIML implementation. Today we rely on each developer to get it right. This is a problem, because the XML parser is brittle. If an AIML topic is rejected, subsequent loading from that file stops: all the following topics are ignored. This means that from a file containing 15,000 topics, only the first few hundred might be loaded if there’s a simple XML typo like forgetting a closing angle bracket. Also, notice to developers is non-automated: someone has to read the logs. If by contrast we deploy the AIML interpreter as its own service, we can use standard CI to validate new AIML centrally, independently of developers, in multiple passes. Like this: check in new AIML —> Jenkins fires a build —> the AIML is validated with the same SAX parser the interpreter uses at runtime —> if the validations fail, the build stops and developers are notified; if the validations succeed, a remote script triggers a pull from SCM, updating the runtime AIML —> the app is redeployed, invisibly to end users, without downtime —> a final round of functional tests written in the SoapUI tool validates the AIML rules at runtime, catching any regressions. This is really pretty nifty.
Easier to swap-out or enhance AIML interpreters without having to rebuild and redistribute an army of dependents. Specifically, we’re unhappy with the “Program AB” reference implementation, which is brittle and poorly performant. We’re likely to either replace it or fully rework it soon-ish. Decoupling that engine from its clients helps to future-proof.
Enables the AIML engine to become fully “elastic” in The Cloud: scaling out or in automatically under load. Nice!
Here are the migration steps:
Pull the AIML interpreter out of the bot code, into a standalone application accessed as a RESTful service running in The Cloud. Very super minorly tricky, because the off-the-shelf AIML interpreters for Java are file-centric, reading from and writing to directories requiring access. Granted we’re not interested in re-writing the interpreter to be more flexible, this rules out Google App Engine, which has no concept of file system. So, Elastic Beanstalk. With a little bit of hoop-jumping we can make that work. A shell script in the ec2-user’s home directory pulls revised AIML from SCM; Jenkins will fire that shell script remotely. Alternatively, we could package the interpreter code and AIML files together in a single WAR which we'd redeploy with every change. Although this is simpler, I decided against it since our AIML developers are used to accessing their source files in a separate location in SCM. It may be necessary to revisit this, as updates to EB now rely on knowing the underlying EC2 instances’ DNS addresses, which of course can change elastically under load.
Redesign the AIML interpreter to serve all of the bots. Our legacy vertical monoliths were each-bot-integrates-its-own-interpreter. Here, we’ll serve all AIML from one central RESTful service on Elastic Beanstalk. The REST API will be super-simple: a GET call parameterized with the bot name and the input String. Return an easy-peasy JSON object. That’s really it. Whip it together in JAX-RS and there it is. Let EB scale it if that ever becomes necessary.
Remove the AIML interpreter from the Web site, having the appropriate JSPs now RESTfully GET their conversations from the service on EB.
Remove the AIML interpreter from the TriadCIty bots, having them similarly call into the service. I decided while I was at it to consolidate the four existing bots into a single JAR rather than continuing to build each as a separate project. This is simpler, and there’s very little Java code distinguishing the bots, really just some hard-wiring of default responses and some variation of the circumstances in which they’ll intervene in chats. A configuration file determines which bot to run; the consolidated code figures out which bot-centric classes to load.
Configure Jenkins to manage Continuous Delivery. The CI build fires when updates are made to the central SCM. The build lifecycle now includes two cycles of AIML validation. A first pass ensures XML correctness before changes are pushed to production; a second pass validates the AIML rules by comparing expected to actual outputs. If we were a larger company I’d break CD into a deployment pipeline where we'd push first to a functional test environment to run these SoapUI-based tests, acting as a quality gate to production. I may circle back and do this at a later time. For now we’ll just do it in production, and send ourselves an email when a test fails.
Now let's add some PaaS sugar. We can log inputs which fail to match any AIML rules, triggering default responses. This feedback will let our botmasters know what their users are interested in, and how their bots are performing. Since we’re on AWS, let’s log to DynamoDB, with a super simple schema: bot name, <input>, <topic>, <that>. Then let’s schedule a job to report on these failures, say, weekly. Coolness! Let’s additionally modularize and abstract this logging so that we’re not 100% married to AWS; we’ll do that with the standard OO Strategy Pattern, via an Interface with a Factory delegating to concrete implementations servicing different environments. Later if we want to bolt from AWS we can simply implement a new Strategy class for wherever we go without untangling miles of upstream code. Check. The migration took two working days, following the steps above. AWS basics are straightforward, with intuitive controls. To my surprise, DynamoDB logging required a full third day. I struggled to get connected - turns out the published Developer Guide is missing one key point, and incorrect on another. To connect you have to specify the AWS region, duh. And, Integer keys are not necessary, Strings are fine, and there's a handy "@DynamoDBAutoGeneratedKey" annotation in the Java SDK which'll auto-gen the key for you. Nice. Pleased to get it going: I'd very much prefer NoSQL solutions, especially when the data are inherently non-relational, as these records unavoidably are.
One more area of enhancement seems obvious. Writing these log messages directly to db is synchronous: our RESTful API server will block during the transaction. Let's put a message queue between the API server and the database: we'll fire the log messages and forget them. Easy-peasy: add a "worker tier" to the application; call the write-to-DynamoDB logic from its POST method; implement a new Strategy object which writes log messages to SQS instead of database. Half a day. The platform is now in place upon which we’ll build a new generation of bots more advanced than these. There’s an upcoming project to rebuild our QA bots via the Drools rules engine - QA bots seem a natural venue for rules-based AI and with Drools we can invent a Domain Specific Language for QA, enabling simple-ish future expansion. Let’s get on it. The next Cloud migration step will be SmartMonsters’ fully-dynamic Web site. More to follow.









