A JBoss Rules (Drools) Performance Tuning Prologue

Wed, Jun 4, 2014

Sometimes it is difficult to figure out just what the best computing architecture is to support your software stack. And when you stack is largely open source, you often have conflicting requirements. Some components require lots of threads and others require raw CPU performance. There are questions you ask when formulating your software architecture when you start.

Do you need to process high transaction counts?

Is the application you are building required to process lots of requests. If it is a web based application serving mobiles or desktops or web services you can count the hits easily. In a service oriented application, you may serve many calls per user interaction. In our case, we built a platform for web (desktop and mobile), IVR (VXML) and CTI applications via SOAP and REST web services. The call count is high but not extreme but with strong bursts throughout the day.

Do you plan to scale out or scale up?

Knowing our patterns of transaction helped us plan for scale out. The purpose of our interactions are to contextually adjust the experience that our large customer base receives when contacting our clients through web or voice interactions. There are millions of customers, each which may receive a custom user experience through one of our channels. To service the volume, we decided that we would like to add extra capacity by adding another cluster node and sharing information using Cassandra as KVP data store.

How complex are your compute requirements?

Processing lots of context is compute expensive and to provide for the business requirement to adjust rules and flows quickly we chose Drools as our rules engine. We have a big application with medium counts of facts and lots of rules to manage our state transitions. It is partitioned in three components, Customer Experience Management, Data Management and Call Routing to agents (destination call centre). There are an extensive set of menus and questions that can be presented. So we have a high per transaction requirement when making decisions.

Do you need fault tolerance?

Like many modern businesses we have 24x7 operations and the call centre is no different. To meet the no downtime requirements we have chosen to scale out through clustering web servers and the database.

Do you have constraints on the hardware platform you will be targeting?

These days, there are many reasons to go to the cloud and very few to supply your own infrastructure. But we had existing infrastructure we excess capacity that could be used.

So quickly we were up and running, adding business value and integrating the future into a legacy platform. A parallel telephony program was looking to migrate to an equivalent modern stack for its future requirements.

The scorecard for phase 1 was good. Good buy in and good results.

Roll on to phase 2

Turn on the new telephony stack and up the load a little (well, actually more than we thought) and all of a sudden, under peak loads we had troubles. Transactions that took < 100 millis blew out to seconds and we had very strict requirements when the caller experience had to be seamless. Knowing that an Intel based “laptop” could deliver the performance that we needed we thought it must be something about the way our application mapped to the hardware and OS. OS was Solaris and laptop Darwin…. probably not that!

Attempt 1 - Quick fix, lower I/O and add more threads (scale up!). Tune the database, app servers etc. Better but not fixed.

Attempt 2 - Reduce compute through refactoring rules. Again, the results this time we impressive but we could still blow out under “extreme” loads when we say, lost nodes or the load balancers forgot what they are meant to do!

We were still uncomfortable with the fact that we had little wiggle room and would potentially need a more strict monitoring and support regime. One more attempt in us yet!

Attempt 3 - So after fighting a hardware platform change we did a radical departure from Solaris on Sparc to Windows on X86 VMs. All of the 1 and 2 changes we run up on a VM cluster and “hey presto” we are comfortable again. 3 hours to solve our problem. Modern Intel processes kick older Sparcs when raw computing grunt is required.

But even more important, remember how you answered the questions. Somewhere along the way we forgot our architecture principle to scale out. And once we remembered, a couple of man days had us going again.

For those that must know some stats:

~ 4 million transactions a day with response times less than 100 milliseconds
3 busy periods a day, night time has low traffic
Started on Glassfish 3.1/Solaris/Sparc
Finished on Tomcat/Wintel

If we had to, we could just about run it off a laptop, but it would get very hot!

If you want to know more, feel free to contact me rodc@opensoftwaresolutions.com.au

Blog