Wednesday, October 31, 2012

Comparison of different database models for big data

A comparison of different database models for big data, including its scalability & CAP properties:

Monday, October 22, 2012

"In-house" vs buying a complex "out of box" solutions

This  is my contribution to the religious debate about self-developed "in house" vs buying "out of box"  solutions.

Suppose your manager said enthusiastically that he'd purchase a rich-features "out of box" solution after a visit from a salesman from a famous vendor. Your response is to remind him to not rely on vendor solutions blindly, since a rich-feature third party product might have many disadvantages:

  •  Difficult to configure, difficult to learn.
  • Expensive total cost of ownership: need big resources, each of the product have complexities its own and need to hire expensive specialists for maintenance, upgrade, integration.
  • In reality only small amounts of the features are really used/necessary.
  • Dependency to vendor e.g. if you have an urgent bug but the vendor refuse to create a patch soon.
  • Vendor locking: you can't easily move your solution to other platform. This can be a problem if the vendor discontinue the product or if the license price becomes too expensive.
  • Even though the salesman advertised direct out of the box solution, in reality you need to spend works to integrate this external product with your environment (i.e. interfacing with existing ERP systems/databases in your company).
  • Due to time difference with the vendor's support team, it can be difficult to discuss the problems. Sometime you need to wait for the answer from your question until the next day. Sometime you need to stay out of the normal working hours to debug the problems together with the support.
  • With in house knowledge the root cause analysis and the solutions are more transparent than if you just buy a black box solution.

One of a good trade off is: open source solutions. You don't have to build from scratch / reinventing the wheel but you have more control over the code and its development.

Similar argument holds for clouds solution. Clouds solution might add dependency to the service provider, it will be more difficult to debug the problems for example in comparison with if you have the code & in house knowledge.

Source: Steve's blogs

Any comments are welcome :)


• Scalability Rules by Abbott & Fisher

Friday, October 19, 2012

Simple capacity planning

  1. identify actual usage rates, max and seasonality (e.g. business calendar). The "usage rates" here is a performance measure (e.g. throughput, processing rate, #simultaneous transactions, storage/memory, network bandwidth Mbps)
  2. you can determine growth rate based on business assumption (e.g. the marketing manager said that the users of our social-apps will grow 10000% in 2 years) or using forecasting technique (e.g. Holt-Winter to predict trend)
  3. determine the headroom gain by optimization / refactoring projects
  4. add 10-20% usage margin (since machines that run with 100% capacity will be unstable)
  5. measure server peak capacity using load test
  6. compute: capacity needed = usage max + grow - headroom gain + usage margin
  7. If the capacity needed is below server peak capacity, you need to buy a better hardware (scale up) or trying to tune or trying to scale out (clustering).
  8. If you choose to scale out, compute the number of server needed = capacity needed / server peak capacity.


  • Plan for 2-4 year, not too long since it's difficult to have good assumptions (e.g. growth rate) for long term. In 5 years you might have another marketing manager with different targets & growth assumptions.
  • The  number of server needed equation above is assuming that the cluster management work (load balancing etc) is negligible compared with the main works.
  • The  number of server needed equation above is assuming that the servers are in separated physical hardware.  If these servers are virtual machines in one physical server then 1+1 is not 2 anymore.
  • Understand which resources that need improvement (e.g. if network bandwidth is the problem adding more servers might not alleviate the problem).
  • If possible  try to scale-up first since it's easier, faster to implement and often cheaper than scale-out. But your vendor salesman might try to convince you that the scale-out upgrade packages is easier to integrate to your deployment environments and applications (just a matter of hiring his consultants), well as my wise grandma once said... always do fact checking to what the salesmen and politicians said.

Source: Steve's blogs

Any comments are welcome :)


The Art of Scalability by Abbott & Fisher

Thursday, October 18, 2012

Distributed Transactions

The goal

The goal of transactions is to maintain data integrity & consistency  in case of concurrent accesses to data (e.g. multiple web sessions update the same database) and/or distributed resources (e.g. sending data to multiple databases/JMS resources). Typically a transaction has the ACID (atomic, consistent, isolated, and durable) properties in order to guarantee data integrity. 

X/Open Distributed Transaction Processing (DTP) standard

This is a popular distributed transaction standard. The architecture diagram (from Kosaraju,

The Transaction Manager manages the transaction: manage transaction context, maintain associations with participating resources, conduct 2PC and recovery protocol. You can use open source transaction manager such as Atomikos. Transaction manager is also included in enterprise platforms such as Weblogic.

The Resource manager is a wrapper over a resource (e.g. database, JMS message broker) that implements XA interface for participating in transactions with a transaction manager. The transaction manager keeps track which resources that are participating in the transaction using resource enlistment process. For example in the case of Oracle database you get the resource manager by using an XA capable Oracle JDBC database driver.

TX interface is the interface between your application and the transaction manager. The important methods in this interface are such as for transaction demarcation  (e.g. begin, rollback, commit).

XA interface is the interface between resource managers and the transaction manager. The important methods in this interface are such as start, end, prepare (the first phase of 2PC), and commit  (the second phase of 2PC),.

For more details please read the article by Allamaraju (see the references below.)

Java Transaction API (JTA) and Java Transaction Service (JTS)

The architecture diagram (by Allamaraju):

In this transaction management from Sun, the API (JTA) follows the X/Open DTP interfaces while the transaction manager (JTS) implements the OMG OTS specification. OMG Object Transaction Service (OTS) standard is basically an X/Open DTP with additional features such as CORBA interfaces, nested transactions/finer granularity, synchronization protocol.

2PC (two-phase commit) protocol

This is a type of distributed transaction protocol to achieve data integrity. First the transaction manager issues prepare requests to all resource managers. If all the resource managers ready to commit (thus a kind of voting), the transaction manager will issue the commit request otherwise the transaction manager will issue the rollback.

Java examples:

Oracle JDBC XA driver

In the application, programmatically you can implement 2PC voting by using the statuses returned by resource managers after you call prepare() to all the resource manager via XA interface. Based on this voting you can perform either commit() or rollback() for all resource managers. An example of code (Oracle JDBC):

Spring with open sources JTA

Spring offer an extra abstraction layer for loose coupling between transaction management implementations & resources (JTA implementations (such as Atomikos, JBossTS), JPA/EJB3, Hibernate, single JDBC, JMS) and the application you write.

Kosaraju wrote a good article (with sample codes) about XA-2PC transaction examples in Spring using open sources JBossTS (usecase: 2 databases) and Atomikos (usecase: 1 database + 1 JMS/ActiveMQ):

This is the sequence diagram (by Kosaraju) of the Atomikos case:

Rather than programmatic, you can better implement transaction using declarative way (either with annotation or AOP) since transaction management is a cross cutting concern across your applications and can be separated from your core business logic. It's interesting how Spring's declarative transaction,  transaction template and the transaction manager hide the complexities of programmatic transaction. In the sequence diagram above, the proxy (which is created by declarative transaction) invokes the transaction manager TX interface (doBegin, do Commit). We see also how the XA interface methods (start, end, prepare, commit) are used in the XA resources (database, jms).

While 2PC provides data integrity, its performance is expensive. You may trade off the data integrity for performance using relaxations of XA-2PC (e.g. XA-1PC, XA-Last resource) or non XA (Shared resource, Best effort 1PC). Please read (with sample codes):

If you need to learn the basic of Spring transaction (with IoC, annotations, AOP, templates), please read the chapter 16 of the "Spring Recipes" book (with example codes).

Source: Steve's blogs

Any comments are welcome :)


Nuts and Bolts of Transaction Processing by Allamaraju

Spring Recipes by Mak

Java Transaction Processing

Java Transaction Design Patterns

XA transactions using Spring by Kosaraju

Distributed transactions in Spring, with and without XA by Syer

Oracle Database JDBC Developer's Guide

Where to focus your effort regarding software performance and availability

In order to work effectively we need to focus our efforts on certain areas of the system.
Normally, we make a choice based on risk analysis (e.g. which component have strict requirements/SLA, which components have new technology) and benefit (features which yield product differentiation). In this blog we will see 2 another insights from the literatures.


Meijer et. al. argued that to improve performace the most important area is the application/query design/implementation  (e.g. avoid select * & cartesian product) . Furthermore, the next focus candidates are (in order of important): database design (e.g. normalization/denormalization, indexes), hardware (e.g. disk defragmentation), database-engine tuning (e.g. SQL Server) and OS-tuning (e.g. shared memory parameters).

So before spending much time digging to SQL database optimization, better to inspect your query first.

High availability

Marcus & Stern argued that for high availability, the most important area is the System Administration Design/Practices e.g. keep it simple, reuse config, separate environtment (prod,test,dev), SLA, pareto, plan, change control, single point of failure, etc.

Surprisingly there are many areas which are more important than services/application design (in order of important): backup, disk/storage management, networking, environment (UPS / backup electricity, cooling, data center racks), client/people/process management.

Please see my software review checklists and database design checklists / guidelines:

Source: Steve's blogs

Any comments are welcome :)


Blueprints for High Availability by Marcus & Stern

Improving .NET Application Performance and Scalability by Meier

Monday, October 1, 2012

How to address software performance issues: Proactive vs Reactive ?

As a Java developer, it was quite fun to learn a lot from the .Net communities, for example "the patterns & practises" series which are provided for free by Microsoft. Here are some lessons to learn from "Improving .NET Application Performance and Scalability" by Meier

Reactive approach

• You investigate the performance only if you face performance problems after design & coding to avoid premature optimization.
• Your bet is that you need to tune & scale vertically (buying faster/more expensive hardware, more clouds-resources). You experience increased hardware expense / total cost of ownership.
• Performance problems are frequently introduced early in the design and cannot always be fixed through tuning or more efficient coding. Also, fixing architectural / design issues later in the cycle very expensive nor always possible.
• You generally cannot tune a poorly designed system to perform as well as a system that was well designed from the start.

Proactive approach

• You incorporate performance modelling and validation since the early design.
• Iteratively you test your assumption / design decision by prototyping and validating the performance for that design (e.g. Hibernate vs iBatis)
• Evaluate your tradeoffs of performance/scalability with other QoS (data integrity, security, availability, manageability) since the design phase.
• You know where to focus your optimization efforts
• You decrease the need to tune and redesign; therefore, you save money.
• You can save money with less expensive hardware or less frequent hardware upgrades.
• You have reduced operational costs.

Performance modelling process

1.Identify key scenarios (uses cases with specific performance requirement/SLA, frequently executed, consume significant system resources, run in parallel)
2. Identify workload (e.g. total concurrent users, data volume)
3. Identify performance objectives (e.g. response time, throughout, resource utilization)
4. Identify budget (max processing time, server timeout, CPU utilization percent, memory MB, disk I/O, network I/O Mbps utilization, number of database connections, hardware & license cost)
5. Identify processing steps for each scenarios (e.g. order submit, validate, database processing, response to user)

For each steps:

6. Allocate budget
7. Evaluate (by prototyping and testing/measuring): Does the budget meet the objective? Are the requirement & budget realistic? Do you need to modify design / deployment topology?
8. Validate your model.

Performance Model Document

The contents:
• Performance objectives.
• Budgets.
• Workloads.
• Itemized scenarios with goals.
• Test cases with goals.

Use risk driven agile architecture

First, prototype and test the most risky areas (e.g. unfamiliar technologies, strong requirement in SLA). The result will guide your next design step. Repeat the past test again (regression test) in the next spirals for example using continous integration. When you address the most risky areas first, you still have more breath looking for alternatives or renegotiate with the customers in case of problems.

Source: Steve's blogs

Any comments are welcome :)


"Improving .NET Application Performance and Scalability" by Meier