Tuesday, January 22, 2013

SOA Software Development Guidelines


In my work I am involved in a team that is responsible for design & code review. This activity motivated me to make a compilation over best practices in software developement.  I would like to share with you about this knowledge compilation in a series of blog that covers these topics:

1. SOA Service & Data Design, Process/governance
http://soa-java.blogspot.nl/2013/01/soa-software-guidelines-service-data.html

2. Security checklists
http://soa-java.blogspot.nl/2012/09/security-checklists.html

3.  Performance, Scalability, Transactions management
http://soa-java.blogspot.nl/2013/01/software-guidelines-performance.html

4. Operations, Availability, Fault management, Maintainability 
http://soa-java.blogspot.nl/2013/01/software-guidelines-operations.html

5. Database Guidelines
http://soa-java.blogspot.nl/2013/01/database-guidelines-rdbmssql.html

6. Coding principles, Usability, Test
http://soa-java.blogspot.nl/2013/01/software-guidelines-coding-principles.html

Meer over software test:
http://soa-java.blogspot.nl/2012/09/test-checklists.html
http://soa-java.blogspot.nl/2012/09/development-test.html




This blog mainly focus on enterprise applications with SOA architecture, but most of the tips are generally also applicable to non-SOA software projects.

Originality it was written from a developer /  architect  point of view but then it was growing to incorporate also some best practices for the operations in the production environment. As developer, you need also to have knowledge about the operations (e.g. monitoring, availability, scalability, fault management, maintainability) since your design & coding will greatly influence the operations. As developer, when you transfer your solution to the client/production team (via QA/test team), you often need also to deliver the base template for server configurations (e.g. JMS resources), infrastructure (e.g. virtual machines components including the OS/security configurations: iptable, security certificates, etc), database creation scripts (including DDL, constraints, indexes, store procedures, triggers), start up scripts and operational guidelines/documentations. Thus these operational knowledges are indispensable for developers.


For this guideline sometime I use questions instead of mandatory compliance checks. The goal of the guideline is to instigate our mind to be aware to certain issues, not to force to a specific/narrow solution. The "best" choice is depend on the project context / environment (e.g. if the service is exposed to external users or it's only used for intranet users inside company firewall).



These are the long list of the guidelines before it was organized to the topics as in the links above:

Service Design


Strategy

  • Use mature technology, become late adopter, follow the major stream/use popular tools/technology. Cutting edge products usually is not as reliable as stable product. Avoid using version 0.x product. Well accepted products usually have better user supports group (blogs, discussion groups etc). Vendors of successful products usually have been growth to big enough to have a resourceful customer supports.
  • Use well-accepted standard solutions (e.g. OAuth, OpenID, WS-Security, SAML, WS-Policy) for better operability & security, don't reinvent new wheels. Standard solutions are usually resulted from cooperation involving many developers so the design and implementation are better tested by the community.  For sensitive topics such as security or distributed transactions, it's difficult & more risky to build a bullet proof code by your own.
  • In general, avoid premature optimization: so build a prototype fast, then do performance tests, only redesign if the performance doesn't meet SLA. But if the risk-profile of the project deducts that the performance is very critical (e.g. options trading), you need to include optimization since earlier design (e.g. multi threading).
  • Validate architecture & design  early with prototyping and (performance) tests. Know the cost of specific design choices / features, be prepare to cut features / rework areas that do not meet specifications.
  • Don't be cheap. Think broader in term of ROI: don't be cheap but sacrificing long term scalability, maintenance, productivity and reliability. Man-hours cost much more than hardware, better to buy reliable high-performance hardware that will facilitate your team to work faster and less problematic. Buy better gateway/firewall that can (intelligently) reject DOS attacks and easy to configure. For your developers, buy "state of the art" PCs with 2 monitors and abundant RAM & storage so they can work smoothly.
  • Don't assume out of the box solution (e.g. my framework will take care everything including my security/scalability/transaction issues), test/verify the new bounds of operation (e.g. verify if the framework really secure your application) and verify the effects to other quality aspects (e.g. the addition of a new framework improve security but now hurt the performance under SLA).
  • In general scale up (buying better hardware) is easier than scale out (distribute workloads by adding more servers), so try scale up first whenever possible. Be specific about which resource to be scaled up (e.g. CPU, memory, networks). Scale out also adds some problems such as synchronization between servers, distributed transaction, how to define (horizontal) data partition, difficult recovery/fault tolerant implementation.
  • Optimize the application / query &  database design first (looks for problems such as locking problems, missing database index, Cartesian product query) before scale up / scale out or dig in to database tuning. See "Where to prioritize your effort regarding software performance": http://soa-java.blogspot.nl/2012/10/where-to-focus-your-effort-re-g-arding.html
  • Use separate environments: sandbox/playground, development, test, production.
  • Reuse design/configurations: if you have a successful design/configuration use it again in another places. Reuse means fewer permutations of configurations thus easier to manage / learn. It's also more robust since it has fewer things to get wrong and has already be tested.
  • Scope your requirements. One solution for one problem, avoid trying to build application capable of all things for all people. Don't try to anticipate every problems in future since it's difficult to get accurate assumptions about the future.
  • Design is a combination of science and art. Make decision based on facts, if the variables are not known use educated guess (e.g. based on historical usage data).
  • Design metrics / measurable objectives to validate your design parameters (e.g. performance throughput Mbps, conversion rate of our web-shop, etc.)
  • Recognize and remove design contradictions as early as possible if possible. Write the trade off decisions and their implications  (e.g. security vs performance).
  • Beware of firewalls overused: adding frustration & time lost during development, test en production. Lead to unnecessary complex solutions (workarounds via Bastion host) or even render the functional requirement impossible. You might leave low value static contents (e.g. css, static images) without firewall.
  • Develop in-house vs buying a complex out of the box:  http://soa-java.blogspot.nl/2012/10/in-house-vs-buying-complex-out-of-box.html
Principles of Service Orientation (http://www.soaprinciples.com):
·         Standardized Contract (e.g. wsdl, xsd schema, ws-policy).  Advantages: interoperability, reduce transformation due to consistent data model, self documented (the purpose, capabilities e.g. Qos, message contents)
·         Loose Coupling
o       Advantages:  maintainability, independent (the interfaces can evolve over time with minimum impact to each other), scalable (e.g. easier for physical partitions/clustering)
o       Characteristics: decouple service contract/interface from implementation/technology details, asynchronous message based (e.g. JMS) instead of synchronous RPC based, separation of logical layers.
·         Service Abstraction: separate service contract/interface from implementation, hide technology & business logic details.
·         Reusability
o       Advantages: faster and cheaper for realization of new services
o       Characteristics: agnostic service, generic & extensible contract. Avoid any messages, operations, or logic that are consumer or technological specific, reuse services (e.g. for compositions), reuse components from other projects (e.g. xsd/common data model, wsdl, operations, queues)
o       How do you anticipate this service being reused? How can modifications be minimized?
·         Service autonomy (has a high level of control over its runtime environment). Advantages: increase reliability, performance, predictability
·         Statelessness. Advantages: scalability, performance, more reusability due to agnostic / less affinity.
·         Discoverability (e.g. using UDDI)
·         Composability (e.g. using BPEL)

Design principles

  • Where are the tightest couplings/dependencies with other services, other systems, etc?
  • Use layered design (e.g. presentation layer, business logic layer) for cohesion, maintainability & scalability.
  • What patterns have been employed? (see e.g. http://soapatterns.org for SOA patterns or Hohpe's book for messaging patterns)
  • Aspect oriented programming, separation of cross-cutting concerns from the main codes (e.g. declarative security using policies)
  • Use progressive processing instead of blocking until the entire result finish, e.g. incremental updates (using JMS topic per update-event instead of bulk update the whole database every night),  render GUI progressively with separate threads (using Ajax for example).  Use paging GUI (e.g. display only 20 results and provide a "next" button). This strategy will improve performance, user experience, responsiveness and availability.

Simplicity

  • Simplify the requirement (80/20 Pareto prioritizing), design and implementation.
  • Cut specifications / features which are unnecessary.
  • Minimalist design and implementation. Does this service fulfill the goal of the service with minimum efforts?
  • Minimize numbers of components & connections.
  • Minimize number of  applications & vendors to avoid integration problems.
  • Avoid duplications of  provisioning (e.g. authentication data in both LDAP and database) since then you have extra problem to synchronize them.

 

Design decisions

  • For every design decision evaluate its impacts to functional & non-functional requirements (e.g. performance, security, maintainability), impact to project plan/constraints (cost, deadline, resources & skills).
  • Prioritize your non-functional requirement for design trade-off (e.g. if performance is above security & reliability you might avoid message encryption & persistence jms)
  • Which integration style this service uses (e.g. RPC-like web service, file transfer, shared database, messaging)?
  • Does this service wrap legacy application or database? Does this application/database already provide out of box SOA integration capabilities (e.g. web services, messaging trigger) that I can use? Can I replace the underlying application/database with another vendor/version without much change propagations to other services?
  • Where the services will be deployed? e.g. cloud providers, internal virtual machines, distributed servers around the world, local PC, etc.
  • Which trade off do you choose about message structures: rigid contract (wsdl soap) vs flexibility (e.g. no wsdl, REST, generic keys-values inputs) considering security / message validation, performance, extendibility, chains of wsdl/xsd changes?
  • Avoid  concurrency programming if possible since it's error prone. If you decide to use concurrency make sure that multi-threading problems (race, deadlocks) have been addressed (tested ok).
  • Which transport protocols do you use (e.g. http-soap, http-rest, jms) and why? Do you need to wrap the protocol (e.g. sync http to wrap asynch jms)? Aware that your platform perhaps has non standard protocols that offer better performance (e.g. Weblogic T3, Weblogic SOA-Direct).
  • Do you consider event driven paradigm (e.g. jms topic)?
  • Understand the features of your frameworks (e.g. security, transactions, failover, cache, monitoring, load balancing/clustering/parallelizing, logging). Using framework features will simplify your design so you don't have to reimplement those features.) Read the vendor recommendation / best practices documents.
Requirement management
  • Have you followed the standards & laws? e.g. SOA guidelines document in your organization, privacy laws as Sarbanes-Oxley(US)/Wet bescherming persoonsgegevens (Netherlands), etc.
  • Is there any real time requirements (e.g. nuclear plant control system/ TUD-IRI)?
  • What is the functional category of this service?
    • Business Process e.g. StudentRegistrationBPELService
    • Business Entity e.g. OsirisService
    • Business Functions e.g. PublishStudentService
    • Utility, e.g. EmailService, Scheduler, LoggingService
    • Security Service (handle identity, authorization)
    • What is the landscape level of this service: Atomic, Domain, Enterprise? (see The Definitive Guide to SOA by Davies et.al)
  • Does this service fulfill the functional & non functional requirements defined in the specification document?
  • Avoid constantly changing requirements. Avoid feature creep.

Asynchronous pattern

The benefits of async messages:
    • avoid blocking thus improve responsiveness & throughput (for better performance, user experience & availability)
    • improve reliability / fault tolerant with persistent queue & durable subscriber
    • loose coupling between producer & consumer (queue) or publisher & subscriber (topic)
    • defer heavy processing to nonpeak period (improve performance & availability)
The drawbacks of async messages are the complexity of implementation:
·         how to persist messages in the queue in case of server fault
·         how to handle if the messages is not delivered (e.g. fault in the subscribers)
·         how to handle duplicate messages or out of sequence messages
·         how to inform the caller about the status of processing (e.g. via a status queue or a status database table)
However with the advances of enterprise integration frameworks (e.g. Oracle OSB), it's becoming easier to deal with these problems.
Beware that some process need direct feedback (e.g. form authentication, form validation) where a synchronous pattern is more appropriate.

Software process / governance

  • Establish standard / guidelines for your department, summarize them into checklists for design & code review.
  • Include design & code review in your software development process. See http://soa-java.blogspot.nl/2012/04/software-review.html
  • Establish architecture policies for your department. Establish a clear role who will guard the architecture policies and guidelines e.g. the architects using design/code review.
  • For maintainability & governance: limit the technologies used in the projects. Avoid constantly changing technology while still open to the new ideas. Provide stability for developers to master the technology.
  • Establish change control. More changes means more chances of failures. You might need to establish a change committees to approve the change request. A change request consists of why, risks, back-out/undo plan, version control of configuration files, schedule. Communicate the schedule with the affected parties before.
  • Use SLA to explicitly write down user expectation: availability (max downtime), hours of services (critical working hours, weekends/holidays), max users/connections, response time/processing rate (normal mode, degradation mode), monitoring/alert, crisis management (procedures to handle failures, who/how to get informed, which services/resources have priorities), escalation policy, backup / recovery procedure (how much to backup, how often, how long keep, how long to recover), limitations (e.g. dependency to external cloud vendor). Quality comes at a price: be realistic when negotiating SLA e.g. 99.99% availability means that you have to guarantee less than 1 hour downtime per year, which is quite tough to achieve.
  • Have contingency plan / crisis management document ready: procedures to handle failures, failover scripts, how to reboot, how to restart in safe mode, configuration backup/restore, how to turn-on/turn-off/undeploy/deploy modules/services/drivers, who/how to get informed, which services/resources have priorities (e.g. telephony service, logging service, security services).  Have this document in multiple copies in printed version (the intranet and printer may not work during crisis). The crisis team should have exercised the procedures (e.g. under simulated DOS attack) and measured the metrics during the exercise (e.g. downtime, throughput during degradation mode). Plan team vacation such that at least one of the crisis team member always available. Some organizations need 24/7 fulltime dedicated monitoring & support team.
  • Document incidents (root causes, solutions, prevention, lesson to learn), add the incident handling procedures to the crisis management document.
  • Establish consistent communication channel between architects, developer team, external developers, testers, management, stakeholders/clients (e.g. documentation trac wiki).
  • Communicate design considerations, issues to the project team/developers and stakeholder.


Data


Data design
  • Do you use common data model? How to enforce consistent terminologies and semantics between terms used in different systems (databases, Ldap, ERP, external apps, etc.)
  • Use simple SOAP data types (e.g. int). A new datatype introduces overhead during (de)serialization of the messages. Don't use xsd:any type.
  • How the data will be synchronized within different databases, ldap directory, external systems?
  • Use MTOM/XOP instead of SWA or inline content for transmitting attachments / large binaries
  • Consider claim and check patterns to avoid data processing: http://eaipatterns.com/StoreInLibrary.html
  • Consider different ways to store data: rdbms-database (for high relationships, ACID), NoSQL/document store (simple key-value data, better scalability/ easier to split), file system (better for read only data).
  • How will your service handle localization/globalization, different encodings/formats? e.g. unicode for diacritics/Cyrillic/Chinese, the DateTime format in your web service vs the format in the database/ldap/external-webservice. Do you consider the effect of time zones?
  • If you use file-based integration: is the file permission right (e.g. access denied for apache-user trying to read file generated by weblogic-user)? is the encoding right?
  • Are transactions required? Are compensations/rollbacks required?
  • Aware the complexity/scalability of algorithm you use.
  • Chose data structure based on the usage need (e.g. tree can be faster for search but generally slow for insert/update) and the particular properties  (size, ordered, set, hash-key, etc) that motivate you to choose that structure.
  • Choose the right data format so that the transformation need is minimize
Keep data small
  • Use short element/attribute names to reduce xml size.
  • Limit the level of XML nesting.
  • delete old / low value data, move old data to backup/lower tier storage. Determine until when the old data is keep in the production, until when the old data is keep in the backup / lower tier storage. Storage (& its maintenance)  is not free.
  • reduce the data with transformation (e.g. summary, data cleaning, sampling)
  • Consider message compression (e.g. define gzip content-encoding in http header)
  • Structure your XSD such that you minimize the information transmitted: use "optional element", minimize data for a specific purpose (e.g. you don't need to send nationality data if you just want to transmit the email address of a person)
  • SOAP webservice style: use literal instead of encoding. Encoding style has several problems: compatibility, extra data traffic for the embedded data types
  • Use simplest/smallest data type (e.g.  use short int  instead of long int if  a short is already good enough)
Data management
  • Can you categorize the quality level of the data? e.g. sensitive data (encrypted/SSL), non sensitive data (not encrypted nor signed to increase performance), critical data (live backup/redundancy), non critical (backup less often), reliable message (transported with persistent queue for guarantee delivery), time-critical/real-time data (use dedicated high performance hardware/networks).
  • How the data will be backup? How you secure the backup data?
  • Are the content correct/accurate? spelling checked?
  • Logically group data for readability & maintainability (e.g. use person struc to contain data about a person: name, birth date, etc)
  • Data is not free: aware the cost of data (for network bandwidth, processing, storage, people/power/space to maintain storage, backup cost). Eliminate low value & high cost data, sample the low value en low cost data. http://www.hertzler.com/images/fck/Image/Whitepapers/DataCost-ValueMatrix.jpg
  • Tiered storage based on value and age.


Performance, Scalability, Transactions

General:
  • Aware the performance impact to other consumer services/application, to the whole system (end-to-end performance.)
  • Can you classified the service level of the operations e.g. long running (processStudentFraternityForm) vs time-critical (buyStockOptions)?
  • Choose your transport carefully e.g. Weblogic T3 is faster than http but http has more interoperability (e.g. firewalls/routers/load-balancers can handle http but not T3).
  • Relax reliability requirement in exchange for performance e.g.:
    • don't use persistent/durable jms
    • don't use durable BPEL process
    • don't use QoS exactly one / WS-ReliableMessaging
    • don't use 2PC/XA distributed transactions, use XA relaxations such as XA-LastResource (e.g. Oracle LLR), non XA (e.b. best effort)
  • Don't use (web service / jms) handler if it's not necessary
  • Regularly audit & remove the unnecessary services/processes/databases, backup & remove  virtual machines which you don't use anymore (e.g. if the development/test phase has finished). Reducing number of running components will improve performance and reduce the chance of faults.
  • For future scalability, scale the capacity during design (20x), implementation (3x), deployment (1.5x). These numbers are just an indication, the real numbers are depend on how fast your service will grow in the coming 1-2 years.
  • Address end-to-end performance (profiling and then reworks/tune only the components that have bottleneck) is more efficient than tune the performance of every components.
  • Use throttling.
  • Prioritize your effort regarding software performance:  the most important area is the application/query design/implementation  (e.g. avoid select * & Cartesian product) . Furthermore, the next focus candidates are (with order according to priority): database design (e.g. normalization/denormalization, indexes), hardware (e.g. disk defragmentation), database-engine tuning (e.g. SQL Server) and OS-tuning (e.g. shared memory parameters).
Distributed computing and transactions:
  • First at all, if possible avoid using distributed system to avoid problems with concurrency and data integrity/synchronization
  • Reducing number of transactions by combining operations from different transactions to one transactions if possible.
  • If the objects are read only, it doesn't have to be synchronized/ thread safe. In the sake of performance, don't apply thread safe by default.
  • Can the process done in parallel? Parallelism not only improves performance but also the availability / fault tolerance. Cost-benefit analysis: the overhead cost of parallel processing e.g. parallel processing of small data chunks are not effective. Use parallel algorithm (e.g. map reduce) if necessary.
  • Minimize lock times:
    • Hold lock only for codes that need atomicity (e.g. read only code & preprocessing can be done before/outside the transaction)
    • Tune/optimize the performance of the concurrent processes in the transaction (e.g. using partitioning) since how faster the concurrent processes how smaller the chance for bottleneck due to lock contention
    • Use optimistic concurrency strategies (but then you need to make sure the conflict detection & resolution)
    • Avoid waiting for user responds during transaction
    • Minimize call to outside boundary (e.g. invoke external cloud web service) during transaction. If possible do the call before/after the transaction.
    • Use compensation instead of rely on transactional rollback if the cost of long transaction is too expensive.
    • Lock: acquire late, release early.
    • Split big transaction/batch job: divide & conquer. If fault, you don't have to restart over the whole big job.
  • Relax isolation level if possible (trade off read consistency for performance)
  • Since for most database usage read is more prevalent than update, you can parallel the read process using multiple slaves such as in the MySql master-slaves replication. You can update the master only and then synchronize the slaves with the master. Backup & data analysis can be run in the slaves, thus will ease the burden of the master.
  • Whenever possible use system transaction (that relies on your application framework or database) instead of build your own business transactions, since offline-concurrency management  is difficult/expensive to implement and error prone.
  • Classify the transactions to light type (critical/need fast response e.g. login) and heavy type (less frequent complex transaction that more latency is tolerated e.g.generateReport). Set different target transaction per seconds for both types.
  • Use deadlock detection & resolution (e.g. using Oracle database EM console). In many platforms you can define the timeout of (2PC) transactions (e.g. Weblogic console: domain JTA config).
  • Balance the granularity of (database) locks: coarse (more contention/locking problem) vs fine grained locks (more processing overhead) e.g. do you lock the whole database table or only a row.
  • The scalability of clusters & load balancing: can you easily add more nodes in the clusters?
  • If you use multi-threading make sure that no performance loss due to locking problems, no deadlock.
  • Different ways to split  application servers, database/storage, networks:
    • clone the same instances (x-axis... using the terminology from Abbott & Fisher's book)
    • split different actions / services (y-axis) e.g. getEmployee, updateEmployee, getStudent
    • split similar things (z-axis)  based on customer (names begin with A-M & N-Z), geographic (Europe mirror, US mirror/cache).
    • Splitting will improve performance, improve caching (reducing cache sizes in case of y-axis / z-axis split) and fault isolation. In term of transaction-growth y-axis splitting yield less scalability benefit than x-axis, but y-axis can help to scale code complexity (CPU) / data (memory) growth.
  • When you need to use distributed systems (e.g. database clusters due to performance), be aware of Brewers' CAP constraints: you can't achive all Consitency, high-Availability and Partition-tolerance at the same time. So relax the constraints e.g. using BASE (Basic Availability Soft-state Eventual consistency) to relax ACID consistency. Recognize which CAP relaxation used by your platform(e.g. Google AppEng/Hadoop prioritizes CA, Amazon S3 prioritizes AP).
Reduce network traffics:
  • What is the optimal message size/ granularity? Granularity/message-size tradeoff: using coarse-grained messages (avoid small messages / chatty communications) but avoid too large / too complex XML messages that will overwhelm the server.
  • Wrapper chatty interfaces with course-grain interfaces (e. g. remote facade & DTO patterns) especially for costly remote calls (inter processes / inter machines).
  • Put the processing closer to the resources it uses (e.g. PL/SQL)
  • Batch work, e.g. store procedures in database to reduce communication between database and business logic layer. You can also run multiple sql queries using semicolons.

Minimize data:
  • Minimize data (for performance) e.g. when synchronizing between systems export only the changed data (using change events), instead of export a huge bulk data every day.
  • After several iterations of redesign & recoding, you might have to refactor the data model again to minimize data size. Due to feature creep you might have added too much data in your database DAO or JMS messages
  • Minimize GUI page size & images
  • Use http compression, this is important especially if SSL (encrypt/decrypt) and serialization/marshalling/transformation needed since how bigger the data the more the processing cost.
  • Keep calls inside the same process & machines to reduce serialization & networks overhead. You might have to trade off this with reusability and scalability (e.g. SOA grid architecture).
  • Serialize/transmit only the data you need to. Some data are only needed for temporary (transient data). Some data are barely needed in future (e.g. yearly report 1 times/year) and can be easily derived from other persisted data.

Resource management:
  • Periodically shrink the connection pool to free up connections that are no longer needed.
  • Use time out for resources (e.g. database connection pool). The faster the time out, the faster the load balancer can detect a failure server and reassign the tasks to other serves, the faster the node manager (in case of Weblogic) restart the failure server for failover.
  • Create/load objects on demand (e.g. lazy loading)
  • Cache/pool to reuse resources which are expensive to create (e.g. jms connections, database connections, sessions, thread pool).
  • If you have high think usage patterns (small transactions with thinking pauses e.g. form submissions with small data) you might consider using shared server to share resources for multiple connections thus improving scalability. But for high traffic big data usage pattern better to use dedicated process.
  • Consider number of resource connections to  when determine thread pool size. Avoid bottleneck due to more thread processes than resources available.
  • Connection/resource: acquire late & release early
  • Clean up your resources after the operation finish (e.g. close file/connection, destructor, dereference pointer, invalidate session)
Caching:
  • Cache to reduce processing & network time cost, especially for data that are expensive to query/compute/render. Good candidates for cache such as data in presentation layer often used for static web pages, data in database layer generated by store procedure/query often called by business logic.
  • Where to save cache data (in memory for short period or in database for long period). To reduce the network round-trips, implement the cache in the same layer/machine where the processing is (e.g. cache in presentation layer if the GUI often reuses the same data).
  • Due to scalability, avoid cache per-user basis  in the server that will cost enormous memory as the number of sessions grows. If you really need to cache per-user use client cache instead but use limited data, use encryption & validation if necessary.
  • Validate using performance test that the cache mechanism is indeed improving performance instead of just adding extra processing (e.g. too big cache memory overhead, too often updates)
  • Select the cache key that not too long but discriminative enough
  • If you need to transform the cache data, transform it before caching to reduce reprocessing
  • Avoid caching data that need to be synchronize across servers (e.g. coherence mechanism is expensive)
  • Avoid caching volatile/ frequently changing data. For certain application, you may relax the novelty constraint (e.g. refresh the cache only every 1 minutes instead of every milliseconds).
  • Determine the appropriate interval to refresh the data (too long: stale data problem, too short: performance cost)
  • Decide which refresh strategy e.g. least recently used, least frequently used, fixed interval

State management:
  • Prefer stateless system (lower cost of memory, processing, network bandwidth, more scalable / less synchronization need between clusters). So avoid statefulness (passing session data back & forward every communications) to reduce communication bandwidth and session/server affinity (that hinders scalability). However statefulness communication sometime can't be avoided (e.g. shopping cart, business process conversation) and can save performance (e.g. to avoid reauthenticate or requery data for the same user in every request).
  • Save the session data in the client (e.g. using http5 storage) is more scalable in term of server memory if you have many clients, also the client session data avoid data replication/synchronization need between servers. But client session data have disadvantages: security risk (so use encryption) and increasing communication bandwidth (so minimize the data). Another alternative: dedicated resource manager that store the states (e.g. Oracle Access Manager).
  • You can save the session data in a centralized storage to avoid data replication need between servers.
  • If you save session data in server and database you need to implement clean up in case of user cancel (with timeout) and data persistence (in case of system crash)
  • Minimize session data, especially if you need to save it in shared resources (the bigger data the more chance of contention problem.)
JMS / Messaging system:
  • Use message quota/limit
  • Don't use guaranteed delivery / durable / transactional jms if it's not necessary
  • When choosing file vs database persistences for QoS jms: you need to do performance test to decide since it depends on many factors such as message sizes, OS, file system (e.g. RAID level), database tuning. In general no-synch (no waiting ack from filesys) file-persistent is faster than database or synch-file persistence.
  • When you use guaranteed delivery jms: send ack for a group of messages (bigger granularity)
  • Prevent large backlogs by: delete expired messages, use async customer, monitor the queue and take action to stop the further inputs before the system gets overwhelm)
  •  
Factors affecting XML web service performances:
  • message size.
  • schema types (e.g. int is faster than integer, time is faster than datetime).
  • handler (e.g. handlers for inserting/removing WS-Security headers) and transformation (use xpath best practice e.g. avoid using //).

Performance tests:
  • Use performance test & profiling to find the bottleneck. You don't have time to tune everything so you need to know which parts should get prioritized for tuning.
  • Load test & stress test: make sure that the response time is acceptable within SLA limit.
  • Is the worst response time within the timeout settings of the load balancers/proxies?
  • Load test & stress test: make sure that the processing capacity is acceptable (e.g. 100M XML messages/second), what is the peak volumes?
  • Load test & stress test: make sure that the memory requirement is acceptable
  • Use performance test to tune database using profiling tools.
  • Actually the more appropriate performance metrics are in term of computation counts (e.g. CPU cycles, numbers of read operations) instead of execution time (which relatively depends on hardware, contentious CPU/disk operations during the test). Nevertheless the performance metrics commonly used in SLA are in terms of time (e.g. web application response time), therefore ideally we use both operational counts & time (e.g. as described by tkprof tool in Oracle database).

Operational

General
  • Have you consulted the operations & infrastructure teams in the design of this service? Verify your design assumptions & constraints early with the production/infrastructure team.
  • Design & plan for rollback just in case if the deployment of the new version fails or buggy. Sometime fixing forward cost you more or difficult to perform, e.g. if the data corruption spread in multiple places and haven't been detected until long time after the new version deployed.
  • Centralized configuration management (e.g. Weblogic management console, PAM configuration in linux)
  • Identify  deployment constraints in production (e.g. protocol restrictions, firewalls, deployment topology, security policies, operational requirements). Identify  external services / cloud  resources restrictions (protocol, security, throughput, response time, #open-sessions, message size) from clouds / external services..
  • Write a short "getting started" administration & configuration document (e.g. artifacts, how to install, servers & resources addresses, dependencies, configurations, service request/response examples, troubleshooting / error scenarios). This document will be especially useful when you act as an external consultant / temporary project developer or if you need to pass the project to other colleagues (perhaps you leave the company, or have to take another project, or get promoted :).
  • Minimize deviations from standard configuration for easy maintenance. Avoid esoteric solutions, use standard hardware & software.
  • Use least privileged  principle (e.g. fine granularity for privilege: the sendOrder process can only read customer address table but doesn't have admin/write privileges over the entire database)
  • Modules and configurations can be changed/load/unload without restarting the server/OS (e.g. using OSGi or Spring DM).
  • Having standard operation procedures document for efficiency and consistency (e.g. how to start/shutdown, backup/restore, database creation/purge, virtual machine creations/purge, user management, incident resolution flow). Automate the procedures if possible (e.g. shell scripting).

Monitoring:
·         Visibility: how the availability and application states were monitored (e.g. using JMX, Oracle BAM)? At any point during processing, the admin can inspect the data & state information of any individual unit of work.
·         Design to be monitored (e.g. logging, JMX for java application)
·         Continue to measure performance in production. Gather production information (e.g. number of users, usage patterns, data volume) and use it for design input in the next iteration.
·         Monitor the system continuously to prevent fault before it happens (e.g. disk quota, network congestion, memory leak, not closed connection/resources, endless process forking/spawning), use alert feature (e.g. alert email to admin in case of threshold violation). You can build automatic script to clean up (e.g. clean up disk space, move old files, close idle connections, kill processes, restart servers) in case of alerts. Use monitoring & alert system (e.g. Nagios). 
·         In case of failure, how the support team received alerts regarding the health of the service? 
·         Alert users that approaching quota limit.
·         Log & audit admin operations (e.g. new users/grant, change configurations, server restart)
·         Provide health check test scripts to diagnose the problem during contingency (e.g. check connectivity to other systems/resources, check numbers of connections open, statistics of transaction errors, respond time of critical operations)
·         Review the logs for incomplete information, too much information, mistaken severity level, unreadable/bad format.

Logging

  • Logging operation is centralized (e.g. LoggingService)
  • Use a centralized file location / database for logging
  • Enforce standardization for logging (e.g. which information to be in log, how to log)
  • How to manage enormous growing of log files (e.g. throttling, alert, circular files, automatic backup) in case of critical component failure or DoS attack. How to prevent the whole server to failure when the disk-space is filled up with logs.
  • Avoid people login with default account (e.g. sa, weblogic) so that you can’t identified who did the actions, give people distinct accounts.
  • Audit logging regularly for early detection of anomalies
  • Log these information (e.g. as database fields or xml elements in log files):
    • service /application name
    • proxy / class  name
    • server name
    • operation / method name
    • fault message
    • timestamp
    • messageID / requestID
    • message/request being process (e.g. soap:body)
    • contextual information (e.g. userID, important application states, configuration/environment variables, jms topic/queue name, version of external software, IP-address of the external cloud service)
  • Beware that logs are read by people (often in stressful crisis situations), so readability is important.
  • Trade off between too few vs too much information. Logging is overhead and consume storage. You might throttle the logs to avoid the system overwhelm with big request.
  • How long you will keep the log files, how you archive the logfile to other (lower grade / cheaper) storage.
  • Log files in another server than the application server: better performance (parallel), more robust (damage in the server file system will not damage the logfile so hopefully we can trace back what happened)
  • Synchronize the time between components (e.g. using NTP), cast all timestamp to one time-zone (e.g. to GMT)


Availability, Robustness


General
  • What are the availability requirements? What is the impact if the service is down for seconds or hours?
  • Availability is a function of inputs (e.g. validate the inputs), application robustness (e.g. defensive programming, recovery), infrastructure availability (e.g. resource redundancy) and operational best practice (e.g. monitoring). Availability also depend on security (e.g. DOS attack can cripple availability) and performance (e.g. locking problems can impede the performance so that the application is less available to accept the requests).
  • Recovery: the user/client should see minimum disruption (e.g. recover session data / last known states, reauthenticate using reused sso-token without user reintervention)
  • Ideally the modules can be deployed/undeployed without restarting thus the availability of the services is not interrupted.
  • Simple capacity planning: http://soa-java.blogspot.nl/2012/10/simple-capacity-planning.html
  • Minimize number of servers or components (e.g. storage) to reduce the complexity of maintenance (e.g. backup). Fewer servers mean fewer chances for failure, fewer necessity for reboots, smaller security surface attack. Roughly speaking if you use 10 storages with MTBF 20 years, chance that you will busy  dealing with a storage failure once every 2 years (a rough calculation). You can minimize number of servers/storage by reducing the data, prioritizing the services (e.g. remove the old services), scale up (bigger CPU, memory, storage).
  • Clean up temp files regularly. Delete/backup old log files.
  • Trade off your timeout (the most common suggestion is about 3 min). Too long timeout means your system will detect the failure late thus slower the failover process. Too short timeout (under application/network latency) can make your system mark a working connection as failure.
  • Beware that a fault tolerant mechanism can hide the root problem, for example: if the load balancer restart the fail servers without alert the admin the root problem (e.g. memory leak) is never detected and corrected. The results is a non consistent performance: initially good then the performance degrades then the performance is recover after restart, then degrades again.
  • Trade off robustness (e.g. permissive to mild violation in sake of user experience, substitute with default values instead of abort/crash) with correctness (e.g. abort the process in radiation therapy machine.)
  • Provide alternative way to perform the same tasks in case of outage / unavailability of the primary systems. Train the operators to be familliar with the procedures. For example if the CustomerRegistrationWebservice breaks down, the customer service can still serve the new customer perhaps using an application that directly communicate to the database using jdbc connection. Or if everything breaks down, at least the operators know the manual procedures using pen and papers.

Errors handling
  • Error is logged and reported in a consistent fashion. What information needed in the log/report? Which error statistics will be useful to improve your system?
  • Do you use company standard error codes? For security you might translate framework-specific exceptions to standard error code to mask the details to hackers.
  • How to handle error? Do you have redundancy-clusters/fail over? Do you need failure recovery (e.g. self restart)?
  • What runtime exceptions are likely to be generated? How can this be mitigated?
  • Provide error locality check in the client (e.g. using JavaScript to validate phone  field) to prevent the client submit wrong request to the backend. But still revalidate the request in the back-end to prevent attacks that bypassing the client.
  • Release resources (e.g. file, database, thread process) : close connections and transactions after an error.
  • Centralized error handling.
  • Avoid empty catch block. 
  • Make sure that your code/platform handle all possible errors / failures , handle gracefully, inform the users/administrators, make sure data/transaction integrity (e.g. compensation, clean up): for more robustness and better user experience.
  • To make sure that your application handles all exceptions gracefully (including system exceptions) you might try to catch all exceptions in your code. But this is not always possible (due to exhaustive possible of exceptions, performance & maintenance burdens). Also if you catch the exceptions your framework (e.g. Weblogic) can miss the exception in the higher level critical mechanism (e.g. centralized logging/monitoring, global transaction management) will fail to act. 
  • If you handle the exception better to handle it locally (close to the context, close to the resources involved) than to throw it to higher level (except the cross cutting concerns such as logging/monitoring and global transaction management.) In a big software the developers are highly specialized, when the exception is handle locally it will be handled within the scope of domain of the developers that have the expertise.
  • catch-rethrow mechanism is expensive  and harder to debug so use then only if you can add some value
  • Don't use exception for flow control because throw-catch exception is expensive
  • Well define exception handling boundary to reduce redundancy & inconsistency
  • Methods return exception instead of int return code (e.g. 0 is success, 1 is input error, etc), since an exception is more expressive
  • How to tolerate nonfatal failures (e.g. disk quota, read/write failure) for example: move the requests to the error queue to be automatically rerun when the fail condition is solved.
  • Introduce software "fuses": stop the processing when unrecoverable error happened or when a  max number of exceptions occurred. Make sure that the job can be rerun without risk of duplication / inconsistency by  compensation/clean-up the partial results.
  • Apply software "valves" when the system fail: stop processing further inputs (from GUI or consumer services) e.g. move the requests to error queue. The valves can be also used in conjunction with scheduler to prioritize works (e.g. to avoid offline OLAP processing hinder the online OLTP performance during working hours).
  • Test the behavior of system under all possible errors. What is the expected behavior if one of the consumer errorly send a surge of (error)inputs? Test if the exception impact other process (e.g. security service).
  • Consider alternative to exceptions e.g. substitute with default safe values.

Fault prevention
  • Use service throttling (to avoid server get overwhelmed with requests)
  • Try to process data during off-hours in large bulk (e.g. scheduled ETL, precompute aggregation functions in an intermediate table)
  • Make sure that long running transactions / processing are within the timeout limit of servers and load balancers. You might need to apply splitter - aggregator pattern (ww.eaipatterns.com/Aggregator.html)
  • Prevent further inputs when faults exist in the pipeline (e.g. disable submit button in the GUI, inform the user "sorry the service is momentary unavailable, please try later.") .
  • Use hostname that are easy for human (e.g. Baan4_FinanceDept server instead of b4f556 server).
  • Automate manual task to reduce human error (e.g. using scripts for build, deployment, test)
  • Avoid processes that run too long with:
    • simplify data & process
    • incremental updates (e.g. update using jms topics instead of daily database bulk copy)
    • divide & conquer
    • asynchronous
    • dedicate server for long process
    • reschedule to nightly job batch (e.g. ETL for OLAP)
  • Avoid long transaction, you can use compensation  instead of rely on transactional rollback.
  • Minimize number of hardware (e.g. some servers don't need monitor  screen & mouse), minimize services/applications running in the server, minimize connections (e.g. to file servers).
  • Validate data to filter error/malicious inputs. For example for web service inputs check the data type, range/permitted values, size limit against XSD.
  • Apply business rule validation (e.g. compare the request with the average size to detect request anomalies, verify time/sequence constraint). You can apply these checks in application or using trigger  or database constraints. Cast these check  to a standard norm (e.g. USD for money, GMT for time comparison).
  • Use RAID storage. Generally the performance of parity based RAID (3,4,5) is poorer than mirroring (RAID 1,0+1,1+0) but parity RAID will save the numbers of harddisk needed. Buy a fast RAID hardware if you decided to use parity based RAID. RAID 5 is better than RAID 3,4 to avoid parity disk bottleneck. RAID 1+0 is better than RAID 0+1 qua surviving rate and recovery time.
  • Use assertion to detect conditions that should never occur. The assertions are useful during development & test to detect bugs, in the production you can log the mild assertion violations or abort the process for serious violations (e.g. in the radiation therapy machine.)

Fault tolerant system:
  • Data and transactions are not lost / corrupted, data integrity / consistency across system is maintain.
  • Use redundancy / partitioning for storage (e.g. RAID) for better reliability and performance.
  • If you really need high availability and rich enough: maintain an idle copy of the server for fast disaster recovery in another physical site. Use load balancer to switch the requests automatically in case of disasters.
  • Propagate the fault status to the (services) consumer chain while maintaining graceful degradation e.g. replace the failure-service with a dummy respond that convey status=fault information,  automatic markdown with timeout.
  • Use persistent queue to reduce the impact when faults exist in the pipeline, to avoid bottleneck in this queue limit the inputs as the faults exist.
  • Design for N+1 for failure, e.g. if the capacity planning suggest you to have 3 servers, add 1 for just in case of failure.
  • Redundancy for  application servers, database, networks (including routers, gateways/firewalls) Use load balancer: to route the loads to avoid failed servers, to distribute the works efficiently. The redundancy should be isolated to each other e.g. don't put them as virtual machines in the same hardware, use dedicated infrastructure if high availability is demanded.
  • For robustness, avoid single points of failure (SPOF) & minimize series components: remove unnecessary components, add parallel redundancies. Beware of second order SPOF e.g. dependency to networks, DNS server, firewall router, central security service, single backup system, the cooling system of the server room. When one of this second order SPOF breaks down, the whole infrastructure will fail.
  • Can you isolate the failure (e.g. the security services will not be compromised)? Consider "swim lanes" fault isolation: divide the process into domains. The domains don't share their database, servers, disks, host-hardware. The domains don't share services. No synchronous calls between swim lanes, and only limited async calls between swim lanes (calls between swim lanes increase the chance of failure propagation).

Automated recovery techniques
  • Prefer fail gracefully:  the users don't notice that the a failure happened and preserve the uncommit works.
  • If possible remove/recover the fail condition before retry/restart.
  • Provide tools for manually trigger the recovery (just in case that the automatic mechanism/detection doesn't work)
  • Provide transparency of numbers of transactions in each state (e.g. done, pending, fail, retry). Some software persist the current transactions in the file/database so that the transaction can be recovered after crash (e.g. Weblogic transaction log or Oracle redo log).
  • How easy a system recover with wrong data (e.g. should I reboot the whole system or rebuild the database schema or if it has self-recovery/fail-over capabilities). Can the system recognize and clean the wrong data or move the wrong data to the error queue. If I restart the system does the system will repeat the same problems to the error data haven't being clean up from the inputs/database?
  • Fail over (can be the server or the client side): self clean up and restart
  • If the client/server need to restart: how to reconnect & relogin preferable without requiring intervention from the user (e.g. using SSO token).
  • If the other side of conversation is restart: how to detect that a restart happened, how to restart/redo the transactions (while maintaining data integration e.g. to avoid double requests), how to reestablished the conversation, how to recover session data.
  • Know which states should be cleaned up and which states can be preserved to speed-up the restart process & to preserve the uncommit works.
  • Gather as much information as possible about the failure condition (e.g. states that cause failure, resources that not responding) useful especially for memory related problems. Save this in protected logs file instead of displaying this information in GUI / blue screen (to avoid information gathering by attackers).
  • Enable the resource to reuse port/connection to avoid "address already in use" error in case of reconnect.
  • Check application heartbeat:  the process might stuck is it hasn't back to he main loop / listening state , thus you might need to kill the process or restart.
  • Audit the automated recovery progress. Provide exception/alert if the recovery fails.
  • After recovery, is that possible to continue with the process with minimum reworks (e.g. if the fault happens in the 10000th iteration, you'd better continue with the state from the last iteration instead having to redo the 9999th previous iterations).
  • Some automated recovery techniques:
    • wait & retry.
    • clean up (e.g. clean up disk space, move old files, close idle connections, kill processes, restart server-instances) without restart the OS
    • save the entire memory and restart OS
    • save the last good states (not the entire memory) and restart: this works for memory leak problem, but you might end up with the same problem given the same initialization states & deterministic algorithm so you might have to randomize.
    • restart from a checkpoint (using periodically saved configurations/states): useful to avoid long startup/recovery.
    • crash and restart (last resort, you might loss uncommitted data)
  • If your failover routine includes retry:
    • make sure that the target systems can handle duplicate messages.
    • You might need to apply compensation/clean-up the partial results before retry.
    • decide when the system stop retry & then escalate the problem (e.g. alert the operator for manual intervention). For example if  the business-requirement demands a respond within 3 hours and a manual intervention need 1 hour then your system have 3 - 1 = 2 hours to retry.)
    • limit how many retry attempts. Ideally you can configure number of retries (such as in Weblogic Jms).
    • decide the interval of retry (too short retry interval may overwhelm the system)
    • you can use queue to persist unprocessed requests. The queue can provide information such as how many requests in the queue, the last time a request was successfully being processed,  the last failed request.
    • provide mechanism to disable retry mechanism
    • provide manual control to reset and resume the system

Backup strategies
  • Partial restore (restore only the parts that damage) to speed up the recovery
  • If the data is too big use partial backup or incremental backup (e.g. level 0 every weekend, level 1 every day)
  • Schedule the backup process so that it doesn't interfere daily operation performance  (e.g. night/weekends)
  • For data consistency, prefer cold backup (prevent updates during backup)
  • Read only data (e.g. department codes) can be backup less often separately
  • Test your backup files (e.g. Oracle RMAN validate / restore ... validate) and restore/recovery procedures (by exercising the procedures)

Maintainability

  • Document the code, document installation/configuration procedures.
  • For easier interoperability & maintenance: use homogenous networks (e.g. all switches & routers are from Novell) and software platform (e.g. service bus & BPEL engines both from Weblogic/BEA, all databases are from Oracle). But beware of vendor locking.
  • Reduce using many (& complex) third party solutions, each third party software introduces cost for hardware & personnel resource, vendor dependency. For example, I know a company which use Baan, PeopleSoft, SAP/BO and they are keep busy integrating those software along with others (LDAP, etc,) spending a lot of money to hire specialists for custom-works during upgrading & integration.
  • Use text-based format for data exchange (e.g. XML for JMS messages, JSON) since text is more readable than binaries so it easier to debug.
  • For traceability to the original request, include request (& sender) identifier (e.g. requestID, clientID, sequenceID, wsa:MessageID, wsa:RelatesTo)  in the reply, intermediate processing messages, (error) logs, messages to/from external system for  intermediate processing. The wsa  here is the WS-Addressing namespace. This traceability is useful not only for debugging but also for request status monitoring, e.g. when you need to do manual intervention in case of a stuck process in a series of business processes you know which processing have been done/not for a particular request.
  • Register defects using bug tracking system e.g. Trac, Bugzilla.

Version management

·         How the services & wsdl/schemas (e.g. common data model) are versioned?
·         How the services are retired? Do you consider back compatibility? How many back-versions will you keep maintain?
·         Provide standard structures in the software configuration management/SCM folders (e.g. folders for Java codes, BPEL codes, WSDL, XSD, XSLT/Xq,  test codes, server-specific-deployment-plan, project documentations, project artifacts/deliverables).
·         Minimum codes in the SCM, no prototypes (which might ignores some QoS such as security) / deprecated codes in the head SCM revision. Put the prototypes in the branches.
·         Minimize the amounts of jms resources: use the same channel for different message versions (e.g. identified by version tag in the soap header/namespace) so instead of using 3 topics (updateEmployee1.1, updateEmployee1.0, insertEmployee1.0) you can use only 1 Employee topic for better manageability & performance. If your service can only process a specific version, use selective consumer pattern http://www.enterpriseintegrationpatterns.com/MessageSelector.html.



Coding principle

  • Follow the best practices / style guidelines in your programming environment / programming language. Use automatic style checking (e.g. findbugs, checkstyle, PMD).
  • Strive to self-documented code. Add comments if necessary. Use Javadoc.
  • Use a good IDE (e.g. Eclipse, VisualStudio) and good tools (e.g. SOAPUI for web service test, Maven for build, Hudson for continuous integration, Trac.)
  • Use software configuration management (SCM) system (e.g. SVN)
  • Use proven design patterns and beware of anti-patterns
  • Limit accessibility: e.g. declare classes/methods/fields as private instead of public
  • Loose coupling, strong cohesion e.g. Spring dependency injection.
  • Aspect oriented programming, separation of concerns (e.g.  Spring AOP for logging, security, transaction, error handling, cache)
  • Use declarative configurations instead of programmatic coding for cross cutting concerns (e.g. WS-policy for security) . The code ideally only focus to the business logic, has little knowledge of the framework (e.g. using AOP & declarative transactions in Spring to hide the underlying transaction mechanism).
  • Use templates to reduce coding (e.g. Spring DAO templates, Velocity JSP templates)
  • Use standard library/solutions don't reinvent new wheels.
  • If you use multithreading: beware of locking effects to the performance, beware of race condition
  • Defensive programming:  test before executing (e.g. null checking in Java/BPEL), handle exception gracefully, protect from bad inputs (validate, substitute with legal values).
  • Beware of common errors in your development language (e.g. null pointer exception in Java or buffer overflow in C++)
  • Use abstraction layer (e.g. use JAAS for authentication, DAO layer for database access) for loose coupling / flexibility
  • Choose libraries, vendors, tools carefully, consider e.g. maturity, popularity, support, future viability
  • Use thin clients (more robust, better performance)
  • Use early/static binding for better run-time performance
  • Use pre-assign size instead of using dynamic growth datatype
  • Use pre-assign number of parameters  instead of using dynamic number of params
  • Build the instrumentation up front during coding, e.g. test (TDD, performance measure), log, build/deploy script.
  • Reuse result (e.g. using temp variables) to reduce number of calls
  • Use implicit interface instead of explicit to reduce method call overhead.
  • If you use asynch/multithreading, do you have message timing/sequencing problem en how to deal with this problem? e.g.  if the software received 3 message events (in arbitrary sequence) which are order interdependent?
  • Write a short "getting started" developer document (e.g. designs, data models, class diagrams, dependencies, configurations, service request/response examples, troubleshooting / error scenarios). This document will be especially useful when you act as an external consultant / temporary project developer or if you need to pass the project to other colleagues (perhaps you leave the company, or have to take another project, or get promoted :).

Test

  • Have test cases created for (all) user cases and functional requirements? Do you test all SLA/non-functional requirements (e.g. response time, availability/robustness, compatibilities)? Are the test cases tractable to the requirement numbers?
  • Have test cases were created for all exceptions (negative tests), include network failure
  • Have you test variety of data input (valid, invalid, null/empty ,boundary values, long input)?
  • Are the tests are reproducible (e.g. automated, documented, code available in SCM, test case inputs in the database)?  It's advisable to rerun the tests (regression test, performance test) when the administrator add a new module/patch, add a new service, or change configuration.
  • How do you perform regression tests to prevent side effects (e.g. triggered by Hudson/a continuous integration framework)?
  • Do you consider also exploratory tests? If the person who perform the exploratory tests has enough experiences? Do more experience person need to assist him for pair-tests?
  • Does the test environment is comparable with the production (e.g. hardware performance, security constraints, OS/software/patch versions, server configurations)? Do you use a virtual lab to clone the production environment for test (e.g.LabManager)?
  • Use realistic data.
  • See test checklists: http://soa-java.blogspot.nl/2012/09/test-checklists.html
  • Reconciliation test (e.g. for asynchronous processing, for automatic document processing): compare number of input orders with number of fulfilments/ outputs. This test to detect 2 problems: order that never be fulfilled, order that fulfilled twice.


Usability, GUI, User-friendliness

  • Involve users during GUI design, prototyping, test (e.g. regular Sprint demo)
  • Use iterative prototyping (e.g. Scrum sprint demo) for frequent user feedbacks.
  • Avoid complex pages. Keep it simple. Start with just enough requirement. Design and implement not more than what the requirements need. Use minimum number of GUI widgets.
  • Anticipate user mistakes (provide cancel/undo button,  defensive programming e.g. invalid user input).
  • Minimum user efforts.
  • GUI structure & flow/navigation are clear/intuitive/logicconsistent, predictable.  Use business workflow to drive GUI forms & flows design.
  • Conform to user culture (e.g. domain terminologies) and standard web-style (e.g. colors, typography, layout) at user organization.
  • User documentation / help provided.
  • Update GUI progressively with separate threads (using Ajax for example) to improve responsiveness.
  • Use paging GUI (e.g. display only 20 results and provide a "next" button).
  • Condition the user to enter detailed query in order to reduce the results and minimize the round trips of multiple searches.
  • Update the user with the application status (e.g. progress bar) and the manage user expectation (e.g. Your request has been submitted. You will receive the notification within 2 days.)
  • Inform the user to avoid surprise and confusion when the application will be forwarded to external application (e.g. before OAuth authorization confirmation,  before IDEAL money transaction).
  • Image cost bandwidth (especially for mobiles) so minimize image sizes & number of images.
  • Avoid expensive computation when the user waiting, use asynchronous pattern or render the result progressively.
  • When the backend is busy prevent the impatient users to resent requests that will hinder availability more by informing the user "e.g. your request is being processed, please wait"  or disable the submit button.
  • For mobile web/applications:
    • Reduce information (due to limited screen): use only about 20% information/features from the normal web version.
    • GUI components are big enough and well-separated for finger touch input.
    • Provide links to the normal (PC version) webpage or text-only (low bandwidth) version.
    • Device awareness & content adaptation e.g. viewport according to screen size.
See Web-GUI checklist http://www.maxdesign.com.au/articles/checklist/


Source: Steve's blogs http://soa-java.blogspot.com/

Any comments are welcome :)




Reference:

 

    ·         Distributed Transactions: http://soa-java.blogspot.nl/2012/10/distributed-transactions.html

·         Patterns for Performance and Operability by Ford et.al.
·         Blueprints for High Availability by Marcus & Stern
·         Scalability Rules by Abbott & Fisher

High Availability and Disaster Recovery by Schmidt


·         The Art of Scalability by Abbott & Fisher
·        Code complete by McConnell
·         Service Design Review Checklist for SOA Governance by Eben Hewitt http://io.typepad.com/eben_hewitt_on_java/2010/07/service-design-review-checklist-for-soa-governance.html
·         Report review & test checklist, university washington http%3A%2F%2Fwww.washington.edu%2Fuwit%2Fim%2Fds%2Fdocs%2FReportReviewAndTest_Template.docm
·         IEEE Standard for Software Reviews and Audits 1028-2008
·         The Definitive Guide to SOA by Davies et.al
·         Enterprise Integration Patterns by Hohpe
·         SOA Principles of Service Design by Erl
·         Improving .NET Application Performance and Scalability by Meier et.al.
·         Patterns of Enterprise Application Architecture by Fowler
·         http://www.royans.net/arch/brewers-cap-theorem-on-distributed-systems/
·          Hacking Exposed Web Applications by Scambray et.al.
·          OWASP Web Service Security Cheat Sheet
·          OWASP Code Review Guide
·          Improving Web Services Security (Microsoft patterns & practices) by Meier et.al.
·          Improving .NET Application Performance and Scalability by Meier et.al.
·          Concurrency Series: Basics of Transaction Isolation Levels by Sunil Agarwal

I use indexes instead of indices since it's acceptable in the modern English usage (I use English in this blog instead of Latin.)