Reliable, Scalable, Maintainable Applications

Published on 2021-01-04

Data-intensive applications are built from standard building blocks. They need to:

  1. Store data - databases
  2. Speed up reads - caches
  3. Search data - search indexes
  4. Send a message to another process for asynchronous processing - stream processing
  5. Periodically crunch accumulated data - batch processing

Which ones to use and how we combine them depends on the application we are building. We might have subtasks, a subtask might be handled really well by a particular tool and we combine many such tools in our application code to build our application. For example we may have a primary database, a search index, and a cache and it is usually the application code that keeps all three in sync.

We need to ensure the following when building data-systems:

  1. Reliablity: Continuing to work correctly as per expectations even when things hardware or software go wrong or human operators make mistakes
  2. Scalabilty: Reasonable ways of dealing with the growth of data, traffic or complexity
  3. Maintainabilty: Developers and operations can work productively while maintaining exisiting system as well enhancing it

Reliability

Typical expectations of any application are:

  1. Work as expected by user
  2. Tolerate when users make mistakes or mishandle the system
  3. Give good performance normally
  4. Prevent unauthorized access and protect sensitive data

A system is reliable if it continues to meet the above expectations even when things go wrong.

A fault is when component stops functioning as expected whereas a failure is when the system as whole stops functioning, could be due to multiple faults. A system that anticipates relevant faults and copes with them is called a fault-tolerant/resilient system.

Many critical bugs in production turn out to be due to poor error handling or due to scenarios that were never encountered before. It is good to deliberately increase faults when testing by bringing down services at random to ensure that fault-tolerate mechanisms are in-place and are working correctly. Example: Netflix Chaos Monkey.

We should prefer tolerating faults over preventing faults. Exception to this is when security matters, for example, actions such as malicious access to systems or unauthorized access to sensitive data can't be undone.

Hardware Faults

For most applications, redundancy of individual hardware components suffices. However as applications use mutiple machines to deal with increased data volume, the rate of hardware faults increases. The move is towards building systems that can tolerate loss of entire machines. A single node server will have downtime, for example, due to reboot needed during upgrades or application of OS security patches. A fault-tolerant server with multiple nodes can be patched one node at a time, with no system downtime also known as rolling upgrade.

Software Errors

Hardware errors are mostly independent. Software errors are systematic errors within the system. Since they could be applicable to all nodes, they can bring down the system. E.g., software bug. No one single solution to deal with this. Lot of things can help - testing, process isolation, crashing and restarting services during testing, monitoring, alerting, etc.

Human Errors

Humans are unreliable. Most of the time outages are caused by misconfiguration by operators. Systems can be made more reliable by:

  1. Minimizing opportunities for errors, for example, good APIs, simple and easy to use admin interfaces that make it easy to do the right thing.
  2. Provding fully featured sandbox environment in production where people can experiment with real data without affecting real users.
  3. Testing at all levels - unit testing, integration testing, manual and automated testing.
  4. Providing ways to recover from human errors quickly such as configuration rollback, gradually rolling out patches, and ability to recompute data incase the previous computation went wrong.
  5. Detailed and clear monitoring such as performance metrics and error rates. Known as telemetry in other engineering disciplines.
  6. Share best practices and training.

Reliability is important even if the application is non-critical. If a system is not reliable it can cause lost productivity, legal issues if data is reported incorrectly, revenue loss, damage to reputation.