Data-intensive applications are built from standard building blocks. They need to:
Which ones to use and how we combine them depends on the application we are building. We might have subtasks, a subtask might be handled really well by a particular tool and we combine many such tools in our application code to build our application. For example we may have a primary database, a search index, and a cache and it is usually the application code that keeps all three in sync.
We need to ensure the following when building data-systems:
Typical expectations of any application are:
A system is reliable if it continues to meet the above expectations even when things go wrong.
A fault is when component stops functioning as expected whereas a failure is when the system as whole stops functioning, could be due to multiple faults. A system that anticipates relevant faults and copes with them is called a fault-tolerant/resilient system.
Many critical bugs in production turn out to be due to poor error handling or due to scenarios that were never encountered before. It is good to deliberately increase faults when testing by bringing down services at random to ensure that fault-tolerate mechanisms are in-place and are working correctly. Example: Netflix Chaos Monkey.
We should prefer tolerating faults over preventing faults. Exception to this is when security matters, for example, actions such as malicious access to systems or unauthorized access to sensitive data can't be undone.
For most applications, redundancy of individual hardware components suffices. However as applications use mutiple machines to deal with increased data volume, the rate of hardware faults increases. The move is towards building systems that can tolerate loss of entire machines. A single node server will have downtime, for example, due to reboot needed during upgrades or application of OS security patches. A fault-tolerant server with multiple nodes can be patched one node at a time, with no system downtime also known as rolling upgrade.
Hardware errors are mostly independent. Software errors are systematic errors within the system. Since they could be applicable to all nodes, they can bring down the system. E.g., software bug. No one single solution to deal with this. Lot of things can help - testing, process isolation, crashing and restarting services during testing, monitoring, alerting, etc.
Humans are unreliable. Most of the time outages are caused by misconfiguration by operators. Systems can be made more reliable by:
Reliability is important even if the application is non-critical. If a system is not reliable it can cause lost productivity, legal issues if data is reported incorrectly, revenue loss, damage to reputation.