Reality with distributed systems is to assume that anything that can go wrong will go wrong.
Faults and Partial Failures
In a distributed systems, some parts of the system may break while other parts are still functional. This is known as a partial failure and the difficulty with partial failures is that they are nondeterministic.
Cloud Computing and Supercomputing
There’s a spectrum of philosophies on building large-scale computing systems:
- High-performance computing (HPC) - many CPUs for one task; trying to simulate a single node
- Cloud computing - many computers, many tasks; inherently multi-node