Common Mistakesgraduate

15 Common Mistakes When Studying Distributed Systems (And How to Fix Them) | LearnByTeaching.ai

Distributed systems is one of the most intellectually demanding areas of computer science, where partial failures and concurrency make reasoning deceptively difficult. These 15 mistakes reflect the gaps that trip up even experienced engineers when they study consensus, replication, and fault tolerance.

#1CriticalConceptual

Assuming the Network Is Reliable

Students design systems as if messages always arrive, arrive in order, and arrive exactly once. This leads to architectures that work perfectly in testing but fail catastrophically under real network conditions.

A student designs a distributed key-value store where a write is considered committed after sending the value to replicas, without waiting for acknowledgments. During a network partition, replicas silently miss updates, leading to permanent data divergence.

How to fix it

Internalize the fallacies of distributed computing from day one. For every protocol you study, explicitly ask: what happens if this message is lost, delayed, duplicated, or reordered? Build systems that assume failure is the norm, not the exception.

#2CriticalConceptual

Confusing Consistency Models

Students treat linearizability, sequential consistency, causal consistency, and eventual consistency as interchangeable or as a simple hierarchy. Each model makes fundamentally different guarantees about what operations concurrent clients can observe.

A student claims their system provides 'strong consistency' because reads always return the latest local write, without realizing this is only read-your-writes consistency, not linearizability. The system allows stale reads from other clients.

How to fix it

Study each consistency model through its formal definition and a concrete anomaly it prevents or allows. Build a comparison table showing which anomalies (stale reads, lost updates, causal violations) each model permits. Work through the PACELC framework to understand the tradeoffs.

#3CriticalConceptual

Misapplying the CAP Theorem

Students treat CAP as a simple 'pick two of three' theorem and try to classify every system as CP or AP. In practice, the CAP theorem applies only during network partitions, and most systems make nuanced tradeoffs along the consistency-availability spectrum.

A student argues that MongoDB is 'CP' while Cassandra is 'AP,' as if each database has a single fixed CAP classification. In reality, both systems offer configurable consistency levels per operation.

How to fix it

Read the original Brewer conjecture and Gilbert-Lynch proof. Understand that CAP only constrains behavior during partitions. Study how real systems like DynamoDB and CockroachDB offer tunable consistency, and learn the PACELC extension that addresses latency tradeoffs during normal operation.

#4MajorConceptual

Treating Consensus Algorithms as Black Boxes

Students memorize that 'Raft elects a leader and replicates a log' without understanding the invariants that ensure safety. This makes it impossible to reason about edge cases or extend the algorithm for real-world use.

A student can describe the happy path of Raft leader election but cannot explain why a candidate with a shorter log must not win an election, or what happens when a partitioned leader continues accepting writes.

How to fix it

Work through the Raft paper section by section, implementing the state machine in code. For each rule (election restriction, log matching property), construct a scenario where violating it leads to data loss. The Raft visualization tool at thesecretlivesofdata.com makes the protocol concrete.

#5MajorConceptual

Ignoring Clock Synchronization Problems

Students use wall-clock timestamps to order events across nodes, not realizing that clock skew between machines can be milliseconds to seconds. This introduces subtle ordering bugs that are nearly impossible to reproduce.

A student implements a last-write-wins conflict resolution strategy using system timestamps. Two clients write to the same key from different nodes at nearly the same time, and clock skew causes the chronologically earlier write to 'win' because its node's clock runs ahead.

How to fix it

Study Lamport timestamps, vector clocks, and hybrid logical clocks. Understand why Google Spanner needs atomic clocks and GPS receivers to bound clock uncertainty. Default to logical clocks for ordering unless you have explicit clock synchronization guarantees.

#6MajorStudy Habit

Studying Only the Happy Path of Protocols

Students focus on how distributed protocols work when everything goes right, skipping the failure-handling portions that constitute most of the protocol's complexity and correctness guarantees.

A student understands two-phase commit when all participants vote 'yes' but cannot explain the blocking problem when the coordinator crashes after sending 'prepare' but before sending 'commit,' leaving participants locked indefinitely.

How to fix it

For every protocol, enumerate the failure scenarios: what if the coordinator crashes? What if a participant crashes? What if a message is lost at each step? Draw timing diagrams for each failure case and trace through the recovery logic. Read Jepsen analyses to see how real systems fail.

#7MajorConceptual

Confusing Replication Strategies

Students mix up synchronous and asynchronous replication, single-leader and multi-leader and leaderless replication, and don't understand the consistency and availability tradeoffs each entails.

A student proposes using asynchronous multi-leader replication for a banking system to improve availability, not realizing this allows concurrent conflicting writes (two withdrawals from the same account on different leaders) that can result in a negative balance.

How to fix it

Create a comparison matrix of replication strategies with columns for consistency guarantee, write latency, read latency, availability during partitions, and conflict handling. Study each strategy through a concrete system: single-leader (PostgreSQL streaming replication), multi-leader (CouchDB), leaderless (Cassandra).

#8MajorConceptual

Not Understanding Exactly-Once Semantics

Students assume messages in distributed systems can be delivered exactly once. In reality, exactly-once delivery is impossible over unreliable networks; what systems achieve is exactly-once processing through idempotency or deduplication.

A student designs a payment processing system that retries failed requests without idempotency keys. When a network timeout occurs after the payment was actually processed, the retry creates a duplicate charge.

How to fix it

Distinguish between at-most-once, at-least-once, and exactly-once semantics. Learn that exactly-once processing requires either idempotent operations or deduplication with unique request IDs. Study how Kafka achieves exactly-once semantics through idempotent producers and transactional writes.

#9MajorConceptual

Overlooking the Split-Brain Problem

Students design failover mechanisms where both the old and new leader can accept writes simultaneously after a false failure detection. This split-brain scenario causes data corruption that is extremely difficult to repair.

A student implements a simple heartbeat-based failover where a backup takes over if it misses three heartbeats from the primary. A network partition causes the backup to promote itself while the original primary is still running and accepting writes from clients on its side of the partition.

How to fix it

Study fencing mechanisms: fencing tokens, STONITH (Shoot The Other Node In The Head), and lease-based approaches. Understand why consensus-based leader election (using Raft or ZooKeeper) is safer than simple heartbeat failover. Design systems that can detect and resolve split-brain through quorum-based writes.

#10MajorStudy Habit

Skipping the Foundational Papers

Students rely on blog posts and secondary summaries instead of reading the original distributed systems papers. Summaries often simplify the nuances and invariants that make the difference between correct and incorrect implementations.

A student reads a blog summary of the Raft paper and misses the subtlety around no-op entries after leader election, leading them to believe a new leader can immediately serve reads at the latest committed index.

How to fix it

Read the core papers: Lamport's 'Time, Clocks, and the Ordering of Events,' the Raft paper, the Dynamo paper, Google's Spanner paper, and the FLP impossibility result. Take notes on each paper's key invariants and assumptions. The MIT 6.824 reading list is an excellent curated sequence.

#11MinorConceptual

Neglecting Failure Detection Nuances

Students treat failure detection as binary (a node is either up or down) when in reality, detectors can only suspect failure based on timeouts, and those suspicions may be wrong due to network delays or GC pauses.

A student sets an aggressive 100ms timeout for failure detection in a system with occasional GC pauses of 200ms. The system repeatedly and incorrectly declares healthy nodes as dead, triggering unnecessary failovers and causing instability.

How to fix it

Study the Chandra-Toueg failure detector model and understand the properties of completeness and accuracy. Learn about phi accrual failure detectors used by Cassandra. Recognize that all practical failure detectors face a fundamental tradeoff between detection speed and false positive rate.

#12MinorStudy Habit

Memorizing Without Building

Students study distributed systems concepts theoretically without implementing any protocols. The gap between understanding an algorithm on paper and making it work in code is enormous in this field.

A student can draw the Raft state diagram on a whiteboard but has never implemented even a simplified version. When asked how to handle a leader receiving a write during a configuration change, they have no intuition to draw on.

How to fix it

Implement at least one consensus protocol (Raft is the most accessible), a simple replicated state machine, or a distributed key-value store with replication. The MIT 6.824 labs provide excellent scaffolded exercises. Even a partial implementation builds more intuition than weeks of reading.

#13MinorConceptual

Confusing Latency with Throughput in System Design

Students optimize for throughput (total operations per second) when the real bottleneck is tail latency (p99 response time), or vice versa. These are different metrics that often require opposing optimization strategies.

A student adds batching to a distributed system to improve throughput, not realizing that batching increases individual request latency. For a user-facing service with strict latency SLAs, this optimization makes the system worse for its actual use case.

How to fix it

Always clarify which performance metric matters for a given system. Study the relationship between batching, pipelining, and latency. Read Jeff Dean's 'Tail at Scale' paper to understand why p99 latency matters more than average latency in distributed systems.

#14MinorConceptual

Underestimating the Difficulty of Distributed Transactions

Students assume distributed transactions work just like local database transactions but across multiple nodes. They don't appreciate the performance cost, blocking risks, and complexity that protocols like two-phase commit introduce.

A student proposes using distributed transactions across a microservices architecture for every cross-service operation, not realizing that 2PC requires all participants to be available and that coordinator failure blocks all participants until recovery.

How to fix it

Study the limitations of 2PC (blocking, performance overhead, coordinator as single point of failure). Learn alternative patterns: Saga pattern for long-lived transactions, eventual consistency with compensation, and Calvin/deterministic databases. Understand when distributed transactions are worth the cost versus when they should be avoided.

#15MinorStudy Habit

Not Reasoning About Partial Failures

Students think of failures as total (the whole system is up or down) rather than partial (some nodes fail while others continue). Partial failures create inconsistent views of the world that are the fundamental challenge of distributed systems.

A student tests their distributed system by killing all replicas simultaneously and verifying recovery, but never tests scenarios where one replica silently drops writes, a network link becomes asymmetric (A can reach B but not vice versa), or a single node has a corrupted disk.

How to fix it

Use chaos engineering tools like Jepsen, Chaos Monkey, or Toxiproxy to inject realistic partial failures during testing. For each component in your system, enumerate: what happens if this node is slow (not dead), if it lies (Byzantine), or if it can receive but not send?

Quick Self-Check

Can you explain the difference between linearizability and sequential consistency with a concrete example showing an execution that is sequentially consistent but not linearizable?
What happens in two-phase commit if the coordinator crashes after sending 'prepare' to all participants but before sending any 'commit' or 'abort' decision?
Why can't you use wall-clock timestamps to determine the global ordering of events across distributed nodes?
In Raft, why is it necessary that a candidate must have all committed entries in its log to win an election?
What is the split-brain problem, and how do fencing tokens prevent data corruption during a failover event?

Pro Tips

✓When studying a new distributed protocol, draw a sequence diagram for the happy path first, then systematically introduce one failure at each step — this reveals the protocol's true complexity and cleverness.
✓Read Jepsen reports (jepsen.io) for real databases to see how production systems violate their stated consistency guarantees — it's the best way to develop a healthy skepticism about distributed systems claims.
✓Keep a 'distributed systems failure zoo' — a personal catalog of failure modes you've encountered in papers, blog posts, and practice, organized by category (network, disk, clock, software bug).
✓When designing a distributed system, start by writing down what must never happen (safety properties) before thinking about what should eventually happen (liveness properties) — getting this priority wrong is the most common design mistake.
✓Use TLA+ or formal specification tools to model your protocols before implementing them — Leslie Lamport designed these tools specifically because distributed systems are too complex to reason about informally.

15 Common Mistakes When Studying Distributed Systems (And How to Fix Them) | LearnByTeaching.ai

Assuming the Network Is Reliable

Confusing Consistency Models

Misapplying the CAP Theorem

Treating Consensus Algorithms as Black Boxes

Ignoring Clock Synchronization Problems

Studying Only the Happy Path of Protocols

Confusing Replication Strategies

Not Understanding Exactly-Once Semantics

Overlooking the Split-Brain Problem

Skipping the Foundational Papers

Neglecting Failure Detection Nuances

Memorizing Without Building

Confusing Latency with Throughput in System Design

Underestimating the Difficulty of Distributed Transactions

Not Reasoning About Partial Failures

Quick Self-Check

Pro Tips

More Distributed Systems Resources

Avoid distributed systems mistakes by teaching it