This is an “evergreen” page that I’m backfilling from reading notes and keeping updated moving forward. I distill and compress the #1 thing I learned or took away from various pieces of literature (except books, but feel free to check out my reading pipeline). It’s far from perfect: run-on sentences galore to fit into the arbitrary one-sentence restriction. Nonetheless, I hope this piques your interest and encourages you to read the source material. Items that are italicized are some of my favorites.
There are no affiliate links here… but there may be broken ones. Let me know!
- Amazon Aurora: Design Considerations for High Throughput Cloud-Native Relational Databases (Verbitski 2017). Aurora achieves strong durability and availability guarantees by replicating data six times (!) across three availability zones (AZ), allowing it to lose an entire AZ and just keep chugging along.
- The Byzantine Generals Problem (Lamport 1982). Byzantine fault tolerance can be met in distributed systems by ensuring that non-malicious nodes outnumber malicious ones by a factor of 3 with synchronous, unsigned messages (the OM algorithm) or 2 with unforgeable, signed messages (the SM algorithm).
- Chip and PIN is Broken (Murdoch 2010). As of the reading of the paper, the central flaw of a banking card’s PIN verification step never being explicitly authenticated allows it to be spoofed.
- Continuous Deployment at Facebook and OANDA (Savor 2017). It’s possible to grow an engineering team and code base without giving up either productivity or quality, and it mostly revolves around making it easy and quick to continuously (duh) ship code to production.
- Controlling Queue Delay (Nichols 2012). “CoDel,” or controlled delay management, uses the in-queue delay experienced by packets to give superior performance and solution to the bufferbloat problem.
- Crash-Only Software (Candea 2003). Build your software and systems so you don’t need a graceful shutdown or clean up; rather, let it crash and recover quickly.
- DeepXplore: automated whitebox testing of deep learning systems (Pei 2019). Finding edge case inputs near decision boundaries that cause DNNs to yield the opposite (and often unsafe) result is actually an optimization problem that can be solved using gradient ascent.
- The Diamond Model of Intrusion Analysis (Caltagirone 2013). Attacks can be modeled as activities pivoting between four points of a diamond, consisting of: the adversary, the victim, infrastructure, and (attacker) capabilities.
- Dropout: a simple way to prevent neural networks from overfitting (Srivastava 2014). Brilliant… you can regularize neural networks by randomly deleting (dropping out) nodes while training.
- End-to-End Arguments in System Design (Saltzer 1984). This paper formalized the End-to-end Principle, which has (and continues to have) deep relevance to the ensuring security and reliability properties of networked systems.
- Faster Key-value Stores (Adya 2019). Stateful, linked, in-memory key-value stores in application servers might offer better performance and availability than stateless application servers plugging into remote key-value stores (e.g. Redis, Memcached).
- Firecracker: Lightweight Virtualization for Serverless Applications (Agache 2020). AWS' Firecracker takes the performance of containers, the security of virtualization, and a stripped down kernel to deliver a scalable compute model for AWS Lambda and Fargate.
- Generative Adversarial Networks (Goodfellow 2014). This is the paper that first innovated GANs, pitting generative and discriminative models against each other to produce stunningly realistic results.
- Going Beyond the Sandbox - An Overview of the New Security Architecture in the Java Development Kit 1.2 (Gong 1997). Having an isolation mechanism at the language VM tier is a darn good idea that modern languages seem to have moved away from or (d|r)elegated to other parts of the computing stack.
- Harvest, Yield, and Scalable Tolerant Systems (Fox 1999). Distributed systems are subject to a “Weak CAP Principle”, and have to choose between reducing yield (availability) or reducing harvest (consistency).
- Immutability Changes Everything (Helland 2015). Immutable data structures provide a benefit everywhere in modern distributed systems, from append-only computing, database write-ahead logs, but do have trade-offs with size (mostly a non-issue), denormalization of data, write amplification, and “read perspiration”.
- Implementation of Cluster-wide Logical Clock and Causal Consistency in MongoDB (Tyulunev 2019). The design and implementation of cluster-wide clock in MongoDB unlocks causal consistency and future work with distributed, multi-document transactions.
- Jellyfish: Networking Data Centers Randomly (Singla 2012). Organizing networks in a random graph topology might yield greater performance and resiliency that traditional, hierarchical hub-and-spoke networks.
- k-anonymity: A Model for Protecting Privacy (Sweeney 2002). Not the author’s first introduction of k-anonymity (1998), but it did crystallize and formalize the idea of quantifying and guaranteeing a level of anonymity in a dataset, i.e. where one person’s data is indistinguishable from k-1 other records.
- Kafka: a Distributed Messaging System for Log Processing. Using a pull model where consumers retrieve messages from a broker at their maximum rates lead to an scalable, high throughput, and (relatively?) low latency messaging system.
- Meaningful Availability (Hauer 2020). Using a windowed user-uptime metric for availability gives you a better, user-centric view of availability while also showing the shape and nature of outages over a time period.
- Metastable Failures in Distributed Systems (Bronson 2021). Many systems fall into a fragile state that can become “sticky” during black swan events and outages.
- No free lunch theorems for optimization (Wolpert 1997). There’s no “best” machine learning or optimization algorithm when you consider the whole problem space.
- No silver bullet (Brooks 1987). We won’t see the equivalent of Moore’s Law with software productivity because we can only fix accidental complexity, not essential complexity.
- A note on the confinement problem (Lampson 1973). Basically most pieces of software have some type of side channel, as the bar for being unconfined is incredibly low, e.g. writing to disk or RAM.
- One pixel attack for fooling deep neural networks (Su 2019). Whoa, many DNNs fail the robustness test and can receive low dimension, pathological inputs that trigger weights and paths that make them yield the totally wrong result.
- Password Hardening Based on Keystroke Dynamics (Monrose 1999). You can use a user’s typing pattern combined with their password to generate a more secure “hardened password,” at the cost of a non-zero false positive rate.
- Password Managers: Attacks and Defenses (Silver 2014). At the time of publication, many password managers were vulnerable to a litany of low-sophistication attacks, including a “sweep attack”, where an attacker abuses autofill to make password managers dump passwords for multiple sites.
- Playing Atari with Deep Reinforcement Learning (Mnih 2013). Cool, you can use “experience replay” and a DNN (or DQN, deep Q-network) to estimate a policy function that maps continuous input (pixels from an Atari game) to discrete output (actions on the controller).
- Polymorphic Blending Attacks (Fogla 2006). If you get some type of oracle or know how a signature-based IDS works, you can use some basic machine learning to craft payloads that circumvent and evade the system.
- POSIX Access Control Lists on Linux (Grunbacher 2003). Linux filesystem ACLs aren’t quite sufficient, so we had to bolt on things like extended ACLs and extended attributes… which are still sorely underused.
- Protection (Lampson 1971). You can view access control as a matrix of subjects vs. objects, where the rows are the (more common) Access Control Lists, or ACLs, and the columns are Capability Lists, or c-lists.
- The Protection of Information in Computer Systems (Saltzer 1975). Secure software principles like least privilege, fail-safe defaults, etc. are just as relevant today as they were 50-some years ago.
- Reflections on Trusting Trust (Thompson 1984). The only way to guarantee the integrity of software and its build artifact is to compile it from source.
- Reward is enough (Silver 2021). The concept of reward (from reinforcement learning) is “enough” to get us eventually to artificial general intelligence.
- Setuid Demystified (Chen
Holy shit, the
setuidstate machine (as well as its many, many siblings’) is incredibly complex and just riddled with footguns.
- The Tail at Scale (Dean 2013). When you’re operating with distributed systems at scale, anything can cause terrible tail latency at the 99th percentile, which 100% of your users will likely hit.
- Targeted Online Password Guessing: An Underestimated Threat (Wang 2016). Unsurprisingly, humans are really bad at crafting passwords and the proposed framework, “TarGuess”, can pwn lots of them.
- Task assignment with unknown duration (Harchol-Balter 2000). Workloads are typically heavy-tailed in reality, and the TAGS scheduling algorithm outperforms all other classical queuing algorithms that were (incorrectly?) designed around exponentially distributed workloads.
- Time, Clocks and the Ordering of Events in a Distributed System (Lamport 1978). Lamport timestamps utilize a logical clock that allow representing the partial ordering and causality events in a distributed system.
- Von Neumann’s First Computer Program. The “father of algorithms” (Knuth) analyzes the “father of… lots of things]” (Von Neumann) and one of the very first stored programs: a sorting program.
- Xen and the Art of Virtualization (Barham 2003). Xen is a virtual machine monitor that heavily uses paravirtualization instead of full virtualization, realizing improved performance gains at the cost of some leaky abstractions and modifications to guests.
- ZMap: Fast Internet-wide Scanning and its Security Applications (Durumeric 2013). ZMap uses stateless connections and a few “tricks” with probes to scan the internet (or other subnets) insanely fast… like the entire IPv4 space on port 80 in 5 minutes.
Articles and Essays
- Against DNSSEC (Ptacek 2015). Well… maybe DNSSEC is kinda pointless and causes more trouble (read: outages) than it’s worth.
- CAP Theorem: Revisited (Greiner 2014). Distributed systems have to be partition-tolerant, so with the CAP Theorem, you basically need to choose between (strong) consistency and availability.
- The Cryptographic Doom Principle (Marlinspike 2011). Encrypt-then-MAC/authenticate; anything else is broken and will likely lead to doom.
- Cryptographic Right Answers (Latacora 2018). A list of “right” answers to common cryptography problems, for developers who aren’t cryptographers (read: most people, myself included).
- A deep dive into an NSO zero-click iMessage exploit: Remote Code Execution (Project Zero 2021). Some grade A impostor syndrome fuel: NSO group exploited a vulnerability in a GIF/PDF library to use a logic gate primitive… to construct an entire virtual CPU on the victim’s host and gain remote code execution!?
- Defense in Depth (Venables 2022). Rather than just stacking a bunch of defensive layers, you actually want them inter-locked and inter-linked, e.g. “every preventative control should have a detective control at the same level and/or one level downstream in the architecture.”
- DNS Response Size (Schaumann 2022). Keep your DNS records small, otherwise it causes all sorts of performance problems as it fails over to EDNS (multiple trips) and TCP fallback (oh no).
- Exploiting the DRAM rowhammer bug to gain kernel privileges (Project Zero 2015). Hammering the same memory addresses give you an arbitrary write primitive (bit flip) to privileged memory via a side channel.
- Hacking the Universe with Quantum Encraption (Kaminsky 2017). Maybe our universe is deterministic and backed by some cryptographically secure pseudorandom number generator, and we totally wouldn’t know… and ehrmagahd I’m having an existential crisis but this is all so super interesting and whew you should just read this article and also RIP Dan, we love you.
- Honest Security (Meller/Kolide 2020). Humans are going to do human things, so security teams should build their endpoint security programs with empathy, positivity, and rationality.
- Local-First Software (Kleppmann 2019). There are seven ideals that characterize and enable local-first software, involving everything from network, storage, security and privacy, and ownership.
- Manual Work is a Bug (Limoncelli 2018). The sysadmin (or engineer) that spends more time automating vs. doing manual work will be far more successful in the long run.
- The Mythical Man-Month (Brooks 1995). This is where Brook’s Law comes from: “adding manpower to a software project that is behind schedule delays it even longer.”*
- Optimizing Your Apache Kafka Deployment (Confluent). You have four levers—throughput, latency, durability, and availability—and you can tune broker, producer, and consumer settings to optimize for the ones that are most important to you.
- The Security Mindset (Schneier 2008). Thinking like an attacker and how they might abuse the $THING you just built gives you a tremendous advantage in defending against them.
- The Second-System Effect (Brooks 1995). When engineers rebuild a system the second time, pent up feature creep and aspirations end up producing an over-engineered, high-complexity system.
- Surprisingly Turing-Complete (Branwen 2011). Holy crap, lots of crazy things (like even Magic: the Gathering) are accidentally Turing-complete… and probably have the ability to be pwned and programmed as a weird machine.
- What to look for when reviewing a company’s infrastructure (Lancini 2022). Read this methodical and thorough walkthrough if you ever need to evaluate infrastructure security at a new job, role, acquisition, etc.
- Architectures that Scale Deep: Regaining Control in Deep Systems (Sigelman 2019). Distributed tracing is basically the best way to gain observability and reason about deep systems and architectures.
- Close Loops and Opening Minds (MacCarthaigh/AWS 2018). Control theory and feedback loops are a critical part of designing stable and reliable systems, and they appear everywhere! Shout-out to Onkur Sen for the recommendation!
- Simple Made Easy (Hickey 2011). Simple != Easy, and software ends up optimizing for “easy” at the expense of creating complex software that ends up being “hard” in the long run.
- Turning the database inside out with Apache Samza (Kleppmann 2014). If you handle your data like one big stream, everything from the app layer, to caching, to APIs and UIs are basically materialized views.
- What Breaks Our Systems: A Taxonomy of Black Swans (Nolan 2019). Black swan events include: hitting limits, spreading slowness, thundering herds, runaway automation, cyberattacks, and circular dependencies.
Things (mostly papers) in my queue.
- A comprehensive study of Convergent and Commutative Replicated Data Types (Shapiro 2011)
- A few billion lines of code later: using static analysis to find bugs in the real world (Bessey 2010)
- Availability in Globally Distributed Storage Systems (Ford 2010)
- Basic Local Alignment Search Tool (Altschul 1990)
- BeyondCorp: Design to Deployment at Google (Osborn 2016)
- Big Ball of Mud (Foote 1999)
- Bigtable: A Distributed Storage System for Structured Data (Chang 2006)
- Borg, Omega, and Kubernetes (Burns 2016)
- C-Store: A Column-oriented DBMS (Stonebraker 2005)
- Canopy: An End-to-End Performance Tracing And Analysis System (Kaldor 2017)
- CRDTs: Consistency without concurrency control (Letia 2009)
- Dapper, a Large-Scale Distributed Systems Tracing Infrastructure (Sigelman 2010)
- Design patterns for container-based distributed system (Burns 2016)
- Dryad: Distributed Data-Parallel Programs from Sequential Building Blocks (Isard 2007)
- Dynamo: Amazon’s Highly Available Key-value Store (DeCandia 2007)
- Google-Wide Profiling: A Continuous Profiling Infrastructure for Data Centers (Ren 2010)
- Hints for Computer System Design (Lampson 1983)
- In Search of an Understandable Consensus Algorithm (Ongaro 2014)
- Large-Scale Automated Refactoring Using ClangMR (Wright 2013)
- Large-scale cluster management at Google with Borg (Verma 2015)
- Life beyond Distributed Transactions: an Apostate’s Opinion (Helland 2007)
- Logic and Lattices for Distributed Programming (Conway 2012)
- MapReduce: Simplified Data Processing on Large Clusters (Dean 2008)
- Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center (Hindman 2011)
- Omega: flexible, scalable schedulers for large compute clusters (Schwarzkopf 2013)
- On Designing and Deploying (Hamilton 2007)
- Online, Asynchronous Schema Change in F1 (Rae 2013)
- Out of the Tar Pit (Moseley 2006)
- Paxos Made Live - An Engineering Perspective (Chandra 2007)
- Paxos Made Simple (Lamport 2001)
- Profiling a warehouse-scale computer (Kanev 2015)
- Rules of Machine Learning: Best Practices for ML Engineering (Zinkevich)
- Searching for Build Debt: Experiences Managing Technical Debt at Google (Morgenthaler 2012)
- Security Keys: Practical Cryptographic Second Factors for the Modern Web (Lang 2016)
- Source Code Rejuvenation is not Refactoring (Pirkelbauer 2009)
- Spanner: Google’s Globally-Distributed Database (Corbett 2012)
- Still All on One Server: Perforce at Scale (Bloch 2011)
- SWIM: Scalable Weakly-consistent Infection-style Process Group Membership (Das 2002)
- The Chubby lock service for loosely-coupled distributed systems (Burrows 2006)
- The Google File System (Ghemawat 2003)
- The UNIX Time-Sharing System (Ritchie 1974)
- Towards a Solution to the Red Wedding Problem (Meiklejohn 2018)
- Unreliable Failure Detectors for Reliable Distributed Systems (Chandra 1996)
- Wormhole: Reliable Pub-Sub to Support Geo-replicated Internet Services (Sharma 2015)
- Zab: High-performance broadcast for primary-backup systems (Junqueira 2011)
Sources where I pick up things to consume.