My Thoughts on Designing Data-Intensive Applications
During Christmas holidays, I started reading Designing Data-Intensive Applications by Martin Kleppmann. I liked the book, and it gave me a lot of things to think about. It also opened a new pandora’s box for me — an interest in databases in general.
Book Contents
There are various database technologies available, each serving a different purpose. For general data storage, we may use relational databases, which have been around since the 70s, or key-value data stores, which may be faster for certain workloads such as saving application state, caching, etc. In the past decade, NoSQL databases have also gained popularity.
Each technology has its own trade-offs, such as different algorithms, scaling capabilities, safety guarantees, and availability (including conflict resolution, concurrency issues, transactions, and ACID compliance). These should be considered, as well as the intended use of the technology — some are better suited for certain workloads than others.
We also need to consider the type of operations we want to perform using our database. There are even more options for specific workloads, such as full-text search or graph databases. While we can replicate some of these features in conventional databases (e.g. using full-text search in MySQL or replicating a graph by storing vertices and edges), it may come with significant overhead and may not be as powerful as technologies specifically designed for those purposes.
To determine the best solution, we need to carefully consider our requirements and priorities — is high availability or high performance more important? How much scaling do we need? Should our databases be distributed or work on a single machine? What level of guarantees do we need for common problems such as race conditions and other conflicts? All of these questions play a role in determining the most suitable technology.
The book discusses at length that databases are not the only way to work with data — we can also process data using methods such as batch processing or message brokers like Apache Kafka. These may be more suitable for certain purposes, depending on our requirements. For example, message brokers may be better for distributed systems as they are more fault-tolerant and often solve specific problems more effectively than conventional databases. If we need to process large amounts of data before using it, batch processing may also be a good solution.
The book also covers a wide range of problems in theory and the progress made in academically solving them. It includes descriptions of various algorithms and their trade-offs, such as faster execution with lower overhead at the cost of less strict rules on correctness and robustness.
Who Is This Book For?
I would recommend this book to anyone interested in application development or considering a career in data/computer science. Having prior knowledge in software development is required, as this is strongly technical book. It provides a comprehensive overview of how we work with data and how we have arrived at our current approaches. It is also useful for those who use databases but want to understand their technical inner workings and the problems that can arise at scale.
Having this knowledge can help with designing the data layer of applications and provides a helpful framework for thinking about data at scale. It has helped me better understand these concepts and how to apply them in practice.