Understanding Stateful Services and Their Challenges

In the dynamic and ever-evolving world of software architecture, comprehending the intricacies and challenges inherent in stateful services is vital. The necessity for designing stateful systems often arises from the fundamental need to store and subsequently query data. Whether it’s for tracking user sessions, managing transactions, or preserving application state, these requirements make it essential to maintain data, thus giving birth to stateful services.

In this series of blog posts we embark on a deep dive into the world of stateful services. In this first post, ‘Understanding Stateful Services and Their Challenges,’ we will explore the myriad complexities that engineers face when designing and implementing stateful services. Following this, in our next post, we will shift our focus to how AWS services can be leveraged to address these challenges, providing a comprehensive guide to utilizing correct AWS solutions for stateful applications.

What Are Stateful Services?

Before diving deeper, it’s essential to define what we mean by ‘Stateful services’ in the context of this blog. Stateful services, as we discuss here, are those that manage and store data within themselves. On the other hand, if a service exposes an API for data storage but delegates the actual data storage to another service, it does not qualify as a stateful service according to the terminologies we’ll be using in this series.

Challenges of Stateful Services

Data Consistency Challenges

Maintaining multiple copies of data significantly amplifies the challenge of ensuring consistency across all instances. It becomes a critical task to keep these replicated data sets in a synchronized and consistent state, especially in a distributed environment where changes are constantly evolving.

Primary node has send the write requests to all the replicas before acking the write

Complexity in Scaling

Scaling stateful services is more complex than scaling stateless ones, both in terms of managing caches and replicating data while maintaining consistency.

As we can see in the above example in cases of stateful services managing the actual data, we need to increase the shards to improve the scaling factor. Increase in the number of shards mean increase in the primary shard as well as replica shards

Recovery and Fault Tolerance

In case of failures, restoring stateful services to their last known good state is more challenging than restarting stateless services.

As we can see in the above diagram, primary node had stopped responding so client had to fallback to the replica node for request response. However in case of stateless services, it is damn obvious that you can send the request to any of the other stateless service by the very nature of being statelessness

Maintainability and Team Understanding Challenges

Designing complex stateful systems isn’t just a technical challenge; it’s also about ensuring that every team member understands the intricacies of the system. The code is written once but read numerous times, making ease of understanding a critical aspect. Complex systems require careful documentation and knowledge transfer to ensure ongoing maintainability and effective team collaboration.

Indeed, the ratio of time spent reading versus writing is well over 10 to 1. We are constantly reading old code as part of the effort to write new code. …[Therefore,] making it easy to read makes it easier to write.

Robert C. Martin

Types of State in Stateful Services

Stateful services can maintain state in two primary ways:

  1. State as Cache: In this approach, the application maintains state through a cache, with a duplicate set of data managed by an external system. This model ensures that even if a subset of nodes fails, the data remains intact and secure.
  2. State as Actual Data: In this scenario, the system directly manages and stores the actual data. This necessitates data replication across multiple nodes to guarantee availability and safeguard against data loss due to node failures.


Challenges in Systems having “State as Actual Data“:

  • Consistency Issues: Ensuring all copies of data across the system are consistent.
  • Scalability and Consistency: Balancing scalability while maintaining data consistency.

Challenges in Systems having “State as Cache”

  • Cache Strategy: Deciding whether to use a read-through or write-through cache or another type, based on specific business needs. This decision is very critical because this decides what kind of Consistency levels you get and what kind of failure scenarios your application needs to handle.
  • Memory Management: Effectively managing the memory for custom caches. How much data do we need to cache and how should we evict the entries when memory limit hits. These are all the questions which we need to answer while putting in the cache.

Decoding State Intricacies Through a Practical Example

Let’s take a closer look at how a stateful system works by using a straightforward example. We’ll think about a system that keeps track of how users interact with ads—it notes down every time a user sees an ad or clicks on one. With millions of users, that’s a lot of information to handle.

This information is super important because it helps to analyze trends and patterns, which means the system has to store and manage the data really well.

System Requirements

  • High Availability: The data store must always be available. Downtime means inability to record or access crucial data. To avoid this, and the unpleasant task of explaining such lapses to the CEO, we must ensure that we have more than just a single copy of our data.
  • Performance: Speed is critical. The system must handle queries and updates efficiently. So hence we need to shard the data across different nodes. Otherwise we are always bottleneck by the resources of the single machine.
  • Handling Data Skew: Some users may generate significantly more data than others, potentially creating bottlenecks.

Design Challenges and Considerations

Let’s break down the challenges we need to tackle to design such a system in a smart way:

Data Redundancy and Replication

  • Multiple Copies: To keep our service up even if part of it goes down, we need to have copies of our data in one or more system. Number of copies depend on the basis of how much redundancy you want to have
  • Replication Strategy: Figuring out on when and how to acknowledge write operations are crucial, especially in a distributed environment. There is a fine balance between consistency and availability and hence you need to make sure you understand the business requirements before you design your system

Data Sharding

  • Sharding Mechanism: Distributing data across multiple machines to manage load and disk space is a must. Key considerations include:
    • Shard Keys: Determining the right shard keys and the frequency of resharding as data grows.
    • Handling Hotspots: Avoiding hotspots where a single shard gets overwhelmed with requests.

Fault Tolerance and Recovery

  • Machine Failures: Strategies for handling failures, including how quickly a backup or new shard becomes operational.
  • Request Handling During Failures: Managing read/write requests during transitions and ensuring data integrity and consistency.

Designing a stateful system like this ad tracking example involves navigating a maze of complexities. From ensuring data availability and performance to managing data distribution and handling failures, each aspect requires careful planning and robust implementation strategies. Understanding these layers of complexity is essential for anyone venturing into the realm of stateful service design.

Considering Cloud-Based Solutions as Alternatives

While designing and managing stateful systems in-house offers control, it comes with significant complexity. An effective alternative approach is to rely on cloud-based services. These services offer similar levels of reliability and scalability but without the overhead of managing the intricate details of stateful architectures. In the next blog post, we will delve into how to choose the right cloud-based service for your needs. We will also explore the decision between using cloud-managed services and managing open-source solutions in-house.

For more insights and a deeper exploration of suitable AWS solutions for stateful services, be sure to check out the next installment in this series: ‘Choosing the Right AWS Solution for Stateful Services: A Practical Guide

2 thoughts on “Understanding Stateful Services and Their Challenges

Leave a Reply