AWS And The Tyranny of Choice
written by Geoffrey Greene on 12/1/2022
An article, written in 2004 for the Scientific American, titled The Tyranny of Choice, states that…
It is only logical to think that if some choice is good, more is better; people who care about having infinite options will benefit from them, and those who do not can always just ignore the 273 versions of cereal they have never tried. Yet recent research strongly suggests that, psychologically, this assumption is wrong. Although some choice is undoubtedly better than none, more is not always better than less.
There are as of this writing more than 305 services and tools offered by AWS. These range from command line interfaces (CLI) to databases (both SQL and NoSQL), multiple container services, an endless buffet of mechanisms for moving data around, authentication frameworks/tools, big data products and tools, networking features and even physical storage devices that can be shipped to your datacenter to copy data. Hell, AWS will even send you a storage area network (SAN) in a tractor-trailer if you have petabytes of data to move. The number of services offered by AWS is staggering and growing.
But what really can cause your head to spin is that many of these services are overlapping, even identical in purpose. Add to that the fact that AWS routinely creates new services to replace old ones (but leaves the old one around because you can’t just shut down technology people are using - cough, looking at you Google), which leads to a confounding menu of similar services. Often, AWS does enhance existing services with new features. Looking at the details you begin to understand why AWS creates an entirely new service offering vs. upgrading an existing one. First, some of these services have high adoption rates. Many AWS customers will be relying on the service and reliability will be the top priority. Second, and I’m guessing here because I don’t know this in every case, is that AWS services are often forks of open source software projects. To add new features, it may mean building on top of a different and more modern OSS project and therefore it would not be possible to support both the existing and new functionality in the same service - it is nearly impossible to merge the functionality of different OSS projects or even major upgrades within the same project. Again - this is a guess - but it lines up with how AWS rolls out significant new offerings and enhancements.
Let’s look at some examples of the Tyranny of Choice (ToC) on AWS. The first services that have some significant overlap are the family of load-balancers. These are request scaling devices that distribute client generated load to a cluster of backend devices (EC2 instances, Lambdas, ECS tasks). In 2022 there are 4 flavors of load-balancers. In 2012 when I had my first significant build-out on AWS, there was one load-balancer, called ELB (Elastic Load-Balancer). Now it’s called CLB (Classic Load-Balancer). It is a simple load-balancer, with few features, but is reliable, easy to understand and manage. The features it lacked were not terribly difficult to implement in code. The newer version is called Application Load Balancer (ALB) and supports many useful features. One that is particularly handy is content-based routing, which is the ability to direct requests to back-end services after evaluating the textual content present in the http payload. In other words, the ALB can parse the URL or the headers of the HTTP requests and route traffic accordingly. The ELB can only map requests based on server:port. The other two load-balancers (NLB, GWLB) are more task specific. The NLB (Network load-balancer) is appropriate for scaling UDP/TCP-IP applications, especially those that might require stateful connections. The GWLB (Gateway load-balancer) has a very specific use-case in directing traffic to hardware appliances for security/privacy purposes before returning a response to the client. The best way to think about this is…
- Classic load-balancer → legacy, probably should not be used for new applications
- Application load-balancer → used for typical http request/response webapps (but does support stateful websockets)
- Network load-balancer → used for high traffic applications, especially those that require stateful connections (live sports scores, financial trading apps, chat, etc).
- Gateway load-balancer → real-time security/privacy analysis of inbound requests
To be fair, from my vantage point, the load-balancer family on AWS is a straightforward set of products. The overlap between them is not especially confusing even with the bare minimum of googling. And, in developing these products AWS has shown how it listens to customers. This is a comprehensive set of offerings covering just about everything you might want from a load-balancer vs. the single product on offer (CLB) pre-2016. Even so, there are a number of subtle differences (depending on your use-case) between these products that can trip you up. I think the load-balancer family, and particularly the CLB and ALB, are an example of AWS deploying an entirely new underlying architecture, possibly built on top of 2 different OSS projects or a significant fork of a single project. The CLB and ALB serve ostensibly the same purpose but the ALB has a number of critical features that couldn’t simply be enhancements to the existing CLB. And thus we have a case where there are 2 AWS service offerings that support nearly identical functions.
Simple Queue Service (SQS)/Simple Notification Service (SNS)
A better example of the ToC in AWS is the event producer/consumer/streaming services. These are mainly SQS, SNS, and KDS (Kinesis Data Streams); but also throw in EventBridge, Kinesis Firehose, CloudWatchEvents, and S3 Events and Lambda just to make your head spin. Choosing the appropriate event service among these can be a head-scratcher. Also, it is often the case that you need to combine these services for certain common architectures. The real event processing workhorses, though, are SQS and SNS, which cover the essentials for an event-based architecture. They are products with a long track record of reliability, high performance, effortless scale and are truly serverless, requiring no upkeep from end-users.
- SQS: point to point messaging. A message on the queue is expected to be processed exactly once. Consumers of the message are obligated to delete a message once it is successfully processed. Clients must “poll” for events.
- SNS: send a message, opt into getting a message. A message sent to SNS can be processed by any service that has subscribed to the “topic”. SNS can “push” events to clients.
SQS is a point to point messaging platform. Applications or other services push messages to a queue and consumers poll that queue. A good analogy is waiting in line in a store, especially if the store has a single line and a light at an array of registers indicating when a cashier is ready for another customer. Think of the light as the “worker” polling for the next message (customer). SNS on the other hand is known as a “pub/sub” system (Publish and Subscribe). A single message can be pushed to multiple clients via “topics”. From there the message can be pushed to a variety of endpoints (mobile push, mail, sms, etc). In the physical world an example of pub/sub could be a bus station where there is a PA system for announcing the schedule of departures/arrivals. Every rider in the station will hear the announcements and each must decide what to do with the information. This isn’t a perfect analogy since pub/sub systems allow you to be selective about the messages you receive.
Simply stated, SQS is good for decoupling within your stack and SNS is good for messaging external systems and services. In certain types of complex systems these services can be combined so that messages are sent first to SNS with an SQS queue subscribed to one or more topics (known as a “fan-out” architecture).
The checkout process for an ecommerce website is a good if basic example of an SQS vs SNS use case. When a customer clicks “checkout” at least 4 things have to happen:
- process credit card transaction
- send confirmation email
- send info to fulfillment
- render “order complete” page
We’re not going to fulfill or confirm the order if the credit card transaction fails. This is a good use case for SQS. I could process the CC directly in my application code but this would add complexity and load to my web tier. The credit card should be processed asynchronously since it’s likely communicating with an external service and it’s not predictable how long it might take to get a response. A good option is to have the web tier send an SQS message which will be picked up by my credit card processor consumer worker to validate the transaction and if successful that process will send an SNS message. Why SNS? Because I still have 3 more jobs to do: send email, notify fulfillment and render the order complete page. Each one of these is independent of the other but dependent on the result of the credit card transaction. No need to couple them so I can scale all of this independently of the infrastructure serving the website. If the credit card transaction fails, the SNS message is still sent but this time it’s got some metadata that indicates “cc transaction failed” so the fulfillment and confirmation listeners don’t pick it up (because SNS has a fancy message filtering feature). The page render listener gets the message, rendering the failure condition UI (probably something about failed, try a different card). I’m not including the details here of how to have a web page poll asynchronously for a result because of course we need to be able to do this.
Streaming services: Kinesis Data Streams and Kinesis Data Firehose (KDS/KDF)
In the previous section I discussed SQS and SNS which are familiar and relatively simple services to understand for most technologists but still have some overlap. These are internet scale AWS versions of messaging infrastructure that have been in use for many years. Anyone familiar with Java/J2EE will recognize SQS/SNS. Streaming services like KDS/KDF are not new, of course, but do have overlap with and can be used in place of classic messaging architectures. There are differences between “streaming” and “messaging” but both serve a similar decoupling purpose between data producers and consumers. In short, the Kinesis family is appropriate for scenarios where the consumer is expected to process and/or aggregate the data in real time, perhaps by using Kinesis Data Analytics (KDA) or some other stream processing framework, like Apache Spark.
You could construct a real time data processing application from a combination of SQS and a mechanism to persist and query the data but you would run into issues of scalability and utility, as the combination of KDS and KDA/Spark are purpose built for real time analytics and high throughput data processing. In short, you’d be reinventing the wheel and probably not a good one. What about using KDS as a messaging platform? KDS has a similar boundary between producers and consumers and even has messaging ordering by default - a feature you would have to explicitly enable (using the lower throughput FIFO configuration) in the SQS/SNS family. A big barrier for KDS is that scaling it can be quite expensive and there is manual intervention required unless you use the pure on-demand version of the service (which adds even more cost). SQS/SNS is advertised as having automatic and limitless scale.
The client library support is much more robust in the SQS/SNS family. You are very likely to find a mature set of APIs for your language domain and a better fit in a variety of architectures. If you are looking for mechanisms to decouple horizontally within your architecture, then it’s likely a better option to choose SQS/SNS. Processing data for ML/AI or real time analytics is the sweet spot use case for the Kinesis family of services. A good example of a streaming use-case is a clickstream processing application, like Google Analytics, where the requirement is low latency between events and visualization, i.e. users need to see what’s currently happening on their website. But what if we used KDS in place of SQS/SNS in my checkout example from above? Right off the bat we’d need a mechanism to tag messages that have been processed to avoid the problem of duplicating transactions. SQS has built in features to make it easy to process a message only once (like message visibility semantics and message deletion). KDS is a simple high throughput ledger of events, with no metadata or state information about the events on the ledger. A streamed event is immutable and can’t be deleted or marked with some sort of status. This functionality would have to be manually added in a parallel system and it is not trivial to implement these extra features.
Up to this point, I haven’t said anything about Kinesis Data Firehose (KDF). KDF’s core purpose is to move streamed data into conventional data-stores (RDS, Redshift, S3, DynamoDB) for near-real time analysis. There are a number of differences between the two services but the big one is latency; KDS has sub-second latency vs KDF at 60 seconds or more. If you don’t need real time processing then KDF is a good choice as there is no manual configuration of shards or any real management overhead. KDF is often paired with KDS to move the streamed data into a data lake or warehouse.
If you are still with me and gotten this far, congratulations! It’s been a long post and I appreciate your patience. I limited my examples to messaging and load-balancers because at 5 pages I think I’ve made my point. But, every single AWS service category will confront you with a similar set of tradeoffs and overlapping feature sets. The AWS documentation is the best place to start but also YouTube, Reddit, Udemy, StackOverFlow, random blog posts (like this one!), and Twitter, have been important resources for filling in the gaps in my knowledge of AWS.
I hope I’ve made the case to you that AWS has become, like Amazon, the “everything store” of cloud services. And like Amazon, it’s often really difficult to choose the absolute perfect product. AWS reputation for customer obsession is evident in everything it builds and its commitment to maintaining everything it releases. Paradoxically, as the product responds to new customer demands, the platform starts to sprawl, eventually entering a sustained period of “bloat” and complexity. Google notoriously handles this problem by ruthlessly deprecating underperforming products. How AWS adapts to a future of increasing complexity will be the story of the modern era of cloud computing.