What Is Event-Driven Architecture?
Event-driven architecture is an architectural style where applications and services talk asynchronously with each other. So instead of Application A calling each of Services B, C and D, and waiting for all of those services to finish, it posts its request or event to a messaging service. That messaging service returns immediately with an acknowledgement, so our Application A can move on to other work. Meantime, Services B, C and D each consume their copy of the event and process it. This is roughly analogous to your boss giving you some work to do, and walking away rather than standing there waiting for you to finish. An event-driven architecture can boost performance, maintainability and scalability, at the expense of some more complexity. So let’s dive in.
Let’s use a healthcare reporting scenario as an example. We have a Reporting System that external healthcare providers use to report a claim. It takes this information and feeds it into the insurer’s internal Claims Management System (CMS). The CMS persists claim data to a database.
A very simple architecture would see the Reporting System call the CMS synchronously. So after the Reporting System captures and validates the new claim information, it makes a
createClaim call to the CMS. (We’ll set aside the security implications for now.) Notice what happens when the Reporting System makes this call. It waits until the CMS finishes doing all its work of validating, applying business logic, and persisting to the database. The Reporting System is blocked until the CMS finishes its business.
In contrast, with an event-driven architecture, the claim reporting system publishes an event to a messaging service, then goes about its business. Another process consumes that claim report, and persists it to the database. Here the consumer is working asynchronously.
So here our Reporting System publishes a
claimCreated event to the Messaging System. The Messaging System immediately responds with an acknowledgement, and the Reporting System goes on to other work. Meanwhile, the CMS sees that
claimCreated event and processes it.
In their book Fundamentals of Software Architecture: An Engineering Approach, authors Mark Richards and Neal Ford describe two topologies in event-driven architecture: Mediator and Broker. Mediator has a process that synchronously calls each process to accomplish the task. Think of it as an orchestra where you have a conductor who invokes each step in the proper order. In contrast, the broker pattern sees a producer publish events. Any processes or services that are interested in those events consume those events.
For the sake of simplicity, we’ll use the Broker topology in this article.
When to use an Event-Driven Architecture
As with anything in software architecture, there are no right or wrong answers, just a different set of trade offs. What may be suitable in your context may be unsuitable in mine. It all depends on the constraints each of us have, and the relative importance of architecture characteristics such as scalability, maintainability, performance, deployability, and fault tolerance.
Here are some situations where an event-driven architecture is worth considering.
A common use case is when you need to get two or more different systems or applications to talk to each other. We saw that with our Reporting System and CMS.
You could have the Reporting System write directly to the database that the CMS owns. If you do that, now the portal has some detailed knowledge of the inner workings of the CMS. For a simple use case, this may be fine but a change to the CMS can very well have a direct impact on the Reporting System. The scope and complexity of your work just expanded. If you have other consumers of the Reporting System, you’re looking at changes to them too. This is called tight coupling. Ugh.
On the other hand, if we just have the Reporting System publish a
claimCreated event to the Messaging System, it doesn’t care who consumes the event. All it needs to do is publish the message, and then it can go on to other work (“fire and forget”).
From the perspectives of maintainability and deployability, an event-driven architecture lets the Reporting System and the CMS evolve at their own pace. Changes to either of those systems need not impact the other. You can now deploy changes sooner with less risk. You’ve also reduced your testing effort. As long as the content and structure of the message payload remains unchanged, each system can evolve at their own pace. This is known as a loosely-coupled architecture.
When Low Latency is Essential
Take for example an e-commerce application. After the customer clicks Submit My Order, order placement, payment processing, inventory checking, and shipping notifications all need to take place. Instead of making the customer wait for all that to happen, it might be better to fire an
orderSubmitted event, and tell the customer “Order submitted, thank you.”. Then all those processes can consume their own copy of that event. The system can notify the customer if any of those processes failed, or (ideally) when their order ships.
Another example is IoT (Internet of Things). These devices can generate reams of data and events. To have them call their downstream services synchronously can be problematic, given the volumes of data and the device’s limited processing power. Instead, if they publish their data and events to a messaging or streaming system, they need not wait for their downstream consumers to finish. In turn the consumers, such as monitoring and analysis systems, can process all this data independently of the IoT devices.
Let’s suppose you have a system consisting of several independently deployed services (microservices). The nature of this system is that some parts experience a greater load than others. Let’s take an auction as an example. A Bid Capture system takes bids from two different sources: one from the floor to the auctioneer, and the other from online bidders. Online bids come in at a much faster rate than those from the auction floor. A Bid Recorder system records the bids from Bid Capture system, and it forms the system of record. We would have a terrible user experience if every bid had to wait until the Bid Recorder validated and persisted the bid, particularly toward the end of a very popular auction. It would be worth considering having the Bid Capture system post events to a messaging system, and let the Bid Recorder do its work asynchronously.
So What’s The Catch?
As with anything in software architecture, there are no right or wrong answers, just a different set of trade offs. There are some drawbacks to an event-driven architecture that, in your context, may be a worthwhile price to pay for the benefits you gain. On the other hand, if you have a simple system with few constraints, this architecture may be overkill. It all depends on your requirements for performance, maintainability, deployability, etc.
You have added complexity
In the sequence diagrams above, you can see there is another system involved – the Messaging System. So you need to provision it, configure it, secure it, and monitor it.
Error Handling is more complicated
Our Reporting System has no way of knowing whether the back end processing on the CMS completed successfully or failed. So you need to design some way of handling errors. These errors can be alternate use case flows, such as a denial of a claim. These you might deal with by having a notification in the Reporting System UI, or a status field in a list of Claims. Or you can have system errors, such as a database error, application error, whatever. There’s nothing the Reporting System user can do about these.
One strategy I’ve seen used is to have the consumer (our CMS) park the failed message into an error queue or topic within the messaging system. An alert then goes out to the support people so they can intervene.
Troubleshooting becomes more difficult
We used a very simple example with our Reporting System and our CMS above. But suppose there are more consumers? Let’s say the CMS publishes an event –
claimAdjudicated – and we have an Analytics System consuming it. Your support people get a ticket from the analytics team who wonder why some Claims are not on their analysis portal. Now you have to figure out where the problem happened. Was it in the CMS? Did the CMS even get the
claimCreated event? Maybe it did, but the Analytics System failed. This is where you need a good logging and tracing solution to help you troubleshoot these problems. Regardless, it now becomes a game of “follow the bread crumbs” to learn exactly where the error happened.
An event-driven architecture can be a powerful solution to handle highly variable volumes of data, and/or to provide a responsive experience to your end users. But the price you pay is in complexity and error handling. There are contexts where this is a worthwhile trade off, especially if you can mitigate some of the downsides. On the other hand, you may have a simpler context where an event-driven architecture is overkill, and simple synchronous calls are good enough. It all depends on the scalability, maintainability, deployability and performance constraints you are facing.