Distributed transactions in Go: Read before you try

98 points by roblaszczak 9 months ago

ekidd 9 months ago

Let's assume you're not a FAANG, and you don't have a billion customers.

If you're gluing microservices together using distributed transactions (or durable event queues plus eventual consistency, or whatever), the odds are good that you've gone far down the wrong path.

For many applications, it's easiest to start with a modular monolith talking to a shared database, one that natively supports transactions. When this becomes too expensive to scale, the next step may be sharding your backend. (It depends on whether you have a system where users mostly live in their silos, or where everyone talks to everyone. If your users are siloed, you can shard at almost any scale.)

Microservices make sense when they're "natural". A video encoder makes a great microservice. So does a map tile generator.

Distributed systems are expensive and complicated, and they kill your team's development velocity. I've built several of them. Sometimes, they turned out to be serious mistakes.

As a rule of thumb:

1. Design for 10x your current scale, not 1000x. 10x your scale allows for 3 consecutive years of 100% growth before you need to rebuild. Designing for 1000x your scale usually means you're sacrificing development velocity to cosplay as a FAANG.

2. You will want transactions in places that you didn't expect.

3. If you need transactions between two microservices, strongly consider merging them and having them talk to the same database.

Sometimes you'll have no better choice than to use distributed transactions or durable event queues. They are inherent in some problems. But they should be treated as a giant flashing "danger" sign.

hggigg 9 months ago

I would add that just because you add these things it does not mean you can scale afterwards. All microservices implementations I've seen so far are bolted on top of some existing layer of mud and serve only to make function calls that were inside processes run over the network with added latency and other overheads. The end game is the aggregate latency and cost increases only with no functional scalability improvements.
Various engineering leads, happy with what went on their resume, leave and tell everyone how they increased scalability. And persuade another generation of failures.
jrockway 9 months ago

I agree with this.
Having worked at FAANG, I'm always excited by the fanciness that is required. Spanner is the best database I've ever used.
That said, my feeling is that in the real world, you just pick Postgres and forget about it. Let's Encrypt issues every TLS cert on the Internet with one beefy Postgres database. Computers are HUGE these days. By the time a 128 core machine isn't good enough for your app, you will have sold your shares and will be living on your own private island or whatever. If you want to wrap sqlite in raft over the weekend for some fun, sure, do that. But don't put it in prod.
- Andys 9 months ago
  
  Agreed - A place I worked needed 24/7 uptime for financial transactions, and we still managed to keep scaling a standard MySQL database, despite it getting hammered, over the course of 10 years up to something like 64 cores and 384GB of RAM on a large EC2 instance.
  We did have to move reporting off to a non-SQL solution because it was too slow to do in realtime, but that was a decision based on evidence of the need.
natdempk 9 months ago

Pretty great advice!
I think the one thing you can run into that is hard is once you want to support different datasets that fall outside the scope of a transaction (think events/search/derived-data, anything that needs to read/write to a system that is not your primary transactional DB) you probably do want some sort of event bus/queue type thing to get eventual consistency across all the things. Otherwise you just end up in impossible situations when you try to manage things like doing a DB write + ES document update. Something has to fail and then your state is desynced across datastores and you're in velocity/bug hell. The other side of this though is once you introduce the event bus and transactional-outbox or whatever, you then have a problem of writes/updates happening and not being reflected immediately. I think the best things that solve this problem are stuff like Meta's TAO that combines these concepts, but no idea what is available to the mere mortals/startups to best solve these types of problems. Would love to know if anyone has killer recommendations here.
- ljm 9 months ago
  
  I think the question is if you need the entire system to be strongly consistent, or just the core of it?
  To use ElasticSearch as an example: do you need to add the complexity of keeping the index up to date in realtime, or can you live with periodic updates for search or a background job for it?
  As long as your primary DB is the source of truth, you can use that to bring other less critical stores up to date outside of the context of an API request.
  - natdempk 9 months ago
    
    Well, the problem you run into is that you kind of want different datastores for different use-cases. For example search vs. specific page loads, and you want to try and make both of those consistent, but you don't have a single DB that can serve both use-cases (often times primary DB + ElasticSearch for example). If you don't keep them consistent, you have user-facing bugs where a user can update a record but not search for it immediately, or if you try to load everything from ES to provide consistent views to a user, then updates can disappear on refresh. Or if you try to write to both SQL + ES in an API request, they can desync on failure writing to one or the other. The problem is even less the complexity of keeping the index up to date in realtime, and more that the ES index isn't even consistent with the primary DB, and to a user they are just different parts of your app that kinda seem a little broken in subtle ways inconsistently. It would be great to be able to have everything present a consistent view to users, that updates together on-write.
    
    misiek08 9 months ago
    
    The way I solved it once was trying to update ES synchronously and if it failed or timeouted - queue event to index the doc. Timeout wasn’t an issue, because double update wasn’t harmful.
    
    ljm 9 months ago
    
    In instances like that I tend to push back on the requirement, for example with this classic DB + Elasticsearch case:
    1. How often is a user going to perform an update and then search for the exact same thing immediately after?
    2. Suppose they did: if elasticsearch was updated in the background, is the queue/worker running fast enough such that the user won't even notice a latency of a second or two max?
    It really depends on what you're doing, because if Elasticsearch is operating as its own source of truth with data that the primary DB doesn't have, then yeah, you're going to have trouble keeping both strongly consistent in a transactional manner without layering on complexity (like sagas with transactions and compensations). But if it's merely a search engine on top of your source of truth (for example, you search ES to get a list of primary keys and then fetch all the data from the DB), you've got some breathing room.
    I mean, we're talking plucky upstart here and not enterprise FAANG, so there's definitely a case for 'less is more'.
  - kukkeliskuu 9 months ago
    
    I think a different framing for the question might be more helpful. What is your overall goal? You cannot have everything. In fact, if you try to have everything, you will get nothing.
    I would say that 99% of time the implicit goal is to cut down development time. And the best way to cut development time on long-term is to cut down complexity.
    To cut down complexity, we should avoid complex problems, use existing solutions to solve them or at least be able to contain them. Sometimes, the price is that you need to solve some easier problems yourself.
    For example, microservice architectures promise that you need less coordination between teams, because parts of the systems can be deployed independently. The price is that you cannot use database transactions to guarantee integrity.
    I think data integrity is almost always much more important problem to solve, partly because it is so difficult to solve by yourself. Actually it is often so difficult that most people just ignore it.
    For example, if you adopt microservices architecture, you often just ignore data integrity, and call your system "eventually consistent". Practically this means that you push the data integrity problems to the sink system.
    It is better to think of data integrity as a meta-feature, rather than a feature. Having data integrity helps you in making other features of your system more simple. For example, migrating schema changes in your system is much more manageable if you use a database which can handle the migration within a transaction.
    In your example, there are various ways where system can be left in an inconsistent state after a crash, even if the database is the "source of truth". For example, do you always reconstruct the ES cache after a crash? If not, how do you know whether it contains inconsistencies? Whose job is it to initiate the reconstruction? etc.
xyzzy_plugh 9 months ago

I frequently see folks fail to understand that when the unicorn rocketship spends a month and ten of their hundreds of engineers replacing their sharded mysql from setting ablaze daily due to overwhelming load, it is actually pretty close to the correct time for that work. Sure it may have been stressful, and customers may have been impacted, but it's a good problem to have. Conversely not having that problem maybe doesn't really mean anything at all, but there's a good chance it means you were solving these scaling problems prematurely.
It's a balancing act, but putting out the fires before they even begin is often the wrong approach. Often a little fire is good for growth.
- AmericanChopper 9 months ago
  
  You really have to be doing huge levels of throughput before you start to struggle with scaling MySQL or Postgres. There’s really not many workloads that actually require strict ACID guarantees _and_ produce that level of throughput. 10-20 years ago I was running hundreds to thousands of transactions per second on beefy Oracle and Postgres instances, and the workloads had to be especially big before we’d even consider any fancy scaling strategies to be necessary, and there wasn’t some magic tipping point where we’d decide that some instance had to go distributed all of a sudden.
  Most of the distributed architectures I’ve seen have been led by engineers needs (to do something popular or interesting) rather than an actual product need, and most of them have had issues relating to poor attempts to replicate ACID functionality. If you’re really at the scale where you’re going to benefit from a distributed architecture, the chances are eventual consistency will do just fine.
foobiekr 9 months ago

Great advice. Microservices also open the door to polyglot, so you lose the ability to even arrive that everyone uses/has access to/understands the things in a common libCompany that make it possible for anyone to at least make sense of code.
When I talk to people who did microservices, I ask them "why is this a service separate from this?"
I have legitimately - and commonly - gotten the answer that the dev I'm talking to wanted their own service.
It's malpractice.
- danenania 9 months ago
  
  > Microservices also open the door to polyglot
  While I see your point about the downsides of involving too many languages/technologies, I think the really key distinction is whether a service has its own separate database.
  It's really not such a big problem to have "microservices" that share the same database. This can bring many of the benefits of microservices without most of the downsides.
  Imo it would be good if we had some common terminology to distinguish these approaches. It seems like a lot of people are creating services with their own separate databases for no reason other than "that's how you're supposed to do microservices".
  - foobiekr 9 months ago
    
    Microservices with their own database are often a terrible design choice, but I will grant you it is one of the two dimensions that make sense:
    1. is there a _very significant_ difference in the horizontal scaling footprint/model or demand (and demand is _only relevant_ if there is static footprint)? Home the function with a similar service, if there is one, otherwise yes, add a service.
    2. is there a _genuine_ and _legitimate_ need for a different database, with completely independent schema, and not some horrible bullshit where you will end up doing cross-party transactions (and almost always failing to do so well)? Are you sure you need a different database or is your team just afraid of SQL and schema management (the usual mongodb garbage)? Is the issue that you don't understand how database security works? .. if all of these pass muster, then yes, ok, that's an OK reason.
    Every architecture I've seen since 2013 at startups and big companies alike (since I do technical diligence professionally as a side gig) has been microservices or simple CRUD.
    Almost all of the microservices ones were totally fucking wrong and had to be massively reworked, and usually multuple times, because they had no thesis at all for what they were doing and it was often - even mostly - an excuse not to have to learn their tools beyond tutorial level and/or a desire to play with new shiny or not read someone else's code. The CRUD guys were fine, they just did business until they needed to add caching, and so on, like real products.
mekoka 9 months ago

> 1. Design for 10x your current scale, not 1000x.
I'd even say that the advice counts double for early stage startups. That is, at that scale, it should be design for 5x.
You could spend years building a well architected, multi-tenanted, microserviced system, whose focus on sound engineering is actually distracting your team from building core solutions that address your clients' real problems now. Or, you could instead redirect that focus on first solving those immediate problems with simplistic, suboptimal, but valid engineering.
An early stage solopreneur could literally just clone or copy/paste/configure their monolith in a new directory and spawn a new database every time they have a new client. They could literally do this for their first 20+ clients in their first year. When I say this, some people look at me in disbelief. Some of them, having yet to make their first sale. Instead, they're working on solutions to counter the anticipated scalability issues they'll have in two years, when they finally start to sell and become a huge success.
For another few, copy/paste/createdb seems like a stroke of genius. But I'm not a genius, I'm just oldish. Many companies did basically this 20 years ago and it worked fine. The reason it's not even considered anymore seems to be a cultural amnesia/insanity that's made certain practices arcane, if not taboo altogether. So we tend to spontaneously reach for the nuclear reactor, when a few pieces of coal would suffice to fuel our current momentum.
- pkhuong 9 months ago
  
  > spawn a new database every time they have a new client.
  I've seen this work great for multitenancy, with sqlite (again, a single beefy server goes a long way). At some point though, you hit niche scaling issues and that's how you end up with, e.g., arcane sqlite hacks. Hopefully these mostly come from people who have found early success, or looked at by others who want reassurance that there is an escape hatch that doesn't involve rewriting everything for web scale.
geodel 9 months ago

This advice is good for people at top VPs/CIO/CTOs. Because mandate for micro service is coming from up top. Doing anything else is either not enterprise architecture approved or need to be justified against much powerful higher ups.
Here I have services working at high performance, low resource usage, fewer errors but only feedback I have is how soon can we break it into micro services, how soon we can get into cloud.
vbezhenar 9 months ago

I wonder if anyone tried long-lasting transactions passed between services?
Like imagine you have a postgres pooler with API so you can use one postgres connection between several applications. Now you can start the transaction in one application, pass its ID to another application and commit it there.
Implement queues using postgres, use the same transaction for both business data and queue operations and some things will become easier.
MuffinFlavored 9 months ago

> For many applications, it's easiest to start with a modular monolith talking to a shared database, one that natively supports transactions.
I don't think this handles "what if my app is a wrapper on external APIs and my own database".
You don't get automatic rollbacks with API calls the same way you do database transactions. What to do then?
- natdempk 9 months ago
  
  You have a distributed system on your hands at that point so you need idempotent processes + reconciliation/eventual consistency. Basically thinking a lot about failure, resyncing data/state, patterns like transactional outboxes, durable queues, two-phase commit, etc etc. It just quickly gets into the specifics of your task/system/APIs so hard to give general advice. Most apps do not solve these problems super well for a long time unless they are in critical places like billing, and even then it might just mean weird repair jobs, manual APIs to resync stuff, audits, etc. Usually an event-bus/queue or related DB table for the idempotent work + some async process validating that table can go a long way though.
jayd16 9 months ago

Sure sure sure
...but micro services is used as a people organization technique.
Once you're there you'll run into a situation where you'll have to do transactions. Might as well get good at it.
devjab 9 months ago

I disagree rather strongly with this advice. Mostly because I’ve spent almost a decade earning rather lucrative money on cleaning up after companies and organisations which did it. Part of what you say is really good advice, if you’re not Facebook then don’t build your infrastructure as though you were. I think it’s always a good idea to remind yourself that StackOverflow ran on a few IIS servers for a long while doing exactly what you’re recommending that people do. (Well almost anyway).
Using a single database always ends up being a mess. Ok, I shouldn’t say always because it’s technically possible for it not to happen. I’ve just never seen the OOP people not utterly fuck up the complexity in their models. It gets even worse when they’ve decided to use stores procedures or some magical ORM which not everyone understood the underlying workings of. I think you should definitely separate your data as much as possible. Even small scale companies will quickly struggle scaling their DBs if they don’t, and it’ll quickly become absolutely horrible if you have to remove parts of your business. Maybe they are unseeded, maybe they get sold off, whatever it is. With that said, however, I think you’re completely correct about not doing distributed transactions. I think that both you and the author are completely right that if you’re doing this, then you’re building complexity you should be building until you’re Facebook (or maybe when you’re almost Facebook).
A good micro-service is one that can live in total isolation. It’ll full-fill the need of a specific business domain, and it should contain all the data for this. If that leaves you with a monolith and a single shared database, then that is perfectly fine. If you can split it up. Say you have solar plants which are owned by companies but as far as the business goes a solar plant and a company can operate completely independently, then you should absolutely build them as two services. If you don’t, then you’re going start building your mess once you need to add wind plants or something different. Do note that I said that this depends on the business needs. If something like individual banking accounts of company owners is relevant to the greenfield workers and asset managers, then you probably can’t split up solar plants and companies. Keeping things separate like this will also help you immensely as you add on business intelligence and analytics.
If you keep everything in a single “model store”, then you’re eventually going end up with “oh, only John knows what that data does” while needing to pay someone like me a ridiculous amount of money to help your IT department get to a point where they are no longer hindering your company growth. Again, I’m sure this doesn’t have to be the case and I probably should just advise people to do exactly as you say. In my experience it’s an imperfect world and unless you keep things as simple as possible with as few abstractions as possible then you’re going to end up with a mess.
- ekidd 9 months ago
  
  I wanted to respond to you, because you had some excellent points.
  > Mostly because I’ve spent almost a decade earning rather lucrative money on cleaning up after companies and organisations which did it.
  For many companies, this is actually a pretty successful outcome. They built an app, they earned a pile of money, they kept adding customers, and now they have a mess. But they can afford to pay you to fix their mess!
  My rule of thumb of "design for 10x scale" is intended to be used iteratively. When something is slow and miserable, take the current demand, multiply it by 10, and design something that can handle it. Sometimes, yeah, this means you need to split stuff up or use a non-SQL database. But at least it's a real need at that point. And there's no substitute for engineering knowledge and good taste.
  But as other people have pointed out, people who can't use an RDBMS correctly are going to have a bad time implementing distributed transactions across microservices.
  So I'm going to stick with my advice to start with a single database, and to only pull things out when there's a clear need to scale something.
  - devjab 9 months ago
    
    Well, I guess the side of my argument which is missing by my anecdotal experiences is that monoliths is what I work on because it was the trend. It’s probably easier to fix the complicated mess of a monolith than a complicated mess of micro-services done wrong.
- mamcx 9 months ago
  
  All that you say is true, and people who do that are THE LESS capable of becoming better at the MUCH harder challenges of microservices.
  I work in the ERP space and interact with dozens and I see the horrors that some only know as fair tales.
  Without exception, staying in an RDBMS is the best option of all. I have seen the cosmical horrors of what people that struggle with rdbms do when moved to nosql and such, and is always much worse than before.
  And all that you say, that is true, hide the real (practical) solution: Learn how to use the RDBMS, use SQL, remove complexity, and maybe put the thing in a bigger box.
  All that is symptoms that are barely related to the use of a single database.
  - devjab 9 months ago
    
    What I dislike about single databases is that it’s too easy for people to build unnecessary relationships. You obviously don’t have to do it and there are a lot of great tools to separate data. That’s not what people are going to do on a Thursday afternoon after a day of horrible meetings though. They’re going to take shortcuts and mess things up if it’s easy to do so. Having multiple databases, and they can all be SQL (should if that’s what your developers know), in isolation is to protect you from yourself, not so much because it’s a great idea technically.
    
    mamcx 9 months ago
    
    But that is the same if you have many databases. Only that the problem spread!
    Maybe is because we are in different niches?. In mine, I have never seen microservices having ANY improvement over the norm, and most certainly are far more negatives.
    However, what is more, the norm is making a 2/3-tier from a monolithic, and that could be better.
    P.D: In the ERP/business space you can have many, whole apps, with ETL in the middle orchestrating. That may improve things because the quality of each app varies, but what is terrible is to split apps into micro services. That is itself a bridge too far.
gustavoamigo 9 months ago

I agree with start with a monolith and a shared database. I’ve done that in the past quite successfully. I would just add that if scaling becomes an issue, I wouldn’t consider sharding my first option, it’s more of a last resort. I would prefer scaling vertically the shared database and optimizing it as much as possible. Also, another strategy I’ve adopted was avoiding doing `JOIN` or `ORDER BY`, as they stress your database precious CPU and IO. `JOIN` also adds coupling between tables, which I find hard to refactor once done.
- vbezhenar 9 months ago
  
  I don't understand how do you avoid JOIN and ORDER BY?
  Well, with ORDER BY, if your result set is not huge, sure, you can just sort it on the client side. Although sorting 100 rows on database side isn't expensive. But if you need, say, latest 500 records out of million (very frequent use-case), you have to sort it on the database side. Also with proper indices, database sometimes can avoid any explicit sort.
  Do you just prefer to duplicate everything in every table instead of JOINing them? I did some denormalization to improve performance, but that was more like the last thing I would do, if there's no other recourse, because it makes it very possible that database will contain logically inconsistent data and it causes lots of headache. Fixing bugs in software is easier. Fixing bugs in data is hard and requires lots of analytic work and sometimes manual work.
  - danenania 9 months ago
    
    I think a better maxim would be to never have an un-indexed ORDER BY or JOIN.
    A big part of what many "nosql" databases that prioritize scale are doing is simply preventing you from ever running an adhoc un-indexed query.

alphazard 9 months ago

The best advice (organizationally) is to just do everything in a single transaction on top of Postgres or MySQL for as long as possible. This produces no cognitive overhead for the developers.

Sometimes that doesn't deliver enough performance and you need to involve another datastore (or the same datastore across multiple transactions). At that point eventual consistency is a good strategy, much less complicated than distributed transactions. This adds a significant tax to all of your work though. Now everyone has to think through all the states, and additionally design a background process to drive the eventual consistency. Do you have a process in place to ensure all your developers are getting this right for every feature? Did you answer code review? Are you sure there's always enough time to re-do the implementation, and you'll never be forced to merge inconsistent "good enough" behavior?

And the worst option (organizationally) is distributed transactions, which basically means a small group of talented engineers can't work on other things and need to be consulted for every new service and most new features and maintain the clients and server for the thumbs up/down system.

If you make it hard to do stuff, then people will either 1. do less stuff, or 2. do the same amount of stuff, but badly.

Scubabear68 9 months ago

If I had a nickel for all the clients I’ve seen with micro services everywhere, and 90% of the code is replicating an RDBMS with hand coded in memory joins.

What could have been a simple SQL query in a sane architecture becomes N REST calls (possibly nested with others downstream) and manually stitching together results.

And that is just the read only case. As the author notes updates add another couple of levels of horror.

renegade-otter 9 months ago

In the good old days, if you did that, you would rightfully be labeled as an "amateur".

pjmlp 9 months ago

As usual, don't try to use the network boundary to do what modules already offer in most languages.

Distributed systems spaghetti is much worse to deal with.

wwarner 9 months ago

AGREE! The author's point is very well argued. Beginning a transaction is almost never a good idea. Design your data model so that if two pieces of data must be consistent, they are in the same row, and allow associated rows to be missing, handling nulls in the application. Inserts and updates should operate on a single table, because in the case of failure, nothing changed, and you have a simple error to deal with. In short, as explained in the article, embrace eventual consistency. There was a great post from the Github team about why they didn't allow transactions in their rails app, from around 2013, but I can't find it for the life of me.

I realize that you're staring at me in disbelief right now, but this is gospel!

relistan 9 months ago

This is a good summary of building an evented system. Having built and run one that scaled up to 130 services and nearly 60 engineers, I can say this solves a lot of problems. Our implementation was a bit different but in a similar vein. When the company didn’t do well in the market and scaled down, 9 engineers were (are) able to operate almost all of that same system. The decoupling and lack of synchronous dependencies means failures are fairly contained and easy to rectify. Replays can fix almost anything after the fact. Scheduled daily replays prevent drift across the system, helping guarantee that things are consistent… eventually.

latentsea 9 months ago

I would really hate to join a team of 9 engineers that owned 130 services.

misiek08 9 months ago

Looks like crypto ad for the library and showing probably worst, most over-engineered method for „solving” transactions in more diverse environment. Eventually consistency is big tradeoff not possible to accept in many payment and stock related areas. Working in company where all described problems exist and were solved the worst way possible I see this article as very misleading. You don’t want the events instead of transactions - if something has to be commited together - you need to reachitect system and that’s it. Of course people who were building this monster for years will block anyone from doing this. Over-engineered AF, because most of the parts where transactions are required could be handled by single database, even SQL and currently are split between dozens of separate Mongo clusters.

"Event based consistency" leaves us in state where you can’t restore system to a stable, safe, consistent state. And of course you have a lot more fun in debugging and developing, because you (we, here) can’t test locally anything. Hundreds of mini-clones of prod setup running, wasting resources and always out of sync are ready to see the change and tell you a little more than nothing. Great DevEx…

kgeist 9 months ago

Main source of pain with eventual consistency is lots of customer calls/emails "we did X but nothing happened". I'd also add that you should make it clear to the user that the action may not be instantenous.

renegade-otter 9 months ago

That's another thing about a single database, if it's well-tuned and your squeel is good - in many cases you don't even need any kind of cache, removing an entire class of bugs.

cletus 9 months ago

To paraphase [1]:

> Some people, when confronted with a problem, think “I know, I'll use micro-services.” Now they have two problems.

As soon as I read this example where there's users and orders microservices, you've already made an error (IMHO). What happens when the traffic becomes such an issue that you need to shard your microservices? Now you've got session and load-balancing issues. If you ignore them, you may break the read-your-write guarantee and that's going to create a huge cost to development.

It goes like this: can you read uncommitted changes within your transaction or request? Generally the answer should be "yes". But imagine you need to speak to a sharded service, what happens when you hit a service that didn't do the mutation but it isn't committed yet?

A sharded data backend will take you as far as you need to go. If it's good enough for Facebook, it's good enough for you.

When I worked at FB, I had a project where someone had come in from Netflix and they fell into the trap many people do of trying to reinvent Netflix architecture at Facebook. Even if the Netflix microservices architecture is an objectively good idea (which I honestly have no opinion on, other than having personally never seen a good solution with microservices), that train has sailed. FB has embraced a different architecture so even if it's objectively good, you're going against established practice and changing what any FB SWE is going to expect when they come across your system.

FB has a write through in-memory graph database (called TAO) that writes to sharded MySQL backends. You almost never speak to MySQL directly. You don't even really talk to TAO directly most of the time. There's a data modelling framework on top of it (that enforces privacy and a lot of other things; talk to TAO directly and you'll have a lot of explaining to do). Anyway, TAO makes the read-your-write promise and the proposed microservices broke that. This was pointed out from the very beginning, yet they barreled on through.

I can understand putting video encoding into a "service" but I tend to view those as "workers" more than a "service".

[1]: https://regex.info/blog/2006-09-15/247

codethief 9 months ago

> Even if the Netflix microservices architecture is an objectively good idea (which I honestly have no opinion on
I have no opinion on that either, but at least this[0] story by ThePrimeagen didn't make it sound all too great. (Watch this classic[1] before for context, unless you already know Wingman, Galactus, etc.)
[0]: https://youtu.be/s-vJcOfrvi0?t=319
[1]: https://m.youtube.com/watch?v=y8OnoxKotPQ

liampulles 9 months ago

This does a good job of encapsulating the considerations of an event driven system.

I think the author is a little too easily dismissive of sagas though - for starters, an event driven system is also still going to need to deal with compensating actions, its just going to result in a larger set of events that various systems potentially need to handle.

The virtue of a saga or some command driven orchestration approach is that the story of what happens when a user does X is plainly visible. The ability to dictate that upfront and figure it out easily later when diagnosing issues cannot be understated.

latchkey 9 months ago

Back in the early 2000's, I was working for the largest hardcore porn company in the world, serving tons of traffic. We built a cluster of 3 Dell 2950 servers with JBoss4. We were using Hibernate and EJB2 entities, with a MySQL backend. This was all before "cloud" allowed porn on their own systems, so we had to do it ourselves.

Once configured correctly and all the multicast networking was set up, distributed 2PC transactions via jgroups worked flawlessly for years. We actually only needed one server for all the traffic, but used 3 for redundancy and rolling updates.

¯\_(ツ)_/¯, kids these days

ebiester 9 months ago

Different problems have different solutions.
You likely mostly had very simple business logic in 90% of your system. If your system is automating systems for a cross-domain sector (think payroll), you're likely to have a large number of developers on a relatively small amount of data and speed is secondary to managing the complexity across teams.
Microservices might not be a great solution, and distributed monoliths will always be an anti-pattern, but there are reasons for more complex setups to enable concurrent development.
- latchkey 9 months ago
  
  Due to the unwillingness for corporations to work with us, we had to develop our own cross-TLD login framework, payments system, affiliate tracker, micro-currency for live pay per minute content, secure image/video serving across multiple CDN's, and a whole ads serving network. It took years to build it all and was massively complicated.
  The point I was making is that the tooling for all of this has existed for ages. People keep reinventing it. Nothing wrong with that, but these sorts of blog posts are entertaining to watch history repeat itself and HN to argue over the best way to do things.

physicsguy 9 months ago

> And if your product backlog is full and people who designed the microservices are still around, it’s unlikely to happen.

Oh man I feel this

junto 9 months ago

This is giving me bad memories of MSDTC and Microsoft SQL Server here.

revskill 9 months ago

You can have your cake and eat it too by allowing replication.

kunley 9 months ago

this is smart, but also: the overall design was so overengineed in the first place..

atombender 9 months ago

I find the "forwarder" system here a rather awkward way to bridge the database and Pub/Sub system.

A better way to do this, I think, is to ignore the term "transaction," which overloaded with too many concepts (such as transactional isolation), and instead to consider the desired behaviour, namely atomicity: You want two updates to happen together, and (1) if one or both fail you want to retry until they are both successful, and (2) if the two updates cannot both be successfully applied within a certain time limit, they should both be undone, or at least flagged for manual intervention.

A solution to both (1) and (2) is to bundle both updates into a single action that you retry. You can execute this with a queue-based system. You don't need an outbox for this, because you don't need to create a "bridge" between the database and the following update. Just use Pub/Sub or whatever to enqueue an "update user and apply discount" action. Using acks and nacks, the Pub/Sub worker system can ensure the action is repeatedly retried until both updates complete as a whole.

You can build this from basic components like Redis yourself, or you can use a system meant for this type of execution, such as Temporal.

To achieve (2), you extend the action's execution with knowledge about whether it should retry or undo its work. For such a simple action as described above, "undo" means taking away the discount and removing the user points, which are just the opposite of the normal action. A durable execution system such as Temporal can help you do that, too. You simply decide, on error, whether to return a "please retry" error, or roll back the previous steps and return a "permanent failure, don't retry" error.

To tie this together with an HTTP API that pretends to be synchronous, have the API handler enqueue the task, then wait for its completion. The completion can be a separate queue keyed by a unique ID, so each API request filters on just that completion event. If you're using Redis, you could create a separate Pub/Sub per request. With Temporal, it's simpler: The API handler just starts a workflow and asks for its result, which is a poll operation.

The outbox pattern is better in cases where you simply want to bridge between two data processing systems, but where the consumers aren't known. For example, you want all orders to create a Kafka message. The outbox ensures all database changes are eventually guaranteed to land in Kafka, but doesn't know anything about what happens next in Kafka land, which could be stuff that is managed by a different team within the same company, or stuff related to a completely different part of the app, like billing or ops telemetry. But if your app already knows specifically what should happen (because it's a single app with a known data model), the outbox pattern is unnecessary, I think.

wavemode 9 months ago

I think attempting to automatically "undo" partially failed distributed operations is fraught with danger.
1. It's not really safe, since another system could observe the updated data B (and perhaps act on that information) before you manage to roll it back to A.
2. It's not really reliable, since the sort of failures that prevent you from completing the operation, could also prevent you from rolling back parts of it.
The best way to think about it is that if a distributed operation fails, your system is now in an indeterminate state and all bets are off. So if you really must coordinate updates within two or more separate systems, it's best if either
a) The operation is designed so that nothing really happens until the whole operation is done. One example of this pattern is git (and, by extension, the GitHub API). To commit to a branch, you have to create a new blob, associate that blob to a tree, create a new commit based on that tree, then move the branch tip to point to the new commit. As you can see, this series of operations is perfectly fine to do in an eventually-consistent manner, since a partial failure just leaves some orphan blobs lying around, and doesn't actually affect the branch (since updating the branch itself is the last step, and is atomic). You can imagine applying this same sort of pattern to problems like ordering or billing, where the last step is to update the order or update the invoice.
b) The alternative is, as you say, flag for manual intervention. Most systems in the world operate at a scale where this is perfectly feasible, and so sometimes it just makes the most sense (compared to trying to achieve perfect automated correctness).
- atombender 9 months ago
  
  Undo may not always be possible or appropriate, but you do need to consider the edge case where an action cannot be applied fully, in which case a decision must be made about what to do. In OP's example, failure to roll back would grant the user points but no discount, which isn't a nice outcome.
  Trying to achieve a "commit point" where changes become visible only at the end of all the updates is worth considering, but it's potentially much more complex to achieve. Your entire data model (such as database tables, including index) has to be adapted to support the kind of "state swap" you need.
- kgeist 9 months ago
  
  >I think attempting to automatically "undo" partially failed distributed operations is fraught with danger.
  We once tried it, and decided against it because
  1) in practice rollbacks were rarely properly tested and were full of bugs
  2) we had a few incidents when a rollback overwrote everything with stale data
  Manual intervention is probably the safest way.

p10jkle 9 months ago

This is also a good use case for durable execution, see eg https://restate.dev