Speaking of the marvels of Internet Commerce...

I am continually amazed that banks still don't fail safe, and apparently haven't learned about SYN/ACK. I mean, I know it's a recent invention, the three-way handshake is only the single most fundamental concept that makes TCP/IP work.

One of the amazing things about the design of the payment processing system is that it's easy to get into a situation where A) you know the customer's card is good; B) you've tried to charge them; and C) something has gone wrong and you don't know whether you've taken their money or not.

This happens to us every couple of weeks:

  1. Customer places an order.
  2. Us to bank: "Can I charge $30?"
  3. Bank to us: "Yes".
  4. Us to bank: "Ok, do it."
  5. ...radio silence.

If response #3 never comes, we get to show the customer an error, and that's fine. But if that response #5 times out and doesn't come back, which happens regularly, we have no idea whether the failure was that step #4 was not received, or that step #4 was received but reply #5 was lost. Sometimes it's one, sometimes it's the other. In the former case, the customer wasn't charged, and in the latter case they were! We have to fix these by hand, and there's no easy way to automate it. It sucks.

If banks understood TCP/IP, it would go like this:

  1. Customer places an order.
  2. SYN: Can I charge $30?
  3. SYN/ACK: Yes.
  4. ACK + SYN: Do it.
  5. SYN/ACK: I am gonna do it.
  6. ACK: I see that you're gonna do it.

If that was their model, then at no point does a communication failure cause a charge to be in an ambiguous state. If I never get the message in #5, the customer is not charged. If I get the message in #5 and my response in #6 is not received, the customer is not charged.

If #6 is sent but not received, then I would think the customer was charged when they were not, but the converse can't happen. There's only one possible failure mode and not two, and that failure mode is the safer one.

This is orthogonal to the complete flying clusterfuck that is AVS, unfortunately, where they put a hold on the money before validating the billing address, and then if that didn't match, often fail to release that hold. Double-you tee fuck.


Tags: , , ,

13 Responses:

  1. Sam Kington says:

    I used to work with credit cards in the late 1990s, and the underlying protocols were all based on hideous COBOL-type fixed-width data structures, sent over (IIRC) X.25. So quite possibly, at the time, the answer would have been "No, the banks haven't heard of TCP/IP".

    Out of interest, what happens if you repeat your original step 4? I don't think it's possible to settle an authorisation twice, so ideally you'd get back either "OK, settling that" or "You did that already". But knowing banks and payment gateways, I wouldn't be surprised if you got back something unhelpful and/or unreliable.

    • jwz says:

      I'm not sure. But, when it's in this state, what's often going on is the reply took several minutes to return, instead of the usual ~4 seconds. So it might be that if we tried again, we'd still not get a real response for N minutes, in which case the user would probably have given up on waiting for the page to load anyway.

      At least, in the "customer was charged but we don't know" state. I'm less sure of what the hell is going on in the "customer was not charged and we don't know" state.

      • Aaron says:

        I solved that once, years ago. It involved an iframe containing the (Paypal) payment form, configured to redirect to a local PHP script that interpreted the response and barfed a glob pf Javascript to tell the containing page whether to render success or failure. That, plus a spinner and a suitably stern message about the consequences of doing anything before a result showed up, was enough to produce a reasonably non-shit UX in spite of everything the payment processor could do to fuck it up.

        Web dev is hell. You have my sympathies.

    • Julian Calaby says:

      They still use a horrible fixed-width format like that in Australia for specifying bank transfers.

      I wrote some software that had to produce files like this a couple of years ago and dealing with those files is a nightmare as every bank has interpreted the standard slightly differently.

      These files were uploaded by the client into their business online banking site, not directly transmitted to the bank, thank god. We used the transactions being listed in the account's statements (downloaded as CSV files) as feedback to determine if they had actually happened.

  2. Nick Lamb says:

    The Two Generals Problem (reliable co-ordination at a distance) does not have a solution. The TCP handshake isn't a solution because, as I said, there isn't one. Adding more steps (one step, or six, or sixty) doesn't fix anything. Right now, today, you could choose to assume that the customer's card was successfully charged as soon as you request this. Exactly the same as in your hypothetical "improved" system, and sometimes you would unknowingly be wrong, exactly the same again.

    What's actually different about your day-by-day experience of TCP is not the protocol, it's the quality of implementation. Almost all of your TCP/IP connections work, whereas apparently many of your credit card transactions fail.

    What you need isn't for your bank (really? it's weird to have a bank take this role, but OK) to "learn about TCP" it's for them to deliver a decent quality of service. But you're in the US, so good luck with that.

    • jwz says:

      You might as well have just said that since you can implement TCP in terms of UDP, TCP doesn't fix anything.

      My solution does not solve the coordination problem; obviously I am aware of that, since I called out the way in which it fails. What it does solve is that right now there are two possible error states, and we get them at random. Having one error state is better than two, especially since one of those error states is much less irritating than the other.

  3. Ewen McNeill says:

    The irony is that due to the two phase processing (auth/settle) -- and lots of internal bank systems -- it's clear they do know about two phase commit. But the problem with credit card processing seems to be a combination of being designed in an environment of reliable communications (X.25 over dedicated lines) and intermediaries who... are not incentivised to make the problem any better.

    It looks like they really need three phases:
    1. Auth (pre-validate you can actually do this, will be auto-cancelled in N hours, returns auth_id to use in commit)

    2. Commit (confirm that you are going ahead with auth_id, will be auto-settled in N days, returns commit_id to use in settlement)

    3. Settle (actually transfer money, perhaps in a large batch, as now)

    If the auth succeeds you have a right to money which will be auto-cancelled if you don't proceed (send commit) within N hours; if the commit succeeds you have a right to money which will auto-proceed if you don't cancel it (presumably with some sort of one-off non-trivial-fee settlement if you don't settle it earlier). That plus idempotent auth, commit, settle (so that if the remote end has already done it, it repeats the ack back at you with a "as I told you before..." flag).

    Unfortunately due to the Two Generals Problem in the case of complete communication breakdown after auth, you have to make a choice whether you try to go ahead (keep hammering on commit) or rollback (start hammering on cancel). But at least if you re-establish communication before the auto-settle period you do get to roll it back, so choosing "okay, we give up" should be safe in a user-waiting-on-HTTPS-response time period (worst case you do the occasional credit back onto the cards of the ones which committed but you couldn't cancel before auto-settle). And the auth should expire in some hours so there'll be fewer user complaints of "OMG, you stole my money". (Or you can choose to gamble that you'll get the commit through before the auth expires, and go ahead with the purchase at that point. But that way lies dragons.)


    PS: IIRC the official credit card processing line on a bunch of these communication errors is "get a status report and try to figure out where you got up to". Which isn't always achievable in the time frame of "impatient user waiting for an answer" :-(

  4. Gordon says:

    If you act as if 4 in your original setup always works and ignore 5, then aren't your failure modes exactly the same? Specifically you send a message to the bank and probably they take the money but maybe not. The only way I see you getting the second failure mode is if you interpret the non receipt of 5 to mean payment not taken. What am I missing?

    What we do is at 3 we get a reference number from the gateway. We can then go back and query the state of 3 at any point. Since the hold already exists at that point we we actually finish the customer request immediately and give them their purchase. Then we do 4 & 5 in background processing where we are free to keep kicking it until the thing reports the correct state. And finally for good measure we have another background process that finds any ID's on the gateway side that we don't recognise (step 3 failures) and cleans them up as well.

  5. Pascal Bourguignon says:

    Banks are not there to do things right.
    Banks are there to shear us.

    Those features are very successful at shearing us more (either the buyer or the seller, possibly both, with added charges for rectifications, etc).

    • Erbo says:

      Exactly. You're thinking in terms of what is most safe and efficient. The bank, however, is thinking more in terms of what's most profitable for the bank. Which turns into bigger bonuses for the bankster fraudsters, which then gets turned into their fourth vacation homes, six-martini lunches, payments to high-class hookers, and all the cocaine they stuff up their noses.

  6. Mark Beeson says:

    The good old bank network timeout problem. The great thing is that there are three different scenarios where this winds up happening:

    1. your connection to your processor has gone into the ether.

    2. your processor's connection to the acquiring bank has gone into the ether.

    3. the response from the customer's bank has gone into the ether.

    You could treat these differently (because there are different symptoms of each problem) but the generally-accepted sledgehammer method of dealing with all three of this is this: perform your charge in a separate thread from your web app thread (that way the customer aborting the request doesn't hose you). After a set timeout (we use 30 seconds, because anything longer than that you've already lost the customer anyways and they're not going to be around to see the response) immediately issue a reversal for the charge. You'll get a response that says "okay, we successfully reversed the charge" in which case scenario 3 is what happened. Alternatively you'll get a response "we don't know about that charge" which is either scenario 1 or scenario 2. Either of these cases is fine, because the customer doesn't have a charge against their card.

    You can then either immediately issue the charge again or just return a "try again" page to the customer-- it's safer to do the latter, but a better customer experience to do the former. At least 90% of your charges should be resolved within 4 seconds so if you try an immediate re-charge you'll make sure the customer gets their order placed successfully. (note that sometimes you'll get a duplicate error on your retry-- this is great, that means the original charge went through okay and you can just return a success to the customer)

  7. Jon says:

    So, in the end you still don't know if the customer was charged.

    You are missing sequence numbers, timers and retransmissions to complete the story. If #6 was not received, #5 could simply be retransmitted after timeout and you'd notice by the sequence number and you could resend #6.