Every time I do a new blog post, within a second I have over a thousand simultaneous hits of that URL on my web server from unique IPs. Load goes over 100, and mariadb stops responding.
The server is basically unusable for 30 to 60 seconds until the stampede of Mastodons slows down.
Presumably each of those IPs is an instance, none of which share any caching infrastructure with each other, and this problem is going to scale with my number of followers (followers' instances).
This system is not a good system.
Update: Blocking the Mastodon user agent is a workaround for the DDoS. "(Mastodon|http\.rb)/". The side effect is that people on Mastodon who see links to my posts no longer get link previews, just the URL.
yeah I often get your 503 page if I click the link if it's been recently posted.
MariaDB/MySQL cope so bad under high load it's insane. Maybe having some sort of "staticizing" mechanism to snapshot the dynamic content and then serve it through nginx with some fine tuning would help? (compression? connection reuse? cache-instructing headers?)
Again, I don't really need you leaning over my desk and saying "You know what you OUGHTA do", thanks.
Sorry for adding stress to the situation, just checked the various other replies reaching my instance and I've been yet-another-one adding to the pile of workarounds to an issue that should be tackled on the instances' side too :blobcatsad:
You may find Kris’ write up interesting. TLDR Mastodon has huge traffic overheads.
User engagement is such a curse. But seriously, better caching might help?
On the Mastodon side, since instances don’t share cache (they can’t, it’s not centralized), the best thing they could do is schedule the job to fetch data about a URL with a small random amount of delay.
On jwz’s side, request collapsing, rate limiting, or caching would solve this problem. Rate limiting is probably the easiest, because then the randomized backoff algorithms will take effect and delay appropriately.
One could run a transparent proxy and share it among Mastodon instances, but this has abuse potential all around.
sounds like federation needs some sort of method for propagating post information, and allowing federated servers to share fetched data with other servers, instead of hitting the source.
can I preach the gospel of WebSub (formerly PubSubHubbub) for this
Avoid flash floods by relying on intermediate copies? Sounds a lot like what the flooding protocol of usenet did. Doing that for ActivityPub will require some design work, but sounds doable given that each propagated object has a unique ID.
Content-addressed data solves this. Build a protocol on top of ipfs and this is not a problem. Supply automatically scales with demand.
I have to learn quite a bit more about ipfs, freenet, i2p and other things like that. Those technologies are definitely going to be a part of the future of all this.
The key idea is that when you ask for data by its (cryptographic) hash then anyone who has it can give it to you, and it's impossible for them to tamper with it. If anyone who retrieves a block of data also serves it up, then supply grows with demand, and you avoid DDOSing the original provider of the data.
Relays seem to be playing part of that role ? https://holger.one/posts/migrate-from-pub-relay-to-activity-relay-en/
that would require the data to be signed.
The only applicable standards I know of that wouldn't require a bunch of requests to the server is reusing the TLS cert key to sign this metadata (bad idea) or using DNSSEC+DANE to list a signing keypair in a way where a secure lookup is possible (viable but complex and not widely supported)
also, this seems like a bit of a "humble brag" ooooh, look at me, I'm @jwz and I have so many followers my posts DDoS me... :P (meant in good fun, not to be mean)
This is precisely why WebSub (aka: PubSubHubbub) was created. Polling sucks. Blog updates should be pushed, not polled.
We've been here before.
why does it only use symmetric HMAC tags? A node which fetches HMAC tagged data from the original server can only verify it locally, it can't prove authenticity to other nodes which it may relay the cached data to, they instead have to trust the relaying node.
This is absolutely not a good system, but also, it seems a little weird that every pageview of your blog involves the database...
This shit right here is how you get blocked.
yes! this seems built into the spec! and there's a notion of a `shared inbox` but it's not clear that it's useful for fixing this
everyone posts graphs about the "number of mastodon users" but it'd be interesting to see graphs of "the number of cross-instance follows" over time
the shared inbox is for a single instance.
The issue is that when jwz writes something, the post is propagated across the network. Each instance will fetch OpenGraph / other metadata from the links in a post.
ActivityPub is RDF and thus this metadata _could_ be looked up by the source server and transmitted once… but you have to trust the source server.
the naive approach is simply for every instance to do the lookup themselves. This is what Mastodon does and is the issue. But it does this with no delay and because the network is pretty fast, it results in a spike.
Some are suggesting to use WebPub/PubSubHubbub/etc. but it’s not really the solution - the posts are already being pushed and not polled. It’s the OpenGraph metadata that’s not pushed.
So given that, I’d say that Mastodon should do the easy first step and schedule these jobs to run randomly 500-5000ms in the future rather than ASAP (this may not be possible in vanilla Sidekiq but it is definitely possible). Then consider passing along OpenGraph metadata in the <a/> so lookups don’t always need to occur.
You are already trusting the source server to tell you that this was my post and what URL was in it.
So, you want us to unfollow you? Won't happen. You not boring.
Interesting. Do you have an estimate of how many of those might be "organic" hits: actual people following the link in the toot?
Zero. He's talking about the Sidekiq job in Mastodon which fetches the URL to (potentially) create a preview card.
I see. Does that usefully respect robots.txt? If not, it could be blocked at the request level. Easily so if it announces itself as a particuliar agent type.
Blocking the traffic would make the post broken in all those Mastodon instances. The post preview would be missing.
That hardly qualifies as "broken". The link would be there right?
Most hyperlinks on the internet do not come with previews. Is the exception rather than the rule.
Preview cards with images make links look better and get far more clicks.
Even if that is true (where are the numbers to prove it), it seems to me that since toots can include images, you can have that cake and eat it too: you can attach your own screenshot of the page to accompany the link, and block the preview requests.
The image travels through the fediverse without generating redundant hits on your server.
Proof of concept:
Here is hackernews: https://news.ycombinator.com
I don't know anything about Mastodon's design yet. Couldn't the Twitter card data be pulled in to the post when it is published to avoid the need to manually add an image?
I don't think "Click through rate" is something we should honestly be caring about.
This isn't an advertising network.
You're arguing that Mastodon should not support link previews. It does, so the debate is whether it is supporting them sustainably.
What I said is that "Click through rate" isn't even a metric we should consider.
Again, this isn't an advertising network, and we don't care about CTR or total impressions. We care about socializing with humans.
My suggestion elsewhere in this thread is that link previews should be an optional toggle in the admin UI, to disable it instance-wide, disabled by default. And/or, users should also have that toggle, for their experiences, defaulted to off.
Reason being? Several. One is a the huge traffic surge we're causing to other network citizens. The other is... frankly, goatse, or other gore/shock porn. Shit, even making it so people who may have content sensitivities wouldn't get blasted by things they didn't ask to see.
Bottom line: You metric you suggested above is a bad one that should have no place in discussing features present in social media software.
It isn't contrary to the goals of Mastodon to want people to click the links in your posts.
That's one of the only two reasons everybody undertook the hassle of implementing Twitter-style preview cards on their websites.
The other is to give users more information before they decide whether to click the link.
Giving instances and users more control on whether to show link previews is a good idea.
Yes, many sites undertook the effort to do preview cards, because of capitalist drives for economic return.
There's very little social value in preview cards, really. Sure, the argument could be made "Decide on clicking the link", but most of that decision should be made because of who is providing the link to you, not because of what a preview card shows you.
Ie, gargron shares a link about new Masto changes, I don't need a preview card to give me info: I have if from the person sharing the link. Same with the local antifascist groups. They provided the link, I trust them, and I don't need a preview card to tell me more, truth be told.
Random person sharing link? I probably will start to consider the sources, or open in an incognito window. I don't trust the preview card, anyways, and at worst, it's preloading data in my browser from a potentially malicious site looking to track people.
In the first 60 seconds, from skimming the logs, it's very obvious that it's almost none of them. For my account (658 followers), it was 41 servers, each making two requests (one for the page, one for the opengraph preview image), triggering 82 requests... and one human.
And of course, this happens if _anyone_ posts one of my links.
It’s a good suggestion. Unless a visitor has a session cookie, then they don’t need an artisanal rendering of the webpage. You can let Apache cache it for a few minutes instead
Pop quiz, how many years have I been running this blog? Do you think that I have not already considered or tried literally every boneheaded suggestion that happens to be the first thing that pops into your head? Really?
This so describes most of my weeks dealing with customers.
Did you try unplugging it and plugging it back in? I mean, it's not like you helped birth the WWW or anything.
Given how much I hate working with caching as a web dev, I'm going to take your comment here and wrap it up and cherish it, and then take it out and WAVE IT AT PEOPLE the next time they, too, suggest that the laws of physics and computing can be avoided by pressing a switch. Thank you, and may the gods of the internet have mercy on your server.
I am not convinced that replacing a capacity problem with one of the two hard problems of computer science is the correct answer.
If running a LAMP stack, you may consider mod_evasive: https://phoenixnap.com/kb/apache-mod-evasive to deal with burst traffic
Thanks. That seems useful against real DDoS attacks. But for this problem that just looks like a DDoS but is actually a legit albeit heavy handed response, throwing a 403 back at a legitimate requestor is undesirable.
I gather this module’s config provides for blacklisting an IP only on repeated requests for the same resource.
How would that help against a stampede where a myriad of clients request any specific resource just once each?
I'm curious whether anyone has made one of the automated "grep your Twitter follows for Mastodon addresses; import the .csv" processes work. I've uploaded mine days ago, and I'm still only following, like, five people.
I've tried several of them and they all found roughly the same set of people. I'm not sure that it's not working so much as most of your Twitter cohort haven't added the links.
I think movetodon.org is the easiest to use. Whether it's the most accurate, I don't know.
It's pretty nice! I added a bunch of people and only had to unfollow about 10.
check if user agent is mastodon and send a page truncated to just <title> may avoid the problem pretty well
Out of curiosity, did you think to yourself:
"I know this guy's been running a blog for decades, but if I post these three letters, a lightbulb is going to go on. The first thing that popped into my head is not only going to be the solution, but I'm the first one who thought of it. I'm one who fixed the problem. Go me."
Is that how you thought this would go?
I was a actually wondering if you'd decided not to go there for some reason?
we wouldn’t be engineers if we didn’t have an irresistible compulsion to fix every problem we hear about. My wife jokes that all she has to do for me to put all groceries in our relatively small fridge is to say it’s a hard problem.
Yes, well, "learning when to keep your mouth shut to avoid being an irritating blowhard who just adds noise and not signal" is an under-appreciated skill of *good* engineers.
Indeed. You’ve been pointing out stuff like this since back when we were posting to the Opinion bboard. Hard to believe that’s nearly 40 years. Glad to see you’re still yourself.
The Fediverse design could use improvement. But in addition to the problem you note with Mastodon in particular, its implementation seems to have absolutely ridiculous server resource requirements for the amount of work it gets done. I don't know if it's because the thing is implemented in ruby-on-rails and node.js, or if it's just really terribly designed, written, and (un)optimized - but there's no way it's going to have more than a niche set of instance operators if they require these massive systems to handle relatively small amounts of users and traffic.
There's a bunch of other implementations of the underlying protocol:
People are using "Mastodon" as a generic name like Bayer or Kleenex or Xerox because "Fediverse" is just stupid and "activitypub" is not usable directly.
They're also using "Mastodon" because Mastodon servers host 97% of the active Fediverse users. (They're only half of the servers, but they're 95% of the servers that participate in the herd that rumbles after links jwz posts.)
no. they're using "mastodon" because it is not an idiot word. the other words are idiot words that regular people don't want to use. regular people also don't want to say "the president tweeted ... " but for a while they were forced to. one of mastodon's great assets is that it does not sound like an idiot thing to regular non-idiots.
I jumped on Mastodon a week ago and am enjoying it in ways I never did with Twitter, but saying "the President tooted..." vs "tweeted..." doesn't make me feel like less of an idiot.
Um, ya, that's why I said that Mastodon in particular had ridiculous resource requirements... not sure if you were just in violent agreement with me or what.
Yeah, what I've read about the resource requirements of even small instances is absolutely bonkers, and largely what prevented me from making that problem for myself too.
I run my instance in a VM on a Raspberry Pi. About ten users so not big at all but it works fine.
I'm not sure your toy computer handling a toy amount of usage is quite the useful datapoint you think it is.
Kris Nova’s write up that I linked to makes that clear. Migration of one account from a single-user server taking days or weeks seems very poor.
That seems better than an even larger stampede in a shorter interval.
I've heard that https://pleroma.social/ is much lighter-weight, such that you can run a single-user instance on a Pi or a free-tier cloud server.
Yeahhhhhhh I heard that too...
Someone please stop me from writing my own ActivityPub server in ANSI C.
You'll not be surprised to find...
Thanks for all the ACME services way back when. I learned a lot from reading the code in the mid-90s.
Farcical, what a complete shit show. I've not looked at any of their source, and refuse to do son without sufficient eye bleach to hand.
Reporting in from the field: @crschmidt:
that first thread is spectacular. i'm not sure what other people might consider the highwater mark of absolute idiot moronism going on there. but for me when the neckbeard shows up to explain how, because there are alternate ways to ddos a website, mastodon need not be concerned about designing in a critical ddos flaw that will get predictably worse with scale, that's the moment.
Reading that first thread was so painful, I couldn't bring myself to read the other. Fuck these people.
Let's create more problems by throwing IPFS into the mix.... great work guys.
I suspect IPFS would crash and burn as a solution, but at least it would be more interesting to watch than a plain old web request overload! ;)
Best comment in the first thread:
Casually remarked upon as if this weren’t a massive problem with the entire fucking concept of your app
The few mastodon issues I've read seem to go that way; "theres a issue with how you are doing things" or "this is a much wanted feature" then a maintainer (*cough* gargron *cough*) responds "well I dont think it should change" then the convo ends. Now that we are getting past the inital explosion, it might be time for instances to start moving to a mastodon fork which isn't a baby and can be more freely worked on by the community.
A couple notes. 1 - bufferbloat is a thing. Not only is your site getting flooded, but starting up so many connections at the same time is *really hard on your network*. I'd hope you were running fq_codel or cake on your server in the first place, and sqm on your router. However even fq_codel starts falling apart under a workload like this which you can see with tc -s qdisc show. Cake might be better.
2 - what also really hurts here is the syn flood. Linux's synflood protection is set WAY too high for most people's networks, and if you just start dropping syns, tcp's natural exponential backoff should make it a bit less horrible.
3 - I'd really love a packet capture of what happens to anyone in your situation - start a tcpdump -s 128 -i your_interface -w somefile.cap - (just need the tcp headers) - do a post - get slammed - stop the cap. I'd be perversely happier if the network was behaving ok and it was just your cpu going to hell... but...
Oooh, it is a server issue, not DDOS from clients? That makes the potential to weaponize it even greater. Don't even need an account with followers...
On the plus side, at least this issue was reported in 2017, and had a long discussion.
Unfortunately, the end result was to end up in a place where the solution was "Just add a 1-60 second random delay", because other solutions (including pushing content with the post) were seen as having too high of a risk.
The fact that a malicious server pushing content could just... lie about the URL in the first place wasn't touched on.
Welp, if my choices are going to end up being, A) harden your site to deal with DDoS on a twice-daily basis instead of twice-annually, or B) stop posting to Mastodon, it's gonna be hard to argue against B.
Unfortunately for you, "Stop posting on Mastodon" doesn't actually _solve_ your problem, since you're still going to have this problem anytime _anyone_ posts the link. (Or boosts a post that has the link in it, as we saw.)
Though certainly if you use Mastodon, it's going to end up being more noticable to you; otherwise, it'll just be like "why is the site slow right now!?" and you'll never figure out it was some semi-popular rando on Mastodon.
I wonder what breaks if I just blackhole the Mastodon user agent. Just link previews, or will other things blow up too? Guess I'll try it and find out...
Assuming you're not hosting any webfinger stuff on jwz.org to allow resolution of e.g. @jwz, just link previews as far as I know.
Apropos of nothing, that's an infuriating discussion. "We don't need to fix this because botnets can generate more traffic" and "...people should have servers that can handle any load" are not valid reasons to not fix your broken design, FFHS.
Exactly! Is Mastodon not already throwing enough roadblocks in front of adoption??
I'll only add a little bit more color by saying that a request to add a feature to do server-side fetch before federation opened in 2020 didn't produce quite the same response, and is still open.
It would probably require a pretty dedicated project lead, since it would be best to coordinate across fediverse software, but I'm hoping that means that someone championing the issue might find traction in helping mitigate things somewhat. https://github.com/mastodon/mastodon/issues/12738
Maybe Mastodon was created by clown services to get more money on network data transfer.
Clown Services is the name of my next startup.
Also, the aforementioned clowns don't do that much advance planning, what with Wall Street and quarterly executive comp and shit. Ideology-driven idiocy suffices to explain what we're seeing here.
To everybody suggesting hosting solutions, the issue isn’t that @jwz needs hosting advice. The issue is that any popular user can DDoS any website. And anybody can become popular through bot accounts.
The decentralized nature of the fediverse makes it significantly harder to deal with problems like this.
Currently simple posts already act like huge write amplification attacks, causing accidental DDoSs. It’s stupidly inefficient and dangerous.
canary token all the things? http://canarytokens.com/feedback/static/2sr8fngn4kwn6oydldcptupg3/submit.aspx
is that people accessing? Or bots 🤔?
It's the instances themselves each hammering the sites simultaneously to get link-preview metadata. Definitely not end users.
Perhaps it's time to bring back the Usenet newsreader warning:
This whole thing is cryptocurrency part two, isn't it?
More amazing still is that, even as crypto is finally collapsing under the weight of its own vast idiocy, people are rushing to this new equally stupid thing. Because, somehow, this time, the technical fix to the social problem will work.
It's not until it:
So far it's only got:
Caching on the Blockchain, and each cache/federated thingy will be explicitly allowed to inject more ads and malware and popups.
The worst part is that this was all so foreseeable.
When I pulled up the ActivityPub standard specification and started reading about the inboxes and outboxes, that was my immediate thought, won't this design cause a huge scalability problem?
But I'm assured that the people behind the standard have a lot of experience, have been working on it for years, while I'm just a guy who pulled up the spec on my lunch break. I guess I should trust in their expertise, that they've thought this through. Right?
Programmers are still taught the concept of big O analysis... right?
> Programmers are still taught the concept of big O analysis... right?
I'm surrounded by peers with CS, CE, and EE degrees though few are familiar with O(...) analysis. Most of the senior architects get it but they're only responsible for a tiny amount of the actual coding. For the rest I fall back on plain old algebra and hand sketched quadratic curves.
Even after simplifying to algebra, I can only reach about half the coders before they write their (invisible by looking at a single function definition) quintuple nested loops. So we have to let empirical measurement provide the proofs for those who don't believe the algebra.
As a result we lean heavily on our QA team to write realistic benchmarks to empirically flush out these performance design flaws. That's very important as our products are designed to handle hundreds of billions of objects and the development teams are accustomed to writing test cases with just a few hundred objects.
I think that part of the issue is many coders have no idea of how a computer system executes their code. They're working within an idealized programmer's model that doesn't take into account physical constraints like cache sizes, hit rates, and link speeds. "My algorithm is so powerful that it needs 2TB of physical memory!"
Genuine question: do you have a concrete idea about how to create a federated version of twitter/facebook/google+/usenet at current volumes of posts and users that doesn't put a largish load on each participating node?
I was one of the people to suggest a federated social network at Google around ~2008. While I now realize that was not viable from a business perspective (how would you mediate a truly federated network?) I do think there's a fundamental tradeoff between centralisation and federation in terms of the cost of propagating the data.
Not claiming you are mourning for usenet but I think the quote is appropriate in terms of design goals. "Those who mourn for 'USENET like it was' should remember
the original design estimates of maximum traffic volume:
2 articles/day" -Steven Bellovin
Disclaimer: I again work for Google but the opinions expressed are mine and not my employer's.
If I had concrete ideas about how to solve these problems, I'd have written a protocol spec. I'm not gonna be the "you know what you OUGHTA do" guy here because these are, in fact, hard problems. But they are also part of a class of problems that have already been solved, at scale, in the past. So while it's hard, there's nothing that makes me think that it's impossible or even impractical.
But the fact that the Mastodon developer community's reaction to "here's a real-world scaling problem happening right now" is victim-blaming and 5 years of delay rather than "we'd better find a solution to that pretty quick" is not a great sign.
I feel like it’s only a matter of time before people start serving goatse to poorly behaved herd of user agents.
Examining it from a business case perspective tells one rather a lot about what the solutions aren't.
End users don't give a fuck about federation and the money to be made owning 1/20th of a Twitter or 1% of a Facebook is minuscule.
If we boil the ocean to completely redo http, then I'm sure it's possible to start solving the technical problems and cut out the need to trust intermediaries, but then you still need the entire world to change.
Or you can trust an intermediary enough to absorb the load elsewhere.
I'm nowhere near as smart as you folks, but I'm pretty sure that without aggressive moderation, which I can't picture as implemented without the generic mediation capability you touched on, "socials" turn into fascist shitpiles dominated by the absolutely worst actors. This was already the case before fascists recognized every network as an opportunity to do their thang and created tools, techniques, and resources for creating more of same in order to accelerate the process.
You in fact are at least as smart I think, because anyone ignoring that is an idiot or someone who wants to enable fascism or other equally awful things. Moderation matters and does not scale very well.
I don't have the answers, obviously, but I suspect that the only workable solution to "how do you do moderation at scale" is going to turn out to be, "don't".
If you take as your baseline the idea that when I participate in social media, there are a billion users who might conceivably interact with me, and now someone has to filter out the nazis... that's gonna be really hard.
So maybe don't let a billion users interact with me. Keep the social graph actually social.
Solving the problem of "my roommate's dipshit uncle is a nazi" is a lot easier than solving the problem of "there is a Moldovan bot farm trying to undermine global democracy".
I think this works for people, except for very famous people who will need to spend a lot of time blocking trolls. But I don't see any point in worrying about those people: they can hire minions to do that for them.
But it probably does not solve the problem for organizations running these services. Twitter is probably in the process of providing a demonstration of thos: my guess is that the EU actually have working teeth, countries in the EU have had actual experience with what happens when you do not deal with nazis, and they are therefore completely fine with saying 'if you are not dealing with the nazis on your platform then you are not in the EU'. And Musk will have another elaborate pubic squealing tantrum which will at least be entertaining.
I'm not sure if federation solves that problem, because if there are enough nazi-infested instances around then the only practical answer is not to federate with anyone.
It is at least possible that there is no workable solution.
FWIW, I think in cases like these it's important to separate out technical issues with technical problems from social issues with social problems. Often enough trying to solve either with the other just makes the mess worse.
So above we were talking about technical problems that might sink these systems even before any of the moderation issues come up. Mastodon has a back end, technical design that eats more and more computational resources, which is a problem. There may end up being no workable system to moderate.
Well, I suppose instances shutting the door to new users they can't afford is a type of social solution to the technical problem, but exactly one of those that's not exactly ideal.
As for your question, though, I always promote social solutions that mirror our normal social interactions, focusing on users having more agency over who they're interacting with instead of having to hope that some [maybe overworked] moderator is going to choose right for them.
We have technical solutions to the technical problems that brings up. Unfortunately, for whatever reason we'd care to speculate about, they haven't been applied.
Usenet is quite federated - or should I say "it was"? It's beyond peak. Part of the reason for using the past tense on Usenet is that it's standards are stuck in the mid-80's (RFC 1036 from 1987 was obsoleted by RFC 5536/5537 in 2009, at which time it was already too late). By the early 2000's some encodings for beyond-7bit-ASCII were somewhat accepted, but some clients still couldn't process that, and let's not talk multipart or anything beyond text/plain. There's a lesson on federated systems and the lowest common denominator of client softwares in there. (See also: XMPP, where you have about five mutually exclusive standards for about everything and it's up to you to figure out how clients and servers play along with that). Also, moderation, which actually does place a burden on operators. See also IRC, which is totally federated and stuck on 7bit ASCII, as they don't even have a way to signal the client's character set. To close this let me paraphrase Niklas Luhmann who famously wrote (in a letter) that with high complexity you get selection patterns and that you should just wait for stuff to explode.
I haven't looked into it, but the IRC instances I use seem able to send emoji.
That's more of an happy accicent, not by design.The specs (RFC 2812/2813 from April 2000) just have "No specific character set is specified" (section 2.2 (2812), 3.2 (2813)) and "The protocol is based on a set of codes which are composed of eight bits" and "delimiters and keywords are such that protocol is mostly usable from US-ASCII terminal and a telnet connection". There is no signalling of character sets and encodings (extensions have been proposed but never became official) and servers mostly just pass the bytes. If the other side sees the emojis you sent, that's only because they happened to run the same encoding as you did. GB2312 or KOI8 anyone? WTF-7?
To me this is a case where maybe there is no good, practical, workable solution to the technical side, but the space between what they're doing now and the better they could be doing is huge.
This backend design that embraces cascades of multiplying connections throughout both participants and the larger internet, and even accepts them as preferable to an ever so slightly degraded user experience from what the lead developer just KNOWS all users demand (as per that bug report linked above), well.
No telling how good such a system could actually be in the real world (or, at least I don't have the expertise to tell), but at the least the system could stop being so bad.
genuine question: do you get a similar spike from Facebook/twitter clients all downloading previews / images ? Or are these all sent from the server?
No because the FB and Twitter servers cache link previews on the server side. Same thing as what Mastodon does except there's only one server instead of many thousands, each with their own separate caches.
sounds like it's time for a kubernetes cluster
we experienced the same thing over at cohost and repeatedly had to block the mastodon user agent until the flurry of traffic was over -- the culprit was instances scraping our site in order to generate previews of embedded links. we ended up having to fix it by creating a specific cache to serve long-lived versions of posts just for mastodon link previews, instead of spending time rerendering them.
It's a #pachydermpounce 🤣
full-mesh federation moment
The best you can do is tarpitting, but that's really suboptimal and still DDoSes you by connection count.
I'm shocked the host Mastodon instance doesn't fetch the preview and share it with the fediverse embedded in the toot. One hit to your server.
It's not really a federation it's a rebroadcast network, a federation would still delegate common services and behave like a system rather than a bunch of islands.
The result being in this case that it amplifies requests and creates an unneeded herd. The kind of behaviour that gets your requests killed and in the end your domain blocked. I wouldn't be surprised if many content owners weren't pattern matching and dumping requests as jwz is forced to do.
наразі ні, але уточню в колег, які є більш технічними фахівцями
Thanks for the heads-up - I had not considered this.
Please ignore this if you don't have the time; I just like to learn from people smarter than me: Naïvely I would attempt to serve a 503 with a randomized Retry-After to Mastodon Agents if the load goes too high.
I'm sure for some reason that doesn't work otherwise you'd already done it, but why doesn't it work?
Is it because everybody (or at least the Mastodon devs) interprets the 'ought' in RFC7231 7.1.3 as "nah she'll be right mate" ?
(Since you wrote outright blocking the UA helps I'm assuming this is not a bandwith problem, since even then one still has ingress bandwith for the initial request? Or will that be handled differently?)
Maybe that would work. I dunno. Trying it would require writing some code and... this is a new problem that I have this week that I didn't have last week, so I did the easiest possible thing to make it go away.
It's not a bandwidth issue, it's a CPU issue. The reason that simply blacklisting the UA works is that the blacklisting happens very early and quickly in the processing of the request (at the Apache layer); actually generating and serving the page runs a lot more code (PHP and mysql).
There are already people right now typing "well why don't you just" and to those people: please, just, don't.
This seems like the sort of thing where someone is going to come up with a WordPress plugin for it and you’ll be able to go from there.
It's way too late if you hit wordpress. Do you even computer?
Unless you meant writing a plugin to reply to people "please just don't suggest X", in which case you get some internet points.
queue more victim blaming:
replying to myself because, if anyone is "sophisticated enough" to "thing about editing" a fucking text file manually they're "probably already" running a caching infrastructure. What fucking planet do these people live on?
They’re still at the “hardcore engineering” stage of thinking that social problems have technical solutions in general (e.g. “if racists are being racist at you, just run your own instance” in the real world we call that a ghetto and if you don’t see why that’s a problem well… you are the problem).
And for this it’s a general Stallman problem. “I don’t understand this. Therefore you are wrong.”
excited to hear how folks plan to solve this!
As a fellow long-time blogger I suggest just writing a boring and unpopular blog. This works well for me.
aka selfinflicted slashdot
yea, someone with 4k followers and a small site shouldn't be at risk of an automatic DDOS on posting. Makes me nervous to link to anything of mine that isn't 100% static.
Remember that it's not just the person with a lot of followers who is at risk here. If I, with a lot of followers, were to post a link to someone else's small web site, that person is going to have a bad day.
Everyone who replied with "use a CDN," is really saying, "I expect all web sites to be run by skilled and dedicated professionals, who deploy future-proofed technology stacks, so that my social network can be run by amateur hobbyists, and developed by those who fear what the future might bring."
well, actually... it's been so dangerous out there that getting _something_ in front of every web server I've run has been an item for years (except static files and nginx).
That said, I too have noticed a few stampedes on my site (even considering I do use Cloudflare and static hosting, the distributed effect breaks through and it shows in outbound blob access metrics _behind_ the CDN):
a friend of mine here in Vegas has a startup that does something like that (aimed at WordPress publishers initially), managing CDN and caches and whatnot to accelerate all of the lighthouse params under whatever load.
Personally, I would rather have this problem than the Instagram silo problem… though your point is well-taken.
There's so many ways around this. Any combination of:
1. Propagate previews in a peer to peer fashion from the posting server
2. Allow servers to set "preview proxies" -- not every mastodon instance needs its own preview renderer. A shared renderer could cache requests.
3. Have mastodon previewing respect robots.txt
So, in short, “Amateur-run websites, amateur-run social networks: choose one”
I keep looking at this toot and being more and more amazed it even needs to exist.
Just like, I know nothing is more confident than the ego of a mediocre tech bro, but do they not realize who the fuck they're talking to here?
we've actually been looking at this in some detail (now that I work with Fastly people who are actually smart about this stuff) and it seems like there some ways to help mitigate this a bit.
Having had similar issues outside of Mastodon due to syndicated and white label content services, I fear "all of the above" is likely the continued route. Postgres shared buffer caching for the server, nginx cache, varnishd or other cluster-level request proxy (elasticache, redis), & Cloudflare/Cloudfront/Akamai/etc. at the edge. Needing an overbuilt server to accept scores of simultaneous requests shouldn't be prerequisite to having a server, but if it can't handle 5 to 10 concurrents...
Why not using CloudFlare caching, it's free?
I think your solution is a pretty decent one (Blocking the UA).
Link previews are... nice. But, wholly unrequired, and frankly, should be an option for instance admins to enable, not a default. For this very reason.
CloudFlare caching is DNS level as opposed to a CDN. This is a solved problem, I don't get the purpose of this article. I'm happy to help you set it up.
Technically, it's an HTTP proxy with CDN, WAF, and other security and performance features, that is provisioned by making a DNS change. The object caching doesn't occur at the DNS level; it happens when a visitor makes an HTTP request for that (cacheable) object. If DNS is pointing at the proxy, the request is intercepted by their servers, forwarded on to the origin, and the response gets cached at the PoP that originally forwarded that request, so it's available to be served much more quickly for future requests, within the object's validity period.
I've been messing with this since your and others posts. Some things I didn't expect even with preliminary data, like Mastodon sites doing multiple requests I guess they don't cache locally, a long tail of requests from initial high activity interactions and it being not too hard to ID activity from user-agent.
I'm playing around with NGINX configurations to protect sites from floods of expensive requests. It's promising!
Like polio, slashdot'ing is back!
I dunno if you specifically would find this useful, but to add to the discussion, yesterday (as a fediverse admin that's been complaining about this problem for years) I stood up jort.link to try to reduce this load, at least from fediverse users aware of the power of their following. Basically a "let us do messy simple caching for you on behalf of fediverse instances" thing — everyone else just gets a 301. (No, it's not based on Cloudflare.)
The amount of people who think this problem doesn't exist or that it's just a "quick fix" is astounding. I can point to numerous examples of things I've accidentally taken offline just by replying to someone on fedi, and I've got "only" 800 followers — I can only imagine the effects of someone with 10000. It sure would be nice if Gargron could swallow his pride and admit this is a problem in need of a fix, and make any attempt to prioritize that.
This sounds like a reasonable workaround for this dumb situation, thank you!
Buuuut I keep getting 504 Gateway Timeout when I test it using a Mastodon UA...
Yeah, I saw that in the request log. The nginx hack is really, really finicky about SSL for reasons I don't fully understand (the original issue) and sometimes just outright doesn't connect (current issue). I'm currently rewriting it as an actual proper server program using a real HTTP client so it'll stop being so buggy. At the very least I've turned off the (very poorly considered) redirect fallback, so it'll just 502 to the thundering herd now rather than be completely useless.
I've just deployed that rewrite, and confirmed it now happily serves a cached copy of your site to a fediverse UA, even behind your shortURLs.
Thanks! It was working last night, but now it's getting 400 Bad Request.
It seems for specifically the jwz.org/b/yj65 URL, a 400 response has been cached from your server; I cache even error responses to make good on my request shield promise. I checked that cache entry and it's just a classic Apache ErrorDocument double-fault.
I've just dropped this cache entry. Let me know if you have a better solution — I don't want to just retry the request or not cache it, due to the nature of this. Maybe a limited number of retries would be okay, but if I were to have implemented that without this context I would likely assume 4XX class errors are not transient. Hm.
Maybe there's more information in logs on your end? The IP is 220.127.116.11, with this UA: Mozilla/5.0 (jort.link shield; +https://jort.link)
Huh. Weird. I do see the 400 error but nothing else in my logs, and when I try to reproduce that with the same URL and UA I don't get the error. It must be that something went wrong on my end but I can't tell what.
That is indeed a good solution.
What's the plan if jort.link becomes too popular? I can imagine that at some point the resource requirements will become non-negligible. Is it "yo fedi users/admins gib monies plz (i need to eat as well)" ?
Or is the long-term goal to demonstrate to the mastodon devs "look guys as we've told you countless times, this _is_ a problem, see here?" and hence finally get a solution, rendering it obsolete?
I already run major fediverse services under Jortage, and a relatively large Mastodon instance on a pretty powerful dedicated server. I don't expect jort.link to become too big for me to run, and regardless people are already donating to keep the Jortage project afloat — the Storage Pool is the media storage for 61 instances, and its deduplication has reduced the costs of our members by over 60%, making it well worth the money of many instance admins that moved from S3 or worse.
Fundamentally, jort.link has very low resource requirements. Once a remote file is cached, it's as expensive as serving a static file — there's no databases or anything involved, and the only point of contention (a synchronize across threads) is avoided if the nearline memory cache has the metadata memoized. The primary concern is bandwidth usage (and I've got plenty of bandwidth) but that's the point of the 8M page limit. And even if I do have to constrain it, these requests do not need to be timely; fedi software will happily wait many seconds to receive their pages, so I can prioritize serving actual browsers and throttle fedi software. Additionally, a previous version of this was proven to be able to run behind bunny.net, a cheap global CDN. I dropped that only because it currently is slowing things down, introducing another point of failure/data processing, and generally doesn't make sense at this stage.
I've been encouraging other fediverse big players to run their own jort.link instances — due to the nature of this, more instances does not equal more load to the origin server. Whichever instance you pick is self-contained and will only request the origin once. This was part of the motivation for the rewrite; it's hard to reproduce my nginx/dnsmasq/bunny.net house of cards, while a small self-contained Java program is very easy to run. (Certainly easier to run than Mastodon itself.)
pyrex (who, for the record, I only know of from this comment thread and them pinging me on fedi with some questions about jort.link and general design) is talking to masto.host about patching the Mastodon code they run to go through a masto.host-hosted jort.link instance for media retrieval transparently on the backend, which would assist with this problem quite a bit due to the sheer number of instances they run.
I do hope Gargron will acknowledge and fix this issue, but after this many years I'm not holding my breath. If jort.link does become obsolete, that's a win.
do you not use any form of static cache? I don't even notice hits like that because everything is staticly cached.
Hm -- this won't solve your problem because someone has to actually implement it, but I've been rolling this around in my head for a while. I'm wondering if Masto users could be persuaded to start setting up (shared) caching proxies specifically for link preview information.
Basically, as a secondary set of APIs, I think Masto servers should be able to report "this is the link preview info for page X," given any arbitrary URL X. That should be cached. This way, if you trust an instance you're federating with, you can immediately get the link preview info for them. Otherwise, you can ask some large instance that you trust and stampede _them_ instead.
This way you can still federate with people who you suspect might use bad link previews to defame people not on your instance. This is not something anyone explicitly wants to do, but seems likely to happen by accident if you have a policy of federating with small instances by default -- which Masto does.
Instances that don't want to be stampeded can refuse to publish this service. That being said, the thing they're caching is much smaller than whole web pages and they can probably do everything with a single hit from cache.
Instances who know that they're the caching proxy for a bunch of other instances would ideally use a different user agent that still matches the (.*Mastodon.*) family of regexes: that way, people who intended to block old!Masto also block new!Masto, but people who know the difference can unblock them manually.
Overall, this seems to me like it would prevent stampedes while also having better performance for literally anyone running a Fedi instance. It requires basically no additional work from webmasters. (but unfortunately, significant extra work from Fedi developers)
This is still less efficient than just federating link preview info by default. The real solution is to not federate with people who lie about link previews, and to defederate with them if you find out you were wrong -- I just think it's good to limit harm in the case where people are too lazy to do this, because Masto is designed in a way that inherently leads to transitive trust-style problems.
I've posted this link on Mastodon today (linked) hoping some Masto admins see it and have opinions on the idea.
Replying to myself to add: two people on Masto inform me this has been built as an unofficial thing: https://jort.link/ .
The current implementation appears to put a lot of responsibility on webmasters and Masto users though, while putting basically no responsibility on instance owners. So basically, I think the social incentives are all wrong since it puts all the power on the people who aren't harmed and opts server owners in by default, unless they do user agent filtering manually.
(My opinions on this seem shared.)
This whole notion of "but what if the link previews are full of lieeeees" is complete nonsense. It's an asinine strawman. A red herring.
If I post a link and instead of the cat video you were hoping for it's actually a rickroll, that's on me, and the block button is right there. My instance is already attesting to other instances that I am who I say I am. If I post a link that is not what it purports to be, that's my fault.
If you are accepting that I am not a malicious actor (which presumably you are by allowing me on your timeline) then you must transitively accept that I am not sending you malicious links.
(Where "malicious" in this case means nothing more tragic than "the thumbnail and title are different".)
The instance you post to should retrieve the link preview data, once. When it distributes your post to other servers, it should include that data with it, along with all the other metadata about the post, such as who you are.
Yes, people can and will post shitty links, and deceptive links. And you can and will block those people for being shitty and/or deceptive people.
You can't code around social problems, but you absolutely can code around not DDoSsing my server under normal, non-adversarial operation.
Once again, "Everyone who replied with 'use a CDN' is really saying, 'I expect all web sites to be run by skilled and dedicated professionals, who deploy future-proofed technology stacks, so that my social network can be run by amateur hobbyists."
And importantly, remember that it's not just the person with a lot of followers who is at risk here. If I, with a lot of followers, were to post a link to someone else's small web site, that person is going to have a bad day. Or even if I reply to a thread in which they were already linked by people with fewer followers! As soon as I say even one word in that thread, that also triggers a stampede on their server.
Well, that failed horribly. Couldn't figure out how to get out of the quoted text box.
Anyway, if you include "the instance you boost from" in the "should retrieve the link preview", that also fixes the "people I don't necessarily trust appearing in my timeline via boosts" problem since that moves the trust back from "random" to "someone I follow".
Yeah, I thought about it (and read more of the comments) and I think you're right. As a server that is receiving a federated post, I should either trust the information I am given about the link preview, or I should not fetch the link preview at all. Any other behavior can DDOS you in at least one case.
(And this is less bad than a lot of masto-level misbehavior that _can't_ be autocaught.)
Someone who says "I don't trust people not to lie about the links" is saying "well, between trusting people not to lie about the links, not having link previews at all, and potentially DDOSing anyone whose link is slightly popular, I picked option 3."
The feature never should have been released in this state and a lot of people who are making this complaint are basically just attempting to justify pushing costs onto you.
EBWOP: and _lying about link previews is less bad_*
“Have you noticed CloudFlare is free” is probably the single most amusing thing to hear from the peanut gallery.
How do you solve Mastodon’s distributed problem? Well, step one is to concentrate even more of the internet’s traffic in one giant corporation! Who needs a bird-shaped chokepoint on microblogging when we can have a cloud-shaped chokepoint on everything! (Also, have the 40-something percent of decentralized web sites hosted on WordPress considered… not? Something something database calls something something.)
The CDN bullshit lets Mastodon cosplay as an independent solution to centralization by forcing everyone else to abandon their independence.
This may be a dumb solution, and I haven't yet tried it, however the blog may respond to that user agent with little more than a <link> in the head referencing another url with type "application/json+oembed". Then that second URL returns a scrap of json which contains the information to build up the preview card. This could be a lighter weight response than the full blog page. I am not sure this is really the right place to fix the problem, or if there is a better way to discover an oembed URL than fetching the whole main post just to get a link to the lightweight card.
Even something as simple as "what is the title of this blog post" requires a dive down into WordPress and mysql, so at that point the damage is done. It's not that these are expensive queries but that there are a fuckton of them in a short period of time, and lest we forget, that they are totally fucking unnecessary.
True enough, this saves some bandwidth but not much else. Will dig a little deeper into the Mastodon code, this looks like a nice feature to fix by doing it properly, needs a serializer for the card, and then the fetch_link_card service needs to wait a few seconds then ask the origin server if it has a serialized link card before attempting to make one itself.
I front everything of mine with CloudFlare.
And I thought Alan Turing helped to defeat the Nazis...
Cool story, Bro. Cloudflare are Nazis.
I use a VPN, does that keep me safe?
Hi, I have the same issue with my blog. Would be kind to post here the exact rule that you wrote on your .htaccess file to block the Mastodon user agent? I'm not very good with this stuff... thank you! 🙏
Thank you! You are a life saver.
Careful of the backslashes. WordPress likes to strip those out.