By default, if you type some random word into a URL like /blog/something it will spin the wheel and redirect to an arbitrary post that happens to have that word in it, like /blog/
add_filter ('do_redirect_guess_404_permalink', '__return_false');
But this is insufficient because, for example, the post with the URL /blog/2022/12/sewers/ also answers to /blog/2022/00/sewers/, /blog/2022/0/sewers/, /blog/0000/00/sewers/, /blog/0000/0/sewers/, /blog/0000/0/SeWeRs/, and who knows what else. And those aren't even redirects, but multiple URLs returning the same document.
How do I make it knock that shit off?
I imagine the answer is somewhere in the parse_query filter, but that's the part of the map that is very clearly labelled "Here Be Monsters".
I’ve given up on WordPress many moons ago. It just contains too much opaque magic. It turned out that I was spending more time troubleshooting unexpected behavior than actually being productive. The list of stuff I needed to *turn off* via hooks grew with every release. The plug-in ecosystem is a chaotic mess.
For my own needs, I’m using Hugo or Zola for static sites. For anything that needs a CMS, Kirby is a clean and mostly predictable one.
Cool story bro.
When I ask a very specific question, I am super interested when people respond with, "Well what you ought to do is burn the whole thing down and start over from scratch."
Maybe I'll recompile my kernel, too. Sounds fun!
Wow, cool reaction bro. I’ll make a note to stay away from you. Have a laid-back new year.
It really is weird that jwz didn't react with snivelling gratitude to "step one, throw everything away".
> I'll make a note to stay away from you.
Oh, please do. Responses like yours go past "unhelpful" and into distinctly anti-helpful territory. I too get peeved when a question "How do I accomplish very specific outcome X with system Y?" is answered with random blathering about unrelated other systems, and advice to do huge amounts of work for no clear reason.
I would understand this kind of reaction if he was actualy advising jwz to start over. But he's not, he's just sharing his experience. Nothing to be mad about.
Nobody gives a shit.
This is a Geek-vs-Normals thing. When people complain about X, some are saying "I need help with X", whereas others are saying "Sympathise with me!" In general, Geeks always assume the former, and so run up against the problem that Normals get annoyed that the Geeks in their lives try to fix things all the time. "I was just saying my job sucks, I didn't need you to research fourteen new career paths for me and present them in a spreadsheet! Why can't you just listen?".
The key to dealing with this, as a Geek, is to formally ask your Normal friends: is this a rant or a bug report? If it's a rant, sympathise; if it's a bug report, offer help to fix the problem.
With JWZ, it's always a bug report if it's about software. If it's about the night club, the local political scene in SF or anything to do with the parlous state of music in the 21st century, it's a rant. I've never seen anyone so reliable on that matter - most people, even Geeks, vary randomly between the two options.
It is hard to believe that some people simply can't tell when I am looking for a solution to a specific problem, since I almost always do so in a sentence like, "I am looking for a solution to this specific problem".
My theory is that they do understand that, but there are just a lot of people in the world who can't resist taking that as a prompt to say, "Well I don't care about that, but let me tell you a story about this unrelated way in which I think I am very clever."
We usually expect that sort of thing to be a Geek issue -- that is, Geeks don't grok human emotions so they always guess wrong in this case. I think here is the only place where it's gone the other way so reliably. Face it, this isn't random - you inspire the frootloops to come out of the woodwork. Maybe it's the green on black colour scheme.
This is hardly the only venue where neckbearded nerds crawl out of the woodwork to shit out unsolicited and incorrect advice.
It's just one where the dominant culture (out host) strongly objects to it.
True enough, @elm. I suppose if it's not sympathy that the frootloops are offering but merely a bit of derailment and "look at me" behaviour, that's definitely on-brand. I was hoping it was an unusual outbreak of empathy, but I guess not. Oh well.
The response to this sort of thing back in the glory days of Metafilter was "get your own blog."
But more to the point, this *may* help:
If this is an actually useful answer (I have no desire to test it) that would be the first time someone would recover from the JWZ pit of doom they threw themselves in.
No, that does nothing that do_redirect_guess_404_permalink does not already do. redirect_canonical calls redirect_guess_404_permalink before running the redirect_canonical filter. (579 lines before, since that disaster of a function is 772 lines long.)
A 772-line function? JFC.
I thought WordPress was, like, real software, not some random hobbyist's My First Free-Software Project.
If I came across that in a system I was working on, There Would Be Words, not just a silent rewrite of it.
There are longer function in WordPress and it still powers more than 40% of the web. Like Mozilla at the times, being small is not always a feature.
Having said that there is of course room for improvement. In this case, if you know what you are doing, rather than doing:
add_filter ('do_redirect_guess_404_permalink', '__return_false');
you could do something much worse, which is redirecting in the filter. so rather than simply returning a true or false, in the filter for that function you check what was requested and based on that redirect to something meaningful. There are probably better ways of doing it, of course.
When someone says 'I have a problem with the detailed behaviour of this Common Lisp function'. 'Use Scheme' is s deeply crap answer.
I, too, had URL parsing problems stemming from using WordPress, but I had found that erasing all of my websites, burning all professional bridges behind me and moving to the country to open a llama farm instead had solved all of them.
You're welcome, JWZ!
(the above is a bit. please don't blacklist me)
Don't tempt me. That sounds like a more and more reasonable response every day.
However... I have experience with WP from the early days, so I might see what I can do to produce a reliable fix for this. It would be in the form of a plugin, something that simply deactivates the use of non-canonical URLs of all sorts, forcing bad guesses and typos to go to a properly formatted 404 page. As a sop to the "don't make me think" crowd, it would also need to collate a list of erroneous URLs and the best guesses as to what they should be, so the user could manually authorise certain common redirects. So for example, if users tried to go to /blog/2023/01/wordpress-stupidity often enough, a click of a checkbox could set up a proper 301 redirection to /blog/2023/01/wordpress-url-stupidity.
I could do that. Not sure when, but it might be enough of a five-finger exercise that I may be inspired. I'll try to remember to post here if I succeed.
An even a lower level crap answer is proposing you run the question through ChatGPT,
which I did of course, and it gave the suggestion to add a few lines to your '.htaccess' file.
Which is a wonderful example of why stackoverflow has banned ChatGPT generated answers.
(I'm sorry, I don't have a WP install up to date enough to actually test for you, but since there seems to be no full working solution yet....)
First, those munged URLs to the Sewers blog post all include a
that matches the formatting you prefer, so you really ought not worry about this edge case beyond the "guess_404" filter you turned off. But that wasn't your question.
Also this behavior seems (and is) dumb, but the idea behind it is that WP doesn't know your preferred permastruct is date-based, and so having /blog/tag1/sewers/ and /blog/tag2/sewers/ both serve your page's content is kind of a feature. But yeah, the absence of dates or tags that are "0000" while still serving the content leaves some room for improvement.
I think you don't need to get into parse_query, and could instead override https://github.com/WordPress/wordpress-develop/blob/6.1/src/wp-includes/canonical.php#L42 with the filter already mentioned (though that mentioned code doesn't seem to do anything by my reading of it): https://github.com/WordPress/wordpress-develop/blob/6.1/src/wp-includes/canonical.php#L791
Here you could parse_url() on both URLs and do some comparisons that are stricter than what you're getting out of the box, or you could do some simple-ish regexing and parse the two URLs to your own satisfaction.
Failing that, you could nuclearly hook into template_redirect and examine the global query object for the date values that will be populated from parse_query (`$monthnum` and `$year`) vs the known ones for the queried_object_id in question and fire your own wp_redirect() if it's not an exact match (or just 404 in those cases, which seems like not what you'd actually want, but you're certainly entitled to your controversial opinions).
Well, the redirect_canonical filter doesn't fire at all on those dumb /00/ URLs, presumably because they are not being redirected at all, it thinks they're fine as-is.
But maybe the following will work? It seems to be early enough, but I'm not sure if it will cause other problems.
That preg_replace reads to me like it will clobber enough of the url as to cause your comparison to basically never pass, but I would have to log it for a while to check and you are a superhero, so if it's not entirely doing what you intend that will be an easy tweak.
Setting aside the comparison itself, your idea will mostly work, but there are a few minor issues in making sure it's the most correct way to do this....
First, note that WP doesn't actually "set_404()" in a meaningful way: https://github.com/WordPress/wordpress-develop/blob/6.1/src/wp-includes/class-wp.php#L757-L760
They don't even give you an option to filter whether a request is a 404 or not, and there is no function that "Does 404". So you could redirect to the 404 template where you are calling "set_404()" and a human would see the 404 page, but the header would still be a 200 (both because you have to explicitly `status_header(404)` yourself AND because the 200 header would have already been sent by the time your code executes).
There are a few ways to hotwire this, but the most direct would be to use the pre_handle_404 filter: https://github.com/WordPress/wordpress-develop/blob/6.1/src/wp-includes/class-wp.php#L697
You could also patch individual holes by making sure the incoming query includes all the data you want to care about about like `$wp_query->get('monthnum')`, which would be less fun and less elegant, but also potentially a bit safer, dragons-wise. The $wp_query won't include the slug itself, though, so you'd still be doing some dumb things to cross off every known test you mentioned in your post which means maybe your instinct to strictly compare canonical URLs is the closest way to achieve your goal.
Without that, it would do 404 on things like "?replytocom=238788#respond" which is less than ideal. It seems to be working so far? I haven't noticed any false positives since I installed that.
Oh, FFS. I didn't notice it was putting the wrong status code on it. This variant seems to be working, though, if I just short circuit the whole thing when the URL doesn't match. So it's not that the status code had already been set when my code ran, it's that it was overwritten later. I think?
I don't really understand the semantics of pre_handle_404 or when it fires...
...it looks like you've already solved this,
however ^this seems vaguely relevant to the solution you found... and 🤷♀️ what ever, happy new year!
I just tried this on my website, and I get the 404 page. Here's the thing, though: I don't think WordPress creates one of those by default, or at least not anything constructive. I went into the Site Editor — I'm using the Twenty Twenty Two theme which supports full-site editing — and put in some text to make it at least a bit more in line with the style of the rest of my blog. Side note: I'm okay with full-site editing for the reason that I can finally pull out and rearrange the default stuff WordPress jams in without having to go into Here Be Dragons territory.
(Btw, your 404 page rocks!)
I tolerate WordPress because I understand it well enough to knock sense into it most of the time. I looked at Hugo and a few other static site generators and nope'd the heck out of there, because I know I'd be spending more time tinkering with (and cursing at) the code than writing blog posts.
Yes, to get WordPress to use a proper 404 page you have to explicitly tell it where yours is, like:
Do you have to do that far? I thought you just needed to include a 404.php file in whatever theme you're using and if it doesn't find one is when it fails over next to the index.php file.
Could be? This is how I did it the first time around and I've never re-adjusted it. Having two different copies of my 404 page would be somewhat suboptimal.
You can blame all the SEO "experts" for not allowing 404s. Back when I was working for agencies I always died a little inside when I had to add hundreds of redirects to stop 404s.
I had thought the tendency of WP to guess at a URL was clever and useful, but I hadn't taken into account the issue of Google being confused by multiple URLs for the same page. If that rel="canonical" entry solves that problem then I'm happy. Except... I just tried a few really simple typos and just got 404s, and further examination showed that the only "correcting" WP does is (a) allowing you to use 0000 or 00 in place of year, month and day elements, and (b) allowing specifically shortened forms of a URL, so .../wordp or ../wordpress-u will both redirect to this page, but ../wpstupid will not. And that latter "feature" is disabled here now.
So it's the zero-as-wildcard that is the problem. If rel="canonical" solves the Google confusion, is the problem worth further effort to fix?
Thanks. That explains the growing number of slightly different redundant urls delivered by search engines. Seems like the search engines could improve their results by stripping out the redundancy. Sometimes by page 2 of results and onwards are just repeats of the same 6 unique results found on page 1.
That is what rel=canonical is supposed to do.
One potential workaround for the date URL issue would be to add a check on the post template -- it has the post's date and the URL, so you could write (ugh) a date comparison function. If the dates don't match, tell WordPress to throw a 404.
Would be ugly, but it also doesn't require forking WordPress or relying on any plugins.
Your 404 page will now be in the top 50 posts for 2023.
I do not know the answer myself, but I find Ask Metafilter to be a useful resource for questions. People there might be able to answer your question.
Looks like jwz found a solution yesterday... I have two stupid questions.
1) is it viable to use a sitemap.xml from a wordpress plugin to generate Apache or nginx rules for 404s?
2) just how ugly would it be using the sitemap.xml and a crawler to turn wordpress into a static site generator?
I know #2 is non-viable for jwz.org. However, I've seen gobs of business sites on WP that should be on static generators and it seems like something for a lift and shift.