The problem I'm trying to solve: I have scripts that do things like, probe all of the youtube videos I've blogged about to see if they've gone stale, as they so often do.
But recently Youtube has begun putting me on "429 Too Many Requests" probation, no matter how many delays I introduce into my scripts. I can't tell what their limits are and they don't say.
So I just want a simple $HTTP_PROXY that I can use from the command line without having to do ifconfig nonsense or some crazy-assed authentication dance. It does not have to be fast.
Hard to find, I'd say. If you have ssh access to a number of machines you could use ssh tunnelling to forward the requests to different networks/IPs...
Responded on twitter, will expand here.
Free for the taking? Unlikely. Spammers and all.
But you could fairly easily spin up a free tier AWS/Azure/Whatever instance, configure squid (and whatever the http header authn) in a few dozen lines of Ansible.
And when it starts getting 429 errors, kill it, spin up a new one, and leave some other shmo to pick up your banned IP :)
I may be willing to set one up for you.
It will not be fast, but it will be fairly reliable and (of course) free.
It will be behind a VPN (on my end) to make whack-a-mole harder for YouTube, since that makes hopping addresses trivial for me.
Send me an email if that is useful to you.
ssh can act as a SOCKs proxy that can be used with the HTTP_PROXY envar.
Cool story bro
Everybody I've ever worked with who does this pays for it. Is it possible there's a free option? I guess, but I have no idea what the rationale would look like for such a thing. Maybe if you're lucky somebody has decided this needs "disrupting" and is throwing VC dollars at giving it away?
There are lists on the internet of open proxies, but they probably won't be reliable for long. I accidentally left a corporate proxy open to the Internet, and within 24 hours it was overrun by Chinese people avoiding the great firewall and people trying to get around forum bans to troll. If you want something you don't have to tweak constantly it'll be difficult.
Indeed, this is an option if necessary but one needs to expect failures commonly.
As long as the errors are being caught properly, there are a few different forks of an tool called proxychains around which can take a list of proxies and connect through one or more (in a chain) either in a designated order or at random (which is what one wants for this application).
Eventually the list is going to go stale, but there are a few scripts floating around for refreshing a proxychains config from various free proxy list sites.
Also, I'm sure all these proxies are either misconfigured or hacked so it's not impossible that one could receive an abuse report or two over using one.
Install torsocks, then type
Addendum: this is obviously not reliable. If "reliable" was implied along with "free", the answer is "no way, get fucked, fuck off".
This is a reasonable guide to how you can get around adversarial websites, but it's a constant struggle.
For your needs, where it's probably OK to take a week or a month to go search for dead links, I would write a script that still used torsocks as source of free proxies (Google sends a lot of tor exit IPs straight to captcha, but not all of them), tried a new circuit on failure, and tests as many videos as it can get away with on success. And if everything is failing, give up and try again tomorrow.
Well, my machine has 2 extra IP addresses (needed to run the two primordial MCOM web sites) so I guess I'll see if using tinyproxy and cutting my load by 1/3rd helps. If not, looks like extra AWS IPs are about $3.40/month, so I could add on a couple more...
Sonic, my ISP, bundles a VPN with my service, but they don't seem to offer an http proxy.
I second torsocks.
But in any case if you want to try using ssh -D I can give you an user into (small) vps on the US and Argentina.
So Tor gives you a SOCKS proxy, which you can turn into an HTTP proxy by putting Polipo (currently unmaintained) on top. It's free, doesn't require
ifconfignonsense...but Google/YouTube's behind CAPTCHA for Tor, which I'm guessing counts as a kind of "crazy-assed authentication dance". (I'm envisioning horrible hack jobs involving using Selenium to control a web browser so you the human can find all the buses before the automated checking bits start.)
I2P also has the concept of "outproxies". Not sure how they work (Something like http://[proxy.i2p]/https://www.youtube.com/watch?v=dQw4w9WgXcQ, maybe?)...or if Google puts CAPTCHA on that too. But that's free as well.
Of course, these tools will make you the eternal friend of the Norwegian Shipping Authority. You may or may not care about that.
There's https://free-proxy-list.net/ available, where also https proxies are offered. I tested one, but that didn't work due to Firefox preventing MITM certificates. But 188.8.131.52 from the list worked for me for HTTPS also. So that's worth a try I would say.
I have been using http://spys.one/en/ for a very long time when I don't have any vps running for other projects. The IPs are reliable down to a few days at most but they can do the job if you can write a script top automatically change the proxy after it's gone dark.
I've had consistent luck with a script that steps between these sources (in this order) enough times until it hits or is forced to give up:
Some years ago I misconfigured a proxy and woke up the next morning to 80% of the host's bandwidth being used by random HTTP requests. A quick review of the logs showed that someone(s) had been probing twice an hour since we took possession of the IP, and as soon as they got a non-error response to a proxy request, they put my IP on a list somewhere, and the whole world started requesting pages.
But...not related pages. Nobody ever requested a page and the images on that page. A lot of requests were just a random image from a page. Related requests were getting divided up somewhere upstream and I was only getting part of the traffic. That implied the existence of a service provider, and that gave me a goal: make that entity blacklist my IP.
I set up a custom HTTP proxy that saved each incoming request in a request pool. If there were fewer than 1000 saved requests, the proxy would behave normally: send the incoming request to the upstream server, and send the reply back to the client. If there were more than 1000 saved requests in the pool, then the proxy would pick a request from the pool at random, send it to the upstream server instead of the client's request, and the client would get whatever came back.
It took about 3 weeks, but requests eventually stopped. Every now and then they'd start up again, but they'd either never get up to full speed, or they'd run at full speed for a few days and then abruptly stop. My guess is the former are services that just never got popular (probably because they had shitty QA if they kept my host in their server pool) and the latter had QA or a robot that would blacklist my server IP. I tried whitelisting the probes, but some of them used random IPs and URLs. I would empty the request pool and trap another proxy provider, but eventually they all stopped coming back and I ran out of dance partners.
Over the years the requests got nastier--not as much "I'm just trying to do normal business online but I have the bad luck to live in China" and more "I want to join a fight with a white supremacist site (on either side)" or "I want to try a thousand passwords on some privately hosted WordPress blog." I kept waiting for an abuse report but none ever arrived. Eventually the project came to an end when the web host's disk died and took the proxy with it.
I kept the requests but not the replies to save disk space (requests are usually much smaller) and for laziness reasons. If I replay the request and it contains an account credential, the account's owner can change the password to protect their account; if I replay the reply and it's private data, the proxy would randomly spit that data out to strangers indefinitely, and there would be no way for the password owner to stop such dissemination unless they resorted to bothering me. I didn't keep backups because the request pool contained a batshit number of passwords. It was pre-Snowden, before Let's Encrypt, Facebook was cleartext, nobody but banks encrypted anything.
Looking past "no VPN" to on-demand tunnels (not necessarily encrypted) will get you as few seconds of somebody else's IP address in somebody else's rack of servers: Algo infra-as-code tunnels (at GitHub).
Plenty of places you would want to proxy to will refuse any request from even small hosting providers, and diligently keep track of their IP ranges. Until China made the mistake of trying to block VPNs by source IP in 2018, not even they bothered to rotate around a pool of IPs subleased from innocuous large ISPs.
While it's not free, I use ScrapeBox to grab giant lists of public proxies. You can also turn any cheap Linux/BSD vps into a socks proxy.
If you are cheap, you can use Tor as previously mentioned.