Update - January 4, 2025
It's been a while since the last update. This website has been seeing very little traffic since the flurry of activity
generated by the posting of my first web scraping article on Hacker News. That one damn article is still getting more
views than the website itself or any of the follow-up articles. Where are all the smart developers and engineers? They can't
see the value in what I am offering? Is there some sort of systemic resistance to anything new? Maybe. There is a lot of
sunk cost in the traditional infrastructure and tools, such as Puppeteer and proxy servers like Fiddler (commercial product)
and mitmproxy (open source). As long as such tools continue to serve
their particular purpose (even if only partially), there is little interest in trying something new. Time will tell.
The cracks in the seams will only become more evident as websites get more aggressive about protecting their intellectual
property. YouTube is the one to watch right now. They are trying to lock down the site so that there is no alternative to
watching videos on YouTube's own platform. The Invidious
platform has already been destroyed due to the banning of data center IP addresses. My own remote proxy server, hosted by
Vercel and Amazon, is locked out. I patched it to route the requests through secondary proxies which are not banned.
That no longer works because the video link is now tied to the IP address used to retrieve the metadata. The only way around
that is to stream the video through the proxy but that is not possible with a free account (surely you don't think that
I'm spending money on this).
This means that my YouTube player demos (yt-player.htm and yt-extract.htm) must now be considered
"abandonware" since they rely on my remote proxy. YouTube is completely unusable without my local proxy server. A long time
ago, when I first started working on Alleycat Player (which I originally called YouTube Player), I used Invidious for the
video links. I eventually crafted my own code to access YouTube directly using the local proxy. That method still works
despite YouTube's multiple attempts to lock everyone out. It will, mark my words, get harder and harder. Hopefully, I will
be able to stay on top of it.
Some of you might be wondering what the heck is this Alleycat Player that I keep mentioning. It has been on vacation for
over two years. It's time to bring it back. The new version will be modular as opposed to everything packed into a single
file. This is necessary because things change, sometimes frequently. The modular format will make it much easier to patch
the player whenever sites change their format. With support for over 50 video sites, as well as Internet TV, it is really
the only tool like it. Stay tuned.
Alleycat Player also supports a number of pirate websites for those who want the latest movies and TV shows. Ironically, it
is the pirates who are the most aggressive about locking down their (stolen) intellectual property. I stay away from the
worst ones but it is getting tough. I'm seeing WASM (web assembly) being used by a few sources to encrypt the video links.
Though these sources appear to be under the control of a single party, it is just a matter of time till we see more of this.
As of now, cracking WASM is impractical but the tools will eventually become available. This is an arms race, after all.
Strictly speaking, cracking WASM is not really necessary since there are ways to run the code in a controlled environment
in order to extract the video link or, possibly, the decryption key.
I am also seeing more sites hiding behind Cloudflare's bot challenge. This can be bypassed though it adds an annoying extra
step to the process. Fortunately, this is not widespread because Cloudflare has been under a lot of pressure from the
copyright warriors (I'm still talking about the pirate sites here). Any site hiding behind Cloudflare runs the risk of
exposure. Obfuscated Javascript is still a pain in the ass to reverse engineer (I don't have a magic solution) but I've
been able to cut the time scale from a few days down to a matter of hours in most cases. The hardest cases (like the case
with WASM) can be worked around using a controlled environment.
What is this "controlled environment"? It means modifying the Javascript and running the code in an iframe under a
fake domain in order to trap the specific information of interest and transmitting that info via postMessage. I
discussed this process in my second web scraping article. This works with Cloudflare and those sites with the nasty WASM.
Ideally, this process would be transparent but that's not always possible, depending on what browser restrictions need to be
bypassed. As I mentioned somewhere else, I'm fighting the web browser as much as I am fighting the website.
And, puhleeze try out my Kraker Local Proxy Server. It will be worth the time and effort. You need it though
you may not realize it yet.
My (awesome) tool for reverse engineering websites:
Kraker Mockery installation and instruction manual
|