Web Scraping with your Web Browser: Bad Dog

😸 Web Scraping with your Web Browser: Bad Dog 😸

In the previous article of this series, I explored the problem of eval and an old algorithm called Packer. For this article, I originally intended to begin exploring obfuscated Javascript but something else came up which I want to talk about. There is a problem with fetch which I'd been meaning to fix for the past two years. In a nutshell, fetch operates like a bitch in heat, if you'll forgive my poor attempt at a joke.

The scenario is this: you submit a link to your app with the intent of scraping the HTML but, for whatever reason, you mistakenly provide a link to an mp4 file. Big oops. What happens is that the browser will happily download the entire file into memory before passing control back to your app. If you open Task Manager or whatever, you can watch your system memory being gobbled up bit by bit. The only thing you can do is close the app or reload it. This is a poor way of dealing with the issue so what we need is a way to stop fetch from downloading anything larger than a certain size.

You might think that the problem could be resolved by simply checking the content type before accepting the file but that won't work either. Once the download has been initiated, it literally cannot be stopped! For example, the following code naively attempts to abort the download:

response = await fetch (url);
if (response.headers.get ("content-type") != "text/html") return "";
return await response.text();

The browser assumes that your app will, at some point, accept the download so it goes ahead and consumes the whole thing. This will waste memory until your app is closed or the garbage collector kicks in (which might take a while). It is time to start treating fetch as the undisciplined dog that it is (another crummy joke and I might have more). This is unfortunate since fetch was originally intended as a simplified alternative to XMLHttpRequest which has the same issue but at least it provides a method of dealing with it. With fetch, we will have to override its default behaviour by digging up the internals and substituting our own code. This is not very simple and there are a few gotchas.

var busy = 0, dogfetch = {abort: () => {}}, oldfetch = fetch; fetch = newfetch;

async function newfetch (url, arg)
{
  if (!busy) throw ("Oops!"); var e = dogfetch = new AbortController();
  if (!arg) arg = {}; arg.signal = e.signal; var f = await oldfetch (url, arg);
  var d = 0, r = f.body; try { r = r.getReader() } catch { return (f) }

  function a (c)
  {
    return r.read().then ((x) => b (c, x));
  }

  function b (c, x)
  {
    if (x.done) return c.close(); d += x.value.length;
    if (d > 3500000) e.abort(); c.enqueue (x.value); return a (c);
  }

  var s = new ReadableStream ({ start (c) { return a (c) }});
  s = new Response (s, { statusText: f.statusText, headers: f.headers });

  Object.defineProperty (s, "status", { value: f.status });
  Object.defineProperty (s, "url", { value: f.url }); return (s);
}

You might be relieved to notice that the code does not contain the typical inscrutable nesting of then statements (except for that one then which is simpler to keep than to get rid of). I did my usual search of the Internet for code samples and could not find anything close to readable so I painstakingly evolved my own code. So what are the gotchas? For one thing, the Response constructor only allows setting the values for "status", "statusText" and "headers". Secondly, it throws an error if the "status" value is not in the range of 200 to 599. This is not the default behaviour of fetch so I had to force it using defineProperty. Some of my scraper code relies on "url" in case of redirection so I have to force that as well.

I'm not explaining the details of how the code works or why it is structured the way it is (I'm not sure either). This should have been simpler but that's on the designers of fetch and I could go on a rant about "design by committee" as I am often tempted to do but I won't. One thing that is surprisingly simple is AbortController which is a generic operator that can be used in any case where a complex function needs to be made abortable. The variable "dogfetch" is a global to be used to abort an ongoing fetch in response to a button press or whatever. It is initialized with a dummy abort function. We also override the default fetch. You don't need to do this since you can simply call the replacement code directly.

The variable "busy" is meant as a catch-all in the case where you want to abort ALL subsequent operations after aborting the current one. Your app will need to set the busy flag before beginning and clear the flag when concluded or upon the request of the user. After checking the "busy" flag, the code initializes the abort controller and passes the value of "signal" to fetch. The process of aborting the operation is as simple as a call to dogfetch.abort. It is also possible to abort multiple operations at once from a single abort controller but we're not doing that here. The code calls fetch and gets a reference to the stream reader if it exists (it may not in the case of an empty body). The code then creates a new readable stream which it monitors for download size to trigger an abort if it exceeds a certain threshold (I used the value 3,500,000 which is a reasonable upper limit).

We have killed two dogs with one bone here. Not only are we protected from accidentally downloading a video in place of an HTML or text file but we also have a method to just stop everything at the push of a button. However, if you're thinking of building an upload progress monitor inspired by this example, rest assured that the design committee has deigned that it not be possible. For that, you'll need to fall back to the old dog XMLHttpRequest but that's a whole different ballgame and I won't be covering that at any point. In your app, you will need this basic structure:

function buttonpress ()
{
  busy = 0; dogfetch.abort();
}

async function scrape (url)
{
  var response, data; busy = 1;

  try
  {
    response = await fetch (url); data = await response.text();
    ... process data, do more fetches, process more data, etc.
  }
  catch { busy = 0 }

  if (!busy) cleanupthedogshit(); busy = 0;
}

You may notice a fundamental difference between the sample code in this article and the samples that I gave in my previous articles. Namely, why did I change the function declarations? There are multiple ways of declaring a function:

var x = function () {}
var x = () => {}
var x = async function () {}
var x = async () => {}
function x () {}
async function x () {}

You could also use "let" or "const" instead of "var". The difference has to do with variable name scoping. It's a heady subject that I'm still trying to wrap my head around and it is, along with the lack of static typing, the main cause of the claim that Javascript is an unsafe language. I grant that if you're talking about rocket science but I won't bother to argue the point. In any case, I started using the form "function x" when I began working with JS and Firefox back in 2017 but stopped using that form because it didn't work with Google Chrome for some reason.

The sample code shows how the "busy" flag is used to ensure that a series of operations will be properly aborted and a cleanup operation performed if the flag is unexpectedly cleared as the result of an exception. The try/catch structure is critical as the replacement fetch code assumes that the caller will catch any potential exceptions, including the one that may be thrown as a result of the abort operation. Word of warning: never use fetch without consuming or aborting the response. The naive example code shown at the top of this article should be written this way:

response = await fetch (url);
if (response.headers.get ("content-type") != "text/html") { dogfetch.abort(); return ""; }
return await response.text();

The ignored response will be quietly discarded and no exception will be thrown. This ensures that the browser will not be left running a pointless download in the background while wasting memory and bandwidth. Now go fetch, Fido.

My website - My GitHub repository October 21, 2024