Web Scraping with your Web Browser: Evil Code

😸 Web Scraping with your Web Browser: Evil Code 😸

In the previous article of this series, I talked about Cloudflare and gave some code which can only run on the Kraker local proxy server. This time, we'll be using a remote proxy again so all of you undecideds who can't (yet) justify investing the time to set up a local proxy can still play along. I can't keep doing this though. A local proxy server will become more necessary as we keep diving deeper into the muck. I'm going to talk about the evil of eval and self-executing or self-invoking functions or the newer terminology immediately invoked function expression though that's a bit of a mouthful.

Seriously, read that article on Wikipedia because it is the absolute best tutorial on the subject. It even covers the problem of "automatic semicolon insertion" which I ran into while preparing the sample code for this article:

This code snippet will print "test" and "hello":
  var hello = function () { console.log ('hello'); }
  (function () { console.log ('test'); } ());

This will only print "test":
  var hello = function () { console.log ('hello') };
  (function () { console.log ('test'); } ());

Same here for some reason:
  function hello () { console.log ('hello'); }
  (function () { console.log ('test'); } ());

I can't explain why the third sample works correctly. You can see that I moved the semicolon in the second sample. I developed a brain clot while puzzling this out though it should have been obvious what was going on. In my defense, my original code was somewhat more verbose. This serves as a really good example of how misplacing or missing a semicolon or a comma or whatever can totally hose what appears to be perfectly good code. When we get into heavily obfuscated Javascript, this will be important to remember because self-invoking functions and commas (in place of semicolons) are often used just to confuse. Messing around with that kind of code can lead to total madness.

The supposed evil of eval has been widely discussed and, with modern Javascript, there is indeed no good reason for using it though nobody mentions the fact that it is still being used to obfuscate both HTML and Javascript. Yes, it can be used to inject malware but it is mostly used to hide functionality from ad blockers and other prying eyes. The usage that I am specifically discussing here involves a twenty-year-old algorithm called "Packer", developed by Dean Edwards (site is currently down and might be for a while). There is a pretty good description here in case you are interested. The usage is easy to recognize by the signature string "p,a,c,k,e,d" or "p,a,c,k,e,r" (rarely).

The original intent of the Packer algorithm was to compress HTML and/or Javascript in order to improve the loading time. Nowadays, web servers routinely compress content and usage of Packer can only hurt the compression ratio. It is still useful as a method of obfuscation though I have actually only seen it on video sites (since I rarely bother with web scraping anything else). The key thing about Packer is that it involves the dreaded eval to run the unpacking algorithm. Here's a simplified example of what it might look like:

eval (function (p,a,c,k,e,d) {
  while (c--)
    if (k[c]) p = p.replace (new RegExp ('\\b' + c.toString(a) + '\\b', 'g'), k[c]);
 return p
}
('...', 36, 161, '...|...|...'.split ('|')))

The code runs as a self-invoking function with four arguments (some versions may have six arguments). The first argument is always the string to be unpacked. The second is usually some number, whose purpose may vary. The third is the length of the array created by the fourth argument which is a collection of strings split across the vertical bar. The purpose of the unpacker may be readily apparent by simply looking at the content of the fourth argument because you will see familiar references. Buried in there may be something you are looking for but you may need to run the code.

The eval function takes a string as its only argument. In the example, the string is generated by a function. I point this out so that it is perfectly clear that eval is NOT evaluating the function but the result that it returns. This distinction is important because we will do the opposite: evaluate the function and return the result from eval. We are not interested in executing the unpacked content because, first of all, it is most likely to crash and, secondly, it may do things like mess with the DOM (before it crashes). The data we are looking for is somewhere inside the unpacked content and we (generally) don't care about the intended functionality (such as initializing a video player).

For this experiment, we will use the same scraping template as in the previous articles. We will scrape this video upload site which has a few demo videos that we can play with. Modify the scraper code to look like this:

var scrape = async () =>
{
  var m, n, p, q, resp, data; print (""); 

  resp = await fetch (proxy + line.value); data = await resp.text();

  m = "(function(p,a,c,k,e,d)"; p = m + pullstring (data, m, "</scrip");

  data = eval (p); q = pullstring (data, 'file:"', '"');
  print (p); print (data); print ("\n" + q + "\n");
}

The important piece of code is data = eval (p) which could also be written as eval ("data=" + p) which does exactly the same thing, namely assigning the return value of the function to a variable. The eval function is not sandboxed so any code executing inside of it has access to variables in the outer scope. Copy a link from the example site to the scraper app and run it. The app will show the unpacker script extracted from the page, the unpacked content and a playable video link. Now we're going to do something even more evil. Modify the code to include an extra two lines:

var scrape = async () =>
{
  var m, n, p, q, resp, data; print (""); 

  resp = await fetch (proxy + line.value); data = await resp.text();

  m = "(function(p,a,c,k,e,d)"; p = m + pullstring (data, m, "</scrip");

  data = eval (p); q = pullstring (data, 'file:"', '"');
  print (p); print (data); print ("\n" + q + "\n");

  m = pullstring (data, "setup(", ");"); eval ("data=" + m); // data = eval ("(" + m + ")");
  print (JSON.stringify (data, null, 2)); print ("\n" + data.sources [0].file);
}

The video link is part of an object structure so, for this experiment, you will extract it but you need to correctly identify where it begins and ends. It is fairly easy in this case but, for larger constructs, it can be helpful to beautify the content. The process of beautifying (also called prettifying) simply adds whitespace to make the content more readable. The sample code will extract the object literal including its starting and ending braces. If you're thinking that we could use JSON.parse then you've got it wrong. It won't work because it is not proper JSON syntax. The content is meant to be executed so we have to eval this. There are two ways to do it: the method shown in the sample code or the other method which is commented out. The reason for the parentheses is because eval will normally treat the opening brace as the beginning of a statement block rather than an object literal.

With the object structure extracted, you can access the data using object references instead of parsing the text. This can be useful if you need to iterate through arrays or access embedded structures. Lastly, I want to reiterate that doing this with arbitrary sources can be hazardous since eval is evil. See this article for advice on safer options. In practice, there should be no issue if you know what you're dealing with.

Part 4: Web Scraping with your Web Browser: Bad Dog

My website - My GitHub repository October 8, 2024