PhantomJS vs Selenium for scraping and automated testing

Now that I have a bit of experience with both PhantomJS and Selenium (for Selenium, mostly in its scripting contexts) here are some things I can share. All of this assumes we can use Selenium's WebDriver and script it to be comparable to PhantomJS because PhantomJS is a out-of-the-browser Javascript module (like Node.js) without a web UI shell like Selenium (in the case of Selenium IDE). WebDriver can be controlled from many scripting languages (and even compiled languages) like Python and even JavaScript. Assume that anything about PhantomJS also applies to CasperJS which is a wrapper for Phantom. I haven't yet used ZombieJS which promises faster runs compared to PhantomJS. PhantomJS is already faster than Selenium. But Phantom/Casper still feel slow since you don't see css or images being fetched and rendered like you do in your normal browser. You would create scripts to run either as background or cron tasks while you do other things.

Selenium has been around awhile, since the time browsers like Firefox added a way to be hooked into from external programs. Now many browsers include this kind of support and this is how Selenium is able to work, by remote controlling the browser using a controller the browser provides. This means that you can run tests or browser pages using "real" browsers, all the main browsers - Firefox, Chrome, Internet Explorer, Safari, and even Opera - if you choose Selenium. This is different from PhantomJS / CasperJS which don't actually have anything to do with normal browsers but instead provides its own headless (meaning there's nothing visual to see or interact with for users, there's no graphical window which you see get loaded and then control) browser implementation which uses WebKit (QtWebKit) which is the same engine that Chrome, Safari, and Opera use (Firefox uses Gecko). But this headless browser isn't exactly the same as any of those browsers besides the rendering engine, even though it should render CSS similarly and can execute Javascript similarly. Even though PhantomJS is headless, you can call a function to render the web page it has in memory into an image file on disk (obviously you can't click any links or buttons in the rendered image).

PhantomJS is a server-side JavaScript library, similar in concept to but not the same thing as Node.js. So you won't be able to automatically import and use npm (Node.js) modules in your Phantom code, but there is a Phantom module for Node which lets you call out to Phantom from node code. Whereas Node uses Google's V8 Javascript engine, PhantomJS is using JavaScriptCore from WebKit. Non-browser JavaScript was unthinkable until a few years ago when, after years of optimizing V8 by the Google guys, another group turned it into its own command-line project. A long time ago, JavaScript code ran slow but due to work by Google, people now use JavaScript and Node to write fast servers. But similarly there had been improvements in the performance of WebKit's JavaScriptCore. There's still a bit of a battle between Apple and Google when it comes to pushing for JavaScriptCore over V8. Now people are used to the idea of writing JS code outside of the browser. And so while PhantomJS isn't the same as Node.js, they are both Javascript, and are both largely the same JavaScript you have been writing for browsers for years. Nuff said.

But there's another main difference between Phantom and Selenium. PhantomJS, like its name implies, is both written in and scripted with Javascript. Selenium, otoh, has "drivers" which can be controlled from a number of languages, including Javascript! There's a node plugin for Selenium. But we've used PHP (part of an existing PHP website) to call Selenium, as well as Python. The thing to keep in mind with Selenium is that it does need an actual browser running to control it, whereas everything PhantomJS does is within its own processes. Selenium, therefore, can run "headless" but to do so requires a browser process to be running and that browser probably needs a windowing system/server like X11/X.org, assuming we're running Linux/Unix on the server. If X / a graphical UI isn't used on the server (being a server and all) then you can use Xvfb, a virtual framebuffer (the virtual memory representation of a screen). You can actually connect remotely to the running Xvfb instance using something like VNC. So then, in Linux, you'd start a Firefox process and tell it to "connect" to Xvfb and then Selenium can connect to Firefox. Practically, once you get it all set up, it means you'll need a lot of memory to run Xvfb + each browser instance (Firefox) + Selenium on Java. Then you'll have to take care to tear down all those processes properly lest you run your server out of memory (let's hope you have console access on your remote server if you get OOM on Linux and important processes like sshd get killed).

If you need to test from real browsers (maybe your scraping target is smart enough to detect a PhantomJS client - although I doubt it), perhaps to check rendering of pages or to test Javascript performance of your site using different real browsers then Selenium is the way to go. Another advantage of Selenium is the Selenium IDE where you can explicitly encode browsing steps (like clicks or form fills) within your browser, or even record the steps which you take within the browser, and even export those steps into code which can be used in your Python script or whatever language you use. However, packaging and reusing Selenium IDE test cases is quite limited and I recommend going straight to WebDriver instead.

If you want something more lightweight and also faster executing than Selenium, then PhantomJS (and possibly ZombieJS) are probably a better choice, assuming you can't get away with curl because you need Javascript execution. Much of the web today relies on JavaScript either for interaction or even for page rendering (see Single Page Apps). SPA frameworks like AngularJS rely on PhantomJS as a way to render their pages to regular HTML in order to serve content to search engines like Google which otherwise won't see the content and distinct pages of a Single Page App site. One nice thing about either Selenium or Phantom is that you can combine the convenience and ubiquity of jQuery to run snippets of code to quickly get information out of the DOM.

See more in my other article on Selenium and PhantomJS / CasperJS.