← back to articles

fb55/readabilitySAX

Save article ToRead Archive Delete · Log out

3 min read · View original · github.com

readabilitySAX

a fast and platform independent readability port

About

This is a port of the algorithm used by the Readability bookmarklet to extract relevant pieces of information out of websites to a SAX parser.

The advantage over other ports, e.g. arrix/node-readability, is a smaller memory footprint and a much faster execution. In my tests, most pages, even large ones, were finished within 15ms (on node, see below for more information). It works with Rhino, so it runs on YQL, which may have interesting uses. And it works within a browser.

The Readability extraction algorithm was completely ported, but some adjustments were made:

HowTo

Installing readabilitySAX (node)

This module is available on npm as readabilitySAX. Just run

npm install readabilitySAX
CLI

A command line interface (CLI) may be installed via

npm install -g readabilitySAX

It's then available via

readability <domain> [<format>]

To get this readme, just run

readability https://github.com/FB55/readabilitySAX

The format is optional (it's either text or html, the default value is text).

Usage

Node

Just run require("readabilitySAX"). You'll get an object containing three methods:

There are two methods available that are deprecated and will be removed in a future version:

Please don't use those two methods anymore. Streams are the way you should build interfaces in node, and that's what I want encourage people to use.

Browsers

I started to implement simplified SAX-"parsers" for Rhino/YQL (using E4X) and the browser (using the DOM) to increase the overall performance on those platforms. The DOM version is inside the /browsers dir.

A demo of how to use readabilitySAX inside a browser may be found at jsFiddle. Some basic example files are inside the /browsers directory.

YQL

A table using E4X-based events is available as the community table redabilitySAX, as well as here.

Parsers (on node)

Most SAX parsers (as sax.js) fail when a document is malformed XML, even if it's correct HTML. readabilitySAX should be used with htmlparser2, my fork of the htmlparser-module (used by eg. jsdom), which corrects most faults. It's listed as a dependency, so npm should install it with readabilitySAX.

Performance

Speed

Using a package of 724 pages from CleanEval (their website seems to be down, try to google it), readabilitySAX processed all of them in 5768 ms, that's an average of 7.97 ms per page.

The benchmark was done using tests/benchmark.js on a MacBook (late 2010) and is probably far from perfect.

Performance is the main goal of this project. The current speed should be good enough to run readabilitySAX on a singe-threaded web server with an average number of requests. That's an accomplishment!

Accuracy

The main goal of CleanEval is to evaluate the accuracy of an algorithm.

// TODO

Todo