HackerID: name the best HackerNews contributors
Peter Renshaw, goonmail at netspace dot net dot au
Friday, September 21, 2007.
"... I just want the data, not the pages, not
pretty design just the data... don't need your
stinkin webpages, apps. Just give us the data!
OK ..."
Introduction
HackerID is born of my frustration on several fronts and some
comments pg made when HackerNews was just launching about the
Leader board and how he used it to get a quick impression of
who the real contributors are at news.yc, since named
HackerNews.
Q "... is there any way you could filter applications before
reading them? ..."
A "... This site was designed partly as an additional filter.
It works too. When I met danielha I knew his name because he
was #1 on News.YC. ..."
http://news.ycombinator.com/item?id=7076
So it was in the back of my mind that I really want to know who
is contributing on HackerNews without having to read through reems
of pages at regular intervals. Next a more fundamental concept.
I just want the data, not the pages, not pretty design just
the data ....
So with these things in mind I hacked together in a day a program
written in python using some standard modules
* urllib (http GET)
* datetime (ISO 8602 date formatting)
* feedparser (RSS parsing)
* BeautifulSoup (screen scrape)
that would allow me to ...
* go to the HackerNews page, grab the RSS feed
* extract as much as I could
* go back to the site and get a few extrac things
* create a nice little valid xml file
But reading xml files and understanding them are two different
things. So next I decided to create a quick xhtml page in table
form to read the results. If I could see the results in a nice
table I could at a quick glance view who was submitting, what
karma they have, what score the submission has, how long ago it
was submitted.
But hang on. How can I find the the best score, highest number
of comments or easily count the number of articles danw has
submitted? I can presort and create a static page. But what if
I want to sort now? So I looked for a better way and found some
Javascript tools that I can use to do the job. In the end I chose
Yahoo User Interface Library or YUI toolset. Specifically the
Beta version of DataTable. You can find the references here ...
http://developer.yahoo.com/yui/
and here
http://developer.yahoo.com/yui/datatable/
But this is just user-interface floss. The real value created
is in generating the useful data in the first place. The
applications created on top of the data are the cream on the
cake.
GET RSS
Make a HTTP request to download the HackerNews RSS file found
at http://news.ycombinator.com/rss
PARSE
Parse the RSS file, extracting title, url, date, parse date
Note the parsed date in the RSS file is invalid.
EXTRACT
For each article url extract information on
* user (who posted)
* title (article title)
* url (article url)
* points (current points assigned to article)
* comments (how many people commented)
* item-id (unique id of post at hackernews)
* posted (relative datetime posted)
For each each article url then extract the user information
from the user hackernews http://news.ycombinator.com/user?id=user
* inception date (when joined)
* karma (points user has)
At some time this should be cached as a list of users and info
about users. No need to gather the same info over and over again.
Collect all the information (submitted article + user) into a
data structure for later use.
CONVERSION DATA TO XML
Running through the data structure convert into a useful xml
document. The structrure of the documnent is
* HEADER (defines contents, who created, when)
* title
* description
* author
* license (loose usage of data,
CC http://creativecommons.org/licenses/by/2.5/au )
* author
* user (url to my hackernews user page)
* changefreq (quartery hourly, change pending)
* lastmod (in ISO8602 format, Zulu time)
* source (url file can be found)
The for each batch of submissions ...
* SUBMISSIONS (made up of 20 submissions)
* submission (hackernews user makes submission)
* user (user name)
* inception (days since created hackernews account)
* karma (numerical measurement of hackernews worth)
* item (ID only of user discussion submission)
* title (title user assigned to submission)
* url (external url of submission)
* points (current points assigned to submission)
* comments (numnber of comments currently found in sub)
* posted (time in minutes, hours, days since sub made)
The whole point of the exercise is to create a data feed that
captures the current best of HackerNews in data form allowing
others to build off this. This is what I certainly miss with
lots of sites. The format consists of a header that describes
the feed, who created it, when it was created and submission
which contains many submisions. The useful
data is contained in the submissions.
Each user submission is captured at a top level allowing you to
identify the user, a little bit about them. The rest of the data
describes the story submission, its title, score, comments when
\posted etc. The structure of the file is inspired in part by the
Flickr public API response format data structure which can be found
at...
http://flickr.com/services/
and here
http://flickr.com/services/api/
The above data is wrapped in the following XML description ...
<?xml version="1.0" encoding="utf-8" ?>
<rsp stat="ok">
<header />
<submissions>
<submission />
<submission />
</submissions>
</rsp>
It is not original but very simple, easy to use and understand. I
would have liked to use RSS as the tools to process RSS are
abundant but I didn't have the time (or patience) to sit down
and work out the exact how. Would also have liked to use ATOM
but the tools as of yet for python are pretty rough (at least
the last time I looked).
Some of the entities (changefreq, changemod, source) are also
inspired from the Google Sitemap Protocol document which can be
found here ...
https://www.google.com/webmasters/tools/docs/en/protocol.html
The above document also highlighted the need for entity escaping
which plays havoc in XML feeds. There are some nice discussions
on url formats (RFC-3986, RFC-3987), python tools to properly
parse urls.
CONVERSION TO XHTML
Generated using static xhtml template inserting values running
through a for loop. Pretty quickly I realised that I need to
display the data so I may as well create a page to look at.
Consists of
* Karma (user overall score with link to user)
* Hacker (user name with link back to user)
* Story (title with link back to original story)
* Reply (numerical, how many replies with link to
discussion)
* posted (relative date submission posted)
This data is wrapped into a table (yes folks plain old table, no
CSS). It's plain and does the job. The most important parts are
the links created back to the original information.
CONVERSION TO DYNAMIC TABLE USING YUI AND DATATABLE
Static xhtml is fine to look at but not really useful to interact
with. I realised pretty quickly I needed some real-time
interaction. So I went looking for portable client side tools
to allow play with.
Eventually I decided to play with the Yahoo User Interface toolkit
and the DataTable in particular. I wanted to use the dynamic table
because I could then really find the best contributors by allowing
searching on...
* Karma (see who has the highest karma now)
* Hacker (know the name of the hacker)
* Story (see what story, go to the story)
* Reply (see how many replies, add a reply in the
discussion)
* posted (see how long ago the submission was posted)
The key bit was I want sorting. No sort, no good. Rather than just
try to find some hack or make my own hack I want a tool that is
open source, in use and development. So the adventure with YUI
tools began. The key bit is tying the static html, the javascript
and YUI controls to the data generated. Once you do this the page
will render and you can sort the data. So what problems did I
find? Well there's links, then understanding the event model and
styling.
/*
Simple example code for building custom url with title using
bits of data. Not shown in any example I could find.
*/
this.formatHacker = function( elCell, oRecord, oColumn, oData ) {
elCell.innerHTML = "<a class='url'
href='http://news.ycombinator.com/user?id=" +
oData +
"' title='" +
oRecord.getData("points") +
" points by " +
oData + " " +
oRecord.getData("posted") + " | " +
oRecord.getData("comments") +
" comments" +
"'>" + oData + "<\/a>";
};
How do you make a link? Would you believe this is probably the
easiest thing to do in static html or python tools. But with the
YUI DataTable the examples given all used the url as the title.
How stupid is that? For something so simple as using another value
as the title was almost impossible to find. Almost. I solved this
and by all means look at the hackerid source html file for the
Javascript to do this. In the end it was not very hard.
Getting the mouse-over events to work was a matter of finding some
examples. It would have been good to have some tutorial of the
event model (didn't find any, still looking) but the examples are
good enough.
The real problem I think is fiddling around with the styling. I
wanted a very simple layout based on the default YUI skin to avoid
wasting time on styling. I can leave this as another exercise. In
the end I chose just to use a simple mouse over on the rows and add
a hint of "this is a hackernews link" by flavouring the onMouse over
stylesheet with a faint orange.
In the end, the real problems overcame I designed the page to look
as near as possible as the static xhtml page. I also used some Y
templates that resolved the need to build my own. This gives the
pages enough similarity to make them look the same.
STRUCTURE
The site has a pretty simple layout...
hackerid/
index.html (dynamic index page)
hackerid-static.html (static xhtml page)
xml/
hackerid.xml (xml data)
The "hackerid" bit allows for cool uri design. The dynamic page
is a simple index page which contains the logic, references to css,
js files in a parent directory. The index page also links directly
to the static page. The dynamic index page reads the necessary data
from the child xml directory. Thats it.
UPLOADING
I have a simple python module that runs at regular intervals
uploading the static index page and data file (hackerid.xml) via
FTP.
LIMITATIONS
I have yet to add the functionality to a site where python runs
at the webserver. It's a simple constraint but one that I'm
working with. The server based processing is done on a machine I
have at home linked to the Internet via a DSL connection. So the
main limitation is someone might turn the begger off or mains
power might fail. The page still works but the data will not be the
latest.
CONCLUSIONS
"... we don't need your stinkin webpages, apps. Just give
us the data! ..."
For me this has been an eye opener. I've been able to rather
quickly hack up a simple app with limited processing power
because I have access to the data and some open source tools.
It's something I am going to persue further and have some fun
with.