Part nerd, part geek, Neek! HackerID

name the best HackerNews contributors

xml | xhtml | contact | bugs | colophon | features | todo


HackerID: name the best HackerNews contributors
Peter Renshaw, goonmail at  netspace dot net dot au
Friday, September 21, 2007.


    "... I just want the data, not the pages, not 
    pretty design just the data... don't need your 
    stinkin webpages, apps. Just give us the data! 
    OK ..."


Introduction

HackerID is born of my frustration on several fronts and some 
comments pg made when HackerNews was just launching about the 
Leader board and how he used it to get a quick impression of 
who the real contributors are at news.yc, since named 
HackerNews.

    
    Q "... is there any way you could filter applications before 
    reading them? ..."

    A "... This site was designed partly as an additional filter. 
    It works too. When I met danielha I knew his name because he 
    was #1 on News.YC. ..."
    
    http://news.ycombinator.com/item?id=7076


So it was in the back of my mind that I really want to know who
is contributing on HackerNews without having to read through reems 
of pages at regular intervals. Next a more fundamental concept.

    I just want the data, not the pages, not pretty design just 
    the data ....

So with these things in mind I hacked together in a day a program 
written in python using some standard modules


        * urllib (http GET)
        * datetime (ISO 8602 date formatting)
        * feedparser (RSS parsing)
        * BeautifulSoup (screen scrape)


that would allow me to ...


        * go to the HackerNews page, grab the RSS feed
        * extract as much as I could
        * go back to the site and get a few extrac things
        * create a nice little valid xml file


But reading xml files and understanding them are two different 
things. So next I decided to create a quick xhtml page in table 
form to read the results. If I could see the results in a nice 
table I could at a quick glance view who was submitting, what 
karma they have, what score the submission has, how long ago it 
was submitted.

But hang on. How can I find the the best score, highest number 
of comments or easily count the number of articles danw has 
submitted? I can presort and create a static page. But what if 
I want to sort now? So I looked for a better way and found some 
Javascript tools that I can use to do the job. In the end I chose 
Yahoo User Interface Library or YUI toolset. Specifically the 
Beta version of DataTable. You can find the references here ...


    http://developer.yahoo.com/yui/

        and here

    http://developer.yahoo.com/yui/datatable/


But this is just user-interface floss. The real value created 
is in generating the useful data in the first place. The 
applications created on top of the data are the cream on the 
cake.



GET RSS

Make a HTTP request to download the HackerNews RSS file found 
at http://news.ycombinator.com/rss



PARSE

Parse the RSS file, extracting title, url, date, parse date 
Note the parsed date in the RSS file is invalid.



EXTRACT

For each article url extract information on

        * user (who posted)
        * title (article title)
        * url (article url)
        * points (current points assigned to article)
        * comments (how many people commented)
        * item-id (unique id of post at hackernews)
        * posted (relative datetime posted)

For each each article url then extract the user information 
from the user hackernews http://news.ycombinator.com/user?id=user

        * inception date (when joined)
        * karma (points user has)

At some time this should be cached as a list of users and info 
about users. No need to gather the same info over and over again. 
Collect all the information (submitted article + user) into a 
data structure for later use.



CONVERSION DATA TO XML

Running through the data structure convert into a useful xml 
document. The structrure of the documnent is

        * HEADER (defines contents, who created, when)
        * title
        * description
        * author
        * license (loose usage of data, 
	    CC http://creativecommons.org/licenses/by/2.5/au )
        * author
        * user (url to my hackernews user page)
        * changefreq (quartery hourly, change pending)
        * lastmod (in ISO8602 format, Zulu time)
        * source (url file can be found)

The for each batch of submissions ...

        * SUBMISSIONS (made up of 20 submissions)
        * submission (hackernews user makes submission)
        * user (user name)
        * inception (days since created hackernews account)
        * karma (numerical measurement of hackernews worth)
        * item (ID only of user discussion submission)
        * title (title user assigned to submission)
        * url (external url of submission)
        * points (current points assigned to submission)
        * comments (numnber of comments currently found in sub)
        * posted (time in minutes, hours, days since sub made)

The whole point of the exercise is to create a data feed that 
captures the current best of HackerNews in data form allowing 
others to build off this. This is what I certainly miss with 
lots of sites. The format consists of a header that describes 
the feed, who created it, when it was created and submission 
which contains many submisions. The useful 
data is contained in the submissions.

Each user submission is captured at a top level allowing you to 
identify the user, a little bit about them. The rest of the data 
describes the story submission, its title, score, comments when 
\posted etc. The structure of the file is inspired in part by the 
Flickr public API response format data structure which can be found 
at...


    http://flickr.com/services/

        and here

    http://flickr.com/services/api/


The above data is wrapped in the following XML description ...


    <?xml version="1.0" encoding="utf-8" ?>
    <rsp stat="ok">
       <header />
       <submissions>
           <submission />
           <submission />
       </submissions>
    </rsp>

It is not original but very simple, easy to use and understand. I 
would have liked to use RSS as the tools to process RSS are 
abundant but I didn't have the time (or patience) to sit down 
and work out the exact how. Would also have liked to use ATOM 
but the tools as of yet for python are pretty rough (at least 
the last time I looked).


Some of the entities (changefreq, changemod, source) are also 
inspired from the Google Sitemap Protocol document which can be 
found here ...


  https://www.google.com/webmasters/tools/docs/en/protocol.html


The above document also highlighted the need for entity escaping 
which plays havoc in XML feeds. There are some nice discussions 
on url formats (RFC-3986, RFC-3987), python tools to properly 
parse urls.



CONVERSION TO XHTML

Generated using static xhtml template inserting values running 
through a for loop. Pretty quickly I realised that I need to 
display the data so I may as well create a page to look at. 
Consists of


        * Karma (user overall score with link to user)
        * Hacker (user name with link back to user)
        * Story (title with link back to original story)
        * Reply (numerical, how many replies with link to 
	  discussion)
        * posted (relative date submission posted)


This data is wrapped into a table (yes folks plain old table, no 
CSS). It's plain and does the job. The most important parts are 
the links created back to the original information.



CONVERSION TO DYNAMIC TABLE USING YUI AND DATATABLE

Static xhtml is fine to look at but not really useful to interact 
with. I realised pretty quickly I needed some real-time 
interaction. So I went looking for portable client side tools 
to allow play with.

Eventually I decided to play with the Yahoo User Interface toolkit 
and the DataTable in particular. I wanted to use the dynamic table 
because I could then really find the best contributors by allowing 
searching on...


        * Karma (see who has the highest karma now)
        * Hacker (know the name of the hacker)
        * Story (see what story, go to the story)
        * Reply (see how many replies, add a reply in the 
	  discussion)
        * posted (see how long ago the submission was posted)


The key bit was I want sorting. No sort, no good. Rather than just 
try to find some hack or make my own hack I want a tool that is 
open source, in use and development. So the adventure with YUI 
tools began. The key bit is tying the static html, the javascript 
and YUI controls to the data generated. Once you do this the page 
will render and you can sort the data. So what problems did I 
find? Well there's links, then understanding the event model and 
styling.


/*
   Simple example code for building custom url with title using 
   bits of data. Not shown in any example I could find.
*/
this.formatHacker = function( elCell, oRecord, oColumn, oData ) {
   elCell.innerHTML =  "<a class='url' 
                        href='http://news.ycombinator.com/user?id=" +
                          oData +
                          "' title='" +
                          oRecord.getData("points") +
                          " points by " +
                          oData + " " +
                          oRecord.getData("posted") + " | " +
                          oRecord.getData("comments") +
                          " comments" +
                          "'>" + oData + "<\/a>";
};


How do you make a link? Would you believe this is probably the 
easiest thing to do in static html or python tools. But with the 
YUI DataTable the examples given all used the url as the title. 
How stupid is that? For something so simple as using another value 
as the title was almost impossible to find. Almost. I solved this 
and by all means look at the hackerid source html file for the 
Javascript to do this. In the end it was not very hard.


Getting the mouse-over events to work was a matter of finding some 
examples. It would have been good to have some tutorial of the 
event model (didn't find any, still looking) but the examples are 
good enough.


The real problem I think is fiddling around with the styling. I 
wanted a very simple layout based on the default YUI skin to avoid 
wasting time on styling. I can leave this as another exercise. In 
the end I chose just to use a simple mouse over on the rows and add 
a hint of "this is a hackernews link" by flavouring the onMouse over 
stylesheet with a faint orange.


In the end, the real problems overcame I designed the page to look 
as near as possible as the static xhtml page. I also used some Y 
templates that resolved the need to build my own. This gives the 
pages enough similarity to make them look the same.



STRUCTURE

The site has a pretty simple layout...

    hackerid/
             index.html (dynamic index page)
             hackerid-static.html (static xhtml page)
             xml/
                 hackerid.xml (xml data)


The "hackerid" bit allows for cool uri design. The dynamic page 
is a simple index page which contains the logic, references to css, 
js files in a parent directory. The index page also links directly
to the static page. The dynamic index page reads the necessary data 
from the child xml directory. Thats it.



UPLOADING

I have a simple python module that runs at regular intervals 
uploading the static index page and data file (hackerid.xml) via 
FTP.



LIMITATIONS

I have yet to add the functionality to a site where python runs 
at the webserver. It's a simple constraint but one that I'm 
working with. The server based processing is done on a machine I 
have at home linked to the Internet via a DSL connection. So the 
main limitation is someone might turn the begger off or mains 
power might fail. The page still works but the data will not be the 
latest.



CONCLUSIONS


    "... we don't need your stinkin webpages, apps. Just give 
    us the data! ..."


For me this has been an eye opener. I've been able to rather 
quickly hack up a simple app with limited processing power 
because I have access to the data and some open source tools. 
It's something I am going to persue further and have some fun 
with.