Remixing culture with RDF
Matthew Haughey, Creative Commons
CC is 2 things:
- a set of liberal licenses
- a search engine to find licensed work
When designing the search infrastructure, it had to be decentralised, metadata-aware, small orgainisation, with existing toolkits. RDF is a good fit
Metadata format: HTML head generally ignored by search engines, robots.txt hacks too limited and hacky, supporting files (too much faff for end users). So they went for RDF documents embedded in HTML comments. This means that you can search for media with a particular re-usability status.
Built a prototype crawler + search engine out of python + postgresql. It's not scalable but it's a proof of concept
Rebuilt the app on Lucene. Lucene is teH r0X0R for building open-source indexing tools; there's an app called nutch which runs on top of lucene which they used. The CC search engine is about 500 lines of java on top of nutch. The search engine now indexes about 10 million pages.
This talk is not very good.