All entries for Wednesday 16 March 2005

March 16, 2005

Remixing culture with RDF

Matthew Haughey, Creative Commons

CC is 2 things:

  1. a set of liberal licenses
  2. a search engine to find licensed work

When designing the search infrastructure, it had to be decentralised, metadata-aware, small orgainisation, with existing toolkits. RDF is a good fit

Metadata format: HTML head generally ignored by search engines, robots.txt hacks too limited and hacky, supporting files (too much faff for end users). So they went for RDF documents embedded in HTML comments. This means that you can search for media with a particular re-usability status.

Built a prototype crawler + search engine out of python + postgresql. It's not scalable but it's a proof of concept

Rebuilt the app on Lucene. Lucene is teH r0X0R for building open-source indexing tools; there's an app called nutch which runs on top of lucene which they used. The CC search engine is about 500 lines of java on top of nutch. The search engine now indexes about 10 million pages.

This talk is not very good.


Clay Shirky: Ontology is Overrated

Writing about web page http://conferences.oreillynet.com/cs/et2005/view/e_sess/6117

Premise: The ways we currently try to apply categorisation to the web is wrong, because we're trying to re-apply categorisation techniques from the pre-web world

Assertion: It will get worse before it gets better; things get broken before they get fixed

Parable: Travel agents – Travelocity takes the place of traditional travel agents: they made the mistake of trying to do everything that a travel agent does including helping people make choices about which holiday to take. In fact people don't want an online agent to do this: travelocity made the mistake of transposing the offline categorisation into the online world.

It's not possible to avoid cultural assumptions in categorisation: Dewey Decimal categorisation has 10 religion categories – 9 for aspects of christianity and 1 for 'other'. Similarly the lib. of congress. geographic terms.

Physical ontologies are optimised for physical access – librarians invent classifications to help them find books. If there is no shelf then there's no need for a librarians ontology.

Hierarchical ontologies are fundamentally not suitable for non-physical information – because they're predicated on an object being in one place at one time – which isn't true.

hierarchy—>hierarchy+links. When the number of links becomes large enough, you don't need the hierarchy any more.

browse (hierarchy) —>search(network of links)

when does ontological organisation work well?

  • Small domain, formal categories, stable entities, restricted entities, clear edges
  • Coordinated expert users (searchers/browsers), expert catalogers, authoritative source

n.b. the web is the diametric opposite of this!

When categorisations are collapsed, there's always some signal loss. Clay's example: if I tag something "queer" and you tag it "homosexual" we probably mean something subtly different. When categorisations are fixed, there will always be errors introduced with time. e.g. "dresden is in east germany"

great minds don't think alike

Usage (number per user) of tags on delicious follows a power law – indicating an organic organisation. Similarly the number of items per tag for an individual user. Looking at the number and distribution of tags for a given URL gives an indication of how clear 'the community' is about the categorisation of the item.

Key point: In a folksonomy, each categorisation is worth less individually than a 'professional' categorisation would have been – but when aggregated they have much more value.

User and Time are important attributes of tags. You need to know who tagged a resource and when, in order to assign a value to the tag. The semantics in a folksonomy are in the users not in the system. When del.icio.us sees OSX it doesn't know that it's an operating system; it just knows that things tagged as OSX are also often tagged as 'Mac' or 'Apple'

Does the world make sense, or do we make sense of the world?


Day 3 Keynotes: session 2

Clay Shirkey (NYU) Phone as platform

NYU have been looking at use of phones in teaching;

  • PacManhatten / ConQuest – 'big games' – controllers use phones to control players on a macro scale.
  • dodgeball: started as a site for rating nightspots; but evolved into a more general social network. Tells you who else is where you are
  • Mobjects – sqeezable controller for driving a bluetooth phone. Squidge it and it sends an SMS. Heartbeat – very similar idea.
  • phone is beginning to be used as a platform rather than as a device unto itself. standardisation of comms protocols (bluetooth) is making this easier although phone mfrs are not used to 'hackers'
  • Server infrastructure is the key – expose back-end data in ways which phones can use
  • voice is increasingly underused.

Tom Igoe ITP / NYU

Physical computing: crossover betwen art & programming – making computer-control of phsical objects simpler. Lots of cool toys / tools, particularly in the space of communicating emotions over a network. I can't easily describe them in a blog post, so hopefully there'll be a slideshow available on the web soon…

Tom Hoffman / Tim Lauer wiki in the classroom

Teaching middle-school children. No managed file space or tech support – looked into wikis. Instiki is a ruby-based wiki that can run on a workstation / laptop. Since the school is mac-based students can use Rendezvous to discover the pages. Since the teachers run the wikis on their own laptops they can run them at home just as easily.

  • Benefits: easy, responsive, completed project can be published as static HTML
  • Problems: if the teacher's machine is asleep it doesn't work, not all teachers grasped that their laptop had to be in the lesson. It's not a supported technology by the LEA

Trying to get student information into the wiki is difficult because the student records are silo-ed (sound familiar ? :-)) Tim got around this by building an open-source, open-api platform for school student records: SchoolTool. SchoolTool is based around a set of ReST APIs. relationships are modelled as xlinks.

James Surowiecki

Collection action is sometimes touted as a magic bullet – the idea that and difficult problem can be solved with a large enough group of people. It is true that in the right context, a crowd can be smarter than the smartest person in it. e.g. the average of a large set of guesses about a single value ('guess the weight of the ox'), or the odds of a horse winning – with a large sample, horses with 3/1 odds win about 1/4 of the time.
Works well when the problem has single 'right' answer. Collective wisdom isn't appearing out of consensus, it arises from the variation in answers. Also there's not much interaction between each person.
Contrast with Linux: A large group work on problems but ultimately 1 person writes the code – the decision-making process is highly centralised. Or alternatively the anthill – lots of dumb agents with lots of interconnection and simple rules. However, humans are not ants. We don't do the same efficient interaction; in some scenarios the more we interact the less inteligence the group has overall.

The reasons for this are all basically Herding – 'it's better to fail conventionally than to succeed unconventionally'. People like the comfort of the crowd. Leads to 'information cascade'.

The root of all problems in the world is that man cannot simply sit by himself in his room

Pascal

Solutions? (guidelines)

  • Keep ties loose. Loose coupling minimises the disruptive influence of others around you.
  • Keep a wide range of inputs, so that you get the maximum amount of diversity/randomness injected into the solution space.

Jon Bostrom Nokia: Mobile computing on the edge

Advantages of edge computing:

  • ease of use
  • dynamic evolution / low centralised control

… this is turning into a bit of a Nokia sales pitch….yawn…


Day 3 Keynotes

Writing about web page http://conferences.oreillynet.com/cs/et2005/view/e_sess/5910

Neil Gershenfeld Bits & Atoms

The state-of-the-art in fabrication is the chip factory: Actually right now it's not very sophisticated: You spread some stuff around and cook it. Compared to biology the big difference is that the things you're making don't know anything about being made – whereas when you make an animal the cells know how to make more cells – the specification of the structure lives within the structure itself.

For traditional manufacturing, errors in mfg are proportional to noise in the process. In signal theory (e.g. networks) a certain amount of noise can be tolerated without having any effect on errors in the system. If we can make fabrication processes where the object being fabbed knows the specification we can get the same kind of noise-toleration (e.g. genes can cope with errors and still make an organism)

20K buys you a field fab-lab: a laser cutter, sign cutter, micron-scale milling machine, and a microcontroller programming setup. Microfabrication is now in the same place that computing was about 25 years ago (e.g. when minicomputers like the PDP were around). The PC equivalent of a microfabricator is not far off

Fabrication labs at this scale are a disruptive technology, Neil's group have been introducing them into developing countries to see what can be acheived: Answer – all sorts of cool small-scale solutions to local problems

Cory Doctorow All complex ecosystems have parasites

  • AOL chooses to allow spam through despite the cost, because if you solve spam you break email. Uncontrollability is a key element of a fault-tolerant system like email

  • DVD has been developed to be controllable; CDs were not. The result is that if you invest in CDs, you can re-use them as MP3s, ringtones, etc,etc… With an investment in DVDs you never get any increase in the value.

  • The DVD control model is fragile and unscalable; trying to extend it out to other devices – wider DRM - won't work, or will cripple the industry if it does. DRM isn't working now – any movie is available over P2P, depsite the huge costs of implementation.

Justin Chepweske, Onion networks

2 billion dollars a year is spent on http optimisation: load balancers, caches, etc. This is at least partly because HTTP is sub-optimal for the size of the web

  • Http is very bad at transferring large (multi-GB) files – packet loss, broken 32-bit apps, etc.
  • One solution is to use very high-quality transports, but it would be better to have a fault-tolerant transport (like RAID for storage)
  • swarming is RAID for the web: tolerates failures of transport and failures of servers
  • swarming features: it's a content-delivery system. data is signed and encrypted so you don't need to trust the host you download from. runs over standard protocols – it's an extension to HTTP
  • standard java http stack replacement available (open-source)

Jimmy Wales – Wikipedia & the future of social software

  • 500K entries
  • taxonomy: 350K categories, hierarchical, dynamic
  • 500MM page views / month
  • the original dream of the net – people sharing information freely
  • problems – quality control, author fatigue
  • solution: wiki[pedia|cities] – a social computing successor to 'free homepages'. Uses a free content license so that people can take their content with them if they want to leave
  • sites are maintained by communities rather than by an individual, thus mitigating the risks of quality control and author fatigue
  • wiki software doesn't eforce social rules – for exampe 'votes for deletion' page
  • wikipedia is a social innovation, not a technological one.
  • software which enables collaboration is the future of the net

Panel discussion Folksonomy

Why do companies allow end-users to participate in tagging?

In flickr's case it was primarily done for the individual user and then aggregated, in wikipedia's case it was primarily done for the community. SB (flickr) – folksonomies are not a replacement for a formal taxonomy, they are an addition. JS (del.icio.us) – also started from the assumption that tags were a personal thing, and just the the folksonomy emerge.
Some tags are nothing to do with categorisation e.g. toread on del.icio.us, even though they are interesting as a social behaviour

flickr / del.icio.us are different to wikipedia, because they start with individual spaces and then aggregate them, whereas WP starts with a shared space and uses negotiation/governance to manage it. The individual approach is less optimal for the social stuff – e.g. people tagging pictures of their trip to mexico as 'etech' because they went just before the conference – right for the individual, but breaks the aggregation.

JS: Although you can key between tags between de.icio.us / flikr / technorati, it's not always appropriate – the tags mean different things in the different applications

Q: How do you provide feedback to people to improve their tagging? In wikipedia it's easy; in flikr it doesn't matter – the primary purpose of a flikr tag is personal. Also the volume of pics. is so great that you don't need a perfect vocabular. In del.icio.us, there are some tools to help you see which of your tags are also used by others.

SB: formal taxonomies are ultimately limited because (as far as we can tell) the real world isn't easily classified.


Amazon: Interplanetary e–commerce

What kind of problems will amazon face in delivering retail services to mars? Or to put it another way, why is it that we don't think global e-commerce is possible?

We already do some things at massive scale – the internet, mobile phones, chips (multi-billion transistors that all work). There are 1 quadrillion ants on the planet (allegedly)

What do we need to solve the problem of massive scalability? Not just technology, though that may be a necessary precursor. There are only a few systems that can scale up to millions of parallel nodes.

Amazon scale: 47 MM users, 7 websites, 50% is non-US sales. 2.8MM units/day ordered at peak time. 32 orders/second peak. 2MM packages dispatched

Scale ought to be seen as an advantage – the more you scale the more you can sell

Can we use the same engineering techniques to build really large systems that we use for current big systems? Management becomes a big deal; how to cope with unreliability

Real Life scales well - systems need to learn from biology for high fault-tolerance. Biological systems go through continuous refresh - cells are designed to die and be born without affecting the organism as a whole.

Outside monitors are not a good indicator of 'health'. system should be designed for continuous change, not stability.

Turings 3 categories of systems:

  • organised (current apps)
  • unorganised (networks)
  • self-organising (biological)

– need to move to self-organisation for massive scalability

Can't expect complete top-down control – since applications won't be deterministic. Real life is not a state machine

Functional units need to be self-organising feedback-centric machines

comparison point: Why are epidemics so robust wrt message loss / node failure? Can be mathematically modelled in a rigorous way. It works because each node can operate independently if it needs to. As the number of nodes becomes really large then you only need to know a subset of the system in order to succeed.

Fault detection protocols – monitor on a particular node A how long since another node B updated it's state. B does not need to contact A directly because the state will eventually replicate around the whole system. Need clear partitioning of data but then the system becomes highly reliable.


'Just' use HTTP

Writing about web page http://intertwingly.net/slides/2005/etcon/

Sam's slides are all online so I don't need to annotate everything. Everyone who develops web apps should read them

  • understand unicode: it is an attractive nuisance . Inexperienced developers will screw it up. c.f. the recent (p)unicode domain name hijacking bugs in mozilla

  • The default encoding for HTML is iso-8859–1, for XML it's utf-8. This is why you can't put HTML directly into RSS, win-1282 (the default enctype on windows) isn't compatible with either (27 differences, mostly around quotes and the euro symbol)

  • URIs: Encoding is not defined – it's up to you to document yours clearly to your clients. Equality of URIs is not well-defined; the CLR uri.equal method is broken.

  • layering is problematic e.g. the rules for encoding a URI don't apply in an XML document (can't encode a ~ as %7e )

  • RSS/Atom: Lots of unanswered questions / ill-defined points in the spec

  • Layering is the problem, not the solution. Layered designs inherit the bugs from all layers.

Questions:
Q: If everything is broken, how come it still works? A: because people are very fault-tolerant. The more machine-machine communication you have the more problematic it becomes.
Q: Are web services genuinely better than HTTP, or just newer? A: because the client stacks have (sometimes) been written with the spec to hand, they're generally more reliable.
Q: How do you avoid the attractive nuisance problem e.g. when writing the atom spec? A: make the spec force people to think about the problems e.g. by specifying the content-encoding for specific element types.


Creating a new web service at google

Nelson Minar

The Google AdWords API

  • Adwords: Campaign management done via a web app. Advertisers select keywords applicable to their ad
  • Hierarchical data model. advertisers->campaigns->keywords
  • API goals – allow developers to integrate with the platform
  • 3rd party companies springing up to tweak keywords for maximum efficiency, or make alternative UIs
  • Smart companies integrating their back-office systems with their ad campaigns e.g. when stock runs out, pause the ads
  • Features: Campaign management, reporting functions, traffic estimator
  • Technologies: SOAP/WSDL over SSL. Quota system; multiple authentication mechanisms (proxying / remote management)
  • consultancy and toolkit vendors are starting to spring up.

  • Uses SOAP 1.1 + the WS-I basic profile
  • objective is to make the integration as simple as possible for a WSDL-enabled application. For a good platform an API call should be 2 lines (make a proxy, call the method)
  • uses Document/literal soap rather than RPC-oriented: D/L is closer to ReST/Atom – it's just passing documents about
  • Doc/lit soap requires good xml—>native object bindings. Poor binding is a frequent cause of interop problems
  • Reality: Interop is still hard; WSDL support varies by toolkit; doc/lit support likewise
  • Good platforms: .NET, Java (axis). OK: C++ (gSOAP), Perl (SOAP::Lite) Not good: Python (SOAPpy, ZSI), PHP

interop hazards

  • nested complex objects
  • polymorphic objects
  • optional fields
  • overloaded methods
  • xsi:type: Since clients keep getting them wrong it's easier to just not bother
  • ws-* – only sun and MS support it
  • doc/lit support is weak in scripting languages
  • or you could just parse the XML yourself.

Why not just use ReST?

  • Easy to use
  • tinkerable
  • high ReST – use the HTTP verbs to build the app, use meaningful URL path, use XML only as a document (payload), use HTTP headers for metadata
  • Nelson treats POST as update (c.f. Ben yesterday who considered it to be create)
    limitations
  • lack of support for PUT/DELETE from browser – poorly tested in caches
  • limited standardisation for error codes
  • browsers can't cope with URLs more than 1000 chars
  • you've got to do your own databindings – no WSDL

bottom line

  • For complex data the XML is what matters and it doesn't make much difference if it's doc/lit soap or ReST
  • for read-mostly apps, ReST is best
  • need better tooling

Lessons learned
– good things:

  • doc/lit
  • stateless design
  • developer reference guide
  • developer tokens
  • interop testing
  • private beta period
  • batch-oriented methods – specify an array of IDs and get back multiple XML entities. big speedups. Makes error semantics harder, and messages larger
    – bad things
  • doc/lit switch was expensive
  • lack of a common/clear data model
  • dates / TZs are wierd – SOAP dates are GMT but google works on PST
  • no gzip enconding
  • quota confusion / anxiety
  • no sandbox
  • SSL - hard to sniff, XML dumps aren't publishable because they contain plain-text passwords, slow. note to self we should use a 1-way hash or something for our APIs_

  • Make sure your SOAP is well validated and clean: test interop. Distributing a client library is worthwhile
  • need good developer support – docs, samples, FAQ, debigging instructions, community

Most recent entries

Loading…

Search this blog

on twitter...


    Tags

    Not signed in
    Sign in

    Powered by BlogBuilder
    © MMXIV