Amazon: Interplanetary e–commerce
What kind of problems will amazon face in delivering retail services to mars? Or to put it another way, why is it that we don't think global e-commerce is possible?
We already do some things at massive scale – the internet, mobile phones, chips (multi-billion transistors that all work). There are 1 quadrillion ants on the planet (allegedly)
What do we need to solve the problem of massive scalability? Not just technology, though that may be a necessary precursor. There are only a few systems that can scale up to millions of parallel nodes.
Amazon scale: 47 MM users, 7 websites, 50% is non-US sales. 2.8MM units/day ordered at peak time. 32 orders/second peak. 2MM packages dispatched
Scale ought to be seen as an advantage – the more you scale the more you can sell
Can we use the same engineering techniques to build really large systems that we use for current big systems? Management becomes a big deal; how to cope with unreliability
Real Life scales well - systems need to learn from biology for high fault-tolerance. Biological systems go through continuous refresh - cells are designed to die and be born without affecting the organism as a whole.
Outside monitors are not a good indicator of 'health'. system should be designed for continuous change, not stability.
Turings 3 categories of systems:
- organised (current apps)
- unorganised (networks)
- self-organising (biological)
– need to move to self-organisation for massive scalability
Can't expect complete top-down control – since applications won't be deterministic. Real life is not a state machine
Functional units need to be self-organising feedback-centric machines
comparison point: Why are epidemics so robust wrt message loss / node failure? Can be mathematically modelled in a rigorous way. It works because each node can operate independently if it needs to. As the number of nodes becomes really large then you only need to know a subset of the system in order to succeed.
Fault detection protocols – monitor on a particular node A how long since another node B updated it's state. B does not need to contact A directly because the state will eventually replicate around the whole system. Need clear partitioning of data but then the system becomes highly reliable.
Chris May
Loading…
Steve Rumsby
I read an article a little while ago (can't find it now – this was before I started furling everything) about the systems behind Google. I forget the details but their systems are based around small, cheap PCs (off-the-shelf & easy to replace and keep spares of) and are built on the assumption that several will fail per day. When that happens, they just plug in a new one, boot it up, and the system is designed to populate it with whatever part of the distributed database was on the failed one, and within a very short time it is able to serve requests again.
16 Mar 2005, 09:20
Chris May
It's true. What's even cooler is that google are expanding so fast that when an individual node dies, they don't even bother to remove it – just leave it to rot in it's rack. In many cases they don't even know whether a particular node is running or not – just that a certain percentage are available/out
16 Mar 2005, 16:01
Add a comment
You are not allowed to comment on this entry as it has restricted commenting permissions.