November 06, 2007

Web 2.0 operations

Artur Bergman – wikia

  • Operations builds trust in a brand.
  • Ops is the stepchild of engineering – development gets the glory.
  • To do good ops, you need sysadmins manning NOCs etc, and engineers.
  • Good engineers * Detail oriented * Don’t aspire to work in development * Should work with development though, so that both sides can learn about each other
  • You must understand the product to run it successfully. Complexity kills
  • Rule 3: Look first. You must have good monitoring and good alerting.
  • Do not over-alert, and do not cry wolf.
  • Trending – long term capactity planning. Make alerts be timely – there’s no point alerting at 90% on a 10TB partion that grows at 1MB a day
  • Websitepulse – gomez alternative.
  • Nagios – doesn’t scale well for large installations; doesn’t keep state; over-alerts
  • Hyperic – looks much nicer
  • Cricket / MRTG / Cacti – impossible to configure
  • Ganglia – rocks – no configuration
  • load increase while running process stays the same – shows up blocking calls
  • custom gmetric scripts – can collect more or less anything
  • graphing stuff is a very good way of spotting problems.
  • tcpdump / wireshark – tells you where packets are going wrong.
  • rule 4: Divide and conquer. Look at the problems in turn, go in the order you suspect is most likely
  • change one thing at a time, and keep an audit trail. use version control
  • If you didn’t fix it, it aint fixed. If it goes away, it will come back and bite you later. Figure out what happened.
  • use strace/dtrace/truss/gdb
  • you need a little bit of process.
  • Design against complexity. Re-use components, define standards. Have a few machine images, re-image all machines periodically for the hell of it
  • MTBF is irrelevant. dealing with failure is more important. Target the right level of uptime.
  • Don’t kid youself. You don’t need 5 nines
  • The higher you aim for reliability, the higher complexity and cost you’ll get. If you need 2 nines and aim for 5, you’ll do worse than if you just aim for 2
  • MTTR – a much better metric. 1 minute downtime every week is still 4 nines
  • Problem management: Once a problem is found, start a phone conference. Use IRC or IM to communicate technical info. Have one person liase with non-technical people, and one person be in command.
  • Write down results of root cause analysis
  • Automation: All machines are created equal. If you manually make changes, you are wrong (usually)
  • Best practice: * Gold images * Centralised authentication * NTP time sync * Central logging * All applies for virtual machines too!
  • cfengine – ; puppet is much nicer
  • cobbler – linux version of jumpstart
  • datacentres: keep them tidy; label everything. have a switch in each rack if you can. Remote consoles / Remote power switches save a lot of pain.
  • Virtualisation: Use it. Managing becomes much easier. power consumption goes down, new test boxes can be quickly provisioned
  • Loadbalancers: Keep them simple and low level. LVS + Squid Carp. Log over UDP so you don’t block if the disk is full
  • Squid: Use for static stuff; use squid with a very short ttl (1 second) for non-logged-in-users and dynamic pages.
  • Databases: report slow queries. fear ORMs; understand what they are doing
  • If you’re small, outsource
  • : CDN
  • S3 for binlogs, datafiles, etc

- No comments Not publicly viewable

Add a comment

You are not allowed to comment on this entry as it has restricted commenting permissions.

Most recent entries


Search this blog

on twitter...


    Not signed in
    Sign in

    Powered by BlogBuilder
    © MMXXI