Saturday, October 02, 2004

Service Reliability

There is still a mindset among operators that their responsibility is for the server rather than the service.

Recently my ISP was briefly blacklisted because it failed to comply with DSBL requirements. For about a week, I could not send emails to certain contacts. But although the ISP has a mechanism for notifying users about server problems, it apparently did not occur to anyone at the ISP to notify users about this service problem.

Service reliability is an important issue, and it's still hard (and costly) to run a web service reliably. Phil Windley recommends that the responsibility be assigned to a product engineer
an engineer on the operations side, whose job it is to make the product (not just the server) work. Properly incented, a product engineer will drive all of the emergency and contingency planning, along with ensuring that engineering delivers a system that can be reliably operated. (blog) (pdf)
But reliable and robust service requires management attention as well as operational attention. This is a matter of business continuity, not just IT continuity.

No comments: