Friday, 25 January 2019

Imprompute Hack Day!

Yesterday we were on a company away day. A day of talks, workshops and speakers both internal and external. Covering a range of subjects around things like technology, design, data science and psychology.

All very interesting, except our team missed a couple of hours in the middle of the day because we were dealing with a thing. But we still had an interesting session where we did a lot of learning!

Ok. Full disclosure. We did kind of cause the thing in the first place. And in the interest of embracing failure and learning from it, I'm sure nobody will mind me blogging about this...

We've just rebuilt our login page so only does it look nice, it's also easier to maintain and deploy. There's not a lot you can do to a login page so version 2.0 basically mirrors the functions of the old one. Including the 'Remember my username' button. Except we have done a complete rewrite behind the scenes and built this function slightly differently. (Local Storage instead of in a cookie, if you're interested.)

On the face of things, it behaves exactly the same as on the old page. But we'd overlooked the fact that if a user had ticked 'remember me' on the old login page, they wouldn't be remembered when we rolled out the the new one. This doesn't seem like a huge problem, but it resulted in 4 times as many people hitting the 'I've forgotten my username' journey or phoning the contact centre because they couldn't log in. Not great when it's a few thousand confused users on a Thursday. Definitely not ideal when you're expecting a few hundred thousand users on a Saturday afternoon!

Lesson 1: We built the new page to be better than the old, but completely missed a tiny detail of the transition between the two. We made sure all existing functionality was preserved. But we hadn't spotted the new user journey that we had introduced. Old world to New.

Lesson 2: That tiny missed detail caused a spike in a completely different part of the site. Not only had we missed this part of the journey, we hadn't realised what the knock on effects of missing it would be. In this case it was obvious why there was an increase in support calls and emails, but next time it might not be so obvious.

Lesson 3: Nobody was blamed for this. It wasn't Test's fault for not spotting it. It wasn't Dev's fault for not building it. It wasn't Product's fault for not asking for it. The team as a whole missed it. We all felt like chumps. But nobody got the sack.

So, round a coffee table on a few sofas outside the hotel meeting rooms, we jumped on the VPN and started working on a solution. We had devs and devs pairing, devs and testers pairing, product owners and devs looking at monitoring and stats to work out priority and impact. We were doing real-time code reviews by literally looking over each other's shoulders and having out-loud, face-to-face conversations. It was brilliant!

Within a couple of hours we had fixes in place in a couple of different codebases and had both deployed to our test environment. We even uncovered an edge case in the initial fix and got a fix in for that too!

Lesson 4: Why don't we work like this all the time!!?

Here's the functional spec/acceptance criteria/test plan I was working from:

  • No prizes for guessing which hotel we were in!
  • Yes, that's chilli smudged in the middle of it. We worked through lunch.
  • No. It's not neat. But I can (just about) read my own writing.
  • No. It's not in any kind of fixed or  recognisable format.
  • It's not in a Test Case Management system. It's not even in Excel!

Lesson 5: It's always useful to write stuff down. You can keep track of things and organise your thoughts. But you don't need to stick to a format or a system. As long as you're communicating well within your team and everyone understands what's happening.

We managed to get a fix out that afternoon. Thanks to a lot of work we have already done around our pipelines and deploy process. The graphs quickly settled to normal levels and the phones in the contact centre stopped ringing.

Lesson 6: It's all well and good to work quickly, but you need to be able to get code into live quickly too. It allows you to be more more reactive to change, means you can ship smaller things more often and gives you a shorter time to recovery. Live deploys should be quick, easy and not be a big deal.

Lesson 7: Monitor all the things! Without stats on how many people hit the 'I've forgotten my username' button, we wouldn't have known there was a problem. And we wouldn't know that we'd actually fixed the problem afterwards.

Lesson 8: See Lesson 4. Why don't we work like this all the time?

No comments:

Post a Comment