{"id":2330,"date":"2015-05-27T20:17:28","date_gmt":"2015-05-28T03:17:28","guid":{"rendered":"https:\/\/mathpirate.net\/log\/?p=2330"},"modified":"2015-05-27T21:51:46","modified_gmt":"2015-05-28T04:51:46","slug":"to-the-cloud","status":"publish","type":"post","link":"https:\/\/mathpirate.net\/log\/2015\/05\/27\/to-the-cloud\/","title":{"rendered":"To The Cloud!"},"content":{"rendered":"<p>About two years ago, the higher ups at my mid-sized web company had an idea.\u00c2\u00a0 Pull together a small team of devs, testers, IT, and even a corporate apps refugee, stick them in a small, dark room, and have them move the entire infrastructure that powered a $300 million a year business into the cloud, within six months.<\/p>\n<p>We were terrified.<\/p>\n<p>We had no idea what we were doing.<\/p>\n<p>But somehow, we managed to pull it off.<\/p>\n<p>It wasn&#8217;t easy.\u00c2\u00a0 A lot of things didn&#8217;t work.\u00c2\u00a0 Many shortcuts were taken.\u00c2\u00a0 Mistakes were made. \u00c2\u00a0We learned a lot of things along the way.<\/p>\n<p>And so now, two years on, it seems like a good time to take a look at how we got to where we are and what we&#8217;ve discovered.<\/p>\n<h1>Where We Started<\/h1>\n<p>When we began the cloud migration process, we lived in two data centers.\u00c2\u00a0 One on the west coast, one on the east coast.\u00c2\u00a0 For the most part, they were clones of one another.\u00c2\u00a0 Each one had the same number of servers and the same code deployed.\u00c2\u00a0 This was great for disaster recovery or maintenance.\u00c2\u00a0 Release in the west?\u00c2\u00a0 Throw all our traffic into the east!\u00c2\u00a0 Hurricane in the east? \u00c2\u00a0Bring all the traffic across the country.\u00c2\u00a0 Site stays up, user is happy.\u00c2\u00a0 It&#8217;s not so great for efficiency.\u00c2\u00a0 The vast majority of the time, our data centers were load balanced.\u00c2\u00a0 We&#8217;d route\u00c2\u00a0users to the data center that would give them the fastest response time.\u00c2\u00a0 Usually, that ended up being the DC closest to them.\u00c2\u00a0 Most of our users were on the east coast, or in Europe or South America.\u00c2\u00a0 As a result, our east coast servers would routinely see 4-5x the traffic load of the boxes on the west coast.\u00c2\u00a0 That meant that we might have 60 boxes in Virginia running at 80% CPU, while the twin cluster of 60 boxes in Seattle is chillin&#8217; at less than 20%.\u00c2\u00a0 And when we need to add capacity in the east, we&#8217;d have to add the same amount of boxes to the west.\u00c2\u00a0 That&#8217;s a lot of wasted processing power.\u00c2\u00a0 And electrical power.\u00c2\u00a0 And software licenses.\u00c2\u00a0 Basically, that&#8217;s a lot of wasted money.<\/p>\n<p>We had virtualized our data centers a few years prior, and while that was a huge step forward over having rack after rack of space heaters locked in a cage, it still wasn&#8217;t the freedom we were promised.\u00c2\u00a0 Provisioning a server still took\u00c2\u00a0a couple of days to get through the process.\u00c2\u00a0 We&#8217;d routinely have resource conflicts, where one runaway box would ruin performance for everything else on the VM host.\u00c2\u00a0 There was limited automation, so pretty much anything you wanted to do involved a manual step where you had to hop on the server and install something and configure something else by hand.\u00c2\u00a0 And if we ran out of capacity on our VM hosts, there&#8217;d be a frantic &#8220;We gotta shut something down&#8221; mail from the admin, followed by a painful meeting where several people gathered in front of a screen and decided which test boxes weren&#8217;t needed this week.\u00c2\u00a0 (The answer was usually most of them&#8230;)\u00c2\u00a0 And once in a while, we&#8217;d have to add another VM host, which meant a months long requisition and installation\u00c2\u00a0process.<\/p>\n<p>All this meant that we were heavily invested in our servers.\u00c2\u00a0 We knew their names, we knew their quirks, and if one had a problem, we&#8217;d try to revive it for days on end.\u00c2\u00a0 Servers were explicitly listed in our load balancers, our monitoring tools, our deployment tools, our patch management tools, and a dozen other places.\u00c2\u00a0 There was serious time and money that went into a server, so they mattered to us.\u00c2\u00a0 It was a big deal to create a new box, so it was big deal to decide to rebuild one.<\/p>\n<p>Our servers were largely Windows (still are).\u00c2\u00a0 That meant that in the weeks following Patch Tuesday, we&#8217;d go through a careful and tedious process of patching several thousand servers to prevent some Ukrainian Script Kiddie from replace our cartoon dog mascot with a pop-up bomb for his personal &#8220;Anti-virus&#8221; ransomware.\u00c2\u00a0 And, of course, that process wasn&#8217;t perfect.\u00c2\u00a0 Box 5 in cluster B wouldn&#8217;t reboot.\u00c2\u00a0 Box 7 ended up with a corrupted configuration.\u00c2\u00a0 And box 12 just flat out refused to be patched because it was having a rough day.\u00c2\u00a0 So hey, now there&#8217;s a day or two of cleaning up that mess.\u00c2\u00a0 Hope no one noticed that box 17 was serving a Yellow Screen of Death all day and didn&#8217;t get pulled from the VIP!<\/p>\n<p>Speaking of VIPs, we had load balancers and firewalls and switches and storage and routers and all manner of other &#8220;invisible&#8221; hardware that\u00c2\u00a0loved to fail.\u00c2\u00a0 Check the &#8220;Fast&#8221; checkbox on the load balancer and it would throw out random packets and delay requests by 500 ms.\u00c2\u00a0\u00c2\u00a0 The firewall would occasionally decide to be a jerk and block the entire corporate network.\u00c2\u00a0 The storage server would get swamped by some forgotten runaway scheduled\u00c2\u00a0process, and that would lead to a cascade of troubles that would knock a handful of frontend servers out of commission.\u00c2\u00a0 Every day, between 9 AM and 1 PM, traffic levels would be so high that we&#8217;d overload a bottleneck switch if we had to swing traffic.\u00c2\u00a0 And don&#8217;t even get me started about Problem 157.\u00c2\u00a0 No one ever figured out Problem 157.<\/p>\n<p>In short, life before was a nightmare.<\/p>\n<h1>Where We Are Now<\/h1>\n<p>Today, we&#8217;re living in three regions of AWS.\u00c2\u00a0 We&#8217;ve got auto-scaling in place, so that we&#8217;re only running the number of servers we need.\u00c2\u00a0 Our main applications are scripted, so there&#8217;s no manual intervention required to install or configure anything, and\u00c2\u00a0we can have new servers running within minutes. \u00c2\u00a0The servers in all of our main clusters are disposable.\u00c2\u00a0 If something goes wrong, kill it and build a new one.\u00c2\u00a0 There&#8217;s no point in spending hours tracking down a problem that could&#8217;ve been caused by a cosmic ray, when it takes two clicks and a couple of minutes to get a brand new box that doesn&#8217;t have the problem.\u00c2\u00a0 (That&#8217;s not to say the code is magically bug free.\u00c2\u00a0 There&#8217;s still problems.\u00c2\u00a0 The cloud doesn&#8217;t\u00c2\u00a0cure bad code.)\u00c2\u00a0 We pre-build images for deployment, run those images through a test cycle, then build new boxes in production using the exact same images, with only a few minor config tweaks.\u00c2\u00a0 We even have large portions of our infrastructure scripted.<\/p>\n<p>Not too long ago, I needed to do a release to virtually every cluster in all regions.\u00c2\u00a0 Two years ago, this would have been a panic-inducing nightmare.\u00c2\u00a0 It would have involved a week of planning across multiple teams.\u00c2\u00a0 We would&#8217;ve had to get director level sign off.\u00c2\u00a0 We would have needed a mile long backout plan.\u00c2\u00a0 And it would have been done in the middle of the night and would have taken a team of six people about eight hours to complete.\u00c2\u00a0 When I had to do it in our heavily automated cloud world, I sent out a courtesy e-mail to other members of my team, then clicked a few buttons.\u00c2\u00a0\u00c2\u00a0 The whole release took two hours (three if you could changing the scripts and building the images), and most of it was completely automatic.\u00c2\u00a0 It could have taken considerably less than two hours, but I was being exceptionally cautious and doing it in stages.\u00c2\u00a0 And did I mention that I did this on a Tuesday afternoon during peak traffic, without downtime or a maintenance window, all\u00c2\u00a0while working from home?<\/p>\n<p>To recap, I rebuilt almost every box we have,\u00c2\u00a0in something like ten separate clusters, across three regions.\u00c2\u00a0 I didn&#8217;t have to log into a single box, I didn&#8217;t have to debug a single failure.\u00c2\u00a0 The ones that failed terminated themselves, and new boxes replaced them, automatically. \u00c2\u00a0The boxes were automatically added to our load balancers and our monitoring system, and once the new boxes were in service, the old boxes removed themselves from the load balancers and monitoring systems.\u00c2\u00a0 While this was going on, I casually watched some graphs and worked on other things.<\/p>\n<p>This is a pretty awesome place to be.<\/p>\n<p>And we still have ideas to make it even better.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>About two years ago, the higher ups at my mid-sized web company had an idea.\u00c2\u00a0 Pull together a small team of devs, testers, IT, and even a corporate apps refugee, stick them in a small, dark room, and have them move the entire infrastructure that powered a $300 million a year business into the cloud, [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":[],"categories":[7,34],"tags":[232,231,236,233,235,234],"_links":{"self":[{"href":"https:\/\/mathpirate.net\/log\/wp-json\/wp\/v2\/posts\/2330"}],"collection":[{"href":"https:\/\/mathpirate.net\/log\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/mathpirate.net\/log\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/mathpirate.net\/log\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/mathpirate.net\/log\/wp-json\/wp\/v2\/comments?post=2330"}],"version-history":[{"count":5,"href":"https:\/\/mathpirate.net\/log\/wp-json\/wp\/v2\/posts\/2330\/revisions"}],"predecessor-version":[{"id":2340,"href":"https:\/\/mathpirate.net\/log\/wp-json\/wp\/v2\/posts\/2330\/revisions\/2340"}],"wp:attachment":[{"href":"https:\/\/mathpirate.net\/log\/wp-json\/wp\/v2\/media?parent=2330"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/mathpirate.net\/log\/wp-json\/wp\/v2\/categories?post=2330"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/mathpirate.net\/log\/wp-json\/wp\/v2\/tags?post=2330"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}