My path to SRE

13 minute read Published:

Wherein I describe my path from grungy 12-year-old, to mediocre software dev, through the corporate IT/DevOps hellscape, onwards to more interesting SRE positions.

During a recent trip to my hometown, a friend and ex-coworker was curious about how I became an SRE.

To give a little background, at the company where we worked together, I was probably the only dev-heavy SRE. My profile resembles a dev moreso than a sysadmin (which is the other most common profile working as an SRE).

Pre-university days

Growing up, I was interested in computers, entirely in the Windows + PC gaming sphere. For those who haven’t had the childhood experience of the early 2000s in third-world countries, playing multiplayer PC games at Internet cafes is a primo activity among the town youth. My uncle introduced me to his lifelong school friend who was now running one of these cafes, and this led to an extensive Counter-Strike habit (that persisted into my early 20s, and expanded to include Dota, Diablo, Team Fortress, and Quake among others).

Shortly afterwards I got my own computer at home (Windows 2000, Pentium 3), where I played mostly first-person shooters. I remember we took my computer to the shop after I received a copy of Need for Speed Underground for Christmas to get a better graphics card.

Several years later I convinced my parents to let me assemble my own computer. I would trawl hardware forums (HardOCP, Overclock.net) and Newegg for days and days, planning builds. I ordered the parts off NCIX.com. It was a good computer and I spent a summer building gaming computers for most of my friends.

University days

Due to the gaming stuff, I was planning on studying game design at university. Another uncle convinced me that an electrical engineering degree would have a wider range of career paths, and I trusted that advice.

During my first few CS courses (electrical, computer, and software eng were in the same department as is somewhat common), I was introduced to Linux.

Unfortunately it was in the hacky way that TAs with a tenuous grasp on fundamentals need to employ to teach a group of 100 complete beginners how to compile C and Java from the command line in time for their first assignment.

Fortunately, I knew enough about C (a laughably small amount) to prove that I’d at least written some C before, and I got my first internship as an embedded systems intern.

First tech job

I’m fortunate to have had an excellent first boss and mentor. I went from 0 Linux knowledge to learning first-hand from a conoisseur - Fedora, Emacs, Vim, IRC, Buildroot, Yocto, Linaro, ARM Linux.

I was basically given a Pandaboard, a Raspberry Pi, and ultimate freedom to figure out a way to test that the I/O interfaces on the Pandaboard were working correctly.

To test the audio I/O, I connected a 3.5mm male-to-male from the output jack to the input jack, emitted tones at a certain frequency, and then used the Goertzel algorithm to ensure that the input jack was indeed receiving the same emitted tone with an acceptable SNR. This knowledge led to the birth of my GitHub account and my very first side projects, Pitcha and pitch-detection.

To test the ethernet, I constructed an ethernet loopback plug and used raw socket programming to send and receive packets through the loopback. I could prove that it worked by disconnecting the loopback plug and watching my test fail. If you just relied on lo the kernel did clever things to not actually send packets anywhere, so the raw packet construction was a must.

I could go into tremendous detail because I genuinely have not been able to replicate that pure joy of learning and optimism since that job, and I remember every second of it.

Second tech job

It was at the same place (after I got my degree and became full-time), but being a consulting firm, sometimes we had to not do what we wanted, but instead do what our clients needed.

I was sent to a local Montreal shop which wanted to migrate a monolithic Java webapp from on-prem to AWS. There, I learned about Hazelcast and AWS from a developer’s perspective. We were also sent with a contingent of our devops experts, who installed a Jenkins/Puppet/CloudFormation pipeline - I was very interested in the pipeline, but the attitude I received was akin to “go play with your toys, dev - let the real operations engineers handle the deployments”. That’s when I set my sites on working in ops.

I angled myself in the “devops direction” by developing tools outside of the application. I intuited that developing features or fixing bugs on the main application would always leave me in the role of developer, whereas operating in the periphery by writing scripts for deploying, testing, etc. would expose me to a wider world. That’s how I taught myself Python for the first time, by writing a Jabber/XMPP fake data simulator which we used to do load and quality tests on the Java webapp.

Third tech job

With this self-taught Python, I got a job as a Python backend microservices developer. I learned great Python from solid developers. Args, kwargs, generators, etc.

It was at this company where I found a new duality - the ops/dev split was a divide of fun. I looked at my dev cluster where everybody was sitting silently with their headphones in, cranking out line after line of code.

I looked at the operations cluster where the rambunctious sysadmins were playing drinking games to determine who would be on-call that night, comparing their mechanical keyboards, and showing off their i3wm/ArchLinux setups to each other.

This is where I completely rewrote my resume to “pretend” like my last few jobs were more devops-oriented than they truly were. For example, instead of saying I watched other people write Puppet/Cloudformation/Jenkins pipelines, I pretended like I created them (and revised enough Puppet and did some PoCs at home so I could answer questions).

I secured an interview and got my first real job as a “devops engineer” (which FYI is now considered to be a junk title).

Fourth tech job

It was awful. It was an IT/DevOps hell job. If you haven’t read The Phoenix Project, now’s the time to do it.

I’ll say with honesty that very little in that book is exaggerated. Actually, the most exaggerated parts are not about how dysfunctional things were in the beginning, but rather how willing people were to make things better.

We had people modifying Apache httpd configs on production web servers on customer websites and causing outages when doing change tickets. I wrote a Go/Docker/Jenkins pipeline to be able to test redirects in a container before deploying them, and my idea was rejected.

We had people doing deployments with Terminator/Tmux instead of using a centralized management system (e.g. Ansible, Salt). I wrote a Python tool using Fabric to do actions in parallel, and my idea was rejected.

It was important as a first operations job, to learn what it meant to be on-call, or do some basic Linux triaging. However, I cannot state strongly enough how bad the IT hell aspect is.

When you’re told for the 10th time that “it’s 2PM Thursday, the network engineers are taking their second lunch of the day so don’t expect them to help you with the load balancers that only they are allowed to touch”, it wears on your soul.

I emerged from this job more cynical, but I don’t think that’s necessarily a bad thing. I’m almost impervious to bad practises now. I shrug and code around them. Working in live operations, in my opinion, requires certain resilience to bad practises. Even at good companies, there’s some crusty old shit in the hidden parts of the stack.

Fifth tech job

This is what was essentially my renaissance. I entered a very strong company with excellent engineering culture. There wasn’t (yet) good collaboration between the engineers and the SREs, but at the very least, there wasn’t any animosity preventing it.

It was probably the perfect job (while it lasted). My SRE boss was excited that he found an SRE who could get along with the devs. The dev leads were excited that there was an SRE who didn’t have the “god damn devs, why do you write new features??? keep everything the same!” crusty sysadmin mentality.

Here I wrote pq and goat. Here I also learned that the title SRE is a respected title in the Bay Area, and that there are some real SREs doing some real cool shit out in the world, e.g. Brendan Gregg. I got the Silicon Valley bug here.

While interviewing for SRE positions, I learned that a certain amount of operational chops are required. I was mostly shielded from that. Instead my skills were used to write glue code for CI/CD pipelines and write infrastructure-as-code with Terraform for AWS.

I saved a transcript of my many interviews. The most important job interview of my life was probably one for Production Engineer at Facebook - this is where I obtained an actionable roadmap to actually becoming a valuable SRE. Here’s my interpretation of that interview (yes, I wrote it down and read it regularly to face my failure and motivate myself to do better):

Production engineer at Facebook Real operations-heavy work keeping prod systems running Initiated: recruiter reached out to me via LinkedIn

Preparation: recruiter warned me very accurately about what was coming. Know the ins and outs of Python and operations-style coding (parsing text files, parsing huge files with low memory footprint). Operations-wise: know your kernel and fork,exec,pid,signal stuff. Did lots of leetcode, watched Brendan Gregg lectures and followed his perf labs lectures (learned mpstat, pidstat, vmstat, iostat, top, htop, iotop, atop, etc.) I did insane prep for this interview, perhaps a month of concentrated effort.

Progress: 1 phone screen about coding. No puzzles, it was 2 straightforward Python question. Parse two text files containing partial data, store them into a combined dict. Did some clever stuff with dict comprehensions, deleted unused dict keys as I went in the loop. Good communication and feedback. Another about splitting a list into two equivalent sublists (something like x[:b-1] == x[b:]. Brute forced, again seemed more pleased with how well I knew Python than my actual CS fundamentals.

Exact feedback: Sevag can quickly and clearly come up with a brute force solution. When prompted for optimizations, he can make good guesses for where and how he could improve the runtime of his code.

1 phone screen about operations. What happens when you type ls? Fork, exec, wait. What’s an orphan? What happens to a wildcard? Look at this code: pid = os.getpid(), kill(pid), why is this bad? Luckily I had just read through the pid/kill/signal sections of The Linux Programming Interface so I was able to say that if pid is 0

Onsites. 1 on coding with a member from the Python foundation. Re-implement GNU Fortune given the following format of delimiters. Came up with a brute force solution quickly (as is typical). He helped me come up with a better solution by leading me to storing file offsets pointing to the contents, rather than the file contents themselves. Asked me to write a cache to not re-parse the fortune file each time. Wrote something nice where I stored the last parsed offsets in ~/.fortune.py.cache

Feedback: good, he knows Python, he knows how to code, he can communicate clearly

1 on networking with a kernel contributor. Asked me about recursive DNS. Asked me if DNS could be implemented in TCP, and why isn’t it (to reduce load on the important TLD servers?) Asked me to draw the internet (i.e. mesh of personal devices and routers). Asked me to describe routers. Led me to a tricky spot where if I have 2 machines listening on the same port, how does the remote know they are 2 different hosts since it goes through the same router? Answer was NAT.

Feedback: you’re not expected to be amazing at networking for a production eng role anyway, but regardless, they were more or less satisfied with my basic knowledge

1 on ops. Asked, what would I do if I’m paged and a disk is showing me errors. Puttered around too much using iostat when I should’ve started with df -h right away. Made some mentions of software RAID (but he said, assume df -h shows you that it’s just a plain mountpoint). Then, we reached my element - how do you fix it? Ok, so there’s a totally empty disk we can swap the erroring disk. First, do an lsof to see if the disk is being used. Then a recursive grep on the mountpoint to find which application is actually configured to use it - good, he was looking for this. Tells me it’s an Nginx config. Change the configs, restart, everything is good.

Feedback: my navigation of an error is slow and uncertain. Seems like I would need a lot of battle-testing before comfortably being on-call for critical production systems (which I agree with)

1 on large system design. How would I design a system to deliver a 300MB firmware blob update to 100,000 prod webservers.

Here I believe I knocked it out of the park. Described that I would do it in little batches (take offline, update, put back online). How would I calculate the batches? Based on what traffic the total webservers can serve (i.e. if we removed 20%, load would be 100%, so we should only remove 10% to maintain 10% for safety since we don’t know if external traffic might increase as we’re doing the change). Describe how I would do it in batches, exponentially increasing the batch size (starting with a batch of 1, capping at a batch of 10%). A lot of estimates and imagination. How to distribute the files? There the interviewer described a torrent-based approach they have within each network rack topology. Very cool stuff, but again, imagination is one thing and ability to execute is another.

1 on personality (1 on 1 with a manager)

Surprisingly, the one I failed. Not from a bad attitude, but rather, the manager could sense that I was a coder with no real operations guts. I never spoke about cool operational things I did (to be fair I haven’t done any).

Feedback:

I was told that one of the engineers who interviewed me championed me (a form of vouching/super-like where I’m guaranteed to be on their team if I don’t make it through bootcamp).

Coding and large system design feedback was very good. Networking was acceptable. Operations was weak.

Manager interview was terrible. The manager has no faith that I have the correct operational know-how to do the nitty gritty ops work. Thinks I’m just a SWE in disguise who’ll try to sidle up to a SWE team as soon as I get hired, and I’ll get bored of production very fast and drag my feet and never learn to do the important operational stuff and would rather chase my tail writing code.

After this interview, I followed the Tyrion Lannister trick of “wear it like armor”. I entered my future interviews with the attitude of “look - I’m a programmer, and I don’t know Linux ops as well as I’d like to. I will help you modernize your operations software, and in turn, I want to face the fire and be on-call and learn all about mdadm and RAID and NFS and everything.”

Sixth tech job

I now live and work in the Bay Area as a “Senior site reliability engineer”. There’s plenty of things on fire so I can learn how to fight fires. There’s also plenty of automation to be written so I’m writing exactly as much code as I want to write.

It’s been an interesting journey and I hope this helps others who are interested in the SRE career track.