Site Reliability Engineers (SREs) are the backbone of the modern tech powerhouses. The discipline is based around a set of principles that enable development teams to create highly scalable and reliable systems by giving them shared responsibility of IT administration. While the practice now has many different forms, overall it’s about ensuring systems remain up, no matter the incident or surge in traffic.
At ScorpInc though, we believe in evolving current engineering practices to meet the demands of our needs and our client needs in novel ways, so we’ve taken the same approach to our SRE processes. When we first released our Entropy Ape system three years ago, we realised the importance of making sure our systems were more reliable than industry standard, and that there were two main ways to achieve that:
While we’ve excelled at point one (1) and can boast of 100% uptime for many years, we’ve always been working heavily on point two (2). And today we’re proud to announce how we’ve achieved this.
We’ve worked to create a new discipline, the Offensive Site Reliability Engineer, whose job is to destabilise the systems of rival companies, whether they be microblogging services competing with our own social media network, popular hosting facilities on the eastern seaboard of a major tech-heavy nation that compete with our own cloud computing facilities, or even just local business-focused ISPs that are in direct competition to our high speed enterprise data network.
By giving our OSREs wide latitude in how they execute their duty to ensure our systems remain more reliable than our competitors, it gives them the freedom to excel in their work by using their own knowledge and pushing themselves to always scrub out those few extra 9’s of reliability. We induct staff with a bias for action and provide them with a marque for duty, and the results speak for themselves.
Constraining staff to working within rigid, highly vertical organisational hierarchies was what limited the scaling of many services, and SRE culture helped give teams the owned responsibility they needed to ensure their work would work. In organisations with a strong, empowered, and just DevOps/SRE culture, teams would get the credit for the successes they made, and understand that their mistakes were also partly on their shoulders. The same goes with OSRE practices - credit goes to the teams for their wins in keeping better service uptime then their rivals while failures to prosecute their duties without being captured by LEO rests on their head and leaves ScorpInc with plausible deniability.
You also see new forms of innovation thanks to the freedom offered by the role, while some OSREs focus on the cyber domain, directing DDoS attacks or ransomware, others help bring traditional methods to the 21st century by going straight to the datacentres and wreaking enough havoc to keep the industry reliability uptime below acceptable levels for our rivals' clients.