• Service Reliability Engineering (SRE) Lead

  • Overview

    Engineering Leaders at Excella are consultants and thought leaders with great business and technology skills, who are responsible for providing expertise to the teams as we deliver game changing products and applications for commercial and government clients. The SRE Lead is part of a senior team leading all capabilities Excella offers to client.


    You might be the right person for this role if:

    • Automation is always front-of-mind for you: you’re always on the prowl to reduce the number of manual steps and write automation so computers can do what computers are good at and humans can do what humans are good at. When you’re automating tasks, you think about how to automate something and what not to automate.
    • You love helping people with different perspectives work together to deliver highly resilient services to production quickly and frequently.
    • You recognize the relationship between the pace of innovation and product stability and establish target availability and error budgets to manage the tradeoffs across development and SRE teams.
    • You have the curiosity to understand why systems and services behave the way they do in complex technical environments.
    • You’re passionate about enabling and running systems that align technology to business outcomes.
    • You’re experienced making systems humane and sustainable for everyone involved.


    The SRE Lead is responsible for shaping the scope and expertise for SRE practices at Excella. You’ll be responsible for building reliability and resiliency into cloud-based and hybrid infrastructure, tools, services and processes working with our development team, plus establishing practices for supporting, and running them that allow us to keep services highly available to our clients, easily supportable by our developers, and operable for the company.


    SRE Lead responsibilities include:

    • Leading SRE teams to ensure the solutions we deliver consistently meet target availability levels.
    • Working closely with development teams to create resilient systems that are able to run and repair themselves with minimal human interaction.
    • Evolving Excella's incident management process and tools to respond swiftly to critical incidents, provide transparent communications on incident status, introduce playbooks where necessary to reduce MTTR and conduct incident reviews for continuous improvement.
    • Evolving Excella's release engineering practices to implement progressive rollouts with the ability to detect issues and remediate quickly safely when required.
    • Establishing and implementing observability to provide visibility into system health and availability.
    • Leading capacity planning efforts to create accurate demand forecasts and conduct regular load testing.
    • Working with account leadership to develop and manage effective and sustainable on-call rotations.


    You will have experience with using, supporting, administering, and leveraging tools in the following areas:

    • Operational experience administering and managing fleets of Linux and Windows servers
    • Amazon AWS, S3, EC2, Lambda, CloudFront, ELB
    • Container technology and orchestration: Kubernetes and Docker
    • CI/CD tools, such as Jenkins, TeamCity, GitLab, Bamboo, TravisCI, or CircleCI
    • Monitoring tools:
      • Infrastructure: Nagios, SolarWinds
      • Application: New Relic
      • Reporting: ELK, Splunk
    • Incident Management Tools - VictorOps, OpsGenie, PagerDuty
    • Configuration Management tools: Chef, Puppet, Ansible
    • Source control tools, such as GitHub or BitBucket
    • Integration of testing tools and services, such as Selenium, Cucumber, JUnit, and JMeter
    • Security testing/compliance integrations: Nexus, Chef Compliance
    • Scripting languages (fluent in at least one from each group):
      • Machine/Operating system: bash shell, PowerShell
      • Multipurpose: Ruby, Python, Java
      • Build process: Make, CMake, Ant, MSbuild, XCode Project
    • Managing and operating SQL and NOSQL databases like Postgres and Mongo
    • Artifact management: Artifactory, Yum/Apt, Chocolatey

    About Excella

    Excella is a technology consulting firm serving commercial, non-profit, and federal clients in the Washington, DC area. Excella builds innovative custom software solutions with a strong focus on Agile engineering practices. We believe that great work leads to great things –- for our clients and our employees. We are growing fast and need passionate, innovative people who love working with technology and are ready to make an impact.


    Here's what you can expect from us:


    • We care about our employees. In fact, The Washington Post and The Washington Business Journal consistently rank us as a "Best Place to Work."
    • You'll work with great people who love what they do: our team includes published authors, certified trainers, and internationally renowned speakers.
    • We have a "bring your own device" workplace and will share the cost of a new computer of your choice -- Mac or PC. It's up to you.
    • We'll invest in your career by providing 3 days of paid professional development every year, including travel and registration fees to attend classes and conferences, in addition to tuition assistance for degrees and certifications.
    • Starting day one, every employee is bonus eligible and receives 17 days of paid vacation.
    • You can bike, drive, or metro to work -- our commute reimbursement plan has you covered.

    Excella is an equal opportunity/affirmative action employer. All qualified applicants will receive consideration for employment without regard to sex, gender identity, sexual orientation, race, color, religion, national origin, disability, protected veteran status, age, or any other characteristic protected by law.


    Sorry the Share function is not working properly at this moment. Please refresh the page and try again later.
    Share on your newsfeed