Software Engineer vs Site Reliability Engineer
(Note: different companies have different definitions of these roles. This doc describes just one typical case. And SRE may be called DevOps or Production Engineer instead.)
- Software Engineer (SWE): owns the design and implementation of the system; hands over the compiled binaries to SRE to run in production;
- Site Reliability Engineer (SRE): owns the binaries running on servers in production; treat binaries as blackboxees.
- There are overlaps in responsibilities, SWE and SRE need to work closely together
The development workflow
Some improvements to production systems may happen within the SRE org, in this case SRE are just like SWE (owns the design and the implementation). Here we discuss a business related project.
SLOs, Metrics, Monitoring
SWEs and SREs need to collaboratively define the SLOs and the metrics, e.g. latency.
Both SWEs and SREs should be familiar with the monitoring tools.
SWE owns the design. However the design doc often needs to be reviewed and approved by an SRE, to make sure the production systems can handle this new change (e.g. if SLOs can still be met, if extra capacity is required).
This is SWE's responsibility.
Build, Test, Release
SWE needs to make sure that the new change can successfully build and all tests pass.
SREs are responsible for the build and release tools.
Oncall and Incident Response
One of SRE's main responsibility. If it is a production issue, SREs can take actions (e.g. reboot, redirect traffic, get extra capacity, rollback, etc); if it is a bug caused by the new change, SREs will reach out to SWEs (since SREs do not own the logic and treats binaries as blackboxes).
Sometimes SWEs are fully responsible for oncalls for a new system, only when it gets matured will SWEs hand over oncalls to SREs (with a well-written playbook, describing proper reactions in different failure scenarios).
SRE teams may be geographically distributed, in order to cover oncalls 24/7. It is not a must for SWE teams.