||SITE RELIABILITY ENGINEER
||Principal Software Engineer
STATEMENT OF PURPOSE:
The Site Reliability Engineer designs, develops, and operates systems for our online content publishing system. This person is an software developer/engineer who can build systems that are reliable and enable us to meet our service reliability goals while also embracing speed of change. This role works with Application Software Engineers to create solutions for deployment, monitoring, and performance.
ESSENTIAL DUTIES & RESPONSIBILITIES:
A successful Site Reliability Engineer will be expected to:
- Write reliable, well-documented, well-tested code consistently and predictably.
- Understand trade offs between reliability, cost, and performance. Communicate and contribute to effective business decisions.
- Participate in the whole lifecycle of software development and deployment to improve PubFactory.
- Design system architectures that enable successful deployment and operation of services.
- Respond to operations incidents affecting customers and resolve them.
- Participate in an escalation path for 24/7 support of customer facing systems.
- Define and implement practices for security and auditing of systems to comply with any contractual agreements.
- Work with Software Engineers to improve the deployment and operation of all systems at PubFactory.
- Responsible for Chef cookbook automation of all servers.
- Responsible for deployment tools currently written in Fabric.
- Participate in the evolution of infrastructure architecture and how we operate and deploy software at PubFactory.
- Provision new systems and services in support of growth of the business through new clients or products.
- Participate in blameless post-mortem culture and embrace iterative improvement.
- Maintain an infrastructure as code production environment spanning more than four datacenters and more than 100 servers.
- BA/BS degree in Computer Science or related technical field, or equivalent practical experience
- Experience in one or more of the following languages: C, C++, Java, Python, Go, Perl and/or Ruby
- Good written and verbal communication skills
- Ability to manage to deadlines multiple projects simultaneously
- Linux systems administration experience
- Experience with one or more of the following configuration management systems: Ansible, Chef, Fabric, Puppet, Salt, CFEngine
- Knowledge of TCP/IP network specifications, topology, and design.
- Understanding of computer hardware for networking and hosting servers.
- Understanding of virtualization and containerization of services and systems including Docker, AWS EC2, and VMWare
- Experience on an agile development team using continuous integration
- Experience coding within Git source-controlled environments.
- Demonstrated ability to debug and optimize code
- Experience with algorithms, data structures, complexity analysis and software design.
- Experience with service orchestration such as Kubernetes, Nomad, and Mesos
- Knowledge of automated infrastructure provisioning through Terraform or Cloudformation
While performing the duties of this job, the employee is required to do the following:
- Use a computer, telephone, and video conferencing equipment
- Provision will be made for candidates who require specially adapted equipment
If you're interested in learning more, please send your resume at firstname.lastname@example.org