This position is for an experienced systems reliability engineer (SRE) eager to play an integral role supporting client's next generation Splunk platform. The role is to help elevate SRE practices, onboard new technologies, solve complex scaling and automation problems in order to provide with superior operational intelligence.
Primary responsibilities include evaluating the status of our large-scale Splunk environment. This will include consultation, designing, building, and supporting advancements in the system, automating infrastructure and operations, creating telemetry for monitoring, engineering high reliability and reinforcing best practices to secure our company and guest data.
This SRE is expected to have experience running large-scale Splunk environment in the tens of terabytes per day of ingest with experience using the Enterprise Security module. The Sr SRE is also expected to have expert level systems administration skills in Linux and Windows platforms, and must have experience with software development (e.g. Python, Go, Java), automation experience (Chef, Terraform, Cloud Formation), cloud hosting (AWS, GCP & Azure), and the DevOps team culture.
The Sr SRE must be prepared to work with engineering, creative and production teams in an extremely collaborative and high-energy environment to brainstorm, architect, gather requirements, troubleshoot, and provide stellar customer support. The ideal Sr SRE is passionate about constantly learning, taking technology to the next level to solve complex problems, and is a highly motivated, optimistic, proactive, creative thought leader and project manager.
- Expertise with Splunk and modules such as Enterprise Security
- Expertise with large-scale Splunk environments (multiple tens of TBs/day of ingest)
- Expertise in multiple scripting languages and advanced skills in programming languages (e.g. Go, Python, Ruby, Dart, Node, Java, others alike) with ability to build test coverage for all software being developed.
- Systems administration skills on Linux and Windows platforms
- Networking skills and protocols (e.g. HTTP, TLS, SSH, DNS)
- Experience with Source Control Management systems (e.g. Git)
- Expertise in public and private cloud hosting services (AWS, Google Cloud, Azure)
- Proficient with data technologies (e.g. NoSQL, MySQL, MongoDB, Redis, Elastic) including being able to perform basic setup, configuration, and troubleshooting.
- Able to implement existing base standards for new systems and/or applications for all of the following:
- Site/Systems monitoring and instrumentation
- Application monitoring and instrumentation
- System monitoring and instrumentation
- Resilience, performance & Telemetry data
- Able to diagnose simple to complex system and process problems.
- Able to perform and provide in depth analysis on load test runs against a moderately complex system.
- Demonstrate exceptional troubleshooting methodology, including the ability to author and instruct new methodologies to the SRE team.
- Independently resolve moderately to highly complex system and application incidents.
- Able to identify and propose system and application fixes for performance bottlenecks.
- Able to evaluate new application requirements for capacity and run-time best practices.
- Able to evaluate new system and/or infrastructure solutions for technical feasibility against known requirements and standards.
- Effective at dealing with change: Able to transition in role or handle a significant modification or technology with minimal ramp-up time and with very little guidance.
Bachelor's degree or higher