Reliability Engineer - Observability
New York, New York, United States
Two Sigma is a financial sciences company, combining data analysis, invention, and rigorous inquiry to help solve the toughest challenges in investment management, insurance technology, securities, private equity, and venture capital.
Our team of scientists, technologists, and academics looks beyond the traditional to develop creative solutions to some of the world’s most complex economic problems.
Our global Reliability Engineering group consists of multiple teams of versatile full stack engineers who drive the expansion and maintenance of Two Sigma’s many and varied systems. The team exists in the space between traditional systems engineering and development, and seeks to merge the capabilities from both disciplines.
The Observability Reliability Engineering team is focused on providing a top tier observability platform consisting of Metrics, Events/Logs and Traces. We’re aiming to provide services and tooling to help teams monitor the health and performance of thousands of jobs and services across the firm. The work done in our team will expose you to large scale distributed systems, public cloud technologies (AWS, GCP, or Azure), Kubernetes , and state of the art observability tools, both commercial and open source.
You will take on the following responsibilities:
- Improving all aspects of software reliability, including better monitoring, alerting and documentation
- Engaging with our software engineering teams on support issues and improvements to our tools, processes, and software
- Acting as a conduit between infrastructure and development teams to ensure teams are making best use of our systems
- Gathering and analyzing metrics from both operating systems and applications to assist in performance tuning and fault finding
- Building out new systems to support large scale observability data
- Participating in an on call rotation, investigating and remediating alerts and responding to user support inquiries
You should possess the following qualifications:
- A bachelor’s degree in a highly technical or scientific discipline such as computer science or electrical engineering
- The ability to leverage off the shelf and open source systems and utilities to provision production systems in a variety of domains, especially for multi-tenant use
- Ability to program (structured and OO) with one or more high level languages (such as Python, Java, C/C++, Go) with a proven track record of automation and an algorithmic approach to solving problems
- In-depth knowledge and experience in at least one of: host based networking, Linux or UNIX engineering, systems programming, distributed systems, databases, cloud computing, and a desire to learn more
- Experience with automated configuration management tools such as Ansible, Chef, Puppet, SaltStack
- Experience with observability and monitoring tooling such as Graphite, Grafana, OpenTSDB, Datadog, Prometheus, ELK, and OpenTracing
- A desire to automate away toil
- Prior experience in a similar Site Reliability Engineering (SRE), DevOps, distributed computing, systems engineering/administration, or related function.
You will enjoy the following benefits:
- Core Benefits: Fully paid medical and dental insurance premiums for employees and dependents, 401k match, employer-paid life & disability insurance
- Perks: Onsite gyms with laundry service, wellness activities, casual dress, snacks, game rooms
- Learning: Tuition reimbursement, conference and training sponsorship
- Time Off: Generous vacation, sick days, and paid caregiver leaves
We are proud to be an equal opportunity workplace. We do not discriminate based upon race, religion, color, national origin, sex, sexual orientation, gender identity/expression, age, status as a protected veteran, status as an individual with a disability, or any other applicable legally protected characteristics.