Overview

Location: Remote, physically located in the Unite States (EST time zone preferred)

KAR Global powers the world’s most trusted automotive marketplaces through innovation, technology and people. Our end-to-end platform serves the remarketing needs of the world’s largest OEMs, dealers, fleet operators, rental companies and financial institutions.

  • We’re a technology company delivering next generation tools to accelerate and simplify remarketing.
  • We’re an analytics company leveraging data to inform and empower our customers with clear, actionable insights.
  • And we’re an auction company powering the world’s most advanced and integrated mobile, digital and physical auction marketplaces.

At KAR, the Service Operations team designs, deploys and operates infrastructure and applications across multiple environments and around the globe.  We are a dynamic and innovative team which aims to provide exceptional customer experience by leveraging best in class automation and orchestration practices for infrastructure and applications.  As a Site Reliability Engineer, you will utilize your software and systems engineering background to build and run large-scale, distributed, fault-tolerant systems. We strive to hire people who are looking to make an impact and thrive in a freedom filled environment driven by context.

About Our Candidate:

Your role is to ensure that our systems – both internally and externally facing-have reliability and maximum uptime. Our current team focuses on optimizing existing systems, building infrastructure and eliminating work through automation. You are responsible for the big picture of how our systems relate to each other, we use a breadth of tools and approaches to solve a broad spectrum of problems. Practices such as limiting time spent on manual operational work, postmortems and proactive identification of potential outages factor into iterative improvement that is key to both product quality and technical standards.

The challenge ahead:

  • Build scalable systems, using best practices around automation, pushing changes that improve reliability and velocity
  • Support services before they go live through activities such as system design consulting, developing software platforms and frameworks, planning and reviews
  • Maintain services once they are live by measuring and monitoring availability, latency and overall system health.
  • Provide mentorship and training to other team members on technologies and processes; drive education and knowledge transfer of design patterns, technical practices, and relevant technologies and tools
  • Drive high standards around incident response practices and policies

You should have the following:

  • 7+ years of experience in an IT Operational, DevOps, Site Reliability Engineer, or Software Engineering role
  • In-depth experience with cloud computing and solid experience of setup and management of cloud infrastructure
  • You can write code – in any language. You’ve implemented your work to production
  • Experience with configuration management and infrastructure automation tools such as:  Ansible, Artifacts, Build/Release Pipelines, Docker, Github, Hashicorp, Kubernetes, etc.
  • Experience with large scale distributed systems in the cloud and concerns like load balancing and disaster recovery
  • Experience with the operational aspects of software systems such as monitoring, centralized logging, and alerting with tools such as:  Splunk, AppDynamics, Honeycomb.io, Datadog, Prometheus.