Site Reliability Engineer (SRE) Manager

Società: Euronet Worldwide, Inc.
Tipo di lavoro: Tempo-pieno

Since 1996, epay, a business segment of Euronet, has been at the center of connecting local and global brands to consumers. Our capabilities, platforms, products, and solutions cater to the changing consumer demand for content and payments in categories such as mobile, gaming, and entertainment. We are dedicated to developing new distribution capabilities that serve customers' changing needs through our retailer network, helping our brand partners meet consumers where they shop: in physical stores, online, via mobile devices or wallets and through ATMs.
We're in search of an experienced SRE Manager to join our team at our Las Vegas office. In this role, you will lead a small team dedicated to designing, implementing, and maintaining highly available, scalable, and secure systems in an exciting and fast-paced startup environment. Your responsibilities will include driving the adoption of SRE principles, collaborating with cross-regional teams, and contributing to the strategic direction of our IT infrastructure and services. The ideal candidate should have a hands-on approach and demonstrate strong proficiency in Linux, Kubernetes, CI/CD, and cloud computing.
 Responsibilities:
Lead the team in designing, implementing, and maintaining highly available, scalable (99.98%), and secure systems. Develop and implement operational processes and procedures to ensure smooth IT infrastructure and service operations.
Collaborate with cross-regional teams to implement best practices for building, deploying, and monitoring software systems.
Staying calm under pressure
Manage major incidents to mitigation/resolution, perform post-incident reviews of all major incidents and determine action items required to avoid similar issues/minimize downtime for future incidents.
Define and track key performance indicators (KPIs, SLIs, SLAs, SLOs) to measure operational effectiveness.
Monitor, analyze, and optimize system performance, capacity, and resource utilization.
Manage budgets and resources effectively.
Identify and implement continuous improvement initiatives to increase efficiency and reduce risks.
Lead incident response activities, performing root cause analysis and implementing preventative solutions.
Drive the development and implementation of automation solutions to streamline operations and reduce manual workloads.
Team Management and Leadership:
Manage a team of Site Reliability Engineers / DevOps, including hiring, evaluating, training, and developing team members.
Build a collaborative and productive team culture.
Own and maintain the company's cloud infrastructure strategy and SRE team roadmap.
Evaluate and improve SRE processes and procedures.
Provide technical expertise by collaborating with stakeholders to make high-level decisions and provide technical direction to team members.
Participate in deep system design and implementation discussions to ensure high-quality systems are built. Work closely with our Software Development and Engineering teams to build platforms before they go live, building a reliable production-ready services and applications.
Provide rotational on-call support where you’ll respond, detect, triage and resolve production incidents
Requirements
Bachelor's degree in Computer Science, Engineering, or a related discipline.
Over a decade of experience in IT, including at least two years in a leadership capacity.
Strong technical background in cloud computing, networking, security, and automation.
Excellent leadership, communication, and interpersonal skills.
Bachelor’s degree in related field or equivalent experience required.
Strong knowledge of Linux and Windows operating systems and environment
Strong knowledge of Networking, Load balancers, DNS, NTP and TCP/IP
Strong knowledge on AWS technologies: Global Accelerator, ALB, NLB, EKS, EC2, VPC, S3, RDS or equivalent experience on (Google Cloud)
Experience with containers
Knowledge with container orchestration
Experience with some Infrastructure Automation like Terraform, Ansible, Puppet/Foreman
Experience with web servers IIS, Apache, Nginx.
Proficiency in the design principles for monitoring and alerting systems.
Experience with monitoring tools like Nagios, Icinga, SolarWinds, New Relic, Grafana
Solid scripting skills; experience with Shell, Bash, Ansible, Python, Powershell, Ruby.
Experience in setting up CI/CD pipelines (Gitlab or AzureDevops)
A willingness to learn on the job and take on tasks as needed
Additional Desired Experience:
Certifications such as AWS Certified DevOps Engineer or Google Professional Cloud DevOps Engineer are a plus.
Experience with one or more of the following F5 products: LTM, AWAF, GTM, AFM, BIGIQ
Experience with one or more of the technologies used for big data: ELK, Beats, Kafka, Redis, Searchguard.
Experience with application monitoring tools like Uptrends
Experience with Postfix
Benefits
401(k) Plan
Health/Dental/Vision Insurance
Employee Stock Purchase Plan
Company-paid Life Insurance
Company-paid disability insurance
Tuition Reimbursement
Paid Time Off
Paid Volunteer Days
Paid Holidays
Plus many more employee perks & incentives!
We are an Equal Opportunity Employer, and all qualified applicants will receive consideration for employment without regard to race, color, religion, gender, sexual orientation, gender identity, or national origin, age, disability status, genetic information, protected veteran status, or any other characteristic protected by law.

Canditati per questo lavoro