The Operations Center (OC) primary purpose is to oversee the Client Technical Operations highly complex environments and take actions necessary to provide exceptional Customer Experience, maintain product availability, and restore services for any issues that may occur.
The primary objective of Client's OC team with regards to monitoring is to help ensure that Client’s products and services remain fully operational at all times by providing insight and visibility within our systems and applications and reduce Mean Time to Detect for any issues that may occur. We provide, maintain and manage a complete suite of integrated processes that use Client's monitoring tools to deliver timely and relevant alerting to technical operations and engineering personnel in an effort to proactively prevent any impact to the Consumers or Business Partners.
The team is geographically dispersed across US, Europe, China, The Philippines, and India. The team consists of ~60 Members including Managers, Tech Leads, Incident Managers, Site Reliability Engineers, and Resolving Engineers.
Successful members of the Operations Center Incident Manager team will:
· Quickly gain the ability to understand multiple applications and technologies within the Client Environment.
· Identify issues in production by observing our monitoring tools and dashboards.
· Follow technical knowledge articles to remediate or escalate issues to the appropriate teams.
· Execute trouble-shooting steps and create incident documentation with proper technical details.
· Identify major incidents and escalate via the Incident Management (IM) Process.
· Take a command and control role as Incident Manager during critical incidents focusing on restoring services as quickly as possible.
· Participate in After Action Reviews and facilitate discovery of Root Cause.
· Identify, evaluate and execute preventive measures to minimize/avoid impact to the consumer experience.
· Collaborate with all Client teams to identify new (or upgrades to existing) technology and facilitate onboarding to our monitoring systems.
· Understand, build and refine communications to inform subscribed audience of business status and impacts from incidents.
· Provide Leadership with KPI reports/dashboards with respect to Incident trend, SLAs, MTTR and other information as requested.
· Create documentation for Operations Center Engineers
The software tools used to execute the projects include:
ServiceNow, Slack, New Relic, Splunk, Solarwinds, Perl, Python, Shell scripting, Jira, Confluence, HipChat, Jabber, Stash / Bitbucket, Artifactory, PagerDuty, ScienceLogic EM7, Microsoft SCOM, ExtraHop, vROPs, OEM, Tableau.