- Identify tools needed to monitor custom solutions put in place by Solution Architects and Implementation Engineers.
- Successfully implement use of monitoring tools and customize to meet monitoring and alerting business needs.
- Analyze logs and error messages to determine source of issue. Debug code and troubleshoot to try and resolve issues. Deliver troubleshooting steps and debugging info to engineering teams for resolution.
- Create documentation and processes to allow Engineers to restore service by performing triage, recovery and validation steps for application, network, system and database events.
- Establish procedures for alarm handling and escalation.
- Create reporting and analysis for health of monitored solutions and identify areas of risk and customer exposure.
- Establish framework for NOC duties and responsibilities. Identify coverage and response needs for monitored assets.
- Identify hand-offs between departments, SLAs, and responsibilities between groups.
- Responsible for the tracking of incidents and requests from initial identification through to resolution, ensuring that appropriate categories for logging and escalating incidents and requests are used.
- Establishing feedback loop between technical solutions and stakeholders for ongoing improvements.
- Work closely with Solution Consultants and Implementation Engineers to scope monitoring and alerting needs for customer solutions in advance of implementation.
- Some knowledge (and experience) of programming languages, enough to follow code execution, detect code errors, debug, and make minor fixes.
- Experience working with Microsoft Azure environments
- Excellent (advanced) knowledge of SQL
- Troubleshooting skills (patience and determination to find and solve problems)
- Strong priority management skills and proven problem-solving skills.
- Highly organized and self-starting.
- Strong Communications skills (written& verbal).