Job Description
Roles & Responsibilities
Responsibilities:
- System Monitoring
Actively monitor applications, servers, and network using dashboards and alerts; identify anomalies and take immediate action or escalate when thresholds are breached. - Incident Management
Own incidents from detection to closure: validate alerts, classify severity, perform initial troubleshooting, apply fixes when possible, and escalate with proper details. - Troubleshooting & Support
Perform first-level diagnosis using logs, metrics, and system checks; restart services, verify dependencies, and support L2 teams with accurate findings. - Communication & Reporting
Provide clear incident updates to stakeholders, log all actions in the ticketing system, and ensure proper documentation and shift handover notes. - Service Availability
Track system uptime and performance; proactively act on alerts to prevent outages and ensure SLA targets are met. - Change & Deployment Support
Monitor systems during releases, validate service health post-deployment, and report or escalate any issues observed during change windows. - Shift Operations
Work in assigned shifts (24/7), respond to alerts within SLA, and ensure smooth handover with complete status and pending actions.
Desired Candidate Profile
Education
Bachelor s degree in Computer Science, IT, Electronics, or related field
- Experience
1 3 years in NOC / SOC / IT Operations with hands-on incident handling in production environments (24/7 support experience required) - Networking Knowledge
Practical understanding of TCP/IP, DNS, HTTP/HTTPS, VPN (evidenced by troubleshooting or support roles) - Systems Administration
Hands-on experience with Linux and/or Windows Server (service checks, logs, basic commands, system health) - Monitoring Tools Experience
Proven use of monitoring/logging tools (e.g., Elastic Stack, Kibana, Zabbix, Grafana) for alerting and issue investigation - Log & Metrics Analysis
Demonstrated ability to analyze logs and system metrics to detect issues and support root cause analysis - Incident Management
Experience following structured incident management and escalation processes (ticketing systems, SLAs, severity handling)