The DevOps Engineer is responsible for maintaining, monitoring, automating, and optimizing the company's infrastructure, applications, and deployment processes. The role ensures system reliability, availability, scalability, and security through proactive monitoring, incident response, automation, and continuous improvement of operational workflows.
I.Responsibilities
- Monitoring and System Inspection
Continuously monitor metrics and alerts through platforms such as Prometheus, Grafana, Huawei Cloud Monitoring, and other monitoring tools.
Track key business indicators, including QPS (Queries Per Second), error rates, response times, success rates, and the health status of infrastructure resources such as CPU, memory, disk, and network utilization.
Conduct routine inspections of production systems, data center environments, and network connectivity, and prepare inspection reports in accordance with established procedures.
- Alert Management and Incident Response
Respond promptly to alerts received through phone calls, SMS, Microsoft Teams, and other communication channels, following established Standard Operating Procedures (SOPs).
Classify alerts according to severity levels and execute the corresponding response plans.
Independently resolve incidents when possible, such as restarting services, clearing disk space, scaling resources, or traffic switching.
Escalate unresolved issues to the Development Team following the escalation process and track progress until resolution while documenting all actions taken.
Continuously optimize alerting rules to minimize false positives, missed alerts, and alert storms.
- Daily Operations and Process Management
Handle Operational Requests Submitted Through The Ticketing System, Including
- Account provisioning
- Access and permission requests
- Resource allocation requests
- Firewall policy modifications
- Domain and SSL certificate applications
Support deployment activities, including application releases, rollbacks, and configuration changes following standardized procedures.
Assist with routine operational tasks such as backup verification, slow query analysis, and vulnerability scan follow-ups.
Maintain and update CMDB (Configuration Management Database) records, ensuring accurate information on servers, IP addresses, applications, and responsible personnel.
- Incident Management and Post-Mortem Analysis
Serve as the first responder during system incidents by coordinating communication channels, notifying stakeholders, managing resources, and providing timely updates.
Participate in incident post-mortem reviews by documenting timelines, root causes, and improvement actions.
Contribute to the continuous improvement of SOPs and the internal knowledge base.
- Documentation and Knowledge Management
Develop and maintain workflow manuals, emergency response procedures, and SOP documentation.
Prepare Regular Weekly And Monthly Operational Reports, Including
- Top alert statistics
- Incident frequency
- Resolution times
- Service performance metrics
- Perform tasks or responsibilities as may be assigned by the Management and Department Head.
II. Job Requirements
Education and Experience:
Associate's Degree, Bachelor's Degree, or higher in Computer Science, Information Technology, Network Engineering, Telecommunications, or a related field.
13 years of experience in IT Operations, System Administration, or DevOps. Outstanding fresh graduates are encouraged to apply.
Technical Skills
Proficient in Linux administration and commonly used commands, with the ability to:
Analyze logs
Troubleshoot processes
Perform network connectivity testing
Diagnose disk and memory-related issues
Good understanding of TCP/IP, HTTP, DNS, and Load Balancing concepts.
Ability To Interpret And Troubleshoot Using Tools Such As
Ping
Telnet
Curl
Tcpdump
Experience With At Least One Monitoring Platform
Zabbix
Prometheus + Grafana
Loki
Huawei Cloud Monitoring
Familiarity with the operation and maintenance of common web services and middleware, including:
Nginx
Tomcat
Redis
MySQL
Strong Command Of Linux Utilities Such As
grep
awk
systemd
netstat
df
top
Ability to develop Shell or Python scripts for automation and routine operational tasks.
Familiarity With Kubernetes Ecosystem Tools And Configurations, Including
Helm
Operators
Istio
Grafana
Prometheus
Basic YAML configuration
Experience With Automation And CI/CD Tools Such As
Ansible
Jenkins
ArgoCD
Ability to trigger pipelines, review logs, and troubleshoot build failures.
Proficient In Using Kubectl Commands, Including
get
describe
logs
exec
Solid Understanding Of Kubernetes Concepts, Including
Pods
Services
Deployments
StatefulSets
Persistent Volumes (PV)
Persistent Volume Claims (PVC)
Soft Skills
Excellent communication and interpersonal skills.
Ability to remain calm and organized during incidents and effectively communicate updates.
Strong sense of ownership, accountability, and teamwork.
Ability to clearly explain technical issues and coordinate with cross-functional teams.
III. Preferred Qualifications
Familiarity with IT Service Management processes, including:
Incident Management
Problem Management
Change Management
Configuration Management
Experience With Ticketing Systems Such As
Jira Service Management
ServiceNow
ONES
Proprietary ticketing platforms
Hands-on Experience With Docker And Kubernetes Operations, Including
Checking Pod status
Reviewing logs
Restarting services
Experience Supporting Large-scale Business Environments Such As
E-commerce
Financial Services
Gaming
Live Streaming Platforms
Ability to develop automation tools using Python or Go to streamline repetitive operational tasks.
DevOps Work Environment and Characteristics
This is not a purely reactive monitoring role. Engineers are encouraged to proactively identify issues, optimize processes, and improve system reliability.
A structured escalation and incident management process is in place to ensure efficient issue resolution and accountability.
The organization provides comprehensive operational tools, including monitoring, logging, alerting, and management platforms to support daily operations and incident response.