Search by job, company or skills

bb wave inc.

DevOps Engineer

Save
  • Posted 9 hours ago
  • Be among the first 10 applicants
Early Applicant

Job Description

The DevOps Engineer is responsible for maintaining, monitoring, automating, and optimizing the company's infrastructure, applications, and deployment processes. The role ensures system reliability, availability, scalability, and security through proactive monitoring, incident response, automation, and continuous improvement of operational workflows.

I.Responsibilities

  • Monitoring and System Inspection

Continuously monitor metrics and alerts through platforms such as Prometheus, Grafana, Huawei Cloud Monitoring, and other monitoring tools.

Track key business indicators, including QPS (Queries Per Second), error rates, response times, success rates, and the health status of infrastructure resources such as CPU, memory, disk, and network utilization.

Conduct routine inspections of production systems, data center environments, and network connectivity, and prepare inspection reports in accordance with established procedures.

  • Alert Management and Incident Response

Respond promptly to alerts received through phone calls, SMS, Microsoft Teams, and other communication channels, following established Standard Operating Procedures (SOPs).

Classify alerts according to severity levels and execute the corresponding response plans.

Independently resolve incidents when possible, such as restarting services, clearing disk space, scaling resources, or traffic switching.

Escalate unresolved issues to the Development Team following the escalation process and track progress until resolution while documenting all actions taken.

Continuously optimize alerting rules to minimize false positives, missed alerts, and alert storms.

  • Daily Operations and Process Management

Handle Operational Requests Submitted Through The Ticketing System, Including

  • Account provisioning
  • Access and permission requests
  • Resource allocation requests
  • Firewall policy modifications
  • Domain and SSL certificate applications

Support deployment activities, including application releases, rollbacks, and configuration changes following standardized procedures.

Assist with routine operational tasks such as backup verification, slow query analysis, and vulnerability scan follow-ups.

Maintain and update CMDB (Configuration Management Database) records, ensuring accurate information on servers, IP addresses, applications, and responsible personnel.

  • Incident Management and Post-Mortem Analysis

Serve as the first responder during system incidents by coordinating communication channels, notifying stakeholders, managing resources, and providing timely updates.

Participate in incident post-mortem reviews by documenting timelines, root causes, and improvement actions.

Contribute to the continuous improvement of SOPs and the internal knowledge base.

  • Documentation and Knowledge Management

Develop and maintain workflow manuals, emergency response procedures, and SOP documentation.

Prepare Regular Weekly And Monthly Operational Reports, Including

  • Top alert statistics
  • Incident frequency
  • Resolution times
  • Service performance metrics
  • Perform tasks or responsibilities as may be assigned by the Management and Department Head.

II. Job Requirements

Education and Experience:

Associate's Degree, Bachelor's Degree, or higher in Computer Science, Information Technology, Network Engineering, Telecommunications, or a related field.

13 years of experience in IT Operations, System Administration, or DevOps. Outstanding fresh graduates are encouraged to apply.

Technical Skills

Proficient in Linux administration and commonly used commands, with the ability to:

Analyze logs

Troubleshoot processes

Perform network connectivity testing

Diagnose disk and memory-related issues

Good understanding of TCP/IP, HTTP, DNS, and Load Balancing concepts.

Ability To Interpret And Troubleshoot Using Tools Such As

Ping

Telnet

Curl

Tcpdump

Experience With At Least One Monitoring Platform

Zabbix

Prometheus + Grafana

Loki

Huawei Cloud Monitoring

Familiarity with the operation and maintenance of common web services and middleware, including:

Nginx

Tomcat

Redis

MySQL

Strong Command Of Linux Utilities Such As

grep

awk

systemd

netstat

df

top

Ability to develop Shell or Python scripts for automation and routine operational tasks.

Familiarity With Kubernetes Ecosystem Tools And Configurations, Including

Helm

Operators

Istio

Grafana

Prometheus

Basic YAML configuration

Experience With Automation And CI/CD Tools Such As

Ansible

Jenkins

ArgoCD

Ability to trigger pipelines, review logs, and troubleshoot build failures.

Proficient In Using Kubectl Commands, Including

get

describe

logs

exec

Solid Understanding Of Kubernetes Concepts, Including

Pods

Services

Deployments

StatefulSets

Persistent Volumes (PV)

Persistent Volume Claims (PVC)

Soft Skills

Excellent communication and interpersonal skills.

Ability to remain calm and organized during incidents and effectively communicate updates.

Strong sense of ownership, accountability, and teamwork.

Ability to clearly explain technical issues and coordinate with cross-functional teams.

III. Preferred Qualifications

Familiarity with IT Service Management processes, including:

Incident Management

Problem Management

Change Management

Configuration Management

Experience With Ticketing Systems Such As

Jira Service Management

ServiceNow

ONES

Proprietary ticketing platforms

Hands-on Experience With Docker And Kubernetes Operations, Including

Checking Pod status

Reviewing logs

Restarting services

Experience Supporting Large-scale Business Environments Such As

E-commerce

Financial Services

Gaming

Live Streaming Platforms

Ability to develop automation tools using Python or Go to streamline repetitive operational tasks.

DevOps Work Environment and Characteristics

This is not a purely reactive monitoring role. Engineers are encouraged to proactively identify issues, optimize processes, and improve system reliability.

A structured escalation and incident management process is in place to ensure efficient issue resolution and accountability.

The organization provides comprehensive operational tools, including monitoring, logging, alerting, and management platforms to support daily operations and incident response.

More Info

Job Type:
Industry:
Function:
Employment Type:

About Company

Job ID: 148965567