Engineer/Sr. Engineer-Infrastructure at Aeris

Noida, Uttar Pradesh, India -

Full Time

Start Date

Immediate

Expiry Date

26 Mar, 26

Salary

0.0

Posted On

26 Dec, 25

Experience

5 year(s) or above

Remote Job

Yes

Telecommute

Yes

Sponsor Visa

Skills

Linux Systems, Kubernetes, OpenStack, Virtualization, SSH, Automation, Scripting, Hypervisors, Cloud Computing, Incident Management, Security, Troubleshooting, Documentation, CI/CD, Disaster Recovery, Monitoring

Industry

Software Development

Description

Role Overview As an Engineer- Infrastructure, your role covers designing, deploying, and troubleshooting Linux systems, hypervisors, VMs (with live migration), Kubernetes clusters, and OpenStack environments. You are the SME ensuring the availability, reliability, security, and migration of the infrastructure stack, including secure SSH access and enterprise virtualization. Location: Noida, Work from office – 5 days per week Responsibilities Design, implement, and manage complex infrastructure: Linux systems, hypervisors (OLVM, Proxmox, KVM, VMware), virtual machines, Kubernetes clusters, and OpenStack clouds. Configure, secure, and troubleshoot sshd (Secure Shell Daemon), manage SSH keys, and enforce secure remote access policies on Linux hosts. Perform live migration of VMs including setup and operational management within OLVM and other hypervisors. Oversee VM lifecycle: provisioning, migration, resource allocation, backup, and fault resolution. Monitor and automate infrastructure for reliability and scalability. Ensure Linux system hardening (patching, security, auditing) and manage advanced system/network performance issues. Deploy, scale, and secure Kubernetes clusters for container workloads. Architect, operate, and troubleshoot OpenStack services and integrations. Lead incident management (L3/expert escalations) and technical mentoring. Document infrastructure standards and procedures. Implement monitoring, alerting, and backup solutions. Collaborate with cross-functional teams for infrastructure automation and CI/CD integration. Drive compliance with security, regulatory, and operational controls. Skills Required A. Operating Systems Expert administration of Red Hat, Oracle Linux, Ubuntu, CentOS, SUSE, etc. Deep understanding and troubleshooting of sshd. Install, configure, tune and secure /etc/ssh/sshd_config. SSH key management, user access control, two-factor, and auditing. Handling root and user SSH policies, port forwarding, and proxies. Linux boot/internal: kernel, systemd, SELinux, PAM, disk partitions, LVM/RAID, ZFS Automation and scripting: Bash, Python, Ansible OS optimization, security hardening, backup/restores, disaster recovery Experience of Windows Server upgrades B. Hypervisors Advanced management of OLVM (Oracle Linux Virtualization Manager), Proxmox, KVM, VMware, Xen, and others VM provisioning, storage pools, network bridges, VM snapshot and backup Live migration of VMs across Oracle Linux KVM hosts Security policies, templates, cluster setup, performance tuning in OLVM VM lifecycle, clustering, resource automation, live migration, network configuration Integration with storage and backups, fault finding/remediation C. Kubernetes Cluster deployment, scaling, backup/restore, and upgrades (kubeadm, kops, Rancher, etc.) Deep knowledge of Kubernetes architecture: API, etcd, controller-manager, kubelet, and networking (CNI plugins) Advanced experience creating, modifying, and troubleshooting Pods, including multi-container and ephemeral containers Pod scheduling, affinity/anti-affinity, taints/tolerations, node selectors, and pod priorities Health checks with liveness, readiness, and startup probes for pods Pod resource management—understanding resource requests, limits, and best practices for optimizing resource usage Managing pod disruption budgets (PDBs) to ensure high availability during maintenance Handling pod lifecycle events (init containers, hooks, restarts, termination, graceful shutdowns) Securing pods with pod security policies (PSPs), SecurityContexts, and namespaces Troubleshooting pod networking issues, DNS integration, inter-pod communication (Services, NetworkPolicies) Managing pod logs, events, and debugging (kubectl exec, logs, describe) Volume management for pods (PersistentVolumeClaims, ephemeral volumes, projected volumes) Rolling updates, canary deployments, and managing pod availability during application upgrades Hands-on experience with Operators and Custom Resources (CRDs) for advanced pod management Application deployment using Helm, Kustomize, and manifest authoring Cluster monitoring and alerting: Prometheus, Grafana, ELK stack Disaster recovery planning: etcd and cluster state backups, restore procedures Security: RBAC, secrets, service accounts, role bindings, pod identity Integrating Kubernetes with CI/CD pipelines (Jenkins, GitLab, ArgoCD) Cluster autoscaling (HPA/VPA), node autoscaling, and performance optimization D. OpenStack Architectural mastery of core OpenStack services (Nova, Neutron, Cinder, Swift, Keystone, Glance, Horizon), including service dependencies and message queues. Automated deployment and upgrade management using tools like Ansible, Kolla, TripleO, and Packstack for installing, configuring, and updating OpenStack clusters. Advanced Nova compute operations including VM provisioning, resizing, live migration, host aggregates, resource optimization, and troubleshooting. Neutron networking design and troubleshooting: building complex network topologies (provider/self-service networks, VLAN/VXLAN), managing routers, security groups, and resolving L2/L3 issues. Cinder block storage administration with multi-backend setups (Ceph, NFS, iSCSI), volume/snapshot management, performance tuning, and backups. Keystone identity, access, and RBAC management — project isolation, integration with LDAP/AD/SAML, token security, and API endpoint security. Glance image service management — secure image creation, registration, distribution, snapshotting, and integration with automation pipelines. High availability and disaster recovery design — HA clustering of controllers/services (Pacemaker/Corosync), database/message bus HA, failover, and DR strategy/testing. Monitoring, telemetry, and logging with Ceilometer, Monasca, or Prometheus/Grafana to enable real-time metrics, alerting, and advanced troubleshooting. OpenStack API and automation expertise — scripting with OpenStack CLI/REST API, orchestration with Heat, and integration with external and hybrid cloud workflows. Minimum Qualifications Bachelor’s in Computer Science, Engineering or related field (or equivalent experience) 5+ years managing enterprise Linux systems and VMs (L3) Direct hands-on experience with OLVM in production 3+ years with Kubernetes and/or OpenStack clusters Proven experience with live migration in OLVM, Proxmox, or VMware Preferred Certifications Red Hat Certified Engineer (RHCE) Oracle Linux Certified Implementation Specialist (for OLVM) Certified Kubernetes Administrator (CKA) OpenStack Administrator Certification VMware Certified Professional (VCP) Soft Skills Strong troubleshooting and analytical skills Excellent communication and documentation practices Collaborative, proactive, and adaptable

Responsibilities

The Engineer/Sr. Engineer-Infrastructure is responsible for designing, deploying, and troubleshooting various infrastructure components including Linux systems, hypervisors, and Kubernetes clusters. The role also involves ensuring the availability, reliability, and security of the infrastructure stack.