Responsibilities
About ByteDance
Founded in 2012, ByteDance’s mission is to inspire creativity and enrich life. With a suite of more than a dozen products, including TikTok as well as platforms specific to the China market, including Toutiao, Douyin, and Xigua, ByteDance has made it easier and more fun for people to connect with, consume, and create content.
Why Join Us
Creation is the core of ByteDance’s purpose. Our products are built to help imaginations thrive. This is doubly true of the teams that make our innovations possible. Together, we inspire creativity and enrich life - a mission we aim towards achieving every day. To us, every challenge, no matter how ambiguous, is an opportunity; to learn, to innovate, and to grow as one team. Status quo? Never. Courage? Always. At ByteDance, we create together and grow together. That’s how we drive impact-for ourselves, our company, and the users we serve. Join us.
About the Team
The Infrastructure Engineering team supports the company’s fast growth by building and operating hyperscale datacenters. The team manages the end to end lifecycle of server fleet, providing cloud solutions and various infrastructure services ensuring that they are scalable and are reliable.
Embark on an exciting expedition to explore the rapidly expanding ByteDance domain in the United States, Europe, and Asia. Here, the Infrastructure Engineering team is crafting monumental data citadels that encircle the planet, sheltering legions of hundreds of thousands of servers. As the maestro of our production systems, you will embark on a captivating odyssey, taming the life cycles of these servers. Your adventure will begin with the orchestration of their initial deployment, navigating the intricate terrain of OS installation, summoning services like a digital magician, and maintaining vigilant watch over our inventory. But, like any epic tale, there will be times of challenge when you become a troubleshooter extraordinaire, mending and restoring with unwavering dedication. Eventually, you’ll guide them into the sunset, orchestrating their decommissioning and ensuring their rebirth through recycling, all while contributing to the pulsating rhythm of ByteDance’s technological evolution.
Key Responsibilities:•- Responsibilities:
- Operation: As a Production Systems Engineer, your mission is to contribute to enhancing the stability, efficiency, effectiveness, and scalability of our data center and cloud operations, platform, and service on a worldwide scale.
- Lifecycle Improvement: Engage in and improve the whole lifecycle of Infrastructure systems - from system design consulting through to launch reviews, deployment, operation, and refinement.
- Automation: Develop & deploy tools and solutions to improve the automation, reliability, scalability, and operability of services.
- Monitoring: Deliver tools and solutions to improve monitor availability, latency, and overall service, server and Cloud infrastructure and network health.
- Disaster Recovery: Troubleshoot and resolve complex technical issues in a high-pressure, time-sensitive environment. Conduct high-level root-cause analysis for service interruption and establish preventive measures. Practice sustainable incident response and postmortem.
- Cross-team Collaboration: Partner with stakeholders like infrastructure architects, project managers, data center operations engineers, platform developers, supply chain teams, and our internal customers to understand overarching business objectives. You will also have the opportunity to design and implement innovative solutions for our Core IDCs and CDN/Edge and Cloud Services.
- On-call: Participate in our on-call across regions and incident response teams to solve critical problems in production.
Qualifications
Minimum Qualifications
- Education: Bachelor’s degree in Computer Science, Electronic Engineering, relevant technical field, or equivalent practical experience.
- Experience: Minimal 3 years of experience in at least one of the areas below:
- Linux System Administration: Proficient in Linux system administration tasks. Have an in-depth understanding of Linux kernels, drivers, and modules. Be capable of writing scripts in Bash and Python to automate routine system operations, thereby enhancing efficiency and reducing manual effort. This includes skills such as system configuration, performance tuning, and security management within the Linux environment.
- Tooling Adaptation, Deployment, and Maintenance: Skilled in adapting operation and maintenance tools to meet specific requirements for new server hardware. Capable of handling the entire lifecycle of software tools, from deployment to ongoing maintenance. This involves tasks related to facilitating the monitoring of server performance, provisioning resources effectively, managing fault handling in a timely manner, and carrying out repairs to ensure the seamless operation of new server hardware.
- Communication: Experience managing and coordinating teams in the global environment.
- Network: When it comes to networks, we’re seeking at least a junior-level understanding. Your ability to chart the course through the network labyrinth is essential.
- Preferred Qualification:
- Data Center: An intermediate expertise level is preferred. We seek those versed from OS installs and break-fixes to impactful projects like planning and ops (covering full infra lifecycle), and new design-build or retrofit to existing systems.
- Full Stack Software Development: We are actively seeking individuals proficient in full stack software development. Ideal candidates should possess the following preferred skills:
- Be capable of creating and integrating RESTful APIs. This includes expertise in using Flask for Python-based back-end development to build robust API endpoints and have a solid understanding of JavaScript and be able to leverage it, along with Node.js, for both front-end and back-end development tasks.
- Demonstrate proficiency in SQL for efficient database management, including designing database schemas, writing queries, and ensuring data integrity.
- Be familiar with Redis and have experience with Ansible
- Experience with GPU server operations & maintenance is a strong plus.
- Project Management: Experience in preparing project plans and specifications, drafting scopes of work, and managing multiple projects simultaneously.
ByteDance is committed to creating an inclusive space where employees are valued for their skills, experiences, and unique perspectives. Our platform connects people from across the globe and so does our workplace. At ByteDance, our mission is to inspire creativity and enrich life. To achieve that goal, we are committed to celebrating our diverse voices and to creating an environment that reflects the many communities we reach. We are passionate about this and hope you are too