職缺描述
<About the Job> We are looking for a highly motivated and skilled AI Infrastructure Engineer with strong hands-on experience in Kubernetes (K8s), particularly in supporting AI/ML workflows. In this role, you will be instrumental in designing, implementing, and maintaining robust, scalable, and high-performance Kubernetes-based infrastructure that supports the entire lifecycle of our AI applications—from data processing and model training to deployment and monitoring. You will work closely with data scientists, AI/ML engineers, and DevOps teams to ensure seamless integration of AI/ML workloads within cloud-native environments. The ideal candidate has a deep understanding of container orchestration, distributed systems, and MLOps practices, and is passionate about building efficient, reliable platforms that enable rapid AI innovation. This is a unique opportunity to work at the intersection of AI and cloud infrastructure, contributing to next-generation systems that power intelligent applications at scale. <Job Responsibilities> .Design & Architecture: Design, build, and scale a reliable and efficient Kubernetes platform optimized for AI/ML workloads. This includes provisioning GPUs, managing resources, and ensuring optimal performance for computationally intensive tasks. .Infrastructure Management: Manage the entire Kubernetes cluster lifecycle—from provisioning and configuration to ongoing maintenance, monitoring, and troubleshooting, ensuring high availability and scalability. .Deployment & Automation: Develop and implement CI/CD pipelines to automate the deployment, scaling, and updating of machine learning models and AI services. Ensure seamless integration with AI tools like Kubeflow, MLflow, and Argo Workflows. .Performance Optimization: Continuously monitor and optimize system performance, focusing on resource utilization, latency reduction, and improving the overall efficiency of AI workloads. Ensure high availability and minimal downtime for AI services. .Collaboration & Guidance: Work closely with data scientists, ML engineers, and cross-functional teams to understand their infrastructure requirements and provide technical solutions to meet workload demands effectively. .Security & Compliance: Implement best practices for cluster security, including network policies, access controls, and vulnerability management to safeguard sensitive data and maintain compliance. .Cost & Resource Efficiency: Manage resources effectively to optimize cost while maintaining high-performance infrastructure for AI model training, inference, and data processing. <Skills & Qualifications> .Kubernetes Expertise: You should have hands-on experience with Kubernetes (K8s) architecture, including deploying applications, managing resources, and troubleshooting complex cluster issues in a production environment. .Containerization & Linux Environment: Strong knowledge of container technologies such as Docker, along with hands-on experience in Linux environments. Expertise in container orchestration and deployment practices is highly valued. .AI Workloads: Deep understanding of GPU scheduling and performance optimization, including strategies for resource allocation, workload balancing, and maximizing throughput for AI/ML tasks. .Automation & CI/CD: You need practical experience with building and managing CI/CD pipelines using tools like GitLab CI, Jenkins, GitHub Actions, or ArgoCD to automate deployments. .Programming & Scripting: Proficiency in at least one scripting language (e.g., Python, Bash) is a must. .Networking: Knowledge of container networking and service mesh technologies (e.g., Istio, Linkerd) is highly desirable and a great advantage.
收合內容