ML Infrastructure Engineer

Department Icon Data Science Analytics & Machine Learning
149+ Applicants
Posted: 3 weeks ago
5-7 years
Bengaluru / Bangalore, Karnataka
work from office

Posted: 3 weeks ago
|
Applicants: 149+
Job Description
About Company
Similar Jobs
Please verify your account first! Send OTP

Please click on the Apply to verify the status of jobs posted more than 15 days ago, as they may have expired. Similar Jobs

Job Description

Byteridge is seeking a Rapid Prototyping Engineer specializing in AI Infrastructure and Optimization to work with our most strategic customers on deploying, fine-tuning, and optimizing large language models at scale. You will be at the forefront of Byteridges AI infrastructure capabilities, helping customers unlock the full potential of foundation models through expert-level deployment on GPU infrastructure. This highly technical role requires deep expertise in machine learning infrastructure, GPU optimization, and production ML systems, combined with the ability to translate complex technical concepts into customer success.
The Core Responsibilities For The Job Include The Following
Model Deployment and Optimization:
  • Lead end-to-end deployments of large language models on AWS infrastructure for strategic customers.
  • Design and implement training, fine-tuning, and inference pipelines using Amazon SageMaker AI.
  • Optimize model performance through GPU-level tuning, kernel optimization, and infrastructure configuration.
  • Deploy models on diverse GPU architectures, including NVIDIA and AWS custom silicon (Trainium, Inferentia).
Infrastructure Architecture And Performance
  • Architect scalable ML infrastructure using SageMaker AI Inference, HyperPod, and distributed training frameworks.
  • Implement CUDA-level optimizations and custom kernels for improved model performance.
  • Design storage and networking architectures optimized for high-throughput ML workloads.
  • Troubleshoot and resolve complex performance bottlenecks at the GPU driver and kernel level.
Customer Engagement And Technical Leadership
  • Partner with AWS AI Specialist Solution Architects and customer ML teams to understand model requirements and deployment constraints.
  • Provide technical guidance on model selection, fine-tuning strategies, and production best practices.
  • Conduct performance benchmarking and cost optimization analysis for ML workloads.
  • Share field insights with AWS product teams to influence infrastructure and service roadmaps.
Requirements
  • Bachelors degree in Computer Science, Engineering, or equivalent practical experience (Masters or PhD preferred).
  • 5+ years of experience in machine learning infrastructure, model deployment, or GPU computing.
  • Strong programming skills in Python and experience with ML frameworks (PyTorch, TensorFlow, JAX).
  • Deep understanding of LLM architectures, training methodologies, and inference optimization.
Technical Expertise (High-Level Alignment)
  • Hands-on experience training, fine-tuning, or deploying large language models in production.
  • Proficiency with GPU programming, CUDA, and kernel-level optimization techniques.
  • Experience with distributed training frameworks and multi-GPU/multi-node orchestration.
  • Looking to get Placed? Try our Placement Guarantee Plan

    Strong knowledge of AWS core services: EC2 (GPU instances), S3 EFS, VPC, and networking.
Preferred Experience
  • Direct experience with Amazon SageMaker AI (Training, Inference, HyperPod) or equivalent ML platforms.
  • Understanding of GPU architectures (NVIDIA A100 H100) and AWS custom silicon (Trainium, Inferentia).
  • Experience with model compression techniques (quantization, pruning, distillation).
  • Knowledge of MLOps practices, model monitoring, and production ML system design.
  • Background in high-performance computing, distributed systems, or systems programming.
Essential Attributes
  • Ability to dive deep into technical problems and debug complex infrastructure issues.
  • Strong analytical skills with a data-driven approach to optimization.
  • Excellent communication skills to explain complex technical concepts to diverse audiences.
  • Comfortable working in ambiguous, fast-paced environments with evolving requirements.
  • Ownership mindset with the ability to drive projects from architecture to production.
This job was posted by Sweety S from Byteridge.

Skills

PythonMachine LearningLarge Language ModelsAiMl

If an employer asks you to pay any kind of fee, please notify us immediately. Jobaaj does not charge any fee from the applicants and we do not allow other companies also to do so.

About Company

Byteridge is a technology solutions and services company. We are experts in delivering software solutions, data analytics, cloud services and digital transformation solutions.

Important dates & deadlines?

Application Deadline

19 Jul 26, 04:13 PM IST

Similar Jobs

View All
Loading...
Bag Logo
Jobaaj
Don't Miss out any Updates

Subscribe now for the latest job alerts
and never miss an update

Job Alert
Google hiring for Specific Roles Apply Now!
1 min ago
New Opportunity
Amazon is hiring freshers Apply Now!
5 min ago
Featured Jobs
Microsoft opening 50+ positions Apply Now!
10 min ago

ML Infrastructure Engineer

Share with