Research

5 min read

Scaling Autonomy in the Cloud

Dec 4, 2024

Peijun Zhu, Henry Lancelle, Jerry Fan, Chuan Qiu

At Nuro, developing autonomous driving technology isn’t just about cutting-edge robotics and AI — it’s also about handling an immense amount of data efficiently and cost-effectively. Every day, we run millions of hours of simulations, data processing tasks, and evaluations on Google Cloud. To make this feasible, we’ve built an in-house system that orchestrates complex task dependencies and optimizes our compute cluster for maximum resource utilization.

The Challenge: Scaling Beyond Industry Standards

In the quest for autonomy, leveraging vast datasets is essential. However, existing industry solutions for job processing fall short when scaling to our needs or providing the rich features developers require for in-depth analysis. Broadly, these solutions fall into two categories:

1. Workflow Management Systems

Tools like Airflow, Jenkins, and Buildkite offer user-friendly interfaces for visualizing task execution — monitoring statuses, accessing logs, and more. But they’re typically designed to handle thousands of tasks per pipeline. In contrast, our validation processes often involve millions of tasks. Scaling such systems to our level presents insurmountable challenges. Traditional schedulers can’t handle thousands of tasks per second or manage Directed Acyclic Graphs (DAGs) with millions of nodes. Moreover, due to user interface constraints, identifying and interacting with individual tasks in massive DAGs becomes impractical.

2. Map-Reduce Frameworks

Platforms like Celery, Google Cloud BigQuery, DataFlow, Ray, and Pub/Sub excel at processing large datasets. However, they’re optimized for quick, stateless operations and lack detailed per-invocation tracking. Diagnosing issues in individual simulations is difficult without granular tracking, and without detailed insights, customizing resource utilization strategies is nearly impossible.

Fig 1. BATES manages DAGs of millions of scenes.

Introducing BATES: Our Scalable Solution

To overcome these hurdles, we developed BATES (Batch Task Execution System) — a robust platform capable of managing millions of tasks daily. Here’s how BATES transforms our workflow:

1. Hierarchical Task Management

In BATES, every execution unit is a Task, encapsulated as a protobuf message routed to a worker for execution. Tasks can dynamically generate sub-tasks, forming a hierarchical structure. Instead of predefining an entire job graph, tasks can spawn new tasks as needed, allowing for flexibility and scalability. This means the root task, or Job, is submitted by the user without the overhead of specifying every dependency upfront.

2. Advanced Orchestration and Tracking

Our BATES server leverages a combination of an OLTP database and Redis to monitor task statuses. We’ve implemented a custom message queue that automatically assigns and adjusts task priorities based on job size, start time, and real-time factors like new job submissions or user reprioritizations. Users can attach metadata to tasks, which we export to observability platforms like Prometheus and Google Cloud BigQuery for analysis and alerting.

Fig 2. Job scheduling optimizes job turnaround time based on SLA and size.

Simplifying Execution with Generic Workers

Building and managing worker environments was another significant challenge. Engineers needed to create container images with specific software versions, set up workers, and fine-tune auto-scaling and resource allocations — a time-consuming process.

Centralized Worker Pool

To streamline this, we’ve established a pool of Generic Workers capable of running most workloads. The worker environment mirrors that of our engineers’ workstations, minimizing setup efforts. Central management allows the compute infrastructure team to fine-tune performance and efficiency, improving overall resource utilization.

Efficient Package Management

We utilize Content Addressable Storage (CAS) as our package format. Identical files across different versions are stored only once, reducing storage needs through automatic deduplication. Our custom filesystem, built using Filesystem in Userspace (FUSE), fetches only the files actively accessed during execution. This on-demand fetching approach minimizes initialization overhead and enhances flexibility in task-to-worker assignments.

Intelligent Task Scheduling and Auto-Scaling

Auto-scaling our cluster is crucial for balancing job turnaround times and cost efficiency. Our sophisticated policies consider multiple factors.

1. Optimal Resource Allocation

Deciding the number of worker replicas involves trade-offs. Not scaling up enough leads to longer queue times and delayed jobs, while scaling up too much results in idle resources or the need to terminate nodes, disrupting ongoing tasks. We address these challenges with custom task assignment strategies. Our scheduler assigns tasks to workers in a way that maximizes resource utilization. When scaling down, we intelligently select nodes whose ter

Fig 3. Auto-scaling strategies are aware of job SLAs and cloud VM cost tiers.

2. Cost Optimization

Cloud compute resources come with varying price points due to factors like committed usage discounts or spot instance availability. Our auto-scaler synthesizes multi-faceted data, considering job SLAs, resource costs, and availability. It computes optimal plans that balance performance requirements with cost-saving opportunities. This strategy has saved Nuro millions of dollars in cloud costs while boosting productivity.

Adaptive Resource Management

Predicting the exact resource specifications for dynamic workloads is inherently challenging. Allocating too much leads to underutilized resources, wasting money, while allocating too little can cause performance bottlenecks or out-of-memory errors. Our solution eliminates the need for users to specify detailed resource requirements. Users provide minimal information, such as the need for a GPU, and the scheduler uses past and current usage data to estimate and allocate appropriate resources. This approach not only simplifies the user experience but also enables us to plan and optimize resources on a broader scale.

Conclusion: Driving Autonomy Forward

The development of BATES and our Generic Worker ecosystem represents a significant investment in Nuro’s ability to scale simulations, evaluations, and data processing efficiently. By overcoming industry limitations, we’ve:

Enhanced Scalability: Managed millions of daily tasks without sacrificing performance.
Improved Cost Efficiency: Saved millions in cloud expenses through intelligent scheduling and resource management.
Boosted Productivity: Freed our engineers to focus on advancing autonomy rather than grappling with infrastructure issues.

Our journey underscores the importance of innovative solutions in driving the future of autonomous vehicles. We’re proud of the strides we’ve made and are excited about what lies ahead.

Join Us in Shaping the Future of Mobility

All this great work is a testament to our incredible team. If you’re passionate about solving complex problems and pushing the boundaries of technology, we’d love to hear from you. Join our team!

The authors would like to acknowledge the contributions of Ryan Brown, Hanhong Gao, Adam Groner, Josh Karges, KK Kim, Aaron Lee, Michael Li, Yiwei Li, Daniel Lin, Sachin Padmanabhan, Iris Sun, Qi Zhu, and other Nuro team members not mentioned here who provided help and support.