Lädt...

🔧 First impressions: GPU + GCP Batch


Nachrichtenbereich: 🔧 Programmierung
🔗 Quelle: dev.to

Weihao and I have been working on programmatic benchmarks for DeepCell on Google Batch.

We tried Vertex AI custom training jobs but ran into an issue with service accounts. It appears that the training job ran on the expected(?) service account, but in an unexpected project. We didn't track down how to give that project's user access to BigQuery. We also figured that we may want to run the container a little closer to the metal (not a VM though).

Enter Google Batch … I've used Batch-like products but never with a GPU. Initial work often looks like a lot of red failures 🥲

Screenshot of the Batch jobs list. Mostly failures, some successes.

First impressions:

1: BigQuery rate limit

I forgot BigQuery has a fairly low rate limit (5 ops per 10 seconds). So a batch of 10 finishing too close would overwhelm the table update. Quick fix with retry logic.

2: GPU scarcity

We've had bad luck getting GPUs. The zone reports exhausted resource pools on the regular:

Screenshot of an error message showing that the GCE resource pool is exhausted for the zone.

We ran into a surprising quota issue as well, running out of persistent disk SSDs – even though we weren't using any…

Screenshot of error message showing inadequate SSD quota: limit 500, usage 480, wanted 30.

The quota page showed the usage going up and down (again, we never observed any disks in the GCE console):

Screenshot of the quota visualization showing

You can (kinda) see it trying different availability zones within region us-central1 here:

Screenshot of the monitoring metric for allocated quota, showing several availability zones summing up to an overall regional usage.

We tried increasing the quota to 1TB (from 500 GB). No luck so far: no resources…!

The quota goes up in increments of 30GB, one per zone resource exhaustion error. I'm guessing it's a Batch implementation detail to spin up the disks in anticipation of having a VM ready.

Fortunately…! there is no billing charge for these disks. It's nice that it only bills when it actually runs, although it's odd to use up the quota.

I've heard several reports of using GPU on Batch, but it's clear that the incantations are arcane indeed. If you know how to reliably get GPUs or have worked through these errors– please let me know!

...

🔧 First impressions: GPU + GCP Batch


📈 54.66 Punkte
🔧 Programmierung

🎥 Run Your First Multi-Worker TensorFlow Training Job With GCP AI Platform


📈 19.89 Punkte
🎥 Künstliche Intelligenz Videos

📰 Announcing our first GCP VRP Prize winner and updates to 2020 program


📈 19.89 Punkte
📰 IT Security Nachrichten

🔧 Top 10+ Google Cloud Platform (GCP) Skills of 2025


📈 14.95 Punkte
🔧 Programmierung

🔧 An Introduction to Google Cloud Platform (GCP) and Its Services


📈 14.95 Punkte
🔧 Programmierung

🔧 Next.js Deployment: Vercel's Charm vs. GCP's Muscle


📈 14.95 Punkte
🔧 Programmierung

🔧 Terraform Starter Boilerplate for GCP using Terragrunt


📈 14.95 Punkte
🔧 Programmierung

🔧 Deploying an Infrastructure as Code Project in GCP Using Terraform.


📈 14.95 Punkte
🔧 Programmierung

🎥 New Google Play Console Data, GCP Database Options, Chrome 76, & more!


📈 14.95 Punkte
🎥 Videos

🔧 Key factors to consider in Google Cloud Platform (GCP) monitoring


📈 14.95 Punkte
🔧 Programmierung

🔧 Migrating a Web Application from AWS to GCP: A Step-by-Step Guide


📈 14.95 Punkte
🔧 Programmierung

📰 Tufin improves security automation on Azure, GCP, and VMware clouds


📈 14.95 Punkte
📰 IT Security Nachrichten

📰 How to Deploy ML Solutions with FastAPI, Docker, and GCP


📈 14.95 Punkte
🔧 AI Nachrichten

🔧 GCP Cloud Functions Gen 2


📈 14.95 Punkte
🔧 Programmierung

📰 Staging Environments for WordPress Sites on GCP


📈 14.95 Punkte
📰 IT Security Nachrichten

🔧 Why Is GCP Falling Behind in Enterprise Adoption? Should You Still Learn It?


📈 14.95 Punkte
🔧 Programmierung

🔧 Top 45+ GCP (Google Cloud Platform) Interview Questions in 2025


📈 14.95 Punkte
🔧 Programmierung

🔧 GCP Tech Stack for a Serverless NextJS Web App


📈 14.95 Punkte
🔧 Programmierung

🔧 AWS/GCP/Azure Consoles, Embedded inside Your Docs


📈 14.95 Punkte
🔧 Programmierung

🔧 GCP CUD: Are There Better Ways to Save Up the Cloud?


📈 14.95 Punkte
🔧 Programmierung

📰 Security features on Google Cloud Platform (GCP)


📈 14.95 Punkte
📰 IT Security Nachrichten

🔧 Study Notes 1.3.2: Terraform Basics with GCP


📈 14.95 Punkte
🔧 Programmierung

🔧 Actionable tips for achieving significant savings and improving cloud cost management on GCP


📈 14.95 Punkte
🔧 Programmierung

🔧 Exploring Google Cloud (GCP): Day 38 of 50 days DevOps Tools Series


📈 14.95 Punkte
🔧 Programmierung

🕵️ CVE-2020-7133 | HPE IOT/GCP 1.2.4.2/1.4.0/1.4.1/1.4.2 authorization


📈 14.95 Punkte
🕵️ Sicherheitslücken

📰 Announcing the winners of the 2021 GCP VRP Prize


📈 14.95 Punkte
📰 IT Security Nachrichten

🔧 Deploying a NestJS App with MongoDB on GCP


📈 14.95 Punkte
🔧 Programmierung

🔧 Automating GCP Instance Updates with GitHub Actions


📈 14.95 Punkte
🔧 Programmierung

🔧 How to Set Up a GCP VM and Install Apache2: Step-by-Step Guide for Beginners


📈 14.95 Punkte
🔧 Programmierung

🔧 Getting Started with Synthetic Monitoring on GCP and Datadog


📈 14.95 Punkte
🔧 Programmierung

🔧 Switch PostgreSQL Environments Across AWS, GCP, and k3d Using Kubernetes Contexts


📈 14.95 Punkte
🔧 Programmierung

matomo