Troubleshooting#
This page is organized by symptom rather than by error message. Locate the section that best matches what you are observing and follow the guidance there.
For a quick diagnostic of common environment problems, run:
kinetic init
and choose troubleshoot at the prompt. It checks for missing tools,
misconfigured credentials, and unhealthy infrastructure, and prints a
concrete fix command for each failed check. kinetic init will also
offer this path automatically when it detects that prerequisites are
missing. The full list of categories the troubleshoot path covers is
described at the end of this page.
Startup and build issues#
“Project must be specified”#
KINETIC_PROJECT (or GOOGLE_CLOUD_PROJECT) is not set. Set it once
in your shell profile:
export KINETIC_PROJECT="your-project-id"
Or pass project= to the decorator. See Configuration.
“404 Requested entity was not found”#
A required GCP resource — usually an Artifact Registry repository or a GKE cluster — doesn’t exist yet. Run the setup once:
kinetic up
Or, if up already ran, enable the missing APIs and create the
registry manually (uncommon):
gcloud services enable compute.googleapis.com cloudbuild.googleapis.com \
artifactregistry.googleapis.com storage.googleapis.com \
container.googleapis.com --project=$KINETIC_PROJECT
gcloud artifacts repositories create "kn-${KINETIC_CLUSTER:-kinetic-cluster}" \
--repository-format=docker --location=us \
--project=$KINETIC_PROJECT
Container build is slow on first run#
The first run with a given requirements.txt builds a new container
image via Cloud Build (~2–5 minutes). Subsequent runs reuse the cached
image and start in under a minute. If you’re churning dependencies
multiple times a day and this is hurting you, see
Execution Modes for prebuilt mode.
Container build failures#
Check Cloud Build logs:
gcloud builds list --project=$KINETIC_PROJECT --limit=5
gcloud builds log <build-id> --project=$KINETIC_PROJECT
Common causes: a package in requirements.txt that doesn’t exist,
network issues during install, or a base image that’s been updated
since you last built. See Dependencies.
Auth and config issues#
“Permission denied” on GCP operations#
Your user (or service account) is missing IAM roles. The minimum set
for Kinetic is roles/storage.admin, roles/artifactregistry.admin,
roles/container.admin, and roles/cloudbuild.builds.editor on the
project:
gcloud projects add-iam-policy-binding $KINETIC_PROJECT \
--member="user:your-email@example.com" \
--role="roles/storage.admin"
Repeat for the other roles. kinetic init’s troubleshoot path flags
missing roles by checking the actual operations that fail.
Application Default Credentials missing or expired#
gcloud auth login
gcloud auth application-default login
If you’ve previously set GOOGLE_APPLICATION_CREDENTIALS to a service
account key, that takes precedence over user ADC.
Settings aren’t taking effect#
Run kinetic config — it prints every config value and where it came
from (decorator arg, CLI flag, env var, or default). The precedence
rules are documented in Configuration.
Scheduling and quota issues#
Job stuck in PENDING for more than 10 minutes#
The cluster autoscaler is trying to provision a node but can’t. Two common reasons:
No quota for the requested accelerator in your zone. Check Cloud Console → IAM & Admin → Quotas, filter by your accelerator type. If quota is exhausted, request more or try a different zone.
Spot capacity is unavailable. If your node pool was created with
--spot, GCP may have no spot capacity to allocate right now. Switch to on-demand or try later.
kinetic init’s troubleshoot path includes a quota check that surfaces
exhausted accelerator quotas in your region. If it doesn’t flag
anything, inspect the Cloud Console quota page directly for
finer-grained breakdowns.
Multi-host TPU job fails right after submit#
Likely causes: topology mismatch (your code expected a different number of devices than the slice has), a stale Pathways context from a prior crashed job, or one host failing before the others can join the collective. See Distributed Training for the full list of multi-host failure modes.
Runtime failures#
ImportError on the remote pod#
The package isn’t in your requirements.txt or pyproject.toml. A
local pip install doesn’t carry over — only what’s in those files
gets installed. See Dependencies for the
common pitfalls list.
Pickle / cloudpickle errors at submit time#
The function or one of its closures references something that can’t be serialized — typically an open file handle, a database connection, or a module-level singleton initialized for local use. Move that initialization inside the decorated function.
JAX version mismatch errors#
You probably pinned jax or jaxlib in requirements.txt. Kinetic
filters those out by default; if you need a specific version, use
# kn:keep (see Dependencies), but expect
to debug runtime/library alignment yourself.
Job FAILS but logs look fine#
The pod exited non-zero without writing a result payload — usually
caused by an OOM kill or the kernel reaping the process. Check pod
events with kubectl describe pod <pod-name> (find the pod name from
kinetic jobs status <id>).
Missing outputs and results#
result() raises “result payload not found”#
The job either never produced one (it crashed before finishing), or
it was already cleaned up. Failed jobs don’t write a result payload.
For long jobs, prefer writing artifacts under KINETIC_OUTPUT_DIR
instead of relying on the return value — see Checkpointing.
Files I wrote inside the job are gone#
Two possibilities:
You wrote them under
/tmpor another pod-local path. The pod is destroyed when the job ends; pod-local files don’t survive. Always write toKINETIC_OUTPUT_DIR.You wrote them under
KINETIC_OUTPUT_DIRbut more than 30 days have passed. The default GCS bucket has a 30-day TTL. Copy critical artifacts to a bucket without lifecycle rules. See Checkpointing for the TTL and retention details.
Logs aren’t streaming back#
Network blip during a --follow stream is the most common cause. The
pod is unaffected — log retrieval is read-only. Use
kinetic jobs logs <id> (without --follow) or --tail N to fetch
fresh logs from any machine.
What the troubleshoot path actually checks#
kinetic init’s troubleshoot path runs eight groups of checks and
prints concrete fix commands when any fail. The groups (matching the
source at kinetic/cli/commands/doctor.py):
Local Tools —
gcloud,kubectl, andgke-gcloud-auth-pluginare installed and on your PATH.Authentication — Application Default Credentials are present, refreshable, and not expired.
Configuration —
KINETIC_PROJECT,KINETIC_ZONE, andKINETIC_CLUSTERresolve to non-empty values.GCP Project — the project exists and has billing enabled.
GCP APIs — Compute Engine, Cloud Build, Artifact Registry, Storage, and Container APIs are enabled.
GCP Resources — the Kinetic service accounts, Artifact Registry repository, GCS buckets, VPC network, and Cloud NAT all exist.
Infrastructure — Pulumi state is present and the GKE cluster is in the
RUNNINGstate.Kubernetes — your
kubeconfigpoints at the cluster, the API server responds, node pools are healthy, GPU drivers are installed where needed, and accelerator quotas are not exhausted.
Each failing check prints a one-line fix suggestion. For multi-step fixes, the troubleshoot path prints a copy-paste command block.