Troubleshooting

Troubleshooting#

This page is organized by symptom rather than by error message. Locate the section that best matches what you are observing and follow the guidance there.

For a quick diagnostic of common environment problems, run:

kinetic init

and choose troubleshoot at the prompt. It checks for missing tools, misconfigured credentials, and unhealthy infrastructure, and prints a concrete fix command for each failed check. kinetic init will also offer this path automatically when it detects that prerequisites are missing. The full list of categories the troubleshoot path covers is described at the end of this page.

Startup and build issues#

“Project must be specified”#

KINETIC_PROJECT (or GOOGLE_CLOUD_PROJECT) is not set. Set it once in your shell profile:

export KINETIC_PROJECT="your-project-id"

Or pass project= to the decorator. See Configuration.

“404 Requested entity was not found”#

A required GCP resource — usually an Artifact Registry repository or a GKE cluster — doesn’t exist yet. Run the setup once:

kinetic up

Or, if up already ran, enable the missing APIs and create the registry manually (uncommon):

gcloud services enable compute.googleapis.com cloudbuild.googleapis.com \
    artifactregistry.googleapis.com storage.googleapis.com \
    container.googleapis.com --project=$KINETIC_PROJECT

gcloud artifacts repositories create "kn-${KINETIC_CLUSTER:-kinetic-cluster}" \
    --repository-format=docker --location=us \
    --project=$KINETIC_PROJECT

Container build is slow on first run#

The first run with a given requirements.txt builds a new container image via Cloud Build (~2–5 minutes). Subsequent runs reuse the cached image and start in under a minute. If you’re churning dependencies multiple times a day and this is hurting you, see Execution Modes for prebuilt mode.

Container build failures#

Check Cloud Build logs:

gcloud builds list --project=$KINETIC_PROJECT --limit=5
gcloud builds log <build-id> --project=$KINETIC_PROJECT

Common causes: a package in requirements.txt that doesn’t exist, network issues during install, or a base image that’s been updated since you last built. See Dependencies.

Auth and config issues#

“Permission denied” on GCP operations#

Your user (or service account) is missing IAM roles. The minimum set for Kinetic is roles/storage.admin, roles/artifactregistry.admin, roles/container.admin, and roles/cloudbuild.builds.editor on the project:

gcloud projects add-iam-policy-binding $KINETIC_PROJECT \
    --member="user:your-email@example.com" \
    --role="roles/storage.admin"

Repeat for the other roles. kinetic init’s troubleshoot path flags missing roles by checking the actual operations that fail.

Application Default Credentials missing or expired#

gcloud auth login
gcloud auth application-default login

If you’ve previously set GOOGLE_APPLICATION_CREDENTIALS to a service account key, that takes precedence over user ADC.

Settings aren’t taking effect#

Run kinetic config — it prints every config value and where it came from (decorator arg, CLI flag, env var, or default). The precedence rules are documented in Configuration.

Scheduling and quota issues#

Job stuck in `PENDING` for more than 10 minutes#

The cluster autoscaler is trying to provision a node but can’t. Two common reasons:

No quota for the requested accelerator in your zone. Check Cloud Console → IAM & Admin → Quotas, filter by your accelerator type. If quota is exhausted, request more or try a different zone.
Spot capacity is unavailable. If your node pool was created with --spot, GCP may have no spot capacity to allocate right now. Switch to on-demand or try later.

kinetic init’s troubleshoot path includes a quota check that surfaces exhausted accelerator quotas in your region. If it doesn’t flag anything, inspect the Cloud Console quota page directly for finer-grained breakdowns.

Multi-host TPU job fails right after submit#

Likely causes: topology mismatch (your code expected a different number of devices than the slice has), a stale Pathways context from a prior crashed job, or one host failing before the others can join the collective. See Distributed Training for the full list of multi-host failure modes.

Runtime failures#

`ImportError` on the remote pod#

The package isn’t in your requirements.txt or pyproject.toml. A local pip install doesn’t carry over — only what’s in those files gets installed. See Dependencies for the common pitfalls list.

Pickle / cloudpickle errors at submit time#

The function or one of its closures references something that can’t be serialized — typically an open file handle, a database connection, or a module-level singleton initialized for local use. Move that initialization inside the decorated function.

JAX version mismatch errors#

You probably pinned jax or jaxlib in requirements.txt. Kinetic filters those out by default; if you need a specific version, use # kn:keep (see Dependencies), but expect to debug runtime/library alignment yourself.

Job FAILS but logs look fine#

The pod exited non-zero without writing a result payload — usually caused by an OOM kill or the kernel reaping the process. Check pod events with kubectl describe pod <pod-name> (find the pod name from kinetic jobs status <id>).

Missing outputs and results#

`result()` raises “result payload not found”#

The job either never produced one (it crashed before finishing), or it was already cleaned up. Failed jobs don’t write a result payload. For long jobs, prefer writing artifacts under KINETIC_OUTPUT_DIR instead of relying on the return value — see Checkpointing.

Files I wrote inside the job are gone#

Two possibilities:

You wrote them under /tmp or another pod-local path. The pod is destroyed when the job ends; pod-local files don’t survive. Always write to KINETIC_OUTPUT_DIR.
You wrote them under KINETIC_OUTPUT_DIR but more than 30 days have passed. The default GCS bucket has a 30-day TTL. Copy critical artifacts to a bucket without lifecycle rules. See Checkpointing for the TTL and retention details.

Logs aren’t streaming back#

Network blip during a --follow stream is the most common cause. The pod is unaffected — log retrieval is read-only. Use kinetic jobs logs <id> (without --follow) or --tail N to fetch fresh logs from any machine.

What the troubleshoot path actually checks#

kinetic init’s troubleshoot path runs eight groups of checks and prints concrete fix commands when any fail. The groups (matching the source at kinetic/cli/commands/doctor.py):

Local Tools — gcloud, kubectl, and gke-gcloud-auth-plugin are installed and on your PATH.
Authentication — Application Default Credentials are present, refreshable, and not expired.
Configuration — KINETIC_PROJECT, KINETIC_ZONE, and KINETIC_CLUSTER resolve to non-empty values.
GCP Project — the project exists and has billing enabled.
GCP APIs — Compute Engine, Cloud Build, Artifact Registry, Storage, and Container APIs are enabled.
GCP Resources — the Kinetic service accounts, Artifact Registry repository, GCS buckets, VPC network, and Cloud NAT all exist.
Infrastructure — Pulumi state is present and the GKE cluster is in the RUNNING state.
Kubernetes — your kubeconfig points at the cluster, the API server responds, node pools are healthy, GPU drivers are installed where needed, and accelerator quotas are not exhausted.

Each failing check prints a one-line fix suggestion. For multi-step fixes, the troubleshoot path prints a copy-paste command block.