FAQ#
When should I use run() vs run_async()?#
Use @kinetic.run() when you want your local script to wait for the
result. Use run_async() when the job is long enough that you’d
rather get a JobHandle back, walk away, and reattach later. submit() is
the right call for anything multi-hour, anything you might want to monitor
from a different machine, or anything you want to fan out and check on in
parallel. See Managing Async Jobs.
Why is the first run slower?#
The first run with a given set of dependencies builds a container image via
Cloud Build (~2–5 minutes). The image is tagged by a hash of your
dependencies, so any subsequent run with the same requirements.txt reuses
the cached image and starts in under a minute. If your dependencies change,
the build re-runs. When the build cost becomes a bottleneck (for example,
when you change requirements.txt several times a day), switch to
prebuilt mode, which installs deps at pod startup instead of baking
them into a fresh image. See Execution Modes and
Dependencies.
Should I use prebuilt or bundled mode?#
Default to bundled. It is the only mode that works without first
publishing a base image. Reach for prebuilt when you change
requirements.txt several times a day and the per-iteration build cost is
hurting you. Prebuilt mode itself works with any base image at the
configured repo, but the kinetic project does not currently publish public
base images, so you will need to run kinetic build-image once to push your
own before this becomes a usable option. See Execution Modes.
When should I use Data(...) vs direct gs://... URIs?#
Always prefer kinetic.Data(...). It accepts both local paths and
gs:// URIs and resolves to a plain filesystem path on the remote, so
your function only sees paths regardless of where the bytes started.
That is the whole point: one consistent API whether you are shipping a
local directory, pointing at an existing GCS bucket, or asking for a
FUSE mount via Data(..., fuse=True). Reach for raw gs:// URIs in
your code only if you specifically want to bypass the Data abstraction.
See Data for the decision matrix.
How do I save checkpoints and outputs?#
Write everything you want to keep under KINETIC_OUTPUT_DIR. Kinetic sets
this env var inside the job pod to a per-job GCS prefix. Anything you write
under it is durable: it outlives the pod and is reachable from your local
machine. The job’s Python return value is for small results; outputs and
checkpoints belong on the output dir. See Checkpointing.
How do I reattach to a job?#
Use kinetic.attach(job_id). It reconstructs a JobHandle from the
metadata Kinetic persisted to GCS at submission time, so you can call
.status(), .result(), .tail(), or .cleanup() from any machine that
has Kinetic and your GCP credentials. The job_id is available on the JobHandle
returned by run_async(). If you have lost it, kinetic.list_jobs() enumerates
jobs on the cluster. See Managing Async Jobs.
What gets cleaned up automatically?#
When a job succeeds, Kinetic removes its Kubernetes Job and pod by default,
so they don’t pile up in the cluster. Failed jobs are kept around so you
can read logs and debug. GCS artifacts (uploaded code, requirements,
metadata) are not auto-deleted; call JobHandle.cleanup(gcs=True) if you
want them gone. Outputs you wrote under KINETIC_OUTPUT_DIR are also kept
unless you explicitly delete them.
How do spot instances affect training?#
Spot capacity costs significantly less than on-demand, but pods can be
preempted with very little warning. Single-host jobs with frequent
checkpoints recover well. Multi-host TPU slices do not, because losing
any one host fails the whole slice. Use --spot for fault-tolerant
single-host workloads, and write checkpoints often enough to absorb a
restart. See Cost Optimization.
When do I need multiple clusters?#
Most users don’t. Spin up a second cluster when you want to isolate GPU and TPU workloads, run jobs in different regions, or separate dev from prod environments. Each cluster has its own GKE control plane management fee, so don’t add them speculatively. See Multiple Clusters.
What does Pathways mean in practice?#
Pathways is a JAX
runtime that coordinates execution across many TPU hosts. Concretely,
when you set backend="pathways" on a multi-host accelerator (e.g.,
tpu-v5litepod-2x4), Kinetic launches your job against a
Pathways-enabled cluster and JAX’s collective communication (jax.pmap,
sharding, etc.) Just Works across hosts. Without Pathways, you would have
to manage multi-host JAX coordination yourself. See Distributed Training.
Glossary#
Accelerator: A TPU or GPU type identifier (e.g., tpu-v6e-8, l4,
a100) passed to accelerator= on the decorator. Picks both the hardware
and the topology.
Topology: How many chips are arranged into the slice. For TPUs,
encoded in the accelerator name (tpu-v6e-8 is 8 chips; tpu-v5litepod-2x4
is a 2×4 slice across hosts).
Pathways: JAX runtime for multi-host TPU coordination. Selected via
backend="pathways" and required for cross-host collectives without
hand-rolled setup.
Node pool: A GKE-managed group of VMs of one accelerator type.
Created with kinetic pool add. Scales between --min-nodes and the max
you need for the job.
Cluster: A GKE cluster with its own control plane and Artifact
Registry. Default name kinetic-cluster. Managed with kinetic up,
kinetic down, and kinetic status.
Bundled image: A container image Kinetic builds for you via Cloud
Build, with your dependencies baked in. The default execution mode. Tagged
by a hash of your requirements.txt.
Prebuilt image: A published base image that already has the
accelerator runtime installed. Your project deps are installed at pod
startup. Selected with container_image="prebuilt". Requires you to
publish base images with kinetic build-image first.
FUSE: Filesystem-in-userspace mount. With kinetic.Data(..., fuse=True),
a GCS bucket is mounted lazily into the pod’s filesystem so reads stream
on demand instead of downloading up front.
Handle: A JobHandle returned by run_async() (or
kinetic.attach()). Wraps status(), result(), tail(), and
cleanup() for one job.
Output dir: The GCS prefix at KINETIC_OUTPUT_DIR inside the job
pod. The canonical place to write checkpoints and any files you want to
keep after the pod exits.