Detached Jobs#
Most Kinetic users start with @kinetic.run(), which blocks the local
process until the remote function returns. That’s the right choice when the
job is short, when you want the result inline in your script, or when
you’re iterating on code interactively.
When the job is long, when you want to walk away from your laptop,
or when you want to fan out and check on multiple jobs in parallel,
switch to calling func.run_async(). It returns a JobHandle immediately
and leaves the actual work running on the cluster. You can then poll status,
tail logs, collect results, or reattach to the job from a different machine
— all backed by metadata Kinetic persisted to GCS at submission time.
This page covers the full submit → observe → collect → cleanup loop, both
from Python and from the kinetic jobs CLI.
A first detached job#
import kinetic
@kinetic.run(accelerator="tpu-v5e-1")
def train_model():
# Long-running training code
return {"final_loss": 0.123}
job = train_model.run_async()
print(f"Submitted: {job.job_id}")
# ... do something else, possibly close the script entirely ...
final = job.result(timeout=3600) # blocks until done
print(final)
The @kinetic.run() decorator accepts the same arguments regardless of
whether you execute the function synchronously or asynchronously (accelerator,
project, zone, cluster, container_image, env vars, data volumes, etc.). The
only difference is how you invoke the function: calling it directly (e.g.,
train_model()) runs it synchronously and blocks, while calling
train_model.run_async() runs it asynchronously and returns a JobHandle
immediately.
Python and CLI side by side#
Every operation is available both as a JobHandle method and as a
kinetic jobs subcommand. Pick whichever fits your workflow.
Operation |
Python |
CLI |
|---|---|---|
Submit |
|
(use the decorator from a script) |
Reattach |
|
(pass |
List |
|
|
Check status |
|
|
Tail logs |
|
|
Follow logs |
|
|
Wait for result |
|
|
Cancel |
|
|
Clean up |
|
|
Job lifecycle#
A submitted job moves through five states (defined as JobStatus in
kinetic.job_status):
┌──────────┐
run_async() ───▶ │ PENDING │ ── pod is waiting on a node
└────┬─────┘
│ pod scheduled
▼
┌──────────┐
│ RUNNING │ ── your function is executing
└────┬─────┘
┌────────┴────────┐
▼ ▼
┌───────────┐ ┌──────────┐
│ SUCCEEDED │ │ FAILED │
└───────────┘ └──────────┘
NOT_FOUND ── the k8s resource no longer exists (cleaned up,
or never registered)
What each state means and what to do:
PENDING — Kubernetes has accepted the job but no pod is running yet. The cluster autoscaler may be provisioning a node; on a fresh accelerator pool this can take 2–5 minutes. What to do: wait. If it’s stuck for much longer, run
kinetic initand choosetroubleshoot, and check your accelerator quota.RUNNING — your function is executing inside the pod. Use
job.tail()orkinetic jobs logs --followto watch progress. What to do: nothing, unless you want to monitor.SUCCEEDED — your function returned normally and Kinetic uploaded the result. What to do: call
job.result()to get the return value. By default this also cleans up the k8s resource and GCS artifacts.FAILED — the pod exited non-zero. The k8s resource is not auto-deleted so you can read logs. What to do:
job.tail()orkinetic jobs logs <id>to see the error, thenjob.cleanup()when you’re done debugging.NOT_FOUND — the Kubernetes Job has already been deleted (typically by a successful
result()call, or by an explicitcleanup). If the result was uploaded to GCS,result()can still return it; otherwise this state means the job is truly gone. What to do: if you need the return value, callresult()once — it will read from GCS even after the pod is gone. Ifresult()raises, the job is unrecoverable.
The full submit-to-cleanup flow:
run_async()packages your code, builds (or reuses) a container image, uploads artifacts to GCS, creates a k8s Job, and returns aJobHandle. Status isPENDING.The cluster autoscaler provisions a node if needed; the pod is scheduled. Status moves to
RUNNING.Your function runs. The pod uploads its return value (or an exception payload) to GCS when it exits.
Status moves to
SUCCEEDEDorFAILED.Calling
job.result()downloads the payload, returns it (or raises the user exception), and — by default — deletes both the k8s resource and the GCS artifacts. Status is nowNOT_FOUNDand the handle is spent.
Reattaching from another machine#
The JobHandle is a small JSON-serializable dataclass that Kinetic
persists to GCS at submit time. Anywhere you have Kinetic installed and
GCP credentials for the same project, you can reconstruct it from the
job ID:
import kinetic
job = kinetic.attach("v5e1-train-model-20260417-153012-abc1234")
print(f"Status: {job.status().value}")
print(job.tail(n=20))
If you don’t remember the ID, list everything currently on the cluster:
for j in kinetic.list_jobs():
print(f"{j.job_id} {j.func_name} {j.status().value}")
The CLI equivalent is kinetic jobs list.
Timeouts and cleanup#
result() blocks indefinitely by default. Pass timeout= (in seconds) to
bound the wait:
try:
final = job.result(timeout=3600)
except TimeoutError:
# Job is still running — handle is still valid; you can call .result()
# again, .tail(), .cancel(), or just walk away.
print(job.tail(n=50))
By default result() cleans up after success: the k8s Job/pod and the
GCS artifacts are deleted. Two ways to opt out:
final = job.result(cleanup=False) # keep everything
job.cleanup(k8s=True, gcs=False) # later: delete pod, keep artifacts
Failed jobs are not auto-cleaned, so logs survive until you delete them.
Anything you wrote under KINETIC_OUTPUT_DIR is also kept regardless of
cleanup — see Checkpointing.
Recommendations for long-running jobs#
The following practices reduce the cost of failures on jobs that run for hours.
Checkpoint regularly. Anything written to
KINETIC_OUTPUT_DIRsurvives a failed pod, but only the checkpoints already written can be used on resume. Pick a cadence that bounds how much progress a restart would lose. See Checkpointing for resume patterns.Persist the
job_id. Record it via stdout, a log file, or your workflow’s tracking system. With the ID, you can reattach from any machine that has Kinetic installed and access to the same GCP project.Do not rely on the local Python process. Once
run_async()returns, the local script is no longer involved in the job’s execution. Interrupting it (for example, withCtrl-C) does not affect the remote job.Avoid
--followfor jobs that run for hours. Continuous log streaming is sensitive to transient network failures. Usekinetic jobs logs <id> --tail 200from a fresh shell to check in periodically instead.Retain artifacts on multi-host or expensive jobs. Pass
cleanup=Falseto the first successfulresult()call so the Kubernetes resources and GCS artifacts remain available for inspection. Callcleanupexplicitly once they are no longer needed.