Skip to content

Releases: kubeflow/trainer

v2.1.0

07 Nov 10:13

Choose a tag to compare

This is Kubeflow Trainer v2.1.0 release.

kubectl apply --server-side -k "https://github.com/kubeflow/trainer.git/manifests/overlays/manager?ref=v2.1.0"

$ kubectl get pods -n kubeflow-system

NAME                                                  READY   STATUS    RESTARTS   AGE
jobset-controller-manager-54968bd57b-88dk4            2/2     Running   0          65s
kubeflow-trainer-controller-manager-cc6468559-dblnw   1/1     Running   0          65s

kubectl apply --server-side -k "https://github.com/kubeflow/trainer.git/manifests/overlays/runtimes?ref=v2.1.0"

You can now install controller manager with Helm charts 🚀

helm install kubeflow-trainer oci://ghcr.io/kubeflow/charts/kubeflow-trainer --version 2.1.0

Install Kubeflow Python SDK:

pip install -U kubeflow

For more information, please see the Kubeflow Trainer docs.

Breaking Changes

  • feat(api): Replace deprecated PodSpecOverrides API with PodTemplateOverrides in TrainJob (#2785 by @xigang)
  • feat(operator): Replace TrainJob controller settings with the Config API (#2879 by @kapil27)
  • chore(operator): Upgrade JobSet to v0.10.1 (#2875 by @astefanutti)
  • chore(operator): Upgrade Kubernetes to v1.34 (#2804 by @astefanutti)
  • Upgrade Kubernetes to v1.33 (#2756 by @astefanutti)

New Features

Distributed AI Data Cache

Stream data directly to your GPU nodes with zero-copy transfers from an in-memory cache cluster powered by Apache Arrow and Apache DataFusion. This allows users to load massive tabular datasets efficiently, maximize GPU utilization, and minimize I/O in for large-scale pre- or post-training distributed AI workloads.

Explore more about data cache in:

LLM Post-Training

Kueue Enhancements

Check out the official Kueue docs.

Volcano Scheduler

  • feat: KEP-2437 - PodGroup Creation for Volcano Scheduler (#2729 by @Doris-xm)
  • feat(docs): KEP-2437-Support Volcano Scheduler in Kubeflow Trainer V2 (#2672 by @Doris-xm)

API Updates

  • feat(runtimes): add support for launcher resource allocation in MPI jobs (#2653 by @jskswamy)
  • feat: Add PodTemplateOverrides into TrainJob V2 API (#2882 by @xigang)
  • feat(api): Sync TrainJob JobsStatus from JobSet ReplicatedJobsStatus (#2802 by @astefanutti)
  • feat: support imagePullSecrets in TrainJob pod spec overrides (#2806 by @toVersus)
  • feat(operator): enforce RFC 1035 validation for TrainJob name (#2767 by @juniemariam)

Bug Fixes

Misc

  • [release-2.1] feat: Adding local execution example notebook (#2924 by @Fiona-Waters)
  • feat(manifests): Publish Kubeflow Trainer Helm charts (#2917 by @adity1raut)
  • [release-2.1] chore(operator): Use SSA throughout runtime framework (#2912 by @astefanutti)
  • [release-2.1] feat(initializer): add s3 model and dataset initializers (#2911 by @rudeigerc)
  • feat(operator): Add validation for required containers in replicatedJobs (#2722 by @ELE...
Read more

v2.1.0-rc.1

03 Nov 22:25

Choose a tag to compare

v2.1.0-rc.1 Pre-release
Pre-release

This is Kubeflow Trainer v2.1.0-rc.1 pre-release:

kubectl apply --server-side -k "https://github.com/kubeflow/trainer.git/manifests/overlays/manager?ref=v2.1.0-rc.1"
kubectl apply --server-side -k "https://github.com/kubeflow/trainer.git/manifests/overlays/runtimes?ref=v2.1.0-rc.1"

New Features

  • feat(manifests): Publish Kubeflow Trainer Helm charts (#2917 by @adity1raut)
  • [release-2.1] chore(operator): Use SSA throughout runtime framework (#2912 by @astefanutti)
  • [release-2.1] feat(initializer): add s3 model and dataset initializers (#2911 by @rudeigerc)

Bug Fixes

  • [release-2.1] fix(manifests): Fix boolean values defaulting in Helm charts (#2914 by @astefanutti)
  • [release-2.1] fix(runtimes): Update pip version in the MLX runtime (#2910 by @andreyvelich)

Full Changelog

v2.1.0-rc.0

21 Oct 14:56

Choose a tag to compare

v2.1.0-rc.0 Pre-release
Pre-release

This is Kubeflow Trainer v2.1.0-rc.0 pre-release:

kubectl apply --server-side -k "https://github.com/kubeflow/trainer.git/manifests/overlays/manager?ref=v2.1.0-rc.0"
kubectl apply --server-side -k "https://github.com/kubeflow/trainer.git/manifests/overlays/runtimes?ref=v2.1.0-rc.0"

Breaking Changes

  • feat(api): Replace deprecated PodSpecOverrides API with PodTemplateOverrides in TrainJob (#2785 by @xigang)
  • feat(operator): Replace TrainJob controller settings with the Config API (#2879 by @kapil27)
  • chore(operator): Upgrade JobSet to v0.10.1 (#2875 by @astefanutti)
  • chore(operator): Upgrade Kubernetes to v1.34 (#2804 by @astefanutti)
  • Upgrade Kubernetes to v1.33 (#2756 by @astefanutti)

New Features

Distributed AI Data Cache

LLM Post-Training

Kueue Enhancements

Volcano Scheduler

  • feat: KEP-2437 - PodGroup Creation for Volcano Scheduler (#2729 by @Doris-xm)
  • feat(docs): KEP-2437-Support Volcano Scheduler in Kubeflow Trainer V2 (#2672 by @Doris-xm)

API Updates

  • feat(runtimes): add support for launcher resource allocation in MPI jobs (#2653 by @jskswamy)
  • feat: Add PodTemplateOverrides into TrainJob V2 API (#2882 by @xigang)
  • feat(api): Sync TrainJob JobsStatus from JobSet ReplicatedJobsStatus (#2802 by @astefanutti)
  • feat: support imagePullSecrets in TrainJob pod spec overrides (#2806 by @toVersus)
  • feat(operator): enforce RFC 1035 validation for TrainJob name (#2767 by @juniemariam)

Bug Fixes

Misc

Read more

v2.0.1

29 Sep 14:24

Choose a tag to compare

This is Kubeflow Trainer v2.0.1 release.

New Features

  • [release-2.0] feat: Add a public function to create runtime info objects (#2846 by @kaisoz)

Bug Fixes

v2.0.0

21 Jul 15:59

Choose a tag to compare

This is the major release of the Kubeflow Trainer 2.0 project.

For more information, please see the

Quickstart

Install the Kubeflow Trainer control plane:

kubectl apply --server-side -k "https://github.com/kubeflow/trainer.git/manifests/overlays/manager?ref=v2.0.0"

$ kubectl get pods -n kubeflow-system

NAME                                                  READY   STATUS    RESTARTS   AGE
jobset-controller-manager-54968bd57b-88dk4            2/2     Running   0          65s
kubeflow-trainer-controller-manager-cc6468559-dblnw   1/1     Running   0          65s

kubectl apply --server-side -k "https://github.com/kubeflow/trainer.git/manifests/overlays/runtimes?ref=v2.0.0"

Install Kubeflow Python SDK:

pip install git+https://github.com/kubeflow/sdk.git@main#subdirectory=python

Run your first TrainJob by following the getting started guide.

Breaking Changes

New Features

LLM Trainer V2

Runtime Framework

  • feat(runtimes): Support MLX Distributed Runtime with OpenMPI (#2565 by @andreyvelich)
  • feat(runtimes): Support DeepSpeed Runtime with OpenMPI (#2559 by @andreyvelich)
  • feat(runtime): remove needless Launcher chainer. (#2558 by @IRONICBo)
  • Store the TrainingRuntime numNodes as runtime.Info.PodSet.Count (#2539 by @tenzen-y)
  • Add dependencies to RuntimeRegistrar (#2476 by @tenzen-y)
  • KEP: 2170: Adding cel validations on TrainingRuntime/ClusterTrainingRuntime CRDs (#2313 by @akshaychitneni)
  • Implement trainer.kubeflow.org/resource-in-use finalizer mechanism to ClusterTrainingRuntime (#2625 by @tenzen-y)
  • Implement trainer.kubeflow.org/resource-in-use finalizer mechanism to TrainingRuntime (#2608 by @tenzen-y)

MPI Plugin

JobSet

New Examples

SDK Updates

Bug Fixes

Read more

v1.9.3

17 Jul 14:48
c77ee3f

Choose a tag to compare

This is the Training Operator v1.9.3 release.

New Features

Misc

v2.0.0-rc.1

05 Jul 23:52

Choose a tag to compare

v2.0.0-rc.1 Pre-release
Pre-release

This is the Kubeflow Trainer v2.0.0-rc.1 pre-release.

New Features

  • [release-2.0] feat: Add schedulingGates to PodSpecOverrides (#2705 by @astefanutti)
  • [release-2.0] feat: Mutable PodSpecOverrides for suspended TrainJob (#2698 by @astefanutti)
  • [Release 2.0] KEP-2170: Add the manifests overlay for Kubeflow Training V2 (#2692 by @Doris-xm)

Bug Fixes

  • [release-2.0] fix(module): Change Go module name to v2 (#2708 by @andreyvelich)
  • [cherry-pick] fix(manifests): Update manifests to enable LLM fine-tuning workflow w… (#2696 by @Electronic-Waste)
  • [release-2.0] fix(plugins): Fix some errors in torchtune mutation process. (#2693 by @Electronic-Waste)
  • [release-2.0] fix(rbac): Add required RBAC to update ClusterTrainingRuntimes on OpenShift (#2684 by @astefanutti)

Misc

  • [release-2.0] chore: Copy generated CRDs into Helm charts (#2704 by @astefanutti)
  • [cherry-pick] feat(example): Add alpaca-trianjob-yaml.ipynb. (#2670) (#2702 by @Electronic-Waste)
  • [release-2.0] chore: Replace the deprecated intstr.FromInt with intstr.FromInt32 (#2697 by @tenzen-y)
  • [release-2.0] chore: Remove the vendor specific parameters (#2694 by @tenzen-y)
  • [release-2.0] chore(runtime): Bump Torch to 2.7.1 and DeepSpeed to 0.17.1 (#2687 by @andreyvelich)
  • [release-2.0] chore(helm): Sync ClusterRule in Helm chart (#2688 by @astefanutti)

v2.0.0-rc.0

12 Jun 12:00

Choose a tag to compare

v2.0.0-rc.0 Pre-release
Pre-release

This is the Kubeflow Trainer v2.0.0-rc.0 pre-release.

Breaking Changes

New Features

LLM Trainer V2

Runtime Framework

  • feat(runtimes): Support MLX Distributed Runtime with OpenMPI (#2565 by @andreyvelich)
  • feat(runtimes): Support DeepSpeed Runtime with OpenMPI (#2559 by @andreyvelich)
  • feat(runtime): remove needless Launcher chainer. (#2558 by @IRONICBo)
  • Store the TrainingRuntime numNodes as runtime.Info.PodSet.Count (#2539 by @tenzen-y)
  • Add dependencies to RuntimeRegistrar (#2476 by @tenzen-y)
  • KEP: 2170: Adding cel validations on TrainingRuntime/ClusterTrainingRuntime CRDs (#2313 by @akshaychitneni)
  • Implement trainer.kubeflow.org/resource-in-use finalizer mechanism to ClusterTrainingRuntime (#2625 by @tenzen-y)
  • Implement trainer.kubeflow.org/resource-in-use finalizer mechanism to TrainingRuntime (#2608 by @tenzen-y)

MPI Plugin

JobSet

New Examples

SDK Updates

Bug Fixes

Misc

Read more

v1.9.2

03 May 02:43
bde9c20

Choose a tag to compare

This is the Training Operator v1.9.2 release.

New Features

Bug Fixes

v1.9.1 release

31 Mar 23:09
17077e3

Choose a tag to compare

This is the Training Operator v1.9.1 release.

Breaking Changes

New Features

  • Add volume and volume mounts arguments to TrainingClient.create_job API (#2449 by @astefanutti)
  • Add configurable QPS and burst settings for kube API client (#2411 by @ronk21runai)

Bug Fixes