Releases · kubeflow/trainer

07 Nov 10:13

v2.1.0

73c9bec

v2.1.0 Latest

Latest

This is Kubeflow Trainer v2.1.0 release.

kubectl apply --server-side -k "https://github.com/kubeflow/trainer.git/manifests/overlays/manager?ref=v2.1.0"

$ kubectl get pods -n kubeflow-system

NAME                                                  READY   STATUS    RESTARTS   AGE
jobset-controller-manager-54968bd57b-88dk4            2/2     Running   0          65s
kubeflow-trainer-controller-manager-cc6468559-dblnw   1/1     Running   0          65s

kubectl apply --server-side -k "https://github.com/kubeflow/trainer.git/manifests/overlays/runtimes?ref=v2.1.0"

You can now install controller manager with Helm charts 🚀

helm install kubeflow-trainer oci://ghcr.io/kubeflow/charts/kubeflow-trainer --version 2.1.0

Install Kubeflow Python SDK:

pip install -U kubeflow

For more information, please see the Kubeflow Trainer docs.

Breaking Changes

feat(api): Replace deprecated PodSpecOverrides API with PodTemplateOverrides in TrainJob (#2785 by @xigang)
feat(operator): Replace TrainJob controller settings with the Config API (#2879 by @kapil27)
chore(operator): Upgrade JobSet to v0.10.1 (#2875 by @astefanutti)
chore(operator): Upgrade Kubernetes to v1.34 (#2804 by @astefanutti)
Upgrade Kubernetes to v1.33 (#2756 by @astefanutti)

New Features

Distributed AI Data Cache

feat(cache): KEP-2655: Adding default runtime with cache and example (#2928 by @akshaychitneni)
feat(cache): KEP-2655 - Supporting readiness probes on cache nodes (#2920 by @akshaychitneni)
feat(cache): KEP-2655 - Add build pipeline and address vulnerabilities for data_cache (#2890 by @akshaychitneni)
feat(cache): KEP-2655: Adding cache initializer (#2793 by @akshaychitneni)
feat: KEP-2655: Add data cache system (#2755 by @akshaychitneni)

Stream data directly to your GPU nodes with zero-copy transfers from an in-memory cache cluster powered by Apache Arrow and Apache DataFusion. This allows users to load massive tabular datasets efficiently, maximize GPU utilization, and minimize I/O in for large-scale pre- or post-training distributed AI workloads.

Explore more about data cache in:

LLM Post-Training

feat(runtimes): Add LoRA/QLoRA/DoRA support in LLM Trainer V2 (#2832 by @Electronic-Waste)
feat: Add Qwen 2.5 1.5b runtime, example and fix gpu e2e test (#2835 by @jaiakash)
feat(runtimes): Support Distributed MLX on CUDA (#2790 by @andreyvelich)

Kueue Enhancements

Support Topology Aware Scheduling for TrainJobs (kubernetes-sigs/kueue#7249 by @kaisoz)
fix: Allow multiple podSpec overrides to target the same TargetJob (#2880 by @kaisoz)
feat: support affinity in TrainJob pod spec overrides (#2796 by @toVersus)
feat: Add schedulingGates to PodSpecOverrides (#2700 by @astefanutti)

Check out the official Kueue docs.

Volcano Scheduler

feat: KEP-2437 - PodGroup Creation for Volcano Scheduler (#2729 by @Doris-xm)
feat(docs): KEP-2437-Support Volcano Scheduler in Kubeflow Trainer V2 (#2672 by @Doris-xm)

API Updates

feat(runtimes): add support for launcher resource allocation in MPI jobs (#2653 by @jskswamy)
feat: Add PodTemplateOverrides into TrainJob V2 API (#2882 by @xigang)
feat(api): Sync TrainJob JobsStatus from JobSet ReplicatedJobsStatus (#2802 by @astefanutti)
feat: support imagePullSecrets in TrainJob pod spec overrides (#2806 by @toVersus)
feat(operator): enforce RFC 1035 validation for TrainJob name (#2767 by @juniemariam)

Bug Fixes

[release-2.1] fix(ci): Fix the Kubeflow SDK installation with Docker (#2927 by @andreyvelich)
fix(manifests): Add RBAC rules for Leases in Helm Charts (#2901 by @astefanutti)
fix(docs): correct example usage in KEP-2437-Support-Volcano-Scheduler (#2898 by @Doris-xm)
fix(api): Keep mpiImplementation field a pointer (#2897 by @astefanutti)
fix(api): Fix lint errors for the config API (#2896 by @astefanutti)
fix: charts dependencies (#2892 by @ls-2018)
fix(runtimes): fix missing dependency in torchtune trainer image. (#2887 by @Electronic-Waste)
fix(ci): Add latest image tag only for the master branch (#2854 by @andreyvelich)
fix: read only permission for PRs (#2829 by @jaiakash)
fix: read only permission for PRs (#2827 by @jaiakash)
fix: update examples to reflect func_args now being unpacked (#2815 by @briangallagher)
fix(examples): Update get_job_logs() API in examples (#2813 by @andreyvelich)
fix: teraform for oci gpu based vm (#2810 by @jaiakash)
fix(api): Regenerate TrainJob CRD (#2805 by @astefanutti)
fix(ci): disable Unit and Integration Test - Go gh action in forked repos (#2746 by @milinddethe15)
fix(manifests): Add missing permissions for the RuntimeClass and LimitRange (#2787 by @tenzen-y)
fix: update kubeflow sdk reference (#2780 by @kramaranya)
fix(api): update license path for kubeflow_trainer_api (#2778 by @kramaranya)
fix(runtimes): Set numProcPerNode: 1 in DeepSpeed Runtime (#2774 by @andreyvelich)
fix(docs): update KEP-2401 according to current implementation. (#2765 by @Electronic-Waste)
fix(ci): Remove coverage from Go integration tests (#2773 by @andreyvelich)
fix(api): Fix license path for Kubeflow Trainer Python API (#2771 by @andreyvelich)
fix(examples): Update the argument for Runtime framework (#2766 by @andreyvelich)
fix(test): Fix Ginkgo command for integration tests (#2758 by @astefanutti)
fix: fix the command for fetching Kubeflow Trainer version in the issue template (#2732 by @rudeigerc)
fix(manifests): add rbac config of events for event recorders (#2731 by @rudeigerc)
fix(manifests): fix position of labels of dataset-initializer from pod to job (#2719 by @rudeigerc)
fix(module): Change Go module name to v2 (#2707 by @andreyvelich)
fix(plugins): Fix some errors in torchtune mutation process. (#2675 by @Electronic-Waste)
fix(manifests): Update manifests to enable LLM fine-tuning workflow with CTR and TrainJob yaml files (#2669 by @Electronic-Waste)
fix(rbac): Add required RBAC to update ClusterTrainingRuntimes on OpenShift (#2682 by @astefanutti)

Misc

[release-2.1] feat: Adding local execution example notebook (#2924 by @Fiona-Waters)
feat(manifests): Publish Kubeflow Trainer Helm charts (#2917 by @adity1raut)
[release-2.1] chore(operator): Use SSA throughout runtime framework (#2912 by @astefanutti)
[release-2.1] feat(initializer): add s3 model and dataset initializers (#2911 by @rudeigerc)
feat(operator): Add validation for required containers in replicatedJobs (#2722 by @ELE...

Contributors

jskswamy, astefanutti, and 27 other contributors

Assets 2

03 Nov 22:25

andreyvelich

v2.1.0-rc.1

3a71dd0

v2.1.0-rc.1 Pre-release

Pre-release

This is Kubeflow Trainer v2.1.0-rc.1 pre-release:

kubectl apply --server-side -k "https://github.com/kubeflow/trainer.git/manifests/overlays/manager?ref=v2.1.0-rc.1"
kubectl apply --server-side -k "https://github.com/kubeflow/trainer.git/manifests/overlays/runtimes?ref=v2.1.0-rc.1"

New Features

feat(manifests): Publish Kubeflow Trainer Helm charts (#2917 by @adity1raut)
[release-2.1] chore(operator): Use SSA throughout runtime framework (#2912 by @astefanutti)
[release-2.1] feat(initializer): add s3 model and dataset initializers (#2911 by @rudeigerc)

Bug Fixes

[release-2.1] fix(manifests): Fix boolean values defaulting in Helm charts (#2914 by @astefanutti)
[release-2.1] fix(runtimes): Update pip version in the MLX runtime (#2910 by @andreyvelich)

Full Changelog

Contributors

astefanutti, rudeigerc, and 2 other contributors

Assets 2

21 Oct 14:56

andreyvelich

v2.1.0-rc.0

732d0fb

v2.1.0-rc.0 Pre-release

Pre-release

This is Kubeflow Trainer v2.1.0-rc.0 pre-release:

kubectl apply --server-side -k "https://github.com/kubeflow/trainer.git/manifests/overlays/manager?ref=v2.1.0-rc.0"
kubectl apply --server-side -k "https://github.com/kubeflow/trainer.git/manifests/overlays/runtimes?ref=v2.1.0-rc.0"

Breaking Changes

feat(api): Replace deprecated PodSpecOverrides API with PodTemplateOverrides in TrainJob (#2785 by @xigang)
feat(operator): Replace TrainJob controller settings with the Config API (#2879 by @kapil27)
chore(operator): Upgrade JobSet to v0.10.1 (#2875 by @astefanutti)
chore(operator): Upgrade Kubernetes to v1.34 (#2804 by @astefanutti)
Upgrade Kubernetes to v1.33 (#2756 by @astefanutti)

New Features

Distributed AI Data Cache

feat(cache): KEP-2655 - Add build pipeline and address vulnerabilities for data_cache (#2890 by @akshaychitneni)
feat(cache): KEP-2655: Adding cache initializer (#2793 by @akshaychitneni)
feat: KEP-2655: Add data cache system (#2755 by @akshaychitneni)

LLM Post-Training

feat(runtimes): Add LoRA/QLoRA/DoRA support in LLM Trainer V2 (#2832 by @Electronic-Waste)
feat: Add Qwen 2.5 1.5b runtime, example and fix gpu e2e test (#2835 by @jaiakash)
feat(runtimes): Support Distributed MLX on CUDA (#2790 by @andreyvelich)

Kueue Enhancements

Support Topology Aware Scheduling for TrainJobs (kubernetes-sigs/kueue#7249 by @kaisoz)
fix: Allow multiple podSpec overrides to target the same TargetJob (#2880 by @kaisoz)
feat: support affinity in TrainJob pod spec overrides (#2796 by @toVersus)
feat: Add schedulingGates to PodSpecOverrides (#2700 by @astefanutti)

Volcano Scheduler

feat: KEP-2437 - PodGroup Creation for Volcano Scheduler (#2729 by @Doris-xm)
feat(docs): KEP-2437-Support Volcano Scheduler in Kubeflow Trainer V2 (#2672 by @Doris-xm)

API Updates

feat(runtimes): add support for launcher resource allocation in MPI jobs (#2653 by @jskswamy)
feat: Add PodTemplateOverrides into TrainJob V2 API (#2882 by @xigang)
feat(api): Sync TrainJob JobsStatus from JobSet ReplicatedJobsStatus (#2802 by @astefanutti)
feat: support imagePullSecrets in TrainJob pod spec overrides (#2806 by @toVersus)
feat(operator): enforce RFC 1035 validation for TrainJob name (#2767 by @juniemariam)

Bug Fixes

fix(manifests): Add RBAC rules for Leases in Helm Charts (#2901 by @astefanutti)
fix(docs): correct example usage in KEP-2437-Support-Volcano-Scheduler (#2898 by @Doris-xm)
fix(api): Keep mpiImplementation field a pointer (#2897 by @astefanutti)
fix(api): Fix lint errors for the config API (#2896 by @astefanutti)
fix: charts dependencies (#2892 by @ls-2018)
fix(runtimes): fix missing dependency in torchtune trainer image. (#2887 by @Electronic-Waste)
fix(ci): Add latest image tag only for the master branch (#2854 by @andreyvelich)
fix: read only permission for PRs (#2829 by @jaiakash)
fix: read only permission for PRs (#2827 by @jaiakash)
fix: update examples to reflect func_args now being unpacked (#2815 by @briangallagher)
fix(examples): Update get_job_logs() API in examples (#2813 by @andreyvelich)
fix: teraform for oci gpu based vm (#2810 by @jaiakash)
fix(api): Regenerate TrainJob CRD (#2805 by @astefanutti)
fix(ci): disable Unit and Integration Test - Go gh action in forked repos (#2746 by @milinddethe15)
fix(manifests): Add missing permissions for the RuntimeClass and LimitRange (#2787 by @tenzen-y)
fix: update kubeflow sdk reference (#2780 by @kramaranya)
fix(api): update license path for kubeflow_trainer_api (#2778 by @kramaranya)
fix(runtimes): Set numProcPerNode: 1 in DeepSpeed Runtime (#2774 by @andreyvelich)
fix(docs): update KEP-2401 according to current implementation. (#2765 by @Electronic-Waste)
fix(ci): Remove coverage from Go integration tests (#2773 by @andreyvelich)
fix(api): Fix license path for Kubeflow Trainer Python API (#2771 by @andreyvelich)
fix(examples): Update the argument for Runtime framework (#2766 by @andreyvelich)
fix(test): Fix Ginkgo command for integration tests (#2758 by @astefanutti)
fix: fix the command for fetching Kubeflow Trainer version in the issue template (#2732 by @rudeigerc)
fix(manifests): add rbac config of events for event recorders (#2731 by @rudeigerc)
fix(manifests): fix position of labels of dataset-initializer from pod to job (#2719 by @rudeigerc)
fix(module): Change Go module name to v2 (#2707 by @andreyvelich)
fix(plugins): Fix some errors in torchtune mutation process. (#2675 by @Electronic-Waste)
fix(manifests): Update manifests to enable LLM fine-tuning workflow with CTR and TrainJob yaml files (#2669 by @Electronic-Waste)
fix(rbac): Add required RBAC to update ClusterTrainingRuntimes on OpenShift (#2682 by @astefanutti)

Misc

feat(operator): Add validation for required containers in replicatedJobs (#2722 by @Electronic-Waste)
feat: add controller manager configuration helm chart (#2895 by @kapil27)
chore(ci): Enable Kubernetes API Linter (#2858 by @astefanutti)
feat(runtimes): implement clusterTrainingRuntime deprecation process (#2791 by @tdn21)
feat: add HF token and allow gpu workflow to run from pull request target (#2818 by @jaiakash)
feat(docs): KEP-2442-Support JAX Training Runtime (#2643 by @mahdikhashan)
chore(test): Support e2e cluster setup with Podman (#2861 by @astefanutti)
chore(runtimes): Upgrade torchtune version to v0.6.1 (#2876 by @Electronic-Waste)
chore(operator): Upgrade JobSet to v0.10.1 (#2875 by @astefanutti)
feat(docs): Update Trainer diagram and SDK release (#2867 by @andreyvelich)
feat(docs): Add changelog for Kubeflow Trainer v2.0.1 (#2864 by @andreyvelich)
fix(docs): Update the release document to push all changes (#2865 by @andreyvelich)
chore: Install released version of Kubeflow SDK (#2857 by @kramaranya)
chore(ci): Ignore generated files in .gitattributes (#2855 by @andreyvelich)
feat: Add a public function to create runtime info objects (#2837 by @kaisoz)
chore(test): add uts for coscheduling plugin. (#2582 by @IRONICBo)
feat(ci): Add Trivy Vulnerability Scan (#2826 by @andreyvelich)
chore: merge test cases using PodSpecOverrides into a single case (#2822 by @toVersus)
chore(runtimes): update torchtune CTRs with multiple dependson feature in jobset v0.9.0 (#2823 by @Electronic-Waste)
chore(operator): Bump JobSet ...

Contributors

jskswamy, astefanutti, and 25 other contributors

Assets 2

29 Sep 14:24

andreyvelich

v2.0.1

332ad39

v2.0.1

This is Kubeflow Trainer v2.0.1 release.

New Features

[release-2.0] feat: Add a public function to create runtime info objects (#2846 by @kaisoz)

Bug Fixes

[release-2.0] fix(runtimes): Set numProcPerNode: 1 in DeepSpeed Runtime (#2863 by @andreyvelich)
[release-2.0] fix(ci): Add latest image tag only for the master branch (#2862 by @andreyvelich)
[release-2.0] fix: update examples to reflect func_args now being unpacked (#2815) (#2853 by @astefanutti)
[release-2.0] fix(examples): Update get_job_logs() API in examples (#2813) (#2852 by @astefanutti)
[release-2.0] feat(runtimes): Add Framework Label to the Runtimes (#2761) (#2851 by @astefanutti)
[release-2.0] fix(examples): Update the argument for Runtime framework (#2766) (#2850 by @astefanutti)
[release-2.0] fix: update kubeflow sdk reference (#2780) (#2847 by @astefanutti)
[release-2.0] fix(api): Fix license path for Kubeflow Trainer Python API (#2772 by @andreyvelich)

Contributors

astefanutti, kaisoz, and andreyvelich

Assets 2

21 Jul 15:59

andreyvelich

v2.0.0

117ad23

v2.0.0

This is the major release of the Kubeflow Trainer 2.0 project.

For more information, please see the

Quickstart

Install the Kubeflow Trainer control plane:

kubectl apply --server-side -k "https://github.com/kubeflow/trainer.git/manifests/overlays/manager?ref=v2.0.0"

$ kubectl get pods -n kubeflow-system

NAME                                                  READY   STATUS    RESTARTS   AGE
jobset-controller-manager-54968bd57b-88dk4            2/2     Running   0          65s
kubeflow-trainer-controller-manager-cc6468559-dblnw   1/1     Running   0          65s

kubectl apply --server-side -k "https://github.com/kubeflow/trainer.git/manifests/overlays/runtimes?ref=v2.0.0"

Install Kubeflow Python SDK:

pip install git+https://github.com/kubeflow/sdk.git@main#subdirectory=python

Run your first TrainJob by following the getting started guide.

Breaking Changes

Migrate SDK to the kubeflow/sdk repository (#2657 by @eoinfennessy)
KEP-2170: Change API Group Name to trainer.kubeflow.org (#2413 by @Electronic-Waste)
Move generated Python models into kubeflow_trainer_api package (#2632 by @kramaranya)
Upgrade kubernetes Go module version to 1.32 (#2450 by @tenzen-y)
Remove kubeflow-trainer prefix from jobset resource names (#2596 by @ChenYi015)
Remove the Training Operator V1 Source Code (#2389 by @andreyvelich)

New Features

LLM Trainer V2

KEP-2401: Support loading local LLMs (#2644 by @Electronic-Waste)
KEP-2401: Support mutating dataset preprocessing config in SDK (#2638 by @Electronic-Waste)
KEP-2401: Create LLM Training Runtimes for Llama 3.2 model family (#2590 by @Electronic-Waste)
KEP-2401: Complement torch plugin to support torchtune config mutation (#2587 by @Electronic-Waste)
KEP-2401: Create torchtune trainer image (#2516 by @Electronic-Waste)
KEP-2401: Refactor current train() API (#2513 by @Electronic-Waste)
KEP-2401: Kubeflow LLM Trainer V2 (#2410 by @Electronic-Waste)

Runtime Framework

feat(runtimes): Support MLX Distributed Runtime with OpenMPI (#2565 by @andreyvelich)
feat(runtimes): Support DeepSpeed Runtime with OpenMPI (#2559 by @andreyvelich)
feat(runtime): remove needless Launcher chainer. (#2558 by @IRONICBo)
Store the TrainingRuntime numNodes as runtime.Info.PodSet.Count (#2539 by @tenzen-y)
Add dependencies to RuntimeRegistrar (#2476 by @tenzen-y)
KEP: 2170: Adding cel validations on TrainingRuntime/ClusterTrainingRuntime CRDs (#2313 by @akshaychitneni)
Implement trainer.kubeflow.org/resource-in-use finalizer mechanism to ClusterTrainingRuntime (#2625 by @tenzen-y)
Implement trainer.kubeflow.org/resource-in-use finalizer mechanism to TrainingRuntime (#2608 by @tenzen-y)

MPI Plugin

[feature]:add validations for MPIRuntime with RunLauncherAsNode (#2551 by @Harshal292004)
Implement CustomValidation UT for MPI plugin (#2555 by @tenzen-y)
Implemenet MPI Plugin for OpenMPI (#2493 by @tenzen-y)
Implement MPI plugin UTs (#2481 by @tenzen-y)
Implement MPIImplementation Enum CRD validation (#2482 by @tenzen-y)
Implement MPI numProcPerNode defaulter (#2483 by @tenzen-y)
Add MPIMLPolicySource CRD defaulters (#2474 by @tenzen-y)
Make MPIMLPolicySource optional fields as a pointer (#2472 by @tenzen-y)
KEP-2170: Implement MPI Plugin for Kubeflow Trainer (#2394 by @andreyvelich)

JobSet

Retrieve JobSetSpec from runtime.Info in CustomValidations (#2557 by @tenzen-y)
KEP-2170: Deploy JobSet in kubeflow-system namespace (#2388 by @andreyvelich)
Bump JobSet to v0.8.0 (#2463 by @andreyvelich)
Upgrade jobset SDK version to v0.7.3 (#2445 by @Electronic-Waste)

New Examples

Add question-answer example for v2 trainer (#2580 by @solanyn)
KEP-2170: Add PyTorch DDP MNIST training example (#2387 by @astefanutti)

SDK Updates

feat(sdk): Get namespace from the provided context (#2593 by @andreyvelich)
feat(sdk): Support MPI-based TrainJobs (#2545 by @andreyvelich)
feat(sdk): Migrate to OpenAPI V3 (#2490 by @andreyvelich)
feat(sdk): Generate external Kubernetes and JobSet models (#2466 by @andreyvelich)

Bug Fixes

[release-2.0] fix(manifests): add rbac config of events for event recorders (#2733 by @rudeigerc)
[release-2.0] fix(manifests): fix position of labels of dataset-initializer from pod to job (#2720 by @rudeigerc)
[release-2.0] fix(module): Change Go module name to v2 (#2708 by @andreyvelich)
[cherry-pick] fix(manifests): Update manifests to enable LLM fine-tuning workflow w… (#2696 by @Electronic-Waste)
[release-2.0] fix(plugins): Fix some errors in torchtune mutation process. (#2693 by @Electronic-Waste)
[release-2.0] fix(rbac): Add required RBAC to update ClusterTrainingRuntimes on OpenShift (#2684 by @astefanutti)
Revert "fix(sdk): Fix type annotation for train method's trainer parameter" (#2651 by @Electronic-Waste)
fix(sdk): Fix bad arg passed to get_args_using_torchtune_config (#2647 by @eoinfennessy)
fix(sdk): Fix type annotation for train method's trainer parameter (#2646 by @eoinfennessy)
fix(controller): Fix RBAC permissions for TrainJob controller (#2626 by @andreyvelich)
Fix close-pr message in Stale GitHub Action (#2622 by @kramaranya)
fix: remove redundant K8s version matrix from integration tests (#2617 by @tr33k)
fix(doc): tidy up KEP-2401. (#2594 by @Electronic-Waste)
Fix MPI Test runnable errors (#2570 by @tenzen-y)
Fix issue with fetching clustertrainingruntime for validations (#2564 by @akshaychitneni)
fix(sdk): Add missing import types. (#2566 by @Electronic-Waste)
fix(sdk): Using correct entrypoint for mpirun (#2552 by @andreyvelich)
fix(sdk): add missing import type Initializer. (#2541 by @Electronic-Waste)
fix(ci): update test-go coverage ci config and replace trainer badge with new address. (#2534 by @IRONICBo)
fix(doc): Update train() API in KEP-2401 (#2536 by @Electronic-Waste)
fix(test): Update images for DockerHub publish (#2535 by @andreyvelich)
[hotfix] fix checkout on workflow (#2531 by @mahdikhashan)
[hotfix] fix docker cred (#2530 by @mahdikhashan)
fix: remove unused parameter name in default case of shouldUseCPU function (#2521 by @Diasker)
Fix #2407: Cap nproc_per_node based on CPU resources for PyTorch TrainJob (#2492 by @Diasker)
fix type in model initializer entrypoint ([#2489](https://github.com/kubeflow/trainer/pull...

Contributors

astefanutti, szaher, and 29 other contributors

Assets 2

17 Jul 14:48

andreyvelich

v1.9.3

c77ee3f

v1.9.3

This is the Training Operator v1.9.3 release.

New Features

[SDK] Add provision to provide local-queue for the training job (#2636 by @abhijeet-dhumal

Misc

chore: Remove V2 code from Training Operator 1.9 release branch (#2737 by @andreyvelich
chore(ci): Add more workaround no space left on device (#2677 by @astefanutti

Contributors

astefanutti, andreyvelich, and abhijeet-dhumal

Assets 2

05 Jul 23:52

andreyvelich

v2.0.0-rc.1

7122fc1

v2.0.0-rc.1 Pre-release

Pre-release

This is the Kubeflow Trainer v2.0.0-rc.1 pre-release.

New Features

[release-2.0] feat: Add schedulingGates to PodSpecOverrides (#2705 by @astefanutti)
[release-2.0] feat: Mutable PodSpecOverrides for suspended TrainJob (#2698 by @astefanutti)
[Release 2.0] KEP-2170: Add the manifests overlay for Kubeflow Training V2 (#2692 by @Doris-xm)

Bug Fixes

[release-2.0] fix(module): Change Go module name to v2 (#2708 by @andreyvelich)
[cherry-pick] fix(manifests): Update manifests to enable LLM fine-tuning workflow w… (#2696 by @Electronic-Waste)
[release-2.0] fix(plugins): Fix some errors in torchtune mutation process. (#2693 by @Electronic-Waste)
[release-2.0] fix(rbac): Add required RBAC to update ClusterTrainingRuntimes on OpenShift (#2684 by @astefanutti)

Misc

[release-2.0] chore: Copy generated CRDs into Helm charts (#2704 by @astefanutti)
[cherry-pick] feat(example): Add alpaca-trianjob-yaml.ipynb. (#2670) (#2702 by @Electronic-Waste)
[release-2.0] chore: Replace the deprecated intstr.FromInt with intstr.FromInt32 (#2697 by @tenzen-y)
[release-2.0] chore: Remove the vendor specific parameters (#2694 by @tenzen-y)
[release-2.0] chore(runtime): Bump Torch to 2.7.1 and DeepSpeed to 0.17.1 (#2687 by @andreyvelich)
[release-2.0] chore(helm): Sync ClusterRule in Helm chart (#2688 by @astefanutti)

Contributors

astefanutti, andreyvelich, and 3 other contributors

Assets 2

12 Jun 12:00

andreyvelich

v2.0.0-rc.0

32a474c

v2.0.0-rc.0 Pre-release

Pre-release

This is the Kubeflow Trainer v2.0.0-rc.0 pre-release.

Breaking Changes

KEP-2170: Change API Group Name to trainer.kubeflow.org (#2413 by @Electronic-Waste)
Move generated Python models into kubeflow_trainer_api package (#2632 by @kramaranya)
Upgrade kubernetes Go module version to 1.32 (#2450 by @tenzen-y)
Remove kubeflow-trainer prefix from jobset resource names (#2596 by @ChenYi015)
Remove the Training Operator V1 Source Code (#2389 by @andreyvelich)

New Features

LLM Trainer V2

KEP-2401: Support loading local LLMs (#2644 by @Electronic-Waste)
KEP-2401: Support mutating dataset preprocessing config in SDK (#2638 by @Electronic-Waste)
KEP-2401: Create LLM Training Runtimes for Llama 3.2 model family (#2590 by @Electronic-Waste)
KEP-2401: Complement torch plugin to support torchtune config mutation (#2587 by @Electronic-Waste)
KEP-2401: Create torchtune trainer image (#2516 by @Electronic-Waste)
KEP-2401: Refactor current train() API (#2513 by @Electronic-Waste)
KEP-2401: Kubeflow LLM Trainer V2 (#2410 by @Electronic-Waste)

Runtime Framework

feat(runtimes): Support MLX Distributed Runtime with OpenMPI (#2565 by @andreyvelich)
feat(runtimes): Support DeepSpeed Runtime with OpenMPI (#2559 by @andreyvelich)
feat(runtime): remove needless Launcher chainer. (#2558 by @IRONICBo)
Store the TrainingRuntime numNodes as runtime.Info.PodSet.Count (#2539 by @tenzen-y)
Add dependencies to RuntimeRegistrar (#2476 by @tenzen-y)
KEP: 2170: Adding cel validations on TrainingRuntime/ClusterTrainingRuntime CRDs (#2313 by @akshaychitneni)
Implement trainer.kubeflow.org/resource-in-use finalizer mechanism to ClusterTrainingRuntime (#2625 by @tenzen-y)
Implement trainer.kubeflow.org/resource-in-use finalizer mechanism to TrainingRuntime (#2608 by @tenzen-y)

MPI Plugin

[feature]:add validations for MPIRuntime with RunLauncherAsNode (#2551 by @Harshal292004)
Implement CustomValidation UT for MPI plugin (#2555 by @tenzen-y)
Implemenet MPI Plugin for OpenMPI (#2493 by @tenzen-y)
Implement MPI plugin UTs (#2481 by @tenzen-y)
Implement MPIImplementation Enum CRD validation (#2482 by @tenzen-y)
Implement MPI numProcPerNode defaulter (#2483 by @tenzen-y)
Add MPIMLPolicySource CRD defaulters (#2474 by @tenzen-y)
Make MPIMLPolicySource optional fields as a pointer (#2472 by @tenzen-y)
KEP-2170: Implement MPI Plugin for Kubeflow Trainer (#2394 by @andreyvelich)

JobSet

Retrieve JobSetSpec from runtime.Info in CustomValidations (#2557 by @tenzen-y)
KEP-2170: Deploy JobSet in kubeflow-system namespace (#2388 by @andreyvelich)
Bump JobSet to v0.8.0 (#2463 by @andreyvelich)
Upgrade jobset SDK version to v0.7.3 (#2445 by @Electronic-Waste)

New Examples

Add question-answer example for v2 trainer (#2580 by @solanyn)
KEP-2170: Add PyTorch DDP MNIST training example (#2387 by @astefanutti)

SDK Updates

Remove SDK (#2657 by @eoinfennessy)
feat(sdk): Get namespace from the provided context (#2593 by @andreyvelich)
feat(sdk): Support MPI-based TrainJobs (#2545 by @andreyvelich)
feat(sdk): Migrate to OpenAPI V3 (#2490 by @andreyvelich)
feat(sdk): Generate external Kubernetes and JobSet models (#2466 by @andreyvelich)

Bug Fixes

Revert "fix(sdk): Fix type annotation for train method's trainer parameter" (#2651 by @Electronic-Waste)
fix(sdk): Fix bad arg passed to get_args_using_torchtune_config (#2647 by @eoinfennessy)
fix(sdk): Fix type annotation for train method's trainer parameter (#2646 by @eoinfennessy)
fix(controller): Fix RBAC permissions for TrainJob controller (#2626 by @andreyvelich)
Fix close-pr message in Stale GitHub Action (#2622 by @kramaranya)
fix: remove redundant K8s version matrix from integration tests (#2617 by @tr33k)
fix(doc): tidy up KEP-2401. (#2594 by @Electronic-Waste)
Fix MPI Test runnable errors (#2570 by @tenzen-y)
Fix issue with fetching clustertrainingruntime for validations (#2564 by @akshaychitneni)
fix(sdk): Add missing import types. (#2566 by @Electronic-Waste)
fix(sdk): Using correct entrypoint for mpirun (#2552 by @andreyvelich)
fix(sdk): add missing import type Initializer. (#2541 by @Electronic-Waste)
fix(ci): update test-go coverage ci config and replace trainer badge with new address. (#2534 by @IRONICBo)
fix(doc): Update train() API in KEP-2401 (#2536 by @Electronic-Waste)
fix(test): Update images for DockerHub publish (#2535 by @andreyvelich)
[hotfix] fix checkout on workflow (#2531 by @mahdikhashan)
[hotfix] fix docker cred (#2530 by @mahdikhashan)
fix: remove unused parameter name in default case of shouldUseCPU function (#2521 by @Diasker)
Fix #2407: Cap nproc_per_node based on CPU resources for PyTorch TrainJob (#2492 by @Diasker)
fix type in model initializer entrypoint (#2489 by @szaher)
fix(runtime): fix error label name. (#2487 by @Electronic-Waste)
fix(sdk): resolve errors in deserialization (#2457 by @Electronic-Waste)
Fix missing external types in apply configurations (#2429 by @astefanutti)
Fix API Group for Torch Runtime (#2424 by @andreyvelich)
Fix Kustomize patchesStrategicMerge deprecation warning (#2405 by @astefanutti)
ControlPlane: Fix flaky integraion testings due to missing the latest version of object (#2414 by @tenzen-y)

Misc

Tag Docker images with GitHub release tags (#2662 by @kramaranya)
feat(controller): Implement PodSpecOverride API (#2614 by @andreyvelich)
Nominate @Electronic-Waste as approver and @astefanutti as reviewer (#2659 by @andreyvelich)
chore(build): Support Podman to run OpenAPI generator (#2656 by @astefanutti)
chore(docs): Add OpenSSF Best Practices Badge (#2611 by @andreyvelich)
[chore] update stale action version to latest (#2642 by @mahdikhashan)
Remove TrainJobCreated condition (#2621 by @astefanutti)
ci: refactor build-push-images workflow (#2607 by @milinddethe15)
Update Go to v1.24 (#2615) (#2620 by @vzamboulingame)
test(runtime): add UT for IndexTrainJobTrainingRuntime (#2603 by @Harshal292004)
ci: add k8s v1.32 for tests env ([#2613](#26...

Contributors

astefanutti, szaher, and 25 other contributors

Assets 2

03 May 02:43

andreyvelich

v1.9.2

bde9c20

v1.9.2

This is the Training Operator v1.9.2 release.

New Features

Add provision to provide labels and annotations for the pytorchjob an… (#2612 by @abhijeet-dhumal)

Bug Fixes

Fix llm hp optimization error (#2576 by @helenxie-bit)
[bug] pull image from ghcr (#2584 by @mahdikhashan)

Contributors

mahdikhashan, abhijeet-dhumal, and helenxie-bit

Assets 2

31 Mar 23:09

andreyvelich

v1.9.1

17077e3

v1.9.1 release

This is the Training Operator v1.9.1 release.

Breaking Changes

Update Manifest Images to GHCR (#2544 by @saileshd1402)
Push images to GHCR for release-1.9 (#2491 by @saileshd1402)

New Features

Add volume and volume mounts arguments to TrainingClient.create_job API (#2449 by @astefanutti)
Add configurable QPS and burst settings for kube API client (#2411 by @ronk21runai)

Bug Fixes

fix(ci): Change publish dir from training to trainer (#2546 by @Electronic-Waste)
fix: fix typos in script comments. (#2465 by @IRONICBo)
fix: adds jaxjobs to the kubeflow-training-roles.yaml ClusterRole (#2417 by @DnPlas)
[release-1.9] Rename paddlepaddle_defaults.go file name (#2400 by @ChristianZaccaria)

Contributors

astefanutti, DnPlas, and 5 other contributors

Assets 2

Releases: kubeflow/trainer

v2.1.0

Breaking Changes

New Features

Distributed AI Data Cache

LLM Post-Training

Kueue Enhancements

Volcano Scheduler

API Updates

Bug Fixes

Misc

Contributors

Uh oh!

v2.1.0-rc.1

New Features

Bug Fixes

Contributors

Uh oh!

v2.1.0-rc.0

Breaking Changes

New Features

Distributed AI Data Cache

LLM Post-Training

Kueue Enhancements

Volcano Scheduler

API Updates

Bug Fixes

Misc

Contributors

Uh oh!

v2.0.1

New Features

Bug Fixes

Contributors

Uh oh!

v2.0.0

Quickstart

Breaking Changes

New Features

LLM Trainer V2

Runtime Framework

MPI Plugin

JobSet

New Examples

SDK Updates

Bug Fixes

Contributors

Uh oh!

v1.9.3

New Features

Misc

Contributors

Uh oh!

v2.0.0-rc.1

New Features

Bug Fixes

Misc

Contributors

Uh oh!

v2.0.0-rc.0

Breaking Changes

New Features

LLM Trainer V2

Runtime Framework

MPI Plugin

JobSet

New Examples

SDK Updates

Bug Fixes

Misc

Contributors

Uh oh!

v1.9.2

New Features

Bug Fixes

Contributors

Uh oh!

v1.9.1 release

Breaking Changes

New Features