Releases: kubeflow/trainer
v2.1.0
This is Kubeflow Trainer v2.1.0 release.
kubectl apply --server-side -k "https://github.com/kubeflow/trainer.git/manifests/overlays/manager?ref=v2.1.0"
$ kubectl get pods -n kubeflow-system
NAME READY STATUS RESTARTS AGE
jobset-controller-manager-54968bd57b-88dk4 2/2 Running 0 65s
kubeflow-trainer-controller-manager-cc6468559-dblnw 1/1 Running 0 65s
kubectl apply --server-side -k "https://github.com/kubeflow/trainer.git/manifests/overlays/runtimes?ref=v2.1.0"You can now install controller manager with Helm charts 🚀
helm install kubeflow-trainer oci://ghcr.io/kubeflow/charts/kubeflow-trainer --version 2.1.0Install Kubeflow Python SDK:
pip install -U kubeflowFor more information, please see the Kubeflow Trainer docs.
Breaking Changes
- feat(api): Replace deprecated PodSpecOverrides API with PodTemplateOverrides in TrainJob (#2785 by @xigang)
- feat(operator): Replace TrainJob controller settings with the Config API (#2879 by @kapil27)
- chore(operator): Upgrade JobSet to v0.10.1 (#2875 by @astefanutti)
- chore(operator): Upgrade Kubernetes to v1.34 (#2804 by @astefanutti)
- Upgrade Kubernetes to v1.33 (#2756 by @astefanutti)
New Features
Distributed AI Data Cache
- feat(cache): KEP-2655: Adding default runtime with cache and example (#2928 by @akshaychitneni)
- feat(cache): KEP-2655 - Supporting readiness probes on cache nodes (#2920 by @akshaychitneni)
- feat(cache): KEP-2655 - Add build pipeline and address vulnerabilities for data_cache (#2890 by @akshaychitneni)
- feat(cache): KEP-2655: Adding cache initializer (#2793 by @akshaychitneni)
- feat: KEP-2655: Add data cache system (#2755 by @akshaychitneni)
Stream data directly to your GPU nodes with zero-copy transfers from an in-memory cache cluster powered by Apache Arrow and Apache DataFusion. This allows users to load massive tabular datasets efficiently, maximize GPU utilization, and minimize I/O in for large-scale pre- or post-training distributed AI workloads.
Explore more about data cache in:
- Kubeflow Trainer docs
- KubeCon + CloudNativeCon London talk
- KubeCon + CloudNativeCon India talk
- Gen AI Summit
LLM Post-Training
- feat(runtimes): Add LoRA/QLoRA/DoRA support in LLM Trainer V2 (#2832 by @Electronic-Waste)
- feat: Add Qwen 2.5 1.5b runtime, example and fix gpu e2e test (#2835 by @jaiakash)
- feat(runtimes): Support Distributed MLX on CUDA (#2790 by @andreyvelich)
Kueue Enhancements
- Support Topology Aware Scheduling for TrainJobs (kubernetes-sigs/kueue#7249 by @kaisoz)
- fix: Allow multiple podSpec overrides to target the same TargetJob (#2880 by @kaisoz)
- feat: support affinity in TrainJob pod spec overrides (#2796 by @toVersus)
- feat: Add schedulingGates to PodSpecOverrides (#2700 by @astefanutti)
Check out the official Kueue docs.
Volcano Scheduler
- feat: KEP-2437 - PodGroup Creation for Volcano Scheduler (#2729 by @Doris-xm)
- feat(docs): KEP-2437-Support Volcano Scheduler in Kubeflow Trainer V2 (#2672 by @Doris-xm)
API Updates
- feat(runtimes): add support for launcher resource allocation in MPI jobs (#2653 by @jskswamy)
- feat: Add PodTemplateOverrides into TrainJob V2 API (#2882 by @xigang)
- feat(api): Sync TrainJob JobsStatus from JobSet ReplicatedJobsStatus (#2802 by @astefanutti)
- feat: support imagePullSecrets in TrainJob pod spec overrides (#2806 by @toVersus)
- feat(operator): enforce RFC 1035 validation for TrainJob name (#2767 by @juniemariam)
Bug Fixes
- [release-2.1] fix(ci): Fix the Kubeflow SDK installation with Docker (#2927 by @andreyvelich)
- fix(manifests): Add RBAC rules for Leases in Helm Charts (#2901 by @astefanutti)
- fix(docs): correct example usage in KEP-2437-Support-Volcano-Scheduler (#2898 by @Doris-xm)
- fix(api): Keep mpiImplementation field a pointer (#2897 by @astefanutti)
- fix(api): Fix lint errors for the config API (#2896 by @astefanutti)
- fix: charts dependencies (#2892 by @ls-2018)
- fix(runtimes): fix missing dependency in torchtune trainer image. (#2887 by @Electronic-Waste)
- fix(ci): Add latest image tag only for the master branch (#2854 by @andreyvelich)
- fix: read only permission for PRs (#2829 by @jaiakash)
- fix: read only permission for PRs (#2827 by @jaiakash)
- fix: update examples to reflect func_args now being unpacked (#2815 by @briangallagher)
- fix(examples): Update get_job_logs() API in examples (#2813 by @andreyvelich)
- fix: teraform for oci gpu based vm (#2810 by @jaiakash)
- fix(api): Regenerate TrainJob CRD (#2805 by @astefanutti)
- fix(ci): disable
Unit and Integration Test - Gogh action in forked repos (#2746 by @milinddethe15) - fix(manifests): Add missing permissions for the RuntimeClass and LimitRange (#2787 by @tenzen-y)
- fix: update kubeflow sdk reference (#2780 by @kramaranya)
- fix(api): update license path for kubeflow_trainer_api (#2778 by @kramaranya)
- fix(runtimes): Set numProcPerNode: 1 in DeepSpeed Runtime (#2774 by @andreyvelich)
- fix(docs): update KEP-2401 according to current implementation. (#2765 by @Electronic-Waste)
- fix(ci): Remove coverage from Go integration tests (#2773 by @andreyvelich)
- fix(api): Fix license path for Kubeflow Trainer Python API (#2771 by @andreyvelich)
- fix(examples): Update the argument for Runtime framework (#2766 by @andreyvelich)
- fix(test): Fix Ginkgo command for integration tests (#2758 by @astefanutti)
- fix: fix the command for fetching Kubeflow Trainer version in the issue template (#2732 by @rudeigerc)
- fix(manifests): add rbac config of events for event recorders (#2731 by @rudeigerc)
- fix(manifests): fix position of labels of dataset-initializer from pod to job (#2719 by @rudeigerc)
- fix(module): Change Go module name to v2 (#2707 by @andreyvelich)
- fix(plugins): Fix some errors in torchtune mutation process. (#2675 by @Electronic-Waste)
- fix(manifests): Update manifests to enable LLM fine-tuning workflow with CTR and TrainJob yaml files (#2669 by @Electronic-Waste)
- fix(rbac): Add required RBAC to update ClusterTrainingRuntimes on OpenShift (#2682 by @astefanutti)
Misc
- [release-2.1] feat: Adding local execution example notebook (#2924 by @Fiona-Waters)
- feat(manifests): Publish Kubeflow Trainer Helm charts (#2917 by @adity1raut)
- [release-2.1] chore(operator): Use SSA throughout runtime framework (#2912 by @astefanutti)
- [release-2.1] feat(initializer): add s3 model and dataset initializers (#2911 by @rudeigerc)
- feat(operator): Add validation for required containers in replicatedJobs (#2722 by @ELE...
v2.1.0-rc.1
This is Kubeflow Trainer v2.1.0-rc.1 pre-release:
kubectl apply --server-side -k "https://github.com/kubeflow/trainer.git/manifests/overlays/manager?ref=v2.1.0-rc.1"
kubectl apply --server-side -k "https://github.com/kubeflow/trainer.git/manifests/overlays/runtimes?ref=v2.1.0-rc.1"New Features
- feat(manifests): Publish Kubeflow Trainer Helm charts (#2917 by @adity1raut)
- [release-2.1] chore(operator): Use SSA throughout runtime framework (#2912 by @astefanutti)
- [release-2.1] feat(initializer): add s3 model and dataset initializers (#2911 by @rudeigerc)
Bug Fixes
- [release-2.1] fix(manifests): Fix boolean values defaulting in Helm charts (#2914 by @astefanutti)
- [release-2.1] fix(runtimes): Update pip version in the MLX runtime (#2910 by @andreyvelich)
v2.1.0-rc.0
This is Kubeflow Trainer v2.1.0-rc.0 pre-release:
kubectl apply --server-side -k "https://github.com/kubeflow/trainer.git/manifests/overlays/manager?ref=v2.1.0-rc.0"
kubectl apply --server-side -k "https://github.com/kubeflow/trainer.git/manifests/overlays/runtimes?ref=v2.1.0-rc.0"Breaking Changes
- feat(api): Replace deprecated PodSpecOverrides API with PodTemplateOverrides in TrainJob (#2785 by @xigang)
- feat(operator): Replace TrainJob controller settings with the Config API (#2879 by @kapil27)
- chore(operator): Upgrade JobSet to v0.10.1 (#2875 by @astefanutti)
- chore(operator): Upgrade Kubernetes to v1.34 (#2804 by @astefanutti)
- Upgrade Kubernetes to v1.33 (#2756 by @astefanutti)
New Features
Distributed AI Data Cache
- feat(cache): KEP-2655 - Add build pipeline and address vulnerabilities for data_cache (#2890 by @akshaychitneni)
- feat(cache): KEP-2655: Adding cache initializer (#2793 by @akshaychitneni)
- feat: KEP-2655: Add data cache system (#2755 by @akshaychitneni)
LLM Post-Training
- feat(runtimes): Add LoRA/QLoRA/DoRA support in LLM Trainer V2 (#2832 by @Electronic-Waste)
- feat: Add Qwen 2.5 1.5b runtime, example and fix gpu e2e test (#2835 by @jaiakash)
- feat(runtimes): Support Distributed MLX on CUDA (#2790 by @andreyvelich)
Kueue Enhancements
- Support Topology Aware Scheduling for TrainJobs (kubernetes-sigs/kueue#7249 by @kaisoz)
- fix: Allow multiple podSpec overrides to target the same TargetJob (#2880 by @kaisoz)
- feat: support affinity in TrainJob pod spec overrides (#2796 by @toVersus)
- feat: Add schedulingGates to PodSpecOverrides (#2700 by @astefanutti)
Volcano Scheduler
- feat: KEP-2437 - PodGroup Creation for Volcano Scheduler (#2729 by @Doris-xm)
- feat(docs): KEP-2437-Support Volcano Scheduler in Kubeflow Trainer V2 (#2672 by @Doris-xm)
API Updates
- feat(runtimes): add support for launcher resource allocation in MPI jobs (#2653 by @jskswamy)
- feat: Add PodTemplateOverrides into TrainJob V2 API (#2882 by @xigang)
- feat(api): Sync TrainJob JobsStatus from JobSet ReplicatedJobsStatus (#2802 by @astefanutti)
- feat: support imagePullSecrets in TrainJob pod spec overrides (#2806 by @toVersus)
- feat(operator): enforce RFC 1035 validation for TrainJob name (#2767 by @juniemariam)
Bug Fixes
- fix(manifests): Add RBAC rules for Leases in Helm Charts (#2901 by @astefanutti)
- fix(docs): correct example usage in KEP-2437-Support-Volcano-Scheduler (#2898 by @Doris-xm)
- fix(api): Keep mpiImplementation field a pointer (#2897 by @astefanutti)
- fix(api): Fix lint errors for the config API (#2896 by @astefanutti)
- fix: charts dependencies (#2892 by @ls-2018)
- fix(runtimes): fix missing dependency in torchtune trainer image. (#2887 by @Electronic-Waste)
- fix(ci): Add latest image tag only for the master branch (#2854 by @andreyvelich)
- fix: read only permission for PRs (#2829 by @jaiakash)
- fix: read only permission for PRs (#2827 by @jaiakash)
- fix: update examples to reflect func_args now being unpacked (#2815 by @briangallagher)
- fix(examples): Update get_job_logs() API in examples (#2813 by @andreyvelich)
- fix: teraform for oci gpu based vm (#2810 by @jaiakash)
- fix(api): Regenerate TrainJob CRD (#2805 by @astefanutti)
- fix(ci): disable
Unit and Integration Test - Gogh action in forked repos (#2746 by @milinddethe15) - fix(manifests): Add missing permissions for the RuntimeClass and LimitRange (#2787 by @tenzen-y)
- fix: update kubeflow sdk reference (#2780 by @kramaranya)
- fix(api): update license path for kubeflow_trainer_api (#2778 by @kramaranya)
- fix(runtimes): Set numProcPerNode: 1 in DeepSpeed Runtime (#2774 by @andreyvelich)
- fix(docs): update KEP-2401 according to current implementation. (#2765 by @Electronic-Waste)
- fix(ci): Remove coverage from Go integration tests (#2773 by @andreyvelich)
- fix(api): Fix license path for Kubeflow Trainer Python API (#2771 by @andreyvelich)
- fix(examples): Update the argument for Runtime framework (#2766 by @andreyvelich)
- fix(test): Fix Ginkgo command for integration tests (#2758 by @astefanutti)
- fix: fix the command for fetching Kubeflow Trainer version in the issue template (#2732 by @rudeigerc)
- fix(manifests): add rbac config of events for event recorders (#2731 by @rudeigerc)
- fix(manifests): fix position of labels of dataset-initializer from pod to job (#2719 by @rudeigerc)
- fix(module): Change Go module name to v2 (#2707 by @andreyvelich)
- fix(plugins): Fix some errors in torchtune mutation process. (#2675 by @Electronic-Waste)
- fix(manifests): Update manifests to enable LLM fine-tuning workflow with CTR and TrainJob yaml files (#2669 by @Electronic-Waste)
- fix(rbac): Add required RBAC to update ClusterTrainingRuntimes on OpenShift (#2682 by @astefanutti)
Misc
- feat(operator): Add validation for required containers in replicatedJobs (#2722 by @Electronic-Waste)
- feat: add controller manager configuration helm chart (#2895 by @kapil27)
- chore(ci): Enable Kubernetes API Linter (#2858 by @astefanutti)
- feat(runtimes): implement clusterTrainingRuntime deprecation process (#2791 by @tdn21)
- feat: add HF token and allow gpu workflow to run from pull request target (#2818 by @jaiakash)
- feat(docs): KEP-2442-Support JAX Training Runtime (#2643 by @mahdikhashan)
- chore(test): Support e2e cluster setup with Podman (#2861 by @astefanutti)
- chore(runtimes): Upgrade torchtune version to v0.6.1 (#2876 by @Electronic-Waste)
- chore(operator): Upgrade JobSet to v0.10.1 (#2875 by @astefanutti)
- feat(docs): Update Trainer diagram and SDK release (#2867 by @andreyvelich)
- feat(docs): Add changelog for Kubeflow Trainer v2.0.1 (#2864 by @andreyvelich)
- fix(docs): Update the release document to push all changes (#2865 by @andreyvelich)
- chore: Install released version of Kubeflow SDK (#2857 by @kramaranya)
- chore(ci): Ignore generated files in .gitattributes (#2855 by @andreyvelich)
- feat: Add a public function to create runtime info objects (#2837 by @kaisoz)
- chore(test): add uts for coscheduling plugin. (#2582 by @IRONICBo)
- feat(ci): Add Trivy Vulnerability Scan (#2826 by @andreyvelich)
- chore: merge test cases using PodSpecOverrides into a single case (#2822 by @toVersus)
- chore(runtimes): update torchtune CTRs with multiple dependson feature in jobset v0.9.0 (#2823 by @Electronic-Waste)
- chore(operator): Bump JobSet ...
v2.0.1
This is Kubeflow Trainer v2.0.1 release.
New Features
Bug Fixes
- [release-2.0] fix(runtimes): Set numProcPerNode: 1 in DeepSpeed Runtime (#2863 by @andreyvelich)
- [release-2.0] fix(ci): Add latest image tag only for the master branch (#2862 by @andreyvelich)
- [release-2.0] fix: update examples to reflect func_args now being unpacked (#2815) (#2853 by @astefanutti)
- [release-2.0] fix(examples): Update get_job_logs() API in examples (#2813) (#2852 by @astefanutti)
- [release-2.0] feat(runtimes): Add Framework Label to the Runtimes (#2761) (#2851 by @astefanutti)
- [release-2.0] fix(examples): Update the argument for Runtime framework (#2766) (#2850 by @astefanutti)
- [release-2.0] fix: update kubeflow sdk reference (#2780) (#2847 by @astefanutti)
- [release-2.0] fix(api): Fix license path for Kubeflow Trainer Python API (#2772 by @andreyvelich)
v2.0.0
This is the major release of the Kubeflow Trainer 2.0 project.
For more information, please see the
Quickstart
Install the Kubeflow Trainer control plane:
kubectl apply --server-side -k "https://github.com/kubeflow/trainer.git/manifests/overlays/manager?ref=v2.0.0"
$ kubectl get pods -n kubeflow-system
NAME READY STATUS RESTARTS AGE
jobset-controller-manager-54968bd57b-88dk4 2/2 Running 0 65s
kubeflow-trainer-controller-manager-cc6468559-dblnw 1/1 Running 0 65s
kubectl apply --server-side -k "https://github.com/kubeflow/trainer.git/manifests/overlays/runtimes?ref=v2.0.0"Install Kubeflow Python SDK:
pip install git+https://github.com/kubeflow/sdk.git@main#subdirectory=pythonRun your first TrainJob by following the getting started guide.
Breaking Changes
- Migrate SDK to the
kubeflow/sdkrepository (#2657 by @eoinfennessy) - KEP-2170: Change API Group Name to
trainer.kubeflow.org(#2413 by @Electronic-Waste) - Move generated Python models into kubeflow_trainer_api package (#2632 by @kramaranya)
- Upgrade kubernetes Go module version to 1.32 (#2450 by @tenzen-y)
- Remove kubeflow-trainer prefix from jobset resource names (#2596 by @ChenYi015)
- Remove the Training Operator V1 Source Code (#2389 by @andreyvelich)
New Features
LLM Trainer V2
- KEP-2401: Support loading local LLMs (#2644 by @Electronic-Waste)
- KEP-2401: Support mutating dataset preprocessing config in SDK (#2638 by @Electronic-Waste)
- KEP-2401: Create LLM Training Runtimes for Llama 3.2 model family (#2590 by @Electronic-Waste)
- KEP-2401: Complement torch plugin to support torchtune config mutation (#2587 by @Electronic-Waste)
- KEP-2401: Create
torchtunetrainer image (#2516 by @Electronic-Waste) - KEP-2401: Refactor current
train()API (#2513 by @Electronic-Waste) - KEP-2401: Kubeflow LLM Trainer V2 (#2410 by @Electronic-Waste)
Runtime Framework
- feat(runtimes): Support MLX Distributed Runtime with OpenMPI (#2565 by @andreyvelich)
- feat(runtimes): Support DeepSpeed Runtime with OpenMPI (#2559 by @andreyvelich)
- feat(runtime): remove needless Launcher chainer. (#2558 by @IRONICBo)
- Store the TrainingRuntime numNodes as runtime.Info.PodSet.Count (#2539 by @tenzen-y)
- Add dependencies to RuntimeRegistrar (#2476 by @tenzen-y)
- KEP: 2170: Adding cel validations on TrainingRuntime/ClusterTrainingRuntime CRDs (#2313 by @akshaychitneni)
- Implement trainer.kubeflow.org/resource-in-use finalizer mechanism to ClusterTrainingRuntime (#2625 by @tenzen-y)
- Implement trainer.kubeflow.org/resource-in-use finalizer mechanism to TrainingRuntime (#2608 by @tenzen-y)
MPI Plugin
- [feature]:add validations for MPIRuntime with RunLauncherAsNode (#2551 by @Harshal292004)
- Implement CustomValidation UT for MPI plugin (#2555 by @tenzen-y)
- Implemenet MPI Plugin for OpenMPI (#2493 by @tenzen-y)
- Implement MPI plugin UTs (#2481 by @tenzen-y)
- Implement MPIImplementation Enum CRD validation (#2482 by @tenzen-y)
- Implement MPI numProcPerNode defaulter (#2483 by @tenzen-y)
- Add MPIMLPolicySource CRD defaulters (#2474 by @tenzen-y)
- Make MPIMLPolicySource optional fields as a pointer (#2472 by @tenzen-y)
- KEP-2170: Implement MPI Plugin for Kubeflow Trainer (#2394 by @andreyvelich)
JobSet
- Retrieve JobSetSpec from runtime.Info in CustomValidations (#2557 by @tenzen-y)
- KEP-2170: Deploy JobSet in
kubeflow-systemnamespace (#2388 by @andreyvelich) - Bump JobSet to v0.8.0 (#2463 by @andreyvelich)
- Upgrade jobset SDK version to v0.7.3 (#2445 by @Electronic-Waste)
New Examples
- Add question-answer example for v2 trainer (#2580 by @solanyn)
- KEP-2170: Add PyTorch DDP MNIST training example (#2387 by @astefanutti)
SDK Updates
- feat(sdk): Get namespace from the provided context (#2593 by @andreyvelich)
- feat(sdk): Support MPI-based TrainJobs (#2545 by @andreyvelich)
- feat(sdk): Migrate to OpenAPI V3 (#2490 by @andreyvelich)
- feat(sdk): Generate external Kubernetes and JobSet models (#2466 by @andreyvelich)
Bug Fixes
- [release-2.0] fix(manifests): add rbac config of events for event recorders (#2733 by @rudeigerc)
- [release-2.0] fix(manifests): fix position of labels of dataset-initializer from pod to job (#2720 by @rudeigerc)
- [release-2.0] fix(module): Change Go module name to v2 (#2708 by @andreyvelich)
- [cherry-pick] fix(manifests): Update manifests to enable LLM fine-tuning workflow w… (#2696 by @Electronic-Waste)
- [release-2.0] fix(plugins): Fix some errors in torchtune mutation process. (#2693 by @Electronic-Waste)
- [release-2.0] fix(rbac): Add required RBAC to update ClusterTrainingRuntimes on OpenShift (#2684 by @astefanutti)
- Revert "fix(sdk): Fix type annotation for
trainmethod'strainerparameter" (#2651 by @Electronic-Waste) - fix(sdk): Fix bad arg passed to
get_args_using_torchtune_config(#2647 by @eoinfennessy) - fix(sdk): Fix type annotation for
trainmethod'strainerparameter (#2646 by @eoinfennessy) - fix(controller): Fix RBAC permissions for TrainJob controller (#2626 by @andreyvelich)
- Fix close-pr message in Stale GitHub Action (#2622 by @kramaranya)
- fix: remove redundant K8s version matrix from integration tests (#2617 by @tr33k)
- fix(doc): tidy up KEP-2401. (#2594 by @Electronic-Waste)
- Fix MPI Test runnable errors (#2570 by @tenzen-y)
- Fix issue with fetching clustertrainingruntime for validations (#2564 by @akshaychitneni)
- fix(sdk): Add missing import types. (#2566 by @Electronic-Waste)
- fix(sdk): Using correct entrypoint for mpirun (#2552 by @andreyvelich)
- fix(sdk): add missing import type Initializer. (#2541 by @Electronic-Waste)
- fix(ci): update
test-gocoverage ci config and replace trainer badge with new address. (#2534 by @IRONICBo) - fix(doc): Update
train()API in KEP-2401 (#2536 by @Electronic-Waste) - fix(test): Update images for DockerHub publish (#2535 by @andreyvelich)
- [hotfix] fix checkout on workflow (#2531 by @mahdikhashan)
- [hotfix] fix docker cred (#2530 by @mahdikhashan)
- fix: remove unused parameter name in default case of shouldUseCPU function (#2521 by @Diasker)
- Fix #2407: Cap nproc_per_node based on CPU resources for PyTorch TrainJob (#2492 by @Diasker)
- fix type in model initializer entrypoint ([#2489](https://github.com/kubeflow/trainer/pull...
v1.9.3
This is the Training Operator v1.9.3 release.
New Features
- [SDK] Add provision to provide local-queue for the training job (#2636 by @abhijeet-dhumal
Misc
- chore: Remove V2 code from Training Operator 1.9 release branch (#2737 by @andreyvelich
- chore(ci): Add more workaround no space left on device (#2677 by @astefanutti
v2.0.0-rc.1
This is the Kubeflow Trainer v2.0.0-rc.1 pre-release.
New Features
- [release-2.0] feat: Add schedulingGates to PodSpecOverrides (#2705 by @astefanutti)
- [release-2.0] feat: Mutable PodSpecOverrides for suspended TrainJob (#2698 by @astefanutti)
- [Release 2.0] KEP-2170: Add the manifests overlay for Kubeflow Training V2 (#2692 by @Doris-xm)
Bug Fixes
- [release-2.0] fix(module): Change Go module name to v2 (#2708 by @andreyvelich)
- [cherry-pick] fix(manifests): Update manifests to enable LLM fine-tuning workflow w… (#2696 by @Electronic-Waste)
- [release-2.0] fix(plugins): Fix some errors in torchtune mutation process. (#2693 by @Electronic-Waste)
- [release-2.0] fix(rbac): Add required RBAC to update ClusterTrainingRuntimes on OpenShift (#2684 by @astefanutti)
Misc
- [release-2.0] chore: Copy generated CRDs into Helm charts (#2704 by @astefanutti)
- [cherry-pick] feat(example): Add alpaca-trianjob-yaml.ipynb. (#2670) (#2702 by @Electronic-Waste)
- [release-2.0] chore: Replace the deprecated intstr.FromInt with intstr.FromInt32 (#2697 by @tenzen-y)
- [release-2.0] chore: Remove the vendor specific parameters (#2694 by @tenzen-y)
- [release-2.0] chore(runtime): Bump Torch to 2.7.1 and DeepSpeed to 0.17.1 (#2687 by @andreyvelich)
- [release-2.0] chore(helm): Sync ClusterRule in Helm chart (#2688 by @astefanutti)
v2.0.0-rc.0
This is the Kubeflow Trainer v2.0.0-rc.0 pre-release.
Breaking Changes
- KEP-2170: Change API Group Name to
trainer.kubeflow.org(#2413 by @Electronic-Waste) - Move generated Python models into kubeflow_trainer_api package (#2632 by @kramaranya)
- Upgrade kubernetes Go module version to 1.32 (#2450 by @tenzen-y)
- Remove kubeflow-trainer prefix from jobset resource names (#2596 by @ChenYi015)
- Remove the Training Operator V1 Source Code (#2389 by @andreyvelich)
New Features
LLM Trainer V2
- KEP-2401: Support loading local LLMs (#2644 by @Electronic-Waste)
- KEP-2401: Support mutating dataset preprocessing config in SDK (#2638 by @Electronic-Waste)
- KEP-2401: Create LLM Training Runtimes for Llama 3.2 model family (#2590 by @Electronic-Waste)
- KEP-2401: Complement torch plugin to support torchtune config mutation (#2587 by @Electronic-Waste)
- KEP-2401: Create
torchtunetrainer image (#2516 by @Electronic-Waste) - KEP-2401: Refactor current
train()API (#2513 by @Electronic-Waste) - KEP-2401: Kubeflow LLM Trainer V2 (#2410 by @Electronic-Waste)
Runtime Framework
- feat(runtimes): Support MLX Distributed Runtime with OpenMPI (#2565 by @andreyvelich)
- feat(runtimes): Support DeepSpeed Runtime with OpenMPI (#2559 by @andreyvelich)
- feat(runtime): remove needless Launcher chainer. (#2558 by @IRONICBo)
- Store the TrainingRuntime numNodes as runtime.Info.PodSet.Count (#2539 by @tenzen-y)
- Add dependencies to RuntimeRegistrar (#2476 by @tenzen-y)
- KEP: 2170: Adding cel validations on TrainingRuntime/ClusterTrainingRuntime CRDs (#2313 by @akshaychitneni)
- Implement trainer.kubeflow.org/resource-in-use finalizer mechanism to ClusterTrainingRuntime (#2625 by @tenzen-y)
- Implement trainer.kubeflow.org/resource-in-use finalizer mechanism to TrainingRuntime (#2608 by @tenzen-y)
MPI Plugin
- [feature]:add validations for MPIRuntime with RunLauncherAsNode (#2551 by @Harshal292004)
- Implement CustomValidation UT for MPI plugin (#2555 by @tenzen-y)
- Implemenet MPI Plugin for OpenMPI (#2493 by @tenzen-y)
- Implement MPI plugin UTs (#2481 by @tenzen-y)
- Implement MPIImplementation Enum CRD validation (#2482 by @tenzen-y)
- Implement MPI numProcPerNode defaulter (#2483 by @tenzen-y)
- Add MPIMLPolicySource CRD defaulters (#2474 by @tenzen-y)
- Make MPIMLPolicySource optional fields as a pointer (#2472 by @tenzen-y)
- KEP-2170: Implement MPI Plugin for Kubeflow Trainer (#2394 by @andreyvelich)
JobSet
- Retrieve JobSetSpec from runtime.Info in CustomValidations (#2557 by @tenzen-y)
- KEP-2170: Deploy JobSet in
kubeflow-systemnamespace (#2388 by @andreyvelich) - Bump JobSet to v0.8.0 (#2463 by @andreyvelich)
- Upgrade jobset SDK version to v0.7.3 (#2445 by @Electronic-Waste)
New Examples
- Add question-answer example for v2 trainer (#2580 by @solanyn)
- KEP-2170: Add PyTorch DDP MNIST training example (#2387 by @astefanutti)
SDK Updates
- Remove SDK (#2657 by @eoinfennessy)
- feat(sdk): Get namespace from the provided context (#2593 by @andreyvelich)
- feat(sdk): Support MPI-based TrainJobs (#2545 by @andreyvelich)
- feat(sdk): Migrate to OpenAPI V3 (#2490 by @andreyvelich)
- feat(sdk): Generate external Kubernetes and JobSet models (#2466 by @andreyvelich)
Bug Fixes
- Revert "fix(sdk): Fix type annotation for
trainmethod'strainerparameter" (#2651 by @Electronic-Waste) - fix(sdk): Fix bad arg passed to
get_args_using_torchtune_config(#2647 by @eoinfennessy) - fix(sdk): Fix type annotation for
trainmethod'strainerparameter (#2646 by @eoinfennessy) - fix(controller): Fix RBAC permissions for TrainJob controller (#2626 by @andreyvelich)
- Fix close-pr message in Stale GitHub Action (#2622 by @kramaranya)
- fix: remove redundant K8s version matrix from integration tests (#2617 by @tr33k)
- fix(doc): tidy up KEP-2401. (#2594 by @Electronic-Waste)
- Fix MPI Test runnable errors (#2570 by @tenzen-y)
- Fix issue with fetching clustertrainingruntime for validations (#2564 by @akshaychitneni)
- fix(sdk): Add missing import types. (#2566 by @Electronic-Waste)
- fix(sdk): Using correct entrypoint for mpirun (#2552 by @andreyvelich)
- fix(sdk): add missing import type Initializer. (#2541 by @Electronic-Waste)
- fix(ci): update
test-gocoverage ci config and replace trainer badge with new address. (#2534 by @IRONICBo) - fix(doc): Update
train()API in KEP-2401 (#2536 by @Electronic-Waste) - fix(test): Update images for DockerHub publish (#2535 by @andreyvelich)
- [hotfix] fix checkout on workflow (#2531 by @mahdikhashan)
- [hotfix] fix docker cred (#2530 by @mahdikhashan)
- fix: remove unused parameter name in default case of shouldUseCPU function (#2521 by @Diasker)
- Fix #2407: Cap nproc_per_node based on CPU resources for PyTorch TrainJob (#2492 by @Diasker)
- fix type in model initializer entrypoint (#2489 by @szaher)
- fix(runtime): fix error label name. (#2487 by @Electronic-Waste)
- fix(sdk): resolve errors in deserialization (#2457 by @Electronic-Waste)
- Fix missing external types in apply configurations (#2429 by @astefanutti)
- Fix API Group for Torch Runtime (#2424 by @andreyvelich)
- Fix Kustomize patchesStrategicMerge deprecation warning (#2405 by @astefanutti)
- ControlPlane: Fix flaky integraion testings due to missing the latest version of object (#2414 by @tenzen-y)
Misc
- Tag Docker images with GitHub release tags (#2662 by @kramaranya)
- feat(controller): Implement PodSpecOverride API (#2614 by @andreyvelich)
- Nominate @Electronic-Waste as approver and @astefanutti as reviewer (#2659 by @andreyvelich)
- chore(build): Support Podman to run OpenAPI generator (#2656 by @astefanutti)
- chore(docs): Add OpenSSF Best Practices Badge (#2611 by @andreyvelich)
- [chore] update stale action version to latest (#2642 by @mahdikhashan)
- Remove TrainJobCreated condition (#2621 by @astefanutti)
- ci: refactor build-push-images workflow (#2607 by @milinddethe15)
- Update Go to v1.24 (#2615) (#2620 by @vzamboulingame)
- test(runtime): add UT for IndexTrainJobTrainingRuntime (#2603 by @Harshal292004)
- ci: add k8s
v1.32for tests env ([#2613](#26...
v1.9.2
This is the Training Operator v1.9.2 release.
New Features
- Add provision to provide labels and annotations for the pytorchjob an… (#2612 by @abhijeet-dhumal)
Bug Fixes
- Fix llm hp optimization error (#2576 by @helenxie-bit)
- [bug] pull image from ghcr (#2584 by @mahdikhashan)
v1.9.1 release
This is the Training Operator v1.9.1 release.
Breaking Changes
- Update Manifest Images to GHCR (#2544 by @saileshd1402)
- Push images to GHCR for release-1.9 (#2491 by @saileshd1402)
New Features
- Add volume and volume mounts arguments to TrainingClient.create_job API (#2449 by @astefanutti)
- Add configurable QPS and burst settings for kube API client (#2411 by @ronk21runai)
Bug Fixes
- fix(ci): Change publish dir from
trainingtotrainer(#2546 by @Electronic-Waste) - fix: fix typos in script comments. (#2465 by @IRONICBo)
- fix: adds jaxjobs to the kubeflow-training-roles.yaml ClusterRole (#2417 by @DnPlas)
- [release-1.9] Rename paddlepaddle_defaults.go file name (#2400 by @ChristianZaccaria)