-
Notifications
You must be signed in to change notification settings - Fork 697
fix: prevent buildkit corruption on sticky disk commit #1393
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
adityamaru
wants to merge
214
commits into
docker:master
from
useblacksmith:fix-buildkit-corruption-prevention
Closed
fix: prevent buildkit corruption on sticky disk commit #1393
adityamaru
wants to merge
214
commits into
docker:master
from
useblacksmith:fix-buildkit-corruption-prevention
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
1. Checks we have buildx installed 2. Configures a remote builder if we get an address back 3. Uses the already configured builder if we don't get an address back This change does not plumb the dockerfile path through as the entity, and does not differentiate a failed build from a succesful to report to anvil in the post step yet.
*: basic scaffolding for build-push-action
This change teaches the build push action to request a stickydisk every time it runs. Once the SD is hotloaded the VM will mount the buildkit root dir and starts buildkitd.
src: use port from env
Previously, we were firing off an async buildkit prune to clean up layers unused in 14 days. This changes that to cleanup layers unused in 7 days and fires it off inline on cleanup. It just seems easier to reason about that way.
src: move buildkit prune to cleanup stage and invoke it inline
Signed-off-by: Aditya Maru <[email protected]>
Firstly this was a bug where we were trying to commit in the post step even if we had already committed at the end of the main step in a non-setup-only invocation. Secondly, if the action is canceled before the exposeID is set in the main process, we don't want to send a commit request with an empty exposeID.
src: only commit stickydisk in post step if in setup-only
src: print the port bpa is trying to hit
src: more debug logs
src: add ping before get stickydisk
src: add a retry with backoff to combat 429s when downloading buildkit
*: allow users to pass in a buildx version
action: add missing option string
The remote builder was hardcoded to use --platform linux/amd64 regardless of user input or runner architecture. This caused performance issues on ARM runners and cache inefficiencies. Now properly uses the platforms input or detects host architecture to avoid unnecessary QEMU emulation and improve build performance. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>
The test was hardcoded to expect arm64 platform, causing failures on AMD runners. Now checks actual host architecture dynamically. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>
fix: use correct platform when creating remote buildx builder
Signed-off-by: Aditya Maru <[email protected]>
src: use BLACKSMITH prefixed VM ID env var
src: only prune if buildkitd was spun up
- Add graceful shutdown validation - fail if buildkitd doesn't shutdown cleanly - Add sync after buildkitd termination to flush database writes - Add buildkit state validation before committing sticky disk - Prevent sticky disk commit on build failures - Add multiple sync operations before unmounting - Add buildkit validation utilities to check database integrity This should prevent the BoltDB corruption issues we've been seeing. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary
Problem
We've been seeing buildkit database corruption issues manifesting as:
These indicate buildkitd isn't properly flushing its database writes before the Ceph volume is committed.
Solution
Changes
shutdownBuildkitd()to track graceful shutdown and fail if timeoutvalidateBuildkitState()to check for corruption indicatorsTesting
🤖 Generated with Claude Code