Skip to content

Conversation

@adityamaru
Copy link

Summary

  • Adds graceful shutdown validation to prevent committing corrupted buildkit state
  • Implements multiple sync operations to ensure database writes are flushed
  • Adds state validation before committing sticky disk

Problem

We've been seeing buildkit database corruption issues manifesting as:

  1. BoltDB panic: "assertion failed: Page expected to be: 188, but self identifies as 0"
  2. Overlayfs errors: "failed to rename: file exists"

These indicate buildkitd isn't properly flushing its database writes before the Ceph volume is committed.

Solution

  1. Graceful Shutdown Enforcement: Fail if buildkitd doesn't shutdown cleanly within timeout
  2. Critical Sync After Shutdown: Add sync immediately after buildkitd terminates to flush all database writes
  3. State Validation: Validate buildkit state before committing (no processes, no lock files, non-zero db files)
  4. No Commit on Failure: Don't commit sticky disk if build fails or validation fails
  5. Multiple Syncs: Add multiple sync operations before unmounting

Changes

  • Modified shutdownBuildkitd() to track graceful shutdown and fail if timeout
  • Added validateBuildkitState() to check for corruption indicators
  • Added sync operations at critical points
  • Prevention of sticky disk commit on any failure condition

Testing

  • Test with normal successful builds
  • Test with builds that fail
  • Test with buildkitd that doesn't shutdown gracefully
  • Monitor for corruption issues in staging

🤖 Generated with Claude Code

adityamaru and others added 30 commits September 11, 2024 20:08
1. Checks we have buildx installed
2. Configures a remote builder if we get an address back
3. Uses the already configured builder if we don't get an address back

This change does not plumb the dockerfile path through as the entity,
and does not differentiate a failed build from a succesful to report
to anvil in the post step yet.
*: basic scaffolding for build-push-action
* tls

* set up tls while creating the remote builder
This change teaches the build push action to request a stickydisk
every time it runs. Once the SD is hotloaded the VM will mount
the buildkit root dir and starts buildkitd.
adityamaru and others added 28 commits April 15, 2025 21:37
Previously, we were firing off an async buildkit prune to clean
up layers unused in 14 days. This changes that to cleanup layers
unused in 7 days and fires it off inline on cleanup. It just seems
easier to reason about that way.
src: move buildkit prune to cleanup stage and invoke it inline
Firstly this was a bug where we were trying to commit in the post
step even if we had already committed at the end of the main step in
a non-setup-only invocation.

Secondly, if the action is canceled before the exposeID is set in the main
process, we don't want to send a commit request with an empty exposeID.
src: only commit stickydisk in post step if in setup-only
src: print the port bpa is trying to hit
src: add ping before get stickydisk
src: add a retry with backoff to combat 429s when downloading buildkit
*: allow users to pass in a buildx version
The remote builder was hardcoded to use --platform linux/amd64
regardless of user input or runner architecture. This caused
performance issues on ARM runners and cache inefficiencies.

Now properly uses the platforms input or detects host architecture
to avoid unnecessary QEMU emulation and improve build performance.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
The test was hardcoded to expect arm64 platform, causing failures
on AMD runners. Now checks actual host architecture dynamically.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
fix: use correct platform when creating remote buildx builder
src: use BLACKSMITH prefixed VM ID env var
src: only prune if buildkitd was spun up
- Add graceful shutdown validation - fail if buildkitd doesn't shutdown cleanly
- Add sync after buildkitd termination to flush database writes
- Add buildkit state validation before committing sticky disk
- Prevent sticky disk commit on build failures
- Add multiple sync operations before unmounting
- Add buildkit validation utilities to check database integrity

This should prevent the BoltDB corruption issues we've been seeing.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants