Skip to content

Refactor buildpack data store and bin/report #1878

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Aug 21, 2025
Merged

Refactor buildpack data store and bin/report #1878

merged 5 commits into from
Aug 21, 2025

Conversation

edmorley
Copy link
Member

@edmorley edmorley commented Aug 20, 2025

The Node.js buildpack has a set of Bash helper utils for managing a key-value store of build data/metadata, that could be added to during the build, and then output when bin/report was run. When the Python buildpack added support for bin/report, its implementation was based on the Node.js one, with some further small improvements.

However, there remains a number of issues with the design:

  1. It handles different data types (string vs int vs float vs bool) at bin/report time, so requires a hardcoded mapping of values which can (and does) get out of sync. (Plus doesn't work well for some cases, like jvm-common being wrapped by the other JVM related buildpacks.)
  2. It tries to escape newlines and carriage returns, but there are a number of edge cases still not handled properly which result in some fields not being sent to Honeycomb properly.

In order to work around the above, Manuel created a jq based approach for use by the JVM buildpacks:
https://github.com/heroku/heroku-buildpack-jvm-common/blob/4fd6e4f8f258a5766a8b2e9ffec2690675b00879/lib/metrics.sh

However, that doesn't have the full API of the existing solution (it's missing the ability to fetch previous values; something which Python needs), and I spotted a few more places where things can be optimised.

In addition, now that we're using jq and JSON internally in the buildpack, it makes sense for us to update cytokine to support the build report being emitted as JSON too (in addition to YAML), particularly since cytokine then submits the report to Vacuole as JSON afterwards.

And in fact, since JSON is roughly a subset of YAML (at least for the simple data types and schema we're using for the build report), we can start emitting the build report as JSON now, rather than needing to wait for the cytokine change to land first. (We'll still want to update cytokine to use a JSON parser though, just to avoid any edge cases in the future).

The new implementation:

  • Sets the data type at the call site where the build data is set, rather than relying on a hardcoded mapping at time of retrieval.
  • Uses jq to do the string escaping, so much less fragile than the hand-rolled implementation.
  • Has been rewritten to avoid unnecessary processes calls, redirections, and patterns that required disabling a number of shellcheck warnings.

I've confirmed that metrics still appear in Honeycomb with the changes here.

After this merges, I'll open PRs against some of the other language buildpacks that don't yet implement bin/report, using the improved implementation from here (with simplifications since other buildpacks won't need the legacy data store handling that Python needs to support caches written by older buildpack versions). And once Manuel is back, I'll discuss with him backporting some of the improvements to the JVM buildpacks too.

GUS-W-19344754.

@edmorley edmorley self-assigned this Aug 20, 2025
The Node.js buildpack had a set of Bash helper utils for managing a
key-value store of metadata, that could be added to during the build,
and then output when bin/report was run.

When the Python buildpack added support for bin/report, its
implementation was based on the Node.js one, with some further small
improvements.

However, there remains a number of issues with the design:
1. It handles different data types (string vs int vs float vs bool) at
   `bin/report` time, so requires a hardcoded mapping of values which
   can (and does) get out of sync. (Plus doesn't work well for some
   cases, like jvm-common being wrapped by the other JVM related
   buildpacks.)
2. It tries to escape newlines and carriage returns, but there are a
   number of edge cases still not handled properly which result in some
   fields not being sent to Honeycomb properly.

In order to work around the above, Manuel created a jq based approach
for use by the JVM buildpacks:
https://github.com/heroku/heroku-buildpack-jvm-common/blob/4fd6e4f8f258a5766a8b2e9ffec2690675b00879/lib/metrics.sh

However, that doesn't have the full API of the existing solution (it's
missing the ability to fetch previous values; something which Python
needs), and I spotted a few more places where things can be optimised.

In addition, now that we're using jq and JSON internally in the
buildpack, it makes sense for us to update cytokine to support the
build report being emitted as JSON too (in addition to YAML),
particularly since cytokine then submits the report to Vacuole as
JSON afterwards.

And in fact, since JSON is roughly a subset of YAML (at least for the
simple data types and schema we're using for the build report), we can
start emitting the build report as JSON now, rather than needing to wait
for the cytokine change to land first. (We'll still want to update
cytokine to use a JSON parser though, just to avoid any edge cases in
the future).

The new implementation:
- Sets the data type at the call site where the metadata is set,
  rather than relying on a hardcoded mapping at time of retrieval.
- Uses jq to do the string escaping, so much less fragile than the
  hand-rolled implementation.
- Uses a simplified initialisation/setup API, which avoids the need to
  call both `init` and `setup`.
- Has been rewritten to avoid unnecessary processes calls, redirections,
  and patterns that required disabling a number of shellcheck warnings.

I've confirmed that metrics still appear in Honeycomb with the changes
here.

After this merges, I'll open PRs against some of the other language
buildpacks that don't yet implement `bin/report`, using the improved
implementation from here (with simplifications since other buildpacks
won't need the legacy data store handling that Python needs to support
caches written by older buildpack versions).

GUS-W-19344754.
@edmorley edmorley marked this pull request as ready for review August 20, 2025 21:15
@edmorley edmorley requested a review from a team as a code owner August 20, 2025 21:15
@dzuelke dzuelke self-requested a review August 20, 2025 21:26
Co-authored-by: Richard Schneeman <richard.schneeman+foo@gmail.com>
Signed-off-by: Ed Morley <501702+edmorley@users.noreply.github.com>
@edmorley
Copy link
Member Author

I've also updated the PR to:

  • s/metadata::/build_data::/ (for other language buildpacks that don't also use the data for things like cache invalidation too, so only ever write to the store and then emit it during bin/report, they might want to call the module telemetry:: or similar, but for Python that would be too narrow a name for what it's used for in this buildpack)
  • Remove ::get() since I only ever use ::get_previous() (for the current build, really Bash variables should be used to pass data around instead of the data store, so I don't think there'll ever really be a need for a ::get() function)

@edmorley edmorley changed the title Refactor buildpack metadata store and bin/report Refactor buildpack data store and bin/report Aug 21, 2025
@edmorley edmorley merged commit db9788b into main Aug 21, 2025
9 of 10 checks passed
@edmorley edmorley deleted the metadata-json branch August 21, 2025 15:05
@heroku-linguist heroku-linguist bot mentioned this pull request Aug 21, 2025
schneems added a commit to heroku/heroku-buildpack-ruby that referenced this pull request Aug 21, 2025
The python buildpack was modified to use JSON instead of YAML as it's easier to use `jq` to format and handle issues such as escaping.

This preserves the same API, but changes the format and adopts the strategy from heroku/heroku-buildpack-python#1878.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants