Page MenuHomePhabricator

FY 24-25 SDS 2.4.9 CDN Synthetic Beacon: EventGate & Varnish: update to receive events from beacon event v2
Closed, ResolvedPublic8 Estimated Story Points

Description

When a request that is schema validated (i.e., its events conform to their schemata) bears the HTTP header X-Experiment-Enrollments:

  1. Verify that the header is well-formed. It should match a format like that in fullexample.vtc. If the X-Experiment-Enrollments header is present but malformed, log an error and drop the request. Otherwise, for each event...
  2. Verify that the configuration for the stream for the event includes a producers.eventgate.use_edge_uniques key with a value of true. If it doesn't, drop the event and register this as a schema validation error so it shows up in the EventGate dashboard.
  3. Verify that the event corresponds to a schema for which there is an experiment fragment (cf. the experiment YAML and its inclusion in the web schema in that diff). If the event doesn't contain an experiment fragment, drop the event and register this as a schema validation error so it shows up in the EventGate dashboard.
  4. Look at the event payload's experiment.enrolled and experiment.assigned values and confirm that there is a corresponding experiment_name=group_name assignment (so that experiment.enrolled corresponds to experiment_name and experiment.assigned corresponds to group_name) in the X-Experiment-Enrollments header. If the fields don't map correctly, drop the event and register this as a schema validation error so it shows up in the EventGate dashboard.
  5. Verify that experiment.subject_id contains a string value of awaiting. If it doesn't, drop the event and register this as a schema validation error so it shows up in the EventGate dashboard.
  6. If all of this has worked so far, change the experiment.subject_id value to that provided by X-Experiment-Enrollments for the experiment_name=group_name (i.e., the part that follows the / symbol for the given record).

Using the example from fullexample.vtc, we see the X-Experiment-Enrollments header as having the following sort of form:

2025-abtests2-foo-10x10=grp0/cSbN4iEngsnYMz1vEK9O6g;2025-abtests2-sel1-4x25=grp1/ckNy35nnQikEwoey8XS1Lw;2025-abtests2-sel2-4x25=grp1/ckNy35nnQikEwoey8XS1Lw;2025-abtests2-udom=grp2/yHi9O8ylLWG2qFw1FfCHww;

Notice that records are semi-colon separated. Each record is of the form experiment_name=group_name/subject_id.

  • experiment_name should match the regex ^[A-Za-z0-9][-_.A-Za-z0-9]{7,62}$
  • group_name values should match the regex ^[A-Za-z0-9][-_.A-Za-z0-9]{1,62}$
  • subject_id should be base64-decodable and its length prior to decoding should be at least 22 characters (it doesn't need to be decoded-and-processed, this is just to ensure the value looks correct).

Part 1 acceptance criteria

  • Code
  • Tests
  • Works in beta cluster for requests sent to the existing v1 endpoint on intake-analytics.wikimedia.beta.wmflabs.org. Note that in Part 2 we'll be unsetting the X-Experiment-Enrollments header in Varnish eventually, but right now we're just trying to verify correctly functioning processing at EventGate.
  • Works in the prod cluster for requests sent to the existing v1 endpoint on intake-analytics.wikimedia.org. Note that in Part 2 we'll be unsetting the X-Experiment-Enrollments header in Varnish eventually, but right now we're just trying to verify correctly functioning processing at EventGate.

Part 2, Varnish

  • The X-Experiment-Enrollments header should be unset unilaterally before further request processing in Varnish. The X-Experiment-Enrollments header will be set with a meaningful value by Varnish if the user is enrolled in one or more experiments and will be conveyed in the request to EventGate only if the client-requested path was /beacon/v2/events?hasty=true.
  • When a request is destined for a path starting with /evt-103e/v2/events for internet facing endpoints, such requests should be forwarded to the same EventGate instance and should replace the start of the path with /v1/events.

Part 2 acceptance criteria

A separate task for an A/A experiment (and another one or two for some initial handcrafted experiments) should make it possible to set the X-Experiment-Enrollments header in Varnish manually for EventGate's processing to verify end-to-end that processing is working as expected.

Important notes

This here task doesn't presume clients will have started routing experiment-associated events to the v2 endpoint. That will be handled separately in T391988: FY 24-25 SDS 2.4.9 CDN Synthetic Beacon: Route experiment-oriented MediaWiki JavaScript-based events conditionally, which entails both server configuration and client changes.

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
Milimetric set the point value for this task to 8.Apr 15 2025, 2:48 PM
dr0ptp4kt updated Other Assignee, added: BBlack.
dr0ptp4kt updated the task description. (Show Details)
dr0ptp4kt added subscribers: cjming, phuedx, Vgutierrez.
dr0ptp4kt renamed this task from EventGate & Varnish: update to receive events from beacon to FY 24-25 SDS 2.4.9 EventGate & Varnish: update to receive events from beacon v2.Apr 15 2025, 9:05 PM
dr0ptp4kt updated the task description. (Show Details)
dr0ptp4kt renamed this task from FY 24-25 SDS 2.4.9 EventGate & Varnish: update to receive events from beacon v2 to FY 24-25 SDS 2.4.9 CDN Synthetic Beacon: EventGate & Varnish: update to receive events from beacon event v2.Apr 15 2025, 9:11 PM
dr0ptp4kt updated the task description. (Show Details)
dr0ptp4kt updated the task description. (Show Details)
dr0ptp4kt updated the task description. (Show Details)

Re the use of the producers.eventgate.enrich_fields_from_http_header config setting:

Are you sure you need/want to use this setting for this behavior? There are already examples of custom handling of headers to set values in specific fields, e.g. http.client_ip field.

If you can just choose a field to always use, you can avoid conflating the simple purpose of the enrich_fields_from_http_header setting.

If you do want a setting to enable or disable this, perhaps a different one focused on this specific behavior/field would be more appropriate?

If I'm reading this correctly, now we want in the stream config:

producers:
  eventgate:
    enrich_fields_from_http_headers:
      'x-experiment-enrollments': 'x-experiment-enrollments'

EventGate checks if this exists, which adds a x-experiment-enrollments field to the event. If it does then it does the processing listing on the ticket. Then after the processing, it removes the x-experiment-enrollments field before it gets validated by EventGate.

I also think it would be better to have a flag in the configuration rather than overload header hoisting

Also reading into this ticket more, any event that is sent that has an X-Experiment-Enrollments but doesn't have an experiment field in its schema gets dropped? Are there instances where there could be a X-Experiment-Enrollments header on non-experiment events that we want to keep? It would be a no-op anyways?

^ agree. I'm not sure if dropping the event in the various cases is the best thing to do. Would it make more sense to just not set the field value in those cases, and then pass the event on through?

Validation happens after these field values are set, so you could just make the relevant event's schema require the field if you want it to be dropped. The event will be invalid if the field is not set.

Validation happens after these field values are set, so you could just make the relevant event's schema require the field if you want it to be dropped. The event will be invalid if the field is not set.

experiment field is an optional property in the base schema. We want as many events as possible using this schema, some will be produced inside experiments (and experiment will be set) and some will be produced outside of experiments (experiment usually not set, but may be set manually by developers).

subject_id should be base64-decodable and its length prior to decoding should be at least 22 characters

Isn't the length of a base64 string always supposed to be a multiple of 4? Where does the 22 come from?

experiment field is an optional property in the base schema

ah okay!

I suppose dropping the event here is okay if that is really what you want.

I'm not sure of the best place to implement that logic though. The function returned by [[ 3e1298dc6f408e0dfbc800b070f2f2519b1f551a | makeSetWikimediaDefaults ]] is the function that should probably augment the event, but the function returned by makeWikimediaValidate is the function that handles logic for accepting or dropping events. makeWikimediaValidate does call setWikimediaDefaults (as well as ensureEventAllowedInStream, which may drop an event).

But we don't have a single function that does both augment + drop. It'd be nice if we could augment and then drop, like happens now, buuut then there would be header handling for this in 2 different places.

I suppose in makeSetWikimediaDefaults function is the best place. Just make sure to add this ^ context/caveat in code comments please!

subject_id should be base64-decodable and its length prior to decoding should be at least 22 characters

Isn't the length of a base64 string always supposed to be a multiple of 4? Where does the 22 come from?

I think the intent is to apply sodium_base64_VARIANT_URLSAFE_NO_PADDING upon a hash output that is a uint8_t[16], although better verify - good question! @BBlack can you confirm that we may see the value without = symbol padding when it comes across to EventGate? And is 22 the correct shortest length in the initial implementation or would we possibly expect a longer length (e.g., 44 characters, which coincidentally would cleanly divide by 4) in the initial implementation?

Sorry I missed the question earlier. Even though I had edited the Description, I was missing the email-based updates because I wasn't yet a subscriber of the ticket! Funny Phabricator behavior to not get auto-subscribed :)

Validation happens after these field values are set, so you could just make the relevant event's schema require the field if you want it to be dropped. The event will be invalid if the field is not set.

I think

  1. When events are dropped register these as schema validation errors so they show up in the EventGate dashboard.

is the catch all here. Maybe the description could be tweaked to make it explicit what's going to happen at each step? You're right though as the majority of cases in the description (all except (1)) are due to misconfiguration on the experiment implementer's side, which will end up with well-formed-but-invalid events arriving at EventGate,

This leads me to:

  1. Verify that the header is well-formed. It should match a format like that in fullexample.vtc. If the X-Experiment-Enrollments header is present but malformed, drop the request.

when combined with (8) (see above) is a little dangerous. (1) occurs when something is going fundamentally wrong with either Varnish or during the transmission of the request to EventGate. This isn't bad data from the instrument or Metrics/Event Platform Client. It shouldn't be modelled as such. Maybe we should log an error (not a validation error) and stop processing the request?

@tchin I modified the Description to capture our Meet where we discussed not using the header enrichment, but rather an explicit boolean value for the stream to indicate its intent about use of edge uniques capabilities. See diff ^ in Show Details.

I also modified the numbered list elements for what @phuedx suggested in T391959#10781335.

Change #1140784 had a related patch set uploaded (by Dr0ptp4kt; author: Dr0ptp4kt):

[operations/mediawiki-config@master] Stream config for edge uniques on beta cluster

https://gerrit.wikimedia.org/r/1140784

Change #1140784 merged by jenkins-bot:

[operations/mediawiki-config@master] Stream config for edge uniques on beta cluster

https://gerrit.wikimedia.org/r/1140784

Maybe we should log an error (not a validation error) and stop processing the request?

FWIW, if you want to emit error events for errors other than ValidationErrors, so that they are captured in the data lake (and not just as a metric or error log line), you can modify the makeMapToErrorEvent function. This function currently returns if the raised error is not a ValidationError, and eventgate-wikimedia only has built in configuration to set the validation error stream name, but this could be changed to support more streams, or to make the 'error stream' support multiple kinds of errors. (If we did that we might want to rename the existent error_stream config and actual configured eventgate streams names.)

subject_id should be base64-decodable and its length prior to decoding should be at least 22 characters

Isn't the length of a base64 string always supposed to be a multiple of 4? Where does the 22 come from?

I think the intent is to apply sodium_base64_VARIANT_URLSAFE_NO_PADDING upon a hash output that is a uint8_t[16], although better verify - good question! @BBlack can you confirm that we may see the value without = symbol padding when it comes across to EventGate? And is 22 the correct shortest length in the initial implementation or would we possibly expect a longer length (e.g., 44 characters, which coincidentally would cleanly divide by 4) in the initial implementation?

To confirm: the subject_id field, as encoded, will always be exactly 22 bytes long, unless future design changes are made, which would require planning and coordination.

The derived subject identifier is a 128-bit binary number in raw form, which stores as 16 bytes of binary data. libsodium is used to encode that number in base64url form without padding, which comes out as 22 bytes of ASCII. Refs on that scheme: enwiki b64 variants table , enwiki b64url, b64 decoding without padding.

Those 22 bytes of ASCII always unambiguously decode as 16 bytes of binary data. If padding were used, there would be two trailing padding bytes added as the fixed string ==. If it's helpful for your implementation's expectations, you can always append the trailing == to what you receive, making it a multiple of 4 bytes, prior to decoding it. The result would be the same 16 bytes of binary data either way.

dr0ptp4kt merged https://gitlab.wikimedia.org/repos/data-engineering/eventgate-wikimedia/-/merge_requests/13

Validate and hoist specified experiment from x-experiment-enrollments header

Reviewed and merged the EventGate patch. And it appears to be working as expected in a manual deployment on the beta cluster. I tried several permutations to exercise the failure cases (e.g., experiment name less than seven characters, missing semicolon termination in header, too long subject_id, illegal character in subject_id, mismatching experiment name / group assignment, using a string other than awaiting as the placeholder) and that appeared to work as expected. Correct-looking events also worked:

$ cat curler.sh                                                   
curl -i -s -H "X-Experiment-Enrollments: gastropub=proteinchips/1234567890abcdefghiJKz;" -H "Content-Type: application/json"  -X POST https://intake-analytics.wikimedia.beta.wmflabs.org/v1/events -d @product_metrics.web_base.experiment_enrollment_handling.dev0
$ cat product_metrics.web_base.experiment_enrollment_handling.dev0
{
  "$schema": "/analytics/product_metrics/web/base/1.4.2",
  "dt": "2025-10-10T11:12:11.111Z",
  "meta": {
    "stream": "product_metrics.web_base.experiment_enrollment_handling.dev0"
  },
  "action": "clicked",
  "experiment": {
    "enrolled": "gastropub",
    "assigned": "proteinchips",
    "subject_id": "awaiting",
    "sampling_unit": "edge-unique",
    "coordinator": "forced",
    "other_assigned":
      {
        "ice_cream_cone": "yes"
      }
  },
  "agent" : {
    "client_platform": "mediawiki_js",
    "client_platform_family": "desktop_browser",
    "release_status": "dev"
  }
}
$ bash curler.sh | grep -E 'HTTP|content-length'                  
HTTP/2 201 
content-length: 0

Screenshot 2025-05-08 at 9.19.14 PM.png (352×839 px, 85 KB)

Nice work @tchin!

Next we'll want to get a production configuration entry into EventStreamConfig and get this onto EventGate in production...for maybe Monday?

URL routing stuff in Varnish (Part 2 in the Description) in the beta cluster should now be okay to tackle. @Vgutierrez is working on material for this in https://gerrit.wikimedia.org/r/c/operations/puppet/+/1143474 .

Change #1143772 had a related patch set uploaded (by Dr0ptp4kt; author: Dr0ptp4kt):

[operations/mediawiki-config@master] Stream config for edge uniques on prod cluster

https://gerrit.wikimedia.org/r/1143772

Change #1143772 merged by jenkins-bot:

[operations/mediawiki-config@master] Stream config for edge uniques on prod cluster

https://gerrit.wikimedia.org/r/1143772

Mentioned in SAL (#wikimedia-operations) [2025-05-12T20:11:44Z] <dr0ptp4kt@deploy1003> Started scap sync-world: Backport for [[gerrit:1143772|Stream config for edge uniques on prod cluster (T391959)]]

Mentioned in SAL (#wikimedia-operations) [2025-05-12T20:16:28Z] <dr0ptp4kt@deploy1003> dr0ptp4kt: Backport for [[gerrit:1143772|Stream config for edge uniques on prod cluster (T391959)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)

Mentioned in SAL (#wikimedia-operations) [2025-05-12T20:30:38Z] <dr0ptp4kt@deploy1003> Finished scap sync-world: Backport for [[gerrit:1143772|Stream config for edge uniques on prod cluster (T391959)]] (duration: 18m 53s)

The Event Stream Config change for production appears to be working in production as expected (nb. the "awaiting" placeholder isn't yet replaced, that will require the production deployment of EventGate Wikimedia with the hoisting capability).

We looked at eventgate-wikimedia!15 in the beta cluster and although the admission of requests and streaming of events is working as expected, the validation error logging behavior is being investigated a little more.

Anyway, from prod:

stat1010:~$ kafkacat -C -b kafka-jumbo1007.eqiad.wmnet:9092 -t eqiad.product_metrics.web_base.experiment_enrollment_handling.dev1
{"$schema":"/analytics/product_metrics/web/base/1.4.2","dt":"2025-10-10T11:12:11.111Z","meta":{"stream":"product_metrics.web_base.experiment_enrollment_handling.dev1","id":"4f71056f-756a-43f2-b748-e214574e1dac","dt":"2025-05-12T20:32:07.317Z","request_id":"5b083fdc-a0ae-43d8-8dd1-8634bb81435a"},"action":"clicked","experiment":{"enrolled":"gastropub","assigned":"proteinchips","subject_id":"awaiting","sampling_unit":"edge-unique","coordinator":"forced","other_assigned":{"ice_cream_cone":"yes"}},"agent":{"client_platform":"mediawiki_js","client_platform_family":"desktop_browser","release_status":"dev"}}

That was from this:

cat curler.bash
curl -i -s -H "X-Experiment-Enrollments: gastropub=proteinchips/1234567890abcdefghiJKz;" -H "Content-Type: application/json"  -X POST https://intake-analytics.wikimedia.org/v1/events -d @product_metrics.web_base.experiment_enrollment_handling.dev1
dr0ptp4kt@wmf3611 mediawiki-config % cat dev1.json
{
  "$schema": "/analytics/product_metrics/web/base/1.4.2",
  "dt": "2025-10-10T11:12:11.111Z",
  "meta": {
    "stream": "product_metrics.web_base.experiment_enrollment_handling.dev1"
  },
  "action": "clicked",
  "experiment": {
    "enrolled": "gastropub",
    "assigned": "proteinchips",
    "subject_id": "awaiting",
    "sampling_unit": "edge-unique",
    "coordinator": "forced",
    "other_assigned":
      {
        "ice_cream_cone": "yes"
      }
  },
  "agent" : {
    "client_platform": "mediawiki_js",
    "client_platform_family": "desktop_browser",
    "release_status": "dev"
  }
}

Change #1146038 had a related patch set uploaded (by TChin; author: TChin):

[operations/deployment-charts@master] [eventgate-analytics-external] bump version v1.12.0

https://gerrit.wikimedia.org/r/1146038

Change #1146038 merged by jenkins-bot:

[operations/deployment-charts@master] [eventgate-analytics-external] bump version v1.12.0

https://gerrit.wikimedia.org/r/1146038

Change #1146695 had a related patch set uploaded (by TChin; author: TChin):

[operations/deployment-charts@master] [eventgate-analytics-external] bump version v1.13.0

https://gerrit.wikimedia.org/r/1146695

Change #1146695 merged by jenkins-bot:

[operations/deployment-charts@master] [eventgate-analytics-external] bump version v1.13.0

https://gerrit.wikimedia.org/r/1146695

An updated version of EventGate was deployed to production (and before that, beta cluster). The hoisting works, and additionally now the errors are making their way into the error stream per kafkacat.

Success case:

$ cat curler.bash
curl -i -s -H "X-Experiment-Enrollments: gastropub=proteinchips/1234567890abcdefghiJKz;" -H "Content-Type: application/json"  -X POST https://intake-analytics.wikimedia.org/v1/events -d @dev1.json
$ cat dev1.json 
{
  "$schema": "/analytics/product_metrics/web/base/1.4.2",
  "dt": "2025-10-10T11:12:11.111Z",
  "meta": {
    "stream": "product_metrics.web_base.experiment_enrollment_handling.dev1"
  },
  "action": "clicked",
  "experiment": {
    "enrolled": "gastropub",
    "assigned": "proteinchips",
    "subject_id": "awaiting",
    "sampling_unit": "edge-unique",
    "coordinator": "forced",
    "other_assigned":
      {
        "ice_cream_cone": "yes"
      }
  },
  "agent" : {
    "client_platform": "mediawiki_js",
    "client_platform_family": "desktop_browser",
    "release_status": "dev"
  }
}

$ bash curler.bash | grep -E 'HTTP|content-length'
HTTP/2 201 
content-length: 0

Error case:

$ cat curler.bad.bash
curl -i -s -H "X-Experiment-Enrollments: gastropub=proteinchips/1234567890abcdefghiJKz;" -H "Content-Type: application/json"  -X POST https://intake-analytics.wikimedia.org/v1/events -d @dev1.bad.json
$ cat dev1.bad.json
{
  "$schema": "/analytics/product_metrics/web/base/1.4.2",
  "dt": "2025-10-10T11:12:11.111Z",
  "meta": {
    "stream": "product_metrics.web_base.experiment_enrollment_handling.dev1"
  },
  "action": "clicked",
  "experiment": {
    "enrolled": "gastropub",
    "assigned": "proteinchips",
    "subject_id": "awaiting!",
    "sampling_unit": "edge-unique",
    "coordinator": "forced",
    "other_assigned":
      {
        "ice_cream_cone": "yes"
      }
  },
  "agent" : {
    "client_platform": "mediawiki_js",
    "client_platform_family": "desktop_browser",
    "release_status": "dev"
  }
}
$ bash curler.bad.bash                                
HTTP/2 400 

...

{"invalid":[{"status":"invalid","event":{"$schema":"/analytics/product_metrics/web/base/1.4.2","dt":"2025-10-10T11:12:11.111Z","meta":{"stream":"product_metrics.web_base.experiment_enrollment_handling.dev1","id":"7cec62d7-2c5f-46fc-a86f-19f6d2916e8d","dt":"2025-05-15T19:56:44.938Z","request_id":"a6ca65e1-2895-4fb7-a3dd-bb138a223f93"},"action":"clicked","experiment":{"enrolled":"gastropub","assigned":"proteinchips","subject_id":"awaiting!","sampling_unit":"edge-unique","coordinator":"forced","other_assigned":{"ice_cream_cone":"yes"}},"agent":{"client_platform":"mediawiki_js","client_platform_family":"desktop_browser","release_status":"dev"}},"context":{"errors":[{"dataPath":".experiment.subject_id","message":"subject_id must be `awaiting` for header to be hoisted"}],"errorsText":"'.experiment.subject_id' subject_id must be `awaiting` for header to be hoisted"}}]}


stat1010:~$ kafkacat -C -b kafka-jumbo1007.eqiad.wmnet:9092 -t eqiad.eventgate-analytics-external.error.validation | grep enrollment_handling

{"meta":{"id":"6f28aa4d-6de3-498a-af84-5102792fd3ce","dt":"2025-05-15T19:56:44.940Z","uri":"unknown","domain":"unknown","request_id":"a6ca65e1-2895-4fb7-a3dd-bb138a223f93","stream":"eventgate-analytics-external.error.validation"},"emitter_id":"eventgate-production","raw_event":"{\"$schema\":\"/analytics/product_metrics/web/base/1.4.2\",\"dt\":\"2025-10-10T11:12:11.111Z\",\"meta\":{\"stream\":\"product_metrics.web_base.experiment_enrollment_handling.dev1\",\"id\":\"7cec62d7-2c5f-46fc-a86f-19f6d2916e8d\",\"dt\":\"2025-05-15T19:56:44.938Z\",\"request_id\":\"a6ca65e1-2895-4fb7-a3dd-bb138a223f93\"},\"action\":\"clicked\",\"experiment\":{\"enrolled\":\"gastropub\",\"assigned\":\"proteinchips\",\"subject_id\":\"awaiting!\",\"sampling_unit\":\"edge-unique\",\"coordinator\":\"forced\",\"other_assigned\":{\"ice_cream_cone\":\"yes\"}},\"agent\":{\"client_platform\":\"mediawiki_js\",\"client_platform_family\":\"desktop_browser\",\"release_status\":\"dev\"}}","message":"'.experiment.subject_id' subject_id must be `awaiting` for header to be hoisted","$schema":"/error/1.0.0","errored_schema_uri":"/analytics/product_metrics/web/base/1.4.2","errored_stream_name":"product_metrics.web_base.experiment_enrollment_handling.dev1"}

Change #1155298 had a related patch set uploaded (by TChin; author: TChin):

[operations/deployment-charts@master] [eventgate-analytics-external] bump version v1.14.0

https://gerrit.wikimedia.org/r/1155298

Change #1155298 merged by jenkins-bot:

[operations/deployment-charts@master] [eventgate-analytics-external] bump version v1.14.0

https://gerrit.wikimedia.org/r/1155298