Skip to content

feat(data-modeling): automatically infer relationships #7220

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

gribnoysup
Copy link
Collaborator

@gribnoysup gribnoysup commented Aug 21, 2025

This is a POC that implements two algorithms for automatically inferring relationships for a set of namespaces using some pretty straightforward heuristics as described in the architecture design doc for the data modeling project

  • First algorithm is from foreign to local (I'm mostly including it for posterity and will probably drop before this lands in main): we start from _id field in a collection and then search through all the fields with matching schemas in other collections. This one is pretty greedy and might be especially expensive if _id is of some very generic type like a string or an int. There is also a bigger chance that an _id field from foreign collection is missing from the local one (for example you have users and comments, and user haven't left any comments yet).
  • Second algorithm is from local to foreign: we look at indexes and any field that is objectid and assuming that if index exists or objectid was used, this is a foreign key in a local collection. This algorightm might return less results if indexes don't exist or _id field is of a custom type, but it has a benefit of being less expensive in terms of execution as the find operation always runs against _id. Also because we're starting with foreign keys from the local collection, there is a way higher chance we will find them in a foreign collection compared to the first algorithm

WIP, but that's already working and I'll continue adding some unit test for smaller functions that are used interchangeably across both methods while we're scoping the work.

For the purposes of testing this, I added some views and indexes to sample_airbnb and sample_mflix to our test cluster (Cluster 0) that you can try this out on

@github-actions github-actions bot added the feat label Aug 21, 2025
@gribnoysup gribnoysup force-pushed the automatically-infer-relationships branch from 9b16160 to 2f3f9cc Compare August 21, 2025 13:36
const [matchingDoc] = await dataService
.find(
foreignNamespace,
{ _id: { $in: sampleValues as any } },
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should probably take foreign namespace _id type into account, if they are completely mismatched with what we think is the foreign keys we found in local colleciton, we can just skip fetching altogether

@gribnoysup gribnoysup force-pushed the automatically-infer-relationships branch from 2f3f9cc to 4234d87 Compare August 21, 2025 13:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant