-
Notifications
You must be signed in to change notification settings - Fork 252
[#159] Fix non-deterministic ID assignment #195
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report
@@ Coverage Diff @@
## master #195 +/- ##
=========================================
+ Coverage 86.78% 88.29% +1.5%
=========================================
Files 23 23
Lines 757 743 -14
Branches 59 59
=========================================
- Hits 657 656 -1
+ Misses 100 87 -13
Continue to review full report at Codecov.
|
.persist(StorageLevel.MEMORY_AND_DISK) | ||
vertices.select(col(ID), nestAsCol(vertices, ATTR)) | ||
.join(withLongIds, ID) | ||
.select(LONG_ID, ID, ATTR) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if it is worth the extra effort to optimize for the newer Spark release.
if monotonically_increasing_id
not unique is only in earlier ones, perhaps we should create a wrapper for it instead of penalizing all versions with extra repartition?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The issue is not mono_id
but the input data frame. The record ordering in the input is not deterministic, even with a correct mono_id
impl we won't get correct result.
assertComponents(components0, expected) | ||
assert(!isFromCheckpoint(components0), | ||
"The result shouldn't depend on checkpoint data if checkpointing is disabled.") | ||
if (isLaterVersion("2.0")) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you leave an inline comment explaining why we skipped this for 1.6?
We used a Spark cluster with Connected Components
PageRank One IterationFor the current development version (with PR-195)
For the previous release version
|
@phi-dbq Thanks for running performance tests. Now I'm more comfortable with the trade-off: seconds vs. correctness. |
This is a follow up task from [#189].
It removes SQLHelpers.zipWithUniqueId which is no longer needed.
We will make scalability test and address potential issues.
In the end, we will make a bug-fix release based on changes in this PR.