Skip to content

[#159] Fix non-deterministic ID assignment #195

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
May 11, 2017

Conversation

phi-dbq
Copy link
Contributor

@phi-dbq phi-dbq commented May 11, 2017

This is a follow up task from [#189].
It removes SQLHelpers.zipWithUniqueId which is no longer needed.
We will make scalability test and address potential issues.
In the end, we will make a bug-fix release based on changes in this PR.

@codecov-io
Copy link

codecov-io commented May 11, 2017

Codecov Report

Merging #195 into master will increase coverage by 1.5%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff            @@
##           master     #195     +/-   ##
=========================================
+ Coverage   86.78%   88.29%   +1.5%     
=========================================
  Files          23       23             
  Lines         757      743     -14     
  Branches       59       59             
=========================================
- Hits          657      656      -1     
+ Misses        100       87     -13
Impacted Files Coverage Δ
...in/spark-2.0/org/apache/spark/sql/SQLHelpers.scala 0% <ø> (ø) ⬆️
...in/spark-2.x/org/apache/spark/sql/SQLHelpers.scala 100% <ø> (ø) ⬆️
...in/spark-1.x/org/apache/spark/sql/SQLHelpers.scala 0% <ø> (ø) ⬆️
src/main/scala/org/graphframes/GraphFrame.scala 86.69% <100%> (+0.4%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8c3df6a...28bad51. Read the comment docs.

@mengxr mengxr self-assigned this May 11, 2017
.persist(StorageLevel.MEMORY_AND_DISK)
vertices.select(col(ID), nestAsCol(vertices, ATTR))
.join(withLongIds, ID)
.select(LONG_ID, ID, ATTR)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if it is worth the extra effort to optimize for the newer Spark release.
if monotonically_increasing_id not unique is only in earlier ones, perhaps we should create a wrapper for it instead of penalizing all versions with extra repartition?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue is not mono_id but the input data frame. The record ordering in the input is not deterministic, even with a correct mono_id impl we won't get correct result.

assertComponents(components0, expected)
assert(!isFromCheckpoint(components0),
"The result shouldn't depend on checkpoint data if checkpointing is disabled.")
if (isLaterVersion("2.0")) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you leave an inline comment explaining why we skipped this for 1.6?

@mengxr mengxr changed the title [WIP][#159] Fix non-deterministic ID assignment [#159] Fix non-deterministic ID assignment May 11, 2017
@mengxr mengxr added the lgtm label May 11, 2017
@mengxr mengxr merged commit d437797 into graphframes:master May 11, 2017
@phi-dbq
Copy link
Contributor Author

phi-dbq commented May 16, 2017

We used a Spark cluster with 8 Workers: 244.0 GB Memory, 32 Cores.
In order to measure end-to-end elapsed time involving only the function of interest,
we clipped the RDD lineage graph using checkpoint.

Connected Components

type #V #E time (dev) [str-key] time (v0.40) [str-key] comments
grid 40200 159200 2 mins 24 secs 698 msecs 2 mins 16 secs 438 msecs  
grid 402000 15992000 5 mins 23 secs 734 msecs 4 mins 36 secs 993 msecs  
chain 80001 80000 2 mins 159 msecs 1 min 31 secs 993 msecs 8 stars
chain 800001 800000 4 mins 41 secs 947 msecs 4 mins 30 secs 394 msecs 80 stars
star 20001 20000 1 min 5 secs 287 msecs 53 secs 90 msecs  
star 200001 200000 1 min 14 secs 654 msecs 1 min 21 secs 196 msecs  

PageRank One Iteration

For the current development version (with PR-195)

type #V #E time (dev) [str-key] time (dev) [int-key] comments
grid 40200 159200 26 secs 937 msecs 7 secs 998 msecs  
grid 4002000 15992000 58 secs 427 msecs 14 secs 98 msecs  
chain 80001 80000 34 secs 188 msecs 22 secs 324 msecs 8 stars
chain 800001 800000 2 mins 15 secs 725 msecs 3 min 6 secs 179 msecs 80 stars
star 20001 20000 20 secs 415 msecs 6 secs 255 msecs  
star 200001 200000 25 secs 120 msecs 8 secs 289 msecs  

For the previous release version v0.40

type #V #E time (v0.40) [str-key] time (v0.40) [int-key] comments
grid 40200 159200 23 secs 891 msecs 8 secs 897 msecs  
grid 4002000 15992000 50 secs 240 msecs 15 secs 661 msecs  
chain 80001 80000 35 secs 270 msecs 21 secs 796 msecs 8 stars
chain 800001 800000 2 mins 18 secs 476 msecs 2 min 32 secs 584 msecs 80 stars
star 20001 20000 20 secs 715 msecs 7 secs 948 msecs  
star 200001 200000 22 secs 766 msecs 7 secs 683 msecs  

@mengxr
Copy link
Contributor

mengxr commented May 16, 2017

@phi-dbq Thanks for running performance tests. Now I'm more comfortable with the trade-off: seconds vs. correctness.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants