refactor: replace manual Parquet checkpointing with DataFrame.checkpoint() #594

ericsun95 · 2025-04-23T16:29:07Z

What changes were proposed in this pull request?

refactor: replace manual Parquet checkpointing with DataFrame.checkpoint() #593

Simplify checkpoint logic in Connected Components using Spark DataFrame API
Use built-in DataFrame checkpointing to replace custom checkpoint workaround

Why are the changes needed?

Fix potential consistency issues in Connected Components checkpointing on S3
Avoid manual Parquet I/O for checkpointing in CC algorithm

use the dataframe checkpoint in connected component computation

SauronShepherd · 2025-04-23T17:00:22Z

I'm not sure about this changes. Why are you changing log4j and the test suite? Besides, we had in mind to create a centralized checkpointing mechanism. https://blog.devgenius.io/apache-spark-wtf-welcome-to-hell-83aa677156e5 We could go ahead with this PR, but it should be replaced sooner or later. El mié, 23 abr 2025, 18:29, ericsun95 ***@***.***> escribió:

…

What changes were proposed in this pull request? refactor: replace manual Parquet checkpointing with DataFrame.checkpoint() #593 <#593> - Simplify checkpoint logic in Connected Components using Spark DataFrame API - Use built-in DataFrame checkpointing to replace custom checkpoint workaround Why are the changes needed? - Fix potential consistency issues in Connected Components checkpointing on S3 - Avoid manual Parquet I/O for checkpointing in CC algorithm ------------------------------ You can view, comment on, or merge this pull request online at: #594 Commit Summary - ac208bd <ac208bd> use the dataframe checkpoint in connected component computation - ad88256 <ad88256> use the dataframe checkpoint in connected component computation - 319a621 <319a621> Merge branch 'master' of https://github.com/ericsun95/graphframes - 647d84e <647d84e> Merge branch 'graphframes:master' into master - ea8cac7 <ea8cac7> Update the test to correctly reflect the checkpoint behavior - 6bd31f6 <6bd31f6> format the code - 47feeb4 <47feeb4> format the code - d1c0b07 <d1c0b07> format the code File Changes (3 files <https://github.com/graphframes/graphframes/pull/594/files>) - *M* src/main/scala/org/graphframes/lib/ConnectedComponents.scala <https://github.com/graphframes/graphframes/pull/594/files#diff-f8d47cdb6a1b97658e89673c2c756a8eaff3707cc2036e501343b7b1d0b13467> (22) - *M* src/test/resources/log4j.properties <https://github.com/graphframes/graphframes/pull/594/files#diff-0fbc367524b5ef18380949741f4c93258b990004da3b30159bfc2ff0f69ffaf5> (4) - *M* src/test/scala/org/graphframes/lib/ConnectedComponentsSuite.scala <https://github.com/graphframes/graphframes/pull/594/files#diff-bb02afae8c5593283a3f757090c784685e0b53297ceeaa5153f3d30dfa563f12> (51) Patch Links: - https://github.com/graphframes/graphframes/pull/594.patch - https://github.com/graphframes/graphframes/pull/594.diff — Reply to this email directly, view it on GitHub <#594>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACCN676QT6ICD5VA65ZFBPD2265W3AVCNFSM6AAAAAB3WZ7VL6VHI2DSMVQWIX3LMV43ASLTON2WKOZTGAYTINRQGE2TSOA> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

ericsun95 · 2025-04-23T17:21:35Z

log4j I deleted, it is a miss commit.

ericsun95 · 2025-04-23T17:23:06Z

But the test suite has to be changed since Spark checkpoint behave differently than saving to the parquet directly, it won't work

ericsun95 · 2025-04-23T17:30:13Z

I'm not sure about this changes. Why are you changing log4j and the test

suite?

Besides, we had in mind to create a centralized checkpointing mechanism.

https://blog.devgenius.io/apache-spark-wtf-welcome-to-hell-83aa677156e5

We could go ahead with this PR, but it should be replaced sooner or later.

El mié, 23 abr 2025, 18:29, ericsun95 @.***> escribió:

What changes were proposed in this pull request?

refactor: replace manual Parquet checkpointing with DataFrame.checkpoint()

#593 #593

Simplify checkpoint logic in Connected Components using Spark

DataFrame API

Use built-in DataFrame checkpointing to replace custom checkpoint

workaround

Why are the changes needed?

Fix potential consistency issues in Connected Components

checkpointing on S3

Avoid manual Parquet I/O for checkpointing in CC algorithm

You can view, comment on, or merge this pull request online at:

#594

Commit Summary

ac208bd

ac208bd

use the dataframe checkpoint in connected component computation

ad88256

ad88256

use the dataframe checkpoint in connected component computation

319a621

319a621

Merge branch 'master' of https://github.com/ericsun95/graphframes

647d84e

647d84e

Merge branch 'graphframes:master' into master

ea8cac7

ea8cac7

Update the test to correctly reflect the checkpoint behavior

6bd31f6

6bd31f6

format the code

47feeb4

47feeb4

format the code

d1c0b07

d1c0b07

format the code

File Changes

(3 files https://github.com/graphframes/graphframes/pull/594/files)

M src/main/scala/org/graphframes/lib/ConnectedComponents.scala

https://github.com/graphframes/graphframes/pull/594/files#diff-f8d47cdb6a1b97658e89673c2c756a8eaff3707cc2036e501343b7b1d0b13467

(22)

M src/test/resources/log4j.properties

https://github.com/graphframes/graphframes/pull/594/files#diff-0fbc367524b5ef18380949741f4c93258b990004da3b30159bfc2ff0f69ffaf5

(4)

M src/test/scala/org/graphframes/lib/ConnectedComponentsSuite.scala

https://github.com/graphframes/graphframes/pull/594/files#diff-bb02afae8c5593283a3f757090c784685e0b53297ceeaa5153f3d30dfa563f12

(51)

Patch Links:

https://github.com/graphframes/graphframes/pull/594.patch

https://github.com/graphframes/graphframes/pull/594.diff

—

Reply to this email directly, view it on GitHub

#594, or unsubscribe

https://github.com/notifications/unsubscribe-auth/ACCN676QT6ICD5VA65ZFBPD2265W3AVCNFSM6AAAAAB3WZ7VL6VHI2DSMVQWIX3LMV43ASLTON2WKOZTGAYTINRQGE2TSOA

.

You are receiving this because you are subscribed to this thread.Message

ID: @.***>

Are you sharing something else? The linked blog doesn't mention anything about checkpoint.

codecov-commenter · 2025-04-24T05:10:26Z

⚠️ Please install the to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

Attention: Patch coverage is 83.33333% with 1 line in your changes missing coverage. Please review.

Project coverage is 89.47%. Comparing base (bc487ef) to head (e3bc8b4).
Report is 28 commits behind head on master.

Files with missing lines	Patch %	Lines
...cala/org/graphframes/lib/ConnectedComponents.scala	83.33%	1 Missing ⚠️

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files

@@            Coverage Diff             @@
##           master     #594      +/-   ##
==========================================
- Coverage   91.43%   89.47%   -1.97%     
==========================================
  Files          18       20       +2     
  Lines         829     1026     +197     
  Branches       52      126      +74     
==========================================
+ Hits          758      918     +160     
- Misses         71      108      +37

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

SemyonSinchenko

@ericsun95 Thanks for the contribution! It LGTM overall, but I left a couple of comments related to setting checkpoint dir.

src/main/scala/org/graphframes/lib/ConnectedComponents.scala

SemyonSinchenko · 2025-04-24T07:33:09Z

src/main/scala/org/graphframes/lib/ConnectedComponents.scala

-          // remove previous checkpoint
+          // enable checkpointing if not yet done
+          if (spark.sparkContext.getCheckpointDir.isEmpty) {
+            spark.sparkContext.setCheckpointDir(checkpointDir.get)


I think we should return it to the initial state after convergence. So, if it was empty before, it should be empty after.

Yeah good call out, will do

I think this will have multitenancy issues. What if more than one user is working in a single spark cluster?

Or even in a single tenant world, imagine there are two pipelines feeding a binary node and they are both leveraging the checkpoint dir

@james-willis SparkContext exists per session, so if there are multiple users using one spark cluster (for example, via YARN), all of them will have own SparkContext. For the case of two pipelines: again, both should have own SparkContext.

For me, returning this var to an initial state is important.

when I say pipeline i mean two children of a binary node in one query. Anyway I want to copy my other comment here:

In my opinion we should just let spark managed the checkpoint location if we want to switch to using the Dataframe checkpoint method. Users can use spark.cleaner.referenceTracking.cleanCheckpoints if they think its important to GC the checkpoint dir.

I am fine either way. But one thing I would like to discuss here is whether we need to delete the checkpoint files within iterations.
The previous behavior is saving to s3 as parquet and clean previous checkpoint files after certain iterations.
However, with spark offered checkpoint method, it is hard for us to know exactly what are the files to be deleted after certain iteration under a fixed parent checkpointDir. So if we don't manage at the parent folder level, we would expect continuously accumulated checkpoint files.

The user doesn't have access to the internal loop which can be a pain if they have limited disk available.

Any ideas on this? I think either way we need to deprecate existing approach and find a way for users to manage the resources within loops.

spark.cleaner.referenceTracking.cleanCheckpoints manages this.

That wouldn't keep the old behavior and it is a global configuration. Ideally if we want to be compatible we need to clean it iterations by iterations. Like after 5 iterations clean the previous checkpoint files

rjurney · 2025-04-25T18:51:32Z

Just a comment, in other places we’ve used a random temp dir for checkpoints, could be the base dir.

ericsun95 · 2025-04-26T18:34:54Z

Just a comment, in other places we’ve used a random temp dir for checkpoints, could be the base dir.

Can you share any link on the "other places" you talked about for reference?

james-willis · 2025-05-05T22:32:20Z

In my opinion we should just let spark managed the checkpoint location if we want to switch to using the Dataframe checkpoint method. Users can use spark.cleaner.referenceTracking.cleanCheckpoints if they think its important to GC the checkpoint dir.

If we want to maintain the current behavior lets just leave it the way it is.

james-willis · 2025-05-05T22:29:53Z

src/main/scala/org/graphframes/lib/ConnectedComponents.scala

-          // remove previous checkpoint
+          // enable checkpointing if not yet done
+          if (spark.sparkContext.getCheckpointDir.isEmpty) {
+            spark.sparkContext.setCheckpointDir(checkpointDir.get)


I think this will have multitenancy issues. What if more than one user is working in a single spark cluster?

Or even in a single tenant world, imagine there are two pipelines feeding a binary node and they are both leveraging the checkpoint dir

james-willis · 2025-05-05T22:30:27Z

src/main/scala/org/graphframes/lib/ConnectedComponents.scala

+          if (spark.sparkContext.getCheckpointDir.isEmpty) {
+            spark.sparkContext.setCheckpointDir(checkpointDir.get)
+          }
+          ee = ee.checkpoint(eager = true)


what does eager help with?

Help trigger the checkpoint immediately otherwise it is lazy.

ok so avoiding the unpersists happening before the checkpoint happens.

rjurney · 2025-05-14T22:45:51Z

In my opinion we should just let spark managed the checkpoint location if we want to switch to using the Dataframe checkpoint method. Users can use spark.cleaner.referenceTracking.cleanCheckpoints if they think its important to GC the checkpoint dir.

If we want to maintain the current behavior lets just leave it the way it is.

This makes sense to me. Why is this a graphframes feature when it’s a spark feature?

Kimahriman · 2025-07-17T21:07:12Z

Another thing that would be interesting is the option to use local checkpointing to avoid having to write to durable storage at all. It would be less resilient to errors but more performant

SemyonSinchenko · 2025-07-18T07:53:25Z

Another thing that would be interesting is the option to use local checkpointing to avoid having to write to durable storage at all. It would be less resilient to errors but more performant

I like that approach. I will work on implementation.

SemyonSinchenko · 2025-07-21T17:35:20Z

Another thing that would be interesting is the option to use local checkpointing to avoid having to write to durable storage at all. It would be less resilient to errors but more performant

Local checkpoints as an option were added in #662

ericsun95 and others added 8 commits April 6, 2025 09:31

use the dataframe checkpoint in connected component computation

ac208bd

use the dataframe checkpoint in connected component computation

use the dataframe checkpoint in connected component computation

ad88256

use the dataframe checkpoint in connected component computation

Merge branch 'master' of https://github.com/ericsun95/graphframes

319a621

Merge branch 'graphframes:master' into master

647d84e

Update the test to correctly reflect the checkpoint behavior

ea8cac7

format the code

6bd31f6

format the code

47feeb4

format the code

d1c0b07

Update log4j.properties

e3bc8b4

SemyonSinchenko self-requested a review April 24, 2025 07:21

SemyonSinchenko added the scala label Apr 24, 2025

SemyonSinchenko reviewed Apr 24, 2025

View reviewed changes

james-willis requested changes May 5, 2025

View reviewed changes

SemyonSinchenko mentioned this pull request Jul 18, 2025

feat: use localCheckpoint in iterative algorithms #651

Closed

7 tasks

refactor: replace manual Parquet checkpointing with DataFrame.checkpoint() #594

Are you sure you want to change the base?

refactor: replace manual Parquet checkpointing with DataFrame.checkpoint() #594

Uh oh!

Conversation

ericsun95 commented Apr 23, 2025

What changes were proposed in this pull request?

Why are the changes needed?

Uh oh!

SauronShepherd commented Apr 23, 2025 via email

Uh oh!

ericsun95 commented Apr 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ericsun95 commented Apr 23, 2025

Uh oh!

ericsun95 commented Apr 23, 2025

Uh oh!

codecov-commenter commented Apr 24, 2025

Codecov Report

Uh oh!

SemyonSinchenko left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rjurney commented Apr 25, 2025

Uh oh!

ericsun95 commented Apr 26, 2025

Uh oh!

james-willis commented May 5, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rjurney commented May 14, 2025

Uh oh!

Kimahriman commented Jul 17, 2025

Uh oh!

SemyonSinchenko commented Jul 18, 2025

Uh oh!

SemyonSinchenko commented Jul 21, 2025

Uh oh!

Uh oh!

ericsun95 commented Apr 23, 2025 •

edited

Loading