Skip to content

[NVIDIA] Refactor Family Blackwell Support codegen #156176

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from

Conversation

johnnynunez
Copy link
Contributor

@johnnynunez johnnynunez commented Jun 17, 2025

With the legacy driver (nvgpu) used for CUDA 12.9, Thor was operating with SM 10.1.
This changes to SM 11.0 when the newer driver model (OpenRM), which is intended for CUDA 13.0, is introduced.
Thor 10.1 --> 11.0
Spark 12.1

Copy link

pytorch-bot bot commented Jun 17, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/156176

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (2 Unrelated Failures)

As of commit 2f1e36e with merge base 7d87e35 (image):

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@johnnynunez
Copy link
Contributor Author

@pytorchbot label "topic: not user facing"

@pytorch-bot pytorch-bot bot added the topic: not user facing topic category label Jun 17, 2025
@colesbury colesbury added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Jun 17, 2025
@colesbury colesbury requested a review from a team June 17, 2025 12:37
@ezyang
Copy link
Contributor

ezyang commented Jun 19, 2025

@cyyever, as author of #154595, I'll let you decide what to do about this one

@soumith soumith removed their request for review June 19, 2025 20:24
@cyyever
Copy link
Collaborator

cyyever commented Jun 22, 2025

I see the need to extend select_compute_arch.cmake even after the forked CUDA modules have been removed. Can you wait until I have fixed all build failures and merged #154595, then we can discuss how to extend?

@cyyever
Copy link
Collaborator

cyyever commented Jun 22, 2025

@johnnynunez Could you try modifying torch_cuda_select_nvcc_arch_flags of cmake/public/utils.cmake and appending to CUDA_ARCHITECTURES?

@johnnynunez
Copy link
Contributor Author

johnnynunez commented Jul 2, 2025

@johnnynunez Could you try modifying torch_cuda_select_nvcc_arch_flags of cmake/public/utils.cmake and appending to CUDA_ARCHITECTURES?

I think that i have to wait, right?

@johnnynunez johnnynunez requested a review from malfet as a code owner July 2, 2025 20:47
@johnnynunez
Copy link
Contributor Author

johnnynunez commented Aug 12, 2025

@tinglvv @malfet @atalman could you merge it? 10.1 not exists anymore in cuda 13. Thor is 11.0

it’s building for me https://pypi.jetson-ai-lab.io/sbsa/cu130/torch/2.9.0

@johnnynunez johnnynunez reopened this Aug 13, 2025
@johnnynunez
Copy link
Contributor Author

@pytorchbot merge

Copy link

pytorch-bot bot commented Aug 14, 2025

This PR needs to be approved by an authorized maintainer before merge.

@johnnynunez
Copy link
Contributor Author

johnnynunez commented Aug 14, 2025

@fmassa @soumith @ezyang review and merge? Thor launch is near

@johnnynunez
Copy link
Contributor Author

@ptrblck could review and merge?

@ezyang
Copy link
Contributor

ezyang commented Aug 15, 2025

@pytorchbot merge

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Aug 15, 2025
@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

asmorkalov pushed a commit to opencv/opencv that referenced this pull request Aug 20, 2025
Refactor Blackwell #27537

In CUDA 13:
- 10.0 is b100/b200 same for aarch64 (gb200)
- 10.3 is GB300
- 11.0 is Thor with new OpenRm driver (moves to SBSA)
- 12.0 is RTX/RTX PRO
- 12.1 is Spark GB10

Thor was moved from 10.1 to 11.0 and Spark is 12.1.
Related patch: pytorch/pytorch#156176
can-gaa-hou pushed a commit to can-gaa-hou/pytorch that referenced this pull request Aug 22, 2025
With the legacy driver (nvgpu) used for CUDA 12.9, Thor was operating with SM 10.1.
This changes to SM 11.0 when the newer driver model (OpenRM), which is intended for CUDA 13.0, is introduced.
Thor 10.1 --> 11.0
Spark 12.1
Pull Request resolved: pytorch#156176
Approved by: https://github.com/ezyang
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ciflow/trunk Trigger trunk jobs on your pull request Merged open source topic: not user facing topic category triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

Successfully merging this pull request may close these issues.