[muon] Introduce Muon optimizer to PyTorch #160213

chuanhaozhuge · 2025-08-08T19:58:49Z

A single-device version of Muon. Algorithm refers Keller Jordan's Muon blogpost, and optionally incorporates Moonshot's learning rate adjustment strategy.

This implementation maintains a minimalist API and is consistent with other optimizer conventions. PyTorch team prefers to handle parameter filtering at a higher level, with the Muon optimizer performing only the msign computation for orthogonalization on all parameters it receives. Users are responsible for grouping parameters for different optimizers as needed. An example usage is shown below, and a more detailed example will be added to the PyTorch examples directory.

Usage

    model = MyModelForCausalLM
    # filter out your params manually
    muon_params = [...]
    adamw_params = [...]
    muon = Muon(
        params = muon_params
        lr=lr,
        wd=wd,
    )
    adamw = AdamW(
        params = adamw_params
        lr=lr,
        wd=wd,
    )

    # in training loop
    loss = model(input)
    loss.backward()
    muon.step()
    adamw.step()
    muon.zero_grad()
    adamw.zero_grad()

~~Additional usage~~
~~Users are also able to pass in self-defined msign function for orthogonalization, and learning rate adjustment function. Interface defined below:~~

~~AdjustLrFn: TypeAlias = Callable[[float, torch.Size], float]~~
~~MsignFn: TypeAlias = Callable[[Tensor, BaseMsignFnConfig], Tensor]~~

As discussed with team and in comment, we prefer to make the interface simpler and cleaner, thus we removed the callback interface, and canonicalize the original NS algorithm for Muon. The only configs available to users are ns_steps, coefficients, and eps, configurable through kwargs.

By default, we use 5-step Newton-Schulz, with coefficients proposed by Keller. We use LR adjustment proposed by Moonshot, which grafts learning rate from AdamW.

Testing

1. Unit tests: the newly introduced Muon is covered in test/test_optim.py. We updated the test cases to pass named parameters to the optimizer under test. Additionally, we introduced a new test case to verify that when the user provides an empty FQN list, Muon correctly falls back to AdamW behavior.

As discussed, in order not to complicate the codebase, we prefer not to include reference implementation into PyTorch. We also updated the interface so we don't need to test the FQN based filtering. Muon is covered by the existing test_optim.py unit test.

End-to-end test: we added a training script that pre-trains a QWEN-like model on openwebtext-100k dataset. We trained for one epoch and the resulting loss curve is compared against the Moonshot implementation to confirm behavioral consistency.

Numerics
We evaluate our implementation with existing implementation to confirm numerical consistency.

As discussed, our implementation closely follows the algorithm described in Keller's post, while incorporating the learning rate adjustment from Moonlight. This captures a key insight that allows users to reuse hyper-parameters tuned for adamW, making Muon a drop-in swap.

As expected, the numerics difference mainly comes from adjust_lr, a max of ~5% relative diff in an example unit test setup below.

    # dummy model and data
    model0 = Linear(10, 10, bias=False)
    model1 = copy.deepcopy(model0)
    inputs = torch.randn(8, 10)
    targets = torch.randn(8, 10)
    loss = MSELoss()

    lr = 1e-3
    wd = 0.1
    momentum = 0.95

    opt_ref_muon = KellySingleDeviceMuon(
        params=model0.parameters(), 
        lr=lr,
        weight_decay=wd,
        momentum=momentum,
    )

    opt_exp_muon = Muon(
        params=model1.parameters(),
        lr=lr,
        weight_decay=wd,
        momentum=momentum,
    )

    out_ref = model0(inputs)
    loss_ref = loss(out_ref, targets)
    opt_ref_muon.zero_grad()
    loss_ref.backward()
    opt_ref_muon.step()

    out_exp = model1(inputs)
    loss_exp = loss(out_exp, targets)
    opt_exp_muon.zero_grad()
    loss_exp.backward()
    opt_exp_muon.step()

    for p_ref, p_exp in zip(model0.parameters(), model1.parameters()):
        torch.testing.assert_close(p_ref, p_exp)

As explained above, including this adjust_lr is preferable. This is validated by an e2e training runs on training a qwen-2-like 0.5b model, where the curves show that training with adjust_lr converges more effectively than without.

Performance
Training for one epoch of openwebtext-100k on eight H100 GPUs with DDP:

adamw_ddp finishes in 13.12 min
pytorch_muon_ddp finishes in 13.45 min

Muon runs ~20s slower compared to AdamW. Assuming no other changes, Muon is 2.5% slower than AdamW.

AdamW: Optimizer.step() takes ~13.5 ms, step time ~930 ms

Muon: Optimizer.step() takes ~54 ms, step time ~960 ms

Note
We restrict the implementation to accept only 2D parameters.

An alternative approach is to allow parameters with more than two dimensions and apply orthogonalization over the last two dimensions. We opt not to go with this approach as it can be error-prone. For example, with a kernel shaped [in_channel, height, width, out_channel], applying orthogonalization to the last two dimensions is not meaningful.

Since Muon is designed to operate orthogonalization on 2D matrices, preserving this assumption keeps the implementation clean and sound.

Next Steps

Add MuP
Open-source optimized triton kernel for symmetric matmul. A preliminary benchmark found 1.23x - 1.48x speedup on small - large (n = 256 -> 16384) matrices.
Open-source unsharded Muon co-designed with FSDP2.

cc: @toothacher17, @vinaysrao, @jcui2, @haocizhang

cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @pragupta @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10 @jerryzh168 @mcarilli @ptrblck @leslie-fang-intel @voznesenskym @penguinwu @EikanWang @Guobing-Chen @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @chenyang78 @kadeng @muchulee8 @amjames @chauhang @aakhundov @coconutruben @Lucaskabela

pytorch-bot · 2025-08-08T19:58:52Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/160213

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (7 Unrelated Failures)

As of commit 7fc7a60 with merge base 3e5b021 ():

BROKEN TRUNK - The following jobs failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

inductor / cuda12.8-py3.10-gcc9-sm86 / test (inductor_torchbench, 1, 2, linux.g5.4xlarge.nvidia.gpu) (gh) (trunk failure)
functorch_maml_omniglot
inductor / cuda12.8-py3.10-gcc9-sm86 / test (inductor_torchbench, 2, 2, linux.g5.4xlarge.nvidia.gpu) (gh) (trunk failure)
maml_omniglot
inductor / linux-jammy-cpu-py3.9-gcc11-inductor / test (cpu_inductor_torchbench, 1, 2, linux.8xlarge.amx) (gh) (trunk failure)
functorch_maml_omniglot
inductor / linux-jammy-cpu-py3.9-gcc11-inductor / test (cpu_inductor_torchbench, 2, 2, linux.8xlarge.amx) (gh) (trunk failure)
maml_omniglot
inductor / linux-jammy-cpu-py3.9-gcc11-inductor / test (dynamic_cpu_inductor_torchbench, 1, 2, linux.8xlarge.amx) (gh) (trunk failure)
functorch_maml_omniglot
inductor / linux-jammy-cpu-py3.9-gcc11-inductor / test (dynamic_cpu_inductor_torchbench, 2, 2, linux.8xlarge.amx) (gh) (trunk failure)
maml_omniglot
inductor / linux-jammy-cpu-py3.9-gcc11-inductor / test (inductor_torchbench_cpu_smoketest_perf, 1, 1, linux.24xl.spr-metal) (gh) (trunk failure)
Process completed with exit code 1.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

janeyx99

Wooo the approach is much simpler indeed now--thank you for the speedy turnaround on the PR. The one main API question I have is how we handle the NS config (whether it should be in the constructor) which I've commented below. Everything else looks super solid.

I know you've put in amazing work for the benchmarks and correctness compared to the original Muon, and I trust that you have verified this PR is still correct and appropriately fast locally. I will look out for the separate PR with those scripts!

janeyx99 · 2025-08-08T20:25:48Z

test/test_optim.py

+            params = [weight, bias]
+            if optim_cls.__name__ == "Muon":
+                params = [weight]


nit to not reassign

Suggested change

params = [weight, bias]

if optim_cls.__name__ == "Muon":

params = [weight]

params = [weight, bias] if optim_cls.__name__ != "Muon" else [weight]

janeyx99 · 2025-08-08T20:28:00Z

test/test_optim.py

+                model = torch.nn.Sequential(
+                    torch.nn.Linear(10, 4, bias=False),


This can just be one Linear then, right? Or maybe it'd be more indicative to add another Linear in there?

Can you add a comment for why we branch here?

janeyx99 · 2025-08-08T20:28:37Z

test/test_optim.py

@@ -1577,14 +1629,26 @@ def test_can_load_from_to_named_state_dict(
        all_optim_inputs = _get_optim_inputs_including_global_cliquey_kwargs(
            device, dtype, optim_info, skip=("differentiable",)
        )
+
+        def _get_model_and_input(device, dtype, optim_cls):


let's only have one version of this helper, it looks the same as above

test/test_optim.py

janeyx99 · 2025-08-08T20:37:00Z

torch/optim/_muon.py

+        nesterov: bool = True,
+        *,
+        msign_fn: MsignFn = zeropower_via_newtonschulz,
+        msign_fn_config: BaseMsignFnConfig = NewtonSchulzConfig(),


What is the pro of having these config live in the constructor as a struct vs separate values? Is this because these values are only used if the msign_fb is zeropower_via_newtonschulz? If so, should this config not live in the Muon constructor at all but be customizable by the user input msign_fn? What are your thoughts?

I also wonder if this config can be made in regular dict, accepted in constructor as Muon(..., msign_fn_config = {'eps' : 1e-5}), and then just passed as self.msign_fn(..., **self.msign_fn_config) - like so, it could be easier saved into a state_dict()...

I think it’s better to encapsulate the configs in a dedicated class, so the function signature stays clean and manageable. Just a preference carried over from my C++ days :)

I see Vadim's point but not sure if it's feasible (or necessity) to store the callable to state_dict in the first place.

Another option is to have config set as simple args to the function, and then have the user override them via calling functools.partial

cc @albanD regarding API design for best practices

Ho that's interesting.
I do agree that this doesn't match how we do APIs in PyTorch in general.
For value config, I would expect they're all passed in as an argument each (see other optimizers).
If you need to override some specific methods and behavior, you can either have a set of pre-defined implementations that a flag toggles between or you can subclass the optimizer to override the particular method you care about.

Also I guess I'm missing some context on why we want to do it this way if there is only one option for each right now?

torch/optim/_muon.py

torch/testing/_internal/common_optimizers.py

janeyx99

The current CI failures are cuz you (probably accidentally) committed the third_party differences--pls remove those!

chuanhaozhuge · 2025-08-12T04:45:35Z

The current CI failures are cuz you (probably accidentally) committed the third_party differences--pls remove those!

uh, they must have come from the rebase. removed

torch/optim/_muon.py

albanD · 2025-08-12T17:53:08Z

torch/optim/_muon.py

+    assert steps < 100, (
+        "Number of steps must be less than 100 for computational efficiency"
+    )
+    assert len(grad.shape) == 2, "Input tensor gradient must be a 2D matrix"
+    assert len(coefficients) == 3, "Coefficients must be a tuple of exactly 3 values"


No plain asserts, please raise appropriate Runtime/Value/Type errors

torch/optim/_muon.py

albanD · 2025-08-12T18:08:25Z

torch/optim/_muon.py

+        nesterov: bool = True,
+        *,
+        msign_fn: MsignFn = zeropower_via_newtonschulz,
+        msign_fn_config: BaseMsignFnConfig = NewtonSchulzConfig(),


Ho that's interesting.
I do agree that this doesn't match how we do APIs in PyTorch in general.
For value config, I would expect they're all passed in as an argument each (see other optimizers).
If you need to override some specific methods and behavior, you can either have a set of pre-defined implementations that a flag toggles between or you can subclass the optimizer to override the particular method you care about.

Also I guess I'm missing some context on why we want to do it this way if there is only one option for each right now?

torch/optim/_muon.py

albanD · 2025-08-12T18:13:14Z

torch/optim/_muon.py

+    has_complex: bool,
+) -> None:
+    lr = _to_scalar(lr)
+    assert has_complex is False, "Complex parameters are not supported"


Let's remove plain asserts here as well

janeyx99 · 2025-08-12T20:27:38Z

Given that we had agreed to land the simplest single device Muon into torch/optim as our first step, it'd be clearest to land what people accept as the original implementation as defined in Keller Jordan's blog ( https://kellerjordan.github.io/posts/muon/). As this implementation chooses newton schulz as the algo, we should take the same stance. This means we can simplify the constructor API greatly (I will get to extensibility right after):

Remove the msign_fn callable argument (the algorithm will just call NS by default today). OSS folks will not expect to pass in a callable to the constructor, so we will not accept the PR with a kwarg that takes in a callable. Instead, for customization, we can intake a string enum (more on this later).
Remove the struct definitions and flatten the kwargs as top-level keyword arguments to the constructor. Having layers of configs is confusing and unnecessarily abstracted. Since we are going in on NS being the algorithm for single device Muon, we can explicitly list out these kwargs in the constructor.
Move the algorithm description of NS into the Muon doc, so people can see immediately what algo they are calling.
I am remembering that the original test scripts comparing this implementation to Keller Jordan's had high atol and rtols, which is surprising as the two algorithms both use PyTorch ops in python, and so I'd expect the results to be the same. Could you link a standalone script that can be run to ascertain correctness and explain why the high atol/rtols are necessary (if they still are)? As we want to land a trustworthy impl for folks to try out, we need to ensure the accuracy results are expected.

I'm realizing that you have interest in extending the algorithm to be distributed (vs another orthogonalization algo for single-device). We are strict on keeping torch/optim code single-device runnable and maximally composable, so we cannot land anything distributed in torch/optim and I'd propose landing the distributed optimizer solution in https://github.com/pytorch/pytorch/tree/main/torch/distributed/optim. With that, I see the possible extension options as below:
a) If we are interested in other orthogonalization techniques for single device, we'd recommend using string enum kwargs similar to line_search_fn in https://docs.pytorch.org/docs/stable/generated/torch.optim.LBFGS.html#lbfgs, where the default None is NS, and other strings can represent other algorithms.
b) If we are attempting to extend in a distributed manner, and the code (state_dict, etc) is easily shareable, we'd recommend subclassing Muon into a new distributed optimizer in torch/distributed/optim. If the code ends up not being so shareable, it is perfectly acceptable to have Dion or a different optim class entirely living in torch/distributed/optim.

chuanhaozhuge · 2025-08-14T01:35:13Z

thank you team for the thoughtful suggestion and thorough review! made the following update according to inputs, so that this version is simple and clear-

removed the callback interface as suggested, and canonicalize the original NS algorithm for Muon. the only configs users can play with are ns_steps, coefficients, and eps, configurable through kwargs.
updated the docstring to follow other optimizers.
used RMS matching (adjust_lr) from Moonshot tech report to graft lr and weight decay tuned for adamw. Keller's impl is equivalent to Moonshot with some constraint.

pytorchmergebot · 2025-08-23T19:21:10Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2025-08-23T19:21:30Z

Merge failed

Reason: 12 jobs have failed, first few of them are: inductor / unit-test / linux-jammy-cpu-py3.9-gcc11-inductor / test (inductor_avx2, 1, 2, linux.10xlarge.avx2), inductor / unit-test / linux-jammy-cpu-py3.9-gcc11-inductor / test (inductor_avx2, 2, 2, linux.10xlarge.avx2), inductor / unit-test / cuda12.8-py3.12-gcc9-sm86 / test (inductor, 1, 2, linux.g5.4xlarge.nvidia.gpu), inductor / unit-test / cuda12.8-py3.12-gcc9-sm86 / test (inductor, 2, 2, linux.g5.4xlarge.nvidia.gpu), inductor / cuda12.8-py3.10-gcc9-sm86 / test (inductor_huggingface, 1, 1, linux.g5.4xlarge.nvidia.gpu)

Details for Dev Infra team

Raised by workflow job

chuanhaozhuge · 2025-08-24T00:33:14Z

@pytorchbot rebase

pytorchmergebot · 2025-08-24T00:34:39Z

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

pytorchmergebot · 2025-08-24T00:34:41Z

Tried to rebase and push PR #160213, but it was already up to date. Try rebasing against main by issuing:
@pytorchbot rebase -b main

pytorch-bot bot added the release notes: optim label Aug 8, 2025

janeyx99 reviewed Aug 8, 2025

View reviewed changes

chuanhaozhuge force-pushed the muon_dev branch from 5083654 to 0f0df7b Compare August 11, 2025 19:28

janeyx99 reviewed Aug 11, 2025

View reviewed changes

chuanhaozhuge force-pushed the muon_dev branch from 0f0df7b to 2e6bf8c Compare August 12, 2025 04:42

chuanhaozhuge force-pushed the muon_dev branch 2 times, most recently from 1c82fc8 to 654f754 Compare August 12, 2025 05:07

chuanhaozhuge marked this pull request as ready for review August 12, 2025 05:08

chuanhaozhuge requested a review from albanD as a code owner August 12, 2025 05:08

janeyx99 reviewed Aug 12, 2025

View reviewed changes

torch/optim/_muon.py Outdated Show resolved Hide resolved

albanD reviewed Aug 12, 2025

View reviewed changes

chuanhaozhuge requested review from mikaylagawarecki, eqy, syed-ahmed, SherlockNoMad, zhxchen17, avikchaudhuri, tugsbayasgalan, ydwu4, angelayi, EikanWang, gujinghui, jeffdaily, jithunnair-amd, ezyang and Chillee as code owners August 14, 2025 05:13

pytorchmergebot removed the merging label Aug 23, 2025

chuanhaozhuge added 23 commits August 23, 2025 14:32

introduce muon

2b2a0a0

missed adding muon file the previous commit

dbe3d1f

linter fix

aeddc01

addresss comments

1f4a94b

further address comments

411b722

fix linter

7dfff8b

fix test

7727ba8

don't update grad in place

602eaa3

fix docstring

99afa33

improve docstring, more expressive var names in NS

399a872

use clamp in NS

be11d01

fix linter

a62663b

fix linter and docstring

f476c78

allow choosing adjust_lr. update doc

f770f54

fix linter

5783e5a

fix doc

c41c836

update adjust_lr_fn selection and doc

60f2d95

use ema style nesterov instead of torch sgd style

836782e

do not inplace lerp

bc75df6

update var name

119f73f

throw adjust_lr_fn error early

88c3250

fix lint

41f5623

update tolerance for compile test

7fc7a60

chuanhaozhuge force-pushed the muon_dev branch from 79b1460 to 7fc7a60 Compare August 23, 2025 21:32

		model = torch.nn.Sequential(
		torch.nn.Linear(10, 4, bias=False),

[muon] Introduce Muon optimizer to PyTorch #160213

Are you sure you want to change the base?

[muon] Introduce Muon optimizer to PyTorch #160213

Conversation

chuanhaozhuge commented Aug 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Aug 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/160213

✅ You can merge normally! (7 Unrelated Failures)

Uh oh!

janeyx99 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

janeyx99 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

chuanhaozhuge commented Aug 12, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

janeyx99 commented Aug 12, 2025

Uh oh!

chuanhaozhuge commented Aug 14, 2025

Uh oh!

pytorchmergebot commented Aug 23, 2025

Merge started

Uh oh!

pytorchmergebot commented Aug 23, 2025

Merge failed

Uh oh!

chuanhaozhuge commented Aug 24, 2025

Uh oh!

pytorchmergebot commented Aug 24, 2025

Uh oh!

pytorchmergebot commented Aug 24, 2025

Uh oh!

Uh oh!

chuanhaozhuge commented Aug 8, 2025 •

edited

Loading

pytorch-bot bot commented Aug 8, 2025 •

edited

Loading

janeyx99 left a comment •

edited

Loading