Customize

Quick Takes

i find it funny that i know people in all 4 of the following quadrants: * works on capabilities, and because international coordination seems hopeless, we need to race to build ASI first before the bad guys * works on capabilities, and because international coordination seems possible, and all national leaders like to preserve the status quo, we need to build ASI before it gets banned * works on safety, and because international coordination seems hopeless, we need to solve the technical problem before ASI kills everyone * works on safety, and because international coordination seems possible, so we need to focus on regulation and policy before ASI kills everyone

Prometheus2h90

leogao

I'd love to see people working on Retroactive Funding for Alignment. Something like a DAO with Governance tokens that only pays out after there is consensus that (1) AGI/ASI has been achieved, and (2) humanity has survived. Using AI or human evaluations, there would be an attempt to "traceback" the greatest contributors toward our survival. The researchers, the organizations, the individuals, the donors, the investors. All would receive a payout, based on their calculated impact. It's a way of almost bringing money from the future into the present, and a way of forcing donors, investors, and researchers to think about what will actually contribute toward a positive future. It would also add incentive for donors, since their contribution would also be later rewarded. Would love to speak further with anyone in the Retroactive Funding or DeSci space.

Zach Stein-Perlman2d445

anaguma

Recently I've been spending much less than half of my time on projects like AI Lab Watch. Instead I've been thinking about projects in the "strategy/meta" and "politics" domains. I'm not sure what I'll work on in the future but sometimes people incorrectly assume I'm on top of lab-watching stuff; I want people to know I'm not owning the lab-watching ball. I think lab-watching work is better than AI-governance-think-tank work for the right people on current margins and at least one more person should do it full-time; DM me if you're interested.

Jesse Hoogland3dΩ30675

leogao, Logan Riggs, and 2 more

We recently put out a new paper on a scalable generalization of influence functions, which quantify how training data affects model behavior (see Nina's post). I'm excited about this because it takes a completely new methodological approach to measuring influence. Instead of relying on a Hessian inverse (which is ill-defined and expensive), our new "Bayesian" influence functions (BIF) rely on a covariance calculation (which can be scalably estimated with MCMC). This approach is more theoretically sound (no more Hessian inverses), and it achieves what I think are a more desirable set of engineering tradeoffs (better model-size scaling but worse dataset-size scaling). At Timaeus, we think these kinds of techniques are on the critical path to safety. Modern alignment techniques like RLHF and Constitutional AI are about controlling model behavior by selecting the right training data. If this continues to be the case, we will need better tools for understanding and steering the pipeline from data to behavior. It's still early days for the BIF. We've done some initial validation on retraining benchmarks and other quantitative tests (follow-up work coming soon), where the BIF comes out looking strong, but more work will be needed to understand the full set of costs and benefits. As that foundation gets established, we expect we'll be able to start applying these techniques directly to safety-relevant problems. You can read the full announcement thread on X (reproduced below):

leogao3d7353

Haiku, Vladimir_Nesov, and 7 more

I think it would be really bad for humanity to rush to build superintelligence before we solve the difficult problem of how to make it safe. But also I think it would be a horrible tragedy if humanity never ever built superintelligence. I hope we figure out how to thread this needle with wisdom.

kave11h81

MichaelDickens, Tomás B., and 6 more

There has been a rash of highly upvoted quick takes recently that don't meet our frontpage guidelines. They are often timely, perhaps because they're political, pitching something to the reader or inside baseball. These are all fine or even good things to write on LessWrong! But I (and the rest of the moderation team I talked to) still want to keep the content on the frontpage of LessWrong timeless. Unlike posts, we don't go through each quick take and manually assign it to be frontpage or personal (and posts are treated as personal until they're actively frontpaged). Quick takes are instead treated more like frontpage by default, but we do have the ability to move them to personal. I'm writing this because of a bunch of us are planning to be more active about moving quick takes off the frontpage. I also might link to this comment to clarify what's happening in cases of confusion.

Matt Dellago12h8-1

testingthewaters, 1a3orn

Maximally coherent agents are indistinguishable from point particles. They have no internal degrees of freedom, one cannot probe their internal structure from the outside. Epistemic Status: Unhinged

Popular Comments

Raemon2d9193

Which side of the AI safety community are you in?

I think there is some way that the conversation needs to advance, and I think this is roughly carving at some real joints and it's important that people are tracking the distinction. But a) I'm generally worried about reifying the groups more into existence (as opposed to trying to steer towards a world where people can have more nuanced views). This is tricky, there are tradeoffs and I'm not sure how to handle this. But... b) this post title and framing particular is super leaning into the polarization and I wish it did something different.

Phaedrus2d5218

The Doomers Were Right

Although you don't explicitly mention it, I feel like this whole post is about value drift. The doomers are generally right on the facts (and often on the causal pathways), and we do nonetheless consider the post-doom world better, but the 1-nth order effects of these new technologies reciprocally change our preferences and worldviews to favor the (doomed?) world created by the aforementioned new technologies. The question of value drift is especially strange given that we have a "meta-intuition" that moral/social values evolving and changing is good in human history. BUT, at the same time, we know from historical precedent that we ourselves will not approve of the value changes. One might attempt to square the circle here by arguing that perhaps if we were, hypothetically, able to see and evaluate future changed values, that we would in reflective equilibrium accept these new values. Sadly, from what I can gather this is just not borne out by the social science: when it comes to questions of value drift, society advances by the deaths of the old-value-havers and the maturation of a next generation with "new" values. For a concrete example, consider that most Americans have historically been Christians. In fact, the history of the early United States is deeply influenced by Christianity, sometimes swelling in certain periods to fanatical levels. If those Americans could see the secular American republic of 2025, with little religious belief and no respect for the moral authority of Christian scripture, they would most likely be morally appalled. Perhaps they might view the loss of "traditional God-fearing values" as a harm that in itself outweighs the cumulative benefits of industrial modernity. As a certain Nazarene said: “For what shall it profit a man, if he shall gain the whole world, and lose his own soul?” (Mark 8:36) With this in mind, as a final exercise I'd like you, dear reader, to imagine a future where humanity has advanced enormously technologically, but has undergone such profound value shifts that every central moral and social principle that you hold dear has been abandoned, replaced with mores which you find alien and abhorrent. In this scenario, do you obey your moral intuitions that the future is one of Lovecraftian horror? Or do you obey your historical meta-intuitions that future people probably know better than you do?

Wei Dai1d221

Reminder: Morality is unsolved

Strongly agree that metaethics is a problem that should be central to AI alignment, but is being neglected. I actually have a draft about this, which I guess I'll post here as a comment in case I don't get around to finishing it. Metaethics and Metaphilosophy as AI Alignment's Central Philosophical Problems I often talk about humans or AIs having to solve difficult philosophical problems as part of solving AI alignment, but what philosophical problems exactly? I'm afraid that some people might have gotten the impression that they're relatively "technical" problems (in other words, problems whose solutions we can largely see the shapes of, but need to work out the technical details) like anthropic reasoning and decision theory, which we might reasonably assume or hope that AIs can help us solve. I suspect this is because due to their relatively "technical" nature, they're discussed more often on LessWrong and AI Alignment Forum, unlike other equally or even more relevant philosophical problems, which are harder to grapple with or "attack". (I'm also worried that some are under the mistaken impression that we're closer to solving these "technical" problems than we actually are, but that's not the focus of the current post.) To me, the really central problems of AI alignment are metaethics and metaphilosophy, because these problems are implicated in the core question of what it means for an AI to share a human's (or a group of humans') values, or what it means to help or empower a human (or group of humans). I think one way that the AI alignment community has avoided this issue (even those thinking about longer term problems or scalable solutions) is by assuming that the alignment target is someone like themselves, i.e. someone who clearly understands that they are and should be uncertain about what their values are or should be, or are at least willing to question their moral beliefs, and eager or at least willing to use careful philosophical reflection to solve their value confusion/uncertainty. To help or align to such a human, the AI perhaps doesn't need an immediate solution to metaethics and metaphilosophy, and can instead just empower the human in relatively commonsensical ways, like keeping them safe and gather resources for them, and allow them to work out their own values in a safe and productive environment. But what about the rest of humanity who seemingly are not like that? From an earlier comment: > I've been thinking a lot about the kind [of value drift] quoted in Morality is Scary. The way I would describe it now is that human morality is by default driven by a competitive status/signaling game, where often some random or historically contingent aspect of human value or motivation becomes the focal point of the game, and gets magnified/upweighted as a result of competitive dynamics, sometimes to an extreme, even absurd degree. > > (Of course from the inside it doesn't look absurd, but instead feels like moral progress. One example of this that I happened across recently is filial piety in China, which became more and more extreme over time, until someone cutting off a piece of their flesh to prepare a medicinal broth for an ailing parent was held up as a moral exemplar.) > > Related to this is my realization is that the kind of philosophy you and I are familiar with (analytical philosophy, or more broadly careful/skeptical philosophy) doesn't exist in most of the world and may only exist in Anglophone countries as a historical accident. There, about 10,000 practitioners exist who are funded but ignored by the rest of the population. To most of humanity, "philosophy" is exemplified by Confucius (morality is everyone faithfully playing their feudal roles) or Engels (communism, dialectical materialism). To us, this kind of "philosophy" is hand waving and make things up out of thin air, but to them, philosophy is learned from a young age and unquestioned. (Or if questioned, they're liable to jump to some other equally hand-wavy "philosophy" like China's move from Confucius to Engels.) What are the real values of someone whose apparent values (stated and revealed preferences) can change in arbitrary and even extreme ways as they interact with other humans in ordinary life (i.e., not due to some extreme circumstances like physical brain damage or modification), and who doesn't care about careful philosophical inquiry? What does it mean to "help" someone like this? To answer this, we seemingly have to solve metaethics (generally understand the nature of values) and/or metaphilosophy (so the AI can "do philosophy" for the alignment target, "doing their homework" for them). The default alternative (assuming we solve other aspects of AI alignment) seems to be to still empower them in straightforward ways, and hope for the best. But I argue that giving people who are unreflective and prone to value drift god-like powers to reshape the universe and themselves could easily lead to catastrophic outcomes on par with takeover by unaligned AIs, since in both cases the universe becomes optimized for essentially random values. A related social/epistemic problem is that unlike certain other areas of philosophy (such as decision theory and object-level moral philosophy), people including alignment researchers just seem more confident about their own preferred solution to metaethics, and comfortable assuming their own preferred solution is correct as part of solving other problems, like AI alignment or strategy. (E.g., moral anti-realism is true, therefore empowering humans in straightforward ways is fine as the alignment target can't be wrong about their own values.) This may also account for metaethics not being viewed as a central problem in AI alignment (i.e., some people think it's already solved). I'm unsure about the root cause(s) of confidence/certainty in metaethics being relatively common in AI safety circles. (Maybe it's because in other areas of philosophy, the various proposed solutions are more obviously unfinished or problematic, e.g. the well-known problems with utilitarianism.) I've previously argued for metaethical confusion/uncertainty being normative at this point, and will also point out now that from a social perspective there is apparently wide disagreement about the problems among philosophers and alignment researchers, so how can it be right to assume some controversial solution to it (which every proposed solution is at this point) as part of a specific AI alignment or strategy idea?

leogao16m150

Prometheus2h90

leogao

Zach Stein-Perlman2d445

anaguma

Jesse Hoogland3dΩ30675

leogao, Logan Riggs, and 2 more

leogao3d7353

Haiku, Vladimir_Nesov, and 7 more

kave11h81

MichaelDickens, Tomás B., and 6 more

Matt Dellago12h8-1

testingthewaters, 1a3orn

Maximally coherent agents are indistinguishable from point particles. They have no internal degrees of freedom, one cannot probe their internal structure from the outside. Epistemic Status: Unhinged

LESSWRONG
LW

Quick Takes

LESSWRONG
LW

Quick Takes

Popular Comments

Popular Comments