Lotus Reader - The Hackernews Client

Lotus Reader

If you click (or tap) on the name of a parent in a discussion, you will be directed to the comment that the parent wrote.

I don't know how you get here from “predict the next word”

qsi

20 hours ago

162

225

https://www.grumpy-economist.com/p/refine

pushedx19 hours ago

Yes, most people (including myself) do not understand how modern LLMs work (especially if we consider the most recent architectural and training improvements).

There's the 3b1b video series which does a pretty good job, but now we are interfacing with models that probably have parameter counts in each layer larger than the first models that we interacted with.

The novel insights that these models can produce is truly shocking, I would guess even for someone who does understand the latest techniques.

measurablefuncpushedx19 hours ago

What's the latest novel insight you have encountered?

brookstmeasurablefunc19 hours ago

Not the person you asked, and “novel” is a minefield. What’s the last novel anything, in the sense you can’t trace a precursor or reference?

But.. I recently had a LLM suggest an approach to negative mold-making that was novel to me. Long story, but basically isolating the gross geometry and using NURBS booleans for that, plus mesh addition/subtraction for details.

I’m sure there’s prior art out there, but that’s true for pretty much everything.

measurablefuncbrookst18 hours ago

I don't know, that's why I asked b/c I always see a lot of empty platitudes when it comes to LLM praise so I'm curious to see if people can actually back up their claims.

I haven't done any 3D modeling so I'll take your word for it but I can tell you that I am working on a very simple interpreter & bytecode compiler for a subset of Erlang & I have yet to see anything novel or even useful from any of the coding assistants. One might naively think that there is enough literature on interpreters & compilers for coding agents to pretty much accomplish the task in one go but that's not what happens in practice.

pushedxmeasurablefunc18 hours ago

Which agents are you using, and are you using them in an agent mode (Codex, Claude Code etc.)?

The difference in quality of output between Claude Sonnet and Claude Opus is around an order of magnitude.

The results that you can get from agent mode vs using a chat bot are around two orders of magnitude.

measurablefuncpushedx18 hours ago

The workflow is not the issue. You are welcome to try the same challenge yourself if you want. Extra test cases (https://drive.proton.me/urls/6Z6557R2WG#n83c6DP6mDfc) & specification (https://claude.ai/public/artifacts/5581b499-a471-4d58-8e05-147ab7c3ef4e). I know enough about compilers, bytecode VMs, parsers, & interpreters to know that this is well within the capabilities of any reasonably good software engineer but the implementation from Gemini 3.1 Pro (high & low) & Claude Opus 4.6 (thinking) have been less than impressive.

Kim_Bruningmeasurablefunc17 hours ago

Possibly a dumb question: but are you running this in claude code, or an ide, or basically what are you using to allow for iteration?

measurablefuncKim_Bruning16 hours ago

I'm using Google's antigravity IDE. I initially had it configured to run allowed commands (cargo add|build|check|run, testing shell scripts, performance profiling shell scripts, etc.) so that it would iterate & fix bugs w/ as little intervention from me as possible but all it did was burn through the daily allotted tokens so I switched to more "manual" guidance & made a lot more progress w/o burning through the daily limits.

What I've learned from this experiment is that the hype does not actually live up to the reality. Maybe the next iteration will manage the task better than the current one but it's obvious that basic compiler & bytecode virtual machine design in a language like Rust is still beyond the capabilities of the current coding agents & whoever thinks I'm wrong is welcome to implement the linked specification to see how far they can get by just "vibing".

Kim_Bruningmeasurablefunc16 hours ago

That's roughly where I'm at too. I have seen people have some more success after having practices though. Possibly the actual workflows needed for full auto are still kind of tacit. Smaller green-field projecs do work for me already though.

measurablefuncKim_Bruning4 hours ago

In my experience a few hundred lines w/ a few crates w/ well-defined scopes & a detailed specification is within current capabilities, e.g. compressing wav files w/ wavelets & arithmetic coding. But it's obvious that a correct parser, compiler, & bytecode VM is still beyond current agents even if the specification is detailed enough to cover basically everything.

pushedxmeasurablefunc17 hours ago

sorry, needed to edit this comment to ask the same question as the sibling:

have you run these models in an agent mode that allows for executing the tests, the agent views the output, and iterates on its own for a while? up to an hour or so?

you will get vastly different output if you ask the agent to write 200 of its own test cases, and then have it iterate from there

kmaitreyspushedx17 hours ago

Can you clarify a bit more about the this two orders of magnitude? In what context? Sure, they have "agency" and can do more than outputting text, but I would like see a proper example of this claim.

brookstmeasurablefunc18 hours ago

It’s taken me a while to get good at using them.

My advice: ask for more than what you think it can do. #1 mistake is failing to give enough context about goals, constraints, priorities.

Don’t ask “complete this one small task”, ask “hey I’m working on this big project, docs are here, source is there, I’m not sure how to do that, come up with a plan”

measurablefuncbrookst18 hours ago

The specification is linked in another comment in this thread & you can decide whether it is ambitious enough or not but what I can tell you is that none of the existing coding agents can complete the task even w/ all the details. If you do try it you will eventually get something that will mostly work on simple tests but fail miserably on slightly more complicated test cases.

joquarkymeasurablefunc3 hours ago

Most humans can't force themselves to come up with something novel immediately upon demand.

measurablefuncjoquarky3 hours ago

Completely unrelated to the topic or any of the points I was making so did you get confused & respond to the wrong thread?

kennyloginzbrookst17 hours ago

There is prior art, so it’s not novel.

brookstkennyloginz9 hours ago

Great. Can you point to anything at all that is truly novel, no prior art?

rsyncbrookst6 hours ago

Sliding down handrails on a skateboard.

aurahampushedx19 hours ago

I highly recommend Build a large language model from scratch [1] by Sebastian Raschka. It provides a clear explanation of the building blocks used in the first versions of ChatGPT (GPT 2 if I recall correctly). The output of the model is a huge vector of n elements, where n is the number of tokens in the vocabulary. We use that huge vector as a probability distribution to sample the next token given an input sequence (i.e., a prompt). Under the hood, the model has several building blocks like tokenization, skip connections, self attention, masking, etc. The author makes a great job explaining all the concepts. It is very useful to understand how LLMs works.

[1] https://www.manning.com/books/build-a-large-language-model-from-scratch

phreezaauraham18 hours ago

But this is missing exactly the gap which OP seems to have, which is going from a next token predictor (a language model in the classical sense) to an instruction finetuned, RLHF-ed and "harnessed" tool?

js8phreeza15 hours ago

The book has a sequel https://www.manning.com/books/build-a-reasoning-model-from-scratch

It will give you an answer to the extent anybody can.

belZaah19 hours ago

It’s called emergent behavior. We understand how an llm works, but do not have even a theory about how the behavior emerges from among the math. We understand ants pretty well, but how exactly does anthill behavior come from ant behavior? It’s a tricky problem in system engineering where predicting emergent behavior (such as emergencies) would be lovely.

devmorbelZaah19 hours ago

The good news is that despite being incredibly complex, it’s still a lot simpler than ants because it is at least all statistical linguistics (as far as LLMs are concerned anyways).

themafiabelZaah19 hours ago

> but do not have even a theory about how the behavior emerges

We fully do. There is a significant quality difference between English language output and other languages which lends a huge hint as to what is actually happening behind the scenes.

> but how exactly does anthill behavior come from ant behavior?

You can't smell what ants can. If you did I'm sure it would be evident.

kristiandupontthemafia19 hours ago

I am very curious about this significant hint, could you point me to some material?

spiralcoasterthemafia19 hours ago

Two very big revelations here that I would love to know more about:

1. Can you reveal "what's actually happening behind the scenes" beyond the hint you gave? I can't figure it out.

2. Can you explain how an ants sense of smell leads to anthills?

jen729wspiralcoaster19 hours ago

> 2. Can you explain how an ants sense of smell leads to anthills?

Ant 0: doesn’t seem to be dangerous here. I’ll drop a scent.

Ant 1: oh cool, a safe place. And I didn’t die either. I’ll reinforce that.

Ant 142,857,098,277: cool anthill.

fc417fc802jen729w18 hours ago

The dynamics of ant nest creation are way more complicated than that. The evolved biological parallel of a procedural generation algorithm. In addition, the completed structure has to be compatible with the various programmed behaviors of the workers.

canjobearthemafia19 hours ago

> There is a significant quality difference between English language output and other languages

florencanjobear19 hours ago

They're saying LLMs do better when outputting English than other languages, an assertion I'm not really able to test but have heard elsewhere.

bryanrasmussenfloren19 hours ago

and this is somehow not related to the size and availability of corpora in English?

florenbryanrasmussen18 hours ago

No, I'm quite sure that's why it's better.

bryanrasmussenfloren18 hours ago

OK but then that goes back to their other assertion that it gives a huge hint at what is going on behind the scenes, is that huge hint just "more data gives better results!" if so, that doesn't seem at all important since that is the absolutely central idea of an LLM. That is not behind the scenes at all, that is the introduction to the play as written by the author.

Not your fault obviously, but they have not yet described what that huge hint is, and I'm just at the edge of my seat with anticipation here.

fc417fc802belZaah19 hours ago

> but do not have even a theory about how the behavior emerges from among the math

Actually we have an awful lot of those.

I'm not sure if emergent is quite the right term here. We carefully craft a scenario to produce a usable gradient for a black box optimizer. We fully expect nontrivial predictions of future state to result in increasingly rich world models out of necessity.

It gets back to the age old observation about any sufficiently accurate model being of equal complexity as the system it models. "Predict the next word" is but a single example of the general principle at play.

hnfongfc417fc80218 hours ago

> black box optimizer

This is admission we don't know how it emerges.

Sure, we expect the behavior to emerge, but we don't know how.

fc417fc802hnfong18 hours ago

No, as I said, we have _lots_ of theories about exactly that at various levels of detail. The theories vary based on (at least) the specifics of the loss function being employed to construct the gradient. Giving an overview of that is far beyond the scope of this comment section (but it's well trodden ground so you can just go ask an LLM).

The "black box" bit refers to a generic, interchangeable optimization algorithm that simply makes the number go down (or up or whatever).

There are certainly various details about the internal workings of models that we don't properly understand but a blanket claim about the whole is erroneous.

netfortiusbelZaah19 hours ago

I'd rather go the route of bats [1]

[1] https://en.wikipedia.org/wiki/What_Is_It_Like_to_Be_a_Bat%3F

WD-4219 hours ago

This is really hard to judge because by the looks of it, finance papers mostly consist of gobbledygook and extensive filler to begin with.

sp4cemonekyWD-4219 hours ago

This. Verbalism lands really well to verbalism.

cyanydeezWD-4219 hours ago

Economics is the attempt to take sociology and add numbers to make it look like a hard science. The fintechbros then seem to think because they can make numbers go up that this proof it's a hard science.

Tarq0ncyanydeez19 hours ago

That's entirely missing the point. "All models are wrong, but some are useful". You can test hypotheses and learn things even about chaotic or emergent systems.

friendzisTarq0n18 hours ago

> You can test hypotheses and learn things even about chaotic or emergent systems.

Ah yes, the famous "Cut GDP in half, abolish public schooling and use that as a control" experiment. Majority of economic "models" are entirely correlational without any mechanistic explanation whatsoever or an explanation so superficial that it contradicts either itself or observed reality.

If you look deeper and read explanatory notes of economic laws, the model may refer some publications, but then the actual figures plugged in the model are explained as "these values have been observed to lead to the desired outcomes, therefore are set without any modeling or validation, hope for the best, lesssgoooo".

tolerance19 hours ago

It’s interesting to read about the use and leverage of LLMs outside of programming.

I’m not too familiar with the history, but the import of this article is brushing up on my nose hairs in a way that makes me think a sort of neo-Sophistry is on the horizon.

themafia19 hours ago

> The comments it offered were on the par of the best comments I’ve received on a paper in my entire academic career.

Sort of the lowest hanging fruit imaginable. Just because it became "fundamental" to the process doesn't mean it gained any quality.

libraryofbabel19 hours ago

I have come to think “predict the next token” is not a useful way to explain how LLMs work to people unfamiliar with LLM training and internals. It’s technically correct, but at this point saying that and not talking about things like RLVR training and mechanistic interpretability is about as useful as framing talking with a person as “engaging with a human brain generating tokens” and ignoring psychology.

At least AI-haters don’t seem to be talking about “stochastic parrots” quite so much now. Maybe they finally got the memo.

qseralibraryofbabel19 hours ago

>“predict the next token” is not a useful way

That is the exact thing to say because that is exactly what it does, despite how it does so.

It is not useful to say it if you are an AI-shill though. You bought up AI-hater, so I think I am entitled to bring up AI-shills.

vascoqsera18 hours ago

My neurons are also just passing electric signals back and forward and exchanging water and salts with the rest of my body.

qseravasco18 hours ago

> just passing electric signals back and forward

Ok, feel free to call yourselves a toaster, I don't mind!

vascoqsera17 hours ago

What, reductionism only works when you do it?

qseravasco17 hours ago

I didn't

stephenrvasco14 hours ago

I mean that's really just a comparison to how silicon circuits work though isn't it.

"Thinking rocks" vs "thinking meat sacks" isn't much of a distinction really.

Conversely if you approach conversations the same way an LLM does and just repeat what you've heard other people say a lot without actually knowing what it means then you're also likely to be compared to a feathery chatterbox.

dylan604libraryofbabel19 hours ago

I think talking to people unfamiliar with LLM training using words like "RLVR training and mechanistic interpretability" is about as useful as a grave robber in a crematorium.

libraryofbabeldylan60418 hours ago

Obviously you don’t just say those words and leave it at that. Both those things can be explained in understandable terms. And even having a superficial sense of what they are gives people a better picture of what modern LLMs are all about than tired tropes from three years ago like “they’re just trained to predict the next token in the training data, therefore…”

stephenrlibraryofbabel19 hours ago

> stochastic parrots

I prefer to use the term "spicy autocomplete" myself.

measurablefunclibraryofbabel19 hours ago

Sampling over a probability distribution is not as catchy as "stochastic parrot" but I have personally stopped telling believers that their imagined event horizon of transistor scale is not going to deliver them to their wished for automated utopia b/c one can not reason w/ people who did not reach their conclusions by reasoning.

goatloverlibraryofbabel18 hours ago

Must one be an "AI-hater" to use the term "stochastic parrot"? Which is probably in response to all the emergent AGI claims and pointless discussions about LLMs being conscious.

imiriclibraryofbabel18 hours ago

Technical concepts can be broken down into ideas anyone can understand if they're interested. Token prediction is at the core of what these tools do, and is a good starting point for more complex topics.

On the other hand, calling these tools "intelligent", capable of "reasoning" and "thought", is not only more confusing and can never be simplified, but dishonest and borderline gaslighting.

Alex_L_Woodlibraryofbabel18 hours ago

“Stochastic parrots” only stopped because AI fanboys stopped screaming “AGI” and “it will replace everyone”. Maybe they finally got the memo?

wavemode19 hours ago

> the kind of analysis the program is able to do is past the point where technology looks like magic. I don’t know how you get here from “predict the next word.”

You're implicitly assuming that what you asked the LLM to do is unrepresented in the training data. That assumption is usually faulty - very few of the ideas and concepts we come up with in our everyday lives are truly new.

All that being said, the refine.ink tool certainly has an interesting approach, which I'm not sure I've seen before. They review a single piece of writing, and it takes up to an hour, and it costs $50. They are probably running the LLM very painstakingly and repeatedly over combinations of sections of your text, allowing it to reason about the things you've written in a lot more detail than you get with a plain run of a long-context model (due to the limitations of sparse attention).

It's neat. I wonder about what other kinds of tasks we could improve AI performance at by scaling time and money (which, in the grand scheme, is usually still a bargain compared to a human worker).

selridgewavemode19 hours ago

>You're implicitly assuming that what you asked the LLM to do is unrepresented in the training data.

This is just as stuck in a moment in time as "they only do next word prediction" What does this even mean anymore? Are we supposed to believe that a review of this paper that wasn't written when that model (It's putatively not an "LLM", but IDK enough about it to be pushy there) was trained? Does that even make sense? We're not in the regime of regurgitating training data (if we really ever were). We need to let go of these frames which were barely true when they took hold. Some new shit is afoot.

wavemodeselridge19 hours ago

Statistical models generalize. If you train a model that f(x) = 5 and f(x+1) = 6, the number 7 doesn't have to exist in the training data for the model to give you a correct answer for f(x+2)

Similarly, if there are millions of academic papers and thousands of peer reviews in the training data, a review of this exact paper doesn't need to be in there for the LLM to write something convincing. (I say "convincing" rather than "correct" since, the author himself admits that he doesn't agree with all the LLM's comments.)

I tend to recommend people learn these things from first principles (e.g. build a small neural network, explore deep learning, build a language model) to gain a better intuition. There's really no "magic" at work here.

selridgewavemode18 hours ago

Ok cool cool. Instead of pretending you need to teach me, you could engage with what I'm saying or even the OP!

"I don't know how you get here from "predict the next word"" is not really so much a statement of ignorance where someone needs you to step in but a reflection that perhaps the tech is not so easily explained as that. No magic needs to be present for that to be the case.

wavemodeselridge18 hours ago

If you disagree with someone on the internet, you can just say "I disagree, and here's why". You don't have to aggressively accuse them of "not engaging" with the text.

I engaged. You just don't like what I wrote. That's okay.

selridgewavemode18 hours ago

Thanks but no thanks.

c22wavemode18 hours ago

> If you train a model that f(x) = 5 and f(x+1) = 6, the number 7 doesn't have to exist in the training data for the model to give you a correct answer for f(x+2)

This is an interesting claim to me. Are there any models that exist that have been trained with a (single digit) number omitted from the training data?

If such a model does exist, how does it represent the answer? (What symbol does it use for the '7'?)

wavemodec2218 hours ago

When I say "model" here I'm referring to any statistical model (in this example, probably linear regression). Not specifically large language models / neural networks.

c22wavemode18 hours ago

Gotcha, I don't think I know enough about it. What constitutes training data for a for a (non neural network) statistical model? Is this something I could play around with myself with pen and paper?

heavyset_goc2217 hours ago

You can write an f(x) and record the input and output and that can be your training data. Or just download some time-series data or something.

nairboonc2217 hours ago

Just the raw numbers? You list the y's and the x's and the model is approximating y=f(x) from the above example. You can totally do it with pen and paper. This is what it'd look like (for linear regression): https://observablehq.com/@yizhe-ang/interactive-visualization-of-linear-regression

red75primewavemode18 hours ago

I think the relevant question is: can a statistical model (or a transformer, in particular) generalize to general reasoning ability?

kristiandupontwavemode18 hours ago

I had Claude help me get a program written for Linux to compile on macOS. The program is written in a programming language the author invented for the project, a pretty unusual one (for example, it allows spaces in variable names).

Claude figured out how the language worked and debugged segfaults until the compiler compiled, and then until the program did. That might not be magic, but it shows a level of sophistication where referring to “statistics” is about as meaningful as describing a person as the statistics of electrical impulses between neurons.

compass_copiumkristiandupont18 hours ago

But the programming language has explicitly laid out rules. It was not trained on those sets of rules, but it was trained on many trillions of lines of code. It has a map of how programs work, and an explanation of this new language. It's using training data and data it's fed to generate that result.

selridgecompass_copium18 hours ago

What doesn't that explain tho?

What behavior would you need to see for that explanation to no longer hold? Because it seems like it explains too much.

BobaFloutistselridge8 hours ago

I don't know how you'd prompt this, but if there was a clean example of an A.I. coming up with an idea that's completely novel in more than details, it would be compelling evidence that these next-token predictors have some weird emergent properties that don't necessarily follow from intricate, sophisticated webs of token-prediction.

E.g. "What might be a room-temperature superconductor" -> "some plausible iteration on existing high-temperature superconductors based on our current understanding of the underlying physics" would not be outside how we currently understand them.

"What might be a room-temperature superconductor?" -> "some completely outlandish material that nobody has studied before and, when examined, seems to have higher temperature superconducting than we would predict" would provoke some serious questions.

A fun experiment I've heard suggested is training a model on all scientific understanding just up to some counterintuitive quantum leap in scientific understanding, say, Einstein's theory of relativity, and then seeing if you can prompt it to "discover" or "invent" said leap, without explicitly telling it what to look for. This would of course be pretty hard to prove, but if you could get it to work on a local model, publish the training set and parameters so that anyone can replicate it on their own machine, that could be pretty darn compelling.

selridgeBobaFloutist6 hours ago

Why would it matter whether or not the robot looks something up if it makes a novel discovery?

Why would it matter that the discovery wasn't just novel but felt like an unconventional one to me, someone who is probably a total outsider to that field?

Both of those feel subjective or at least hard to sustain.

Look. What I'm trying to tell people is that the easy explanations for how these models worked circa GPT-2 is just not cutting it anymore. Neither is setting some subjective and needlessly high bar for...what exactly? What? Do we decide to pay attention to AI after it does all the above? That seems a bit late to the party for cheering on or resisting it.

Some new shit is afoot. Folk need to pay attention, not think they got it figured out already.

compass_copiumselridge7 hours ago

Programs are fundamentally lists of instructions. LLMs are very good at building these lists. That it performs well when you say "Build a list you've seen before, but do it in a slightly different way this time. Here's the exact way I want you to do it." is not surprising. I would honestly be surprised if it couldn't do it.

As the other commenter suggested, a genuinely novel scientific idea would be surprising. A new style of art (think Picasso or Pollack coming along), not just an iteration on Ghibli, would be surprising. That's actual creativity.

selridgecompass_copium7 hours ago

>I would honestly be surprised if it couldn't do it.

You'd be surprised if an LLM couldn't write *any* program?

orfcompass_copium17 hours ago

That’s still over-general to the point of being useless.

What you wrote would apply to a human approaching this task as well, sans the “many trillion lines of code”.

Kim_Bruningwavemode18 hours ago

If you run an LLM in an autoregressive loop you can get it to emulate a turing machine though. That sort of changes the complexity class of the system just a touch. 'Just predicts the next word' hits different when the loop is doing general computation.

Took me a bit of messing around, but try to write out each state sequentially, with a check step between each.

arkhwavemode17 hours ago

I expected (and still expect) a lot from LLM with cross disciplinary research.

I think they should be the perfect tool to find methods or results in a field which look like it could be used in another field.

WithinReasonarkh17 hours ago

This might actually be a limitation of the "predict next word" approach since the network is never trained to predict a result in one field from a result in another. It might still make the connection though, but not as easily.

ainchwavemode16 hours ago

Sorry but this is famously not true! There is no guarantee that statistical models generalise. In your example, whether or not your model generalises depends entirely on what f(x) you use - depending on the complexity of your function class f(x+2) could be 7, 8, or -500.

One of the surprises of deep learning is that it can, sometimes, defy prior statistical learning theory to generalise, but this is still poorly understood. Concepts like grokking, double descent, and the implicit bias of gradient descent are driving a lot of new research into the underlying dynamics of deep learning. But I'd say it is pretty ahistoric to claim that this is obvious or trivial - decades of work studied "overfitting" and related problems where statistical models fail to generalise or even interpolate within the support of their training data.

anon7725selridge18 hours ago

“Represented in the training data” does not mean “represented as a whole in the training data”. If A and B are separately in the training data, the model can provide a result when A and B occur in the input because the model has made a connection between A and B in the latent space.

selridgeanon772518 hours ago

Yes. I’m saying that “it’s just in the training data” is a cognitive containment of these models which is incomplete. You can insist that’s what’s happening, but you’ll be left unable to explain what’s going on beyond truisms.

WithinReasonselridge16 hours ago

It's called "generalization":

https://en.wikipedia.org/wiki/Generalization_(learning)

selridgeWithinReason7 hours ago

>"If A and B are separately in the training data, the model can provide a result when A and B occur in the input because the model has made a connection between A and B in the latent space."

This statement (The one I was replying to) is fundamentally unbounded. There's nothing that can't be explained as a combination of "A" and "B" in "training data" because practically speaking we can express anything as such where the combination only needs to be convex along some high-dimensional semantic surface. Add on to that my scare quotes around "training data" because very few people have any practical idea of what is or isn't in there, so we can just make claims strategically. Do we need to explain a success? It was in the training data. A failure, probably not in the training data. Will anyone call us on this transparent farce? Not usually, no.

If a statement can--at will--explain everything and nothing, what's it worth?

jjmarrwavemode18 hours ago

I created a code review pipeline at work with a similar tradeoff and we found the cost is worth it. Time is a non-issue.

We could run Claude on our code and call it a day, but we have hundreds of style, safety, etc rules on a very large C++ codebase with intricate behaviour (cooperative multitasking be fun).

So we run dozens of parallel CLI agents that can review the code in excruciating detail. This has completely replaced human code review for anything that isn't functional correctness but is near the same order of magnitude of price. Much better than humans and beats every commercial tool.

"scaling time" on the other hand is useless. You can just divide the problem with subagents until it's time within a few minutes because that also increases quality due to less context/more focus.

smallpipejjmarr16 hours ago

> This has completely replaced human code review for anything that isn't functional correctness

Isn’t functional correctness pretty much the only thing that matters though?

grey-areasmallpipe15 hours ago

Well no, style is important too for humans when they read a codebase, so the LLMs the parent is running clearly have some value for them.

They're not claiming LLMs solved every problem, just that they made life easier by taking care of busywork that humans would otherwise be doing. I think personally this is quite a good use for them - offering suggestions on PRs say, as long as humans still review them as well.

1718627440grey-area14 hours ago

But isn't style already achievable by running e.g. GNU indent?

jjmarr17186274407 hours ago

Some examples of complex transformations linters can't catch:

* Function names must start with a verb.

* Use standard algorithms instead of for loops.

* Refactor your code to use IIFEs to make variables constexpr.

The verb one is the best example. Since we work adjacent to hardware, people like creating functions on structs representing register state called "REGISTER_XYZ_FIELD_BIT_1()" and you can't tell if this gets the value of the first field bit or sets something called field bit to 1.

If you rename it to `getRegisterXyzFieldBit1()` or `setRegisterXyzFieldBitTo1()` at least it becomes clear what they're doing.

aktaujjmarr11 hours ago

Any LLM-based code review tooling I've tried has been lackluster (most comments not too helpful). Prose review is usually better.

> So we run dozens of parallel CLI agents that can review the code in excruciating detail. This has completely replaced human code review for anything that isn't functional correctness but is near the same order of magnitude of price. Much better than humans and beats every commercial tool.

Sure, you could make multiple LLM invocations (different temporature, different prompts, ...). But how does one separate the good comments from the bad comments? Another meta-LLM? [1] Do you know of anyone who summarizes the approach?

[1]: I suppose you could shard that out for as much compute you want to spend, with one LLM invocation judging/collating the results of (say) 10 child reviewers.

DustinKlentaktau11 hours ago

I have attempted to replicate the "workflow" LLM process where several LLMs come up with different variations of a way to solve a problem and a "judge" LLM reviews them and the go through different verification processes to see if this workflow increased the accuracy of the LLM's ability to solve the problem. For me, in my experiments, it didn't really make much difference but at the time I was using LLMs significantly dumber than current frontier models. HOWEVER...When I enable "Thinking Mode" on frontier LLM's like ChatGPT it DOES tend to solve problems that the non-thinking mode isn't able to solve so perhaps it's just a matter of throwing enough iterations at it for the LLM to be able to solve a particular complex problem.

ivansavzaktau10 hours ago

> But how does one separate the good comments from the bad comments?

One thing that works very well for me (in a different context) is to ask to return two lists:

- Things that I must absolutely fix (bugs, typos, logic mistakes, etc.)

- Lesser fixes and other stylistic improvements

Then I look only at the first list.

jjmarraktau7 hours ago

You need human alignment on what constitutes a "good" comment. That means consistent rules.

Otherwise, some people feel review is too harsh, other people feel it is not harsh enough. AI does not fix inconsistent expectations.

> But how does one separate the good comments from the bad comments?

If the AI took a valid interpretation of the coding guidelines, it is a legitimate comment. If the AI is being overly pedantic, it is a documentation bug and we change the rules.

Kim_Bruningwavemode17 hours ago

> You're implicitly assuming that what you asked the LLM to do is unrepresented in the training data. That assumption is usually faulty - very few of the ideas and concepts we come up with in our everyday lives are truly new.

I made a cursed CPU in the game 'Turing Complete'; and had an older version of claude build me an assembler for it?

Good luck finding THAT in the training data. :-P

(just to be sure, I then had it write actual programs in that new assembly language)

withinboredomKim_Bruning17 hours ago

But the ideas are not 'new'. A benchmark that I use to tell me if an AI is overfitted is to present the AI with a recent paper (especially one like a paxos variant) and have it build that. If it writes general paxos instead of what the paper specified, its overfitted.

Claude 4.5: not overfitted too much -- does the right thing 6/10 times.

Claude 4.6: overfitted -- does the right thing 2/10 times.

OpenAI 5.3: overfitted -- does the right thing 3/10 times.

These aren't perfect benchmarks, but it lets me know how much babysitting I need to do.

My point being that older Claude models weren't overfitted nearly as much, so I'm confirming what you're saying.

Kim_Bruningwithinboredom16 hours ago

Could also be that the model has stronger priors wrt Paxos (and thus has Opinions on what good Paxos should look like)

At any rate, with an assembler, you end up with a lot of random letter-salad mnemonics with odd use cases, so that is very likely to tokenize in interesting ways at the very least.

withinboredomKim_Bruning14 hours ago

I was just using paxos as an example. Any paper will do.

mnewme19 hours ago

Is this an ad? Seems like it. The text is not really what the headline suggests.

pianom4nmnewme18 hours ago

Do you think the submitter intended this as an ad? His post history doesn't seem suspicious.

Or do you think article's author wrote this an an ad? He's a reputable academic who seems impressed with an AI tool he used and is honestly sharing his thoughts.

For reference he published the 80 page inflation mini-book 2 weeks ago asking for feedback: https://www.grumpy-economist.com/p/inflation

deauxpianom4n17 hours ago

> Or do you think article's author wrote this an an ad? He's a reputable academic who seems impressed with an AI tool he used and is honestly sharing his thoughts.

Ghuntley used to be reputable on here, then the crypto money looked too juicy.

pianom4ndeaux7 hours ago

Are you seriously comparing a random hacker to a lifelong academic for their odds of becoming a crypto shill?

callmeal19 hours ago

The "predict the next word" to a current llm is at the same level as a "transistor" (or gate) is to a modern cpu. I don't understand llms enough to expand on that comparison, but I can see how having layers above that feed the layers below to "predict the next word" and use the output to modify the input leading to what we see today. It is turtles all the way down.

brookstcallmeal19 hours ago

It’s a good comparison. It’s about abstraction and layers. Modern LLMs aren’t just models, they’re all the infrastructure around promoting and context management and mixtures of experts.

The next-word bit may be slightly higher than an individual transistor, possibly functional units.

echeloncallmeal18 hours ago

Humans are future predictors. Our vision systems, our mental models of our careers. People that predict the future tend to do well financially.

Now the machines are getting better than we are. It's exciting and a little bit terrifying.

We were polymers that evolved intelligence. Now the sand is becoming smart.

qseraechelon18 hours ago

>Now the machines are getting better than we are

Then AI companies should stop looking for investors and instead play stock markets with all that predictive powers!

echelonqsera18 hours ago

The real money is in using the models to build utility and money-making companies. You're removed from orders of magnitude in upside potential if you have to wait for the public markets.

qseraechelon17 hours ago

> money-making companies

You mean, money sucking companies, right?

>You're removed from orders of magnitude in upside potential if you have to wait for the public markets.

because that won't work. That is why!

echelonqsera17 hours ago

> You mean, money sucking companies, right?

Is that what you (and all people) are in your job function? A money suck?

Do you ever buy anything for food, shelter, and clothing? Do you have hobbies?

Capitalism means we don't have to all be hunter-gatherers, and I'm pretty keen on that trade.

> because that won't work. That is why!

This is the forum for a venture capital firm. A lot of the folks here build things with the intention of creating value and getting compensated for that value creation. Other valid options are sitting at home and playing video games, reading a book, or posting on HN.

I like working on problems where I'm the customer and where I would buy the product if it existed. Turns out, there tend to be other people who would buy my software too.

qseraechelon16 hours ago

I get paid for what I do. Not for what I promise to do.

ejoltocallmeal18 hours ago

There is a big difference, because I understand how those transistors produce a picture on a screen, I don’t understand how LLMs do what they do. The difference is so big that the comparison is useless.

jculejolto18 hours ago

I understand how transistors work too, and how they can result in a picture on a screen. But I think most people outside the software / electronics areas don't and to them it's just magic.

visarga19 hours ago

> Nothing you write will matter if it is not quickly adopted to the training dataset.

That is my take too, I was surprised to see how many people object to their works being trained on. It's how you can leave your mark, opening access for AI, and in the last 25 years opening to people (no restrictions on access, being indexed in Google).

mbgerringvisarga19 hours ago

People who produced the works LLMs are trained on are not compensated for the value they are now producing, and their skills are increasingly less valued in a world with LLMs. The value the LLMs are producing is being captured by employees of AI companies who are driving up rent in the Bay Area, and driving up the cost of electricity and water everywhere else.

Your surprise to people’s objections makes sense if you can’t count.

chiimbgerring18 hours ago

> People who produced the works LLMs are trained on are not compensated for the value they are now producing

the value being extracted via LLM techniques is new value, which did not previously exist. The producer(s) of the old data had an asking price, which was taken by the LLM trainers. They cannot make the argument that since the LLM is producing new value, they should retroactively update their old asking price for their works.

They could update their asking price for any new works they produce. They also have the right to ask their works not be used for training, etc. But they cannot ask their old works to be paid for by the new uses in LLM in a retroactive way.

GolfPopperchii18 hours ago

>The producer(s) of the old data had an asking price, which was taken by the LLM trainers.

This is... blatantly untrue?

https://arstechnica.com/tech-policy/2026/02/microsoft-removes-guide-on-how-to-train-llms-on-pirated-harry-potter-books/

https://www.theatlantic.com/technology/archive/2025/03/libgen-meta-openai/682093/

mbgerringchii13 hours ago

> They could update their asking price for any new works they produce. They also have the right to ask their works not be used for training, etc.

Someone else already pointed out that many works used to train LLMs were stolen, but also, it’s unclear whether this is true, either. Can you opt out? Because copyright should have been enough to prevent a company from stealing and profiting from your work, but it wasn’t in the case of every existing LLM.

joquarkymbgerringan hour ago

> not compensated for the value they are now producing

"To promote the Progress of Science and useful Arts" is the basis for this right.

Does the old way still promote the Progress of Science and useful Arts sufficiently to stifle the new way?

heavyset_govisarga18 hours ago

Most people value their time and work and don't want to give it away for free to some billionaire so they can reproduce it as slop for their own private profit.

That's to say, most people recognize when they're getting fucked over and are correct to object to it.

Morromistvisarga17 hours ago

"On reflection I have started to worry again. In 10 to 20 years nobody will read anything any more, they just will read LLM digests. So, the single most important task of a writer starting right now is to get your efforts wired in to the LLMs"

You're words will be like a drop in the ocean, an ocean where the water volume keeps increasing every year. Also if nobody reads anything anymore what's the point?

retrac19 hours ago

I know this sounds insane but I've been dwelling on it. Language models are digital Ouija boards. I like the metaphor because it offers multiple conflicting interpretations. How does a Ouija board work? The words appear. Where do they come from? It can be explained in physical terms. Or in metaphysical terms. Collective summing of psychomotor activity. Conduits to a non-corporeal facet of existence. Many caution against the Ouija board as a path to self-inflicted madness, others caution against the Ouija board as a vehicle to bring poorly understood inhuman forces into the world.

brookstretrac19 hours ago

Ouija boards are just collective negotiation among people.

nekusarretrac18 hours ago

There's 2 completely different ways to understand how a Ouija board works. Occult, and Scientific.

Scientific: It's a combined response from everyone's collective unconscious blend of everyone participating. In other words, its a probabilistic result of an "answer" to the question everyone hears.

Occult: If an entity is present, it's basically the unshielded response of that entity by collectively moving everyone's body the same way, as a form of a mild channel. Since Ouija doesn't specific to make a circle and request presence of a specific entity, there's a good chance of some being hostile. Or, you all get nothing at all, and basically garbage as part of the divination/communication.

But comparing Ouija to LLMs? The LLM, with the same weights, with the same hyperparameters, and same questions will give the same answers. That is deterministic, at least in that narrow sense. An Ouija board is not deterministic, and cannot be tested in any meaningful scientific sense.

ChaitanyaSai19 hours ago

The whole next word thing is interesting isn't it. I like to see it with Dennett's "Competence and comprehension" lens. You can predict the next word competently with shallow understanding. But you could also do it well with understanding or comprehension of the full picture. A mental model that allows you to predict better. Are the AIs stumbling into these mental models? Seems like it. However, because these are such black boxes, we do not know how they are stringing these mental models together. Is it a random pick from 10 models built up inside the weights? Is there any system-wide cohesive understanding, whatever that means? Exploring what a model can articualate using self-reflection would be interesting. Can it point to internal cognitive dissonance because it has been fed both evolution and intelligent design, for example? Or these exist as separate models to invoke depending on the prompt context, because all that matters is being rewarded by the current user?

halyconWaysChaitanyaSai18 hours ago

Searle's Chinese Room experiment but without knowing what's in the room, and when you try to peek in you just see a cloud of fog and are left to wonder if it's just a guy with that really big dictionary or something more intelligent.

selridgehalyconWays18 hours ago

It's an octopus, perhaps: https://aclanthology.org/2020.acl-main.463.pdf

There's also this blog post: https://julianmichael.org/blog/2020/07/23/to-dissect-an-octopus.html (which IMO is better to read than the paper)

grey-areaChaitanyaSai18 hours ago

Given their failure on novel logic problems, generation of meaningless text, tendency to do things like delete tests and incompetence at simple mathematics, it seems very unlikely they have built any sort of world model. It’s remarkable how competent they are given the way they work.

Predict the next word is a terrible summary of what these machines do though, they certainly do more than that, but there are significant limitations.

‘Reasoning’ etc are marketing terms and we should not trust the claims made by companies who make these models.

The Turing test had too much confidence in humans it seems.

shaknagrey-area18 hours ago

Probably worth remembering that ELIZA passed Turing tests, and was the definition of shallow prediction.

jbotzshakna16 hours ago

ELIZA absolutely did not ever pass anything resembling a real Turing test. A real Turing test is adversarial, the interrogator knows the testees are trying to fool him.

shaknajbotz16 hours ago

Landauer and Bellman, absolutely put ELIZA to an adversarial Turing test, and called it such, in 1999. [0]

But... Over in 2025, ELIZA was once again, put to the Turing test in adversarial conditions. [1] And still had people think it was a real person, over 27% of the time. Over a quarter of the testees, thought the thing was a human.

The "ELIZA Effect" wasn't coined because everyone understands that an AI isn't conscious.

[0] https://books.google.com.au/books?id=jTgMIhy6YZMC&pg=PA174

[1] https://arxiv.org/html/2503.23674v1

grey-areajbotz15 hours ago

Unfortunately I'm not sure the Turing test posited a minimal level of intelligence for the human testers. As we have found with LLMs, humans are rather easy to fool.

steve1977grey-area18 hours ago

> Predict the next word is a terrible summary of what these machines do though, they certainly do more than that

What would that be?

grey-areasteve197717 hours ago

They generate text based on quite a large context, including hidden prompts we don’t see and their weights are distorted heavily by training. So I think there’s a lot more than a simple probability of word x coming next. That makes ‘predict next word’ a reductive summary IMO.

I do not personally feel it resembles thinking or reasoning though and really object to that framing because it is misleading many people.

karamanolevgrey-area17 hours ago

> their weights are distorted heavily by training

What does that even mean? Their weights are essentially created by training. There aren't some magic golden weights that are then distorted.

grey-areakaramanolev15 hours ago

I may be using the wrong terms, my impression was:

1. Weights in the model are created by ingesting the corpus

2. Techniques like reinforcement learning, alignment etc are used to adjust those weights before model release

3. The model is used and more context injected which then affects which words it will choose, though it is still heavily biased by the corpus and training.

That could be way off base though, I'd welcome correction on that.

The point I was trying to make though was that they do more than predict next word based on just one set of data. Their weights can encode entire passages of source material in the training data (https://arxiv.org/abs/2505.12546), including books, programs. This is why they are so effective at generating code snippets.

Also text injected at the last stage during use has far less weight than most people assume (e.g. https://georggrab.net/content/opus46retrieval.html) and is not read and understood IMO.

There are a lot of inputs nowadays and a lot of stages to training. So while I don't think they are intelligent I think it is reductive to call them next token predictors or similar. Not sure what the best name for them is, but they are neither next word predictors nor intelligent agents.

karamanolevgrey-area15 hours ago

That extended explanation is more accurate, yes. I'd call your points 1 and 2 both training under the definition "anything that adjusts model weights is training". There are multiple stages and types of training. Right now AFAIK most (all) architectures then fix the weights and you have non-weight-affecting steps like the system prompt, context, etc.

You're right that the weights can enable the model to memorize training data.

joquarkykaramanolev3 hours ago

Alignment scrubs the underlying raw output to be socially acceptable. It's an artificial superego.

red75primegrey-area16 hours ago

> there are significant limitations

Where can we read about those significant limitations?

grey-areared75prime15 hours ago

Well here's some:

Confabulation/Hallucination - https://github.com/lechmazur/confabulations

Failure to read context - https://georggrab.net/content/opus46retrieval.html

Deleting tests to make them pass - https://www.linkedin.com/posts/jasongorman_and-after-it-did-all-that-the-tests-activity-7383354222240538624--jXv

Going rogue and deleting data - https://x.com/jasonlk/status/1946069562723897802

Agent security nightmares because they are not in fact intelligent assistants - https://x.com/theonejvo/status/2015401219746128322

Failure to read or generate structured data - https://support.google.com/gemini/thread/390981629/llm-ignored-constraints-injected-external-data-and-failed-to-read-file-uploads-google-sheets?hl=en

There are many, many examples, mostly caused by people thinking LLMs are intelligent and reasoning and giving them too much power (e.g. treating them as agents, not text generators). I'm sure they're all fixed in whatever new version came out this week though.

red75primegrey-area14 hours ago

Your sarcasm is misplaced. Without principled limitations that demonstrate the existence of a lower bound on the error rate and show that errors are correlated across invocations and models (so that you can't improve the error rate with multiple supervision), you can’t exclude the possibility that "they're all fixed in the new version" (for practical purposes).

joquarkygrey-area3 hours ago

I've seen all of these from human teammates in my 30+ years in tech.

Kim_Bruninggrey-area13 hours ago

So that might depend on model, how long ago you lasted tested it, etc. I've seen llms solve novel logic problems, generate meaningful text, retain tests just fine, and simple mathematics on newer models is a lot better.

Btw if you read the actual paper that proposes the Turing test, Turing actually rejects the framing of "can machines think"; preferring to go for the more practical "can you tell them apart in practice".

grey-areaKim_Bruning6 hours ago

Yes, that’s the ‘too much confidence in humans’ bit - he didn’t count on some humans being easily fooled by prolix word generators. I’d be interested in his take on these generators but I think he’d be focussed on what was missing as well as the amazing progress we have seen.

Kim_Bruninggrey-area3 hours ago

So my reading of (Turing 1950)...

> "The original question, 'Can machines think?' I believe to be too meaningless to deserve discussion."

> "the question, 'Can machines think?' should be replaced by 'Are there imaginable digital computers which would do well in the imitation game?'"

> "according to this view the only way to know that a man thinks is to be that particular man. It is in fact the solipsist point of view... instead of arguing continually over this point it is usual to have the polite convention that everyone thinks."

... is: if it's practical to say the system can give meaningful intput/output on xyz in -say- natural language; we might just go ahead and say it can think about xyz, because otherwise everyone's just going to go nuts inventing new terms every time.

grey-area!thinking, kim_bruning!thinking, pet_cat!thinking, octopus!thinking, claude_opus!thinking.

Can we leave out the '!' ? Nothing to do with fooling people. Just practical ways of dealing with the overall concept.

https://courses.cs.umbc.edu/471/papers/turing.pdf

baschChaitanyaSai18 hours ago

It's honestly disheartening and a bit shocking how everyone has started repeating the predict the next syllable criticism.

The language model predicts the next syllable by FIRST arriving in a point in space that represents UNDERSTANDING of the input language. This was true all the way back in 2017 at the time of Attention Is All You Need. Google had a beautiful explainer page of how transformers worked, which I am struggling to find. Found it. https://research.google/blog/transformer-a-novel-neural-network-architecture-for-language-understanding/

The example was and is simple and perfect. The word bank exists. You can tell what bank means by its proximity to words, such as river or vault. You compare bank to every word in a sentence to decide which bank it is. Rinse, repeat. A lot. You then add all the meanings together. Language models are making a frequency association of every word to every other word, and then summing it to create understanding of complex ideas, even if it doesn't understand what it is understanding and has never seen it before.

That all happens BEFORE "autocompleting the next syllable."

The magic part of LLMs is understanding the input. Being able to use that to make an educated guess of what comes next is really a lucky side effect. The fact that you can chain that together indefinitely with some random number generator thrown in and keep saying new things is pretty nifty, but a bit of a show stealer.

What really amazes me about transformers is that they completely ignored prescriptive linguistic trees and grammar rules and let the process decode the semantic structure fluidly and on the fly. (I know google uses encode/decode backwards from what I am saying here.) This lets people create crazy run on sentences that break every rule of english (or your favorite language) but instructions that are still parsable.

It is really helpful to remember that transformers origins are language translation. They are designed to take text and apply a modification to it, while keeping the meaning static. They accomplish this by first decoding meaning. The fact that they then pivoted from translation to autocomplete is a useful thing to remember when talking to them. A task a language model excels at is taking text, reducing it to meaning, and applying a template. So a good test might be "take Frankenstein, and turn it into a magic school bus episode." Frankenstein is reduced to meaning, the Magic School Bus format is the template, the meaning is output in the form of the template. This is a translation, although from English to English, represented as two completely different forms. Saying "find all the Wild Rice recipes you can, normalize their ingredients to 2 cups of broth, and create a table with ingredient ranges (min-max) for each ingredient option" is closer to a translation than it is to "autocomplete." Input -> Meaning -> Template -> Output. With my last example the template itself is also generated from its own meaning calculation.

A lot has changed since 2017, but the interpreter being the real technical achievement still holds true imho. I am more impressed with AI's ability to parse what I am saying than I am by it's output (image models not withstanding.)

qserabasch18 hours ago

>represents UNDERSTANDING of the input language.

It does not have an understanding, it pattern matches the "idea shape" of words in the "idea space" of training data and calculates the "idea shape" that is likely to follow considering all the "idea shape" patterns in its training data.

It mimics understanding. It feels mysterious to us because we cannot imagine the mapping of a corpus of text to this "idea space".

It is quite similar to how mysterious a computer playing a movie can appear, if you are not aware of mapping of movie to a set of pictures, pictures to pixels, and pixels to co-ordinates and colors codes.

baschqsera18 hours ago

Semantics. Its a encoded position that represents meaning in a way that is useful and reusable. That is "understanding." It's a mathematical representation of grasp.

qserabasch17 hours ago

Yea, semantics is important. It is not "understanding" any more than a microphone+ADC is hearing.

baschqsera17 hours ago

agree to disagree. encoding a meaning is understanding. I cited a source using the word in the same way.

qserabasch17 hours ago

>agree to disagree.

Yea

>encoding a meaning is understanding.

encoding a meaning is encoding. Nothing more!

baschqsera17 hours ago

what is understanding but encoded meaning distilled into pure structure, in both cases a property of a pattern?

No need to gatekeep the word "understanding" behind subjective human experience eg qualia.

qserabasch17 hours ago

> No need to gatekeep the word "understanding" behind subjective human experience eg qualia.

Yea, I think gatekeeping is needed exactly for the same reason. Make up another word if you want..

joquarkyqsera2 hours ago

If you make up a word, nobody will know what it means.

fc417fc802qsera16 hours ago

The distinction you're making reads like substance dualism to me. Are you able to provide a clear and objective metric for assessing "understanding"? If not then you're just handwaving an effectively meaningless semantic distinction.

qserafc417fc80216 hours ago

>objective metric for assessing "understanding"

It should involve consciousness. You would not call an AI reacting to red color as "seeing" red. Same thing.

fc417fc802qsera15 hours ago

And where is this objective metric for consciousness? Last I checked we didn't even have a sensible definition for it.

It seems to me you're just kicking the can.

Setting that issue aside. While I certainly don't believe LLMs to be conscious (an entirely subjective and arbitrary take on my part I admit) I don't see any reason that concepts such as "intelligence" and "understanding" should require it. When considering how we apply those terms to humans it seems to me they are results based and highly contextual (ie largely arbitrary).

qserafc417fc80214 hours ago

>humans it seems to me they are results based and highly contextual (ie largely arbitrary).

Is that right? It seems that we generally say that "the computer is programmed to do", instead of "the computer understand" or "the computer knows", even if the programmed computer can produce the same result as a human who does it.

baschqsera12 hours ago

Language models aren’t “programmed” though.

qserabasch11 hours ago

You are right, it is worse.

It is generated by tweaking a bunch of `if` statements until the output starts to look about right.

fc417fc802qsera7 hours ago

Of course we don't say that. You can't ask the (traditionally) programmed computer a freeform question and get a sensible answer back. We tried that for going on 50 years and it never really worked. (The highest achievement that comes to mind is answering jeopardy questions.)

You can very carefully construct a query in a dedicated language, debug that query, and get useful results back. But that's clearly just a human using a tool, not a machine exhibiting understanding or general knowledge.

Meanwhile you can ask a multi-billion parameter LLM a freeform question in ~any human language and it can produce a coherent and meaningful response. It can one shot pieces of code. Track down bugs based on compiler error messages. It might not (yet) be human level in many cases but to get hung up on that is to miss the point.

baschqsera12 hours ago

If I “convey understanding” I transfer it from one person to another. Consciousness does not transfer with understanding.

Some people argue that consciousness emerges in early childhood. I can get an infant to understand what I am saying even if they aren’t conscious.

gkbrkqsera16 hours ago

A microphone + ADC is hearing though, that's the whole reason we even produce microphones. So that our electronics can hear sound.

qseragkbrk16 hours ago

So according to you when can you qualify something as capable of hearing

1. Vibrate according input to the sound, is that hearing?

2. Generate electrical signals according to the sound, is that hearing?

3. Amplify electrical signal, does we cross the hearing mark?

4. Record the signal to a cassete tape (or use an ADC -> mp3), are we hearing yet?

5. Play it back through a speaker. Sure, we should be hearing now!

At which point exactly would you say the thing is definitely hearing?

fc417fc802qsera6 hours ago

You can reduce the human auditory process to a similar mechanical list. At which specific point would you say a human is hearing?

You've fallen into the trap of human exceptionalism but you don't seem to be aware of that fact. Are you a substance dualist or not?

joquarkyqsera2 hours ago

Alan Watts talks about this.

If a tree falls, does it make a sound? It depends on whether there is somebody to ultimately perceive the vibrations that the falling tree made (either directly or via recording).

aqua_coderqsera18 hours ago

I am not knowledgeable on how transformer works but, what if, us humans just do the same thing in our minds as well ? What if our feeling of "understanding" is merely just the emotional response to a pattern matching as you just said?

qseraaqua_coder17 hours ago

Yea, you said it. It is the feeling of understanding and feeling/sensing implies consciousness. Why does it matter? I don't know. All I know is that it is not the same thing, because a chunk of metal cannot feel. So I don't want it to be called by the same name.

When AI marketing (ab)uses the word, it is to project the appearance of human equivalence. And I don't like to fall for it.

joquarkyqsera2 hours ago

Psychopaths don't feel. Are they conscious?

dnauticsqsera17 hours ago

> pattern matches the "idea shape" of words in the "idea space

it does much more than this. first layer has an attention mechanism on all previous tokens and spits out an activation representing some sum of all relations between the tokens. then the next layer spits out an activation representing relations of relations, and the next layer and so forth. the llm is capable of deducing a hierarchy of structural information embedded in the text.

not clear to me how this isn't "understanding".

steve1977basch18 hours ago

From what I understand, it's more like "input is 1, 3, 5, 7" so "output is likely to be 9".

Understanding would be a bit generous of a term for that I guess, but that also depends on the definition of understanding.

baschsteve197718 hours ago

Id really invite people to read the google blog post. https://research.google/blog/transformer-a-novel-neural-network-architecture-for-language-understanding/

Google chose the word understanding.

steve1977basch18 hours ago

Thanks for the link, I will read it. But keep in mind that Google wants to sell us something.

baschsteve197718 hours ago

A the time, it was a free language translation tool. You weren't paying for transformers in 2017.

steve1977basch17 hours ago

True, but that doesn't mean that Google did not already have intentions to monetize it if possible.

baschsteve197717 hours ago

You would think, wouldn’t you?

And yet they waited until ChatGPT was a thing and threw Bard together overnight in response.

steve1977basch13 hours ago

Fair point ;)

nairboonbasch17 hours ago

Google chose "understanding" in that context, because the relevant AI/ML task is called "Natural language understanding". But that term is an aspiration. It's the problem of trying to reveal the "meaning" of text data (language) as in making sense of the symbols with computers.

Just because Transformers work well on the "Natural language understanding" task in AI, doesn't mean that a Transformer actually "understands" language in the human sense.

interloxiabasch17 hours ago

The task is language understanding. The tool is amazing. Pianos are amazing. The task is to create music. The process is to transform movement to sound. They don't understand music.

joquarkybasch2 hours ago

Even if it gets the output wrong, it always seems to provide some output that indicates that it got the input right. This is the first thing that really surprised me about this tech.

mzhaaseChaitanyaSai18 hours ago

It always occurred to me that LLMs may be like the language center of the brain. And there should be a "whole damn rest of the brain" behind it to steer it.

LLMs miss very important concepts, like the concept of a fact. There is no "true", just consensus text on the internet given a certain context. Like that study recently where LLMs gave wrong info if there was the biography of a poor person in the context.

steve1977mzhaase18 hours ago

I think much along the same lines. LLMs are probably even just a part of the language center.

And of course they also miss things like embodiment, mirror neurons etc.

If an LLM makes a mistake, it will tell you it is sorry. But does it really feel sorry?

red75primesteve197716 hours ago

> But does it really feel sorry?

And what does it mean to feel sorry? Beyond fallible and imprecise human introspective notion of "sorry", that is. A definition that can span species and computing substrates. A deanthropomorphized definition of "sorry", so to speak.

dnauticsmzhaase17 hours ago

thats unlikely. but they are awfully lot like turing machines (k/v cache ~ turing tape) so their architecture is strongly predisposed to be able to find any algorithm, possibly including reasoning

joquarkymzhaase3 hours ago

Ever practiced meditation of the form where you just witness your thoughts? It seems just like LLM generated words, both factual and confabulated nonsense.

throw310822ChaitanyaSai17 hours ago

> You can predict the next word competently with shallow understanding.

I don't get this. When you say "predict the next word" what you mean is "predict the word that someone who understands would write next". This cannot be done without an understanding that is as complete as that of the human whose behaviour you are trying to predict. Otherwise you'd have the paradox that understanding doesn't influence behaviour.

mekokaChaitanyaSai16 hours ago

> Are the AIs stumbling into these mental models? Seems like it.

Since nature decided to deprive me of telepathic abilities, when I want to externalize my thoughts to share with others, I'm bound to this joke of a substitute we call language. I must either produce sounds that encode my meaning, or gesture, or write symbols, or basically find some way to convey my inner world by using bodily senses as peripherals. Those who receive my output must do the work in reverse to extract my meaning, the understanding in my message. Language is what we call a medium that carries our meaning to one another's psyche.

LLMs, as their name alludes, are trained on language, the medium, and they're LARGE. They're not trained on the meaning, like a child would be, for instance. Saying that by their sole analysis of the structure and patterns in the medium they're somehow capable of stumbling upon the encoded meaning is like saying that it's possible to become an engineer, by simply mindlessly memorizing many perfectly relevant scripted lines whose meaning you haven't the foggiest.

Yes, on the surface the illusion may be complete, but can the medium somehow become interchangeable with the meaning it carries? Nothing indicates this. Everything an LLM does still very much falls within the parameters of "analyze humongous quantity of texts for patterns with massive amount of resources, then based on all that precious training, when I feed you some text, output something as if you know what you're talking about".

I think the seeming crossover we perceive is just us becoming neglectful in our reflection of the scale and significance of the required resources to get them to fool us.

js8ChaitanyaSai16 hours ago

Dennett also came to my mind, reading the title, but in a different sense. When people came up with theory of evolution, it was hard to conceive for many people, how do we get from "subtly selecting from random changes" to "build a complex mechanism such as human". I think Dennett offers a nice analogy with a skyscraper, how it can be built if cranes are only so tall?

In a similar way, LLMs build small abstractions, first on words, how to subtly rearrange them without changing meaning, then they start to understand logic patterns such as "If A follows B, and we're given A, then B", and eventually they learn to reason in various ways.

It's the scale of the whole process that defies human understanding.

(Also modern LLMs are not just next word predictors anymore, there is reinforcement learning component as well.)

intended18 hours ago

The article talks about LLMs reviewing Econ papers.

I’m hesitant to call this an outright win, though.

Perhaps the review service the author is using is really good.

Almost certainly the taste, expertise and experience of the author is doing unseen heavy lifting.

I found that using prompts to do submission reviews for conferences tended to make my output worse, not better.

Letting the LLM analyze submissions resulted in me disconnecting from the content. To the point I would forget submissions after I closed the tab.

I ended up going back to doing things manually, using them as a sanity check.

On the flip side, weaker submissions using generative tools became a nightmare, because you had to wade through paragraphs of fluff to realize there was no substantive point.

It’s to the point that I dread reviewing.

I am going to guess that this is relatively useful for experts, who will submit stronger submissions, than novices and journeymen, who will still make foundational errors.

ruhith18 hours ago

Predict the next token' is true but not explanatory. It's like saying humans 'fire neurons.' Technically correct, explains nothing useful about the behavior you're actually observing. The debate isn't whether the description is accurate - it's whether it's at the right level of abstraction.

pharrington18 hours ago

Why do the deliverables always take about 1 hour? Is this fully automated?

gammalost18 hours ago

It is really interesting how great and also how terrible LLMs can be at the same time. For example, I had a really annoying bug yesterday, I missed one character, "_". Asking ChatGPT for help led to a lot of feedback that was arguably okay but not currently relevant (because there was a fatal flaw in the code).

Remade the conversation with personal information stripped here https://chatgpt.com/share/699fef77-b530-8007-a4ed-c3dda9461d03

GodelNumbering18 hours ago

It is probably the first-time aha moment the author is talking about. But under the hood, it is probably not as magical as it appears to be.

Suppose you prompted the underlying LLM with "You are an expert reviewer in..." and a bunch of instructions followed by the paper. LLM knows from the training that 'expert reviewer' is an important term (skipping over and oversimplifying here) and my response should be framed as what I know an expert reviewer would write. LLMs are good at picking up (or copying) the patterns of response, but the underlying layer that evaluates things against a structural and logical understanding is missing. So, in corner cases, you get responses that are framed impressively but do not contain any meaningful inputs. This trait makes LLMs great at demos but weak at consistently finding novel interesting things.

If the above is true, the author will find after several reviews that the agent they use keeps picking up on the same/similar things (collapsed behavior that makes it good at coding type tasks) and is blind to some other obvious things it should have picked up on. This is not a criticism, many humans are often just as collapsed in their 'reasoning'.

LLMs are good at 8 out of 10 tasks, but you don't know which 8.

Kim_BruningGodelNumbering17 hours ago

In your model, explain the old trick "think step by step"

GodelNumberingKim_Bruning8 hours ago

It simply forces the model to adopt an output style known to conduce systematic thinking without actually thinking. At no point has it through through the thing (unless there are separate thinking tokens)

modeless18 hours ago

It's clear that in the general case "predict the next word" requires arbitrarily good understanding of everything that can be described with language. That shouldn't be mysterious. What's mysterious is how a simple training procedure with that objective can in practice achieve that understanding. But then again, does it? The base model you get after that simple training procedure is not capable of doing the things described in the article. It is only useful as a starting point for a much more complex reinforcement learning procedure that teaches the skills an agent needs to achieve goals.

RL is where the magic comes from, and RL is more than just "predict the next word". It has agents and environments and actions and rewards.

sasjaws18 hours ago

A while ago i did the nanogpt tutorial, i went through some math with pen and paper and noticed the loss function for 'predict the next token' and 'predict the next 2 tokens' (or n tokens) is identical.

That was a bit of a shock to me so wanted to share this thought. Basically i think its not unreasonable to say llms are trained to predict the next book instead of single token.

Hope this is usefull to someone.

sputknicksasjaws18 hours ago

I'd like to explore this idea, did you make a blog post about it? is it simple enough to post in the reply?

WithinReasonsputknick17 hours ago

Look up attention masks

317070sasjaws17 hours ago

As an expert in the field: this is exactly right.

LLMs are trained to do whole book prediction, at training time we throw in whole books at the time. It's only when sampling we do one or a few tokens at the time.

justinator31707017 hours ago

where do you get these books?

honking intensifies

WHERE DO YOU GET THESE BOOKS?!

tasukijustinator17 hours ago

The local library.

fc417fc802justinator17 hours ago

Can anyone even say what a book really is at the end of the day? It's such an abstract concept. /s

benterixjustinator17 hours ago

We do things, but it doesn't feel right

TuringTest31707017 hours ago

Isn't that the same as compressing the whole book, in a special differential format that compares how the text looks from any given point before and after?

317070TuringTest17 hours ago

There are many ways to model how the model works in simpler terms. Next-word prediction is useful to characterize how you do inference with the model. Maximizing mutual information, compressing, gradient descent, ... are all useful characterisations of the training process.

But as stated above, next token prediction is a misleading frame for the training process. While the sampling is indeed happening 1 token at a time, due to the training process, much more is going on in the latent space where the model has its internal stream of information.

margalabargalaTuringTest7 hours ago

Everything is the same as everything else. It's all just hydrogen and time mixed together.

croonsasjaws16 hours ago

Isn't that why noise was introduced (seed rolling/temperature/high p/low p/etc)? I mean it is still deterministic given the same parameters.

But this might be misleadingly interpreted as an LLM having "thought out an answer" before generating tokens, which is an incorrect conclusion.

Not suggesting you did.

throw310822croon16 hours ago

> this might be misleadingly interpreted as an LLM having "thought out an answer"

I'm convinced that that is exactly what happens. Anthropic confirms it:

"Claude will plan what it will say many words ahead, and write to get to that destination. We show this in the realm of poetry, where it thinks of possible rhyming words in advance and writes the next line to get there. This is powerful evidence that even though models are trained to output one word at a time, they may think on much longer horizons to do so."

https://www.anthropic.com/research/tracing-thoughts-language-model

apexalphasasjaws15 hours ago

Are you referring to this one?: https://github.com/karpathy/build-nanogpt

tsunamifury18 hours ago

I think it’s funny that at Google I invented and productized next word (and next action) predictor in Gmail and hangouts chat and I’ve never had a single person come to me and ask how this all works.

To me LLMs are incredibly simple. Next word next sentence next paragraph and next answer are stacked attention layers which identify manifolds and run in reverse to then keep the attention head on track for next token. It’s pretty straight forward math and you can sit down and make a tiny LLM pretty easily on your home computer with a good sized bag of words and context

To me it’s baffling everyone goes around saying constantly that not even Nobel prize winners know how this works it’s a huge mystery.

Has anyone thought to ask the actual people like me and others who invented this?

booleandilemmatsunamifury17 hours ago

A lot of people in tech thrive on the mystery and don't like explaining things in simple terms. It makes what they do seem more valuable if no one can understand what they're talking about. At the same time, being vague and mysterious can help hide someone's own misunderstandings. When you speak clearly you need to be accurate, because it's more obvious when you're wrong.

tsunamifurybooleandilemma5 hours ago

I agree -- or the math is just way over peoples heads -- even word points to word N times.

kosh2tsunamifury16 hours ago

This is like saying quantum mechanics is really simple to understand, all you have to do is find the right formula and plug in the numbers.

When people talk about understanding, they mean as knowing how the underlying mechanism works often by finding an analog in real life.

tsunamifurykosh29 hours ago

It is a sophisticated way of putting your foot in front of you and taking a step while keeping your head up and looking at your destination.

teekert18 hours ago

I think this is a thing not often discussed here, but I too have this experience. An LLM can be fantastic if you write a 25-pager then later need to incorporate a lot of comments with sometimes conflicting arguments/viewpoints.

LLMs can be really good at "get all arguments against this", "Incorporated this view point in this text while making it more concise.", "Are these views actually contradicting or can I write it such that they align. Consider incentives".

If you know what you're doing and understand the matter deeply (and that is very important) you'll find that the LLM is sometimes better at wording what you actually mean, especially when not writing in your native language. Of course, you study the generated text, make small changes, make it yours, make sure you feel comfortable with it etc. But man can it get you over that "how am I going to write this down"-hump.

Also: "Make an executive summary" "Make more concise", are great. Often you need to de-linkedIn the text, or tell it to "not sound like an American waiter", and "be business-casual", "adopt style of rest of doc", etc. But it works wonders.

mrorigo18 hours ago

Attention is all you need.

trhway16 hours ago

>I don’t know how you get here from “predict the next word.”

The question puts horse behind the buggy. The main point isn't "from", it is how you get to “predict the next word.” During the training the LLM builds inside itself compressed aggregated representation - a model - of what is fed into it. Giving the model you can "predict the next word" as well as you can do a lot of other things.

For simple starting point for understanding i'd suggest to look back at the key foundational stone that started it all - "sentiment neuron"

https://openai.com/index/unsupervised-sentiment-neuron/

"simply predicting the next character in Amazon reviews resulted in discovering the concept of sentiment.

...

Digging in, we realized there actually existed a single “sentiment neuron” that’s highly predictive of the sentiment value."

asymmetric16 hours ago

The ideas in the update were previously explored by Gwern 2 years ago: https://www.lesswrong.com/posts/PQaZiATafCh7n5Luf/gwern-s-shortform?commentId=KAtgQZZyadwMitWtb

keybored16 hours ago

> On reflection I have started to worry again. In 10 to 20 years nobody will read anything any more, they just will read LLM digests. So, the single most important task of a writer starting right now is to get your efforts wired in to the LLMs. Nothing you write will matter if it is not quickly adopted to the training dataset. As the art of pushing your results to the top of the google search was the 1990s game, getting your ideas into the LLMs is today’s. Refine is no different. It’s so good, everyone will use it. So whether refine and its cousins take a FTPL or new Keynesian view in evaluating papers is now all determining for where the consensus of the profession goes.

I expected a bit more cynicism and less merrily going with the downward spiral flow from a “grumpy” blogger.

Oh, but it’s a grumpy economist.

singularity200116 hours ago

Superhuman chess engines are now trained just from one bit reward signal: win / lose. This says absolutely nothing about the complexity that the model develops inside. They even learned the rule of the games just from that reward.

vb713213 hours ago

  IMO, the writer is overzealous with their comments on LLMs. As a coder, it feels like an outsider trying out a product that was amazed me over and over so many times.

  > They aren’t perfect, but the kind of analysis the program is able to do is past the point where technology looks like magic.

  But as you use this product over a long period of time, there are many obvious gaps - hallucinations / repeated tool calls / out of context outputs / etc.

  To me, refine.ink sounds like a company that has built heavy tooling around some super high context window LLMs and then some very good prompts. Their claim is to compare it against any good off-the-shelf LLM with any prompt. But when you are spending bunch of money to build a whole ecosystem around LLMs, it's obvious that it's not going to beat their output. 

  I won't be surprised if the next version of an LLM within the next few months completely outperforms their output -- that's usually the case with all the coding tools and scaffoldings. They are rendered useless by a superior LLM.

TYPE_FASTER9 hours ago

The best overview I've found so far of how LLMs work: https://www.youtube.com/watch?v=7xTGNNLPyMI

joquarky6 hours ago

The links on this site are low contrast.