CHATGPT DOESN'T REASON! (Top scientist bombshell)

Publicado 2024-07-29
Prof. Subbarao Kambhampati argues that while LLMs are impressive and useful tools, especially for creative tasks, they have fundamental limitations in logical reasoning and cannot provide guarantees about the correctness of their outputs. He advocates for hybrid approaches that combine LLMs with external verification systems.

MLST is sponsored by Brave:
The Brave Search API covers over 20 billion webpages, built from scratch without Big Tech biases or the recent extortionate price hikes on search API access. Perfect for AI model training and retrieval augmentated generation. Try it now - get 2,000 free queries monthly at brave.com/api.

This is 2/13 of our #ICML2024 series

TOC
[00:00:00] Intro
[00:02:06] Bio
[00:03:02] LLMs are n-gram models on steroids
[00:07:26] Is natural language a formal language?
[00:08:34] Natural language is formal?
[00:11:01] Do LLMs reason?
[00:19:13] Definition of reasoning
[00:31:40] Creativity in reasoning
[00:50:27] Chollet's ARC challenge
[01:01:31] Can we reason without verification?
[01:10:00] LLMs cant solve some tasks
[01:19:07] LLM Modulo framework
[01:29:26] Future trends of architecture
[01:34:48] Future research directions

Pod: podcasters.spotify.com/pod/show/machinelearningstr…

Subbarao Kambhampati:
x.com/rao2z

Interviewer: Dr. Tim Scarfe

Refs:

Can LLMs Really Reason and Plan?
cacm.acm.org/blogcacm/can-llms-really-reason-and-p…

On the Planning Abilities of Large Language Models : A Critical Investigation
arxiv.org/pdf/2305.15771

Chain of Thoughtlessness? An Analysis of CoT in Planning
arxiv.org/pdf/2405.04776

On the Self-Verification Limitations of Large Language Models on Reasoning and Planning Tasks
arxiv.org/pdf/2402.08115

LLMs Can't Plan, But Can Help Planning in LLM-Modulo Frameworks
arxiv.org/pdf/2402.01817

Embers of Autoregression: Understanding Large Language
Models Through the Problem They are Trained to Solve
arxiv.org/pdf/2309.13638

arxiv.org/abs/2402.04210
"Task Success" is not Enough

Partition function (number theory) (Srinivasa Ramanujan and G.H. Hardy's work)
en.wikipedia.org/wiki/Partition_function_(number_t…)

Poincaré conjecture
en.wikipedia.org/wiki/Poincar%C3%A9_conjecture

Gödel's incompleteness theorems
en.wikipedia.org/wiki/G%C3%B6del%27s_incompletenes…

ROT13 (Rotate13, "rotate by 13 places")
en.wikipedia.org/wiki/ROT13

A Mathematical Theory of Communication (C. E. SHANNON)
people.math.harvard.edu/~ctm/home/text/others/shan…

Sparks of AGI
arxiv.org/abs/2303.12712

Kambhampati thesis on speech recognition (1983)
rakaposhi.eas.asu.edu/rao-btech-thesis.pdf

PlanBench: An Extensible Benchmark for Evaluating Large Language Models on Planning and Reasoning about Change
arxiv.org/abs/2206.10498

Explainable human-AI interaction
link.springer.com/book/10.1007/978-3-031-03767-2

Tree of Thoughts
arxiv.org/abs/2305.10601

On the Measure of Intelligence (ARC Challenge)
arxiv.org/abs/1911.01547

Getting 50% (SoTA) on ARC-AGI with GPT-4o (Ryan Greenblatt ARC solution)
redwoodresearch.substack.com/p/getting-50-sota-on-…

PROGRAMS WITH COMMON SENSE (John McCarthy) - "AI should be an advice taker program"
www.cs.cornell.edu/selman/cs672/readings/mccarthy-…

Original chain of thought paper
arxiv.org/abs/2201.11903

ICAPS 2024 Keynote: Dale Schuurmans on "Computing and Planning with Large Generative Models" (COT)
   • ICAPS 2024 Keynote: Dale Schuurmans o...  

The Hardware Lottery (Hooker)
arxiv.org/abs/2009.06489

A Path Towards Autonomous Machine Intelligence (JEPA/LeCun)
openreview.net/pdf?id=BZ5a1r-kVsf

AlphaGeometry
www.nature.com/articles/s41586-023-06747-5

FunSearch
www.nature.com/articles/s41586-023-06924-6

Emergent Abilities of Large Language Models
arxiv.org/abs/2206.07682

Language models are not naysayers (Negation in LLMs)
arxiv.org/abs/2306.08189

The Reversal Curse: LLMs trained on "A is B" fail to learn "B is A"
arxiv.org/abs/2309.12288

Embracing negative results
openreview.net/forum?id=3RXAiU7sss

Todos los comentarios (21)
  • @grahamhenry9368
    Not only can you not tell if someone memorized the answer or reasoned from first principles to get the answer, you can’t tell if someone memorized the steps involved in reasoning from first principles when it looks like they have reasoned from first principles. When I first took calculus I had no trouble memorizing the Epsilon-Delta proof without really understanding it, so if you asked me to demonstrate that I understood the fundamental principles of calculus I could provide a proof of the principles without understanding it
  • @trucid2
    I've worked with people who don't reason either. They exhibit the kind of shallow non-thinking that ChatGPT engages in.
  • @pruff3
    I memorized all the knowledge of humans. I can't reason but I know everything humans have ever put online. Am I useful? Provide reason.
  • @Neomadra
    LLMs definitely can do transitive closure. Not sure why the guest stated otherwise. I tried it out with completely random strings as object names and Claude could do it easily. So it's not just retrieving information.
  • @dr.mikeybee
    Next word prediction is the objective function, but it isn't what the model learns. We don't know what the learned function is, but I can guarantee you it isn't log-odds.
  • @oscarmoxon102
    There's a difference between in-distribution reasoning and out-of-distribution reasoning. If you can make the distribution powerful enough, you can still advance research with neural models.
  • @timcarmichael
    Have we yet defined intelligence sufficiently well that we can appraise it and identify it hallmarks in machines?
  • @DataTranslator
    His analogy of GPT to learning a second language makes 100% sense to me.
  • @markplutowski
    if the title says “people don’t reason” many viewers think it makes the strong claim “ALL people don’t reason“, when it is actually making the weaker claim “SOME people don’t reason“. that title is factually defensible but misleading. one could be excused for interpreting this title to be claiming “ChatGPT doesn’t reason (at all)“, when it is actually claiming “ChatGPT doesn’t reason (very well)“. One of the beauties of human language is that the meaning of an utterance derived by the listener depends as much on the deserialization algorithm used by the listener as on the serialization algorithm employed by the speaker. the YouTube algorithm chose this title because the algorithm “knows” that many viewers assume the stronger claim. nonetheless, be that as it may, this was a wonderful interview. many gems of insight on multiple levels ; including historical, which I enjoyed. I especially liked your displaying the title page of an article that was mentioned. looking forward to someone publishing “Alpha reasoning: no tokens required“. I would watch again.
  • @NunTheLass
    Thank you. He was my favorite guest that I watched here so far. I learned a lot.
  • @jeremyh2083
    Those people who are assuming the AGI is going to be achieved have never done long-term work inside any of the major GPT systems if you want to have a quick and dirty test, tell it to create you a fiction book first make 15 chapters and 10 sections with each chapter And then have it start writing that book look at it in detail and you will see section after section it loses sight of essentially every detail. It does a better job if you are working inside the universe, another author has already made and does the worst job if you were creating a brand new universe, even if you have it define the universe.
  • @davidcummins8125
    Could an LLM for example figure out whether a request requires a planner, a math engine etc, transform the request into the appropriate format, use the appropriate tool, and then transform the results for the user? I think that LLMs provide a good combination of UI and knowledge base. I was suspicious myself that in the web data they may well have seen joke explanations, movie reviews, etc etc and can lean on that. I think that LLMs can do better, but it requires memory and a feedback loop in the same way that embodied creatures have.
  • @thenautilator661
    Very convincing arguments. Haven't heard it laid out this succinctly and comprehensively yet. I'm sure Yann LeCunn would be in the same camp, but I recall not being persuaded by LeCunn's arguments when he made them on Lex Fridman
  • @aitheignis
    I love this episode. In science, it's never about what can be done or what happen in the system, but it's always about mechanism that lead to the event (how the event happen basically). What is severely missing from all the LLMs talk today is the talk about underlying mechanism. The work on mechanism is the key piece that will move all of these deep neural network works from engineering feat to actual science. To know mechanism, is to know causality.
  • @shyama5612
    Sara Hooker said the same about us not fully understanding what is used in training - the low frequency data and memorization of those being interpreted as generalization or reasoning. Good interview.
  • @wtfatc4556
    Gpt is like a reactive mega wikipedia....
  • @DataJuggler
    0:18 When I was 4 years old, I was often stuck at my parents work. The only thing for me to do that was entertaining, was play with calculators or adding machines. I memorized the times table, because I played with calculators a lot. My parents would spend $8 at the drug store to keep me from asking why is the sky blue and other pertinent questions. I was offered to skip first grade by after kindergarten, and my parents said no. Jeff Bezos is the same age from me, and also from Houston. His parents said yes to skipping first grade. I told my parents this forever until they died.
  • @user-qg8qc5qb9r
    Introduction and Initial Thoughts on Reasoning (00:00) The Manhole Cover Question and Memorization vs. Reasoning (00:00:39) Using Large Language Models in Reasoning and Planning (00:01:43) The Limitations of Large Language Models (00:03:29) Distinguishing Style from Correctness (00:06:30) Natural Language vs. Formal Languages (00:10:40) Debunking Claims of Emergent Reasoning in LLMs (00:11:53) Planning Capabilities and the Plan Bench Paper (00:15:22) The Role of Creativity in LLMs and AI (00:32:37) LLMs in Ideation and Verification (00:38:41) Differentiating Tacit and Explicit Knowledge Tasks (00:54:47) End-to-End Predictive Models and Verification (01:02:03) Chain of Thought and Its Limitations (01:08:27) Comparing Generalist Systems and Agentic Systems (01:29:35) LLM Modulo Framework and Its Applications (01:34:03) Final Thoughts and Advice for Researchers (01:35:02) Closing Remarks (01:40:07)
  • @SurfCatten
    Claude just deciphered a random biography, in rotation cipher, for me. All I told him was that it was a Caesar cipher and then gave him the text. I didn't tell him how many letters it was shifted or rotated by and I didn't use rot13. I tried it three times with three different shift values and it translated it perfectly each time. There's no way that Claude has memorized every single piece of information on the internet in cipher form. Don't know if it's "reasoning" but it is certainly applying some procedure to translate this that is more than just memorization or retrieval. ChatGPT also did it but it had some errors. Instead of criticizing other scientists for being fooled and not being analytical enough maybe you should check your own biases. I have found it true that it can't do logic when a similar logic problem was not in its training data but it definitely can generalize even when very different words are used.
  • @jimbarchuk
    I have to stop to ask if '150% accuracy' is an actual thing in LLM/GPT? Or other weird number things that I'll have to go read. Keywords?