<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/">
  <channel>
    <title>Spatial Intelligence on PaperMoon&#39;s blog</title>
    <link>https://milknocandy.github.io/tags/spatial-intelligence/</link>
    <description>Recent content in Spatial Intelligence on PaperMoon&#39;s blog</description>
    <image>
      <title>PaperMoon&#39;s blog</title>
      <url>https://milknocandy.github.io/%3Clink%20or%20path%20of%20image%20for%20opengraph,%20twitter-cards%3E</url>
      <link>https://milknocandy.github.io/%3Clink%20or%20path%20of%20image%20for%20opengraph,%20twitter-cards%3E</link>
    </image>
    <generator>Hugo -- 0.154.3</generator>
    <language>en</language>
    <lastBuildDate>Sun, 05 Apr 2026 17:58:05 +0800</lastBuildDate>
    <atom:link href="https://milknocandy.github.io/tags/spatial-intelligence/index.xml" rel="self" type="application/rss+xml" />
    <item>
      <title>When VLMs Become Cognitive Mimics, Not Physical Reasoners: A QuantiPhy Study</title>
      <link>https://milknocandy.github.io/posts/2026-03-23-quantiphy/</link>
      <pubDate>Mon, 23 Mar 2026 16:42:46 +0800</pubDate>
      <guid>https://milknocandy.github.io/posts/2026-03-23-quantiphy/</guid>
      <description>&lt;div class=&#34;paperbox&#34;&gt;
    &lt;div class=&#34;pb-item&#34;&gt;
        &lt;span class=&#34;pb-key&#34;&gt;TOPIC&lt;/span&gt;
        &lt;span class=&#34;pb-sep&#34;&gt;&lt;/span&gt;
        &lt;span class=&#34;pb-val&#34;&gt;Quantitative Physical Understanding&lt;/span&gt;
    &lt;/div&gt;
    &lt;div class=&#34;pb-item&#34;&gt;
        &lt;span class=&#34;pb-key&#34;&gt;WHY READ&lt;/span&gt;
        &lt;span class=&#34;pb-sep&#34;&gt;&lt;/span&gt;
        &lt;span class=&#34;pb-val&#34;&gt;Exposes that top VLMs guess physical quantities from memory (pre-trained world knowledge) rather than measure from video, with rigorous tests to diagnose this failure.&lt;/span&gt;
    &lt;/div&gt;
    &lt;div class=&#34;pb-item&#34;&gt;
        &lt;span class=&#34;pb-key&#34;&gt;TAKEAWAY&lt;/span&gt;
        &lt;span class=&#34;pb-sep&#34;&gt;&lt;/span&gt;
        &lt;span class=&#34;pb-val&#34;&gt;Current VLMs are cognitive mimics not physical reasoners, so build systems that arbitrate between perception and memory rather than forcing pure end to end inference. (Context Learning, Agentic AI)&lt;/span&gt;
    &lt;/div&gt;
    &lt;div class=&#34;pb-links&#34;&gt;
        &lt;span class=&#34;pb-org&#34;&gt;Stanford University, UST&lt;/span&gt;
        &lt;div class=&#34;pb-link-group&#34;&gt;&lt;a href=&#34;https://arxiv.org/abs/2512.19526&#34; target=&#34;_blank&#34; class=&#34;pb-link&#34;&gt;📄 Paper&lt;/a&gt;&lt;a href=&#34;https://github.com/Paulineli/QuantiPhy&#34; target=&#34;_blank&#34; class=&#34;pb-link&#34;&gt;💻 Code&lt;/a&gt;&lt;a href=&#34;https://github.com/Paulineli/QuantiPhy&#34; target=&#34;_blank&#34; class=&#34;pb-link&#34;&gt;🌐 Project&lt;/a&gt;&lt;a href=&#34;https://github.com/Paulineli&#34; target=&#34;_blank&#34; class=&#34;pb-link&#34;&gt;👤 Author&lt;/a&gt;
        &lt;/div&gt;
    &lt;/div&gt;
&lt;/div&gt;

&lt;hr&gt;
&lt;h2 id=&#34;-1-motivation--problem&#34;&gt;🚀 1 Motivation &amp;amp; Problem&lt;/h2&gt;
&lt;p&gt;Humans understand the physical world through structured mathematical abstractions. From Isaac Newton’s formulation of universal gravitation inspired by a falling apple, to modern physics, quantitative laws enable precise reasoning about the dynamics of the real world. In contrast, although state-of-the-art AI systems demonstrate remarkable capabilities in mathematical reasoning, programming, and scientific writing, enabling artificial intelligence to &lt;u&gt;&lt;i&gt;ground its understanding in the physical world&lt;/i&gt;&lt;/u&gt; remains a fundamental and unresolved challenge. This limitation poses a critical barrier to deploying AI systems in real-world, embodied environments.&lt;/p&gt;</description>
    </item>
    <item>
      <title>Spatial Intelligence in Large Models: Benchmarks, Mechanisms, and Reasoning</title>
      <link>https://milknocandy.github.io/posts/2026-03-19-si/</link>
      <pubDate>Thu, 19 Mar 2026 11:15:09 +0800</pubDate>
      <guid>https://milknocandy.github.io/posts/2026-03-19-si/</guid>
      <description>&lt;h2 id=&#34;1-benchmark&#34;&gt;1 Benchmark&lt;/h2&gt;
&lt;h3 id=&#34;11-textual-benchmarks&#34;&gt;1.1 Textual Benchmarks&lt;/h3&gt;
&lt;p&gt;&lt;details class=&#34;paper-details-wrapper&#34;&gt;
    &lt;summary class=&#34;paper-summary&#34;&gt;
        &lt;div class=&#34;summary-inner&#34;&gt;
            

            
            
            
            

            &lt;span class=&#34;s-venue-dynamic v-arxiv-2026&#34;&gt;
                &lt;svg viewBox=&#34;0 0 24 24&#34; fill=&#34;none&#34; stroke=&#34;currentColor&#34; stroke-width=&#34;2&#34;
                stroke-linecap=&#34;round&#34; stroke-linejoin=&#34;round&#34; class=&#34;v-icon&#34;&gt;
                &lt;path d=&#34;M14.5 2H6a2 2 0 0 0-2 2v16a2 2 0 0 0 2 2h12a2 2 0 0 0 2-2V7.5L14.5 2z&#34;&gt;&lt;/path&gt;
                &lt;polyline points=&#34;14 2 14 8 20 8&#34;&gt;&lt;/polyline&gt;
            &lt;/svg&gt;
                &lt;span class=&#34;v-text&#34;&gt;Arxiv 2026&lt;/span&gt;
            &lt;/span&gt;

            &lt;p class=&#34;s-title&#34;&gt;Can LLMs See Without Pixels? Benchmarking Spatial Intelligence from Textual Descriptions&lt;/p&gt;
            &lt;span class=&#34;s-toggle-icon&#34;&gt;🔻&lt;/span&gt;
        &lt;/div&gt;
    &lt;/summary&gt;

    &lt;div class=&#34;paper-card-expanded&#34;&gt;
        &lt;div class=&#34;expand-action-bar&#34;&gt;
            &lt;div class=&#34;org-outer-container&#34;&gt;
                
                &lt;div class=&#34;org-group&#34;&gt;
                    
                    
                      &lt;span class=&#34;org-tag&#34;&gt;🏛️
                        Beijing Institute of Technology&lt;/span&gt;
                    
                    
                      &lt;span class=&#34;org-tag&#34;&gt;🏛️
                        BUCT&lt;/span&gt;
                    
                    
                &lt;/div&gt;
                
            &lt;/div&gt;

            &lt;div class=&#34;action-btns-fixed&#34;&gt;
                &lt;a href=&#34;https://binisalegend.github.io/&#34; target=&#34;_blank&#34; class=&#34;act-btn&#34;&gt;👤 Author&lt;/a&gt;
                &lt;a href=&#34;https://arxiv.org/abs/2601.03590&#34; target=&#34;_blank&#34; class=&#34;act-btn&#34;&gt;📄 Paper&lt;/a&gt;
                &lt;a href=&#34;https://github.com/binisalegend/SiT-Bench&#34; target=&#34;_blank&#34; class=&#34;act-btn&#34;&gt;💻 Code&lt;/a&gt;
                
            &lt;/div&gt;
        &lt;/div&gt;&lt;div class=&#34;expand-grid&#34;&gt;&lt;div class=&#34;ex-row&#34;&gt;&lt;span class=&#34;ex-icon&#34;&gt;🏷️&lt;/span&gt;
                &lt;div class=&#34;ex-text&#34;&gt;&lt;b&gt;Subject:&lt;/b&gt; Textual spatial reasoning benchmark for intrinsic LLM spatial intelligence evaluation&lt;/div&gt;
            &lt;/div&gt;
            &lt;div class=&#34;ex-row&#34;&gt;&lt;span class=&#34;ex-icon&#34;&gt;❓&lt;/span&gt;
                &lt;div class=&#34;ex-text&#34;&gt;&lt;b&gt;Problem:&lt;/b&gt;
                    &lt;div class=&#34;ex-markdown-inner&#34;&gt; &lt;ul&gt;
&lt;li&gt;Perception–reasoning entanglement in VLM benchmarks&lt;/li&gt;
&lt;li&gt;Lack of high-fidelity text-only spatial tasks&lt;/li&gt;
&lt;li&gt;Over-reliance on language priors/pattern matching&lt;/li&gt;
&lt;li&gt;Weak evaluation of global consistency, mental mapping&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
                &lt;/div&gt;
            &lt;/div&gt;
            &lt;div class=&#34;ex-row&#34;&gt;&lt;span class=&#34;ex-icon&#34;&gt;💡&lt;/span&gt;
                &lt;div class=&#34;ex-text&#34;&gt;&lt;b&gt;Idea:&lt;/b&gt; Convert visual scenes into &lt;mark&gt;coordinate-aware text&lt;/mark&gt; to isolate and test &lt;mark&gt;symbolic spatial reasoning&lt;/mark&gt; in LLMs.&lt;/div&gt;
            &lt;/div&gt;

            &lt;div class=&#34;ex-row ex-sol-box&#34;&gt;
                &lt;span class=&#34;ex-icon&#34;&gt;🛠️&lt;/span&gt;
                &lt;div class=&#34;ex-text&#34;&gt;
                    &lt;b&gt;Solution:&lt;/b&gt;
                    &lt;div class=&#34;ex-markdown-inner&#34;&gt;&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;SiT-Bench:&lt;/strong&gt; 3.8K QA across 5 categories, 17 subtasks for spatial cognition&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Textual Encoding:&lt;/strong&gt; Multi-view scenes → coordinate-aware descriptions enabling symbolic reasoning&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Dual Construction:&lt;/strong&gt; Image-based generation + vision-benchmark-to-text adaptation&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;R1 Filtering:&lt;/strong&gt; Reasoning-based filtering removes trivial, inconsistent, leakage samples&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Evaluation Protocol:&lt;/strong&gt; Compare LLMs/VLMs with/without CoT to isolate reasoning ability&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
                &lt;/div&gt;
            &lt;/div&gt;

            &lt;div class=&#34;ex-row&#34;&gt;&lt;span class=&#34;ex-icon&#34;&gt;🏆&lt;/span&gt;
                &lt;div class=&#34;ex-text&#34;&gt;&lt;b&gt;Results:&lt;/b&gt; Best model 59.46% vs. 74.42% human; large gap in global tasks (&lt;10% mapping). CoT significantly improves performance, validating latent but underutilized spatial reasoning.&lt;/div&gt;
            &lt;/div&gt;

            

            
            
        &lt;/div&gt;
    &lt;/div&gt;
&lt;/details&gt;

&lt;figure &gt;
    &lt;img src=&#34;1_Sample4SiT.png&#34; alt=&#34;Example of SiT Benchmark&#34; /&gt;&lt;figcaption&gt;
        &lt;span class=&#34;auto-fig-title&#34;&gt;Example of SiT Benchmark&lt;/span&gt;
    &lt;/figcaption&gt;&lt;/figure&gt;&lt;/p&gt;</description>
    </item>
  </channel>
</rss>
