<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/">
  <channel>
    <title>Posts on PaperMoon&#39;s blog</title>
    <link>https://milknocandy.github.io/posts/</link>
    <description>Recent content in Posts on PaperMoon&#39;s blog</description>
    <image>
      <title>PaperMoon&#39;s blog</title>
      <url>https://milknocandy.github.io/%3Clink%20or%20path%20of%20image%20for%20opengraph,%20twitter-cards%3E</url>
      <link>https://milknocandy.github.io/%3Clink%20or%20path%20of%20image%20for%20opengraph,%20twitter-cards%3E</link>
    </image>
    <generator>Hugo -- 0.154.3</generator>
    <language>en</language>
    <lastBuildDate>Sun, 05 Apr 2026 17:58:05 +0800</lastBuildDate>
    <atom:link href="https://milknocandy.github.io/posts/index.xml" rel="self" type="application/rss+xml" />
    <item>
      <title>When VLMs Become Cognitive Mimics, Not Physical Reasoners: A QuantiPhy Study</title>
      <link>https://milknocandy.github.io/posts/2026-03-23-quantiphy/</link>
      <pubDate>Mon, 23 Mar 2026 16:42:46 +0800</pubDate>
      <guid>https://milknocandy.github.io/posts/2026-03-23-quantiphy/</guid>
      <description>&lt;div class=&#34;paperbox&#34;&gt;
    &lt;div class=&#34;pb-item&#34;&gt;
        &lt;span class=&#34;pb-key&#34;&gt;TOPIC&lt;/span&gt;
        &lt;span class=&#34;pb-sep&#34;&gt;&lt;/span&gt;
        &lt;span class=&#34;pb-val&#34;&gt;Quantitative Physical Understanding&lt;/span&gt;
    &lt;/div&gt;
    &lt;div class=&#34;pb-item&#34;&gt;
        &lt;span class=&#34;pb-key&#34;&gt;WHY READ&lt;/span&gt;
        &lt;span class=&#34;pb-sep&#34;&gt;&lt;/span&gt;
        &lt;span class=&#34;pb-val&#34;&gt;Exposes that top VLMs guess physical quantities from memory (pre-trained world knowledge) rather than measure from video, with rigorous tests to diagnose this failure.&lt;/span&gt;
    &lt;/div&gt;
    &lt;div class=&#34;pb-item&#34;&gt;
        &lt;span class=&#34;pb-key&#34;&gt;TAKEAWAY&lt;/span&gt;
        &lt;span class=&#34;pb-sep&#34;&gt;&lt;/span&gt;
        &lt;span class=&#34;pb-val&#34;&gt;Current VLMs are cognitive mimics not physical reasoners, so build systems that arbitrate between perception and memory rather than forcing pure end to end inference. (Context Learning, Agentic AI)&lt;/span&gt;
    &lt;/div&gt;
    &lt;div class=&#34;pb-links&#34;&gt;
        &lt;span class=&#34;pb-org&#34;&gt;Stanford University, UST&lt;/span&gt;
        &lt;div class=&#34;pb-link-group&#34;&gt;&lt;a href=&#34;https://arxiv.org/abs/2512.19526&#34; target=&#34;_blank&#34; class=&#34;pb-link&#34;&gt;📄 Paper&lt;/a&gt;&lt;a href=&#34;https://github.com/Paulineli/QuantiPhy&#34; target=&#34;_blank&#34; class=&#34;pb-link&#34;&gt;💻 Code&lt;/a&gt;&lt;a href=&#34;https://github.com/Paulineli/QuantiPhy&#34; target=&#34;_blank&#34; class=&#34;pb-link&#34;&gt;🌐 Project&lt;/a&gt;&lt;a href=&#34;https://github.com/Paulineli&#34; target=&#34;_blank&#34; class=&#34;pb-link&#34;&gt;👤 Author&lt;/a&gt;
        &lt;/div&gt;
    &lt;/div&gt;
&lt;/div&gt;

&lt;hr&gt;
&lt;h2 id=&#34;-1-motivation--problem&#34;&gt;🚀 1 Motivation &amp;amp; Problem&lt;/h2&gt;
&lt;p&gt;Humans understand the physical world through structured mathematical abstractions. From Isaac Newton’s formulation of universal gravitation inspired by a falling apple, to modern physics, quantitative laws enable precise reasoning about the dynamics of the real world. In contrast, although state-of-the-art AI systems demonstrate remarkable capabilities in mathematical reasoning, programming, and scientific writing, enabling artificial intelligence to &lt;u&gt;&lt;i&gt;ground its understanding in the physical world&lt;/i&gt;&lt;/u&gt; remains a fundamental and unresolved challenge. This limitation poses a critical barrier to deploying AI systems in real-world, embodied environments.&lt;/p&gt;</description>
    </item>
    <item>
      <title>Spatial Intelligence in Large Models: Benchmarks, Mechanisms, and Reasoning</title>
      <link>https://milknocandy.github.io/posts/2026-03-19-si/</link>
      <pubDate>Thu, 19 Mar 2026 11:15:09 +0800</pubDate>
      <guid>https://milknocandy.github.io/posts/2026-03-19-si/</guid>
      <description>&lt;h2 id=&#34;1-benchmark&#34;&gt;1 Benchmark&lt;/h2&gt;
&lt;h3 id=&#34;11-textual-benchmarks&#34;&gt;1.1 Textual Benchmarks&lt;/h3&gt;
&lt;p&gt;&lt;details class=&#34;paper-details-wrapper&#34;&gt;
    &lt;summary class=&#34;paper-summary&#34;&gt;
        &lt;div class=&#34;summary-inner&#34;&gt;
            

            
            
            
            

            &lt;span class=&#34;s-venue-dynamic v-arxiv-2026&#34;&gt;
                &lt;svg viewBox=&#34;0 0 24 24&#34; fill=&#34;none&#34; stroke=&#34;currentColor&#34; stroke-width=&#34;2&#34;
                stroke-linecap=&#34;round&#34; stroke-linejoin=&#34;round&#34; class=&#34;v-icon&#34;&gt;
                &lt;path d=&#34;M14.5 2H6a2 2 0 0 0-2 2v16a2 2 0 0 0 2 2h12a2 2 0 0 0 2-2V7.5L14.5 2z&#34;&gt;&lt;/path&gt;
                &lt;polyline points=&#34;14 2 14 8 20 8&#34;&gt;&lt;/polyline&gt;
            &lt;/svg&gt;
                &lt;span class=&#34;v-text&#34;&gt;Arxiv 2026&lt;/span&gt;
            &lt;/span&gt;

            &lt;p class=&#34;s-title&#34;&gt;Can LLMs See Without Pixels? Benchmarking Spatial Intelligence from Textual Descriptions&lt;/p&gt;
            &lt;span class=&#34;s-toggle-icon&#34;&gt;🔻&lt;/span&gt;
        &lt;/div&gt;
    &lt;/summary&gt;

    &lt;div class=&#34;paper-card-expanded&#34;&gt;
        &lt;div class=&#34;expand-action-bar&#34;&gt;
            &lt;div class=&#34;org-outer-container&#34;&gt;
                
                &lt;div class=&#34;org-group&#34;&gt;
                    
                    
                      &lt;span class=&#34;org-tag&#34;&gt;🏛️
                        Beijing Institute of Technology&lt;/span&gt;
                    
                    
                      &lt;span class=&#34;org-tag&#34;&gt;🏛️
                        BUCT&lt;/span&gt;
                    
                    
                &lt;/div&gt;
                
            &lt;/div&gt;

            &lt;div class=&#34;action-btns-fixed&#34;&gt;
                &lt;a href=&#34;https://binisalegend.github.io/&#34; target=&#34;_blank&#34; class=&#34;act-btn&#34;&gt;👤 Author&lt;/a&gt;
                &lt;a href=&#34;https://arxiv.org/abs/2601.03590&#34; target=&#34;_blank&#34; class=&#34;act-btn&#34;&gt;📄 Paper&lt;/a&gt;
                &lt;a href=&#34;https://github.com/binisalegend/SiT-Bench&#34; target=&#34;_blank&#34; class=&#34;act-btn&#34;&gt;💻 Code&lt;/a&gt;
                
            &lt;/div&gt;
        &lt;/div&gt;&lt;div class=&#34;expand-grid&#34;&gt;&lt;div class=&#34;ex-row&#34;&gt;&lt;span class=&#34;ex-icon&#34;&gt;🏷️&lt;/span&gt;
                &lt;div class=&#34;ex-text&#34;&gt;&lt;b&gt;Subject:&lt;/b&gt; Textual spatial reasoning benchmark for intrinsic LLM spatial intelligence evaluation&lt;/div&gt;
            &lt;/div&gt;
            &lt;div class=&#34;ex-row&#34;&gt;&lt;span class=&#34;ex-icon&#34;&gt;❓&lt;/span&gt;
                &lt;div class=&#34;ex-text&#34;&gt;&lt;b&gt;Problem:&lt;/b&gt;
                    &lt;div class=&#34;ex-markdown-inner&#34;&gt; &lt;ul&gt;
&lt;li&gt;Perception–reasoning entanglement in VLM benchmarks&lt;/li&gt;
&lt;li&gt;Lack of high-fidelity text-only spatial tasks&lt;/li&gt;
&lt;li&gt;Over-reliance on language priors/pattern matching&lt;/li&gt;
&lt;li&gt;Weak evaluation of global consistency, mental mapping&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
                &lt;/div&gt;
            &lt;/div&gt;
            &lt;div class=&#34;ex-row&#34;&gt;&lt;span class=&#34;ex-icon&#34;&gt;💡&lt;/span&gt;
                &lt;div class=&#34;ex-text&#34;&gt;&lt;b&gt;Idea:&lt;/b&gt; Convert visual scenes into &lt;mark&gt;coordinate-aware text&lt;/mark&gt; to isolate and test &lt;mark&gt;symbolic spatial reasoning&lt;/mark&gt; in LLMs.&lt;/div&gt;
            &lt;/div&gt;

            &lt;div class=&#34;ex-row ex-sol-box&#34;&gt;
                &lt;span class=&#34;ex-icon&#34;&gt;🛠️&lt;/span&gt;
                &lt;div class=&#34;ex-text&#34;&gt;
                    &lt;b&gt;Solution:&lt;/b&gt;
                    &lt;div class=&#34;ex-markdown-inner&#34;&gt;&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;SiT-Bench:&lt;/strong&gt; 3.8K QA across 5 categories, 17 subtasks for spatial cognition&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Textual Encoding:&lt;/strong&gt; Multi-view scenes → coordinate-aware descriptions enabling symbolic reasoning&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Dual Construction:&lt;/strong&gt; Image-based generation + vision-benchmark-to-text adaptation&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;R1 Filtering:&lt;/strong&gt; Reasoning-based filtering removes trivial, inconsistent, leakage samples&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Evaluation Protocol:&lt;/strong&gt; Compare LLMs/VLMs with/without CoT to isolate reasoning ability&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
                &lt;/div&gt;
            &lt;/div&gt;

            &lt;div class=&#34;ex-row&#34;&gt;&lt;span class=&#34;ex-icon&#34;&gt;🏆&lt;/span&gt;
                &lt;div class=&#34;ex-text&#34;&gt;&lt;b&gt;Results:&lt;/b&gt; Best model 59.46% vs. 74.42% human; large gap in global tasks (&lt;10% mapping). CoT significantly improves performance, validating latent but underutilized spatial reasoning.&lt;/div&gt;
            &lt;/div&gt;

            

            
            
        &lt;/div&gt;
    &lt;/div&gt;
&lt;/details&gt;

&lt;figure &gt;
    &lt;img src=&#34;1_Sample4SiT.png&#34; alt=&#34;Example of SiT Benchmark&#34; /&gt;&lt;figcaption&gt;
        &lt;span class=&#34;auto-fig-title&#34;&gt;Example of SiT Benchmark&lt;/span&gt;
    &lt;/figcaption&gt;&lt;/figure&gt;&lt;/p&gt;</description>
    </item>
    <item>
      <title>The Evolution of Unified Multimodal Models</title>
      <link>https://milknocandy.github.io/posts/2026-03-07-umm/</link>
      <pubDate>Sat, 07 Mar 2026 14:51:21 +0800</pubDate>
      <guid>https://milknocandy.github.io/posts/2026-03-07-umm/</guid>
      <description>&lt;h2 id=&#34;1-timeline-order&#34;&gt;1 Timeline Order&lt;/h2&gt;
&lt;blockquote&gt;
&lt;p&gt;Summarize the literature reviewed in chronological order.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;h3 id=&#34;2026&#34;&gt;2026&lt;/h3&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;details class=&#34;paper-details-wrapper&#34;&gt;
    &lt;summary class=&#34;paper-summary&#34;&gt;
        &lt;div class=&#34;summary-inner&#34;&gt;
            

            
            
            
            

            &lt;span class=&#34;s-venue-dynamic v-arxiv-2026&#34;&gt;
                &lt;svg viewBox=&#34;0 0 24 24&#34; fill=&#34;none&#34; stroke=&#34;currentColor&#34; stroke-width=&#34;2&#34;
                stroke-linecap=&#34;round&#34; stroke-linejoin=&#34;round&#34; class=&#34;v-icon&#34;&gt;
                &lt;path d=&#34;M14.5 2H6a2 2 0 0 0-2 2v16a2 2 0 0 0 2 2h12a2 2 0 0 0 2-2V7.5L14.5 2z&#34;&gt;&lt;/path&gt;
                &lt;polyline points=&#34;14 2 14 8 20 8&#34;&gt;&lt;/polyline&gt;
            &lt;/svg&gt;
                &lt;span class=&#34;v-text&#34;&gt;Arxiv 2026&lt;/span&gt;
            &lt;/span&gt;

            &lt;p class=&#34;s-title&#34;&gt;WeMMU: Enhanced Bridging of Vision-Language Models and Diffusion Models via Noisy Query Tokens&lt;/p&gt;
            &lt;span class=&#34;s-toggle-icon&#34;&gt;🔻&lt;/span&gt;
        &lt;/div&gt;
    &lt;/summary&gt;

    &lt;div class=&#34;paper-card-expanded&#34;&gt;
        &lt;div class=&#34;expand-action-bar&#34;&gt;
            &lt;div class=&#34;org-outer-container&#34;&gt;
                
                &lt;div class=&#34;org-group&#34;&gt;
                    
                    
                      &lt;span class=&#34;org-tag&#34;&gt;🏛️
                        MoE Key Laboratory of Brain-inspired Intelligent Perception and Cognition&lt;/span&gt;
                    
                    
                      &lt;span class=&#34;org-tag&#34;&gt;🏛️
                        University of Science and Technology of China&lt;/span&gt;
                    
                    
                      &lt;span class=&#34;org-tag&#34;&gt;🏛️
                        Zhejiang University&lt;/span&gt;
                    
                    
                      &lt;span class=&#34;org-tag&#34;&gt;🏛️
                        The Hong Kong University of Science and Technology&lt;/span&gt;
                    
                    
                &lt;/div&gt;
                
            &lt;/div&gt;

            &lt;div class=&#34;action-btns-fixed&#34;&gt;
                
                &lt;a href=&#34;https://arxiv.org/abs/2512.02536&#34; target=&#34;_blank&#34; class=&#34;act-btn&#34;&gt;📄 Paper&lt;/a&gt;
                
                
            &lt;/div&gt;
        &lt;/div&gt;&lt;div class=&#34;expand-grid&#34;&gt;&lt;div class=&#34;ex-row&#34;&gt;&lt;span class=&#34;ex-icon&#34;&gt;🏷️&lt;/span&gt;
                &lt;div class=&#34;ex-text&#34;&gt;&lt;b&gt;Subject:&lt;/b&gt; Bridging Pre-trained VLMs and Diffusion Models for UMMs&lt;/div&gt;
            &lt;/div&gt;
            &lt;div class=&#34;ex-row&#34;&gt;&lt;span class=&#34;ex-icon&#34;&gt;❓&lt;/span&gt;
                &lt;div class=&#34;ex-text&#34;&gt;&lt;b&gt;Problem:&lt;/b&gt;
                    &lt;div class=&#34;ex-markdown-inner&#34;&gt; Existing methods (MetaQuery) performs &lt;mark&gt;alignment via learnable queries&lt;/mark&gt;, but suffer from poor task generalization. They require retraining in the early stage for significantly different task types.&lt;/div&gt;
                &lt;/div&gt;
            &lt;/div&gt;
            &lt;div class=&#34;ex-row&#34;&gt;&lt;span class=&#34;ex-icon&#34;&gt;💡&lt;/span&gt;
                &lt;div class=&#34;ex-text&#34;&gt;&lt;b&gt;Idea:&lt;/b&gt; Probabilistic Expert Bridge (from Bagel) samples Noisy Query Tokens.&lt;/div&gt;
            &lt;/div&gt;

            &lt;div class=&#34;ex-row ex-sol-box&#34;&gt;
                &lt;span class=&#34;ex-icon&#34;&gt;🛠️&lt;/span&gt;
                &lt;div class=&#34;ex-text&#34;&gt;
                    &lt;b&gt;Solution:&lt;/b&gt;
                    &lt;div class=&#34;ex-markdown-inner&#34;&gt;&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Noisy Query Tokens:&lt;/strong&gt; Sample tokens from the standard normal distribution $N(0, I)$ at each training step to learn a robust distributed intermediate representation space instead of task-specific features.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Probabilistic Expert Bridge:&lt;/strong&gt; Freeze VLM core parameters, add a parallel generative pathway, follow the division of labor (VLM for understanding, Diffusion Model for generation), and use Position MLP for feature alignment and spatial cue injection.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;VAE Branch:&lt;/strong&gt; Inject VAE fine-grained features into VLM via a linear projection layer to fuse high-level semantics ans low-level visual details, reducing the Diffusion Models&#39;s burden.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Progressive Training:&lt;/strong&gt; Adopt a four-stage curriculum training strategy, flexibly switch between contrastive/conditional flow matching loss, and gradually upgrade resolution and task complexity.&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
                &lt;/div&gt;
            &lt;/div&gt;

            &lt;div class=&#34;ex-row&#34;&gt;&lt;span class=&#34;ex-icon&#34;&gt;🏆&lt;/span&gt;
                &lt;div class=&#34;ex-text&#34;&gt;&lt;b&gt;Results:&lt;/b&gt; Though the performace is not SOTA, it alleviates task generalization collapse of UMMs, facilitates stable cross-task continual learning and retains fine-grained image details.&lt;/div&gt;
            &lt;/div&gt;

            

            
            
        &lt;/div&gt;
    &lt;/div&gt;
&lt;/details&gt;

&lt;figure &gt;
    &lt;img src=&#34;1_WeMMU.png&#34; alt=&#34;WeMMU&#34; /&gt;&lt;figcaption&gt;
        &lt;span class=&#34;auto-fig-title&#34;&gt;WeMMU&lt;/span&gt;
    &lt;/figcaption&gt;&lt;/figure&gt;&lt;/p&gt;</description>
    </item>
    <item>
      <title>LoRA Variants Surveys</title>
      <link>https://milknocandy.github.io/posts/2026-01-16-lora/</link>
      <pubDate>Fri, 16 Jan 2026 00:09:30 +0800</pubDate>
      <guid>https://milknocandy.github.io/posts/2026-01-16-lora/</guid>
      <description>&lt;h2 id=&#34;1-timeline-order&#34;&gt;1 Timeline Order&lt;/h2&gt;
&lt;blockquote&gt;
&lt;p&gt;Summarize the literature reviewed in chronological order.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;h3 id=&#34;2023&#34;&gt;2023&lt;/h3&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;📝【&lt;em&gt;&lt;strong&gt;EMNLP 2023 - Main&lt;/strong&gt;&lt;/em&gt;】- Sparse Low-rank Adaptation of Pre-trained Language Models (&lt;em&gt;Tsinghua University, The University of Chicago&lt;/em&gt;)&lt;/p&gt;
&lt;div class=&#34;highlight-box default&#34;&gt;
    &lt;div class=&#34;box-content&#34;&gt;
        &lt;p&gt;&lt;strong&gt;Subject:&lt;/strong&gt; Adaptive Rank Selection&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Problem:&lt;/strong&gt; Standard LoRA uses a fixed, inflexible rank (hyperparameter $ r
 $), requiring expensive manual tuning.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Core Idea:&lt;/strong&gt; Make the rank learnable rather than fixed.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Mechanism:&lt;/strong&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Gating:&lt;/strong&gt; Introduces an optimizable gating unit to the low-rank matrices.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Optimization:&lt;/strong&gt; Uses proximal gradient methods to update the gates.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Dynamics:&lt;/strong&gt; Prunes less important ranks during training automatically.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Result:&lt;/strong&gt; Eliminates discrete rank search; the model discovers its own optimal rank structure.&lt;/li&gt;
&lt;/ul&gt;
    &lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;
&lt;figure &gt;
    &lt;img src=&#34;1-sora.png&#34; alt=&#34;SoRA&#34; /&gt;&lt;figcaption&gt;
        &lt;span class=&#34;auto-fig-title&#34;&gt;SoRA&lt;/span&gt;
    &lt;/figcaption&gt;&lt;/figure&gt;&lt;/p&gt;</description>
    </item>
    <item>
      <title>​Designing Bert for Convolutional Networks</title>
      <link>https://milknocandy.github.io/posts/2025-08-28-spark/</link>
      <pubDate>Thu, 28 Aug 2025 20:47:43 +0800</pubDate>
      <guid>https://milknocandy.github.io/posts/2025-08-28-spark/</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;SparK：&lt;a href=&#34;https://github.com/keyu-tian/SparK&#34;&gt;Designing Bert for Convolutional Networkss: Sparse and Hierarchical Masked Modeling&lt;/a&gt; (ICLR 2023 Spotlight)&lt;/p&gt;
&lt;p&gt;论文介绍：&lt;font style=&#34;color:rgb(38, 38, 38);&#34;&gt;&lt;/font&gt;&lt;a href=&#34;https://www.bilibili.com/video/BV11s4y1M7qL/&#34;&gt;https://www.bilibili.com/video/BV11s4y1M7qL/&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Bert算法是遮住数据的一部分，用模型去进行预测，达到一个自监督学习的效果。迁移到图像领域中的视觉Transformer的工作比如MAE，但是直接将Transformer替换为卷积网络则出现问题。如下图，zero-outing表示直接替换：&lt;/p&gt;
&lt;!-- 这是一张图片，ocr 内容为：HIERARCHY APE MASKING EPOCH METHOD STD. LOSS ACC. 83.1 -1.0 NOT PRETRAINED 0.07 SPARK(OURS) 84.1 2 0.0 MASKED ONLY 1600 SPARSE X 3 83.2 0.06 ZERO-OUTING 1600 -0.9 MASKED ONLY ZERO-OUTING --&gt;
&lt;p&gt;
&lt;figure &gt;
    &lt;img src=&#34;fig1.png&#34; alt=&#34;&#34; /&gt;&lt;/figure&gt;&lt;/p&gt;
&lt;p&gt;可以看到只有0.1个点的提升，是完全无效的。下面是作者的分析。&lt;/p&gt;
&lt;h2 id=&#34;为什么失败&#34;&gt;为什么失败？&lt;/h2&gt;
&lt;h3 id=&#34;问题1pixel-intensity-distribution-shift&#34;&gt;问题1：Pixel Intensity Distribution Shift&lt;/h3&gt;
&lt;p&gt;Transformer在处理patches时，只要保证是随机删去一些patches，可以保证删除的patches和图像的像素分布是一致的。而卷积神经网络则不能删去一些像素，只能是将一些像素“涂黑”来模拟丢失这部分像素的信息。&lt;/p&gt;
&lt;!-- 这是一张图片，ocr 内容为：CNN SPARSE CNN TRANSFORMER ENCODING PROCESS: PIXEL INTENSITY DATA DISTRIBUTION MA PROBABILITY BEFORE/AFTER MASKING: (A)DIRECTLY DROPPING (C)SPARSELY DROPPING (B)ZERO-OUTING (D) RAW INPUT --&gt;
&lt;p&gt;
&lt;figure &gt;
    &lt;img src=&#34;fig2.png&#34; alt=&#34;像素分布。横轴是像素强度，纵轴是像素出现的频率&#34; /&gt;&lt;figcaption&gt;
        &lt;span class=&#34;auto-fig-title&#34;&gt;像素分布。横轴是像素强度，纵轴是像素出现的频率&lt;/span&gt;
    &lt;/figcaption&gt;&lt;/figure&gt;&lt;/p&gt;</description>
    </item>
    <item>
      <title>Self-supervised Object-Centric Learning for Videos</title>
      <link>https://milknocandy.github.io/posts/2023-12-10-cutler/</link>
      <pubDate>Sun, 10 Dec 2023 11:35:33 +0800</pubDate>
      <guid>https://milknocandy.github.io/posts/2023-12-10-cutler/</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;来源：&lt;a href=&#34;https://openreview.net/group?id=NeurIPS.cc/2023/Conference&#34;&gt;NIPS 2023&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;论文地址：&lt;a href=&#34;http://arxiv.org/abs/2310.06907&#34;&gt;http://arxiv.org/abs/2310.06907&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;代码地址：❌&lt;/p&gt;
&lt;p&gt;作者主页：二作谢伟迪主页&lt;a href=&#34;https://weidixie.github.io/&#34;&gt;https://weidixie.github.io/&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;项目主页：&lt;a href=&#34;https://kuis-ai.github.io/solv/&#34;&gt;https://kuis-ai.github.io/solv/&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h2 id=&#34;介绍&#34;&gt;介绍&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;背景&lt;/strong&gt;：&lt;u&gt;无监督多目标分割&lt;/u&gt;借助自监督学习预训练中学习到的强力的语义信息展示了显著的效果。通常也是通过添加额外的模态（比如深度、动作）来增强视频序列的分割结果。然而，在 _合成序列 _中观察到的性能提升&lt;u&gt;依赖&lt;/u&gt;于额外信息的鲁棒性，并不能转化为更具挑战的真实世界场景。&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;任务&lt;/strong&gt;：给定一个复杂场景的视频序列，目标是训练一个视觉系统能够&lt;u&gt;发现、追踪和分割&lt;/u&gt;图像帧里的目标，将数百万的像素的视觉信息抽象为&lt;i&gt;语义部分&lt;/i&gt;。（object-centric视觉表征学习）&lt;/p&gt;
&lt;figure class=&#34;main-figure&#34;&gt;
  &lt;div class=&#34;side-by-side-wrapper grid-layout&#34;&gt;
    &lt;div class=&#34;side-item&#34; style=&#34;--w: 45%&#34;&gt;
      &lt;img src=&#34;1.gif&#34;&gt;
      &lt;p&gt;(a) Ground Truth&lt;/p&gt;
    &lt;/div&gt;
    &lt;div class=&#34;side-item&#34; style=&#34;--w: 45%&#34;&gt;
      &lt;img src=&#34;1-2.gif&#34;&gt;
      &lt;p&gt;(b) Prediction&lt;/p&gt;
    &lt;/div&gt;
  &lt;/div&gt;
  &lt;!-- &lt;figcaption&gt;
    &lt;span class=&#34;auto-fig-title&#34;&gt;非对称比例对比&lt;/span&gt;
  &lt;/figcaption&gt; --&gt;
&lt;/figure&gt;
&lt;p&gt;&lt;strong&gt;领域的发展&lt;/strong&gt;：从&lt;i&gt;合成图像&lt;/i&gt;开始，转向&lt;u&gt;in-the-wild&lt;/u&gt;图像和&lt;u&gt;real-world&lt;/u&gt;视频。现有方法通常使用自编码器训练范式（如重建输入信号，希望能基于数据或结构的先验来将&lt;u&gt;区域像素&lt;/u&gt;分组为有语义意义的对象）。&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;对图像：使用来源于预训练模型的&lt;u&gt;低级特征&lt;/u&gt;（如颜色、语义特征等）来确定像素到目标的分配&lt;/li&gt;
&lt;li&gt;对视频：通常结合额外的模态、信号（如光流、深度图），可直接从不连续性获得分割掩码&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;提出问题&#34;&gt;提出问题&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;使用额外信息带来的问题&lt;/strong&gt;：在视频中使用额外的信号会增加&lt;strong&gt;计算开销&lt;/strong&gt;和&lt;strong&gt;误差累积&lt;/strong&gt;。比如光流信号在处理&lt;u&gt;静态或可变形&lt;/u&gt;物体以及帧间&lt;u&gt;大位移&lt;/u&gt;时可能会产生问题，而深度值在普通视频中可能不易获得，在&lt;u&gt;低光照&lt;/u&gt;或&lt;u&gt;低对比度&lt;/u&gt;环境中其估算也会受到影响。&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;过度分割问题&lt;/strong&gt;：由于视觉场景的复杂性，使用固定数量的&lt;u&gt;slots&lt;/u&gt;可能导致过度分割问题（over-segmentation issuse）。&lt;/p&gt;
&lt;h3 id=&#34;解决问题&#34;&gt;解决问题&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;作者方法&lt;/strong&gt;：&lt;strong&gt;首次&lt;/strong&gt;提出用于&lt;u&gt;真实世界序列中多目标分割&lt;/u&gt;的完全无监督方法。SOLV，一个能够发现真实世界视频序列中多个目标且不使用额外的模态信息或任何类似弱监督方法（比如使用第一帧进行初始化）。&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;方案&lt;/strong&gt;：使用轴向空间-时隙注意力（axial spatial-temporal slot attentions）&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;首先对每帧内空间区域进行分组&lt;/li&gt;
&lt;li&gt;然后使用来自相邻帧的交互来丰富时隙表示（slot representations）&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;训练策略&lt;/strong&gt;：masked autoencoder（MAE）训练范式。两个优势：&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;充当信息瓶颈（information bottleneck），让模型观察部分区域，强迫模型学习高级语义结构。&lt;/li&gt;
&lt;li&gt;缓解内存限制，有助于提高计算效率&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;针对&lt;strong&gt;over-segmentation&lt;/strong&gt;问题：作者通过使用简单的聚类算法来融合相似的slots。&lt;/p&gt;
&lt;p&gt;总的来说，贡献如下：&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;提出一个在真实世界视频上的自监督多目标分割模型，使用axial spatial-temporal slots attention，能有效地将具有相似特性的视觉区域进行分组，而不需要使用&lt;u&gt;额外的信号&lt;/u&gt;。&lt;/li&gt;
&lt;li&gt;展示了一个基于掩码特征重建的object-centric学习方式以及slot融合方法。&lt;/li&gt;
&lt;li&gt;MOVi-E和Youtube-VIS 2019数据集上的SOTA以及DAVIS2017数据集上的具有竞争力的性能。&lt;/li&gt;
&lt;/ul&gt;
&lt;blockquote&gt;
&lt;p&gt;slot即视频中的各物体对象，见下图。&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;
&lt;figure &gt;
    &lt;img src=&#34;2.png&#34; alt=&#34;Source from: Conditional object-centric learning from video&#34; /&gt;&lt;figcaption&gt;
        &lt;span class=&#34;auto-fig-title&#34;&gt;Source from: Conditional object-centric learning from video&lt;/span&gt;
    &lt;/figcaption&gt;&lt;/figure&gt;&lt;/p&gt;
&lt;h3 id=&#34;相关工作&#34;&gt;相关工作&lt;/h3&gt;
&lt;h4 id=&#34;object-centric-learning&#34;&gt;Object-centric Learning&lt;/h4&gt;
&lt;p&gt;图像和视频的object-centric无监督表征学习现有几种解决办法：&lt;/p&gt;</description>
    </item>
  </channel>
</rss>
