[{"content":" TOPIC Quantitative Physical Understanding WHY READ Exposes that top VLMs guess physical quantities from memory (pre-trained world knowledge) rather than measure from video, with rigorous tests to diagnose this failure. TAKEAWAY Current VLMs are cognitive mimics not physical reasoners, so build systems that arbitrate between perception and memory rather than forcing pure end to end inference. (Context Learning, Agentic AI) Stanford University, UST 📄 Paper💻 Code🌐 Project👤 Author 🚀 1 Motivation \u0026amp; Problem Humans understand the physical world through structured mathematical abstractions. From Isaac Newton’s formulation of universal gravitation inspired by a falling apple, to modern physics, quantitative laws enable precise reasoning about the dynamics of the real world. In contrast, although state-of-the-art AI systems demonstrate remarkable capabilities in mathematical reasoning, programming, and scientific writing, enabling artificial intelligence to ground its understanding in the physical world remains a fundamental and unresolved challenge. This limitation poses a critical barrier to deploying AI systems in real-world, embodied environments.\nModern large language models (LLMs) are predominantly trained under the next-token prediction paradigm, which implicitly encourages models to capture statistical regularities in data. A natural extension toward building world models is to train systems to predict future states—such as future frames in videos or evolving spatial configurations. While such approaches can improve perceptual modeling and temporal prediction, they do not necessarily lead to a true understanding of physical laws. Instead, models may learn to imitate surface-level patterns in visual data without acquiring the underlying causal and quantitative structure of the physical world.\nThis limitation can be intuitively understood by analogy to human cognition. If humans were to perceive the world purely through passive observation, without forming explicit conceptual or physical knowledge, their behavior would be driven by superficial correlations rather than grounded reasoning. As a result, actions would lack an understanding of physical consequences (e.g., failing to infer the danger of falling from a height), reflecting a gap between perception and cognition. Similarly, current AI systems often rely on learned statistical priors rather than principled physical reasoning.\nTo mitigate this issue, prior work has introduced large-scale datasets in the form of Visual Question Answering (VQA) to inject world knowledge into models. However, such approaches remain insufficient for evaluating true physical understanding.\nProblem: Existing benchmarks for physical world understanding are predominantly VQA-based and qualitative. These evaluations often reduce reasoning to discrete answer selection or linguistic plausibility, which can be solved via pattern matching rather than genuine physical inference. Insight: To address this limitation, the authors introduce a new paradigm that evaluates quantitative physical reasoning, focusing on whether models can infer numerical kinematic properties (e.g., size, velocity, acceleration) from visual inputs. 💡 2 Methodology 2.1 Task Formulation The paper formulates a kinematic inference task for evaluating physical reasoning in vision-language models. Given a video and a single physical prior (e.g., size $\\bold{S}_t^{\\text{world}}$, velocity $\\bold{V}_t^{\\text{world}}$, or acceleration $\\bold{A}_t^{\\text{world}}$), the model is required to estimate another physical quantity of an target object in real-world units.\nTab. 1: Pixel-to-World Representation and Scale Mapping\rComponent\rDefinition\rMeasurement Units\rPixel Space\rObservable quantities derived from video frames\r[pixel], [pixel/s], [pixel/s²]\rWorld Space\rPhysical quantities in real-world coordinates\r[m], [m/s], [m/s²]\rScale Factor (γ)\rMapping between pixel space and world space\r[m/pixel]\rGiven a video capturing the translational motion of a target object under a fixed camera, the object's position in pixel space, denoted as $\\mathbf{X}_t^{\\text{pixel}}$, can be obtained at each time step $t$ from the frames. Based on the resulting discrete trajectory, the velocity and acceleration in pixel space can be estimated using finite difference approximations: $$\r\\bold{V}_t^{\\text{pixel}}\\approx\\frac{\\bold{X}_{t+\\mathrm{d}t}^{\\text{pixel}}-\\bold{X}_t^{\\text{pixel}}}{\\mathrm{d}t};\r\\bold{A}_t^{\\text{pixel}}\\approx\\frac{\\bold{X}_{t+2\\mathrm{d}t}^{\\text{pixel}}-2\\bold{X}_{t+\\mathrm{d}t}^{\\text{pixel}}+\\bold{X}_t^{\\text{pixel}}}{\\mathrm{d}t^2}.\r\\tag{1}\r$$To convert these pixel-based measurements into real-world physical quantities, a scale factor $\\gamma$ is introduced, which maps pixel space to world space. The relationship can be expressed as follows: $$\r\\bold{S}_t^{\\text{world}}=\\gamma \\cdot \\bold{S}_t^{\\text{pixel}};\r\\bold{V}_t^{\\text{world}}=\\gamma \\cdot \\bold{V}_t^{\\text{pixel}};\r\\bold{A}_t^{\\text{world}}=\\gamma \\cdot \\bold{A}_t^{\\text{pixel}}.\r\\tag{2}\r$$ Thus, we can compute the kinematic properties from the video and these priors.\n2.2 Benchmark Design For comprehensively evaluate of the kinematic movements above, QuantiPhy include video-question pairs along three primary axes. The first two axes define the core reasoning task:\nDimensionality: {2D, 3D}. 2D movement assumes motion strictly in the x-y plane (constant depth), while 3D movement includes the z-axis (varying depth), making it intrinsically more challenging. Physical prior: {Static, Dynamic}. The Static prior provides constant object size $\\bold{S}^{\\text{world}}$ throughout the video, while the Dynamic prior provides velocity $\\bold{V}_t^{\\text{world}}$ or acceleration $\\bold{A}_t^{\\text{world}}$ at a given timestep $t$. These two axes yield four tasks: 2D-Static (2S), 2D-Dynamic (2D), 3D-Static (3S), and 3D-Dynamic (3D). The data statistic of QuantiPhy benchmark is presented in Table 2.\nTab. 2: Data Statistics of the QuantiPhy Benchmark\rCategory\rValue\rDescription\rTotal Videos\r569\rUnique video samples collected from multiple sources\rTotal QA Pairs\r3,355\rVideo-question pairs with numerical ground truth\rTask Types\r4\r2D-Static, 2D-Dynamic, 3D-Static, 3D-Dynamic\rVideo Duration\r2–3 seconds\rTypical length of each video clip\rData Sources\r3\rBlender simulation, lab capture, internet videos\rStorage Size\r~115 MB\rTotal dataset size after processing\r2.3 Data Construction QuantiPhy employs a three-stage construction pipeline that balances experimental control with real-world diversity. As illustrated in Figure 2.1, the authors integrate synthetic simulation, controlled laboratory capture, and in-the-wild internet videos to create a comprehensive evaluation benchmark.\nThe construction of QuantiPhy Benchmark\rStage 1: Data Collection. The authors source videos from three complementary channels to ensure broad coverage of physical scenarios:\nTab. 3: Data Source Characteristics and Collection Methodology\rSource\rQuantity\rKey Characteristics\rPrimary Use Case\rBlender Simulation\r300 videos\rFull physical control; precise ground-truth; scalable scene variation\rControlled experiments; counterfactual testing\rLab Capture\r112 videos\rReal-world physics; 4D metric reconstruction; calibrated multi-view\rReal sensor validation; depth-varying 3D motion\rInternet Scraping\r72 videos\rNatural scenes; diverse distributions; uncontrolled conditions\rOut-of-distribution evaluation\rSegmented (SAM2)\r85 videos\rIsolated objects on plain backgrounds; background ablation\rScene complexity analysis\rBlender Simulation enables precise control over object kinematics, camera parameters, and scene composition. They render scenes using Cycles/EEVEE engines with varying resolutions (1920×1080, 1080×1080, 480×960), frame rates (24–120 fps), and lighting conditions. Motion types include: (i) keyframed animation for articulated objects (humans, animals), and (ii) physics-driven simulation for rigid-body dynamics with Newtonian constraints.\nLab Capture utilizes four Orbbec Femto Mega RGB-D cameras arranged in multi-view stereo configuration. They capture diverse motions including free fall, sliding, pendulum oscillation, and bouncing across small-scale (desk-top) and large-scale (room-scale) setups.\nInternet Videos are manually curated from open-source platforms and author-recorded footage, strictly filtered for static camera, translational motion, and visible reference objects. All identifiable information (faces, license plates) is anonymized via blurring.\nStage 2: Data Annotation. They employ source-specific annotation protocols to extract precise kinematic ground truth:\nTab. 4: Annotation Methods by Data Source\rSource\rAnnotation Method\rExtracted Quantities\rPrecision\rBlender\rAutomated Python scripts querying scene graph\rSize, displacement, velocity, acceleration, depth\rExact (floating-point)\rLab\rUI-assisted depth clicking + multi-view triangulation\rMetric depth, 3D trajectory, instantaneous velocity/acceleration\r±1 cm (depth camera limited)\rInternet\rInteractive pixel measurement tool + reference scaling\rPixel kinematics → world units via γ estimation\rApproximate (reference-dependent)\rStage 3: Task Formulation. Each video is associated with multiple (prior, question, ground-truth) triplets following the kinematic inference framework:\nTab. 5: Video-Text Record Schema\rField\rDescription\rExample\rvideo_id\rUnique identifier\rsimulation_0032\rvideo_type\r4-character code: [Prior][Dim][Objects][Background]\rA3MC (Acceleration, 3D, Multiple objects, Complex)\rinference_type\rPrior dynamics → Target dynamics (S=static, D=dynamic)\rDD (Dynamic prior → Dynamic target)\rground_truth_prior\rProvided physical constant with unit\rgravity acc = 9.8 m/s²\rdepth_info\rTemporal depth annotations (3D tasks only)\rt=1s, distance_ball_camera = 1.4020m\rground_truth_posterior\rNumerical answer (unit specified in question)\r2.86\rThe four-character video type code systematically encodes task complexity:\n1st character: S (Size prior), V (Velocity prior), or A (Acceleration prior) 2nd character: 2 (2D planar motion) or 3 (3D depth-varying motion) 3rd character: S (Single object) or M (Multiple objects requiring relational reasoning) 4th character: X (Plain background), S (Simple texture), or C (Complex scene) This schema yields 36 fine-grained categories (e.g., A2SX, V3MC), each populated with ≥4 videos to ensure statistical validity. The final dataset comprises 569 unique videos and 3,355 question-answer pairs, with 2D:3D ratio of approximately 4:3 and balanced distribution across inference types. 🛠️ 3 Evaluation Protocol The QuantiPhy evaluation framework is designed to rigorously assess Vision-Language Models' quantitative physical reasoning through standardized prompting, robust parsing, and calibrated metrics. Their protocol addresses three critical challenges: (i) ensuring consistent model behavior across diverse architectures, (ii) extracting reliable numerical predictions from potentially verbose outputs, and (iii) measuring proximity to ground truth with appropriate tolerance for physical measurement uncertainty.\n3.1 Benchmark Models The authors evaluate 21 state-of-the-art VLMs spanning proprietary APIs and open-weight architectures to ensure comprehensive coverage of current capabilities:\nTab. 6: Evaluated Model Suite\rCategory\rModels\rKey Characteristics\rProprietary\rChatGPT-5.1, ChatGPT-5\rOpenAI multimodal with extended CoT reasoning\rGemini-2.5 Pro/Flash\rGoogle long-context video understanding\rGrok-4.1 (Fast Reasoning)\rxAI rapid inference with reasoning optimization\rClaude-4.5 Sonnet\rAnthropic detailed explanatory generation\rOpen-Weight\n(Scaling Series)\rQwen3-VL-Instruct (2B/8B/32B)\rAlibaba architecture scaling analysis\rInternVL-3.5 (2B/8B/30B)\rShanghai AI Lab vision-language alignment\rPhi-4-Multimodal / Phi-3-Mini\rMicrosoft efficient multimodal design\rSmolVLM-Instruct (256M)\rUltra-lightweight edge deployment\rSpecialized\rMolmo-7B, VILA-7B, LLaVA-13B\rAcademic research architectures\rMiniCPM-V 4.5, CogVLM2-Video\rNative video input processing\rDeployment Configuration: Proprietary models are accessed via official APIs (OpenAI, Google, Anthropic, xAI). Open-weight models are hosted via Replicate API or self-deployed with vLLM. Temperature is fixed at 0–0.1 for deterministic outputs; token limits range from 500 (lightweight models) to 10,000 (reasoning-intensive models).\n3.2 Prompting Strategy They employ a constrained generation protocol designed to minimize output variance and enforce numerical precision:\nTab. 7: Standardized Prompt Structure\rComponent\rContent\rPurpose\r[Video Frames]\rFull temporal sequence at 480p resolution; all frames retained\rPreserve motion dynamics; avoid temporal aliasing\r[System Prompt]\r\"You are an expert video analyst specializing in physics measurements\"\rEstablish authoritative persona; pilot-validated for adherence\r[Ground Truth Prior]\rSingle physical constant (e.g., \"length of yellow car = 5.67m\")\rEnable scale factor γ determination\r[Depth Info] (3D only)\rTemporal camera-object distances\rSupport depth-varying kinematic inference\r[Question]\rTarget quantity with explicit unit and timestamp\rRemove ambiguity in prediction target\r[Post-Prompt]\r\"Output ONLY the numerical answer and unit. No explanation.\"\rSuppress verbose CoT; enforce parseability\rCritical Design Choices:\nTemporal fidelity over spatial resolution: 480p preserves all frames; subsampling degrades velocity/acceleration tracking Single prior constraint: Exactly one physical constant provided to test scale transformation, not multi-factor estimation Deterministic decoding: Greedy sampling (temperature=0) where supported; default parameters otherwise. 3.3 Answer Retrieval and Parsing Given model outputs ranging from concise numerical responses to extensive analytical narratives, they implement a hierarchical parsing pipeline: parse_number function 1 2 3 4 5 1. Exact Match: Check if response matches [number][unit] format 2. Delimiter Search: Scan for \u0026#34;=\u0026#34;, \u0026#34;Final Answer:\u0026#34;, \u0026#34;=\u0026gt;\u0026#34;, \u0026#34;:\u0026#34; → Retain substring after last delimiter 3. Unit Sanitization: Remove \u0026#34;meters\u0026#34;, \u0026#34;m/s\u0026#34;, \u0026#34;cm/s²\u0026#34; etc. 4. Heuristic Extraction: Apply regex for floating-point numbers → Take absolute value; select last valid number if multiple 5. Failure Handling: Return None if no valid number identified 3.4 Evaluation Metric: Mean Relative Accuracy (MRA) The authors adopt MRA as the primary metric, extending the design from VSI-Bench with threshold calibration for physical reasoning tasks: $$\r\\text{MRA}=\\frac{1}{10}\\sum_{\\theta\\in\\mathcal{C}}\\mathbb{1}\\bigg(\\frac{|\\hat{y}-y|}{|y|}\\lt 1-\\theta\\bigg),\\quad \\mathcal{C}=\\{0.5,0.55,...,0.95\\}\r\\tag{3}\r$$\rTab. 8: MRA Design Rationale and Properties\rProperty\rDescription\rPhysical Reasoning Justification\rMulti-threshold\r10 confidence levels (0.5–0.95)\rCaptures gradations of \"accurate enough\"; avoids binary rigidity\rRelative error\r$|\\hat{y}-y|/|y|$ rather than absolute\rScale-invariant; comparable across microscopic to astronomical scenes\rPartial credit\rLinear accumulation across thresholds\r3.1m error (3% relative) rewarded; 31m error (1000% relative) penalized\rRobustness\rIndicator function rather than continuous loss\rTolerates annotation ambiguity (hair inclusion in height, rim vs. outer diameter)\rAggregation Protocol:\nQuestion-level: MRA computed per (video, question) pair. Category-level: Average MRA across all questions in {2D-Static, 2D-Dynamic, 3D-Static, 3D-Dynamic}. Model-level: Unweighted mean of four category scores. Questions with no valid numerical output after 5 retries contribute MRA = 0 to the category average.\n📊 4 Experiments 4.1 Main Results Figure 4.1 presents performance across four kinematic inference categories. ChatGPT-5.1 achieves the highest overall MRA (53.1%), marginally surpassing humans on 2D-Dynamic tasks but remaining below the human average of 55.6%. Open-weight models exhibit clear scaling effects: Qwen3-VL improves from 29.0% (2B) to 46.0% (32B), with gains most pronounced on dynamic categories requiring temporal integration.\nMain Results on QuantiPhy (MRA %)\r4.2 Effect of Scene Context The authors analyze performance across scene difficulty axes:\nBackground Complexity: Performance in complex backgrounds (C, 0.40 MRA) slightly exceeds simple textures (S, 0.38) and plain backgrounds (X, 0.35). Realistic backgrounds provide additional scale reference cues (road markings, architectural elements) that aid inference. Object Multiplicity: Multiple-object scenes (M) consistently outperform single-object scenes (S) by 3–5 MRA points. Additional objects serve as implicit comparison standards for size and speed estimation. Effect pf scene context\r🧠 5 Reflection \u0026amp; Inspiration The study reveals that current vision-language models struggle with quantitative physical reasoning. Instead of relying on visual evidence and provided priors, they tend to depend heavily on pre-trained world knowledge, leading to limited numerical accuracy and poor input faithfulness.\nPros: Novel quantitative paradigm: Moves beyond binary VQA evaluation to continuous numerical accuracy with MRA metric, distinguishing 3.1m error (acceptable) from 31m error (catastrophic). Controlled yet diverse data: Blender simulation enables exact ground-truth and systematic variation; lab capture adds real-world validation; internet data tests distribution generalization. Cons: Simplified physical scope: Restricted to translational motion of rigid objects—no rotation, deformation, fluid dynamics, or multi-body contact physics relevant to real robotics. Fixed camera assumption: Eliminates ego-motion ambiguity present in embodied navigation and AR/VR applications (not general scenarios). No possible solution: Benchmark identifies failure modes but provides no demonstration of improved training recipes or fine-tuning strategies to address the Inspiration: From perception to cognition: Personal testing on VSI-Bench confirms SOTA models maintain strong performance (e.g., Object Counting) without visual input—mirroring QuantiPhy's findings. Rather than forcing pure perception, we should architect cognitive systems that strategically arbitrate between sensing and memory, transforming VLMs from passive perceivers into active agents that know when to look and when to recall. Agentic system as solution pathway: Even with substantial room for base model improvement, the immediate deployment of embodied AI may benefit more from intelligent system design—explicit uncertainty quantification, selective memory retrieval, and input-confidence gating—than from waiting for perfect end-to-end physical reasoning to emerge. ","permalink":"https://milknocandy.github.io/posts/2026-03-23-quantiphy/","summary":"\u003cdiv class=\"paperbox\"\u003e\n    \u003cdiv class=\"pb-item\"\u003e\n        \u003cspan class=\"pb-key\"\u003eTOPIC\u003c/span\u003e\n        \u003cspan class=\"pb-sep\"\u003e\u003c/span\u003e\n        \u003cspan class=\"pb-val\"\u003eQuantitative Physical Understanding\u003c/span\u003e\n    \u003c/div\u003e\n    \u003cdiv class=\"pb-item\"\u003e\n        \u003cspan class=\"pb-key\"\u003eWHY READ\u003c/span\u003e\n        \u003cspan class=\"pb-sep\"\u003e\u003c/span\u003e\n        \u003cspan class=\"pb-val\"\u003eExposes that top VLMs guess physical quantities from memory (pre-trained world knowledge) rather than measure from video, with rigorous tests to diagnose this failure.\u003c/span\u003e\n    \u003c/div\u003e\n    \u003cdiv class=\"pb-item\"\u003e\n        \u003cspan class=\"pb-key\"\u003eTAKEAWAY\u003c/span\u003e\n        \u003cspan class=\"pb-sep\"\u003e\u003c/span\u003e\n        \u003cspan class=\"pb-val\"\u003eCurrent VLMs are cognitive mimics not physical reasoners, so build systems that arbitrate between perception and memory rather than forcing pure end to end inference. (Context Learning, Agentic AI)\u003c/span\u003e\n    \u003c/div\u003e\n    \u003cdiv class=\"pb-links\"\u003e\n        \u003cspan class=\"pb-org\"\u003eStanford University, UST\u003c/span\u003e\n        \u003cdiv class=\"pb-link-group\"\u003e\u003ca href=\"https://arxiv.org/abs/2512.19526\" target=\"_blank\" class=\"pb-link\"\u003e📄 Paper\u003c/a\u003e\u003ca href=\"https://github.com/Paulineli/QuantiPhy\" target=\"_blank\" class=\"pb-link\"\u003e💻 Code\u003c/a\u003e\u003ca href=\"https://github.com/Paulineli/QuantiPhy\" target=\"_blank\" class=\"pb-link\"\u003e🌐 Project\u003c/a\u003e\u003ca href=\"https://github.com/Paulineli\" target=\"_blank\" class=\"pb-link\"\u003e👤 Author\u003c/a\u003e\n        \u003c/div\u003e\n    \u003c/div\u003e\n\u003c/div\u003e\n\n\u003chr\u003e\n\u003ch2 id=\"-1-motivation--problem\"\u003e🚀 1 Motivation \u0026amp; Problem\u003c/h2\u003e\n\u003cp\u003eHumans understand the physical world through structured mathematical abstractions. From Isaac Newton’s formulation of universal gravitation inspired by a falling apple, to modern physics, quantitative laws enable precise reasoning about the dynamics of the real world. In contrast, although state-of-the-art AI systems demonstrate remarkable capabilities in mathematical reasoning, programming, and scientific writing, enabling artificial intelligence to \u003cu\u003e\u003ci\u003eground its understanding in the physical world\u003c/i\u003e\u003c/u\u003e remains a fundamental and unresolved challenge. This limitation poses a critical barrier to deploying AI systems in real-world, embodied environments.\u003c/p\u003e","title":"When VLMs Become Cognitive Mimics, Not Physical Reasoners: A QuantiPhy Study"},{"content":"1 Benchmark 1.1 Textual Benchmarks Arxiv 2026\rCan LLMs See Without Pixels? Benchmarking Spatial Intelligence from Textual Descriptions\n🔻\r🏛️\rBeijing Institute of Technology\r🏛️\rBUCT\r👤 Author\r📄 Paper\r💻 Code\r🏷️\rSubject: Textual spatial reasoning benchmark for intrinsic LLM spatial intelligence evaluation\r❓\rProblem:\rPerception–reasoning entanglement in VLM benchmarks Lack of high-fidelity text-only spatial tasks Over-reliance on language priors/pattern matching Weak evaluation of global consistency, mental mapping 💡\rIdea: Convert visual scenes into coordinate-aware text to isolate and test symbolic spatial reasoning in LLMs.\r🛠️\rSolution:\rSiT-Bench: 3.8K QA across 5 categories, 17 subtasks for spatial cognition Textual Encoding: Multi-view scenes → coordinate-aware descriptions enabling symbolic reasoning Dual Construction: Image-based generation + vision-benchmark-to-text adaptation R1 Filtering: Reasoning-based filtering removes trivial, inconsistent, leakage samples Evaluation Protocol: Compare LLMs/VLMs with/without CoT to isolate reasoning ability 🏆\rResults: Best model 59.46% vs. 74.42% human; large gap in global tasks (\u003c10% mapping). CoT significantly improves performance, validating latent but underutilized spatial reasoning.\rExample of SiT Benchmark\r1.2 Text-to-Image Benchmarks ICLR 2026\rEverything in Its Place: Benchmarking Spatial Intelligence of Text-to-Image Models\n🔻\r🏛️\rAMAP - Alibaba Group\r🏛️\rBeijing University of Posts and Telecommunications\r👤 Author\r📄 Paper\r💻 Code\r🏷️\rSubject: Information-dense Spatial Benchmarking for Text-to-Image Spatial Intelligence\r❓\rProblem:\rPrompt Sparsity: short / sparse prompts → fail probe complex spatial constraints Metric Coarseness: yes/no, detection → lack fine-grained diagnosis Spatial Intelligence Gap: strong \u0026quot;what\u0026quot;, weak \u0026quot;where/how/why\u0026quot; Reasoning Blind Spot: comparison, occlusion, causality under-evaluated 💡\rIdea: Use information-dense prompts + omni-dimensional QA to explicitly decompose and measure spatial intelligence across perception, reasoning, and interaction.\r🛠️\rSolution:\rSpatialGenEval: 1,230 long prompts + 10 sub-domains; comprehensive spatial coverage Omni-QA Evaluation: 10 multi-choice QAs per prompt; fine-grained capability diagnosis Hierarchical Decomposition: foundation → perception → reasoning → interaction modeling Leakage-Free Evaluation: image-only QA, “None” option prevents forced guessing SpatialT2I Dataset: 15.4K pairs; rewritten dense prompts for training consistency Data-Centric SFT: fine-tune T2I models to enhance spatial reasoning 🏆\rResults: Spatial reasoning emerges as dominant bottleneck (~20–30% on key sub-tasks); SpatialT2I yields consistent gains (+4.2%–5.7%), validating data-centric improvement.\r💭 Thoughts:\rNeed Bidirectional Evaluation: Current T2I benchmarks only test forward generation, but spatial intelligence should be bidirectional and reversible. Can a model truly understand spatial relations if it cannot consistently reconstruct them across generation and interpretation (T2I ↔ I2T)? Cross-modal Spatial Consistency: Do multimodal models maintain a unified spatial representation when reasoning across image and text, or do they rely on modality-specific shortcuts? Structure-aware Spatial Robustness: Can a model still perform correct spatial reasoning when specific spatial factors (e.g., position, occlusion) are selectively removed rather than randomly missing? Samples of SpatialGenEval. T2I Generation $\\rightarrow$ MLLMs as a judge evaluation.\rComparisons between SpatialGenEval and previous T2I Benchmarks. \u0026#39;L\u0026#39; and \u0026#39;S\u0026#39; denote long and short prompt.\r1.3 Video-based Benchmarks Arxiv 2025\rQuantiPhy: A Quantitative Benchmark Evaluating Physical Reasoning Abilities of Vision-Language Models\n🔻\r🏛️\rStanford University\r🏛️\rUST\r👤 Author\r📄 Paper\r💻 Code\r🚀 Demo\r🏷️\rSubject: Quantitative Kinematic Benchmark for VLMs Physical Reasoning Evaluation\r❓\rProblem:\rQualitative Evaluation Bias: current Benchmark for VQA-style; lacks numerical precision sensitivity. Missing Kinematic Quantification: no explicit size/velocity/acceleration inference. 💡\rIdea: Cast physical reasoning as prior-conditioned kinematic scaling with numerical error calibration.\r🛠️\rSolution:\rQuantiPhy Benchmark: 3.3K video–text pairs; numeric GT for kinematics. Kinematic Inference Task: single prior → infer remaining quantities via scaling MRA Metric: multi-threshold relative error aggregation. Diagnostic Probing Suite: prior-only, counterfactual, CoT analyses. 🏆\rResults: Best VLM achieves 53.1 MRA vs. human 55.6; counterfactual drops (70–80%) reveal failure in input-faithful quantitative reasoning and reliance on memorized priors.\rExamples from QuantiPhy Benchmark\r","permalink":"https://milknocandy.github.io/posts/2026-03-19-si/","summary":"\u003ch2 id=\"1-benchmark\"\u003e1 Benchmark\u003c/h2\u003e\n\u003ch3 id=\"11-textual-benchmarks\"\u003e1.1 Textual Benchmarks\u003c/h3\u003e\n\u003cp\u003e\u003cdetails class=\"paper-details-wrapper\"\u003e\r\n    \u003csummary class=\"paper-summary\"\u003e\r\n        \u003cdiv class=\"summary-inner\"\u003e\r\n            \r\n\r\n            \r\n            \r\n            \r\n            \r\n\r\n            \u003cspan class=\"s-venue-dynamic v-arxiv-2026\"\u003e\r\n                \u003csvg viewBox=\"0 0 24 24\" fill=\"none\" stroke=\"currentColor\" stroke-width=\"2\"\n                stroke-linecap=\"round\" stroke-linejoin=\"round\" class=\"v-icon\"\u003e\n                \u003cpath d=\"M14.5 2H6a2 2 0 0 0-2 2v16a2 2 0 0 0 2 2h12a2 2 0 0 0 2-2V7.5L14.5 2z\"\u003e\u003c/path\u003e\n                \u003cpolyline points=\"14 2 14 8 20 8\"\u003e\u003c/polyline\u003e\n            \u003c/svg\u003e\r\n                \u003cspan class=\"v-text\"\u003eArxiv 2026\u003c/span\u003e\r\n            \u003c/span\u003e\r\n\r\n            \u003cp class=\"s-title\"\u003eCan LLMs See Without Pixels? Benchmarking Spatial Intelligence from Textual Descriptions\u003c/p\u003e\r\n            \u003cspan class=\"s-toggle-icon\"\u003e🔻\u003c/span\u003e\r\n        \u003c/div\u003e\r\n    \u003c/summary\u003e\r\n\r\n    \u003cdiv class=\"paper-card-expanded\"\u003e\r\n        \u003cdiv class=\"expand-action-bar\"\u003e\r\n            \u003cdiv class=\"org-outer-container\"\u003e\r\n                \r\n                \u003cdiv class=\"org-group\"\u003e\r\n                    \r\n                    \r\n                      \u003cspan class=\"org-tag\"\u003e🏛️\r\n                        Beijing Institute of Technology\u003c/span\u003e\r\n                    \r\n                    \r\n                      \u003cspan class=\"org-tag\"\u003e🏛️\r\n                        BUCT\u003c/span\u003e\r\n                    \r\n                    \r\n                \u003c/div\u003e\r\n                \r\n            \u003c/div\u003e\r\n\r\n            \u003cdiv class=\"action-btns-fixed\"\u003e\r\n                \u003ca href=\"https://binisalegend.github.io/\" target=\"_blank\" class=\"act-btn\"\u003e👤 Author\u003c/a\u003e\r\n                \u003ca href=\"https://arxiv.org/abs/2601.03590\" target=\"_blank\" class=\"act-btn\"\u003e📄 Paper\u003c/a\u003e\r\n                \u003ca href=\"https://github.com/binisalegend/SiT-Bench\" target=\"_blank\" class=\"act-btn\"\u003e💻 Code\u003c/a\u003e\r\n                \r\n            \u003c/div\u003e\r\n        \u003c/div\u003e\u003cdiv class=\"expand-grid\"\u003e\u003cdiv class=\"ex-row\"\u003e\u003cspan class=\"ex-icon\"\u003e🏷️\u003c/span\u003e\r\n                \u003cdiv class=\"ex-text\"\u003e\u003cb\u003eSubject:\u003c/b\u003e Textual spatial reasoning benchmark for intrinsic LLM spatial intelligence evaluation\u003c/div\u003e\r\n            \u003c/div\u003e\r\n            \u003cdiv class=\"ex-row\"\u003e\u003cspan class=\"ex-icon\"\u003e❓\u003c/span\u003e\r\n                \u003cdiv class=\"ex-text\"\u003e\u003cb\u003eProblem:\u003c/b\u003e\r\n                    \u003cdiv class=\"ex-markdown-inner\"\u003e \u003cul\u003e\n\u003cli\u003ePerception–reasoning entanglement in VLM benchmarks\u003c/li\u003e\n\u003cli\u003eLack of high-fidelity text-only spatial tasks\u003c/li\u003e\n\u003cli\u003eOver-reliance on language priors/pattern matching\u003c/li\u003e\n\u003cli\u003eWeak evaluation of global consistency, mental mapping\u003c/li\u003e\n\u003c/ul\u003e\n\u003c/div\u003e\r\n                \u003c/div\u003e\r\n            \u003c/div\u003e\r\n            \u003cdiv class=\"ex-row\"\u003e\u003cspan class=\"ex-icon\"\u003e💡\u003c/span\u003e\r\n                \u003cdiv class=\"ex-text\"\u003e\u003cb\u003eIdea:\u003c/b\u003e Convert visual scenes into \u003cmark\u003ecoordinate-aware text\u003c/mark\u003e to isolate and test \u003cmark\u003esymbolic spatial reasoning\u003c/mark\u003e in LLMs.\u003c/div\u003e\r\n            \u003c/div\u003e\r\n\r\n            \u003cdiv class=\"ex-row ex-sol-box\"\u003e\r\n                \u003cspan class=\"ex-icon\"\u003e🛠️\u003c/span\u003e\r\n                \u003cdiv class=\"ex-text\"\u003e\r\n                    \u003cb\u003eSolution:\u003c/b\u003e\r\n                    \u003cdiv class=\"ex-markdown-inner\"\u003e\u003cul\u003e\n\u003cli\u003e\u003cstrong\u003eSiT-Bench:\u003c/strong\u003e 3.8K QA across 5 categories, 17 subtasks for spatial cognition\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eTextual Encoding:\u003c/strong\u003e Multi-view scenes → coordinate-aware descriptions enabling symbolic reasoning\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eDual Construction:\u003c/strong\u003e Image-based generation + vision-benchmark-to-text adaptation\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eR1 Filtering:\u003c/strong\u003e Reasoning-based filtering removes trivial, inconsistent, leakage samples\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eEvaluation Protocol:\u003c/strong\u003e Compare LLMs/VLMs with/without CoT to isolate reasoning ability\u003c/li\u003e\n\u003c/ul\u003e\n\u003c/div\u003e\r\n                \u003c/div\u003e\r\n            \u003c/div\u003e\r\n\r\n            \u003cdiv class=\"ex-row\"\u003e\u003cspan class=\"ex-icon\"\u003e🏆\u003c/span\u003e\r\n                \u003cdiv class=\"ex-text\"\u003e\u003cb\u003eResults:\u003c/b\u003e Best model 59.46% vs. 74.42% human; large gap in global tasks (\u003c10% mapping). CoT significantly improves performance, validating latent but underutilized spatial reasoning.\u003c/div\u003e\r\n            \u003c/div\u003e\r\n\r\n            \r\n\r\n            \r\n            \r\n        \u003c/div\u003e\r\n    \u003c/div\u003e\r\n\u003c/details\u003e\n\r\n\u003cfigure \u003e\r\n    \u003cimg src=\"1_Sample4SiT.png\" alt=\"Example of SiT Benchmark\" /\u003e\u003cfigcaption\u003e\r\n        \u003cspan class=\"auto-fig-title\"\u003eExample of SiT Benchmark\u003c/span\u003e\r\n    \u003c/figcaption\u003e\u003c/figure\u003e\u003c/p\u003e","title":"Spatial Intelligence in Large Models: Benchmarks, Mechanisms, and Reasoning"},{"content":"\r0-1 Knapsack Problem Description: There are $N$ items and one knapsack with a maximum capacity of $V$. Each item can be selected at most once (i.e., either take it or leave it). The $i\\text{-th}$ item has a volume of $v_i$​ and a value of $w_i$​. Your task is to select a subset of items to put into the knapsack such that:\nThe total volume of the selected items does not exceed the knapsack's capacity V. The total value of the selected items is maximized. Output the maximum possible total value achievable under these constraints.\nInput Format:\nLine 1: Two integers $N$ and $V$, separated by a space, representing the number of items and the capacity of the knapsack, respectively. Next $N$ lines: Each line contains two integers $v_i$​ and $w_i$​, separated by a space, representing the volume and value of the $i\\text{-th}$ item. Output Format: Output a single integer, the maximum total value of items that can be packed into the knapsack.\nConstraints:\n$$0\\lt N,V \\le 1000\\\\ 0\\lt v_i, w_i \\le 1000$$Sample Input:\n1 2 3 4 5 4 5 1 2 2 4 3 4 4 5 Sample Output:\n1 8 As can be seen from the problem statement, our goal is to determine the optimal subset of items to pack into the knapsack to maximize the total value. Since each item can be selected at most once (the 0-1 knapsack constraint), the core idea is to iteratively determine the optimal selection when considering the $i\\text{-th}$ item. This decision-making process can be formalized using the following dynamic programming recurrence relation: $$\r\\text{dp}[i][j] = \\max(\\text{dp}[i-1][j], \\text{dp}[i-1][j-v_i]+w_i)\\tag{1-1}\r$$ In the iteration where we decide whether to include the i-th item, we compare the total value obtained by including this item against the optimal value achieved without considering it (i.e., the optimal solution for the first $i\\text{−1}$ items). This is a classic dynamic programming problem, and the corresponding implementation code is provided below:\nC\u0026#43;\u0026#43;: 1-D sequenceC\u0026#43;\u0026#43;: 2-D matrixPython: 1-D sequence 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 #include\u0026lt;bits/stdc++.h\u0026gt; using namespace std; int main() { int n, m; cin \u0026gt;\u0026gt; n \u0026gt;\u0026gt; m; vector\u0026lt;int\u0026gt; f(m+1) int v, w; for(int i = 1; i \u0026lt;= n; i++) { cin \u0026gt;\u0026gt; v \u0026gt;\u0026gt; w; // Reverse iteration (core: j \u0026gt;= v[i]) for(int j = m; j \u0026gt;= v; j--) { f[j] = max(f[j], f[j - v] + w); } } cout \u0026lt;\u0026lt; f[m] \u0026lt;\u0026lt; endl; return 0; } 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 #include\u0026lt;bits/stdc++.h\u0026gt; using namespace std; const int MAXN = 1005; int v[MAXN]; // 体积 int w[MAXN]; // 价值 int main() { int n, m; cin \u0026gt;\u0026gt; n \u0026gt;\u0026gt; m; vector\u0026lt;vector\u0026lt;int\u0026gt;\u0026gt; f(n+1, vector\u0026lt;int\u0026gt;(m+1, 0)); for(int i = 1; i \u0026lt;= n; i++) cin \u0026gt;\u0026gt; v[i] \u0026gt;\u0026gt; w[i]; for(int i = 1; i \u0026lt;= n; i++) for(int j = 1; j \u0026lt;= m; j++) { // if current bag can\u0026#39;t pack i-th goods, then the optimal value is equal to the summation of the first i-1 goods if(j \u0026lt; v[i]) f[i][j] = f[i - 1][j]; // can pack this, then check this else f[i][j] = max(f[i - 1][j], f[i - 1][j - v[i]] + w[i]); } cout \u0026lt;\u0026lt; f[n][m] \u0026lt;\u0026lt; endl; return 0; } 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 def main(): num_N, val_V = map(int, input().split()) solutions = [0] * (val_V + 1) # 提前绑定max为局部变量（微小提速，不影响简洁性） max_func = max for _ in range(num_N): # 无需i计数，用_更简洁 w, v = map(int, input().split()) # 逆序枚举，保留你的核心逻辑 for j in range(val_V, w - 1, -1): solutions[j] = max_func(solutions[j], solutions[j - w] + v) print(solutions[val_V]) if __name__ == \u0026#39;__main__\u0026#39;: main() Complete Knapsack Problem Description: There are $N$ items and a knapsack with capacity $V$. Each item can be selected any number of times. The $i\\text{-th}$ item has volume $v_i$ and value $w_i$. Select items to maximize total value, with total volume $\\le V$. Output the maximum value.\nInput Format:\nLine 1: Two integers $N, V$ (number of items, knapsack capacity). Next $N$ lines: Each line has two integers $v_i$, $w_i$ (volume and value of the $i\\text{-th}$ item). Output Format: Output a single integer (maximum total value).\nConstraints: $$0\\lt N,V \\le 1000\\\\ 0\\lt v_i, w_i \\le 1000$$Sample Input:\n1 2 3 4 5 4 5 1 2 2 4 3 4 4 5 Sample Output:\n1 10 We can add one for-loop to the code from 0-1 Knapsack Problem. C\u0026#43;\u0026#43;: Think for Unbounded KP 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 #include\u0026lt;bits/stdc++.h\u0026gt; using namespace std; const int MAXN = 1005; int v[MAXN]; // 体积 int w[MAXN]; // 价值 int main() { int n, m; cin \u0026gt;\u0026gt; n \u0026gt;\u0026gt; m; vector\u0026lt;vector\u0026lt;int\u0026gt;\u0026gt; f(n+1, vector\u0026lt;int\u0026gt;(m+1, 0)); for(int i = 1; i \u0026lt;= n; i++) cin \u0026gt;\u0026gt; v[i] \u0026gt;\u0026gt; w[i]; for(int i = 1; i \u0026lt;= n; i++) for(int j = 1; j \u0026lt;= m; j++) { for(int k = 0; k*v[i]\u0026lt;=j; k++){ f[i][j] = max(f[i][j], f[i - 1][j - k*v[i]] + k*w[i]); } } cout \u0026lt;\u0026lt; f[n][m] \u0026lt;\u0026lt; endl; return 0; } Let's deduct this, find out how we update: $$\r\\scriptsize\r\\begin{aligned}\r\\text{dp}[i][j] \u0026= \\max(\\text{dp}[i-1][j], \\text{dp}[i-1][j-v_i]+w_i, \\text{dp}[i-1][j-2\\cdot v_i]+2\\cdot w_i, ...)\\\\\r\\text{dp}[i][j-v] \u0026= \\max(\\text{dp}[i-1][j-v], \\text{dp}[i-1][j-2\\cdot v_i]+2\\cdot w_i, \\text{dp}[i-1][j-3\\cdot v_i]+3\\cdot w_i, ...)\r\\tag{1-2}\r\\end{aligned}\r$$","permalink":"https://milknocandy.github.io/study/20260311_knapsack/","summary":"\u003c!-- set id for section manually --\u003e\r\n\u003ch2 id=\"01knapsack-ref\"\u003e0-1 Knapsack Problem\u003c/h2\u003e\n\u003cdiv class=\"highlight-box default\"\u003e\r\n    \u003cdiv class=\"box-content\"\u003e\r\n        \u003cp\u003e\u003cstrong\u003eDescription:\u003c/strong\u003e\nThere are $N$ items and one knapsack with a maximum capacity of $V$. Each item can be selected at most once (i.e., either take it or leave it). The $i\\text{-th}$ item has a volume of $v_i$​ and a value of $w_i$​. Your task is to select a subset of items to put into the knapsack such that:\u003c/p\u003e\n\u003col\u003e\n\u003cli\u003eThe total volume of the selected items does not exceed the knapsack's capacity V.\u003c/li\u003e\n\u003cli\u003eThe total value of the selected items is maximized.\u003c/li\u003e\n\u003c/ol\u003e\n\u003cp\u003eOutput the maximum possible total value achievable under these constraints.\u003c/p\u003e","title":"Knapsack Problem"},{"content":"1 Timeline Order Summarize the literature reviewed in chronological order.\n2026 Arxiv 2026\rWeMMU: Enhanced Bridging of Vision-Language Models and Diffusion Models via Noisy Query Tokens\n🔻\r🏛️\rMoE Key Laboratory of Brain-inspired Intelligent Perception and Cognition\r🏛️\rUniversity of Science and Technology of China\r🏛️\rZhejiang University\r🏛️\rThe Hong Kong University of Science and Technology\r📄 Paper\r🏷️\rSubject: Bridging Pre-trained VLMs and Diffusion Models for UMMs\r❓\rProblem:\rExisting methods (MetaQuery) performs alignment via learnable queries, but suffer from poor task generalization. They require retraining in the early stage for significantly different task types.\r💡\rIdea: Probabilistic Expert Bridge (from Bagel) samples Noisy Query Tokens.\r🛠️\rSolution:\rNoisy Query Tokens: Sample tokens from the standard normal distribution $N(0, I)$ at each training step to learn a robust distributed intermediate representation space instead of task-specific features. Probabilistic Expert Bridge: Freeze VLM core parameters, add a parallel generative pathway, follow the division of labor (VLM for understanding, Diffusion Model for generation), and use Position MLP for feature alignment and spatial cue injection. VAE Branch: Inject VAE fine-grained features into VLM via a linear projection layer to fuse high-level semantics ans low-level visual details, reducing the Diffusion Models's burden. Progressive Training: Adopt a four-stage curriculum training strategy, flexibly switch between contrastive/conditional flow matching loss, and gradually upgrade resolution and task complexity. 🏆\rResults: Though the performace is not SOTA, it alleviates task generalization collapse of UMMs, facilitates stable cross-task continual learning and retains fine-grained image details.\rWeMMU\rCVPR 2026\rUAE: Incentivizing Mutual Benefits for Unified Multimodal Understanding and Generation via RL\n🔻\r🏛️\rPeking University\r🏛️\rBaidu\r🏛️\rRabbitpre AI\r🏛️\rSYSU\r🏛️\rUSTC\r🏛️\rCASIA\r👤 Author\r📄 Paper\r💻 Code\r🏷️\rSubject: Bridging Pre-trained VLMs and Diffusion Models for UMMs\r❓\rProblem:\rImage-to-text (I2T) and text-to-image (T2I) tasks are optimized independently, failing to leverage their inherent connection for mutual enhancement. Joint training of existing UMMs leads to mutual degradation of understanding and generation capabilities, while decoupled training misses cross-task reciprocal benefits. 💡\rIdea: Links I2T and T2I via an auto-encoder perspective (text as intermediate latent representation) + Unified-GRPO RL post-training with reconstructive rewards\r🛠️\rSolution:\rUnified Auto-Encoder Paradigm: Define I2T as image-to-text semantic encoding and T2I as text-to-image decoding, taking semantic similarity between input and reconstructed images as the core optimization objective. Unified-GRPO Post-Training Strategy: Adapt to two mainstream UMMs, freeze visual modules to only optimize LLMs, and adopt CLIP+generator as a frozen reconstructive reward module. Unified-Bench Evaluation Benchmark: Design dual protocols-calculate Unified-Score through four visual backbones and evaluate caption quality via commercial LLM 🏆\rResults: UAE achieves an overall Unified-Score of 86.09 on Unified-Bench, surpassing GPT-4o-Image’s 85.95, and attains SOTA generation performance of 0.86 on GenEval and 0.475 on GenEval++. The core innovation of reconstructive reinforcement learning is fully validated, as it successfully drives the model to produce long, detail-rich text that indirectly enhances image perception, establishing a bidirectional synergistic mechanism.\rThe workflow of RAE\r2025 ICLR 2025\rReconstructive Visual Instruction Tuning\n🔻\r🏛️\rCASIA\r🏛️\rUniversity of Hongkong\r🏛️\rMEGVII Tech.\r🏛️\rStepFun\r👤 Author\r📄 Paper\r💻 Code\r🚀 Demo\r🏷️\rSubject: Visual Instruction Tuning for Large Multimodal Models\r❓\rProblem:\rLLM-centric Training Paradigm: Conventional visual instruction tuning for LMMs rely on vision-to-text alignment and text-only supervision. Extrinstic Assistance: Previous vision-centric methods leverage extra vision experts[1] at the encoder end to enrich the crucial visual details in images for MLLMs, but require careful manual selection of experts and resulting in a complex inference process. Spatial Redundancy in Images: Visual signals have heavy spatial redundancy, making it hard to generate meaningful feedback from natural images. 💡\rIdea: Reconsturct latent visual tokens of input images by denoiser to supervise the visual outputs of LMMs\r🛠️\rSolution:\rReconsturction Variant Design: Proposes three regression-based reconstruction variants: $\\textbf{ROSS}^R\\text{-Pixel}$ (regresses raw RGB pixel values via patchify operation), $\\textbf{ROSS}^R\\text{-Latent}$ (regresses fine-grained latent tokens extracted by frozen teacher tokenizers VAE/DINOv2/DEiT-III), and $\\textbf{ROSS}^R\\text{-Latent2Pixel}$ (back to RGB pixel space for regression). Training Objective: How to reconstruct: Replaces vanilla regression with a per-token denoising objective to address visual spatial redundancy. How to train: Trains the model with a joint loss of original textual next-token prediction and visual reconstructive denoising. 🏆\rResults: Reconstructive objectives significantly boost LMMs' fine-grained visual comprehension and reduce hallucinations, while generative objectives focus only on high-aesthetic image generation instead of text-image alignment and thus fail to improve multimodal comprehension.\rReferences:\n[1] S. Tong et al., Eyes wid shut? exploring the visual shortcomings of multimodal llms. in CVPR 2024.\rTraining Procedure of ROSS\r","permalink":"https://milknocandy.github.io/posts/2026-03-07-umm/","summary":"\u003ch2 id=\"1-timeline-order\"\u003e1 Timeline Order\u003c/h2\u003e\n\u003cblockquote\u003e\n\u003cp\u003eSummarize the literature reviewed in chronological order.\u003c/p\u003e\n\u003c/blockquote\u003e\n\u003cul\u003e\n\u003cli\u003e\n\u003ch3 id=\"2026\"\u003e2026\u003c/h3\u003e\n\u003c/li\u003e\n\u003c/ul\u003e\n\u003cp\u003e\u003cdetails class=\"paper-details-wrapper\"\u003e\r\n    \u003csummary class=\"paper-summary\"\u003e\r\n        \u003cdiv class=\"summary-inner\"\u003e\r\n            \r\n\r\n            \r\n            \r\n            \r\n            \r\n\r\n            \u003cspan class=\"s-venue-dynamic v-arxiv-2026\"\u003e\r\n                \u003csvg viewBox=\"0 0 24 24\" fill=\"none\" stroke=\"currentColor\" stroke-width=\"2\"\n                stroke-linecap=\"round\" stroke-linejoin=\"round\" class=\"v-icon\"\u003e\n                \u003cpath d=\"M14.5 2H6a2 2 0 0 0-2 2v16a2 2 0 0 0 2 2h12a2 2 0 0 0 2-2V7.5L14.5 2z\"\u003e\u003c/path\u003e\n                \u003cpolyline points=\"14 2 14 8 20 8\"\u003e\u003c/polyline\u003e\n            \u003c/svg\u003e\r\n                \u003cspan class=\"v-text\"\u003eArxiv 2026\u003c/span\u003e\r\n            \u003c/span\u003e\r\n\r\n            \u003cp class=\"s-title\"\u003eWeMMU: Enhanced Bridging of Vision-Language Models and Diffusion Models via Noisy Query Tokens\u003c/p\u003e\r\n            \u003cspan class=\"s-toggle-icon\"\u003e🔻\u003c/span\u003e\r\n        \u003c/div\u003e\r\n    \u003c/summary\u003e\r\n\r\n    \u003cdiv class=\"paper-card-expanded\"\u003e\r\n        \u003cdiv class=\"expand-action-bar\"\u003e\r\n            \u003cdiv class=\"org-outer-container\"\u003e\r\n                \r\n                \u003cdiv class=\"org-group\"\u003e\r\n                    \r\n                    \r\n                      \u003cspan class=\"org-tag\"\u003e🏛️\r\n                        MoE Key Laboratory of Brain-inspired Intelligent Perception and Cognition\u003c/span\u003e\r\n                    \r\n                    \r\n                      \u003cspan class=\"org-tag\"\u003e🏛️\r\n                        University of Science and Technology of China\u003c/span\u003e\r\n                    \r\n                    \r\n                      \u003cspan class=\"org-tag\"\u003e🏛️\r\n                        Zhejiang University\u003c/span\u003e\r\n                    \r\n                    \r\n                      \u003cspan class=\"org-tag\"\u003e🏛️\r\n                        The Hong Kong University of Science and Technology\u003c/span\u003e\r\n                    \r\n                    \r\n                \u003c/div\u003e\r\n                \r\n            \u003c/div\u003e\r\n\r\n            \u003cdiv class=\"action-btns-fixed\"\u003e\r\n                \r\n                \u003ca href=\"https://arxiv.org/abs/2512.02536\" target=\"_blank\" class=\"act-btn\"\u003e📄 Paper\u003c/a\u003e\r\n                \r\n                \r\n            \u003c/div\u003e\r\n        \u003c/div\u003e\u003cdiv class=\"expand-grid\"\u003e\u003cdiv class=\"ex-row\"\u003e\u003cspan class=\"ex-icon\"\u003e🏷️\u003c/span\u003e\r\n                \u003cdiv class=\"ex-text\"\u003e\u003cb\u003eSubject:\u003c/b\u003e Bridging Pre-trained VLMs and Diffusion Models for UMMs\u003c/div\u003e\r\n            \u003c/div\u003e\r\n            \u003cdiv class=\"ex-row\"\u003e\u003cspan class=\"ex-icon\"\u003e❓\u003c/span\u003e\r\n                \u003cdiv class=\"ex-text\"\u003e\u003cb\u003eProblem:\u003c/b\u003e\r\n                    \u003cdiv class=\"ex-markdown-inner\"\u003e Existing methods (MetaQuery) performs \u003cmark\u003ealignment via learnable queries\u003c/mark\u003e, but suffer from poor task generalization. They require retraining in the early stage for significantly different task types.\u003c/div\u003e\r\n                \u003c/div\u003e\r\n            \u003c/div\u003e\r\n            \u003cdiv class=\"ex-row\"\u003e\u003cspan class=\"ex-icon\"\u003e💡\u003c/span\u003e\r\n                \u003cdiv class=\"ex-text\"\u003e\u003cb\u003eIdea:\u003c/b\u003e Probabilistic Expert Bridge (from Bagel) samples Noisy Query Tokens.\u003c/div\u003e\r\n            \u003c/div\u003e\r\n\r\n            \u003cdiv class=\"ex-row ex-sol-box\"\u003e\r\n                \u003cspan class=\"ex-icon\"\u003e🛠️\u003c/span\u003e\r\n                \u003cdiv class=\"ex-text\"\u003e\r\n                    \u003cb\u003eSolution:\u003c/b\u003e\r\n                    \u003cdiv class=\"ex-markdown-inner\"\u003e\u003cul\u003e\n\u003cli\u003e\u003cstrong\u003eNoisy Query Tokens:\u003c/strong\u003e Sample tokens from the standard normal distribution $N(0, I)$ at each training step to learn a robust distributed intermediate representation space instead of task-specific features.\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eProbabilistic Expert Bridge:\u003c/strong\u003e Freeze VLM core parameters, add a parallel generative pathway, follow the division of labor (VLM for understanding, Diffusion Model for generation), and use Position MLP for feature alignment and spatial cue injection.\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eVAE Branch:\u003c/strong\u003e Inject VAE fine-grained features into VLM via a linear projection layer to fuse high-level semantics ans low-level visual details, reducing the Diffusion Models's burden.\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eProgressive Training:\u003c/strong\u003e Adopt a four-stage curriculum training strategy, flexibly switch between contrastive/conditional flow matching loss, and gradually upgrade resolution and task complexity.\u003c/li\u003e\n\u003c/ul\u003e\n\u003c/div\u003e\r\n                \u003c/div\u003e\r\n            \u003c/div\u003e\r\n\r\n            \u003cdiv class=\"ex-row\"\u003e\u003cspan class=\"ex-icon\"\u003e🏆\u003c/span\u003e\r\n                \u003cdiv class=\"ex-text\"\u003e\u003cb\u003eResults:\u003c/b\u003e Though the performace is not SOTA, it alleviates task generalization collapse of UMMs, facilitates stable cross-task continual learning and retains fine-grained image details.\u003c/div\u003e\r\n            \u003c/div\u003e\r\n\r\n            \r\n\r\n            \r\n            \r\n        \u003c/div\u003e\r\n    \u003c/div\u003e\r\n\u003c/details\u003e\n\r\n\u003cfigure \u003e\r\n    \u003cimg src=\"1_WeMMU.png\" alt=\"WeMMU\" /\u003e\u003cfigcaption\u003e\r\n        \u003cspan class=\"auto-fig-title\"\u003eWeMMU\u003c/span\u003e\r\n    \u003c/figcaption\u003e\u003c/figure\u003e\u003c/p\u003e","title":"The Evolution of Unified Multimodal Models"},{"content":"1 Installation 1.1 Installation for python environment To install nvcc inside a Python virtual environment, conda is all you need. It lets you pick the exact CUDA version for each env, so you never clash with system-wide installs. Please follow the steps below to install nvcc:\nSearch for all available nvcc version 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 # activate your conda environment first $ conda activate xxenv # search for available cudatoolkit versions $ conda search cudatoolkit --channel conda-forge # output like this: Loading channels: done # Name Version Build Channel cudatoolkit 5.5rc1 p0 anaconda/pkgs/pro cudatoolkit 5.5.1 p0 anaconda/pkgs/pro cudatoolkit 6.0 p0 anaconda/pkgs/pro cudatoolkit 7.0 1 anaconda/pkgs/pro cudatoolkit 7.5 0 anaconda/pkgs/free cudatoolkit 7.5 2 anaconda/pkgs/free cudatoolkit 8.0 1 anaconda/pkgs/free cudatoolkit 8.0 3 anaconda/pkgs/free cudatoolkit 9.0 h13b8566_0 anaconda/pkgs/main cudatoolkit 9.0 h13b8566_0 pkgs/main cudatoolkit 9.2 0 anaconda/pkgs/main cudatoolkit 9.2 0 pkgs/main ... Install specific version of nvcc If you install nvcc within the base environment, you may run into errors such as OSError: [Errno 39] Directory not empty: 'xxx/anaconda3/lib/ossl-modules'. Therefore, I recommend installing it in a manually created virtual environment instead. 1 2 # Install specific version of `nvcc` $ conda install -c nvidia cudatoolkit=11.8 -y # or conda install -c conda-forge cudatoolkit=x.x Now let us check what we have obtained by running this.\n1 2 3 4 5 6 $ conda list | grep cuda # output like this: cudatoolkit 11.8.0 h6a678d5_0 https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main nvidia-cuda-cupti-cu11 11.8.87 pypi_0 pypi nvidia-cuda-nvrtc-cu11 11.8.89 pypi_0 pypi nvidia-cuda-runtime-cu11 11.8.89 pypi_0 pypi Oops, it looks like the cuda-nvcc package is missing. Let’s install it with conda.\n1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 $ conda search cuda-nvcc --channel nvidia # output like this: Loading channels: done # Name Version Build Channel cuda-nvcc 11.5.50 h8f81028_0 nvidia cuda-nvcc 11.5.119 h2e31d95_0 nvidia cuda-nvcc 11.6.55 h5758ece_0 nvidia cuda-nvcc 11.6.112 hf7fc535_0 nvidia cuda-nvcc 11.6.124 hbba6d2d_0 nvidia cuda-nvcc 11.7.64 0 nvidia cuda-nvcc 11.7.99 0 nvidia cuda-nvcc 11.8.89 0 nvidia cuda-nvcc 12.0.76 0 nvidia ... # Install the corresponding version of nvcc $ conda install -c nvidia cuda-nvcc=11.8.89 -y Finally, let us check if nvcc is correctly installed.\n1 2 3 4 5 6 7 $ nvcc --version # output like this: nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2022 NVIDIA Corporation Built on Wed_Sep_21_10:33:58_PDT_2022 Cuda compilation tools, release 11.8, V11.8.89 Build cuda_11.8.r11.8/compiler.31833905_0 Note that you may need to run the following command before checking nvcc.\n1 2 3 4 5 6 7 # set temporary variable to system environment variable export PATH=$CONDA_PREFIX/bin:$PATH export CUDA_HOME=$CONDA_PREFIX # or set this permanently echo \u0026#39;export PATH=/usr/local/cuda-12.1/bin:$PATH\u0026#39; \u0026gt;\u0026gt; ~/.bashrc echo \u0026#39;export LD_LIBRARY_PATH=/usr/local/cuda-12.1/lib64:$LD_LIBRARY_PATH\u0026#39; \u0026gt;\u0026gt; ~/.bashrc source ~/.bashrc # refresh 2 Basic Usage Once nvcc is installed, you can start using it to compile your CUDA programs. Here are some basic commands to get you started:\nCompile a CUDA program 1 $ nvcc -o my_program my_program.cu This command compiles the my_program.cu file and generates an executable named my_program.\nCompile with specific architecture 1 $ nvcc -arch=sm_61 -o my_program my_program.cu This command compiles the CUDA program for a specific GPU architecture (in this case, sm_61).\n","permalink":"https://milknocandy.github.io/tech/2026-1-20-nvcc/","summary":"\u003ch2 id=\"1-installation\"\u003e1 Installation\u003c/h2\u003e\n\u003ch3 id=\"11-installation-for-python-environment\"\u003e1.1 Installation for python environment\u003c/h3\u003e\n\u003cp\u003eTo install \u003ccode\u003envcc\u003c/code\u003e inside a Python virtual environment, conda is all you need. It lets you pick the exact CUDA version for each env, so you never clash with system-wide installs. Please follow the steps below to install \u003ccode\u003envcc\u003c/code\u003e:\u003c/p\u003e\n\u003col\u003e\n\u003cli\u003eSearch for all available \u003ccode\u003envcc\u003c/code\u003e version\u003c/li\u003e\n\u003c/ol\u003e\n\u003cdiv class=\"highlight\"\u003e\u003cdiv class=\"chroma\"\u003e\n\u003ctable class=\"lntable\"\u003e\u003ctr\u003e\u003ctd class=\"lntd\"\u003e\n\u003cpre tabindex=\"0\" class=\"chroma\"\u003e\u003ccode\u003e\u003cspan class=\"lnt\" id=\"hl-0-1\"\u003e\u003ca class=\"lnlinks\" href=\"#hl-0-1\"\u003e 1\u003c/a\u003e\n\u003c/span\u003e\u003cspan class=\"lnt\" id=\"hl-0-2\"\u003e\u003ca class=\"lnlinks\" href=\"#hl-0-2\"\u003e 2\u003c/a\u003e\n\u003c/span\u003e\u003cspan class=\"lnt\" id=\"hl-0-3\"\u003e\u003ca class=\"lnlinks\" href=\"#hl-0-3\"\u003e 3\u003c/a\u003e\n\u003c/span\u003e\u003cspan class=\"lnt\" id=\"hl-0-4\"\u003e\u003ca class=\"lnlinks\" href=\"#hl-0-4\"\u003e 4\u003c/a\u003e\n\u003c/span\u003e\u003cspan class=\"lnt\" id=\"hl-0-5\"\u003e\u003ca class=\"lnlinks\" href=\"#hl-0-5\"\u003e 5\u003c/a\u003e\n\u003c/span\u003e\u003cspan class=\"lnt\" id=\"hl-0-6\"\u003e\u003ca class=\"lnlinks\" href=\"#hl-0-6\"\u003e 6\u003c/a\u003e\n\u003c/span\u003e\u003cspan class=\"lnt\" id=\"hl-0-7\"\u003e\u003ca class=\"lnlinks\" href=\"#hl-0-7\"\u003e 7\u003c/a\u003e\n\u003c/span\u003e\u003cspan class=\"lnt\" id=\"hl-0-8\"\u003e\u003ca class=\"lnlinks\" href=\"#hl-0-8\"\u003e 8\u003c/a\u003e\n\u003c/span\u003e\u003cspan class=\"lnt\" id=\"hl-0-9\"\u003e\u003ca class=\"lnlinks\" href=\"#hl-0-9\"\u003e 9\u003c/a\u003e\n\u003c/span\u003e\u003cspan class=\"lnt\" id=\"hl-0-10\"\u003e\u003ca class=\"lnlinks\" href=\"#hl-0-10\"\u003e10\u003c/a\u003e\n\u003c/span\u003e\u003cspan class=\"lnt\" id=\"hl-0-11\"\u003e\u003ca class=\"lnlinks\" href=\"#hl-0-11\"\u003e11\u003c/a\u003e\n\u003c/span\u003e\u003cspan class=\"lnt\" id=\"hl-0-12\"\u003e\u003ca class=\"lnlinks\" href=\"#hl-0-12\"\u003e12\u003c/a\u003e\n\u003c/span\u003e\u003cspan class=\"lnt\" id=\"hl-0-13\"\u003e\u003ca class=\"lnlinks\" href=\"#hl-0-13\"\u003e13\u003c/a\u003e\n\u003c/span\u003e\u003cspan class=\"lnt\" id=\"hl-0-14\"\u003e\u003ca class=\"lnlinks\" href=\"#hl-0-14\"\u003e14\u003c/a\u003e\n\u003c/span\u003e\u003cspan class=\"lnt\" id=\"hl-0-15\"\u003e\u003ca class=\"lnlinks\" href=\"#hl-0-15\"\u003e15\u003c/a\u003e\n\u003c/span\u003e\u003cspan class=\"lnt\" id=\"hl-0-16\"\u003e\u003ca class=\"lnlinks\" href=\"#hl-0-16\"\u003e16\u003c/a\u003e\n\u003c/span\u003e\u003cspan class=\"lnt\" id=\"hl-0-17\"\u003e\u003ca class=\"lnlinks\" href=\"#hl-0-17\"\u003e17\u003c/a\u003e\n\u003c/span\u003e\u003cspan class=\"lnt\" id=\"hl-0-18\"\u003e\u003ca class=\"lnlinks\" href=\"#hl-0-18\"\u003e18\u003c/a\u003e\n\u003c/span\u003e\u003cspan class=\"lnt\" id=\"hl-0-19\"\u003e\u003ca class=\"lnlinks\" href=\"#hl-0-19\"\u003e19\u003c/a\u003e\n\u003c/span\u003e\u003cspan class=\"lnt\" id=\"hl-0-20\"\u003e\u003ca class=\"lnlinks\" href=\"#hl-0-20\"\u003e20\u003c/a\u003e\n\u003c/span\u003e\u003c/code\u003e\u003c/pre\u003e\u003c/td\u003e\n\u003ctd class=\"lntd\"\u003e\n\u003cpre tabindex=\"0\" class=\"chroma\"\u003e\u003ccode class=\"language-bash\" data-lang=\"bash\"\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"c1\"\u003e# activate your conda environment first\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e$ conda activate xxenv\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"c1\"\u003e# search for available cudatoolkit versions\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e$ conda search cudatoolkit --channel conda-forge\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"c1\"\u003e# output like this:\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003eLoading channels: \u003cspan class=\"k\"\u003edone\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"c1\"\u003e# Name                       Version           Build  Channel             \u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003ecudatoolkit                   5.5rc1              p0  anaconda/pkgs/pro   \n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003ecudatoolkit                    5.5.1              p0  anaconda/pkgs/pro   \n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003ecudatoolkit                      6.0              p0  anaconda/pkgs/pro   \n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003ecudatoolkit                      7.0               \u003cspan class=\"m\"\u003e1\u003c/span\u003e  anaconda/pkgs/pro   \n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003ecudatoolkit                      7.5               \u003cspan class=\"m\"\u003e0\u003c/span\u003e  anaconda/pkgs/free  \n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003ecudatoolkit                      7.5               \u003cspan class=\"m\"\u003e2\u003c/span\u003e  anaconda/pkgs/free  \n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003ecudatoolkit                      8.0               \u003cspan class=\"m\"\u003e1\u003c/span\u003e  anaconda/pkgs/free  \n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003ecudatoolkit                      8.0               \u003cspan class=\"m\"\u003e3\u003c/span\u003e  anaconda/pkgs/free  \n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003ecudatoolkit                      9.0      h13b8566_0  anaconda/pkgs/main  \n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003ecudatoolkit                      9.0      h13b8566_0  pkgs/main           \n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003ecudatoolkit                      9.2               \u003cspan class=\"m\"\u003e0\u003c/span\u003e  anaconda/pkgs/main  \n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003ecudatoolkit                      9.2               \u003cspan class=\"m\"\u003e0\u003c/span\u003e  pkgs/main    \n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e...\n\u003c/span\u003e\u003c/span\u003e\u003c/code\u003e\u003c/pre\u003e\u003c/td\u003e\u003c/tr\u003e\u003c/table\u003e\n\u003c/div\u003e\n\u003c/div\u003e\u003col start=\"2\"\u003e\n\u003cli\u003eInstall specific version of \u003ccode\u003envcc\u003c/code\u003e\nIf you install nvcc within the base environment, you may run into errors such as \u003ccode\u003eOSError: [Errno 39] Directory not empty: 'xxx/anaconda3/lib/ossl-modules'\u003c/code\u003e. Therefore, I recommend installing it in a manually created virtual environment instead.\u003c/li\u003e\n\u003c/ol\u003e\n\u003cdiv class=\"highlight\"\u003e\u003cdiv class=\"chroma\"\u003e\n\u003ctable class=\"lntable\"\u003e\u003ctr\u003e\u003ctd class=\"lntd\"\u003e\n\u003cpre tabindex=\"0\" class=\"chroma\"\u003e\u003ccode\u003e\u003cspan class=\"lnt\" id=\"hl-1-1\"\u003e\u003ca class=\"lnlinks\" href=\"#hl-1-1\"\u003e1\u003c/a\u003e\n\u003c/span\u003e\u003cspan class=\"lnt\" id=\"hl-1-2\"\u003e\u003ca class=\"lnlinks\" href=\"#hl-1-2\"\u003e2\u003c/a\u003e\n\u003c/span\u003e\u003c/code\u003e\u003c/pre\u003e\u003c/td\u003e\n\u003ctd class=\"lntd\"\u003e\n\u003cpre tabindex=\"0\" class=\"chroma\"\u003e\u003ccode class=\"language-bash\" data-lang=\"bash\"\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e\u003cspan class=\"c1\"\u003e# Install specific version of `nvcc`\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003cspan class=\"line\"\u003e\u003cspan class=\"cl\"\u003e$ conda install -c nvidia \u003cspan class=\"nv\"\u003ecudatoolkit\u003c/span\u003e\u003cspan class=\"o\"\u003e=\u003c/span\u003e11.8 -y \u003cspan class=\"c1\"\u003e# or conda install -c conda-forge cudatoolkit=x.x\u003c/span\u003e\n\u003c/span\u003e\u003c/span\u003e\u003c/code\u003e\u003c/pre\u003e\u003c/td\u003e\u003c/tr\u003e\u003c/table\u003e\n\u003c/div\u003e\n\u003c/div\u003e\u003cp\u003eNow let us check what we have obtained by running this.\u003c/p\u003e","title":"NVCC Master Guide: From Installation to Performance Tuning"},{"content":"1 Timeline Order Summarize the literature reviewed in chronological order.\n2023 📝【EMNLP 2023 - Main】- Sparse Low-rank Adaptation of Pre-trained Language Models (Tsinghua University, The University of Chicago)\nSubject: Adaptive Rank Selection\nProblem: Standard LoRA uses a fixed, inflexible rank (hyperparameter $ r\r$), requiring expensive manual tuning. Core Idea: Make the rank learnable rather than fixed. Mechanism: Gating: Introduces an optimizable gating unit to the low-rank matrices. Optimization: Uses proximal gradient methods to update the gates. Dynamics: Prunes less important ranks during training automatically. Result: Eliminates discrete rank search; the model discovers its own optimal rank structure. SoRA\r2024 🍰【Arxiv 2024】- MixLoRA: Enhancing Large Language Models Fine-Tuning with LoRA-based Mixture of Experts (Sichuan University, Purdue University, Emory University, Nanyang Technology University)\nA solid summary of various LoRA variants.\rMixLoRA\r📔【ICLR 2024】- Mixture of LoRA Experts (Microsoft Research Asia, Tsinghua University) Subject: Multiple LoRA Merging\nProblem: Combining multiple LoRA adapters into a single model is challenging. Existing methods (e.g., linear interpolation or reference-tuning) either degrade the generation quality of pre-trained models or incur high training costs. Core Idea: Adaptively combine multiple LoRA adapters at each layer by gate function. Method: MoLE treats each trained layer of LoRA as an independent expert. It implements hierarchical weight control by embedding a learnable gating function in each layer, and dynamically learns the optimal combination weights by combining gating balance loss and domain-specific loss. Results: More flexible merging method for multiple LoRAs with negligible costs. Three LoRA composition strategies: (a) linear arithmetic, applying a single weight across all layers; (b) reference-tuning, retraining the large model with handcrafted masks that fuse multiple LoRA outputs; (c) MoLE, learning layer-wise distributions to set adaptive composition weights.\r2025 🍰【Arxiv 2025】- ShareLoRA: Parameter Efficient and Robust Large Language Model Fine-tuning via Shared Low-Rank Adaptation (University of California)\nNumerous efforts are devoted to reducing the trainable parameters of LoRA, but a significant reduction in parameters will lead to slower convergence, and an inadequate reduction method will also make the model prone to over-fitting. Moreover, many existing PEFT methods struggle to maintain cross-domain robustness after fine-tuning.\nObservation: LoRA's A and B do not need to be uniquely configured across different layers to achieve optimal performance Method: Share matrix A or B across all layers while keeping the corresponding terms (e.g. qkv, out_proj) distinct in each layer. A variety of sharing strategies (share A, share B or share AB) are explored, with a key finding that such sharing does not compromise model performance. ShareLoRA: three sharing strategies (left) and ShareA applied across self-attention layers (right)\r","permalink":"https://milknocandy.github.io/posts/2026-01-16-lora/","summary":"\u003ch2 id=\"1-timeline-order\"\u003e1 Timeline Order\u003c/h2\u003e\n\u003cblockquote\u003e\n\u003cp\u003eSummarize the literature reviewed in chronological order.\u003c/p\u003e\n\u003c/blockquote\u003e\n\u003cul\u003e\n\u003cli\u003e\n\u003ch3 id=\"2023\"\u003e2023\u003c/h3\u003e\n\u003c/li\u003e\n\u003c/ul\u003e\n\u003cp\u003e📝【\u003cem\u003e\u003cstrong\u003eEMNLP 2023 - Main\u003c/strong\u003e\u003c/em\u003e】- Sparse Low-rank Adaptation of Pre-trained Language Models (\u003cem\u003eTsinghua University, The University of Chicago\u003c/em\u003e)\u003c/p\u003e\n\u003cdiv class=\"highlight-box default\"\u003e\r\n    \u003cdiv class=\"box-content\"\u003e\r\n        \u003cp\u003e\u003cstrong\u003eSubject:\u003c/strong\u003e Adaptive Rank Selection\u003c/p\u003e\n\u003cul\u003e\n\u003cli\u003e\u003cstrong\u003eProblem:\u003c/strong\u003e Standard LoRA uses a fixed, inflexible rank (hyperparameter $ r\r\n $), requiring expensive manual tuning.\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eCore Idea:\u003c/strong\u003e Make the rank learnable rather than fixed.\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eMechanism:\u003c/strong\u003e\n\u003cul\u003e\n\u003cli\u003e\u003cstrong\u003eGating:\u003c/strong\u003e Introduces an optimizable gating unit to the low-rank matrices.\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eOptimization:\u003c/strong\u003e Uses proximal gradient methods to update the gates.\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eDynamics:\u003c/strong\u003e Prunes less important ranks during training automatically.\u003c/li\u003e\n\u003c/ul\u003e\n\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eResult:\u003c/strong\u003e Eliminates discrete rank search; the model discovers its own optimal rank structure.\u003c/li\u003e\n\u003c/ul\u003e\r\n    \u003c/div\u003e\r\n\u003c/div\u003e\n\u003cp\u003e\r\n\u003cfigure \u003e\r\n    \u003cimg src=\"1-sora.png\" alt=\"SoRA\" /\u003e\u003cfigcaption\u003e\r\n        \u003cspan class=\"auto-fig-title\"\u003eSoRA\u003c/span\u003e\r\n    \u003c/figcaption\u003e\u003c/figure\u003e\u003c/p\u003e","title":"LoRA Variants Surveys"},{"content":" SparK：Designing Bert for Convolutional Networkss: Sparse and Hierarchical Masked Modeling (ICLR 2023 Spotlight)\n论文介绍：https://www.bilibili.com/video/BV11s4y1M7qL/\nBert算法是遮住数据的一部分，用模型去进行预测，达到一个自监督学习的效果。迁移到图像领域中的视觉Transformer的工作比如MAE，但是直接将Transformer替换为卷积网络则出现问题。如下图，zero-outing表示直接替换：\n可以看到只有0.1个点的提升，是完全无效的。下面是作者的分析。\n为什么失败？ 问题1：Pixel Intensity Distribution Shift Transformer在处理patches时，只要保证是随机删去一些patches，可以保证删除的patches和图像的像素分布是一致的。而卷积神经网络则不能删去一些像素，只能是将一些像素“涂黑”来模拟丢失这部分像素的信息。\n像素分布。横轴是像素强度，纵轴是像素出现的频率\r问题2：Mask Patttern Vanishing 当我们在zero-outed的图像上做卷积，即进行遮盖后的图像，会发现被mask的地方逐渐消失了，类似图形学操作里的erosion效果。\n像素消失问题\r问题3：a gap between CV and NLP in data processing 差异如下：\nNLP中，数据是由一个个单词组成，每个单词都是一个语义单元，有它自己的含义，具有离散的特点；而在CV中，数据来自通过照相机获取到的来自真实世界的光学信息，单个像素并不存在某种信息，连续的像素组成的像素集合才可以被看做语义单元，因此收集到的光学信号是拥有连续的特点 图像中的物体有大有小，因此我们需要多尺度的图像操作。许多经典的CV模型都是从多个尺度，使用多尺度层次化的结构来处理图像信息 因此NLP中，像Bert模型在处理数据时都是单尺度概念，但CV都是多尺度的概念，这里就有一个 gap，不能忽视这个 gap 解决方案 使用sparse Convolution 解决问题1和2\n这俩问题的根源都是CNNs不能处理不规则、随机masked的图像，但是ViT可以。\n稀疏卷积（想法来自3D卷云的特点）可以跳过所有“empty/masked/zero”的位置，因此：\nmasked的位置在稀疏卷积后不会变少，解决问题2 不需要“zero-out pixels”来模拟丢失操作，即遮盖操作，解决问题1 使用hierarchical encoder-decoder 作者使用多尺度的编码解码结构来做BERT式训练，如下图：\n网络结构\r总结如下：\n作者的算法作用在4个不同的尺度（4x/8x/16x/32x下采样） 每个稀疏特征$ S_i $喂给decoder来获得$ D_i $，这里稀疏特征会做desify操作，即空的位置填充masked token 和UNet一样的跳连接 mask百分之六十 和MAE以及同期的ConvNextV2的对比：\n几个疑点：\nmask百分之六十，比例其实是比较高的，原MAE里也是75%，也是非常高的，作者猜测图像里的冗余量比NLP里的冗余量要高很多 mask的地方替换为mask token时训练发现loss会变成nan，这个问题很关键，是否说明卷积这种局部信息无法很好的提供重建所需要的语义信息，导致重建的token没有一个明确的还原方向，毕竟局部感受野除非是能够刚好覆盖到目标被mask物体，否则只是一堆像素块。而mask token相当于引入噪声了，Spark这种稀疏卷积的操作则是可以避免被mask部分的影响。 ","permalink":"https://milknocandy.github.io/posts/2025-08-28-spark/","summary":"\u003cblockquote\u003e\n\u003cp\u003eSparK：\u003ca href=\"https://github.com/keyu-tian/SparK\"\u003eDesigning Bert for Convolutional Networkss: Sparse and Hierarchical Masked Modeling\u003c/a\u003e (ICLR 2023 Spotlight)\u003c/p\u003e\n\u003cp\u003e论文介绍：\u003cfont style=\"color:rgb(38, 38, 38);\"\u003e\u003c/font\u003e\u003ca href=\"https://www.bilibili.com/video/BV11s4y1M7qL/\"\u003ehttps://www.bilibili.com/video/BV11s4y1M7qL/\u003c/a\u003e\u003c/p\u003e\n\u003c/blockquote\u003e\n\u003cp\u003eBert算法是遮住数据的一部分，用模型去进行预测，达到一个自监督学习的效果。迁移到图像领域中的视觉Transformer的工作比如MAE，但是直接将Transformer替换为卷积网络则出现问题。如下图，zero-outing表示直接替换：\u003c/p\u003e\n\u003c!-- 这是一张图片，ocr 内容为：HIERARCHY APE MASKING EPOCH METHOD STD. LOSS ACC. 83.1 -1.0 NOT PRETRAINED 0.07 SPARK(OURS) 84.1 2 0.0 MASKED ONLY 1600 SPARSE X 3 83.2 0.06 ZERO-OUTING 1600 -0.9 MASKED ONLY ZERO-OUTING --\u003e\r\n\u003cp\u003e\r\n\u003cfigure \u003e\r\n    \u003cimg src=\"fig1.png\" alt=\"\" /\u003e\u003c/figure\u003e\u003c/p\u003e\n\u003cp\u003e可以看到只有0.1个点的提升，是完全无效的。下面是作者的分析。\u003c/p\u003e\n\u003ch2 id=\"为什么失败\"\u003e为什么失败？\u003c/h2\u003e\n\u003ch3 id=\"问题1pixel-intensity-distribution-shift\"\u003e问题1：Pixel Intensity Distribution Shift\u003c/h3\u003e\n\u003cp\u003eTransformer在处理patches时，只要保证是随机删去一些patches，可以保证删除的patches和图像的像素分布是一致的。而卷积神经网络则不能删去一些像素，只能是将一些像素“涂黑”来模拟丢失这部分像素的信息。\u003c/p\u003e\n\u003c!-- 这是一张图片，ocr 内容为：CNN SPARSE CNN TRANSFORMER ENCODING PROCESS: PIXEL INTENSITY DATA DISTRIBUTION MA PROBABILITY BEFORE/AFTER MASKING: (A)DIRECTLY DROPPING (C)SPARSELY DROPPING (B)ZERO-OUTING (D) RAW INPUT --\u003e\r\n\u003cp\u003e\r\n\u003cfigure \u003e\r\n    \u003cimg src=\"fig2.png\" alt=\"像素分布。横轴是像素强度，纵轴是像素出现的频率\" /\u003e\u003cfigcaption\u003e\r\n        \u003cspan class=\"auto-fig-title\"\u003e像素分布。横轴是像素强度，纵轴是像素出现的频率\u003c/span\u003e\r\n    \u003c/figcaption\u003e\u003c/figure\u003e\u003c/p\u003e","title":"​Designing Bert for Convolutional Networks"},{"content":" 来源：NIPS 2023\n论文地址：http://arxiv.org/abs/2310.06907\n代码地址：❌\n作者主页：二作谢伟迪主页https://weidixie.github.io/\n项目主页：https://kuis-ai.github.io/solv/\n介绍 背景：无监督多目标分割借助自监督学习预训练中学习到的强力的语义信息展示了显著的效果。通常也是通过添加额外的模态（比如深度、动作）来增强视频序列的分割结果。然而，在 _合成序列 _中观察到的性能提升依赖于额外信息的鲁棒性，并不能转化为更具挑战的真实世界场景。\n任务：给定一个复杂场景的视频序列，目标是训练一个视觉系统能够发现、追踪和分割图像帧里的目标，将数百万的像素的视觉信息抽象为语义部分。（object-centric视觉表征学习）\n(a) Ground Truth\n(b) Prediction\n领域的发展：从合成图像开始，转向in-the-wild图像和real-world视频。现有方法通常使用自编码器训练范式（如重建输入信号，希望能基于数据或结构的先验来将区域像素分组为有语义意义的对象）。\n对图像：使用来源于预训练模型的低级特征（如颜色、语义特征等）来确定像素到目标的分配 对视频：通常结合额外的模态、信号（如光流、深度图），可直接从不连续性获得分割掩码 提出问题 使用额外信息带来的问题：在视频中使用额外的信号会增加计算开销和误差累积。比如光流信号在处理静态或可变形物体以及帧间大位移时可能会产生问题，而深度值在普通视频中可能不易获得，在低光照或低对比度环境中其估算也会受到影响。\n过度分割问题：由于视觉场景的复杂性，使用固定数量的slots可能导致过度分割问题（over-segmentation issuse）。\n解决问题 作者方法：首次提出用于真实世界序列中多目标分割的完全无监督方法。SOLV，一个能够发现真实世界视频序列中多个目标且不使用额外的模态信息或任何类似弱监督方法（比如使用第一帧进行初始化）。\n方案：使用轴向空间-时隙注意力（axial spatial-temporal slot attentions）\n首先对每帧内空间区域进行分组 然后使用来自相邻帧的交互来丰富时隙表示（slot representations） 训练策略：masked autoencoder（MAE）训练范式。两个优势：\n充当信息瓶颈（information bottleneck），让模型观察部分区域，强迫模型学习高级语义结构。 缓解内存限制，有助于提高计算效率 针对over-segmentation问题：作者通过使用简单的聚类算法来融合相似的slots。\n总的来说，贡献如下：\n提出一个在真实世界视频上的自监督多目标分割模型，使用axial spatial-temporal slots attention，能有效地将具有相似特性的视觉区域进行分组，而不需要使用额外的信号。 展示了一个基于掩码特征重建的object-centric学习方式以及slot融合方法。 MOVi-E和Youtube-VIS 2019数据集上的SOTA以及DAVIS2017数据集上的具有竞争力的性能。 slot即视频中的各物体对象，见下图。\nSource from: Conditional object-centric learning from video\r相关工作 Object-centric Learning 图像和视频的object-centric无监督表征学习现有几种解决办法：\n对比学习方法： Object discovery and representation networks.（ECCV2022） Contrastive learning of structured world models.（ICLR2020） Groupvit: Semantic segmentation emerges from text supervision.（CVPR2022） 重建目标方法（将输入切分乘潜在空间中的一组区域辨识变量，即slots，然后将其和对象的不同对象进行绑定）： 应用于图像： Multi-object representation learning with iterative variational inference.（ICML2019） Monet: Unsupervised scene decomposition and representation. Spatially invariant unsupervised object detection with convolutional neural networks.（AAAI2019） Generative scene graph networks.（ICLR2021） Genesis: Generative scene inference and sampling with object-centric latent representations.（ICLR2020） Space: Unsupervised object-oriented scene representation via spatial attention and decomposition.（ICLR2020） Unsupervised foreground extraction via deep region competition.（NIPS2021） Attend, infer, repeat: Fast scene understanding with generative models（NIPS2016） Tagger: Deep unsupervised perceptual grouping.（NIPS2016） Unsupervised learning of compositional energy concepts.（NIPS2021） Object-centric learning with slot attention.（NIPS2020） Illiterate dall-e learns to compose.（ICLR2022） 应用于视频： Simone: View-invariant, temporally-abstracted object representations via unsupervised video decomposition.（NIPS2021） Faster attend-infer-repeat with tractable probabilistic models.（ICLR2019） Neural expectation maximization.（NIPS2017） Conditional object-centric learning from video.（ICLR2022） Scalor: Generative world models with scalable object representations.（ICLR2020） Entity abstraction in visual model-based reinforcement learning.（CoRL2020） Parts: Unsupervised segmentation with slots, attention and independence maximization.🤔（ICCV2021） Sequential attend, infer, repeat: Generative modelling of moving objects.（NIPS2018） 这些方法都是在合成数据上进行验证的，且由于复杂性的增加，很难推广到现实世界的场景中。为解决这个问题，之前的研究者们考虑探索额外的信息引导：\n基于3D结构： Roots: Object-centric representation and rendering of 3d scenes.（JMLR2021） Giraffe: Representing scenes as compositional generative neural feature fields.（CVPR2021） Unsupervised object-centric video generation and decomposition in 3d.（NIPS2020） 基于不同模态信息的重建： 光流：Conditional object-centric learning from video.（ICLR2022） 深度：Savi++: Towards end-to-end object-centric learning from real-world videos.（NIPS2022） 当前，在没有精确引导的情况下准确识别复杂视觉场景中的物体仍具有挑战。现有工作以来于从运动分割掩码[R1-2]或初始对象位置[R3-4]的引导初始化。为克服这个局限，DINOSAUR[R6]借助之前的预训练模型[R5]学习到的归纳偏置来重建特征空间。作者也是这个方法来在真实数据中学习object-centric表征，无需任何引导初始化或明显的监督信号。\nObject Localizaiton from DINO Features DINO展示了VIT在自监督学习中超强的性能。一些研究者[R7]通过聚类等传统的图划分方法将DINO提取到的特征进行分组，应用在下游任务中。CutLER将这个方法进行了扩展，提出了MaskCut可以不断的生成伪标签并更新，借此训练网络。\nPipeline of CutLER\rObservations from Deep Spectral Methods[R7]\rDeep Spectral Methods\rVideo Object Segmentation Video Object Segmentation(VOS)旨在识别视频中显著的对象：\n无监督设置下，不依赖标注， 半监督设置下的评估仅标注第一帧。 推理过程是无监督的，但是训练过程中仍要用到标注数据。Relying on labelled data during training might introduce a bias towards the labelled set of classes that is available during training.使用标记数据会导致模型训练产生偏置效果，即偏向训练数据中的已标记类别对象。\n动作信息通常在无监督VOS中用于跨时间匹配对象区域。在用objecti-centric方法进行多实例对象分离时，动作信息尤为方便。Motion Grouping学习object-centric表征，通过对流中的patterns进行分组来分割运动对象。最近的工作主要借助序列模型，引入额外的信息。本工作中，作者从多个帧中学习temporal slot时隙表征，不使用任何显式的运动信息。这样可以避免在无法可靠地估计flow时导致的性能下降。\n[R1]: Discovering objects that can move.（CVPR2022）\n[R2]: Object discovery from motion-guided tokens.（CVPR2023）\n[R3]: Savi++: Towards end-to-end object-centric learning from real-world videos.（NIPS2022）\n[R4]: Conditional object-centric learning from video.（ICLR2022）\n[R5]: Emerging properties in self-supervised vision transformers.（ICCV2021）\n[R6]: Bridging the gap to real-world object-centric learning.（ICLR2023）\n[R7]: Deep Spectral Methods: A Surprisingly Strong Baseline for Unsupervised Semantic Segmentation and Localization.（CVPR2022）\n方法 首先介绍考虑到的场景问题，然后描述提出的object-centric结构细节。\nProblem Scenario 输入：给定一个RGB视频片段作为输入，比如$ \\mathcal{V}_t=\\{\\rm{v_{t-n}},\\cdots,v_t, \\cdots, v_{t+n}\\}\\in\\mathbb R^{(2n+1)\\times H\\times W\\times 3} $\n我们的目标：训练一个object-centric模型，能够处理这个片段，输出所有对象实例的分割掩码，比如发现并追踪视频中的实例\n借助自监督学习，我们将该问题用公式来描述为：\n$$ m_t=\\Phi(\\mathcal V_t;\\Theta)=\\Phi_{vis-dec}\\circ\\Phi_{st-bind}\\circ\\Phi_{vis-enc}(\\mathcal V_t)\\tag{1} $$这里$ m_t\\in\\mathbb R^{K_t\\times H\\times W} $表示中间帧的输出分割掩码，$ K_t $是被认为是对象的数量。对每帧进行分割后，再使用Hungarian matching来追踪视频中的跨帧对象。$ \\Phi(\\cdot;\\Theta) $是提出的分割模型，由三个核心组件组成：\nvisual encoder视觉编码器 - 用于逐帧提取视觉特征 spatial-temporal axial binding - 首先将像素分组到帧内的slots中，然后跨时间帧来连接这些slots visual decoder视觉解码器 - 通过对spatial-temporal slots进行解码，重建密集的视觉特征，副产品为目标的分割掩码 结构剖析 作者提出的结构是Transformer的变体，训练方式和简单的masked autoencoder一致，比如根据给定的部分输入观察来重建完整的信号。和标准的MAE不同的是，MAE在像素空间重建图像，作者的则是一个信息瓶颈设计information bottleneck design：\n首先将spatial-temporal特征分配到slots中 然后从潜在slots中重建密集视觉特征 由此，每个slot都被附加到一个语义上有意义的对象上，且可通过重建过程获得分割掩码（副产品），不依赖手工标注的标签。\nVisual Encoder 和标准的VIT一样，对输入的RGB视频片段，我们将每张图像切分为不重叠的patches，得到$ \\mathcal V_t=\\{v_{t-n},\\cdots,v_t,\\cdots,v_{t+n}\\}\\in\\mathbb R^{(2n+1)\\times N\\times(3P^2)} $，这里的$ N=HW/P^2 $表示用尺寸为$ P $的patch来对每帧提取到的tokens的数量。Visual Encoder有token drop和特征提取组成。\nToken Drop：作为encoder的输入，我们只采样patches的部分子集。采样策略非常直接：随机丢弃每帧固定比率的输入patches，这里$ N' $即随机采样后得到的tokens数量：\n$$\r\\begin{align}\r\\mathcal V_t\u0026=\\{{v'_{t-n},\\cdots,v'_{t+n}}\\}\r\\\\\u0026=\\{\\mathrm{drop(v_{t-n}),\\cdots,\\mathrm{drop(v_{t+n})}}\\}\\in\\mathbb R^{(2n+1)\\times N'\\times(3P^2)}, \\quad N'\\lt N\r\\end{align}\r\\tag{2}\r$$Feature Extraction：特征提取部分，作者直接使用DINOv2训练好的VIT权重来进行初始化，并固定参数：\n$$\\mathcal F=\\{\\mathrm{f_{t-n}, \\cdots, f_{t+n}\\}=\\bigg\\{\\phi_{DINO}(v'_{t-n}),\\cdots,\\phi_{DINO}(v'_{t+n})\\bigg\\}}\\in\\mathbb R^{(2n+1)\\times N'\\times D}\\tag{3} $$这里$ D $是DINOv2最后一个block输出特征的维度，在最后一个Layer Normalization之前。作者这样设计有两点原因：\nmasked autoencoding在NLP和CV任务中，通常作为自监督学习的代理任务，借助作者的token drop来强迫模型获取高质量的视觉表征 对视频数理来说，额外的时序轴引入了几个数量级的数据，处理采样后的稀疏视觉tokens可以极大地减少内存预算，作者后面有进行验证 Spatial-temporal Binding 从每帧提取到视觉特征后，首先在空间上将图像区域分组到slots中，每个slot指定一个语义对象， 即在单张图像中发现对象的过程；然后用Transformer在slots中建立temporal binding，也就是关联视频片段中的对象。\n$$\r\\Phi_{\\mathrm{st-bind}}(\\mathcal F)=\\psi_{\\mathrm{t-bind}}\\big(\\psi_{\\mathrm{s-bind}}(\\bold f_{t-n}),\\cdots,\\psi_{\\mathrm{s-bind}}(\\bold f_{t+n})\\big)\\in\\mathbb R^{K\\times D_{\\rm slot}}\\tag{4}\r$$Spatial Binding ($ \\psi_{\\bold{s-bind}} $)：Spatial binding过程独立作用于各帧。作者使用Biza提出的invariant slot attention，有一点不同，在每个时间步$ \\mathcal T\\in\\{t-n,\\cdots,t+n\\} $使用一个共享的初始化$ \\mathcal {Z_T} $。\n具体来说，给定一个时间步$ \\mathcal T $的特征经过token drop之后作为输入，我们学习一组初始化的向量，用$ K $个slot向量$ \\bold z^j\\in\\mathbb R^{D_{\\rm slot}} $、$ K $个缩放向量$ \\bold S^j_s\\in\\mathbb R^2 $、$ K $个位置向量$ \\bold S^j_p\\in\\mathbb R^2 $以及一个绝对位置嵌入grid$ \\bold G_{abs}\\in\\mathbb R^{N\\times 2} $分别平移和缩放每个slot的输入位置编码。\n作者将特征编码中被丢弃的tokens对应的patches遮住，对每帧$ \\tau $得到一个绝对位置嵌入$ \\bold G_{abs}=\\rm drop(\\bold G_{abs})\\in\\mathbb R^{N'\\times 2} $。对每帧图像，我们可以得到下面的一组可学习向量：\n$$\r\\mathcal Z_\\tau=\\bigg\\{(\\bold z^j,\\bold S^j_s,\\bold S^j_p,\\bold G_{abs,\\tau})\\bigg\\}^K_{j=1}\\tag{5}\r$$这里这些可学习的参数是对所有帧共享的，且根据对应帧的密集视觉特征进行更新。也就是说，不同帧的slots刚开始是相同的表征，经过bind操作之后由于有和帧内特征的交互而不同。本质上来说，连续的帧通常具有相似的视觉上下文信息，因此使用学习到的slots自然地会促进temporal binding，即具有相同索引的slots能够跨帧绑定到相同的对象区域。\nInvariant slot attention细节见原文，粗略看下来是利用位置编码来学习关系。\nTemporal Binding ($ \\psi_{\\bold{t-bind}} $): 到这里为止，模型只能通过利用来自单个帧的信息来发现对象。本节的目的是借助时间上下文信息来增强slot表征。给定一个来自spatial binding模块的输出slots$ \\bigg\\{\\{\\bold{z}^j_{t-n}\\}^K_{j=1},…,\\{\\bold{z}^j_{t+n}\\}^K_{j=1}\\bigg\\}\\in\\mathbb R^{(2n+1)\\times K\\times D_{\\rm slot}} $，作者直接使用Transformer编码器来处理跨不同帧且具有相同索引的输出slots，这里自注意力机制学习的是一个跨$ (2n+1) $个时间步的$ (2n+1)\\times(2n+1) $affinity矩阵。借助自注意力，Transformer可以学习每个slot过去、现在、未来的时间步里的表征，借此生成更robust表征。为了区分不同时间步，作者将可学习的时间位置编码添加到slots中，相同帧内的slots使用相同的编码。该temporal transformer最终得到目标时间步$ t $更新后的slots$ \\bold c $为： $$ \\bold c=\\Phi_{\\mathrm{st-bind}}(\\mathcal F)\\in\\mathbb R^{K\\times D_{\\rm slot}}\\tag{6} $$Visual Decoder 前面的spatial-temporal binding过程得到在时间步$ t $的一组slot向量$ \\bold c\\in\\mathbb R^{K\\times D_{\\rm slot}} $。但是，真实的视频中，单个帧内的对象的数量可能变化很大，因此如果使用固定数量的slots可能会导致过度聚类。为了克服这个困难，作者借助Agglomerative Clustering算法（层次聚类）提出一个用于slot融合的简单解决方案。此外还有一个重建视频特征的slot解码步骤，和特征空间中的MAE类似。\n$$ \\Phi_{\\mathrm{vis-dec}}(\\bold c)=\\psi_{\\rm dec}\\circ\\psi_{\\rm merge}(\\bold c)\\tag{7} $$\rSOLV整体结构及工作流程\rSlot Merging ($ \\psi_{\\bold{merge}} $): 自监督设定下，对象的分割问题通常是一个难以界定的问题，因为一个视觉区域可以有多个解释。比如图像中的一个人，常见的做法是将这个人占据的所有像素都分为一组，或者将这个“人”分解为多个部分：脸、手臂、身体和腿等组。然鹅，经验上来说，来自同一对象的像素embeddings对比来自不同对象的像素embeddings，来自同一对象的应当更相近。作者使用Agglomerative Clustering（层次聚类算法）来融合slots，以解决这个问题。如上图，先基于余弦相似度计算所有slots的affinity矩阵，然后利用这个矩阵将所有的slots进行聚类分组，接着为每个簇计算平均slot： $$ \\bold c'=\\psi_{\\rm merge}(\\bold c)\\in\\mathbb R^{K_r\\times D_{\\rm slot}}, \\qquad K_t\\le K\\tag{8} $$\r通过融合对应于同一对象的语义上相似的slots，我们就可以动态地确定slots的最优数量（🤔not good）。该slot融合步骤并不是一个后处理步骤，而是训练过程中必不可少的一部分，毕竟这样的聚类得到的特征更加独特，能帮助网络学习对象的表征。下图（左）是不使用该slot融合，图（右）是使用该slot融合步骤的结果可视化：\nDecoder ($ \\psi_{\\bold{dec}} $): 使用解码器来将融合得到的slots$ \\bold c' $解码得到对应的分割掩码$ \\bold m $以及完整的重建信号$ \\bold y $：\n$$ \\bold y, \\bold m = \\psi_{\\rm dec}(\\bold c'),\\quad \\bold y\\in\\mathbb R^{N\\times D},\\bold m \\in \\mathbb R^{K_t\\times N}\\tag{9} $$作者使用reshape和upsample来处理得到的掩码$ \\bold m $以恢复原始输入尺寸得到最终的分割输出。和DINOSAUR这篇论文的MLP解码器类似，作者使用一个spatial broadcast解码器来为每个slot$ j $重建完整的特征图$ \\hat{\\bold y}\\in\\mathbb R^{N\\times D} $，它们的alpha权重$ \\alpha^j\\in\\mathbb R^N $。对这个权重使用softmax函数以得到最终的分割掩码。解码阶段添加了学习到的位置编码。对每个slot$ \\bold c'^j $，将它的形状broadcast至输入的特征图的形状，然后使用一系列线性层$ \\psi_{\\rm mapper} $进行解码，这些层在所有slots中权重共享。最终的重建特征由解码后的slots加权求和得到：\n$$\r\\bold y=\\sum_{j=1}^{K_t}\\hat{\\bold y}^j\\odot\\bold m^j,\\qquad \\bold m^j=\\mathrm{softmax}(\\boldsymbol\\alpha^j),\\\\ \\boldsymbol\\alpha^j,\\hat{\\bold{y}}^j=\\psi_{\\rm mapper}\\bigg(\\mathrm{broadcast}\\Big(\\bold c'^j\\Big)\\bigg)\r\\tag{10}\r$$训练中，作者通过最小化在时间步$ t $得到的未进行遮盖的帧图像，经过DINO进行编码得到的特征图中的tokens和重建后的tokens$~\\bold y $的差异来优化模型：$ \\mathcal L=\\| \\phi_{\\mathrm{DINO}}(\\bold v_t)-\\bold y\\|^2 $\n结论 给出了一个大致的框架，可以看到最近的研究大多数是这类——重新定义任务，整合现有算法及数据解决新任务。名字起的很响亮，可惜实际的解决方案还是未突破自监督学习的范式，代理任务还是之前的方案。\n","permalink":"https://milknocandy.github.io/posts/2023-12-10-cutler/","summary":"\u003cblockquote\u003e\n\u003cp\u003e来源：\u003ca href=\"https://openreview.net/group?id=NeurIPS.cc/2023/Conference\"\u003eNIPS 2023\u003c/a\u003e\u003c/p\u003e\n\u003cp\u003e论文地址：\u003ca href=\"http://arxiv.org/abs/2310.06907\"\u003ehttp://arxiv.org/abs/2310.06907\u003c/a\u003e\u003c/p\u003e\n\u003cp\u003e代码地址：❌\u003c/p\u003e\n\u003cp\u003e作者主页：二作谢伟迪主页\u003ca href=\"https://weidixie.github.io/\"\u003ehttps://weidixie.github.io/\u003c/a\u003e\u003c/p\u003e\n\u003cp\u003e项目主页：\u003ca href=\"https://kuis-ai.github.io/solv/\"\u003ehttps://kuis-ai.github.io/solv/\u003c/a\u003e\u003c/p\u003e\n\u003c/blockquote\u003e\n\u003ch2 id=\"介绍\"\u003e介绍\u003c/h2\u003e\n\u003cp\u003e\u003cstrong\u003e背景\u003c/strong\u003e：\u003cu\u003e无监督多目标分割\u003c/u\u003e借助自监督学习预训练中学习到的强力的语义信息展示了显著的效果。通常也是通过添加额外的模态（比如深度、动作）来增强视频序列的分割结果。然而，在 _合成序列 _中观察到的性能提升\u003cu\u003e依赖\u003c/u\u003e于额外信息的鲁棒性，并不能转化为更具挑战的真实世界场景。\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003e任务\u003c/strong\u003e：给定一个复杂场景的视频序列，目标是训练一个视觉系统能够\u003cu\u003e发现、追踪和分割\u003c/u\u003e图像帧里的目标，将数百万的像素的视觉信息抽象为\u003ci\u003e语义部分\u003c/i\u003e。（object-centric视觉表征学习）\u003c/p\u003e\n\u003cfigure class=\"main-figure\"\u003e\r\n  \u003cdiv class=\"side-by-side-wrapper grid-layout\"\u003e\r\n    \u003cdiv class=\"side-item\" style=\"--w: 45%\"\u003e\r\n      \u003cimg src=\"1.gif\"\u003e\r\n      \u003cp\u003e(a) Ground Truth\u003c/p\u003e\r\n    \u003c/div\u003e\r\n    \u003cdiv class=\"side-item\" style=\"--w: 45%\"\u003e\r\n      \u003cimg src=\"1-2.gif\"\u003e\r\n      \u003cp\u003e(b) Prediction\u003c/p\u003e\r\n    \u003c/div\u003e\r\n  \u003c/div\u003e\r\n  \u003c!-- \u003cfigcaption\u003e\r\n    \u003cspan class=\"auto-fig-title\"\u003e非对称比例对比\u003c/span\u003e\r\n  \u003c/figcaption\u003e --\u003e\r\n\u003c/figure\u003e\r\n\u003cp\u003e\u003cstrong\u003e领域的发展\u003c/strong\u003e：从\u003ci\u003e合成图像\u003c/i\u003e开始，转向\u003cu\u003ein-the-wild\u003c/u\u003e图像和\u003cu\u003ereal-world\u003c/u\u003e视频。现有方法通常使用自编码器训练范式（如重建输入信号，希望能基于数据或结构的先验来将\u003cu\u003e区域像素\u003c/u\u003e分组为有语义意义的对象）。\u003c/p\u003e\n\u003cul\u003e\n\u003cli\u003e对图像：使用来源于预训练模型的\u003cu\u003e低级特征\u003c/u\u003e（如颜色、语义特征等）来确定像素到目标的分配\u003c/li\u003e\n\u003cli\u003e对视频：通常结合额外的模态、信号（如光流、深度图），可直接从不连续性获得分割掩码\u003c/li\u003e\n\u003c/ul\u003e\n\u003ch3 id=\"提出问题\"\u003e提出问题\u003c/h3\u003e\n\u003cp\u003e\u003cstrong\u003e使用额外信息带来的问题\u003c/strong\u003e：在视频中使用额外的信号会增加\u003cstrong\u003e计算开销\u003c/strong\u003e和\u003cstrong\u003e误差累积\u003c/strong\u003e。比如光流信号在处理\u003cu\u003e静态或可变形\u003c/u\u003e物体以及帧间\u003cu\u003e大位移\u003c/u\u003e时可能会产生问题，而深度值在普通视频中可能不易获得，在\u003cu\u003e低光照\u003c/u\u003e或\u003cu\u003e低对比度\u003c/u\u003e环境中其估算也会受到影响。\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003e过度分割问题\u003c/strong\u003e：由于视觉场景的复杂性，使用固定数量的\u003cu\u003eslots\u003c/u\u003e可能导致过度分割问题（over-segmentation issuse）。\u003c/p\u003e\n\u003ch3 id=\"解决问题\"\u003e解决问题\u003c/h3\u003e\n\u003cp\u003e\u003cstrong\u003e作者方法\u003c/strong\u003e：\u003cstrong\u003e首次\u003c/strong\u003e提出用于\u003cu\u003e真实世界序列中多目标分割\u003c/u\u003e的完全无监督方法。SOLV，一个能够发现真实世界视频序列中多个目标且不使用额外的模态信息或任何类似弱监督方法（比如使用第一帧进行初始化）。\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003e方案\u003c/strong\u003e：使用轴向空间-时隙注意力（axial spatial-temporal slot attentions）\u003c/p\u003e\n\u003cul\u003e\n\u003cli\u003e首先对每帧内空间区域进行分组\u003c/li\u003e\n\u003cli\u003e然后使用来自相邻帧的交互来丰富时隙表示（slot representations）\u003c/li\u003e\n\u003c/ul\u003e\n\u003cp\u003e\u003cstrong\u003e训练策略\u003c/strong\u003e：masked autoencoder（MAE）训练范式。两个优势：\u003c/p\u003e\n\u003col\u003e\n\u003cli\u003e充当信息瓶颈（information bottleneck），让模型观察部分区域，强迫模型学习高级语义结构。\u003c/li\u003e\n\u003cli\u003e缓解内存限制，有助于提高计算效率\u003c/li\u003e\n\u003c/ol\u003e\n\u003cp\u003e针对\u003cstrong\u003eover-segmentation\u003c/strong\u003e问题：作者通过使用简单的聚类算法来融合相似的slots。\u003c/p\u003e\n\u003cp\u003e总的来说，贡献如下：\u003c/p\u003e\n\u003cul\u003e\n\u003cli\u003e提出一个在真实世界视频上的自监督多目标分割模型，使用axial spatial-temporal slots attention，能有效地将具有相似特性的视觉区域进行分组，而不需要使用\u003cu\u003e额外的信号\u003c/u\u003e。\u003c/li\u003e\n\u003cli\u003e展示了一个基于掩码特征重建的object-centric学习方式以及slot融合方法。\u003c/li\u003e\n\u003cli\u003eMOVi-E和Youtube-VIS 2019数据集上的SOTA以及DAVIS2017数据集上的具有竞争力的性能。\u003c/li\u003e\n\u003c/ul\u003e\n\u003cblockquote\u003e\n\u003cp\u003eslot即视频中的各物体对象，见下图。\u003c/p\u003e\n\u003c/blockquote\u003e\n\u003cp\u003e\r\n\u003cfigure \u003e\r\n    \u003cimg src=\"2.png\" alt=\"Source from: Conditional object-centric learning from video\" /\u003e\u003cfigcaption\u003e\r\n        \u003cspan class=\"auto-fig-title\"\u003eSource from: Conditional object-centric learning from video\u003c/span\u003e\r\n    \u003c/figcaption\u003e\u003c/figure\u003e\u003c/p\u003e\n\u003ch3 id=\"相关工作\"\u003e相关工作\u003c/h3\u003e\n\u003ch4 id=\"object-centric-learning\"\u003eObject-centric Learning\u003c/h4\u003e\n\u003cp\u003e图像和视频的object-centric无监督表征学习现有几种解决办法：\u003c/p\u003e","title":"Self-supervised Object-Centric Learning for Videos"}]