2023-08-09 04:11:01

3D Motion Magnification: Visualizing Subtle Motions with Time Varying Radiance Fields

2023-08-07 17:59:59+00:00 cs.CV

Motion magnification helps us visualize subtle, imperceptible motion.
However, prior methods only work for 2D videos captured with a fixed camera. We
present a 3D motion magnification method that can magnify subtle motions from
scenes captured by a moving camera, while supporting novel view rendering. We
represent the scene with time-varying radiance fields and leverage the Eulerian
principle for motion magnification to extract and amplify the variation of the
embedding of a fixed point over time. We study and validate our proposed
principle for 3D motion magnification using both implicit and tri-plane-based
radiance fields as our underlying 3D scene representation. We evaluate the
effectiveness of our method on both synthetic and real-world scenes captured
under various camera setups.

FSD V2: Improving Fully Sparse 3D Object Detection with Virtual Voxels

2023-08-07 17:59:48+00:00 cs.CV

LiDAR-based fully sparse architecture has garnered increasing attention.
FSDv1 stands out as a representative work, achieving impressive efficacy and
efficiency, albeit with intricate structures and handcrafted designs. In this
paper, we present FSDv2, an evolution that aims to simplify the previous FSDv1
while eliminating the inductive bias introduced by its handcrafted
instance-level representation, thus promoting better general applicability. To
this end, we introduce the concept of \textbf{virtual voxels}, which takes over
the clustering-based instance segmentation in FSDv1. Virtual voxels not only
address the notorious issue of the Center Feature Missing problem in fully
sparse detectors but also endow the framework with a more elegant and
streamlined approach. Consequently, we develop a suite of components to
complement the virtual voxel concept, including a virtual voxel encoder, a
virtual voxel mixer, and a virtual voxel assignment strategy. Through empirical
validation, we demonstrate that the virtual voxel mechanism is functionally
similar to the handcrafted clustering in FSDv1 while being more general. We
conduct experiments on three large-scale datasets: Waymo Open Dataset,
Argoverse 2 dataset, and nuScenes dataset. Our results showcase
state-of-the-art performance on all three datasets, highlighting the
superiority of FSDv2 in long-range scenarios and its general applicability to
achieve competitive performance across diverse scenarios. Moreover, we provide
comprehensive experimental analysis to elucidate the workings of FSDv2. To
foster reproducibility and further research, we have open-sourced FSDv2 at

Mask Frozen-DETR: High Quality Instance Segmentation with One GPU

2023-08-07 17:53:21+00:00 cs.CV

In this paper, we aim to study how to build a strong instance segmenter with
minimal training time and GPUs, as opposed to the majority of current
approaches that pursue more accurate instance segmenter by building more
advanced frameworks at the cost of longer training time and higher GPU
requirements. To achieve this, we introduce a simple and general framework,
termed Mask Frozen-DETR, which can convert any existing DETR-based object
detection model into a powerful instance segmentation model. Our method only
requires training an additional lightweight mask network that predicts instance
masks within the bounding boxes given by a frozen DETR-based object detector.
Remarkably, our method outperforms the state-of-the-art instance segmentation
method Mask DINO in terms of performance on the COCO test-dev split (55.3% vs.
54.7%) while being over 10X times faster to train. Furthermore, all of our
experiments can be trained using only one Tesla V100 GPU with 16 GB of memory,
demonstrating the significant efficiency of our proposed framework.

The Copycat Perceptron: Smashing Barriers Through Collective Learning

2023-08-07 17:51:09+00:00 cond-mat.dis-nn

We characterize the equilibrium properties of a model of $y$ coupled binary
perceptrons in the teacher-student scenario, subject to a suitable learning
rule, with an explicit ferromagnetic coupling proportional to the Hamming
distance between the students' weights. In contrast to recent works, we analyze
a more general setting in which a thermal noise is present that affects the
generalization performance of each student. Specifically, in the presence of a
nonzero temperature, which assigns nonzero probability to configurations that
misclassify samples with respect to the teacher's prescription, we find that
the coupling of replicas leads to a shift of the phase diagram to smaller
values of $\alpha$: This suggests that the free energy landscape gets smoother
around the solution with good generalization (i.e., the teacher) at a fixed
fraction of reviewed examples, which allows local update algorithms such as
Simulated Annealing to reach the solution before the dynamics gets frozen.
Finally, from a learning perspective, these results suggest that more students
(in this case, with the same amount of data) are able to learn the same rule
when coupled together with a smaller amount of data.

A Cost Analysis of Generative Language Models and Influence Operations

2023-08-07 17:38:41+00:00 cs.CY

Despite speculation that recent large language models (LLMs) are likely to be
used maliciously to improve the quality or scale of influence operations,
uncertainty persists regarding the economic value that LLMs offer
propagandists. This research constructs a model of costs facing propagandists
for content generation at scale and analyzes (1) the potential savings that
LLMs could offer propagandists, (2) the potential deterrent effect of
monitoring controls on API-accessible LLMs, and (3) the optimal strategy for
propagandists choosing between multiple private and/or open source LLMs when
conducting influence operations. Primary results suggest that LLMs need only
produce usable outputs with relatively low reliability (roughly 25%) to offer
cost savings to propagandists, that the potential reduction in content
generation costs can be quite high (up to 70% for a highly reliable model), and
that monitoring capabilities have sharply limited cost imposition effects when
alternative open source models are available. In addition, these results
suggest that nation-states -- even those conducting many large-scale influence
operations per year -- are unlikely to benefit economically from training
custom LLMs specifically for use in influence operations.

Randomized algorithms for precise measurement of differentially-private, personalized recommendations

2023-08-07 17:34:58+00:00 cs.CR

Personalized recommendations form an important part of today's internet
ecosystem, helping artists and creators to reach interested users, and helping
users to discover new and engaging content. However, many users today are
skeptical of platforms that personalize recommendations, in part due to
historically careless treatment of personal data and data privacy. Now,
businesses that rely on personalized recommendations are entering a new
paradigm, where many of their systems must be overhauled to be privacy-first.
In this article, we propose an algorithm for personalized recommendations that
facilitates both precise and differentially-private measurement. We consider
advertising as an example application, and conduct offline experiments to
quantify how the proposed privacy-preserving algorithm affects key metrics
related to user experience, advertiser value, and platform revenue compared to
the extremes of both (private) non-personalized and non-private, personalized

SurvBeX: An explanation method of the machine learning survival models based on the Beran estimator

2023-08-07 17:18:37+00:00 cs.LG

An explanation method called SurvBeX is proposed to interpret predictions of
the machine learning survival black-box models. The main idea behind the method
is to use the modified Beran estimator as the surrogate explanation model.
Coefficients, incorporated into Beran estimator, can be regarded as values of
the feature impacts on the black-box model prediction. Following the well-known
LIME method, many points are generated in a local area around an example of
interest. For every generated example, the survival function of the black-box
model is computed, and the survival function of the surrogate model (the Beran
estimator) is constructed as a function of the explanation coefficients. In
order to find the explanation coefficients, it is proposed to minimize the mean
distance between the survival functions of the black-box model and the Beran
estimator produced by the generated examples. Many numerical experiments with
synthetic and real survival data demonstrate the SurvBeX efficiency and compare
the method with the well-known method SurvLIME. The method is also compared
with the method SurvSHAP. The code implementing SurvBeX is available at:

Tiny LVLM-eHub: Early Multimodal Experiments with Bard

2023-08-07 17:17:05+00:00 cs.CV

Recent advancements in Large Vision-Language Models (LVLMs) have demonstrated
significant progress in tackling complex multimodal tasks. Among these
cutting-edge developments, Google's Bard stands out for its remarkable
multimodal capabilities, promoting comprehensive comprehension and reasoning
across various domains. This work presents an early and holistic evaluation of
LVLMs' multimodal abilities, with a particular focus on Bard, by proposing a
lightweight variant of LVLM-eHub, named Tiny LVLM-eHub. In comparison to the
vanilla version, Tiny LVLM-eHub possesses several appealing properties.
Firstly, it provides a systematic assessment of six categories of multimodal
capabilities, including visual perception, visual knowledge acquisition, visual
reasoning, visual commonsense, object hallucination, and embodied intelligence,
through quantitative evaluation of $42$ standard text-related visual
benchmarks. Secondly, it conducts an in-depth analysis of LVLMs' predictions
using the ChatGPT Ensemble Evaluation (CEE), which leads to a robust and
accurate evaluation and exhibits improved alignment with human evaluation
compared to the word matching approach. Thirdly, it comprises a mere $2.1$K
image-text pairs, facilitating ease of use for practitioners to evaluate their
own offline LVLMs. Through extensive experimental analysis, this study
demonstrates that Bard outperforms previous LVLMs in most multimodal
capabilities except object hallucination, to which Bard is still susceptible.
Tiny LVLM-eHub serves as a baseline evaluation for various LVLMs and encourages
innovative strategies aimed at advancing multimodal techniques. Our project is
publicly available at \url{https://github.com/OpenGVLab/Multi-Modality-Arena}.

AdaptiveSAM: Towards Efficient Tuning of SAM for Surgical Scene Segmentation

2023-08-07 17:12:54+00:00 cs.CV

Segmentation is a fundamental problem in surgical scene analysis using
artificial intelligence. However, the inherent data scarcity in this domain
makes it challenging to adapt traditional segmentation techniques for this
task. To tackle this issue, current research employs pretrained models and
finetunes them on the given data. Even so, these require training deep networks
with millions of parameters every time new data becomes available. A recently
published foundation model, Segment-Anything (SAM), generalizes well to a large
variety of natural images, hence tackling this challenge to a reasonable
extent. However, SAM does not generalize well to the medical domain as is
without utilizing a large amount of compute resources for fine-tuning and using
task-specific prompts. Moreover, these prompts are in the form of
bounding-boxes or foreground/background points that need to be annotated
explicitly for every image, making this solution increasingly tedious with
higher data size. In this work, we propose AdaptiveSAM - an adaptive
modification of SAM that can adjust to new datasets quickly and efficiently,
while enabling text-prompted segmentation. For finetuning AdaptiveSAM, we
propose an approach called bias-tuning that requires a significantly smaller
number of trainable parameters than SAM (less than 2\%). At the same time,
AdaptiveSAM requires negligible expert intervention since it uses free-form
text as prompt and can segment the object of interest with just the label name
as prompt. Our experiments show that AdaptiveSAM outperforms current
state-of-the-art methods on various medical imaging datasets including surgery,
ultrasound and X-ray. Code is available at

Efficient Temporal Sentence Grounding in Videos with Multi-Teacher Knowledge Distillation

2023-08-07 17:07:48+00:00 cs.CV

Temporal Sentence Grounding in Videos (TSGV) aims to detect the event
timestamps described by the natural language query from untrimmed videos. This
paper discusses the challenge of achieving efficient computation in TSGV models
while maintaining high performance. Most existing approaches exquisitely design
complex architectures to improve accuracy with extra layers and loss, suffering
from inefficiency and heaviness. Although some works have noticed that, they
only make an issue of feature fusion layers, which can hardly enjoy the
highspeed merit in the whole clunky network. To tackle this problem, we propose
a novel efficient multi-teacher model (EMTM) based on knowledge distillation to
transfer diverse knowledge from both heterogeneous and isomorphic networks.
Specifically, We first unify different outputs of the heterogeneous models into
one single form. Next, a Knowledge Aggregation Unit (KAU) is built to acquire
high-quality integrated soft labels from multiple teachers. After that, the KAU
module leverages the multi-scale video and global query information to
adaptively determine the weights of different teachers. A Shared Encoder
strategy is then proposed to solve the problem that the student shallow layers
hardly benefit from teachers, in which an isomorphic teacher is collaboratively
trained with the student to align their hidden states. Extensive experimental
results on three popular TSGV benchmarks demonstrate that our method is both
effective and efficient without bells and whistles.