My research lies at the intersection of
machine learning, computer vision and computer graphics.
|
3D Gaussian Splatting as Markov Chain Monte Carlo
Shakiba Kheradmand, Daniel Rebain, Gopal Sharma, Weiwei Sun, Jeff Tseng,
Hossam Isack, Abhishek Kar, Andrea Tagliasacchi, Kwang Moo Yi
NeurIPS 2024, Webpage
Abstract
While 3D Gaussian Splatting has recently become popular for
neural rendering, current methods rely on carefully engineered cloning and
splitting strategies for placing Gaussians, which does not always generalize and
may lead to poor-quality renderings. In addition, for real-world scenes, they
rely on a good initial point cloud to perform well. In this work, we rethink 3D
Gaussians as random samples drawn from an underlying probability distribution
describing the physical representation of the scene -- in other words, Markov
Chain Monte Carlo (MCMC) samples. Under this view, we show that the 3D Gaussian
updates are strikingly similar to a Stochastic Langevin Gradient Descent (SGLD)
update. As with MCMC, samples are nothing but past visit locations, adding new
Gaussians under our framework can simply be realized without heuristics as
placing Gaussians at existing Gaussian locations. To encourage using fewer
Gaussians for efficiency, we introduce an L1-regularizer on the Gaussians. On
various standard evaluation scenes, we show that our method provides improved
rendering quality, easy control over the number of Gaussians, and robustness to
initialization.
|
|
Volumetric Rendering with Baked Quadrature Fields
Gopal Sharma, Daniel Rebain, Andrea Tagliasacchi, Kwang Moo Yi.
ECCV 2024, Webpage
Abstract
We propose a novel Neural Radiance Field (NeRF) representation
for non-opaque scenes that allows fast inference by utilizing textured polygons.
Despite the high-quality novel view rendering that NeRF provides, a critical
limitation is that it relies on volume rendering that can be computationally
expensive and does not utilize the advancements in modern graphics hardware.
Existing methods for this problem fall short when it comes to modelling
volumetric effects as they rely purely on surface rendering. We thus propose to
model the scene with polygons, which can then be used to obtain the quadrature
points required to model volumetric effects, and also their opacity and colour
from the texture. To obtain such polygonal mesh, we train a specialized field
whose zero-crossings would correspond to the quadrature points when volume
rendering, and perform marching cubes on this field. We then rasterize the
polygons and utilize the fragment shaders to obtain the final colour image. Our
method allows rendering on various devices and easy integration with existing
graphics frameworks while keeping the benefits of volume rendering alive.
|
|
Unsupervised Keypoints from Pretrained Diffusion Models
Eric Hedlin, Gopal Sharma, Shweta Mahajan, Xingzhe He, Hossam Isack, Abhishek
Kar,
Helge Rhodin, Andrea Tagliasacchi, Kwang Moo Yi.
CVPR 2024, Webpage
Abstract
Unsupervised learning of keypoints and landmarks has seen
significant progress with the help of modern neural network architectures, but
performance is yet to match the supervised counterpart, making their
practicability questionable. We leverage the emergent knowledge within
text-to-image diffusion models, towards more robust unsupervised keypoints. Our
core idea is to find text embeddings that would cause the generative model to
consistently attend to compact regions in images (i.e. keypoints). To do so, we
simply optimize the text embedding such that the cross-attention maps within the
denoising network are localized as Gaussians with small standard deviations. We
validate our performance on multiple dataset: the CelebA, CUB-200-2011,
Tai-Chi-HD, DeepFashion, and Human3.6m datasets. We achieve significantly
improved accuracy, sometimes even outperforming supervised ones, particularly
for data that is non-aligned and less curated.
|
|
PointNeRF++: A multi-scale, point-based Neural Radiance Field
Weiwei Sun, Eduard Trulls, Yang-Che Tseng, Sneha Sambandam, Gopal Sharma,
Andrea Tagliasacchi, Kwang Moo Yi.
ECCV 2024, Webpage
Abstract
Point clouds offer an attractive source of information to
complement images in neural scene representations, especially when few images
are available. Neural rendering methods based on point clouds do exist, but they
do not perform well when the point cloud quality is low---e.g., sparse or
incomplete, which is often the case with real-world data. We overcome these
problems with a simple representation that aggregates point clouds at multiple
scale levels with sparse voxel grids at different resolutions. To deal with
point cloud sparsity, we average across multiple scale levels---but only among
those that are valid, i.e., that have enough neighboring points in proximity to
the ray of a pixel. To help model areas without points, we add a global voxel at
the coarsest scale, thus unifying ``classical'' and point-based NeRF
formulations.
|
|
Accelerating Neural Field Training via Soft Mining
Shakiba Kheradmand, Daniel Rebain, Gopal Sharma, Hossam Isack, Abhishek Kar
Andrea Tagliasacchi, Kwang Moo Yi
CVPR 2024, Webpage
Abstract
We present an approach to accelerate Neural Field training by
efficiently selecting sampling locations. While Neural Fields have recently
become popular, it is often trained by uniformly sampling the training domain,
or through handcrafted heuristics. We show that improved convergence and final
training quality can be achieved by a soft mining technique based on importance
sampling: rather than either considering or ignoring a pixel completely, we
weigh the corresponding loss by a scalar. To implement our idea we use Langevin
Monte-Carlo sampling. We show that by doing so, regions with higher error are
being selected more frequently, leading to more than 2x improvement in
convergence speed.
|
|
Unsupervised Semantic Correspondence Using Stable Diffusion
Eric Hedlin, Gopal Sharma, Shweta Mahajan, Hossam Isack, Abhishek Kar, Andrea
Tagliasacchi, Kwang Moo Yi.
NeurIPS 2023, Webpage
Abstract
Text-to-image diffusion models are now capable of generating
images that are often indistinguishable from real images. To generate such
images, these models must understand the semantics of the objects they are asked
to generate. In this work we show that, without any training, one can leverage
this semantic knowledge within diffusion models to find semantic correspondences
-- locations in multiple images that have the same semantic meaning.
Specifically, given an image, we optimize the prompt embeddings of these models
for maximum attention on the regions of interest. These optimized embeddings
capture semantic information about the location, which can then be transferred
to another image. By doing so we obtain results on par with the strongly
supervised state of the art on the PF-Willow dataset and significantly
outperform (20.9% relative for the SPair-71k dataset) any existing weakly or
unsupervised method on PF-Willow, CUB-200 and SPair-71k datasets.
|
|
Attention Beats Concatenation for Conditioning Neural Fields
Daniel Rebain, Mark J. Matthews, Kwang Moo Yi, Gopal Sharma, Dmitry Lagun,
Andrea Tagliasacchi
TMLR 2023,
Abstract
Neural fields model signals by mapping coordinate inputs to sampled values. They
are becoming an increasingly important backbone architecture across many fields
from vision and graphics to biology and astronomy. In this paper, we explore the
differences between common conditioning mechanisms within these networks, an
essential ingredient in shifting neural fields from memorization of signals to
generalization, where the set of signals lying on a manifold is modelled
jointly. In particular, we are interested in the scaling behaviour of these
mechanisms to increasingly high-dimensional conditioning variables. As we show
in our experiments, high-dimensional conditioning is key to modelling complex
data distributions, thus it is important to determine what architecture choices
best enable this when working on such problems. To this end, we run experiments
modelling 2D, 3D, and 4D signals with neural fields, employing concatenation,
hyper-network, and attention-based conditioning strategies -- a necessary but
laborious effort that has not been performed in the literature. We find that
attention-based conditioning outperforms other approaches in a variety of
settings.
|
|
PRIFIT: Learning to Fit Primitives Improves Few Shot Point Cloud Segmentation
Gopal Sharma*, Bidya Dash*, Matheus Gadelha, Aruni
RoyChowdhury, Marios Loizou, Evangelos Kalogerakis, Liangliang Cao, Erik
Learned-Miller, Rui Wang and Subhransu Maji.
SGP 2022, Webpage
Abstract
We present PriFit, a simple approach for label efficient learning
of 3D shape segmentation networks. PriFit is based on a self-supervised task of
decomposing the surface of a 3D shape into geometric primitives.
It can be readily applied to existing network architectures for 3D shape
segmentation,
and improves their performance in the few-shot setting, as we demonstrate in the
widely used ShapeNet and PartNet benchmarks.
PriFit outperforms the prior state-of-the-art in this setting, suggesting that
decomposability into primitives is a useful prior for learning representations
predictive of semantic parts.
We present a number of experiments varying the choice of geometric primitives
and downstream tasks to demonstrate the effectiveness of the method.
|
|
MvDeCor: Multi-view Dense Correspondence Learning for Fine-grained 3D
Segmentation
Gopal Sharma, Kangxue Yin, Subhransu Maji, Evangelos Kalogerakis, Or Litany,
and Sanja Fidler
ECCV 2022, Webpage
Abstract
We propose to utilize self-supervised techniques in the 2D domain for
fine-grained 3D shape segmentation tasks. This is inspired by the
observation that view-based surface representations are more effective
at modeling high-resolution surface details and texture than their 3D
counterparts based on point clouds or voxel occupancy. Specifically,
given a 3D shape, we render it from multiple views, and set up a dense
correspondence learning task within the contrastive learning
framework. As a result, the learned 2D representations are
view-invariant and geometrically consistent, leading to better
generalization when trained on a limited number of labeled shapes than
alternatives based on self-supervision in 2D or 3D alone. Experiments
on textured (RenderPeople) and untextured (PartNet) 3D datasets show
that our method outperforms state-of-the-art alternatives in
fine-grained part segmentation. The improvements over baselines are
greater when only a sparse set of views is available for training or
when shapes are textured, indicating that MvDeCor benefits from both 2D
processing and 3D geometric reasoning.
|
|
Neural Shape Parsers for Constructive Solid Geometry
Gopal Sharma, Rishabh Goyal, Difan Goyal, Evangelos Kalogerakis and Subhransu
Maji
TPAMI, Paper
Abstract
Constructive Solid Geometry (CSG) is a geometric modeling
technique that defines complex shapes by recursively applying boolean operations
on primitives such as spheres and cylinders. We present CSGNet, a deep network
architecture that takes as input a 2D or 3D shape and outputs a CSG program that
models it. Parsing shapes into CSG programs is desirable as it yields a compact
and interpretable generative model. However, the task is challenging since the
space of primitives and their combinations can be prohibitively large. CSGNet
uses a convolutional encoder and recurrent decoder based on deep networks to map
shapes to modeling instructions in a feed-forward manner and is significantly
faster than bottom-up approaches. We investigate two architectures for this task
— a vanilla encoder (CNN) - decoder (RNN) and another architecture that augments
the encoder with an explicit memory module based on the program execution stack.
The stack augmentation improves the reconstruction quality of the generated
shape and learning efficiency. Our approach is also more effective as a shape
primitive detector compared to a state-of-the-art object detector. Finally, we
demonstrate CSGNet can be trained on novel datasets without program annotations
through policy gradient techniques.
Cite
@ARTICLE{9293398,
author={G. {Sharma} and R. {Goyal} and D. {Liu} and E. {Kalogerakis} and S.
{Maji}},
journal={IEEE Transactions on Pattern Analysis and Machine Intelligence},
title={Neural Shape Parsers for Constructive Solid Geometry},
year={2020},
volume={},
number={},
pages={1-1},
doi={10.1109/TPAMI.2020.3044749}}
|
|
ParSeNet: A Parametric Surface Fitting Network for 3D Point Clouds
Gopal Sharma ,
Difan Liu,
Evangelos Kalogerakis,
Subhransu Maji,
Siddhartha Chaudhuri and
Radomír Měch
ECCV 2020, Paper
Abstract
We propose a novel, end-to-end trainable, deep network called ParSeNet that
decomposes a 3D point cloud into parametric surface patches, including B-spline
patches as well as basic geometric primitives. ParSeNet is trained on a
large-scale dataset of man-made 3D shapes and captures high-level semantic
priors for shape decomposition. It handles a much richer class of primitives
than prior work, and allows us to represent surfaces with higher fidelity. It
also produces repeatable and robust parametrizations of a surface compared to
purely geometric approaches. We present extensive experiments to validate our
approach against analytical and learning-based alternatives.
Cite
@misc{sharma2020parsenet,
title={ParSeNet: A Parametric Surface Fitting Network for 3D Point Clouds},
author={Gopal Sharma and Difan Liu and Evangelos Kalogerakis and Subhransu Maji
and Siddhartha Chaudhuri and Radomír Měch},
year={2020},
eprint={2003.12181},
archivePrefix={arXiv},
primaryClass={cs.CV}
}
|
|
Label-Efficient Learning on Point Clouds using Approximate Convex
Decompositions
Matheus Gadelha*, Aruni RoyChowdhury*, Gopal Sharma ,
Evangelos Kalogerakis, Liangliang Cao, Erik Learned-Miller, Rui Wang, Subhransu Maji
ECCV 2020, Paper
Abstract
The problems of shape classification and part segmentation from 3D point clouds
have garnered increasing attention in the last few years. But both of these
problems suffer from relatively small training sets, creating the need for
statistically efficient methods to learn 3D shape representations. In this work,
we investigate the use of Approximate Convex Decompositions (ACD) as a
self-supervisory signal for label-efficient learning of point cloud
representations. Decomposing a 3D shape into simpler constituent parts or
primitives is a fundamental problem in geometrical shape processing. There has
been extensive work on such decompositions, where the criterion for simplicity
of a constituent shape is often defined in terms of convexity for solid
primitives. In this paper, we show that using the results of ACD to approximate
a ground truth segmentation provides excellent self-supervision for learning 3D
point cloud representations that are highly effective on downstream tasks. We
report improvements over the state-of-theart in unsupervised representation
learning on the ModelNet40 shape classification dataset and significant gains in
few-shot part segmentation on the ShapeNetPart dataset.
Cite
@misc{gadelha2020labelefficient,
title={Label-Efficient Learning on Point Clouds using Approximate Convex
Decompositions},
author={Matheus Gadelha and Aruni RoyChowdhury and Gopal Sharma and Evangelos
Kalogerakis and Liangliang Cao and Erik Learned-Miller and Rui Wang and
Subhransu Maji},
year={2020},
eprint={2003.13834},
archivePrefix={arXiv},
primaryClass={cs.CV}
}
|
|
Search-Guided, Lightly-supervised Training of Structured Prediction Energy
Networks
Amirmohammad Rooshenas, Dongxu Zhang, Gopal Sharma, and Andrew McCallum
NeurIPS 2019, Paper
Abstract
In structured output prediction tasks, labeling ground-truth training output is
often expensive. However, for many tasks, even when the true output is unknown,
we can evaluate predictions using a scalar reward function, which may be easily
assembled from human knowledge or non-differentiable pipelines. But searching
through the entire output space to find the best output with respect to this
reward function is typically intractable. In this paper, we instead use
efficient truncated randomized search in this reward function to train
structured prediction energy networks (SPENs), which provide efficient test-time
inference using gradient-based search on a smooth, learned representation of the
score landscape, and have previously yielded state-of-the-art results in
structured prediction. In particular, this truncated randomized search in the
reward function yields previously unknown local improvements, providing
effective supervision to SPENs, avoiding their traditional need for labeled
training data.
Cite
@incollection{NIPS2019_9507,
title = {Search-Guided, Lightly-Supervised Training of Structured Prediction
Energy Networks},
author = {Rooshenas, Amirmohammad and Zhang, Dongxu and Sharma, Gopal and
McCallum, Andrew},
booktitle = {Advances in Neural Information Processing Systems 32},
editor = {H. Wallach and H. Larochelle and A. Beygelzimer and F.
d\textquotesingle Alch\'{e}-Buc and E. Fox and R. Garnett},
pages = {13522--13532},
year = {2019},
publisher = {Curran Associates, Inc.},
url =
{http://papers.nips.cc/paper/9507-search-guided-lightly-supervised-training-of-structured-prediction-energy-networks.pdf}
}
|
|
Learning Point Embeddings from Shape Repositories for Few-Shot Segmentation
Gopal Sharma, Evangelos Kalogerakis and Subhransu Maji.
3DV 2019, Paper
Abstract
User generated 3D shapes in online repositories contain rich information about
surfaces, primitives, and their geometric relations, often arranged in a
hierarchy. We present a framework for learning representations of 3D shapes that
reflect the information present in this meta data and show that it leads to
improved generalization for semantic segmentation tasks. Our approach is a point
embedding network that generates a vectorial representation of the 3D points
such that it reflects the grouping hierarchy and tag data. The main challenge is
that the data is noisy and highly variable. To this end, we present a tree-aware
metric-learning approach and demonstrate that such learned embeddings offer
excellent transfer to semantic segmentation tasks, especially when training data
is limited. Our approach reduces the relative error by 10.2% with 8 training
examples, by 11.72% with 120 training examples on the ShapeNet semantic
segmentation benchmark, in comparison to the network trained from scratch. By
utilizing tag data the relative error is reduced by 12.8% with 8 training
examples, in comparison to the network trained from scratch. These improvements
come at no additional labeling cost as the meta data is freely available.
Cite
@INPROCEEDINGS{8885650,
author={G. {Sharma} and E. {Kalogerakis} and S. {Maji}},
booktitle={2019 International Conference on 3D Vision (3DV)},
title={Learning Point Embeddings from Shape Repositories for Few-Shot
Segmentation},
year={2019},
volume={},
number={},
pages={67-75},}
|