-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathindex.html
executable file
·562 lines (497 loc) · 42.9 KB
/
index.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
<!DOCTYPE html>
<html lang="en">
<head>
<!-- Global site tag (gtag.js) - Google Analytics -->
<script async src="https://www.googletagmanager.com/gtag/js?id=UA-111713571-1"></script>
<script>
window.dataLayer = window.dataLayer || [];
function gtag(){dataLayer.push(arguments);}
gtag('js', new Date());
gtag('config', 'UA-111713571-1');
</script>
<!-- Required meta tags -->
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no">
<!-- Bootstrap core CSS -->
<link rel="stylesheet" href="https://maxcdn.bootstrapcdn.com/bootstrap/4.0.0-beta/css/bootstrap.min.css" integrity="sha384-/Y6pD6FV/Vv2HJnA6t+vslU6fwYXjCFtcEpHbNJ0lyAFsXTsjBbfaDjzALeQsN6M" crossorigin="anonymous">
<!-- Custom styles for this template -->
<link href="files/jumbotron.css" rel="stylesheet">
<script src="js/main.js"></script>
<script src="js/scroll.js"></script>
</head>
<title>Energy-based Models</title>
<body>
<nav class="navbar navbar-expand-md navbar-dark fixed-top bg-dark" id="Home">
<a class="navbar-brand" href="#Home">Energy-based Models</a>
<div class="collapse navbar-collapse" id="navbarToggle">
<ul class="navbar-nav ml-auto">
<li class="nav-item">
<a class="nav-link" href="#Publications">Publications</a>
</li>
<li class="nav-item">
<a class="nav-link" href="#Contact">Contact</a>
</li>
<li class="nav-item">
<a class="nav-link" href="http://accessibility.mit.edu" target="_blank">Accessibility</a>
</li>
</ul>
</div>
</nav>
<br><br><br><br>
<!-- Publications -->
<div class="container">
<h3 id="Publications" style="padding-top: 80px; margin-top: -80px;">
Recent Publications
</h3>
<div id="pubs"></div>
<div class="row">
<div class="col-md-3">
<img class="img-fluid img-rounded" src="files/paper/24-ired/ired.gif" style="border:1px solid black" alt="">
</video>
</div>
<div class="col-md-9">
<b><font color="black">Learning Iterative Reasoning through Energy Diffusion</font></b><br>
<a href="https://yilundu.github.io/">Yilun Du*</a>,
<a href="https://jiayuanm.com/">Jiayuan Mao*</a>,
<a href="https://scholar.google.com/citations?user=rRJ9wTJMUB8C&hl=en">Joshua Tenenbaum</a>
<br>
<b><a href="https://icml.cc/" target="_blank">ICML 2024</a></b> <br>
<a href="https://energy-based-model.github.io/ired/" target="_blank"> <small>[Project]</small></a>
<a href="https://arxiv.org/abs/2406.11179"> <small>[Paper]</small></a>
<a href="https://github.com/yilundu/ired_code_release"> <small>[Code]</small></a>
</div>
<div class="col-md-12">
<br>
<p> We introduce iterative reasoning through energy diffusion (IRED), a novel framework for learning to reason for a variety of tasks by formulating reasoning and decision-making problems with energy-based optimization. IRED learns energy functions to represent the constraints between input conditions and desired outputs. After training, IRED adapts the number of optimization steps during inference based on problem difficulty, enabling it to solve problems outside its training distribution -- such as more complex Sudoku puzzles, matrix completion with large value magnitudes, and pathfinding in larger graphs. Key to our method's success is two novel techniques: learning a sequence of annealed energy landscapes for easier inference and a combination of score function and energy landscape supervision for faster and more stable training. Our experiments show that IRED outperforms existing methods in continuous-space reasoning, discrete-space reasoning, and planning tasks, particularly in more challenging scenarios. </p>
</div>
</div><hr>
<div class="row">
<div class="col-md-3">
<img class="img-fluid img-rounded" src="files/paper/24-decomp-diffusion/decomp_diffusion.png" style="border:1px solid black" alt="">
</video>
</div>
<div class="col-md-9">
<b><font color="black">Compositional Image Decomposition with Diffusion Models</font></b><br>
<a href="https://www.semanticscholar.org/author/Jocelin-Su/51149200">Jocelin Su*</a>,
<a href="https://nanliu.io/">Nan Liu*</a>,
<a href="https://openreview.net/profile?id=~Yanbo_Wang3">Yanbo Wang*</a>,
<a href="http://cocosci.mit.edu/josh">Joshua B. Tenenbaum</a>,
<a href="https://yilundu.github.io/">Yilun Du</a>
<br>
<b><a href="https://icml.cc/" target="_blank">ICML 2024</a></b> <br>
<a href="https://energy-based-model.github.io/decomp-diffusion/" target="_blank"> <small>[Project]</small></a>
<a href="https://arxiv.org/abs/2406.19298"> <small>[Paper]</small></a>
<a href="https://github.com/jsu27/decomp_diffusion"> <small>[Code]</small></a>
</div>
<div class="col-md-12">
<br>
<p> Given an image of a natural scene, we are able to quickly decompose it into a set of components such as objects, lighting, shadows, and foreground. We can then envision a scene where we combine certain components with those from other images, for instance a set of objects from our bedroom and animals from a zoo under the lighting conditions of a forest, even if we have never encountered such a scene before. In this paper, we present a method to decompose an image into such compositional components. Our approach, Decomp Diffusion, is an unsupervised method which, when given a single image, infers a set of different components in the image, each represented by a diffusion model. We demonstrate how components can capture different factors of the scene, ranging from global scene descriptors like shadows or facial expression to local scene descriptors like constituent objects. We further illustrate how inferred factors can be flexibly composed, even with factors inferred from other models, to generate a variety of scenes sharply different than those seen in training time. </p>
</div>
</div><hr>
<div class="row">
<div class="col-md-3">
<img class="img-fluid img-rounded" src="files/paper/24-compose-model/decentralized.png" style="border:1px solid black" alt="">
</video>
</div>
<div class="col-md-9">
<b><font color="black">Compositional Generative Modeling: A Single Model is Not All You Need</font></b><br>
<a href="https://yilundu.github.io/">Yilun Du</a>,
<a href="https://people.csail.mit.edu/lpk/">Leslie Kaelbling</a>
<br>
<b><a href="https://icml.cc/" target="_blank">ICML 2024</a></b> <br>
<a href="https://arxiv.org/abs/2402.01103"> <small>[Paper]</small></a>
</div>
<div class="col-md-12">
<br>
<p> Large monolithic generative models trained on massive amounts of data have become an increasingly dominant approach in AI research. In this paper, we argue that we should instead construct large generative systems by composing smaller generative models together. We show how such a compositional generative approach enables us to learn distributions in a more data-efficient manner, enabling generalization to parts of the data distribution unseen at training time. We further show how this enables us to program and construct new generative models for tasks completely unseen at training. Finally, we show that in many cases, we can discover separate compositional components from data. </p>
</div>
</div><hr>
<div class="row">
<div class="col-md-3">
<video width="100%" playsinline="" autoplay="" loop="" preload="" muted="" style="border:1px solid black">
<source src="files/paper/23-iccv-concept-discovery-diffusion/teaser.m4v" type="video/mp4">
</video>
</div>
<div class="col-md-9">
<b><font color="black">Unsupervised Compositional Concepts Discovery with Text-to-Image Generative Models</font></b><br>
<a href="https://nanliu.io/" target="_blank">Nan Liu*</a>,
<a href="https://yilundu.github.io/">Yilun Du*</a>,
<a href="https://people.csail.mit.edu/lishuang/">Shuang Li*</a>,
<a href="https://scholar.google.com/citations?user=rRJ9wTJMUB8C&hl=en" target="_blank">Joshua B. Tenenbaum</a>,
<a href="https://groups.csail.mit.edu/vision/torralbalab/" target="_blank">Antonio Torralba</a>
<br>
<b><a href="https://iccv2023.thecvf.com/" target="_blank">ICCV 2023</a></b> <br>
<a href="https://energy-based-model.github.io/unsupervised-concept-discovery/" target="_blank"> <small>[Project]</small></a>
<a href="https://arxiv.org/abs/2306.05357"> <small>[Paper]</small></a>
<a href="https://github.com/nanlliu/Unsupervised-Compositional-Concepts-Discovery"> <small>[Code]</small></a>
</div>
<div class="col-md-12">
<br>
<p> Text-to-image generative models have enabled high-resolution image synthesis across different domains, but require users to specify the content they wish to generate. In this paper, we consider the inverse problem -- given a collection of different images, can we discover the generative concepts that represent each image? We present an unsupervised approach to discover generative concepts from a collection of images, disentangling different art styles in paintings, objects, and lighting from kitchen scenes, and discovering image classes given ImageNet images. We show how such generative concepts can accurately represent the content of images, be recombined and composed to generate new artistic and hybrid images, and be further used as a representation for downstream classification tasks. </p>
</div>
</div><hr>
<!-- ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ -->
<div class="row">
<div class="col-md-3">
<video width="100%" playsinline="" autoplay="" loop="" preload="" muted="" style="border:1px solid black">
<source src="files/paper/23-icml-niip/teaser_good_trimmed.mp4" type="video/mp4">
</video>
</div>
<div class="col-md-9">
<b><font color="black">Inferring Relational Potentials in Interacting Systems</font></b><br>
<a href="https://www.linkedin.com/in/armandcomas/" target="_blank">Armand Comas</a>,
<a href="https://yilundu.github.io/">Yilun Du</a>,
<a href="https://www.linkedin.com/in/christian-fernandez-lopez-868b09161/">Christian Fernandez Lopez</a>,
<a href="https://sandeshgh.com/" target="_blank">Sandesh Ghimire</a>,
<a href="https://scholar.google.com/citations?user=hbNllP0AAAAJ&hl=en" target="_blank">Mario Sznaier</a>,
<a href="http://cocosci.mit.edu/josh" target="_blank">Joshua B. Tenenbaum</a>,
<a href="https://scholar.google.com/citations?user=htt9T1AAAAAJ&hl=en" target="_blank">Octavia Camps</a>
<br>
<b><a href="https://iccv2023.thecvf.com/" target="_blank">ICML 2023</a>, <font color="firebrick">Oral</font> </b> <br>
<a href="https://energy-based-model.github.io/interaction-potentials/" target="_blank"> <small>[Project]</small></a>
<a href="https://arxiv.org/pdf/2310.14466.pdf"> <small>[Paper]</small></a>
<a href="https://github.com/ArmandCom/neural-interaction-inference"> <small>[Code]</small></a>
</div>
<div class="col-md-12">
<br>
<p> Systems consisting of interacting agents are prevalent in the world, ranging from dynamical systems in physics to complex biological networks. To build systems which can interact robustly in the real world, it is thus important to be able to infer the precise interactions governing such systems. Existing approaches typically discover such interactions by explicitly modeling the feed-forward dynamics of the trajectories. In this work, we propose Neural Interaction Inference with Potentials (NIIP) as an alternative approach to discover such interactions that enables greater flexibility in trajectory modeling: it discovers a set of relational potentials, represented as energy functions, which when minimized reconstruct the original trajectory. NIIP assigns low energy to the subset of trajectories which respect the relational constraints observed. We illustrate that with these representations NIIP displays unique capabilities in test-time. First, it allows trajectory manipulation, such as interchanging interaction types across separately trained models, as well as trajectory forecasting. Additionally, it allows adding external hand-crafted potentials at test-time. Finally, NIIP enables the detection of out-of-distribution samples and anomalies without explicit training. </p>
</div>
</div><hr>
<div class="row">
<div class="col-md-3">
<video width="100%" playsinline="" autoplay="" loop="" preload="" muted="" style="border:1px solid black">
<source src="files/paper/23-recycle-diffusion/teaser.mp4" type="video/mp4">
</video>
</div>
<div class="col-md-9">
<b><font color="black">Reduce, Reuse, Recycle: Compositional Generation with Energy-Based Diffusion Models</font></b><br>
<a href="https://yilundu.github.io/" target="_blank">Yilun Du</a>,
<a href="https://conormdurkan.github.io/">Conor Durkan</a>,
<a href="https://rstrudel.github.io/">Robin Strudel</a>,
<a href="http://cocosci.mit.edu/josh">Joshua B. Tenenbaum</a>,
<a href="https://benanne.github.io/about/">Sander Dieleman</a>,
<a href="https://scholar.google.com/citations?user=GgQ9GEkAAAAJ&hl=en">Rob Fergus</a>,
<a href="http://www.sohldickstein.com/">Jascha Sohl-Dickstein</a>,
<a href="https://www.stats.ox.ac.uk/~doucet/">Arnaud Doucet</a>,
<a href="http://www.cs.toronto.edu/~wgrathwohl/">Will Grathwohl</a>
<br>
<b><a href="https://icml.cc/Conferences/2023/CallForPapers" target="_blank">ICML 2023</a></b> <br>
<a href="https://energy-based-model.github.io/reduce-reuse-recycle/" target="_blank"> <small>[Project]</small></a>
<a href="https://arxiv.org/abs/2302.11552"> <small>[Paper]</small></a>
<a href="https://colab.research.google.com/drive/1jvlzWMc6oo-TH1fYMl6hsOYfrcQj2rEs?usp=sharing"> <small>[Colab]</small></a>
<a href="https://github.com/yilundu/reduce_reuse_recycle"> <small>[Code]</small></a>
</div>
<div class="col-md-12">
<br>
<p> Since their introduction, diffusion models have quickly become the prevailing approach to generative modeling in many domains. They can be interpreted as learning the gradients of a time-varying sequence of log-probability density functions. This interpretation has motivated classifier-based and classifier-free guidance as methods for post-hoc control of diffusion models. In this work, we build upon these ideas using the score-based interpretation of diffusion models, and explore alternative ways to condition, modify, and reuse diffusion models for tasks involving compositional generation and guidance. In particular, we investigate why certain types of composition fail using current techniques and present a number of solutions. We conclude that the sampler (not the model) is responsible for this failure and propose new samplers, inspired by MCMC, which enable successful compositional generation. Further, we propose an energy-based parameterization of diffusion models which enables the use of new compositional operators and more sophisticated, Metropolis-corrected samplers. Intriguingly we find these samplers lead to notable improvements in compositional generation across a wide set of problems such as classifier-guided ImageNet modeling and compositional text-to-image generation. </p>
</div>
</div><hr>
<div class="row">
<div class="col-md-3">
<video width="100%" playsinline="" autoplay="" loop="" preload="" muted="" style="border:1px solid black">
<source src="files/paper/22-compose-pretrain/teaser-2.mp4" type="video/mp4">
</video>
</div>
<div class="col-md-9">
<b><font color="black">Composing Ensembles of Pre-trained Models via Iterative Consensus</font></b><br>
<a href="https://people.csail.mit.edu/lishuang/">Shuang Li*</a>,
<a href="https://yilundu.github.io/" target="_blank">Yilun Du*</a>,
<a href="https://scholar.google.com/citations?user=rRJ9wTJMUB8C&hl=en" target="_blank">Joshua B. Tenenbaum</a>,
<a href="https://groups.csail.mit.edu/vision/torralbalab/" target="_blank">Antonio Torralba</a>,
<a href="https://scholar.google.com/citations?user=Vzr1RukAAAAJ&hl=en" target="_blank">Igor Mordatch</a>
<br>
(*equal contribution. Shuang Li did experiments on image generation, video question answering, and mathematical reasoning. Yilun Du did all the experiments on robot manipulation.)
<br>
<b><a href="https://iclr.cc/Conferences/2023" target="_blank">ICLR 2023</a></b> <br>
<a href="https://energy-based-model.github.io/composing-pretrained-models-web/" target="_blank"> <small>[Project]</small></a>
<a href="https://arxiv.org/abs/2210.11522" target="_blank"> <small>[Paper]</small></a>
</div>
<div class="col-md-12">
<br>
<p> Large pre-trained models exhibit distinct and complementary capabilities dependent on the data they are trained on. Language models such as GPT-3 are capable of textual reasoning but cannot understand visual information, while vision models such as DALL-E can generate photorealistic photos but fail to understand complex language descriptions. In this work, we propose a unified framework for composing ensembles of different pre-trained models -- combining the strengths of each individual model to solve various multimodal problems in a zero-shot manner. We use pre-trained models as "generators" or "scorers" and compose them via closed-loop iterative consensus optimization. The generator constructs proposals and the scorers iteratively provide feedback to refine the generated result. Such closed-loop communication enables models to correct errors caused by other models, significantly boosting performance on downstream tasks, e.g. improving accuracy on grade school math problems by 7.5%, without requiring any model finetuning. We demonstrate that consensus achieved by an ensemble of scorers outperforms the feedback of a single scorer, by leveraging the strengths of each expert model. Results show that the proposed method can be used as a general purpose framework for a wide range of zero-shot multimodal tasks, such as image generation, video question answering, mathematical reasoning, and robotic manipulation. </p>
</div>
</div><hr>
<div class="row">
<div class="col-md-3">
<video width="100%" playsinline="" autoplay="" loop="" preload="" muted="" style="border:1px solid black">
<source src="files/paper/22-compose-diffusion/teaser_glide.mp4" type="video/mp4">
</video>
</div>
<div class="col-md-9">
<b><font color="black">Compositional Visual Generation with Composable Diffusion Models</font></b><br>
<a href="" target="_blank">Nan Liu*</a>,
<a href="https://people.csail.mit.edu/lishuang/">Shuang Li*</a>,
<a href="https://yilundu.github.io/" target="_blank">Yilun Du*</a>,
<a href="https://groups.csail.mit.edu/vision/torralbalab/" target="_blank">Antonio Torralba</a>
<a href="https://scholar.google.com/citations?user=rRJ9wTJMUB8C&hl=en" target="_blank">Joshua B. Tenenbaum</a>, and
(*equal contribution)
<br>
<b><a href="https://eccv2022.ecva.net/" target="_blank">ECCV 2022</a></b> <br>
<a href="https://energy-based-model.github.io/Compositional-Visual-Generation-with-Composable-Diffusion-Models/" target="_blank"> <small>[Project]</small></a>
<a href="https://arxiv.org/pdf/2206.01714.pdf" target="_blank"> <small>[Paper]</small></a>
<a href="https://github.com/energy-based-model/Compositional-Visual-Generation-with-Composable-Diffusion-Models-PyTorch" target="_blank"> <small>[Code]</small></a>
<a href="https://colab.research.google.com/github/energy-based-model/Compositional-Visual-Generation-with-Composable-Diffusion-Models-PyTorch/blob/main/notebooks/demo.ipynb" target="_blank"> <small>[Colab]</small></a>
<a href="https://huggingface.co/spaces/Shuang59/Composable-Diffusion" target="_blank"> <small>[HuggingFace Demo]</small></a><br>
<a> Press coverage: <a href="https://news.mit.edu/2022/ai-system-makes-models-like-dall-e-2-more-creative-0908" target="_blank">MIT News</a>, <a href="https://www.csail.mit.edu/news/ai-system-makes-models-dall-e-2-more-creative" target="_blank">MIT CSAIL News</a> </a>
</div>
<div class="col-md-12">
<br>
<p>Large text-guided diffusion models, such as DALLE-2, are able to generate stunning photorealistic images given natural language descriptions. While such models are highly flexible, they struggle to understand the composition of certain concepts, such as confusing the attributes of different objects or relations between objects. In this paper, we propose an alternative structured approach for compositional generation using diffusion models. An image is generated by composing a set of diffusion models, with each of them modeling a certain component of the image. To do this, we interpret diffusion models as energy-based models in which the data distributions defined by the energy functions may be explicitly combined. The proposed method can generate scenes at test time that are substantially more complex than those seen in training, composing sentence descriptions, object relations, human facial attributes, and even generalizing to new combinations that are rarely seen in the real world. We further illustrate how our approach may be used to compose pre-trained text-guided diffusion models and generate photorealistic images containing all the details described in the input descriptions, including the binding of certain object attributes that have been shown difficult for DALLE-2. These results point to the effectiveness of the proposed method in promoting structured generalization for visual generation. </p>
</div>
</div><hr>
<div class="row">
<div class="col-md-3">
<video width="100%" playsinline="" autoplay="" loop="" preload="" muted="" style="border:1px solid black">
<source src="files/paper/22-icml-irem/irem.mp4" type="video/mp4">
</video>
</div>
<div class="col-md-9">
<b><font color="black">Learning Iterative Reasoning through Energy Minimization</font></b><br>
<a href="https://yilundu.github.io/" target="_blank">Yilun Du</a>,
<a href="https://people.csail.mit.edu/lishuang/">Shuang Li</a>,
<a href="https://scholar.google.com/citations?user=rRJ9wTJMUB8C&hl=en" target="_blank">Joshua B. Tenenbaum</a>, and
<a href="https://scholar.google.com/citations?user=Vzr1RukAAAAJ&hl=en" target="_blank">Igor Mordatch</a>
<br>
<b><a href="https://icml.cc/" target="_blank">ICML 2022</a></b> <br>
<a href="https://energy-based-model.github.io/iterative-reasoning-as-energy-minimization/" target="_blank"> <small>[Project]</small></a>
<a href="https://arxiv.org/abs/2206.15448" target="_blank"> <small>[Paper]</small></a>
<a href="https://github.com/yilundu/irem_code_release" target="_blank"> <small>[Code]</small></a>
</div>
<div class="col-md-12">
<br>
<p> Deep learning has excelled on complex pattern recognition tasks such as image classification and object recognition. However, it struggles with tasks requiring nontrivial reasoning, such as algorithmic computation.
Humans are able to solve such tasks through iterative reasoning -- spending more time thinking about harder tasks. Most existing neural networks, however, exhibit a fixed computational budget controlled by the neural network architecture, preventing additional computational processing on harder tasks. In this work, we present a new framework for iterative reasoning with neural networks. We train a neural network to parameterize an energy landscape over all outputs, and implement each step of the iterative reasoning as an energy minimization step to find a minimal energy solution. By formulating reasoning as an energy minimization problem, for harder problems that lead to more complex energy landscapes, we may then adjust our underlying computational budget by running a more complex optimization procedure. We empirically illustrate that our iterative reasoning approach can solve more accurate and generalizable algorithmic reasoning tasks in both graph and continuous domains. Finally, we illustrate that our approach can recursively solve algorithmic problems requiring nested reasoning.
</div>
</div><hr>
<div class="row">
<div class="col-md-3">
<img class="img-fluid img-rounded" src="files/paper/21-neurips-decompose/teaser.gif" style="border:1px solid black" alt="">
</div>
<div class="col-md-9">
<b><font color="black">Unsupervised Learning of Compositional Energy Concepts</font></b><br>
<a href="https://yilundu.github.io/" target="_blank">Yilun Du</a>,
<a href="https://people.csail.mit.edu/lishuang/">Shuang Li</a>,
<a href="https://www.yash-sharma.com" target="_blank">Yash Sharma</a>,
<a href="https://scholar.google.com/citations?user=rRJ9wTJMUB8C&hl=en" target="_blank">Joshua B. Tenenbaum</a>, and
<a href="https://scholar.google.com/citations?user=Vzr1RukAAAAJ&hl=en" target="_blank">Igor Mordatch</a>
<br>
<b><a href="https://neurips.cc/" target="_blank">NeurIPS 2021</a></b> <br>
<a href="https://energy-based-model.github.io/comet/" target="_blank"> <small>[Project]</small></a>
<a href="https://arxiv.org/pdf/2111.03042.pdf" target="_blank"> <small>[Paper]</small></a>
<a href="https://github.com/yilundu/comet" target="_blank"> <small>[Code]</small></a>
</div>
<div class="col-md-12">
<br>
<p> We introduce an approach to decompose images, in an unsupervised manner, into separate component energy functions.
These energy functions can both represent global factors of variation, such as facial expression and hair color, as well
as local factors of variations, such as the objects in a scene. Decomposed energy functions generalize well, and may be
recombined with energy function discovered by training a separate instance of approach on a different dataset, enabling
the recombination of objects and lighting conditions across datasets.
</div>
</div><hr>
<div class="row">
<div class="col-md-3">
<video width="100%" playsinline="" autoplay="" loop="" preload="" muted="" style="border:1px solid black">
<source src="files/paper/21-neurips-compose-relation/clevr_teaser.mp4" type="video/mp4">
</video>
</div>
<div class="col-md-9">
<b><font color="black">Learning to Compose Visual Relations</font></b><br>
<a href="" target="_blank">Nan Liu*</a>,
<a href="https://people.csail.mit.edu/lishuang/">Shuang Li*</a>,
<a href="https://yilundu.github.io/" target="_blank">Yilun Du*</a>,
<a href="https://scholar.google.com/citations?user=rRJ9wTJMUB8C&hl=en" target="_blank">Joshua B. Tenenbaum</a>, and
<a href="https://groups.csail.mit.edu/vision/torralbalab/" target="_blank">Antonio Torralba</a>
(*equal contribution)
<br>
<b><a href="https://neurips.cc/" target="_blank">NeurIPS 2021</a>, <font color="firebrick">Spotlight</font> </b> <br>
<b><a href="https://ctrlgenworkshop.github.io/accepted_papers.html" target="_blank">NeurIPS Workshop on Controllable Generative Modeling 2021</a>, <font color="firebrick">Outstanding Paper Award</font> </b> <br>
<a> Press coverage: <a href="https://news.mit.edu/2021/ai-object-relationships-image-generation-1129" target="_blank">MIT News</a>, <a href="https://www.csail.mit.edu/news/artificial-intelligence-understands-object-relationships" target="_blank">MIT CSAIL News</a> </a> <br>
<a href="https://composevisualrelations.github.io/" target="_blank"> <small>[Project]</small></a>
<a href="https://arxiv.org/abs/2111.09297" target="_blank"> <small>[Paper]</small></a>
<a href="https://github.com/nanlliu/compose-visual-relations" target="_blank"> <small>[Code]</small></a><br>
</div>
<div class="col-md-12">
<br>
<p>The visual world around us can be described as a structured set of objects and their associated relations. In this work, we propose to represent each relation as an unnormalized density (an energy-based model), enabling us to compose separate relations in a factorized manner. We show that such a factorized decomposition allows the model to both generate and edit scenes that have multiple sets of relations more faithfully. We further show that decomposition enables our model to effectively understand the underlying relational scene structure.</p>
</div>
</div><hr>
<!-- <div class="row">
<div class="col-md-3">
<img class="img-fluid img-rounded" src="files/paper/21-iccv-pvd/spotlight_pvd.gif" style="border:1px solid black" alt="">
</div>
<div class="col-md-9">
<b><font color="black">3D Shape Generation and Completion through Point-Voxel Diffusion</font></b><br>
<a href="https://yilundu.github.io/">Alex Linqi Zhou</a>,
<a href="https://yilundu.github.io/" target="_blank">Yilun Du</a>, and
<a href="https://jiajunwu.com/">Jiajun Wu</a>
<br>
<b><a href="https://iccv2021.thecvf.com/home" target="_blank">ICCV 2021</a>, <font color="firebrick">Oral</font> </b> <br>
<a href="https://alexzhou907.github.io/pvd" target="_blank"> <small>[Project]</small></a>
<a href="https://arxiv.org/abs/2104.03670" target="_blank"> <small>[Paper]</small></a>
<a href="https://github.com/alexzhou907/PVD" target="_blank"> <small>[Code]</small></a>
</div>
<div class="col-md-12">
<br>
<p> We present a method to generate 3D shapes through diffusion. Our diffusion objective enables the synthesis of
high fidelity 3D shapes. We further illustrate the ability of our approach to generate multiple different completions of a shape from a single partial observation,
and finally illustrate that it may generalize to real depth images.
</div>
</div><hr>
-->
<div class="row">
<div class="col-md-3">
<img class="img-fluid img-rounded" src="files/paper/21-iclr-ebm-improve/fig1.png" style="border:1px solid black" alt="">
</div>
<div class="col-md-9">
<b><font color="black">Improved Contrastive Divergence Training of Energy Based Models</font></b><br>
<a href="https://yilundu.github.io/" target="_blank">Yilun Du</a>,
<a href="https://people.csail.mit.edu/lishuang/">Shuang Li</a>,
<a href="https://scholar.google.com/citations?user=rRJ9wTJMUB8C&hl=en" target="_blank">Joshua B. Tenenbaum</a>, and
<a href="https://scholar.google.com/citations?user=Vzr1RukAAAAJ&hl=en" target="_blank">Igor Mordatch</a>
<br>
<b><a href="https://icml.cc/Conferences/2021" target="_blank">ICML 2021</a> </b> <br>
<b><a href="https://sites.google.com/view/ebm-workshop-iclr2021" target="_blank">ICLR EBM Workshop 2021</a>, <font color="firebrick">Oral</font> </b> <br>
<a href="https://energy-based-model.github.io/improved-contrastive-divergence/" target="_blank"> <small>[Project]</small></a>
<a href="https://arxiv.org/pdf/2012.01316.pdf" target="_blank"> <small>[Paper]</small></a>
<a href="https://github.com/yilundu/improved_contrastive_divergence" target="_blank"> <small>[Code]</small></a>
</div>
<div class="col-md-12">
<br>
<p> We present tools to improve the underlying contrastive divergence objective for training EBMs. First we illustrate a neglected term in contrastive divergence
training of EBMs, and present a loss function to mitigate this term. We further propose to utilize data augmentation to aid the mixing of MCMC chains when training EBMs
and propose to use a multiscale architecture to further improve the underlying generative performance. We illustrate how our tricks improve the underlying generative
performance of EBMs, and further show improved out-of-distribution detection.
</div>
</div><hr>
<div class="row">
<div class="col-md-3">
<img class="img-fluid img-rounded" src="files/paper/21-iclr-continual-learning/fig2.png" style="border:1px solid black" alt="">
</div>
<div class="col-md-9">
<b><font color="black">Energy-Based Models for Continual Learning</font></b><br>
<a href="https://people.csail.mit.edu/lishuang/">Shuang Li</a>,
<a href="https://yilundu.github.io/" target="_blank">Yilun Du</a>,
<a href="https://scholar.google.com/citations?user=3k0l15MAAAAJ&hl=en" target="_blank">Gido M. van de Ven</a>, and
<a href="https://scholar.google.com/citations?user=Vzr1RukAAAAJ&hl=en" target="_blank">Igor Mordatch</a>
<br>
<b><a href="https://lifelong-ml.cc/" target="_blank">CoLLAs 2022</a>, <font color="firebrick">Oral</font> </b> <br>
<b><a href="https://sites.google.com/view/ebm-workshop-iclr2021" target="_blank">ICLR EBM Workshop 2021</a>, <font color="firebrick">Oral</font> </b> <br>
<a href="https://energy-based-model.github.io/Energy-Based-Models-for-Continual-Learning/" target="_blank"> <small>[Project]</small></a>
<a href="https://arxiv.org/pdf/2011.12216.pdf" target="_blank"> <small>[Paper]</small></a>
<a href="https://github.com/ShuangLI59/ebm-continual-learning" target="_blank"> <small>[Code]</small></a>
</div>
<div class="col-md-12">
<br>
<p>We motivate Energy-Based Models (EBMs) as a promising model class for continual learning problems. Instead of tackling continual learning via the use of external memory, growing models, or regularization, EBMs change the underlying training objective to causes less interference with previously learned information. Our proposed version of EBMs for continual learning is simple, efficient, and outperforms baseline methods by a large margin on several benchmarks. Moreover, our proposed contrastive divergence based training objective can be applied to other continual learning methods, resulting in substantial boosts in their performance.
</div>
</div><hr>
<div class="row">
<div class="col-md-3">
<img class="img-fluid img-rounded" src="files/paper/20-neurips-ebm-compositional/comp-face.gif" style="border:1px solid black" alt="">
</div>
<div class="col-md-9">
<b><font color="black">Compositional Visual Generation with Energy Based Models</font></b><br>
<a href="https://yilundu.github.io/" target="_blank">Yilun Du</a>,
<a href="https://people.csail.mit.edu/lishuang/">Shuang Li</a>, and
<a href="https://scholar.google.com/citations?user=Vzr1RukAAAAJ&hl=en" target="_blank">Igor Mordatch</a>
<br>
<b><a href="https://neurips.cc/" target="_blank">NeurIPS 2020</a>, <font color="firebrick">Spotlight</font> </b> <br>
<a href="https://energy-based-model.github.io/compositional-generation-inference/" target="_blank"> <small>[Project]</small></a>
<a href="https://arxiv.org/pdf/2004.06030.pdf" target="_blank"> <small>[Paper]</small></a>
<a href="https://github.com/yilundu/ebm_compositionality" target="_blank"> <small>[Code]</small></a>
</div>
<div class="col-md-12">
<br>
<p> A vital aspect of human intelligence is the ability to compose increasingly complex concepts out of simpler ideas, enabling both rapid learning and adaptation of knowledge. In this paper we show that energy-based models can exhibit this ability by directly combining probability distributions. Samples from the combined distribution correspond to compositions of concepts. For example, given one distribution for smiling face images, and another for male faces, we can combine them to generate smiling male faces. This allows us to generate natural images that simultaneously satisfy conjunctions, disjunctions, and negations of concepts. We evaluate compositional generation abilities of our model on the CelebA dataset of natural faces and synthetic 3D scene images. We showcase the breadth of unique capabilities of our model, such as the ability to continually learn and incorporate new concepts, or infer compositions of concept properties underlying an image.
</div>
</div><hr>
<div class="row">
<div class="col-md-3">
<img class="img-fluid img-rounded" src="files/paper/20-iclr-protein/protein.png" style="border:1px solid black" alt="">
</div>
<div class="col-md-9">
<b><font color="black">Energy Based Models for Atomic Level Protein Conformations</font></b><br>
<a href="https://yilundu.github.io/" target="_blank">Yilun Du</a>,
<a href="https://scholar.google.com/citations?user=2M0OltAAAAAJ&hl=en&oi=ao">Joshua Meier</a>,
<a href="https://scholar.google.com/citations?user=qukcWBAAAAAJ&hl=en&oi=ao">Jerry Ma</a>,
<a href="https://cs.nyu.edu/~fergus/pmwiki/pmwiki.php">Rob Fergus</a>, and
<a href="https://scholar.google.com/citations?user=vqb78-gAAAAJ&hl=en&oi=ao">Alexander Rives</a>
<br>
<b><a href="https://iclr.cc/Conferences/2020" target="_blank">ICLR 2020</a>, <font color="firebrick">Spotlight</font> </b> <br>
<a href="https://arxiv.org/pdf/2004.13167.pdf" target="_blank"> <small>[Paper]</small></a>
<a href="https://github.com/facebookresearch/protein-ebm" target="_blank"> <small>[Code]</small></a>
</div>
<div class="col-md-12">
<br>
<p> We propose an energy-based model (EBM) of protein conformations that operates at atomic scale. The model is trained solely on crystallized protein data. By contrast, existing approaches for scoring conformations use energy functions that incorporate knowledge of physical principles and features that are the complex product of several decades of research and tuning. To evaluate the model, we benchmark on the rotamer recovery task, the problem of predicting the conformation of a side chain from its context within a protein structure, which has been used to evaluate energy functions for protein design. The model achieves performance close to that of the Rosetta energy function, a state-of-the-art method widely used in protein structure prediction and design. An investigation of the model’s outputs and hidden representations finds that it captures physicochemical properties relevant to protein energy.
</div>
</div><hr>
<div class="row">
<div class="col-md-3">
<img class="img-fluid img-rounded" src="files/paper/corl-2019-planning/ebm_plan.png" style="border:1px solid black" alt="">
</div>
<div class="col-md-9">
<b><font color="black">Model Based Planning with Energy Based Models</font></b><br>
<a href="https://yilundu.github.io/" target="_blank">Yilun Du</a>,
<a href="https://toruowo.github.io/">Toru Lin</a>, and
<a href="https://scholar.google.com/citations?user=Vzr1RukAAAAJ&hl=en" target="_blank">Igor Mordatch</a>
<br>
<b><a href="https://sites.google.com/robot-learning.org/corl2019" target="_blank">CORL 2019</a></b> <br>
<b><a href="https://sites.google.com/view/mbrl-icml2019/home" target="_blank">ICML MBRL Workshop 2019</a>, <font color="firebrick">Oral</font> </b> <br>
<a href="https://arxiv.org/pdf/1909.06878.pdf" target="_blank"> <small>[Paper]</small></a>
<a href="https://github.com/yilundu/model_based_planning_ebm" target="_blank"> <small>[Code]</small></a>
</div>
<div class="col-md-12">
<br>
<p> Model-based planning holds great promise for improving both sample efficiency and generalization in reinforcement learning (RL). We show that energy-based models (EBMs) are a promising class of models to use for model-based planning. EBMs naturally support inference of intermediate states given start and goal state distributions. We provide an online algorithm to train EBMs while interacting with the environment, and show that EBMs allow for significantly better online learning than corresponding feed-forward networks. We further show that EBMs support maximum entropy state inference and are able to generate diverse state space plans. We show that inference purely in state space - without planning actions - allows for better generalization to previously unseen obstacles in the environment and prevents the planner from exploiting the dynamics model by applying uncharacteristic action sequences. </div>
</div><hr>
<div class="row">
<div class="col-md-3">
<video width="100%" playsinline="" autoplay="" loop="" preload="" muted="">
<source src="files/paper/nips-2019-ebm/half.mp4" type="video/mp4">
</video>
</div>
<div class="col-md-9">
<b><font color="black">Implicit Generation and Generalization with Energy Based Models</font></b><br>
<a href="https://yilundu.github.io/" target="_blank">Yilun Du</a> and
<a href="https://scholar.google.com/citations?user=Vzr1RukAAAAJ&hl=en" target="_blank">Igor Mordatch</a>
<br>
<b><a href="https://neurips.cc/" target="_blank">NeurIPS 2019</a>, <font color="firebrick">Spotlight</font> </b> <br>
<a href="https://openai.com/blog/energy-based-models/" target="_blank"> <small>[OpenAI Blog]</small></a>
<a href="https://arxiv.org/pdf/1903.08689.pdf" target="_blank"> <small>[Paper]</small></a>
<a href="https://github.com/openai/ebm_code_release" target="_blank"> <small>[Code]</small></a>
</div>
<div class="col-md-12">
<br>
<p> Energy Based Models (EBMs) are a appealing class of models due to their generality and simplicity in likelihood modeling. However, EBMs have been traditionally hard to train. We present techniques to scale MCMC based training of EBMs on continuous neural networks on high-dimensional data domains such as ImageNet128x128 and robotic hand trajectories. We highlight some unique capabilities of implicit generation. Finally, we illustrate how EBMs are a useful class of models across a wide variety of tasks, achieving out-of-distribution generalization, adversarially robust classification, online continual learning, and compositionality.
</div>
</div><hr>
</div>
<br><br>
<!-- Service -->
<div class="container">
<h3 id="Contact" style="padding-top: 80px; margin-top: -80px;">Contact</h3>
<ul>
<li> <a href="https://yilundu.github.io/" target="_blank">Yilun Du</a>: [email protected]</li>
<li> <a href="https://people.csail.mit.edu/lishuang/">Shuang Li</a>: [email protected] </li>
</ul>
</div><br><br>
<div class="container">
<hr>
<center>
<footer>
<p>© Massachusetts Institute of Technology 2022</p>
</footer>
</center>
</div>
<!-- /container -->
<!-- Bootstrap core JavaScript -->
<!-- Placed at the end of the document so the pages load faster -->
<script>showPubs(1);</script>
<script>var scroll = new SmoothScroll('a[href*="#"]', {speed: 1000});</script>
<script src="https://code.jquery.com/jquery-3.2.1.slim.min.js" integrity="sha384-KJ3o2DKtIkvYIK3UENzmM7KCkRr/rE9/Qpg6aAZGJwFDMVNA/GpGFF93hXpG5KkN" crossorigin="anonymous"></script>
<script src="https://cdnjs.cloudflare.com/ajax/libs/popper.js/1.11.0/umd/popper.min.js" integrity="sha384-b/U6ypiBEHpOf/4+1nzFpr53nxSS+GLCkfwBdFNTxtclqqenISfwAzpKaMNFNmj4" crossorigin="anonymous"></script>
<script src="https://maxcdn.bootstrapcdn.com/bootstrap/4.0.0-beta/js/bootstrap.min.js" integrity="sha384-h0AbiXch4ZDo7tp9hKZ4TsHbi047NrKGLO3SEJAg45jXxnGIfYzk4Si90RDIqNm1" crossorigin="anonymous"></script>
</body>
</html>