Stars
[ECCV2024] Grounded Multimodal Large Language Model with Localized Visual Tokenization
Project Page for "LISA: Reasoning Segmentation via Large Language Model"
High-resolution models for human tasks.
[NeurIPS'23 Oral] Visual Instruction Tuning (LLaVA) built towards GPT-4V level capabilities and beyond.
Official Implementation for "SCAPE: A Simple and Strong Category-Agnostic Pose Estimator", ECCV 2024.
Hiera: A fast, powerful, and simple hierarchical vision transformer.
[ECCV 2024] Beyond MOT: Semantic Multi-Object Tracking
This is the code of our paper "Video-Based Human Pose Regression via Decoupled Space-Time Aggregation".
A Visual Studio Code extension with support for the Ruff linter.
The repository provides code for running inference with the Meta Segment Anything Model 2 (SAM 2), links for downloading the trained model checkpoints, and example notebooks that show how to use th…
PyTorch implementation of MAR+DiffLoss https://arxiv.org/abs/2406.11838
Official codebase used to develop Vision Transformer, SigLIP, MLP-Mixer, LiT and more.
MMPD Dataset from ECCV'2024 "When Pedestrian Detection Meets Multi-Modal Learning: Generalist Model and Benchmark Dataset"
This repository contains implementations and illustrative code to accompany DeepMind publications
[ICLR 2024 & ECCV 2024] The All-Seeing Projects: Towards Panoptic Visual Recognition&Understanding and General Relation Comprehension of the Open World"
In CVPR'2024. Meta-Point Learning and Refining for Category-Agnostic Pose Estimation
Code for "LocLLM: Exploiting Generalizable Human Keypoint Localization via Large Language Model", CVPR 2024 Highlight
A novel few-shot keypoint detector with uncertainty learning for unseen species (CVPR2022).
[ECCV2024] Official implementation of paper, "DenseNets Reloaded: Paradigm Shift Beyond ResNets and ViTs".
The project is an official implementation of our paper "PoseFormerV2: Exploring Frequency Domain for Efficient and Robust 3D Human Pose Estimation".
[CVPR 2024] MagicAnimate: Temporally Consistent Human Image Animation using Diffusion Model
[ICLR 2024] MogaNet: Efficient Multi-order Gated Aggregation Network
Transparent Image Layer Diffusion using Latent Transparency
Unofficial implementation of "Swin Transformer: Hierarchical Vision Transformer using Shifted Windows" (https://arxiv.org/abs/2103.14030)
This is the official code release for our work, Denoising Vision Transformers.