Starred repositories
LLaMA-Omni is a low-latency and high-quality end-to-end speech interaction model built upon Llama-3.1-8B-Instruct, aiming to achieve speech capabilities at the GPT-4o level.
Qwen2-VL is the multimodal large language model series developed by Qwen team, Alibaba Cloud.
[CVPR 2024 Oral] InternVL Family: A Pioneering Open-Source Alternative to GPT-4o. 接近GPT-4o表现的开源多模态对话模型
OpenPose: Real-time multi-person keypoint detection library for body, face, hands, and foot estimation
The repository provides code for running inference with the Meta Segment Anything Model 2 (SAM 2), links for downloading the trained model checkpoints, and example notebooks that show how to use th…
[ECCV2024] Video Foundation Models & Data for Multimodal Understanding
The model, data and code for the visual GUI Agent SeeClick
MINT-1T: A one trillion token multimodal interleaved dataset.
DeepSeek-VL: Towards Real-World Vision-Language Understanding
Qwen2.5 is the large language model series developed by Qwen team, Alibaba Cloud.
《开源大模型食用指南》基于Linux环境快速部署开源大模型,更适合中国宝宝的部署教程
UniMD: Towards Unifying Moment retrieval and temporal action Detection
Large-scale text-video dataset. 10 million captioned short videos.
Official implementation of "Samba: Simple Hybrid State Space Models for Efficient Unlimited Context Language Modeling"
🔥🔥🔥Latest Papers, Codes and Datasets on Vid-LLMs.
Event camera (DVS, Spike) based Papers Published on Top International Conference
🔥🔥MLVU: Multi-task Long Video Understanding Benchmark
Grounded SAM: Marrying Grounding DINO with Segment Anything & Stable Diffusion & Recognize Anything - Automatically Detect , Segment and Generate Anything
GLM-4 series: Open Multilingual Multimodal Chat LMs | 开源多语言多模态对话模型
OmniCorpus: A Unified Multimodal Corpus of 10 Billion-Level Images Interleaved with Text
NEW - YOLOv8 🚀 in PyTorch > ONNX > OpenVINO > CoreML > TFLite
This repository contains code for object detection and tracking in videos using the YOLOv10 object detection model and the DeepSORT algorithm.
YOLOv10: Real-Time End-to-End Object Detection [NeurIPS 2024]
MiniCPM-V 2.6: A GPT-4V Level MLLM for Single Image, Multi Image and Video on Your Phone
AlwaysReddy is a LLM voice assistant that is always just a hotkey away.
The repository provides code for running inference with the SegmentAnything Model (SAM), links for downloading the trained model checkpoints, and example notebooks that show how to use the model.
[NeurIPS'23 Oral] Visual Instruction Tuning (LLaVA) built towards GPT-4V level capabilities and beyond.
DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model