Lists (1)
Sort Name ascending (A-Z)
Stars
Intelligent proxy pool for Humans™ to extract content from the internet and build your own Large Language Models in this new AI era
SeaweedFS is a fast distributed storage system for blobs, objects, files, and data lake, for billions of files! Blob store has O(1) disk seek, cloud tiering. Filer supports Cloud Drive, cross-DC ac…
The easiest way to serve AI apps and models - Build reliable Inference APIs, LLM apps, Multi-model chains, RAG service, and much more!
Series follows learning from Apache Spark (PySpark) with quick tips and workaround for daily problems in hand
A workspace to experiment with Apache Spark, Livy, and Airflow in a Docker environment.
pyspark🍒🥭 is delicious,just eat it!😋😋
🗃 Open source self-hosted web archiving. Takes URLs/browser history/bookmarks/Pocket/Pinboard/etc., saves HTML, JS, PDFs, media, and more...
fatwang2 / search4all
Forked from leptonai/search_with_leptonPersonal AI search copilot, open-source Perplexity
🔍 AI orchestration framework to build customizable, production-ready LLM applications. Connect components (models, vector DBs, file converters) to pipelines or agents that can interact with your da…
Search for words, documents, images, videos, news, maps and text translation using the DuckDuckGo.com search engine. Downloading files and images to a local hard drive.
SearXNG is a free internet metasearch engine which aggregates results from various search services and databases. Users are neither tracked nor profiled.
RAGFlow is an open-source RAG (Retrieval-Augmented Generation) engine based on deep document understanding.
The AI-native database built for LLM applications, providing incredibly fast hybrid search of dense vector, sparse vector, tensor (multi-vector), and full-text
Source-agnostic distributed change data capture system
Winner system (DAMO-NLP) of SemEval 2022 MultiCoNER shared task over 10 out of 13 tracks.
A library that provides an embeddable, persistent key-value store for fast storage.
Distributed reliable key-value store for the most critical data of a distributed system
A python based HTML to text conversion library, command line client and Web service.
Knowledge extraction from semi-structured web.
Unsupervised text tokenizer for Neural Network-based text generation.
Large-scale Self-supervised Pre-training Across Tasks, Languages, and Modalities
Schema-Driven Information Extraction from Heterogeneous Tables
Simplified DOM Trees for Transferable Attribute Extraction from the Web
Finetune LayoutLM on SROIE dataset using W&B tools
xrr233 / NLP_ability
Forked from DA-southampton/NLP_ability总结梳理自然语言处理工程师(NLP)需要积累的各方面知识,包括面试题,各种基础知识,工程能力等等,提升核心竞争力
SIGIR-2022 Webformer: Pre-training with Web Pages for Information Retrieval
[EMNLP 2021] The baseline code for WebSRC dataset.