Skip to content

Commit

Permalink
Develop (#54)
Browse files Browse the repository at this point in the history
* Bumped version

* Moved evaluation folder

* Moved test folder

* Fixed zero division error

* Added test for zero division error

* Updated paths

* Docformatters & isort

* Fixed misspelled name

* Added support for models from the hub

* Updated readme

* Moved to smaller model for faster downloading

* Updated examples in readme

* Updated to make use of the new Vectors class

* Added FT Vectors

* Fixed inheritance structure

* Updated benchmark

* Updated readme

* blacked

* Fixed readme

* Fixed dataclasses backwards compatibility

* Updated benchmarks

* Added release & media objects

* Added coverage badge

* Removed manual coverage badege

* Updated readme

* Centered

* Updated logo

* Trigger build
  • Loading branch information
Oliver Borchers committed Dec 3, 2021
1 parent 9c34c4e commit 12bf5d1
Show file tree
Hide file tree
Showing 33 changed files with 1,673 additions and 795 deletions.
178 changes: 135 additions & 43 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,15 +6,29 @@
<a href="https://github.com/psf/black"><img alt="Code style: black" src="https://img.shields.io/badge/code%20style-black-000000.svg"></a>
<a href="https://img.shields.io/github/license/oborchers/Fast_Sentence_Embeddings.svg?style=flat"><img alt="License: GPL3" src="https://img.shields.io/github/license/oborchers/Fast_Sentence_Embeddings.svg?style=flat"></a>
</p>
<p align="center">
<a><img alt="fse" src="https://raw.githubusercontent.com/oborchers/Fast_Sentence_Embeddings/develop/media/fse.png"></a>
</p>

Fast Sentence Embeddings (fse)
Fast Sentence Embeddings
==================================

Fast Sentence Embeddings is a Python library that serves as an addition to Gensim. This library is intended to compute *sentence vectors* for large collections of sentences or documents.
Fast Sentence Embeddings is a Python library that serves as an addition to Gensim. This library is intended to compute *sentence vectors* for large collections of sentences or documents with as little hassle as possible:

**Disclaimer**: I am working full time. Unfortunately, I have yet to find time to add all the features I'd like to see. Especially the API needs some overhaul and we need support for gensim 4.0.0.
```
from fse import Vectors, Average, IndexedList
I am looking for active contributors to keep this package alive. Please feel free to ping me at <o.borchers@oxolo.com> if you are interested.
vecs = Vectors.from_pretrained("glove-wiki-gigaword-50")
model = Average(vecs)
sentences = [["cat", "say", "meow"], ["dog", "say", "woof"]]
model.train(IndexedList(sentences))
model.sv.similarity(0,1)
```

If you want to support fse, take a quick [survey](https://forms.gle/8uSU323fWUVtVwcAA) to improve it.

Audience
------------
Expand All @@ -24,9 +38,7 @@ This package builds upon Gensim and is intenteded to compute sentence/paragraph
- Your dataset is too large for existing solutions (spacy)
- Using GPUs is not an option.

The average (online) inference time for a well optimized (and batched) sentence-transformer is around 1ms-10ms per sentence.
If that is not enough and you are willing to sacrifice a bit in terms of quality, this is your package.

The average (online) inference time for a well optimized (and batched) sentence-transformer is around 1ms-10ms per sentence. If that is not enough and you are willing to sacrifice a bit in terms of quality, this is your package.

Features
------------
Expand All @@ -43,6 +55,8 @@ Key features of **fse** are:

**[X]** Up to 500.000 sentences / second (1)

**[X]** Provides HUB access to various pre-trained models for convenience

**[X]** Supports Average, SIF, and uSIF Embeddings

**[X]** Full support for Gensims Word2Vec and all other compatible classes
Expand Down Expand Up @@ -95,6 +109,62 @@ If building the Cython extension fails (you will be notified), try:
Usage
-------------

Using pre-trained models with **fse** is easy. You can just use them from the hub and download them accordingly.
They will be stored locally so you can re-use them later.

```
from fse import Vectors, Average, IndexedList
vecs = Vectors.from_pretrained("glove-wiki-gigaword-50")
model = Average(vecs)
sentences = [["cat", "say", "meow"], ["dog", "say", "woof"]]
model.train(IndexedList(sentences))
model.sv.similarity(0,1)
```

If your vectors are large and you don't have a lot of RAM, you can supply the `mmap` argument as follows to read the vectors from disk instead of loading them into RAM:

```
Vectors.from_pretrained("glove-wiki-gigaword-50", mmap="r")
```

To check which vectors are on the hub, please check: https://huggingface.co/fse. For example, you will find:
- glove-twitter-25
- glove-twitter-50
- glove-twitter-100
- glove-twitter-200
- glove-wiki-gigaword-100
- glove-wiki-gigaword-300
- word2vec-google-news-300
- paragram-25
- paranmt-300
- paragram-300-sl999
- paragram-300-ws353
- fasttext-wiki-news-subwords-300
- fasttext-crawl-subwords-300 (Use with `FTVectors`)

In order to use **fse** with a custom model you must first estimate a Gensim model which contains a
gensim.models.keyedvectors.BaseKeyedVectors class, for example *Word2Vec* or *Fasttext*. Then you can proceed to compute sentence embeddings for a corpus as follows:

```
from gensim.models import FastText
sentences = [["cat", "say", "meow"], ["dog", "say", "woof"]]
ft = FastText(sentences, min_count=1, size=10)
from fse import Average, IndexedList
model = Average(ft)
model.train(IndexedList(sentences))
model.sv.similarity(0,1)
```

fse offers multi-thread support out of the box. However, for most applications a *single thread will most likely be sufficient*.

Additional Information
-------------

Within the folder nootebooks you can find the following guides:

**Tutorial.ipynb** offers a detailed walk-through of some of the most important functions fse has to offer.
Expand All @@ -118,51 +188,69 @@ The models presented are based on
Credits to Radim Řehůřek and all contributors for the **awesome** library
and code that [Gensim](https://github.com/RaRe-Technologies/gensim) provides. A whole lot of the code found in this lib is based on Gensim.

In order to use **fse** you must first estimate a Gensim model which contains a
gensim.models.keyedvectors.BaseKeyedVectors class, for example
*Word2Vec* or *Fasttext*. Then you can proceed to compute sentence embeddings
for a corpus.

from gensim.models import FastText
sentences = [["cat", "say", "meow"], ["dog", "say", "woof"]]
ft = FastText(sentences, min_count=1, size=10)

from fse.models import Average
from fse import IndexedList
model = Average(ft)
model.train(IndexedList(sentences))

model.sv.similarity(0,1)

fse offers multi-thread support out of the box. However, for most
applications a *single thread will most likely be sufficient*.

To install **fse** on Colab, check out: https://colab.research.google.com/drive/1qq9GBgEosG7YSRn7r6e02T9snJb04OEi

Results
------------

Model | [STS Benchmark](http://ixa2.si.ehu.es/stswiki/index.php/STSbenchmark#Results)
:---: | :---:
`CBOW-Paranmt` | **79.85**
`uSIF-Paranmt` | 79.02
`SIF-Paranmt` | 76.75
`SIF-Paragram` | 73.86
`uSIF-Paragram` | 73.64
`SIF-FT` | 73.38
`SIF-Glove` | 71.95
`SIF-W2V` | 71.12
`uSIF-FT` | 69.4
`uSIF-Glove` | 67.16
`uSIF-W2V` | 66.99
`CBOW-W2V` | 61.54
`CBOW-Paragram` | 50.38
`CBOW-FT` | 48.49
`CBOW-Glove` | 40.41
Model | Vectors | params | [STS Benchmark](http://ixa2.si.ehu.es/stswiki/index.php/STSbenchmark#Results)
:---: | :---: | :---: | :---:
`CBOW` | `paranmt-300` | | 79.82
`uSIF` | `paranmt-300` | length=11 | 79.02
`SIF-10` | `paranmt-300` | components=10 | 76.76
`SIF-10` | `paragram-300-sl999` | components=10 | 74.27
`SIF-10` | `paragram-300-ws353` | components=10 | 74.08
`SIF-10` | `fasttext-crawl-subwords-300` | components=10 | 73.54
`uSIF` | `paragram-300-sl999` | length=11 | 73.09
`SIF-10` | `fasttext-wiki-news-subwords-300` | components=10 | 72.24
`uSIF` | `paragram-300-ws353` | length=11 | 71.90
`SIF-10` | `glove-twitter-200` | components=10 | 71.67
`SIF-10` | `glove-wiki-gigaword-300` | components=10 | 71.43
`SIF-10` | `word2vec-google-news-300` | components=10 | 71.17
`SIF-10` | `glove-wiki-gigaword-200` | components=10 | 70.73
`SIF-10` | `glove-twitter-100` | components=10 | 69.70
`uSIF` | `fasttext-crawl-subwords-300` | length=11 | 69.55
`uSIF` | `fasttext-wiki-news-subwords-300` | length=11 | 69.05
`SIF-10` | `glove-wiki-gigaword-100` | components=10 | 68.43
`uSIF` | `glove-wiki-gigaword-300` | length=11 | 67.73
`uSIF` | `glove-wiki-gigaword-200` | length=11 | 67.26
`uSIF` | `word2vec-google-news-300` | length=11 | 67.15
`uSIF` | `glove-twitter-200` | length=11 | 66.73
`SIF-10` | `glove-twitter-50` | components=10 | 65.57
`uSIF` | `glove-wiki-gigaword-100` | length=11 | 65.48
`uSIF` | `paragram-25` | length=11 | 64.31
`uSIF` | `glove-twitter-100` | length=11 | 64.22
`SIF-10` | `glove-wiki-gigaword-50` | components=10 | 64.20
`uSIF` | `glove-wiki-gigaword-50` | length=11 | 62.22
`CBOW` | `word2vec-google-news-300` | | 61.54
`uSIF` | `glove-twitter-50` | length=11 | 60.50
`SIF-10` | `paragram-25` | components=10 | 59.22
`uSIF` | `glove-twitter-25` | length=11 | 55.17
`CBOW` | `paragram-300-ws353` | | 54.72
`SIF-10` | `glove-twitter-25` | components=10 | 54.42
`CBOW` | `paragram-300-sl999` | | 51.46
`CBOW` | `fasttext-crawl-subwords-300` | | 48.49
`CBOW` | `glove-wiki-gigaword-300` | | 44.46
`CBOW` | `glove-wiki-gigaword-200` | | 42.40
`CBOW` | `paragram-25` | | 40.13
`CBOW` | `glove-wiki-gigaword-100` | | 38.12
`CBOW` | `glove-wiki-gigaword-50` | | 37.47
`CBOW` | `glove-twitter-200` | | 34.94
`CBOW` | `glove-twitter-100` | | 33.81
`CBOW` | `glove-twitter-50` | | 30.78
`CBOW` | `glove-twitter-25` | | 26.15
`CBOW` | `fasttext-wiki-news-subwords-300` | | 26.08

Changelog
-------------

0.2.0:
- Added `Vectors` and `FTVectors` class and hub support by `from_pretrained`
- Extended benchmark
- Fixed zero division bug for uSIF
- Moved tests out of the main folder
- Moved sts out of the main folder

0.1.17:
- Fixed dependency issue where you cannot install fse properly
- Updated readme
Expand Down Expand Up @@ -197,6 +285,10 @@ Proceedings of the 3rd Workshop on Representation Learning for NLP. (Toulon, Fra
Copyright
-------------

**Disclaimer**: I am working full time. Unfortunately, I have yet to find time to add all the features I'd like to see. Especially the API needs some overhaul and we need support for gensim 4.0.0.

I am looking for active contributors to keep this package alive. Please feel free to ping me at <o.borchers@oxolo.com> if you are interested.

Author: Oliver Borchers

Copyright (C) 2021 Oliver Borchers
Expand Down
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
6 changes: 6 additions & 0 deletions fse/__init__.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,9 @@
import logging

from fse import models
from fse.models import Average, SIF, uSIF, SentenceVectors

from fse.vectors import Vectors, FTVectors

from .inputs import (
BaseIndexedList,
Expand All @@ -22,3 +25,6 @@ def emit(self, record):
logger = logging.getLogger("fse")
if len(logger.handlers) == 0: # To ensure reload() doesn't add another one
logger.addHandler(NullHandler())


__version__ = "0.2.0"
2 changes: 1 addition & 1 deletion fse/inputs.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
# -*- coding: utf-8 -*-

# Author: Oliver Borchers
# Copyright (C) Oliver Borchers Oliver Borchers
# Copyright (C) Oliver Borchers

from pathlib import Path
from typing import List, MutableSequence, Union
Expand Down
2 changes: 1 addition & 1 deletion fse/models/average.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
# -*- coding: utf-8 -*-

# Author: Oliver Borchers
# Copyright (C) Oliver Borchers Oliver Borchers
# Copyright (C) Oliver Borchers

"""This module implements the base class to compute average representations for sentences, using highly optimized C routines,
data streaming and Pythonic interfaces.
Expand Down
2 changes: 1 addition & 1 deletion fse/models/average_inner.pxd
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@
# coding: utf-8

# Author: Oliver Borchers
# Copyright (C) Oliver Borchers Oliver Borchers
# Copyright (C) Oliver Borchers

cimport numpy as np

Expand Down
2 changes: 1 addition & 1 deletion fse/models/average_inner.pyx
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@
# coding: utf-8

# Author: Oliver Borchers
# Copyright (C) Oliver Borchers Oliver Borchers
# Copyright (C) Oliver Borchers

"""Optimized cython functions for computing sentence embeddings"""

Expand Down
32 changes: 1 addition & 31 deletions fse/models/base_s2v.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
# -*- coding: utf-8 -*-

# Author: Oliver Borchers
# Copyright (C) Oliver Borchers Oliver Borchers
# Copyright (C) Oliver Borchers
# Licensed under GNU General Public License v3.0

"""Base class containing common methods for training, using & evaluating sentence embeddings.
Expand Down Expand Up @@ -123,36 +123,6 @@ def __init__(
Key word arguments needed to allow children classes to accept more arguments.
"""

"""
TODO:
[ ] global:
[ ] windows support
[ ] documentation
[ ] more benchmarks
[ ] remove wv_mapfile_path?
[ ] modifiable sv_mapfile_path?
[ ] models:
[ ] check feasibility first
[ ] max-pooling -> easy
[ ] hierarchical pooling -> easy
[ ] discrete cosine transform -> somewhat easy, questionable
[ ] valve -> unclear, not cited enough
[ ] power-means embedding -> very large dimensionalty
[ ] z-score transformation is quite nice
[ ] sentencevectors:
[X] similar_by_sentence model type check
[ ] approximate NN search for large files
[ ] compare ANN libraries
[ ] ease-of-use
[ ] dependencies
[ ] compatibility
[ ] memory-usage
"""

set_madvise_for_mmap()

self.workers = int(workers)
Expand Down
2 changes: 1 addition & 1 deletion fse/models/sentencevectors.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
# -*- coding: utf-8 -*-

# Author: Oliver Borchers
# Copyright (C) Oliver Borchers Oliver Borchers
# Copyright (C) Oliver Borchers


from __future__ import division
Expand Down
2 changes: 1 addition & 1 deletion fse/models/sif.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
# -*- coding: utf-8 -*-

# Author: Oliver Borchers
# Copyright (C) Oliver Borchers Oliver Borchers
# Copyright (C) Oliver Borchers

from fse.models.average import Average
from fse.models.utils import compute_principal_components, remove_principal_components
Expand Down
Loading

0 comments on commit 12bf5d1

Please sign in to comment.