Skip to content

Commit

Permalink
Updating API to v1.0.4
Browse files Browse the repository at this point in the history
  • Loading branch information
jbesomi committed Apr 27, 2020
1 parent 8cfa542 commit 8a51a56
Show file tree
Hide file tree
Showing 3 changed files with 160 additions and 34 deletions.
130 changes: 103 additions & 27 deletions website/docs/api-preprocessing.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,63 @@ title: Preprocessing

# Preprocessing

Utility functions to clean text-columns of a dataframe.
Preprocess text-based Pandas DataFrame.


### texthero.preprocessing.clean(s, pipeline=None)
Clean pandas series by appling a preprocessing pipeline.

For information regarding a specific function type help(texthero.preprocessing.func_name).
The default preprocessing pipeline is the following:

>
> * fillna

> * lowercase

> * remove_digits

> * remove_punctuation

> * remove_diacritics

> * remove_stop_words

> * remove_whitespace

* **Return type**

`Series`



### texthero.preprocessing.do_stemm(input, stem='snowball')
Stem series using either NLTK ‘porter’ or ‘snowball’ stemmers.

Not in the default pipeline.


* **Parameters**


* **input** (`Series`) –


* **stem** – Can be either ‘snowball’ or ‘stemm’



* **Return type**

`Series`



### texthero.preprocessing.fillna(input)
Expand All @@ -19,13 +75,18 @@ Replace not assigned values with empty spaces.


### texthero.preprocessing.get_default_pipeline()
Default pipeline:
Return a list contaning all the methods used in the default cleaning pipeline.

Return a list with the following function


* remove_lowercase
* fillna


* lowercase

* remove_numbers

* remove_digits


* remove_punctuation
Expand All @@ -34,17 +95,20 @@ Default pipeline:
* remove_diacritics


* remove_white_space
* remove_stop_words


* remove_stop_words
* remove_whitespace


* **Return type**

[]

* stemming


### texthero.preprocessing.lowercase(input)
Lowercase all cells.
Lowercase all text.


* **Return type**
Expand All @@ -54,7 +118,7 @@ Lowercase all cells.


### texthero.preprocessing.remove_diacritics(input)
Remove diacritics (as accent marks) from input
Remove all diacritics.


* **Return type**
Expand All @@ -64,7 +128,7 @@ Remove diacritics (as accent marks) from input


### texthero.preprocessing.remove_digits(input, only_blocks=True)
Remove all digits.
Remove all digits from a series and replace it with a single space.


* **Parameters**
Expand All @@ -76,31 +140,43 @@ Remove all digits.
* **only_blocks** (*bool*) – Remove only blocks of digits. For instance, hel1234lo 1234 becomes hel1234lo.


### Examples

* **Returns**


```python
>>> import texthero
>>> import pandas as pd
>>> s = pd.Series(["texthero 1234 He11o"])
>>> texthero.preprocessing.remove_digits(s)
0 texthero He11o
dtype: object
>>> texthero.preprocessing.remove_digits(s, only_blocks=False)
0 texthero He o
dtype: object
```


* **Return type**

pd.Series
`Series`


### Examples

```python
>>> import texthero
>>> s = pd.Series(["remove_digits_s remove all the 1234 digits of a pandas series. H1N1"])
>>> texthero.preprocessing.remove_digits_s(s)
u'remove_digits_s remove all the digits of a pandas series. H1N1'
>>> texthero.preprocessing.remove_digits_s(s, only_blocks=False)
u'remove_digits_s remove all the digits of a pandas series. HN'
```
### texthero.preprocessing.remove_punctuation(input)
Remove string.punctuation (!”#$%&’()\*+,-./:;<=>?@[]^_\`{|}~).

Replace it with a single space.

### texthero.preprocessing.remove_punctuation(input)
Remove punctuations from input

* **Return type**

`Series`



### texthero.preprocessing.remove_stop_words(input)
Remove all stop words using NLTK stopwords list.

List of stopwords: NLTK ‘english’ stopwords, 179 items.


* **Return type**
Expand All @@ -109,8 +185,8 @@ Remove punctuations from input



### texthero.preprocessing.remove_whitespaces(input)
Remove any type of space between words.
### texthero.preprocessing.remove_whitespace(input)
Remove all white spaces between words.


* **Return type**
Expand Down
34 changes: 33 additions & 1 deletion website/docs/api-representation.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,4 +3,36 @@ id: api-representation
title: Representation
---

Text representation
Map words into vectors using different algorithms such as TF-IDF, word2vec or GloVe.


### texthero.representation.do_count(s, max_features=100)
Represent input on a Count vector space.


### texthero.representation.do_dbscan(s, eps=0.5, min_samples=5, metric='euclidean', metric_params=None, algorithm='auto', leaf_size=30, p=None, n_jobs=None)
Perform DBSCAN clustering.


### texthero.representation.do_kmeans(s, n_clusters=5, init='k-means++', n_init=10, max_iter=300, tol=0.0001, precompute_distances='auto', verbose=0, random_state=None, copy_x=True, n_jobs=- 1, algorithm='auto')
Perform K-means clustering algorithm.


### texthero.representation.do_meanshift(s, bandwidth=None, seeds=None, bin_seeding=False, min_bin_freq=1, cluster_all=True, n_jobs=None, max_iter=300)
Perform mean shift clustering.


### texthero.representation.do_nmf(s, n_components=2)
Perform non-negative matrix factorization.


### texthero.representation.do_pca(s, n_components=2)
Perform PCA.


### texthero.representation.do_tfidf(s, max_features=100)
Represent input on a TF-IDF vector space.


### texthero.representation.do_tsne(s, vector_columns, n_components, perplexity, early_exaggeration, learning_rate, n_iter)
Perform TSNE.
30 changes: 24 additions & 6 deletions website/docs/api-visualization.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,15 +5,23 @@ title: Visualization

# Visualization

Text visualization
Visualize insights and statistics of a text-based Pandas DataFrame.


### texthero.visualization.scatterplot(df, col, color=None, hover_data=None, title='')
Scatterplot of df[column].
Show scatterplot using python plotly scatter.

The df[column] must be a tuple of 2d-coordinates.

Usage example:
* **Parameters**


* **df**


* **col** – The name of the column of the DataFrame used for x and y axis.


### Examples

```python
>>> import texthero
Expand All @@ -22,8 +30,18 @@ Usage example:
```


### texthero.visualization.top_words(s, normalize=True)
Return most common words of a given series sorted from most used.
### texthero.visualization.top_words(s, normalize=False)
Return most common words.


* **Parameters**


* **s** (`Series`) –


* **normalize** – Default is False. If set to True, returns normalized values.



* **Return type**
Expand Down

0 comments on commit 8a51a56

Please sign in to comment.