Skip to content

Commit

Permalink
Merge pull request Conchylicultor#89 from Conchylicultor/vocab_filter
Browse files Browse the repository at this point in the history
Vocab filter
  • Loading branch information
Conchylicultor committed Mar 19, 2017
2 parents fa451ce + 1bdd7f8 commit a956625
Show file tree
Hide file tree
Showing 4 changed files with 231 additions and 76 deletions.
34 changes: 33 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,8 @@
# Deep Q&A
[![Join the chat at https://gitter.im/chatbot-pilots/DeepQA](https://badges.gitter.im/chatbot-pilots/DeepQA.svg)](https://gitter.im/chatbot-pilots/DeepQA?utm_source=badge&utm_medium=badge&utm_campaign=pr-badge&utm_content=badge)

Note: if your models don't work anymore following the last update, please follow [those instructions](#upgrade) to update.

#### Table of Contents

* [Presentation](#presentation)
Expand All @@ -11,6 +13,7 @@
* [Results](#results)
* [Pretrained model](#pretrained-model)
* [Improvements](#improvements)
* [Upgrade](#upgrade)

## Presentation

Expand Down Expand Up @@ -205,7 +208,10 @@ It also seems to overfit as sometimes it will just pop out sentences from its tr

## Pretrained model

You can find a pre-trained model [here](https://drive.google.com/file/d/0Bw-phsNSkq23TXltWGlOdk9wOXc/view?usp=sharing), trained of the default corpus. To launch it, extract it inside `DeepQA/save/` and run `./main.py --modelTag pretrainedv2 --test interactive`. The old pre-trained model is still available [here](https://drive.google.com/file/d/0Bw-phsNSkq23amlSZXVqcm5oVFU/view?usp=sharing) (Won't work with the current version).
You can find a pre-trained model [here](https://drive.google.com/file/d/0Bw-phsNSkq23TXltWGlOdk9wOXc/view?usp=sharing), trained of the default corpus. To use it:
1. Extract the zip file inside `DeepQA/save/`
2. Copy the preprocessed dataset from `save/model-pretrainedv2/dataset-cornell-old-lenght10-filter0.pkl` to `data/samples/`.
3. Run `./main.py --modelTag pretrainedv2 --test interactive`.

If you have a high-end GPU, don't hesitate to play with the hyper-parameters/corpus to train a better model. From my experiments, it seems that the learning rate and dropout rate have the most impact on the results. Also if you want to share your models, don't hesitate to contact me and I'll add it here.

Expand All @@ -218,3 +224,29 @@ In addition to trying larger/deeper model, there are a lot of small improvements
* Having more data usually don't hurt. Training on a bigger corpus should be beneficial. [Reddit comments dataset](https://www.reddit.com/r/datasets/comments/59039y/updated_reddit_comment_dataset_up_to_201608/) seems the biggest for now (and is too big for this program to support it). Another trick to artificially increase the dataset size when creating the corpus could be to split the sentences of each training sample (ex: from the sample `Q:Sentence 1. Sentence 2. => A:Sentence X. Sentence Y.` we could generate 3 new samples: `Q:Sentence 1. Sentence 2. => A:Sentence X.`, `Q:Sentence 2. => A:Sentence X. Sentence Y.` and `Q:Sentence 2. => A:Sentence X.`. Warning: other combinations like `Q:Sentence 1. => A:Sentence X.` won't work because it would break the transition `2 => X` which links the question to the answer)
* The testing curve should really be monitored as done in my other [music generation](https://github.com/Conchylicultor/MusicGenerator) project. This would greatly help to see the impact of dropout on overfitting. For now it's just done empirically by manually checking the testing prediction at different training steps.
* For now, the questions are independent from each other. To link questions together, a straightforward way would be to feed all previous questions and answer to the encoder before giving the answer. Some caching could be done on the final encoder stated to avoid recomputing it each time. To improve the accuracy, the network should be retrain on entire dialogues instead of just individual QA. Also when feeding the previous dialogue to the encoder, new tokens `<Q>` and `<A>` could be added so the encoder knows when the interlocutor is changing. I'm not sure though that the simple seq2seq model would be sufficient to capture long term dependencies between sentences. Adding a bucket system to group similar input lengths together could greatly improve training speed.

## Upgrade

With the last commit, I added the possibility to filters rarely used words from the dataset using `--filterVocab 3`. The dataset preprocessing should also be a lot faster when generating the vocabulary for different `maxLength`. Unfortunately, this make the previous pre-processed datasets incompatible. Here are the change to make to use the old model with the new version:

1. Rename the old dataset (present in `data/samples/`) into the new format name. Ex: from `dataset-cornell-10.pkl` to `dataset-cornell-old-lenght10-filter0.pkl`.

2. Update the model configuration file `params.ini`. The changes to make are:
* `version`: change to `0.5`
* Create a new `[Dataset]` field with the following fields

```ini
[Dataset]
# Make sure that datasettag match the one from the filename you just renamed
datasettag = old
# Use the maxLength value of your model. Should also match
maxlength = 10
filtervocab = 0
```

If everything goes well, you should see this message somewhere on the terminal output when loading your model.

```
Loading dataset from /home/*/DeepQA/data/samples/dataset-cornell-old-lenght10-filter0.pkl
Loaded cornell: 34991 words, 139979 QA
```
24 changes: 15 additions & 9 deletions chatbot/chatbot.py
Original file line number Diff line number Diff line change
Expand Up @@ -70,7 +70,7 @@ def __init__(self):
self.MODEL_NAME_BASE = 'model'
self.MODEL_EXT = '.ckpt'
self.CONFIG_FILENAME = 'params.ini'
self.CONFIG_VERSION = '0.4'
self.CONFIG_VERSION = '0.5'
self.TEST_IN_NAME = 'data/test/samples.txt'
self.TEST_OUT_SUFFIX = '_predictions.txt'
self.SENTENCES_PREFIX = ['Q: ', 'A: ']
Expand Down Expand Up @@ -113,7 +113,7 @@ def parseArgs(args):
datasetArgs.add_argument('--datasetTag', type=str, default='', help='add a tag to the dataset (file where to load the vocabulary and the precomputed samples, not the original corpus). Useful to manage multiple versions. Also used to define the file used for the lightweight format.') # The samples are computed from the corpus if it does not exist already. There are saved in \'data/samples/\'
datasetArgs.add_argument('--ratioDataset', type=float, default=1.0, help='ratio of dataset used to avoid using the whole dataset') # Not implemented, useless ?
datasetArgs.add_argument('--maxLength', type=int, default=10, help='maximum length of the sentence (for input and output), define number of maximum step of the RNN')
datasetArgs.add_argument('--lightweightFile', type=str, default=None, help='file containing our lightweight-formatted corpus')
datasetArgs.add_argument('--filterVocab', type=int, default=1, help='remove rarelly used words (by default words used only once). 0 to keep all words.')

# Network options (Warning: if modifying something here, also make the change on save/loadParams() )
nnArgs = parser.add_argument_group('Network options', 'architecture related option')
Expand Down Expand Up @@ -536,18 +536,20 @@ def loadModelParams(self):

# Restoring the the parameters
self.globStep = config['General'].getint('globStep')
self.args.maxLength = config['General'].getint('maxLength') # We need to restore the model length because of the textData associated and the vocabulary size (TODO: Compatibility mode between different maxLength)
self.args.watsonMode = config['General'].getboolean('watsonMode')
self.args.autoEncode = config['General'].getboolean('autoEncode')
self.args.corpus = config['General'].get('corpus')
self.args.datasetTag = config['General'].get('datasetTag', '')
self.args.embeddingSource = config['General'].get('embeddingSource', '')

self.args.datasetTag = config['Dataset'].get('datasetTag')
self.args.maxLength = config['Dataset'].getint('maxLength') # We need to restore the model length because of the textData associated and the vocabulary size (TODO: Compatibility mode between different maxLength)
self.args.filterVocab = config['Dataset'].getint('filterVocab')

self.args.hiddenSize = config['Network'].getint('hiddenSize')
self.args.numLayers = config['Network'].getint('numLayers')
self.args.softmaxSamples = config['Network'].getint('softmaxSamples')
self.args.initEmbeddings = config['Network'].getboolean('initEmbeddings')
self.args.embeddingSize = config['Network'].getint('embeddingSize')
self.args.embeddingSource = config['Network'].get('embeddingSource')


# No restoring for training params, batch size or other non model dependent parameters
Expand All @@ -556,11 +558,12 @@ def loadModelParams(self):
print()
print('Warning: Restoring parameters:')
print('globStep: {}'.format(self.globStep))
print('maxLength: {}'.format(self.args.maxLength))
print('watsonMode: {}'.format(self.args.watsonMode))
print('autoEncode: {}'.format(self.args.autoEncode))
print('corpus: {}'.format(self.args.corpus))
print('datasetTag: {}'.format(self.args.datasetTag))
print('maxLength: {}'.format(self.args.maxLength))
print('filterVocab: {}'.format(self.args.filterVocab))
print('hiddenSize: {}'.format(self.args.hiddenSize))
print('numLayers: {}'.format(self.args.numLayers))
print('softmaxSamples: {}'.format(self.args.softmaxSamples))
Expand All @@ -585,19 +588,22 @@ def saveModelParams(self):
config['General'] = {}
config['General']['version'] = self.CONFIG_VERSION
config['General']['globStep'] = str(self.globStep)
config['General']['maxLength'] = str(self.args.maxLength)
config['General']['watsonMode'] = str(self.args.watsonMode)
config['General']['autoEncode'] = str(self.args.autoEncode)
config['General']['corpus'] = str(self.args.corpus)
config['General']['datasetTag'] = str(self.args.datasetTag)
config['General']['embeddingSource'] = str(self.args.embeddingSource)

config['Dataset'] = {}
config['Dataset']['datasetTag'] = str(self.args.datasetTag)
config['Dataset']['maxLength'] = str(self.args.maxLength)
config['Dataset']['filterVocab'] = str(self.args.filterVocab)

config['Network'] = {}
config['Network']['hiddenSize'] = str(self.args.hiddenSize)
config['Network']['numLayers'] = str(self.args.numLayers)
config['Network']['softmaxSamples'] = str(self.args.softmaxSamples)
config['Network']['initEmbeddings'] = str(self.args.initEmbeddings)
config['Network']['embeddingSize'] = str(self.args.embeddingSize)
config['Network']['embeddingSource'] = str(self.args.embeddingSource)

# Keep track of the learning params (but without restoring them)
config['Training (won\'t be restored)'] = {}
Expand Down
Loading

0 comments on commit a956625

Please sign in to comment.