Merge pull request Conchylicultor#89 from Conchylicultor/vocab_filter

Vocab filter
RubinOrlando · Mar 19, 2017 · a956625 · a956625
2 parents fa451ce + 1bdd7f8
commit a956625
Show file tree

Hide file tree

Showing 4 changed files with 231 additions and 76 deletions.
diff --git a/README.md b/README.md
@@ -1,6 +1,8 @@
 # Deep Q&A
 [![Join the chat at https://gitter.im/chatbot-pilots/DeepQA](https://badges.gitter.im/chatbot-pilots/DeepQA.svg)](https://gitter.im/chatbot-pilots/DeepQA?utm_source=badge&utm_medium=badge&utm_campaign=pr-badge&utm_content=badge)
 
+Note: if your models don't work anymore following the last update, please follow [those instructions](#upgrade) to update.
+
 #### Table of Contents
 
 * [Presentation](#presentation)
@@ -11,6 +13,7 @@
 * [Results](#results)
 * [Pretrained model](#pretrained-model)
 * [Improvements](#improvements)
+* [Upgrade](#upgrade)
 
 ## Presentation
 
@@ -205,7 +208,10 @@ It also seems to overfit as sometimes it will just pop out sentences from its tr
 
 ## Pretrained model
 
-You can find a pre-trained model [here](https://drive.google.com/file/d/0Bw-phsNSkq23TXltWGlOdk9wOXc/view?usp=sharing), trained of the default corpus. To launch it, extract it inside `DeepQA/save/` and run `./main.py --modelTag pretrainedv2 --test interactive`. The old pre-trained model is still available  [here](https://drive.google.com/file/d/0Bw-phsNSkq23amlSZXVqcm5oVFU/view?usp=sharing) (Won't work with the current version).
+You can find a pre-trained model [here](https://drive.google.com/file/d/0Bw-phsNSkq23TXltWGlOdk9wOXc/view?usp=sharing), trained of the default corpus. To use it:
+ 1. Extract the zip file inside `DeepQA/save/`
+ 2. Copy the preprocessed dataset from `save/model-pretrainedv2/dataset-cornell-old-lenght10-filter0.pkl` to `data/samples/`.
+ 3. Run `./main.py --modelTag pretrainedv2 --test interactive`.
 
 If you have a high-end GPU, don't hesitate to play with the hyper-parameters/corpus to train a better model. From my experiments, it seems that the learning rate and dropout rate have the most impact on the results. Also if you want to share your models, don't hesitate to contact me and I'll add it here.
 
@@ -218,3 +224,29 @@ In addition to trying larger/deeper model, there are a lot of small improvements
 * Having more data usually don't hurt. Training on a bigger corpus should be beneficial. [Reddit comments dataset](https://www.reddit.com/r/datasets/comments/59039y/updated_reddit_comment_dataset_up_to_201608/) seems the biggest for now (and is too big for this program to support it). Another trick to artificially increase the dataset size when creating the corpus could be to split the sentences of each training sample (ex: from the sample `Q:Sentence 1. Sentence 2. => A:Sentence X. Sentence Y.` we could generate 3 new samples: `Q:Sentence 1. Sentence 2. => A:Sentence X.`, `Q:Sentence 2. => A:Sentence X. Sentence Y.` and `Q:Sentence 2. => A:Sentence X.`. Warning: other combinations like `Q:Sentence 1. => A:Sentence X.` won't work because it would break the transition `2 => X` which links the question to the answer)
 * The testing curve should really be monitored as done in my other [music generation](https://github.com/Conchylicultor/MusicGenerator) project. This would greatly help to see the impact of dropout on overfitting. For now it's just done empirically by manually checking the testing prediction at different training steps.
 * For now, the questions are independent from each other. To link questions together, a straightforward way would be to feed all previous questions and answer to the encoder before giving the answer. Some caching could be done on the final encoder stated to avoid recomputing it each time. To improve the accuracy, the network should be retrain on entire dialogues instead of just individual QA. Also when feeding the previous dialogue to the encoder, new tokens `<Q>` and `<A>` could be added so the encoder knows when the interlocutor is changing. I'm not sure though that the simple seq2seq model would be sufficient to capture long term dependencies between sentences. Adding a bucket system to group similar input lengths together could greatly improve training speed.
+
+## Upgrade
+
+With the last commit, I added the possibility to filters rarely used words from the dataset using `--filterVocab 3`. The dataset preprocessing should also be a lot faster when generating the vocabulary for different `maxLength`. Unfortunately, this make the previous pre-processed datasets incompatible. Here are the change to make to use the old model with the new version:
+
+ 1. Rename the old dataset (present in `data/samples/`) into the new format name. Ex: from `dataset-cornell-10.pkl` to `dataset-cornell-old-lenght10-filter0.pkl`.
+
+ 2. Update the model configuration file `params.ini`. The changes to make are:
+    * `version`: change to `0.5`
+    * Create a new `[Dataset]` field with the following fields
+
+```ini
+[Dataset]
+# Make sure that datasettag match the one from the filename you just renamed
+datasettag = old
+# Use the maxLength value of your model. Should also match
+maxlength = 10
+filtervocab = 0
+```
+
+If everything goes well, you should see this message somewhere on the terminal output when loading your model.
+
+```
+Loading dataset from /home/*/DeepQA/data/samples/dataset-cornell-old-lenght10-filter0.pkl
+Loaded cornell: 34991 words, 139979 QA
+```
diff --git a/chatbot/chatbot.py b/chatbot/chatbot.py
@@ -70,7 +70,7 @@ def __init__(self):
         self.MODEL_NAME_BASE = 'model'
         self.MODEL_EXT = '.ckpt'
         self.CONFIG_FILENAME = 'params.ini'
-        self.CONFIG_VERSION = '0.4'
+        self.CONFIG_VERSION = '0.5'
         self.TEST_IN_NAME = 'data/test/samples.txt'
         self.TEST_OUT_SUFFIX = '_predictions.txt'
         self.SENTENCES_PREFIX = ['Q: ', 'A: ']
@@ -113,7 +113,7 @@ def parseArgs(args):
         datasetArgs.add_argument('--datasetTag', type=str, default='', help='add a tag to the dataset (file where to load the vocabulary and the precomputed samples, not the original corpus). Useful to manage multiple versions. Also used to define the file used for the lightweight format.')  # The samples are computed from the corpus if it does not exist already. There are saved in \'data/samples/\'
         datasetArgs.add_argument('--ratioDataset', type=float, default=1.0, help='ratio of dataset used to avoid using the whole dataset')  # Not implemented, useless ?
         datasetArgs.add_argument('--maxLength', type=int, default=10, help='maximum length of the sentence (for input and output), define number of maximum step of the RNN')
-        datasetArgs.add_argument('--lightweightFile', type=str, default=None, help='file containing our lightweight-formatted corpus')
+        datasetArgs.add_argument('--filterVocab', type=int, default=1, help='remove rarelly used words (by default words used only once). 0 to keep all words.')
 
         # Network options (Warning: if modifying something here, also make the change on save/loadParams() )
         nnArgs = parser.add_argument_group('Network options', 'architecture related option')
@@ -536,18 +536,20 @@ def loadModelParams(self):
 
             # Restoring the the parameters
             self.globStep = config['General'].getint('globStep')
-            self.args.maxLength = config['General'].getint('maxLength')  # We need to restore the model length because of the textData associated and the vocabulary size (TODO: Compatibility mode between different maxLength)
             self.args.watsonMode = config['General'].getboolean('watsonMode')
             self.args.autoEncode = config['General'].getboolean('autoEncode')
             self.args.corpus = config['General'].get('corpus')
-            self.args.datasetTag = config['General'].get('datasetTag', '')
-            self.args.embeddingSource = config['General'].get('embeddingSource', '')
+
+            self.args.datasetTag = config['Dataset'].get('datasetTag')
+            self.args.maxLength = config['Dataset'].getint('maxLength')  # We need to restore the model length because of the textData associated and the vocabulary size (TODO: Compatibility mode between different maxLength)
+            self.args.filterVocab = config['Dataset'].getint('filterVocab')
 
             self.args.hiddenSize = config['Network'].getint('hiddenSize')
             self.args.numLayers = config['Network'].getint('numLayers')
             self.args.softmaxSamples = config['Network'].getint('softmaxSamples')
             self.args.initEmbeddings = config['Network'].getboolean('initEmbeddings')
             self.args.embeddingSize = config['Network'].getint('embeddingSize')
+            self.args.embeddingSource = config['Network'].get('embeddingSource')
 
 
             # No restoring for training params, batch size or other non model dependent parameters
@@ -556,11 +558,12 @@ def loadModelParams(self):
             print()
             print('Warning: Restoring parameters:')
             print('globStep: {}'.format(self.globStep))
-            print('maxLength: {}'.format(self.args.maxLength))
             print('watsonMode: {}'.format(self.args.watsonMode))
             print('autoEncode: {}'.format(self.args.autoEncode))
             print('corpus: {}'.format(self.args.corpus))
             print('datasetTag: {}'.format(self.args.datasetTag))
+            print('maxLength: {}'.format(self.args.maxLength))
+            print('filterVocab: {}'.format(self.args.filterVocab))
             print('hiddenSize: {}'.format(self.args.hiddenSize))
             print('numLayers: {}'.format(self.args.numLayers))
             print('softmaxSamples: {}'.format(self.args.softmaxSamples))
@@ -585,19 +588,22 @@ def saveModelParams(self):
         config['General'] = {}
         config['General']['version']  = self.CONFIG_VERSION
         config['General']['globStep']  = str(self.globStep)
-        config['General']['maxLength'] = str(self.args.maxLength)
         config['General']['watsonMode'] = str(self.args.watsonMode)
         config['General']['autoEncode'] = str(self.args.autoEncode)
         config['General']['corpus'] = str(self.args.corpus)
-        config['General']['datasetTag'] = str(self.args.datasetTag)
-        config['General']['embeddingSource'] = str(self.args.embeddingSource)
+
+        config['Dataset'] = {}
+        config['Dataset']['datasetTag'] = str(self.args.datasetTag)
+        config['Dataset']['maxLength'] = str(self.args.maxLength)
+        config['Dataset']['filterVocab'] = str(self.args.filterVocab)
 
         config['Network'] = {}
         config['Network']['hiddenSize'] = str(self.args.hiddenSize)
         config['Network']['numLayers'] = str(self.args.numLayers)
         config['Network']['softmaxSamples'] = str(self.args.softmaxSamples)
         config['Network']['initEmbeddings'] = str(self.args.initEmbeddings)
         config['Network']['embeddingSize'] = str(self.args.embeddingSize)
+        config['Network']['embeddingSource'] = str(self.args.embeddingSource)
 
         # Keep track of the learning params (but without restoring them)
         config['Training (won\'t be restored)'] = {}