README updates

yixuan · yixuan · Jul 11, 2016 · May 5, 2016 · May 5, 2016 · May 7, 2016
commit 3b8d612237773a349ad752e179afeb77194b5a67
diff --git a/README.md b/README.md
@@ -1,20 +1,44 @@
-## Recommender System with the recosystem Package
-
-**Important Notes**: The API of this package has changed since version 0.3, due
-to the API change of the underlying LIBMF library version 1.2.
+### IMPORTANT NOTES
+
+> The API of this package has changed since version 0.4, due
+> to the API change of LIBMF 2.01 and some other design improvement.
+
+- The `cost` option in `$train()` and `$tune()` has been expanded to and replaced
+  by `costp_l1`, `costp_l2`, `costq_l1`, and `costq_l2`, to allow for more
+  flexibility of the model.
+- `$output()` has been renamed to `$export()`.
+- Data input and output are now managed in a unified way via functions
+  `data_file()`, `data_memory()`, `out_file()`, `out_memory()`, and
+  `out_nothing()`. See section **Data Input and Output** below.
+- As a result, a number of arguments in functions `$tune()`, `$train()`,
+  `$export()`, and `$predict()` now should be objects returned by these
+  input/output functions.
 
-- `$convert_train()` and `$convert_test()` have been removed
-- `$train()` and `$predict()` have different argument lists
-- Added `$tune()` member function for parameter tuning
+## Recommender System with the recosystem Package
 
 ### About This Package
 
 `recosystem` is an R wrapper of the `LIBMF` library developed by
-Yu-Chin Juan, Yong Zhuang, Wei-Sheng Chin and Chih-Jen Lin
-(http://www.csie.ntu.edu.tw/~cjlin/libmf/),
-an open source library for recommender system using marix factorization.
+Yu-Chin Juan, Wei-Sheng Chin, Yong Zhuang, Bo-Wen Yuan, Meng-Yuan Yang,
+and Chih-Jen Lin (http://www.csie.ntu.edu.tw/~cjlin/libmf/),
+an open source library for recommender system using parallel marix
+factorization.
+
+### Highlights of LIBMF and recosystem
+
+`LIBMF` is a high-performance C++ library for large scale matrix factorization.
+`LIBMF` itself is a parallelized library, meaning that
+users can take advantage of multicore CPUs to speed up the computation.
+It also utilizes some advanced CPU features to further improve the performance.
 
-A more detailed introduction can be found in the vignette of this package.
+`recosystem` is a wrapper of `LIBMF`, hence it inherits most of the features
+of `LIBMF`, and additionally provides a number of user-friendly R functions to
+simplify data processing and model building. Also, unlike most other R packages
+for statistical modeling that store the whole dataset and model object in
+memory, `LIBMF` (and hence `recosystem`) can significantly reduce memory use,
+for instance the constructed model that contains information for prediction
+can be stored in the hard disk, and output result can also be directly
+written into a file rather than be kept in memory.
 
 ### A Quick View of Recommender System
 
@@ -31,23 +55,45 @@ rating matrix based on observed values, as is shown in the table below:
 
 Each cell with number in it is the rating given by some user on a specific
 item, while those marked with question marks are unknown ratings that need
-to be predicted. In some other literatures, this problem may be given other
-names, e.g. collaborative filtering, matrix completion, matrix recovery, etc.
+to be predicted. In some other literatures, this problem may be named
+collaborative filtering, matrix completion, matrix recovery, etc.
 
-### Highlights of LIBMF and recosystem
+In `recosystem`, we provide convenient functions for model training, parameter
+tuning, model exporting, and model prediction.
+
+### Data Input and Output
 
-`LIBMF` itself is a parallelized library, meaning that users can take
-advantage of multicore CPUs to speed up the computation. It also utilizes 
-some advanced CPU features to further improve the performance. [@LIBMF]
+Each step in the recommender system involves data input and output, as the
+table below shows:
 
-`recosystem` is a wrapper of `LIBMF`, hence the features of `LIBMF`
-are all included in `recosystem`. Also, unlike most other R packages for
-statistical modeling which store the whole dataset and model object in memory,
-`LIBMF` (and hence `recosystem`) is much hard-disk-based, for instance
-the constructed model which contains information for prediction can be stored
-in the hard disk, and prediction result can also be directly written into a file
-rather than kept in memory. That is to say, `recosystem` will have a
-comparatively small memory usage.
+| Step             | Input             | Output                           |
+|------------------|-------------------|----------------------------------|
+| Model training   | Training data set | --                               |
+| Parameter tuning | Training data set | --                               |
+| Exporting model  | --                | User matrix `P`, item matrix `Q` |
+| Prediction       | Testing data set  | Predicted values                 |
+
+Data may have different formats and types of storage, for example the input
+data set may be saved in a file or stored as R objects, and users may want
+the output results to be directly written into file or to be returned as R
+objects for further processing. In `recosystem`, we use two classes,
+`DataSource` and `Output`, to handle data input and output in a unified way.
+
+An object of class `DataSource` specifies the source of a data set (either
+training or testing), which can be created by the following two functions:
+
+- `data_file()`: Specifies a data set from a file in the hard disk
+- `data_memory()`: Specifies a data set from R objects
+
+And an object of class `Output` describes how the result should be output,
+typically returned by the functions below:
+
+- `out_file()`: Result should be saved to a file
+- `out_memory()`: Result should be returned as R objects
+- `out_nothing()`: Nothing should be output
+
+More data source formats and output options may be supported in the future
+along with the development of this package.
 
 ### Data Format
 
@@ -56,15 +102,13 @@ sparse matrix triplet form, i.e., each line in the file contains three
 numbers
 
 ```
-user_id item_id rating
+user_index item_index rating
 ```
 
-Testing data file is similar to training data, but since the ratings in
-testing data are usually unknown, the `rating` entry in testing data file
-can be omitted, or can be replaced by any placeholder such as `0` or `?`.
-
-Be careful with the convention that `user_id` and `item_id` start from 0,
-so the training data file for the example in the beginning will look like
+User index and item index may start with either 0 or 1, and this can be
+specified by the `index1` parameter in `data_file()` and `data_memory()`.
+For example, with `index1 = FALSE`, the training data file for the rating matrix
+in the beginning of this article may look like
 
 ```
 0 0 2
@@ -76,7 +120,11 @@ so the training data file for the example in the beginning will look like
 ...
 ```
 
-And testing data file is
+Testing data file is similar to training data, but since the ratings in
+testing data are usually unknown, the `rating` entry in testing data file
+can be omitted, or can be replaced by any placeholder such as `0` or `?`.
+
+The testing data file for the same rating matrix would be
 
 ```
 0 2
@@ -85,12 +133,8 @@ And testing data file is
 ...
 ```
 
-Since ratings for testing data are unknown, here we simply omit the third entry.
-However if their values are really given, the testing data will serve as
-a validation set on which RMSE of prediction can be calculated.
-
-Example data files are contained in the `recosystem/dat`
-(or `recosystem/inst/dat`, for source package) directory.
+Example data files are contained in the `<recosystem>/dat`
+(or `<recosystem>/inst/dat`, for source package) directory.
 
 ### Usage of recosystem
 
@@ -101,116 +145,124 @@ The usage of `recosystem` is quite simple, mainly consisting of the following st
 along a set of candidate values.
 3. Train the model by calling the `$train()` method. A number of parameters
 can be set inside the function, possibly coming from the result of `$tune()`.
-4. (Optionally) output the model, i.e. write the factorized $P$ and $Q$
-matrices info files.
-5. Use the `$predict()` method to compute predictions and write results
-into a file.
+4. (Optionally) export the model, i.e. write the factorization matrices
+$P$ and $Q$ into files or return them as R objects.
+5. Use the `$predict()` method to compute predicted valeus.
 
 Below is an example on some simulated data:
 
 ```r
 library(recosystem)
 set.seed(123) # This is a randomized algorithm
-trainset = system.file("dat", "smalltrain.txt", package = "recosystem")
-testset = system.file("dat", "smalltest.txt", package = "recosystem")
+train_set = data_file(system.file("dat", "smalltrain.txt", package = "recosystem"))
+test_set  = data_file(system.file("dat", "smalltest.txt",  package = "recosystem"))
 r = Reco()
-opts = r$tune(trainset, opts = list(dim = c(10, 20, 30), lrate = c(0.05, 0.1, 0.2),
-                                    nthread = 1, niter = 10))
+opts = r$tune(train_set, opts = list(dim = c(10, 20, 30), lrate = c(0.1, 0.2),
+                                     costp_l1 = 0, costq_l1 = 0,
+                                     nthread = 1, niter = 10))
 opts
 ```
 
 ```
-## $min
-## $min$dim
-## [1] 10
-## 
-## $min$cost
-## [1] 0.1
-## 
-## $min$lrate
-## [1] 0.05
-## 
-## 
-## $res
-##    dim cost lrate      rmse
-## 1   10 0.01  0.05 0.9508706
-## 2   20 0.01  0.05 0.9769276
-## 3   30 0.01  0.05 0.9552881
-## 4   10 0.10  0.05 0.9494486
-## 5   20 0.10  0.05 0.9745281
-## 6   30 0.10  0.05 0.9665343
-## 7   10 0.01  0.10 1.0146531
-## 8   20 0.01  0.10 1.0176182
-## 9   30 0.01  0.10 1.0006795
-## 10  10 0.10  0.10 0.9697273
-## 11  20 0.10  0.10 0.9870130
-## 12  30 0.10  0.10 0.9751481
-## 13  10 0.01  0.20 1.1101094
-## 14  20 0.01  0.20 1.0386463
-## 15  30 0.01  0.20 1.0129634
-## 16  10 0.10  0.20 1.0422394
-## 17  20 0.10  0.20 1.0249771
-## 18  30 0.10  0.20 1.0148717
-```
-
-```r
-r$train(trainset, opts = c(opts$min, nthread = 1, niter = 10))
-```
-
-```
-## iter   tr_rmse          obj
-##    0    2.5987   6.9706e+04
-##    1    1.8298   3.7380e+04
-##    2    1.2323   2.0192e+04
-##    3    0.9563   1.4674e+04
-##    4    0.8542   1.3051e+04
-##    5    0.8128   1.2467e+04
-##    6    0.7926   1.2200e+04
-##    7    0.7803   1.2033e+04
-##    8    0.7725   1.1929e+04
-##    9    0.7671   1.1863e+04
-## real tr_rmse = 0.7411
+$min
+$min$dim
+[1] 20
+
+$min$costp_l1
+[1] 0
+
+$min$costp_l2
+[1] 0.1
+
+$min$costq_l1
+[1] 0
+
+$min$costq_l2
+[1] 0.01
+
+$min$lrate
+[1] 0.1
+
+$min$rmse
+[1] 0.9804937
+
+
+$res
+   dim costp_l1 costp_l2 costq_l1 costq_l2 lrate      rmse
+1   10        0     0.01        0     0.01   0.1 0.9996368
+2   20        0     0.01        0     0.01   0.1 1.0040111
+3   30        0     0.01        0     0.01   0.1 0.9967101
+4   10        0     0.10        0     0.01   0.1 0.9930384
+5   20        0     0.10        0     0.01   0.1 0.9804937
+6   30        0     0.10        0     0.01   0.1 0.9921565
+7   10        0     0.01        0     0.10   0.1 0.9857116
+8   20        0     0.01        0     0.10   0.1 1.0006225
+9   30        0     0.01        0     0.10   0.1 0.9891277
+10  10        0     0.10        0     0.10   0.1 0.9826748
+11  20        0     0.10        0     0.10   0.1 0.9807865
+12  30        0     0.10        0     0.10   0.1 0.9863404
+13  10        0     0.01        0     0.01   0.2 1.1022376
+14  20        0     0.01        0     0.01   0.2 1.0266608
+15  30        0     0.01        0     0.01   0.2 1.0039170
+16  10        0     0.10        0     0.01   0.2 1.0734307
+17  20        0     0.10        0     0.01   0.2 1.0393326
+18  30        0     0.10        0     0.01   0.2 1.0003177
+19  10        0     0.01        0     0.10   0.2 1.0769594
+20  20        0     0.01        0     0.10   0.2 1.0323938
+21  30        0     0.01        0     0.10   0.2 1.0061849
+22  10        0     0.10        0     0.10   0.2 1.0365456
+23  20        0     0.10        0     0.10   0.2 1.0023265
+24  30        0     0.10        0     0.10   0.2 1.0044131
 ```
 
 ```r
-outfile = tempfile()
-r$predict(testset, outfile)
+r$train(train_set, opts = c(opts$min, nthread = 1, niter = 10))
 ```
 
 ```
-## prediction output generated at /tmp/RtmpqxN3AV/file2043363dc41b
+iter      tr_rmse          obj
+   0       2.2673   5.3765e+04
+   1       1.0267   1.3667e+04
+   2       0.8372   1.0147e+04
+   3       0.7977   9.4773e+03
+   4       0.7703   9.0439e+03
+   5       0.7402   8.5967e+03
+   6       0.7048   8.1202e+03
+   7       0.6609   7.5638e+03
+   8       0.6133   7.0246e+03
+   9       0.5614   6.4770e+03
 ```
 
 ```r
-## Compare the first few true values of testing data
-## with predicted ones
-# True values
-print(read.table(testset, header = FALSE, sep = " ", nrows = 10)$V3)
+## Write predictions to file
+pred_file = tempfile()
+r$predict(test_set, out_file(pred_file))
+print(scan(pred_file, n = 10))
 ```
 
 ```
-##  [1] 3 4 2 3 3 4 3 3 3 3
+ [1] 3.92323 3.05510 2.98484 3.42607 2.53514 2.88135 2.93226 3.11718 2.40406 3.46282
 ```
 
 ```r
-# Predicted values
-print(scan(outfile, n = 10))
+## Or, directly return an R vector
+pred_rvec = r$predict(test_set, out_memory())
+head(pred_rvec, 10)
 ```
 
 ```
-##  [1] 3.70478 3.02759 2.97616 3.46205 2.15736 3.03603 2.74433 2.96865
-##  [9] 2.02960 3.24131
+ [1] 3.923234 3.055096 2.984840 3.426066 2.535142 2.881347 2.932261 3.117176 2.404063
+[10] 3.462822
 ```
 
 Detailed help document for each function is available in topics
 `?recosystem::Reco`, `?recosystem::tune`, `?recosystem::train`,
-`?recosystem::output` and `?recosystem::predict`.
+`?recosystem::export` and `?recosystem::predict`.
 
-### Installation Issue
+### Performance Improvement with Extra Installation Options
 
-`LIBMF` utilizes some compiler and CPU features that may be unavailable
-in some systems. To build `recosystem` from source, one needs a C++
-compiler that supports C++11 standard.
+To build `recosystem` from source, one needs a C++ compiler that supports
+the C++11 standard.
 
 Also, there are some flags in file `src/Makevars`
 (`src/Makevars.win` for Windows system) that may have influential

diff --git a/inst/NEWS.Rd b/inst/NEWS.Rd
@@ -6,9 +6,10 @@
     \item Update LIBMF to version 2.01.
     \item API change from LIBMF 2.01:
           \itemize{
-            \item The \code{cost} option in \code{$train()} is expanded to
-                  \code{costp_l1}, \code{costp_l2}, \code{costq_l1}, and
-                  \code{costq_l2}.
+            \item The \code{cost} option in \code{$train()} and \code{$tune()}
+                  has been expanded to and replaced by \code{costp_l1},
+                  \code{costp_l2}, \code{costq_l1}, and \code{costq_l2}, to
+                  allow for more flexibility of the model.
           }
     \item Other API change:
           \itemize{

diff --git a/vignettes/introduction.Rmd b/vignettes/introduction.Rmd
@@ -226,7 +226,7 @@ Below is an example on some simulated data:
 library(recosystem)
 set.seed(123) # This is a randomized algorithm
 train_set = data_file(system.file("dat", "smalltrain.txt", package = "recosystem"))
-test_set = data_file(system.file("dat", "smalltest.txt", package = "recosystem"))
+test_set  = data_file(system.file("dat", "smalltest.txt",  package = "recosystem"))
 r = Reco()
 opts = r$tune(train_set, opts = list(dim = c(10, 20, 30), lrate = c(0.1, 0.2),
                                      costp_l1 = 0, costq_l1 = 0,