Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

update to libmf 2.01 #2

Merged
merged 62 commits into from
Jul 11, 2016
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
62 commits
Select commit Hold shift + click to select a range
77b45bb
avoid crashes
yixuan May 5, 2016
6cfac01
import from libmf 2.01
yixuan May 5, 2016
be723d8
unneeded function
yixuan May 7, 2016
eea05d1
additinoal headers and macros
yixuan May 7, 2016
160fafc
fallback implementation of aligned memory allocation
yixuan May 7, 2016
108cb84
use R's RNG
yixuan May 8, 2016
29f38ea
use R's printing functions
yixuan May 8, 2016
26f3830
data reader
yixuan Jun 28, 2016
6278364
remove data reading code
yixuan Jun 29, 2016
13cb287
separate header file
yixuan Jun 29, 2016
9cccd4d
remove duplicated code
yixuan Jun 29, 2016
7a42321
model training
yixuan Jun 29, 2016
c1132d5
argument names
yixuan Jun 29, 2016
6d70119
model tuning
yixuan Jun 29, 2016
d1fedfa
upcoming news
yixuan Jun 29, 2016
5a175ce
get reader from R object
yixuan Jun 29, 2016
c5bda37
describe data source
yixuan Jun 29, 2016
224a5f7
virtual destructor
yixuan Jun 29, 2016
bf38713
getting data reader
yixuan Jun 29, 2016
e5d7386
use S4
yixuan Jun 29, 2016
4a3f1f7
make train() to work
yixuan Jun 29, 2016
c4a707d
make tune() to work
yixuan Jun 30, 2016
d313752
formatting
yixuan Jun 30, 2016
36333d5
documentation for train()
yixuan Jun 30, 2016
3442d62
update documentation for tune()
yixuan Jun 30, 2016
4482978
reference
yixuan Jul 2, 2016
2093bdb
documentation for data source
yixuan Jul 2, 2016
450d174
no longer needed
yixuan Jul 3, 2016
346d487
output format
yixuan Jul 4, 2016
aef401e
rename function
yixuan Jul 4, 2016
6339b67
code to export model
yixuan Jul 5, 2016
50bdc4e
R code for export()
yixuan Jul 6, 2016
c9b4498
proper way to read meta information
yixuan Jul 6, 2016
d5f7fe4
typo
yixuan Jul 6, 2016
519f4dd
private => protected
yixuan Jul 6, 2016
f08edd8
export functions
yixuan Jul 6, 2016
bb7de45
new code for predict()
Jul 6, 2016
e3b9d50
update R code
yixuan Jul 6, 2016
65c0f6b
documentation update
yixuan Jul 7, 2016
4e3ebb5
remove trailing space
yixuan Jul 7, 2016
5156a9d
documentation for predict()
yixuan Jul 7, 2016
188217a
header guards
yixuan Jul 8, 2016
4d10251
do not need data frame
yixuan Jul 9, 2016
19005b8
Merge branch 'libmf2.01' of https://github.com/yixuan/recosystem into…
yixuan Jul 10, 2016
089d52d
is_valid() is not so meaningful, removed
yixuan Jul 10, 2016
e1c2b38
refine documentation
yixuan Jul 10, 2016
d2da707
export data_memory()
yixuan Jul 10, 2016
7f5f805
update script to simulate data
yixuan Jul 10, 2016
402bfcc
use new function
yixuan Jul 10, 2016
afba586
reader of data in memory
yixuan Jul 10, 2016
6f5e7f1
typo
yixuan Jul 10, 2016
207a67b
add example for data_memory()
yixuan Jul 10, 2016
d710724
in-memory reader for testing data
yixuan Jul 10, 2016
dabb57c
add example
yixuan Jul 10, 2016
d0c0ea8
package information
yixuan Jul 10, 2016
acb1e25
update Rd files
yixuan Jul 10, 2016
11bac94
NEWS
yixuan Jul 10, 2016
2db7ec5
update code and formula in vignette
yixuan Jul 10, 2016
84ab5ec
updates on vignette
yixuan Jul 11, 2016
af96b2b
invisible NULL
yixuan Jul 11, 2016
f6e4218
refine vignette
yixuan Jul 11, 2016
3b8d612
README updates
yixuan Jul 11, 2016
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
README updates
  • Loading branch information
yixuan committed Jul 11, 2016
commit 3b8d612237773a349ad752e179afeb77194b5a67
280 changes: 166 additions & 114 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,20 +1,44 @@
## Recommender System with the recosystem Package

**Important Notes**: The API of this package has changed since version 0.3, due
to the API change of the underlying LIBMF library version 1.2.
### IMPORTANT NOTES

> The API of this package has changed since version 0.4, due
> to the API change of LIBMF 2.01 and some other design improvement.

- The `cost` option in `$train()` and `$tune()` has been expanded to and replaced
by `costp_l1`, `costp_l2`, `costq_l1`, and `costq_l2`, to allow for more
flexibility of the model.
- `$output()` has been renamed to `$export()`.
- Data input and output are now managed in a unified way via functions
`data_file()`, `data_memory()`, `out_file()`, `out_memory()`, and
`out_nothing()`. See section **Data Input and Output** below.
- As a result, a number of arguments in functions `$tune()`, `$train()`,
`$export()`, and `$predict()` now should be objects returned by these
input/output functions.

- `$convert_train()` and `$convert_test()` have been removed
- `$train()` and `$predict()` have different argument lists
- Added `$tune()` member function for parameter tuning
## Recommender System with the recosystem Package

### About This Package

`recosystem` is an R wrapper of the `LIBMF` library developed by
Yu-Chin Juan, Yong Zhuang, Wei-Sheng Chin and Chih-Jen Lin
(http://www.csie.ntu.edu.tw/~cjlin/libmf/),
an open source library for recommender system using marix factorization.
Yu-Chin Juan, Wei-Sheng Chin, Yong Zhuang, Bo-Wen Yuan, Meng-Yuan Yang,
and Chih-Jen Lin (http://www.csie.ntu.edu.tw/~cjlin/libmf/),
an open source library for recommender system using parallel marix
factorization.

### Highlights of LIBMF and recosystem

`LIBMF` is a high-performance C++ library for large scale matrix factorization.
`LIBMF` itself is a parallelized library, meaning that
users can take advantage of multicore CPUs to speed up the computation.
It also utilizes some advanced CPU features to further improve the performance.

A more detailed introduction can be found in the vignette of this package.
`recosystem` is a wrapper of `LIBMF`, hence it inherits most of the features
of `LIBMF`, and additionally provides a number of user-friendly R functions to
simplify data processing and model building. Also, unlike most other R packages
for statistical modeling that store the whole dataset and model object in
memory, `LIBMF` (and hence `recosystem`) can significantly reduce memory use,
for instance the constructed model that contains information for prediction
can be stored in the hard disk, and output result can also be directly
written into a file rather than be kept in memory.

### A Quick View of Recommender System

Expand All @@ -31,23 +55,45 @@ rating matrix based on observed values, as is shown in the table below:

Each cell with number in it is the rating given by some user on a specific
item, while those marked with question marks are unknown ratings that need
to be predicted. In some other literatures, this problem may be given other
names, e.g. collaborative filtering, matrix completion, matrix recovery, etc.
to be predicted. In some other literatures, this problem may be named
collaborative filtering, matrix completion, matrix recovery, etc.

### Highlights of LIBMF and recosystem
In `recosystem`, we provide convenient functions for model training, parameter
tuning, model exporting, and model prediction.

### Data Input and Output

`LIBMF` itself is a parallelized library, meaning that users can take
advantage of multicore CPUs to speed up the computation. It also utilizes
some advanced CPU features to further improve the performance. [@LIBMF]
Each step in the recommender system involves data input and output, as the
table below shows:

`recosystem` is a wrapper of `LIBMF`, hence the features of `LIBMF`
are all included in `recosystem`. Also, unlike most other R packages for
statistical modeling which store the whole dataset and model object in memory,
`LIBMF` (and hence `recosystem`) is much hard-disk-based, for instance
the constructed model which contains information for prediction can be stored
in the hard disk, and prediction result can also be directly written into a file
rather than kept in memory. That is to say, `recosystem` will have a
comparatively small memory usage.
| Step | Input | Output |
|------------------|-------------------|----------------------------------|
| Model training | Training data set | -- |
| Parameter tuning | Training data set | -- |
| Exporting model | -- | User matrix `P`, item matrix `Q` |
| Prediction | Testing data set | Predicted values |

Data may have different formats and types of storage, for example the input
data set may be saved in a file or stored as R objects, and users may want
the output results to be directly written into file or to be returned as R
objects for further processing. In `recosystem`, we use two classes,
`DataSource` and `Output`, to handle data input and output in a unified way.

An object of class `DataSource` specifies the source of a data set (either
training or testing), which can be created by the following two functions:

- `data_file()`: Specifies a data set from a file in the hard disk
- `data_memory()`: Specifies a data set from R objects

And an object of class `Output` describes how the result should be output,
typically returned by the functions below:

- `out_file()`: Result should be saved to a file
- `out_memory()`: Result should be returned as R objects
- `out_nothing()`: Nothing should be output

More data source formats and output options may be supported in the future
along with the development of this package.

### Data Format

Expand All @@ -56,15 +102,13 @@ sparse matrix triplet form, i.e., each line in the file contains three
numbers

```
user_id item_id rating
user_index item_index rating
```

Testing data file is similar to training data, but since the ratings in
testing data are usually unknown, the `rating` entry in testing data file
can be omitted, or can be replaced by any placeholder such as `0` or `?`.

Be careful with the convention that `user_id` and `item_id` start from 0,
so the training data file for the example in the beginning will look like
User index and item index may start with either 0 or 1, and this can be
specified by the `index1` parameter in `data_file()` and `data_memory()`.
For example, with `index1 = FALSE`, the training data file for the rating matrix
in the beginning of this article may look like

```
0 0 2
Expand All @@ -76,7 +120,11 @@ so the training data file for the example in the beginning will look like
...
```

And testing data file is
Testing data file is similar to training data, but since the ratings in
testing data are usually unknown, the `rating` entry in testing data file
can be omitted, or can be replaced by any placeholder such as `0` or `?`.

The testing data file for the same rating matrix would be

```
0 2
Expand All @@ -85,12 +133,8 @@ And testing data file is
...
```

Since ratings for testing data are unknown, here we simply omit the third entry.
However if their values are really given, the testing data will serve as
a validation set on which RMSE of prediction can be calculated.

Example data files are contained in the `recosystem/dat`
(or `recosystem/inst/dat`, for source package) directory.
Example data files are contained in the `<recosystem>/dat`
(or `<recosystem>/inst/dat`, for source package) directory.

### Usage of recosystem

Expand All @@ -101,116 +145,124 @@ The usage of `recosystem` is quite simple, mainly consisting of the following st
along a set of candidate values.
3. Train the model by calling the `$train()` method. A number of parameters
can be set inside the function, possibly coming from the result of `$tune()`.
4. (Optionally) output the model, i.e. write the factorized $P$ and $Q$
matrices info files.
5. Use the `$predict()` method to compute predictions and write results
into a file.
4. (Optionally) export the model, i.e. write the factorization matrices
$P$ and $Q$ into files or return them as R objects.
5. Use the `$predict()` method to compute predicted valeus.

Below is an example on some simulated data:

```r
library(recosystem)
set.seed(123) # This is a randomized algorithm
trainset = system.file("dat", "smalltrain.txt", package = "recosystem")
testset = system.file("dat", "smalltest.txt", package = "recosystem")
train_set = data_file(system.file("dat", "smalltrain.txt", package = "recosystem"))
test_set = data_file(system.file("dat", "smalltest.txt", package = "recosystem"))
r = Reco()
opts = r$tune(trainset, opts = list(dim = c(10, 20, 30), lrate = c(0.05, 0.1, 0.2),
nthread = 1, niter = 10))
opts = r$tune(train_set, opts = list(dim = c(10, 20, 30), lrate = c(0.1, 0.2),
costp_l1 = 0, costq_l1 = 0,
nthread = 1, niter = 10))
opts
```

```
## $min
## $min$dim
## [1] 10
##
## $min$cost
## [1] 0.1
##
## $min$lrate
## [1] 0.05
##
##
## $res
## dim cost lrate rmse
## 1 10 0.01 0.05 0.9508706
## 2 20 0.01 0.05 0.9769276
## 3 30 0.01 0.05 0.9552881
## 4 10 0.10 0.05 0.9494486
## 5 20 0.10 0.05 0.9745281
## 6 30 0.10 0.05 0.9665343
## 7 10 0.01 0.10 1.0146531
## 8 20 0.01 0.10 1.0176182
## 9 30 0.01 0.10 1.0006795
## 10 10 0.10 0.10 0.9697273
## 11 20 0.10 0.10 0.9870130
## 12 30 0.10 0.10 0.9751481
## 13 10 0.01 0.20 1.1101094
## 14 20 0.01 0.20 1.0386463
## 15 30 0.01 0.20 1.0129634
## 16 10 0.10 0.20 1.0422394
## 17 20 0.10 0.20 1.0249771
## 18 30 0.10 0.20 1.0148717
```

```r
r$train(trainset, opts = c(opts$min, nthread = 1, niter = 10))
```

```
## iter tr_rmse obj
## 0 2.5987 6.9706e+04
## 1 1.8298 3.7380e+04
## 2 1.2323 2.0192e+04
## 3 0.9563 1.4674e+04
## 4 0.8542 1.3051e+04
## 5 0.8128 1.2467e+04
## 6 0.7926 1.2200e+04
## 7 0.7803 1.2033e+04
## 8 0.7725 1.1929e+04
## 9 0.7671 1.1863e+04
## real tr_rmse = 0.7411
$min
$min$dim
[1] 20

$min$costp_l1
[1] 0

$min$costp_l2
[1] 0.1

$min$costq_l1
[1] 0

$min$costq_l2
[1] 0.01

$min$lrate
[1] 0.1

$min$rmse
[1] 0.9804937


$res
dim costp_l1 costp_l2 costq_l1 costq_l2 lrate rmse
1 10 0 0.01 0 0.01 0.1 0.9996368
2 20 0 0.01 0 0.01 0.1 1.0040111
3 30 0 0.01 0 0.01 0.1 0.9967101
4 10 0 0.10 0 0.01 0.1 0.9930384
5 20 0 0.10 0 0.01 0.1 0.9804937
6 30 0 0.10 0 0.01 0.1 0.9921565
7 10 0 0.01 0 0.10 0.1 0.9857116
8 20 0 0.01 0 0.10 0.1 1.0006225
9 30 0 0.01 0 0.10 0.1 0.9891277
10 10 0 0.10 0 0.10 0.1 0.9826748
11 20 0 0.10 0 0.10 0.1 0.9807865
12 30 0 0.10 0 0.10 0.1 0.9863404
13 10 0 0.01 0 0.01 0.2 1.1022376
14 20 0 0.01 0 0.01 0.2 1.0266608
15 30 0 0.01 0 0.01 0.2 1.0039170
16 10 0 0.10 0 0.01 0.2 1.0734307
17 20 0 0.10 0 0.01 0.2 1.0393326
18 30 0 0.10 0 0.01 0.2 1.0003177
19 10 0 0.01 0 0.10 0.2 1.0769594
20 20 0 0.01 0 0.10 0.2 1.0323938
21 30 0 0.01 0 0.10 0.2 1.0061849
22 10 0 0.10 0 0.10 0.2 1.0365456
23 20 0 0.10 0 0.10 0.2 1.0023265
24 30 0 0.10 0 0.10 0.2 1.0044131
```

```r
outfile = tempfile()
r$predict(testset, outfile)
r$train(train_set, opts = c(opts$min, nthread = 1, niter = 10))
```

```
## prediction output generated at /tmp/RtmpqxN3AV/file2043363dc41b
iter tr_rmse obj
0 2.2673 5.3765e+04
1 1.0267 1.3667e+04
2 0.8372 1.0147e+04
3 0.7977 9.4773e+03
4 0.7703 9.0439e+03
5 0.7402 8.5967e+03
6 0.7048 8.1202e+03
7 0.6609 7.5638e+03
8 0.6133 7.0246e+03
9 0.5614 6.4770e+03
```

```r
## Compare the first few true values of testing data
## with predicted ones
# True values
print(read.table(testset, header = FALSE, sep = " ", nrows = 10)$V3)
## Write predictions to file
pred_file = tempfile()
r$predict(test_set, out_file(pred_file))
print(scan(pred_file, n = 10))
```

```
## [1] 3 4 2 3 3 4 3 3 3 3
[1] 3.92323 3.05510 2.98484 3.42607 2.53514 2.88135 2.93226 3.11718 2.40406 3.46282
```

```r
# Predicted values
print(scan(outfile, n = 10))
## Or, directly return an R vector
pred_rvec = r$predict(test_set, out_memory())
head(pred_rvec, 10)
```

```
## [1] 3.70478 3.02759 2.97616 3.46205 2.15736 3.03603 2.74433 2.96865
## [9] 2.02960 3.24131
[1] 3.923234 3.055096 2.984840 3.426066 2.535142 2.881347 2.932261 3.117176 2.404063
[10] 3.462822
```

Detailed help document for each function is available in topics
`?recosystem::Reco`, `?recosystem::tune`, `?recosystem::train`,
`?recosystem::output` and `?recosystem::predict`.
`?recosystem::export` and `?recosystem::predict`.

### Installation Issue
### Performance Improvement with Extra Installation Options

`LIBMF` utilizes some compiler and CPU features that may be unavailable
in some systems. To build `recosystem` from source, one needs a C++
compiler that supports C++11 standard.
To build `recosystem` from source, one needs a C++ compiler that supports
the C++11 standard.

Also, there are some flags in file `src/Makevars`
(`src/Makevars.win` for Windows system) that may have influential
Expand Down
7 changes: 4 additions & 3 deletions inst/NEWS.Rd
Original file line number Diff line number Diff line change
Expand Up @@ -6,9 +6,10 @@
\item Update LIBMF to version 2.01.
\item API change from LIBMF 2.01:
\itemize{
\item The \code{cost} option in \code{$train()} is expanded to
\code{costp_l1}, \code{costp_l2}, \code{costq_l1}, and
\code{costq_l2}.
\item The \code{cost} option in \code{$train()} and \code{$tune()}
has been expanded to and replaced by \code{costp_l1},
\code{costp_l2}, \code{costq_l1}, and \code{costq_l2}, to
allow for more flexibility of the model.
}
\item Other API change:
\itemize{
Expand Down
2 changes: 1 addition & 1 deletion vignettes/introduction.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -226,7 +226,7 @@ Below is an example on some simulated data:
library(recosystem)
set.seed(123) # This is a randomized algorithm
train_set = data_file(system.file("dat", "smalltrain.txt", package = "recosystem"))
test_set = data_file(system.file("dat", "smalltest.txt", package = "recosystem"))
test_set = data_file(system.file("dat", "smalltest.txt", package = "recosystem"))
r = Reco()
opts = r$tune(train_set, opts = list(dim = c(10, 20, 30), lrate = c(0.1, 0.2),
costp_l1 = 0, costq_l1 = 0,
Expand Down