Skip to content

Latest commit

 

History

History
114 lines (76 loc) · 3.73 KB

README.md

File metadata and controls

114 lines (76 loc) · 3.73 KB

fast_feature_selection

HYBparsimony explained in:

Automatic Hyperparameter Optimization and Feature Selection with HYBparsimony Package

Searching Parsimonious Models in Small and High-Dimensional Datasets with HYBparsimony Python Package

vs

methods explained in:

Efficient Feature Selection via CMA-ES (Covariance Matrix Adaptation Evolution Strategy)

Efficient Feature Selection via Genetic Algorithms

Code extracted from: https://github.com/FlorinAndrei/fast_feature_selection, and adapted to include the following code:

from hybparsimony import HYBparsimony
import random

def fitness_custom(cromosoma, **kwargs):
    X_train = kwargs["X"]
    y_train = kwargs["y"]
        
    # Extract features from the original DB plus response (last column)
    X_fs_selec = X_train.loc[: , cromosoma.columns]
    predictor = sm.OLS(y_train, X_fs_selec, hasconst=True).fit()
    fitness_val = -predictor.bic
    return np.array([fitness_val, np.sum(cromosoma.columns)]), predictor

random.seed(0)
num_indiv_hyb = 20
HYBparsimony_model = HYBparsimony(fitness=fitness_custom,
                                features=X.columns,
                                rerank_error=1.0, #Diff between bics to promote parsimonious solution
                                seed_ini=0,
                                npart=num_indiv_hyb, #Population 20 individuals
                                maxiter=10000,
                                early_stop=500,
                                verbose=0,
                                n_jobs=1)

HYBparsimony_model.fit(X, y)
print(HYBparsimony_model.best_complexity, HYBparsimony_model.best_score, HYBparsimony_model.minutes_total)

The model used for regression is statsmodels.api.OLS(). The objective function used to select the best features is BIC, or the Bayesian Information Criterion - less is better.

Four feature selection techniques are explored:

  • Sequential Feature Search (SFS) implemented via the mlxtend library
  • Genetic Algorithms (GA) implemented via the deap library
  • Covariance Matrix Adaptation Evolution Strategy (CMA-ES) implemented via the cmaes library
  • HYBparsimony is a Python package that simultaneously performs automatic: feature selection (FS), model hyperparameter optimization (HO), and parsimonious model selection (PMS) with GA and PSO HYBparsimony

SFS and GA used a multiprocessing pool with 24 workers to run the objective function. CMA-ES used a single process for everything.

Test system:

  • AMD Ryzen Threadripper 3960X 24-Core
  • Ubuntu 22.04
  • Python 3.10.13

Results:

Run time (less is better):

SFS:    79.762 sec
GA:     240.776 sec
CMA-ES: 70.152 sec
HYB-PARSIMONY: 101.067 sec

Number of the selected features:

baseline:     214
SFS:          36
GA:           33
CMA-ES:       35
HYB-PARSIMONY: 32

Number of times the objective function was invoked (less is better):

SFS:    22791
GA:     600525
CMA-ES: 20000
HYB-PARSIMONY: 33520

Objective function best value found (less is better):

baseline BIC: 34570.1662
SFS:          33708.9860
GA:           33706.2129
CMA-ES:       33712.1037
HYB-PARSIMONY: 33710.6326