Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handling Categorical Data / Validation: switching from np.ndarray -> pd.DataFrame #120

Open
indialindsay opened this issue Oct 11, 2022 · 4 comments

Comments

@indialindsay
Copy link
Contributor

Handling categorical data in HDDDM: Ask user to input dtype as category, then label encode

For detectA, if input dtype then get_dummies()

switch to pd.DataFrame from np.ndarray for validation steps. Why?

  • allows for storing mapping between categorical variables and their encoding so we can help user identify either which class or which categories are experiencing drift
@tms-bananaquit
Copy link
Collaborator

If we have an internal encoder for each detector (or even just labels), then we might want to set up examples so that they decode, e.g. histograms which swap out "(0,1], (2,3], ..." for "cat, dog, marmoset".

@indialindsay
Copy link
Contributor Author

I'm leaning towards requiring the user to convert categorical features to get_dummies as a preprocessing step, before using our detectors (thinking this through in context of DetectA). It seems risky to automatically convert categorical features to dummies? And it would follow with other ML libraries to require the user to preprocess categorical data prior to use

I can add to HDDDM example / documentation a note about using label encode, and to DetectA a note and example on using get dummies

Thoughts? @tms-bananaquit

@indialindsay
Copy link
Contributor Author

saving DetectA handling categorical features for later.

Problem: Because we use PCA to whiten the matrix, we must use one-hot encoding to handle categorical variables. When computing the covariance matrix, this results in the correlations of the one-hot encoded variables being the same for several features (Ex: a few features will all have identical rows in the covariance matrix because the correlation between them and other one-hot encoded features is either 0 or 1). This causes the determinant of the covariance matrix to be 0, so we cannot compute the inverse and calculate the T2 statistic.

Potential solutions could consider another form of encoding the categorical variables.. encode using one-hot for PCA and then convert back to label encoding?

@tms-bananaquit
Copy link
Collaborator

After more discussion, we'll likely leave proper encoding to the user, with examples. One can imagine cases where the incoming data is already encoded properly, as it comes out of a query, or similar, so asking the user to potentially back-convert to e.g. a dataframe is potentially duplicating work for them and adding the burden of more code for us. Will think a bit more about this and see whether there's good reason to make other tweaks to the validation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants