Skip to content

Command line tool for automating your dataset exploratory analysis.

Notifications You must be signed in to change notification settings

seankim658/leads

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

57 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

LEADS

LEADS is a Lazy Exploratory Analysis Data Summarizer.

Writing the same boilerplate exploratory analysis code in a Jupyter notebook or Excel spreadsheet for each new dataset can be tedious. This tool automates the generation of a consistent, comprehensive, and human readable exploratory analysis report that allows you to immediately become familiar with a dataset. The generated PDF report contains the below features.

Currently supports .csv, .tsv, and .parquet files for inputs and .pdf files for report formats (eventually will work on additional report formats such as markdown).

Feature List

  • Report features:
    • Title page.
    • Table of contents.
    • Page numbers.
    • Run metadata.
    • Glossary of statistical terms (will be continually updated as new features are built out).
  • Report analysis sections:
    • Data type analysis:
      • Identification of feature data types.
    • Basic dataset information and descriptive statistics:
      • Number of rows and columns.
      • Column names and data types.
      • Min, max, mean, median, standard deviation.
      • Quartiles and interquartile ranges.
      • Skewness and kurtosis.
    • Missing value analysis:
      • Count and percentage of missing values per column.
      • Visualization of missing value patterns.
    • Distribution analysis:
      • Normality tests (Shapiro-Wilk, Anderson-Darling).
      • Q-Q plots.
    • Outlier detection:
      • Z-score method.
      • IQR method.
      • Local outlier factor (LOF).
      • Visualization of outliers.
    • Visualizations:
      • Histograms.
      • Box plots.
      • Scatter plots.
      • Correlation heatmaps.
      • Pair plots for multivariate data.
      • Unique value counts for categorical variables.
    • Multicollinearity checks:
      • Correlation matrix.
      • Variance inflation factor (VIF).
    • Pairwise data exploration:
      • Scatter plot matrix.
      • Correlation analysis.
    • Dimensionality reduction:
      • Principal component analysis (PCA).
      • t-SNE visualization.
    • Feature importance:
      • For categorical variables: chi-squared test.
      • For numerical target variables: correlation analysis.

About

Command line tool for automating your dataset exploratory analysis.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages