LEADS is a Lazy Exploratory Analysis Data Summarizer.
Writing the same boilerplate exploratory analysis code in a Jupyter notebook or Excel spreadsheet for each new dataset can be tedious. This tool automates the generation of a consistent, comprehensive, and human readable exploratory analysis report that allows you to immediately become familiar with a dataset. The generated PDF report contains the below features.
Currently supports .csv
, .tsv
, and .parquet
files for inputs and .pdf
files for report formats (eventually will work on additional report formats such as markdown).
- Report features:
- Title page.
- Table of contents.
- Page numbers.
- Run metadata.
- Glossary of statistical terms (will be continually updated as new features are built out).
- Report analysis sections:
- Data type analysis:
- Identification of feature data types.
- Basic dataset information and descriptive statistics:
- Number of rows and columns.
- Column names and data types.
- Min, max, mean, median, standard deviation.
- Quartiles and interquartile ranges.
- Skewness and kurtosis.
- Missing value analysis:
- Count and percentage of missing values per column.
- Visualization of missing value patterns.
- Distribution analysis:
- Normality tests (Shapiro-Wilk, Anderson-Darling).
- Q-Q plots.
- Outlier detection:
- Z-score method.
- IQR method.
- Local outlier factor (LOF).
- Visualization of outliers.
- Visualizations:
- Histograms.
- Box plots.
- Scatter plots.
- Correlation heatmaps.
- Pair plots for multivariate data.
- Unique value counts for categorical variables.
- Multicollinearity checks:
- Correlation matrix.
- Variance inflation factor (VIF).
- Pairwise data exploration:
- Scatter plot matrix.
- Correlation analysis.
- Dimensionality reduction:
- Principal component analysis (PCA).
- t-SNE visualization.
- Feature importance:
- For categorical variables: chi-squared test.
- For numerical target variables: correlation analysis.
- Data type analysis: