Exploratory Data Analysis
For Data Science Master's Thesis.
Introduction
- Evidence is tables
- Examples of data visualization
- References
The Shape of DS problems
- Dataset / Domain
- Model / Method
- Metric / Task
Machine Learning
- Features
- Distances
- Clusters
The Routine
- Find shapes of inputs and outputs
- Run model on single example
- Compute metrics
- Establish baseline
- Investigate issues, i.e. find root causes
- Expand literature survey
The Setup
- Boilerplate
- MNIST
- How steps 1-4 are expressed in the example
- How steps 5-6 are often ignored in literature
- Implementing steps 5,6
- TSNE, UMAP, SHAP, etc.
- example of deepfake detection
The Breakdown
- config
- data
- models
- metrics
- visualization
- experiment
- pipeline
The Payoff
- other classification datasets
- other models
- sql queries
- dashboards
The Meta
- command line
- editor / ide
- sqlite3, jq
- datasette
- visidata
- streamlit
- html + css + js
- fastapi
- csv + tikz
EDA for the Meta
- Do literature survey, scrape repo urls
- Clone repos
- Define heuristics for language, file structures, size, dependencies
- Visualize as infographic
Frustrations with the system
A look at Google Scholar.