Skip to content

Exploratory Data Analysis

For Data Science Master's Thesis.

Introduction

  • Evidence is tables
  • Examples of data visualization
  • References

The Shape of DS problems

  • Dataset / Domain
  • Model / Method
  • Metric / Task

Machine Learning

  • Features
  • Distances
  • Clusters

The Routine

  1. Find shapes of inputs and outputs
  2. Run model on single example
  3. Compute metrics
  4. Establish baseline
  5. Investigate issues, i.e. find root causes
  6. Expand literature survey

The Setup

  • Boilerplate
  • MNIST
  • How steps 1-4 are expressed in the example
  • How steps 5-6 are often ignored in literature
  • Implementing steps 5,6
  • TSNE, UMAP, SHAP, etc.
  • example of deepfake detection

The Breakdown

  1. config
  2. data
  3. models
  4. metrics
  5. visualization
  6. experiment
  7. pipeline

The Payoff

  • other classification datasets
  • other models
  • sql queries
  • dashboards

The Meta

  • command line
  • editor / ide
  • sqlite3, jq
  • datasette
  • visidata
  • streamlit
  • html + css + js
  • fastapi
  • csv + tikz

EDA for the Meta

  • Do literature survey, scrape repo urls
  • Clone repos
  • Define heuristics for language, file structures, size, dependencies
  • Visualize as infographic

Frustrations with the system

A look at Google Scholar.