Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Exploratory Data Analysis

Authors
Affiliations
McMaster University
Brown University
Updated: 24 apr 2026

Imports

All requires Python libraries are imported below

Source
import altair as alt
import pandas as pd

Below are altair settings required to support plotting large datasets and customize appearance

_ = alt.data_transformers.enable("vegafusion")
_ = alt.renderers.set_embed_options(actions=False)

About

Perform EDA.

User Inputs

url = "https://vegafusion-datasets.s3.amazonaws.com/vega/movies_201k.parquet"

Get Data

%%time
df = pd.read_parquet(url)
df.head()
CPU times: user 158 ms, sys: 47.8 ms, total: 206 ms
Wall time: 427 ms
Loading...

EDA

Distributions of Numerical Features

Show a bar chart of all movie ratings

chart = (
    alt.Chart(df)
    .mark_bar()
    .encode(
        alt.X("IMDB_Rating:Q", bin=alt.Bin(maxbins=75)),
        y="count()",
    )
)
chart
Loading...

Relationships of Numerical Features

%%time
chart = alt.Chart(df).mark_rect().encode(
    alt.X('IMDB_Rating:Q', bin=alt.Bin(maxbins=60)),
    alt.Y('Rotten_Tomatoes_Rating:Q', bin=alt.Bin(maxbins=40)),
    alt.Color('count():Q', scale=alt.Scale(scheme='greenblue'))
)
chart
CPU times: user 843 μs, sys: 0 ns, total: 843 μs
Wall time: 848 μs
Loading...