The aim of this project is to plot interactive scores of NBA games over the course of the match:

Part of a project on github.

Data Collection

The above score data is collected from:

The data is shown in the Play-By-Play table.

We use the requests library to grab the page contents. Then we use pandas to extract and parse the table using beautiful-soup.

Scrapes match information from Extracts the scores, pre-processes the data and visualises against time:

def score_table_from_url(url):
    # Fetch URL contents
    response = requests.get(url)
    return response.content

def dataframe_from_table_html(html_str):
    score_table = pd.read_html(io=html_str, attrs={"id": "pbp"}, flavor="bs4",)[0]
    return score_table["1st Q"]

Data Cleaning

The output from the table parsing has quite a few issues which we need to clean. Below is a typical example of the initial pandas dataframe:

  Time Toronto Unnamed: 2_level_1 Score Unnamed: 4_level_1 Charlotte
0 12:00.0 Jump ball: S. Ibaka vs. B. Biyombo (P. McCaw gains possession) Jump ball: S. Ibaka vs. B. Biyombo (P. McCaw gains possession) Jump ball: S. Ibaka vs. B. Biyombo (P. McCaw gains possession) Jump ball: S. Ibaka vs. B. Biyombo (P. McCaw gains possession) Jump ball: S. Ibaka vs. B. Biyombo (P. McCaw gains possession)
1 11:40.0 O. Anunoby misses 2-pt layup from 2 ft nan 0-0 nan nan
2 11:35.0 nan nan 0-0 nan Defensive rebound by D. Graham

The various steps we do to clean the data is outlined as follows:

def clean_table(score_df) -> pd.DataFrame:
    score_df = remove_unnamed_columns(score_df)
    score_df = add_quarter_column(score_df)
    score_df = remove_nonscore_rows(score_df)
    score_df = scores_to_separate_columns(score_df)
    score_df, team_names = add_team_label(score_df)
    score_df = add_action_label(score_df, team_names)
    score_df = normalise_time_remaining(score_df)

    # Make Time index
    score_df.set_index(keys=["TimeElapsed"], inplace=True)
    return score_df

The full source can be found here:

After cleaning, the resulting dataframe is much easier to process for our plotting needs:

TimeElapsed Time Quarter HomeScore AwayScore TeamLabel Label
1900-01-01 00:00:20 11:40.0 1 0 0 Toronto O. Anunoby misses 2-pt layup from 2 ft
1900-01-01 00:00:25 11:35.0 1 0 0 Charlotte Defensive rebound by D. Graham
1900-01-01 00:00:38 11:22.0 1 0 2 Charlotte M. Bridges makes 2-pt layup from 2 ft (assist by P. Washington)
1900-01-01 00:00:49 11:11.0 1 0 2 Toronto K. Lowry misses 3-pt jump shot from 30 ft
1900-01-01 00:00:52 11:08.0 1 0 2 Charlotte Defensive rebound by M. Bridges

Plot Generation

The plots are generated using the bokeh library through hvplot. hvplot extends the standard pandas plotting API to use different backends. This allows us to create an interactive plot. We can zoom and hover over data points. The hover tools are setup to list all columns of our score dataframe.

The code requires for this becomes rather simple:

def create_plot(score_df):
    score_plot = score_df.hvplot(
        y=["HomeScore", "AwayScore"], hover_cols=list(score_df.columns)

    hover = HoverTool(
        tooltips=[(col, "@" + col) for col in ["HomeScore", "AwayScore"]]

    score_plot = score_plot.opts(tools=[hover], show_grid=True)
    return score_plot

Create a webservice

This code can be packaged up into a small webservice.

This allows you to call the program via a normal webaddress:

This is run in flask to setup a REST API as follows:

app = Flask(__name__)

@app.route("/nba_score_plot", methods=["GET"])
def process_request():
    game_id = request.args.get("game_id")
    score_plot = generate_plot(game_id)
    score_plot_html = convert_plot_to_html(score_plot)
    return score_plot_html, 200

if __name__ == "__main__":


An example of the interactive plots generated can be seen at the top of the page.

The full source can be found here: