Age of DataFrames 2: Polars Edition

May 23, 2024

•

Wouter Gins

In this publication, I showcase some Polars tricks and features.

As one of the newer kids on the block, Polars offers an exciting alternative to PySpark for small to medium sized datasets, as already evidenced in other blogposts. Here, we’ll dangle our toes in the water with a more focused topic to showcase how powerful and expressive polars can be: e-sports prediction. We’ll have a look at a tournament of the video game Age of Empires II: Definitive Edition (AoE), and try to make a prediction on how the first round will go. Aside from a similar age (Polars had its first commit in June 2020, AoE was released in November 2019), the AoE community has spawned many community projects to maintain vast databases, making it ideal for a small project like this. I hope that by the end of this blogpost, you will share my joy in the features of Polars that just work 😄

In order to make a prediction, I’ll first have a look at some of the data that has been gathered by the community, then make a simple model to calculate a win chance, and then end by applying this model to the tournament.

If you want to play around with these concepts, here is a link to the GitHub repo you can use to download the data and get started.

📖 Data source

For our data, we scraped from aoestats.io, which maintains a database of online games played. I’m currently only interested in matches that are officially ranked, which corresponds to raw_match_type being between 6 and 9 (inclusive). I will add this as a filter to the reading.

I’ve already pre-processed the data, having done a bit of cleaning. After filtering on the match type, the data looks like this:

df_m = (
    (pl.scan_parquet("data/aoe2/matches/*/*.parquet"))
    .filter(pl.col("raw_match_type").is_between(6, 9))
    .select(pl.all().shrink_dtype())
)
df_m_c = df_m.collect()
print(f"Match data is about {df_m_c.estimated_size('gb'):.2f} GB")

A neat trick the Polars offers is the .shrink_dtype() expression, which shrinks numeric types to the smallest version that supports the values currently in the dataframe. Quite handy to minimize memory usage!

Why filter: Eagle-eyed readers will also know of the existence of the semi-join, which I could also have used in order to filter the data by making a small dataframe with a raw_match_type column. In testing, I found a minute difference in performance between the two (0.4s versus 0.6s). My takeaway here is that the semi-join probably has some overhead. For filtering on larger amounts of values or on values that are not known beforehand, I would recommend the semi-join.

Now, aside from the matches, I also need the data of the players involved in these games. This is an ideal case for the semi-join!

df_p = pl.scan_parquet("data/aoe2/player/*/*.parquet")
df_p_c = (
    df_p.join(other=df_m_c.lazy(), on="game_id", how="semi")
    .collect()
    .select(pl.all().shrink_dtype())
)
print(f"Player data is about {df_p_c.estimated_size('gb'):.2f} GB")

As the dataset contains almost 12 million played games and 42 million records of players in those games, the dataset is large enough to play around with and just small enough to fit in memory on my modest laptop (once I close Chrome 😉).

🔮 Win chance prediction: making a lookup table

Of particular interest in this dataset is the rating or elo columns, which is a number that represents how strong a player is. This concept originated in chess and is widely used among different e-sports in order to rank players. As a naive estimator for a player’s chance to win a match, we can use the rating difference. First, let’s transform the data so it can be fed it into a classifier, and then make a lookup table for the rating difference:

rating_diffs = (
    df_p_c.join(
        other=df_m_c.filter(
            raw_match_type=6 # For filtering a column on equality, Polars offers some syntactical sugar
        ),  # Restrict the data to 1-vs-1 games to get a more accurate image
        on="game_id",
        how="inner",
    )
    .select(
        "match_rating_diff", "winner"
    )  # We only want the rating difference and the winner-flag
    .drop_nulls()  # Placement matches have players with no rating calculated yet, can be removed
)
rating_diffs

For the classifier, I simply took the Gaussian Naive Bayes classifier available in scikit-learn. This is not necessarily the most suitable choice, and I will also not split the data into a training and testing set. For building good models, these steps are absolutely essential, but I want to focus on how to use Polars, not on machine learning 😉 Just for fun, I’ve also scored the model to see how it performs:

from sklearn.naive_bayes import GaussianNB
import numpy as np

gnb = GaussianNB()
gnb.fit(X=rating_diffs.select("match_rating_diff"),
        y=rating_diffs.select("winner")
       )
gnb.score(X=rating_diffs.select("match_rating_diff"),
          y=rating_diffs.select("winner")
         )
>> 0.5476460560275515

Our classifier scores 55%, so slightly better than a straight-up coin toss! The classification selects who has the higher win percentage, somewhat boring. However, the model can also give the win percentage, not so boring! Let’s evaluate this in steps of 5 ELO points over a decent range, so we can see the win chance changing smoothly. Since we need to interface with the classifier, the Polars version of a User Defined Function is required, which can be either map_elements(...) or map_batches(...). The exact difference is quite nuanced, and it has a large entry in the documentation. In this situation, map_batches can be used to evaluate all the data at once, rather than evaluating record per record.