top of page

The Duality of Data

  • Writer: Beckett Sanderson
    Beckett Sanderson
  • Sep 12, 2022
  • 5 min read

Data has become a staple of sports. Turn on any talk show or podcast and you will hear data point after data point being used to back up different narratives. However, in this world of endless data, it’s easy to be misled by impressive numbers, and an experienced media member can use nitpicked stats to tell any story they want.

Look at the two NBA players below and their stats from the 2021-22 NBA season and attempt to make a decision on the better player for the year (abbreviation glossary):

Each player has their argument in this comparison. Based on these stats, Player A is the far superior rebounder with slight edges in scoring, FG%, and usage percentage. Player B on the other hand is the better creator with a significant difference in assists and turnovers as well as exceptional efficiency and advanced stats. The decision is a toss up that depends on an individual’s preferences for what they want from a star player.

Player A in this exercise is none other than Giannis Antetokounmpo, a player who just placed 3rd in MVP voting last year and is considered by many to be the best player in the world.

Player B is the notorious… Jared Harper.

Who?

The problem with only examining statistics is there are so many different angles from which to look at a player or team that it’s easy to hide any flaws they may have. There’s no surrounding context to truly evaluate a performance.

In this example, Jared Harper played PG for the New Orleans Pelicans last year, but only played five games for the team. In fact, Harper has only played 16 games total in his three year NBA career, spread out across three teams — the Phoenix Suns, the New York Knicks, and the aforementioned Pelicans.

Harper played great in those five games so he has elite time-adjusted stats, but put up next to Antetokounmpo (or most other NBA players) his raw stats would be severely lacking. However, through using only the time-adjusted data (per 36 minutes for all key stats), the numbers are able to make this 5’10” point guard with 67.2 total NBA minutes to his name appear equivalent to a player many consider the best in the world.*

This is just one example of how stats can be used to mislead a layperson. The issue is sometimes it’s difficult to tell this is happening because all of the numbers are undeniably true — they are simply being provided in a way that hides the proper context. To counteract this problem, here are a couple scenarios to keep an eye out for.

The first scenario is usage of time-adjusted data without the full time played. As demonstrated above, time-adjusted data is one of the most effective ways to make players without much time on the field or court seem far better than otherwise shown. This can happen in every major sport: in baseball when adjusting for 9 innings, in football when adjusting for yards per attempt, in basketball when adjusting per 36 or 48 minutes, and in hockey when adjusting for 60 minutes to name a few.

Some data might not even clearly show that it is adjusted for time. For instance in the Jared Harper example the advanced metric of Box +/- is used. This metric does not adjust for time played so it could be used in the misleading comparison above. On the other hand, the advanced metric Win Shares (WS) takes into account playing time, but may undervalue a player who was out for several weeks due to COVID-19 restrictions. It’s important to understand what goes into a statistic or metric before using it to evaluate a player or team.

The second scenario is data with a small sample size. This is another common misunderstanding as any team or player can look like a juggernaut with a minimal amount of games played. Graphics makers love to slip “over the last five games” or a similar descriptor into the fine print of their charts as it is easier to attain eye catching numbers. For example, in the 2020 NBA playoff bubble, Orlando Magic center Nikola Vucevic averaged 28.0 PPG (6th best that year) on a superb 60.4 TS%. Vucevic is a good center, but with just those stats he could be made out as a top 3 center in the league, which he was not at the time.

The final scenario is when there are large amounts of unique data. This one is a bit harder to notice sometimes and can be used to make a player appear very impressive at first glance. In the example below, Phoenix Suns center Deandre Ayton appears to be one of the best players in the NBA to start the 2018 season. However, a perceptive reader would notice that there are a few too many qualifications detailed — 15+ PPG, 10+ RPG, 60+ FG%, and the time frame (over the first four games of a season).

Ayton is a good player, but the number of data points in this graphic makes it easy to isolate him in comparison to other players. Adding the 10+ rebounds practically guarantees most guards won’t achieve those numbers, but the most restrictive and misleading stat is, “over his first four games”. By putting time frame information in words after the eye-catching numbers, the tweet makes it appear that Ayton is the first with those numbers ever, when it is capturing only in the first four games.

You have been shown what to look out for, but if you see one of these scenarios what can you do? Most important is to understand the context from which the data is gathered or a metric is created, and find the missing pieces of information that aren’t explicitly stated. When viewing time-adjusted data, it should be viewed along with the raw stats to complement it. When looking at a small sample size, the conclusions should be taken with the grain of salt that opinions might change once more games are played.

Overall data is a big part of our world now both in and out of sports. There are many more ways to manipulate data than described above, but once you know to keep an eye out for it, the manipulations are easier to catch and understand. As big data continues to grow, understanding how to use and interpret it will be a crucial skill to develop.


*As a note, Basketball Reference (where these stats were retrieved) provides very clear data with their, “When table is sorted, hide non-qualifiers for rate stats” button that I conveniently ignored for this misleading comparison exercise.


1 Comment


Julian Richardson
Julian Richardson
Jun 14, 2023

Delete this before Adam Silver sees it

Like
Profile Photo.jpg

An aspiring data scientist with a data science and economics combined degree from Northeastern University's Khoury College of Computer Sciences and John Martinson Honors Program.

  • LinkedIn
bottom of page