Bonus: Subset Stats

Beyond what we’ve learned so far in this section, there remains a lot to learn about statistical theory that can help to gain a better understanding for the abstract mathematical influence on sports. For this bonus page, we will look at subset stats. With the words below, the goal is to explain three things about this topic:

  1. What subset stats are, and how they apply to sports analysis.
  2. How subset stats impact probability theory and regression to the mean in terms of sports analysis.
  3. How certain subset stats hold more value than a normal random subset, and how that will impact the appearance of a team’s or player’s success (and lack thereof).

These factors will play a MAJOR ROLE in the future analysis on this website. For starters, when any publication tries to focus on a part of the game, it generally involves a cross-section that in some way involves a subset. That means any statistical assessment could involve a subset stat. In many ways, we use some variation of subset stats much more often than we know. Therefore, understanding subset stats will help us to gain a deeper and more abstract understanding of sports analysis in general.

Defining Subset Stats
First, we need to understand exactly what a subset means. This will play a big role in all types of sports analysis, statistical-based or not. In terms of general application, a subset is defined as a group of people / places / things that is part of a larger group. Each person / place / thing of the former group is an “element” of the latter group. Just think of the phrase “not all (blank) are (blank), but all (blank) are (blank).” When the second and third blanks match up, and the first and fourth blanks match up, you will establish a scenario involving subsets. I’ll use one example:

Not all musicians are drummers, but all drummers are musicians.

In this example, the drummers are a subset of all musicians. Some subset situations can be applied a bit more abstractly, but it won’t be applied clearly enough for the intent of this website. We want clearly defined items when making an argument. Luckily, as we use subset stats, we will have those clearly defined items. Qualitative or quantitative, we will have a clear and categorical designation.

Let’s remember that we aren’t just looking at subsets, but specifically, subset stats. Thus, we need to look at the mathematical definition. Mathematically speaking, a subset is defined as a group of elements that is collectively part of another inclusive group. In other words, all elements of the former group are also elements of the latter group. We will define the former group as “Set A” and the latter group as “Set B.” For the sake of this discussion, Set A will be orange. Set B will be yellow. Below, you will physically see how Set A is a subset of Set B. With this, we will be able to start applying subsets to sports analysis.

The orange set (Set A) is a subset of the yellow set (Set B).

The orange set (Set A) is a subset of the yellow set (Set B).

Applying Subset Stats to Sports Analysis
In order to apply subset stats to sports analysis, we need to understand that the context of our arguments will affect how we construct our sets. Technically, subsets are pretty much everywhere in sports. Every team sports involves multiple subsets of players that make up one team. For example, wide receivers make up a subset of a football team. Applying it to the picture above, the group of wide receivers act as the orange set. The group of players that make up the whole team act as the yellow set.

However, the focus of this page is on subset stats. This is slightly, yet fundamentally different than simply defining subsets in sports. Therefore, we are not using subsets like the example used above. For the sake of the sports analysis on this website, subset stats are defined as any piece of data that is attached to a subset category. The category itself is an element of an inclusive categorical set. Therefore, an individual stat may be used in the context of a subset. It still makes for a meaningful argument.

For example, look at third down defense. This is a “subset stat” because it measures the rate of allowing a first down during a select situation that is an element of an inclusive set. The entire set involves first down, second down, third down and fourth down. Each singular down in a subset. Therefore, fourth down defense is another subset stat, as well as first down defense and second down defense. Note that we are NOT actually using the numbers to define the subsets. However, we are still analyzing the numbers AFTER determining their subset stat status. We will do so while understanding the impact and implications at hand.

Subset Stats and the Affect on Regression Analysis
Understanding subset stats becomes very important for any analyst looking to make an argument for “what matters” and “how much something matters” in the sports scene. The presence of subset stats could greatly influence the argument. In terms of regression analysis, subset stats are less trustworthy than the stats from the inclusive set. With the weakened reliability of these subset stats, the subjects of these stats are less likely to maintain abnormal success or failure if the statistical production in the inclusive set is normal. This is extremely important to understand when accessing the progress of a player or team statistically, whether it be during the year or from one year to another. We will know how much to trust the production of the player or team.

Let’s continue with our example involving third down defense. Say a team is allowing a first down at an abnormally high rate during third down situations. However, the team is allowing a first down at a relatively normal rate during all situations. We should put more stock into the overall production than the third down production. This is because the team fielded more snaps overall than it did on third down.

The sample size plays the major role here. With a larger sample size, the results become more reliable. Therefore, we will generally trust the whole set more than the subset. Furthermore, we will generally trust a team’s success and failure more and more as a season progress. Think about it like we’re thinking about subsets. The first two weeks of a season is merely a subset of a 16-week season. This shows how subset stats are intricately tied into the affect of sample sizes.

Subset Stats That Hold More Value than Normal
Subset stats impact more than regression. In select cases, they also put a “weighted” appearance to the inclusive set stats. This makes it very important for any analyst to understand the most important scenarios of a game. Even though these subset stats are less than reliable than the inclusive set stats, they could provide a major boost or major setback to a team or player. In terms of the overall statistical profile, certain subset stats that address important scenarios of the game will have a larger impact than the normal statistical cross-section. Many analysts generalize this subject when discussing the “clutch factor” of players or teams.

Again, let’s look at third down defense. Say a team performs abnormally well during third down, but performs normally during the other down situations. Although the previous section explained to us how we should trust the normal play more than the abnormally well play, we cannot excuse the results altogether. Overall, the defense with abnormally strong play on third down will look much stronger than a defense that only performs abnormally well on first or second down. That’s because stops on third down will generally force the opponent to kick (via punts or field goals), while stops on the ealirer downs generally don’t end the drive.

Here’s a list of some scenarios that produce subset stats that have this “weighted” impact:

Football (NFL or NCAA)

  • Third down, offense or defense
  • Fourth down, offense or defense
  • Red zone, offense or defense
  • “Late and close,” offense or defense
  • One-possession play
  • Overtime play
  • Postseason play (for individual value)

Baseball (MLB, MiLB or NCAA)

  • Runners in scoring position, hitting or pitching
  • Two strikes, hitting or pitching
  • “Late and close,” hitting or pitching
  • One-run play
  • Extra-inning play
  • Postseason play (for individual value)

Basketball (NBA or NCAA)

  • Three-point shooting, offense or defense
  • “Late and close,” offense or defense
  • One-possession play
  • Overtime play
  • Postseason play (for individual value)

Hockey (NHL or NCAA)

  • One-goal play
  • Overtime play
  • Power play or short-handed play (for individual value)
  • Postseason play (for individual value)

Get to know these stats very well, and research how they can impact the game. We will be discussing factors for regression involving these subset stats on many occasions here at TABMathletics. Enjoy the analysis!