Proceedings of the Institute of Statistical Mathematics Vol.65, No.2, 185-200 (2017)

## Factors Affecting Batters' Contact with a Four-seam Fastball

(Graduate School of Science and Technology, Keio University)
(Department of Mathematics, Keio University)

In baseball, ``nobi'' is a four-seam fastball in which a batter has trouble making contact. Our research aims to understand the origin of nobi. It has been speculated that the velocity a four-seam fastball with nobi does not change much from the time it leaves the pitcher's hand to when it crosses the plate. Our previous analysis of nobi using PITCHf/x, which is a system that measures data such as the coordinates and break of a pitch by tracking the ball's trajectory, revealed the opposite relation. Consequently, we applied a logistic regression model to explain bat contact by the difference in the ball speed after defining the batter's contact with a pitch. A negative relation was obtained.
This study focuses on the break of a pitch. We analyzed the relationship between the break of a pitch and contact quantitatively. Additionally, we investigated the break of the ball by a generalized additive model using a multivariate spline smoothing method to evaluate the relationship between the break of the ball and bat contact. Vertical breaks are important. Moreover, adjusting the model to replace pitch quality as a random effect with hitting difficulty by pitcher revealed that in the 2014 MLB (Major League Baseball) season, Uehara was the most difficult pitcher for batters to face.

Key words: PITCHf/x data, four-seam fastball, nobi, break of the pitch, generalized additive model, random effect.

Proceedings of the Institute of Statistical Mathematics Vol.65, No.2, 201-215 (2017)

## A Statistical Analysis of Medial Collateral Ligament Injury Using Baseball Tracking Data in MLB

(Faculty of Science and Engineering, Chuo University)
(Graduate School of Science and Engineering, Chuo University)
(Graduate School of Science and Engineering, Chuo University)
(Graduate School of Science and Engineering, Chuo University)
(The Center for Data Science Education and Research, Shiga University)

The incidence of ulnar collateral ligament (UCL) reconstruction surgeries among baseball pitchers has increased in recent decades. Despite the importance of preventing UCL injuries, there is as yet no scientific consensus regarding the risk factors for such injuries.
In this paper, we reconsidered candidate risk factors for UCL injuries, referring to the opinions of an amateur pitcher and a sports doctor, and then obtained adjusted odds ratios for selected risk factors via a logistic regression model and stepwise variable selection using AIC. The results revealed the following risk factors: for starting pitchers, smaller repertoire of pitch types, horizontal release location farther from the body, and a greater mean pitch count per game; and for relief pitchers, smaller repertoire of pitch types, horizontal release location farther from the body, greater mean pitch speed of fast balls, and fewer days between consecutive games. These results support previous studies of the risk factors of UCL injuries, and provide important suggestions regarding pitch count per game and mound interval for both starter and relief pitchers.

Key words: Odds ratio, logistic regression, sparse logistic regression, lasso.

Proceedings of the Institute of Statistical Mathematics Vol.65, No.2, 217-234 (2017)

## Effectiveness of the Squeeze Play Using Covariate Balancing Propensity Scores

(Graduate School of Science and Technology, Keio University)
(Department of Mathematics, Keio University)

Major League Baseball (MLB) has collected play-by-play data for the past 20 years. This data is available to the public. In this paper, we estimate the effect of a squeeze play on scoring using the covariate balancing propensity score (CBPS, Imai and Ratkovic, 2014) method. We focus on the case where the score difference is 0 or 1, except when the bases are loaded. A simple method is used to estimate the effect of a squeeze play on scoring. Specifically, sample averages are compared between two groups (attempting and not attempting a squeeze play). However, the decision to attempt a squeeze play is not random; it depends on the batter, pitcher, inning, etc. If these confounding variables are not considered, the estimated result will not represent the true effect of a squeeze play. In this paper, we estimate the effect of a squeeze play using a propensity score approach to adjust the effect of other variables. In the analysis, two types of estimation procedures for the propensity score are compared: the logistic regression model and the CBPS method. CBPS produces more balanced distributions of the covariates and the estimated effect of a squeeze play becomes more stable than using the logistic regression model to estimate the propensity score. CBPS indicates that a squeeze play has a positive effect on the scoring probability and increases the probability of scoring by 18.2%.

Key words: Baseball, squeeze play, causal inference, covariate adjustment, covariate balancing propensity score.

Proceedings of the Institute of Statistical Mathematics Vol.65, No.2, 235-249 (2017)

## Measurements of Baseball Players' Batting Abilities

(Graduate School of Science and Engineering, Chuo University)
(Department of Industrial and Systems Engineering, Chuo University)
(Department of Industrial and Systems Engineering, Chuo University)

Statistics of player performance is an important part of baseball. Many stats have been proposed to measure a batter's performance, including batting average, on-base percentage, and slugging percentage. In the field of baseball analytics, the ``streakiness'' of batter's ability is often discussed using a binary sequence of hitting outcomes for a player during a season.
Unlike previous studies, which use data from the batter, we take a different approach. To analyze a batter's performance, we simultaneously model the pitcher and batter's ability. To model a batter's streakiness, we employ an extension of a one-parameter logistic item response model. Item response theory (IRT) estimates both the subject's ability and item difficulty. In this study, the ability parameter and item difficulty parameter correspond to the batter's ability and pitcher's ability, respectively. Although simplicity is thought to make the one-parameter logistic model easy to interpret, our model incorporates numerous parameters. However, using the odds ratio allows athletes to be compared.
We express streakiness by the interactions of previous at bats and imposing the Markov property on batting data. Specifically, we use MCMC in the Hamiltonian Monte Carlo method (also called the hybrid Monte Carlo method). The computation of Gibbs sampling is complex and time consuming, but the Hamiltonian Monte Carlo method is easily computed once the prior distribution and the likelihood function are defined. Our simulation study shows that the true and estimated values agree well. Additionally, the calculated proportion of times that the credible interval contains the true value is close to the nominal value.
To demonstrate the usefulness of our proposed method, we applied it to analyze actual data from Japanese professional baseball. Two-way tables can measure the dependence of the previous success and the current success by the Pearson chi-square statistic and the corresponding p-value of the test of independence. The results provide more information and are consistent with the results of chi-square test. Because comparing streakiness in the hypothesis test is difficult, we ranked streaky players from the credible intervals and the posterior means. IRT requires many subjects to estimate item difficulty parameters. Although we estimated the parameters using fewer batters, the results from our method are similar to those from IRT.

Key words: Bayesian hierarchical model, MCMC, sabermetrics, logistic model.

Proceedings of the Institute of Statistical Mathematics Vol.65, No.2, 251-269 (2017)

## Statistical Rating Method of Volleyball National Teams to Predict Results and Determine Competition Format Design

(Faculty of Science and Technology, Meijo University)

The F\'ed\'eration Internationale de Volleyball (FIVB), the world's governing body for volleyball, regularly ranks its member nations' teams. The FIVB Board of Administration has designed a point system for select FIVB world and other official competitions. However, the point system does not have a clear mathematical or statistical background. Consequently, this system cannot be used as a quantitative measure of a team's skill.
This paper proposes a novel mathematics-based rating and ranking system of national volleyball teams. The rating, which is a parameter reflecting the skill of the team, is calculated based on the scoring ratio of teams in each major international competition. A logistic regression model is employed to explain the scoring ratio with respect to the rating difference between two teams. Additionally, an iterative rating calculation method is proposed. Numerical experiments demonstrate the stability of the proposed method. The correlation coefficient between the proposed rating difference and the results of several major international competitions (e.g., Rio Olympic Games) is about 0.7. This value shows a strong correlation that is higher than that of the FIVB ranking (point) difference. The proposed rating is used to highlight the improper design of the ranking point attribution system and the competition format of the World Olympic Qualifying Tournament and Rio Olympic Games.

Key words: Sports, volleyball, rating, logistic regression model.

Proceedings of the Institute of Statistical Mathematics Vol.65, No.2, 271-286 (2017)

## Quantitative Evaluation of Soccer Players' Movements

(Graduate School of Culture and Information Science, Doshisha University)
(Department of Culture and Information Science, Doshisha University)

Many studies have examined the movement of soccer players with the ball. Some studies have even investigated the movement of soccer players without the ball, but they tend to focus on evaluating the overall movements when executing gameplay strategies. Only a few studies have evaluated the soccer players themselves.
In this paper, we use player mass as an indicator to evaluate soccer players based on overall movements by a gravity model. In the gravity model, player mass is a parameter reflecting the movements of all players. Because the estimated player mass is equivalent to the parameters of a log-linear model, it corresponds to the main effect of the log-linear model. We calculated the density and distance among players to estimate the player mass for actual tracking data. Applying the estimated player mass and data from recorded gameplays to a Bayesian hierarchal model reveals the relationships between player mass and player movement.

Key words: Bayesian hierarchical model, log linear model, sports data analysis.

Proceedings of the Institute of Statistical Mathematics Vol.65, No.2, 287-298 (2017)

## Tracking Data to Extract Changes in Football Game Situation

(Department of Civil Engineering, The University of Tokyo)
(Department of Civil and Environmental Engineering, Tokyo Institute of Technology)
(Department of Civil Engineering, The University of Tokyo)

In football, a ``game situation'' gradually changes the interaction of teams' attacks and defenses. The ability to automatically extract a change in a football game will facilitate the development of advanced strategies as well as provide richer information to the spectators. In this research, we regard the change in a football game situation as a change in time series behavior of players and a ball.
We attempt to extract the change using ChangeFinder, which is a statistical change detection method. ChangeFinder can detect changes in nonstationary and multi-noise time-series data via online learning of the two-step VAR model. Input variables are created from the tracked data to create five types of indicators: ball position, front line position, compactness, defense vulnerability degree, and attack rate. The experiments confirm that a large fluctuation in the time-series behavior of VAR model parameters occurs just prior to a detected change point. The contents of the change in a game situation assumed from the parameter variation roughly agree with the actual play contents, suggesting that the model can detect changes in football game situations.

Key words: Football, game situation, change detection, time series analysis, ChangeFinder.

Proceedings of the Institute of Statistical Mathematics Vol.65, No.2, 299-307 (2017)

## Characterization of the Formation Structure in Team Sports

(Department of Physics, Faculty of Science and Engineering, Chuo University)
(Department of Physics, School of Advanced Science and Engineering, Waseda University)

In team sports, whether to maintain or rearrange a team formation is an essential strategy, but there is not an established method to analyze the influence of different formations. We propose a method to identify the formation structure based on Delaunay triangulation. The adjacency matrix obtained from the Delaunay triangulation for each player is regarded as the formation pattern. Our method allows time-series analysis and a quantitative comparison of formations. A classification algorithm of formations is proposed by combining our method with hierarchical clustering.

Key words: Formation, Delaunay triangulation, hierarchical clustering.

Proceedings of the Institute of Statistical Mathematics Vol.65, No.2, 309-321 (2017)

## Development of Optimization Algorithm for Attack Play in Football

(Junior and Senior High School at Komaba, University of Tsukuba/Doctoral Program in Physical Education, Health and Sport Sciences, University of Tsukuba)
(Japan Sports Council)
(Doctoral Program in Physical Education, Health and Sport Sciences, University of Tsukuba)
(Nitobebunkagakuen)
(Department of Health and Sports Science, Juntendo University)
(Faculty of Health and Sport Sciences, University of Tsukuba)

Although many analyses of sports performance data have been performed, few studies have worked with big data. The purpose of this study was to invent an optimization algorithm to increase player shot probability using big data. Using attacking data in all 306 matches in the J. League division 1 in 2013, supplied by DataStudiam Inc., we converted raw data to a binary dataset in accordance with the measurement items in a prior study. To create a cooperation probability matrix from the odds ratio between measurement items, we invented the ``insertion algorithm,'' which has the following procedure: (1) Store the `success' items from attacking play; (2) sort the success items in descending order based on cooperation probability for ``shoot''; (3) calculate probability in case inserting a `failure' item between the success items; (4) if the probability was higher than the probability between the success items, insert the failure item; and (5) continue the insertion by double-loop. Team attack characteristics were compared by calculating the success rate and improvement rate adapted by the algorithm.

Key words: Soccer, J. League, attack play, optimization algorithm, big data.

Proceedings of the Institute of Statistical Mathematics Vol.65, No.2, 323-339 (2017)

## Recent Development of Integer-valued Autoregressive Models

(Graduate School of Science and Engineering, Chuo University)
(Faculty of Science and Engineering, Chuo University)
(The Institute of Statistical Mathematics/Department of Statistical Science, SOKENDAI)

Integer-valued Autoregressive Models (INAR models) express current observations, which are integers that depend on past integer-value observations. INAR models completely differ from Dynamic Generalized Linear Models and General State-Space Models, which employ an unobservable/latent process to model integer-valued time series. In INAR models, the choice of the marginal and/or innovation distribution and the definition of the `autoregressive part' are very important to ensure compatibility (in the distributional sense) of both sides of an autoregressive model. Although publications sporadically appeared from the mid 1980s to the early 1990s, notable contributions were not reported for more than a decade. However in the late 2000s, new results began appearing on a regular basis. In this paper, we introduce recent developments in INAR models beginning with Poisson INAR. For some theorems and propositions without proofs in existing literature, we compiled our notes in the appendix. Additionally, we include new results on an INAR model based on the difference of two Poisson distributions and use real data analysis as an illustration.

Key words: Integer-valued time series data, thinning operator, decomposition of random variables, INAR(1) model, INAR(p) model, method of moment.