Quantifan @rwdvc - Tumblr Blog

Optimizing MLB Lineups with Julia and JuMP

by Robert Del Vicario

I was recently asked by someone on reddit for some suggestions on how to update the code on the NBA optimization code in R I had posted a while back to accept MLB lineups with stacking requirements that are defined via the number of teams that can be in a lineup. My suggestion to anyone using R for optimization would be to move to the Julia language and the JuMP package (or python's pyomo/pulp).

Specifying optimization problems in either language will be much easier and much more flexible. I prefer Julia's JuMP, but it is largely a function of preferring what I know.

The below code is built to support draftkings MLB, but the same concepts will work for any sport for which you want to create stacked teams. I'm a bit of a Julia noob so I suspect code optimizations abound.

I'd also note that you could wrap the optimization portion in a function to call iteratively to generate a large number of lineups.

using JuMP using Cbc using DataFrames #Optimization rules - Draftkings salary_cap = 50000 num_teams = 10 #set value to 10 or larger if you don't want stacking requirements p = 2 c = 1 fb = 1 sb = 1 tb = 1 ss = 1 of = 3 #data load #you'll need to set your working directory df = readtable("DKSalaries.csv") #position encodings pitcher = ["SP", "RP"] first_base = ["1B"] second_base = ["2B"] third_base = ["3B"] center = ["C"] outfield = ["OF"] short_stop = ["SS"] #setup model matrices player_positions = split(df[:Position], "/") function getPosition(pp, pos_encodings) pp_out = zeros(size(pp)[1]) for i = 1:size(pp)[1] push_flag = 0 for x = 1:size(pp[i])[1] for z = 1:size(pos_encodings)[1] if pp[i][x] == pos_encodings[z] pp_out[i] = 1 push_flag = 1 break end end if push_flag == 1 break end end if push_flag == 0 pp_out[i] = 0 end end return(pp_out) end function getDummyTeams(teams) mat = zeros(size(teams)[1], length(unique(teams))) unique_teams = unique(teams) for i = 1:size(teams)[1] for x = 1:length(unique_teams) if teams[i] == unique_teams[x] mat[i,x] = 1 end end end names(mat) = unique_teams return(mat) end pitcher = getPosition(player_positions, pitcher) first_base = getPosition(player_positions, first_base) second_base = getPosition(player_positions, second_base) third_base = getPosition(player_positions, third_base) center = getPosition(player_positions, center) outfield = getPosition(player_positions, outfield) short_stop = getPosition(player_positions, short_stop) teams = getDummyTeams(df[:teamAbbrev]) unique_teams = unique(df[:teamAbbrev]) #setup optimization model m = Model(solver = CbcSolver()) #create our binary variable to choose a player @variable(m, x[1:size(df)[1]], Bin) @variable(m, opt_team[1:length(unique_teams)], Bin) #setup salary constraint @constraint(m, sum(x .* df[:Salary]) salary_cap) @constraint(m, sum(x) == 10) #setup max teams constraint for stacking @constraint(m, sum(opt_team) == num_teams) #set constraint for how many teams can be selected for j = 1:size(teams)[1] for k = 1:size(teams)[2] if teams[j,k] == 1 @constraint(m, x[j] * teams[j, k] opt_team[k]) end end end #setup position constraints @constraint(m, sum(x .* pitcher) >= p) @constraint(m, sum(x .* first_base) >= fb) @constraint(m, sum(x .* center) >= c) @constraint(m, sum(x .* second_base) >= sb) @constraint(m, sum(x .* third_base) >= tb) @constraint(m, sum(x .* short_stop) >= ss) @constraint(m, sum(x .* outfield) >= of) #setup objective @objective(m, Max, sum(x .* df[:AvgPointsPerGame])) status = solve(m) println("Objective value: ", getobjectivevalue(m)) selected = getvalue(x) #print out summary stats selected_team = df[selected .!= 0, :] #rounding issues println("Total salary: ", sum(selected_team[:Salary])) println("Projected points per game: ", sum(selected_team[:AvgPointsPerGame])) println("Unique teams: ", length(unique(selected_team[:teamAbbrev]))) println(selected_team)

#julia #jump #dfs #dfssports #reddit #cbc #mlb #optimization

NBA Player Stats Data

by Robert Del Vicario

For DFS players in need a good source of data. I'm going to make my daily NBA stats available. Stats should be updated around 10AM Eastern daily. If they aren't a cron job failed to kick off or my computer rebooted or something and I take no responsiblitiy for it.

Stats are downloaded from the NBAs API, and for those who would like to download the stats on their own they can pickup the code from the KNIME, an open source workflow based analytics application, nodes.

Hopefully this data can help improve your DFS lineups!

Player data: https://dl.dropboxusercontent.com/u/10684925/reddit_nba_data/nba_basketball_data.csv

Season logs: https://dl.dropboxusercontent.com/u/10684925/reddit_nba_data/season_logs.csv

KNIME Workflow w/ Code: https://dl.dropboxusercontent.com/u/10684925/reddit_nba_data/NBA%20Download.knwf

#NBA DFS

Trading SPY

by Robert Del Vicario

I've always been interested in algorithmic trading strategies. Removing some of the human bias from the trading equation has always seemed a sensible idea given the biases involved in human decision making. To that end I decided to create a simple long only strategy that uses the VIX and TNX to help trade SPY based on the RSI and its derivatives.

As we can see the strategy is trash, but I do believe that it has some learnings that I can build on. The first is the use of time slicing to validate my GBM models. The second is the development of a very simple trade execution that should be easily extendable. As a side note I looked at the Quantstrat package but found it to be a bit overkill for my use (long only equity trades). Code below...

require(plyr) require(caret) require(PerformanceAnalytics) require(TTR) require(quantmod) require(dplyr) setwd("~/Dropbox/Analytics Projects/Quantifan/Stock Analysis") source("tradingfunctions.r") # UDV --------------------------------------------------------------------- symbols <- c("^VIX" "SPY" "^TNX") train_split <- as.Date("2009-01-01") from <- as.Date("2001-01-01") to <- Sys.Date() # UDF --------------------------------------------------------------------- sym2DF <- function(x){ x <- as.data.frame(x) x$date <- as.Date(rownames(x)) x } r2 <- function(y y_hat){ 1- sum((y - y_hat)^2) / sum((y - mean(y))^2) } executeTrades <- function(signal date close trade_perc trans_cost = 5 risk_dollars = 100000){ require(dplyr) #valid signals are long & exit -- will add shorts at a later date if(length(signal) == length(close)){ df <- data.frame(date signal close trade_perc) #lag the signal to help us evaluate whether it has changed df <- df %>% mutate( signal_1 = lag(signal) ) %>% arrange(date) #setup account tracking df$cash <- risk_dollars df$account_val <- risk_dollars df$shares <- 0 df$trade <- 0 #loop through all the trade dates for(i in 1:nrow(df)){ #setup the portfolio on trade 1 if(i == 1){ #if signal == exit then we don't enter a trade #if signal == long then we enter if(df$signal[i] == "exit"){ df$account_val[i] <- risk_dollars } else { #calculate the number of shares to purchase df$shares[i] <- floor(((df$cash[i] - trans_cost) * df$trade_perc[i]) / df$close[i]) #subtract purchase price from cash df$cash[i] <- df$cash[i] - (df$shares[i] * df$close[i]) - trans_cost #calculate account value df$account_val[i] <- df$cash[i] + df$shares[i] * df$close[i] #log the trade df$trade[i] <- 1 } } if(i != 1){ #if nothing has changed calculate portfolio value if(df$signal[i] == df$signal_1[i]){ df$shares[i] <- df$shares[i-1] df$cash[i] <- df$cash[i-1] df$account_val[i] <- df$cash[i] + df$shares[i] * df$close[i] } #if going long add in transaction cost and calc value if(df$signal[i] != df$signal_1[i] & df$signal[i] == "long"){ #calculate number of shares to buy based on pervious period df$shares[i] <- floor(((df$cash[i - 1] - trans_cost) * df$trade_perc[i]) / df$close[i]) #subtrack cost of shares bought df$cash[i] <- df$cash[i - 1] - (df$shares[i] * df$close[i]) - trans_cost #calculate portfolio value df$account_val[i] <- df$cash[i] + df$shares[i] * df$close[i] #log the trade df$trade[i] <- 1 } if(df$signal[i] != df$signal_1[i] & df$signal[i] == "exit"){ #sell shares and add back to cash df$cash[i] <- df$cash[i - 1] + (df$shares[i-1] * df$close[i]) - trans_cost #calculate number of shares to buy based on pervious period df$shares[i] <- 0 #calculate portfolio value df$account_val[i] <- df$cash[i] #log the trade df$trade[i] <- 1 } } } #calculate indexed portfolio returns df$returns <- df$account_val / df$account_val[1] } else { stop("Error: Vector lengths do not match.") } df } # Data Load --------------------------------------------------------------- getSymbols(symbols from = from to = to) summary(VIX) #TNX candleChart(TNX) addMACD() addBBands() #VIX candleChart(VIX) addMACD() addBBands() #SPY candleChart(SPY) addMACD() addBBands() tnx <- sym2DF(TNX) vix <- sym2DF(VIX) spy <- sym2DF(SPY) vars <- merge(tnx vix) vars <- merge(vars spy) # Transform data ---------------------------------------------------------- colnames(vars) <- tolower(colnames(vars)) vars <- vars %>% arrange(date) %>% mutate( #DV daily_return = log(spy.adjusted) - lag(log(spy.adjusted)) signal = ifelse(daily_return > 0.0 "long" "exit") #spy predictors #current spy_spread = (spy.high - spy.low) / spy.close spy_vol_chg = log(spy.volume) - lag(log(spy.volume)) spy_rsi_fast = RSI(spy.adjusted 14) spy_rsi_med = RSI(spy.adjusted 50) spy_rsi_slw = RSI(spy.adjusted 200) spy_fast_med = spy_rsi_fast - spy_rsi_med spy_fast_slw = spy_rsi_fast - spy_rsi_slw #vix predictors #current vix_chg = log(vix.adjusted) / lag(log(vix.close)) vix_rsi_fast = RSI(vix.adjusted 14) vix_rsi_med = RSI(vix.adjusted 50) vix_rsi_slw = RSI(vix.adjusted 200) vix_fast_med = vix_rsi_fast - vix_rsi_med vix_fast_slw = vix_rsi_fast - vix_rsi_slw #tnx predictors #current tnx_chg = log(tnx.close) - lag(log(tnx.close)) tnx_rsi_fast = RSI(tnx.adjusted 14) tnx_rsi_med = RSI(tnx.adjusted 50) tnx_rsi_slw = RSI(tnx.adjusted 200) tnx_fast_med = tnx_rsi_fast - tnx_rsi_med tnx_fast_slw = tnx_rsi_fast - tnx_rsi_slw ) rm(SPY spy TNX tnx VIX vix) # New model --------------------------------------------------------------- train <- vars[vars$date < train_split ] test <- vars[vars$date >= train_split ] gbmGrid <- expand.grid(interaction.depth = c(1 2 3) n.trees = c(100 500 1000) shrinkage = 0.1 n.minobsinnode = 10) gbmTrain <- trainControl(method = "timeslice" initialWindow = 1000 horizon = 500) set.seed(1) mdl <- train(daily_return ~ spy_fast_slw + spy_fast_med + spy_rsi_fast + spy_rsi_med + vix_chg + vix_rsi_fast + vix_fast_med + vix_fast_slw + tnx_chg + tnx_rsi_fast + tnx_rsi_med + tnx_fast_med + tnx_fast_slw + spy_vol_chg data = train method = "gbm" tuneGrid = gbmGrid trControl = gbmTrain) mdl plot(mdl) varImp(mdl) #test <- vars[complete.cases(vars), ] test$predicted <- predict(mdl test) plot(test$predicted) test$predicted <- ifelse(test$predicted > 0.000 "long" "exit") table(test$predicted) result <- executeTrades(signal = test$predicted date = test$date close = test$spy.close trade_perc = 1 trans_cost = 5) result$spy_returns <- test$spy.adjusted / test$spy.adjusted[1] # Review Performance ------------------------------------------------------ graph_data <- data_frame(strat = result$account_val spy = test$spy.adjusted) row.names(graph_data) <- test$date graph_data$strat <- log(graph_data$strat) - lag(log(graph_data$strat)) graph_data$spy <- log(graph_data$spy) - lag(log(graph_data$spy)) graph_data <- as.xts(graph_data) # graph_data <- CalculateReturns(graph_data) charts.PerformanceSummary(graph_data begin = "axis") table.Stats(graph_data) SharpeRatio.annualized(graph_data) chart.RiskReturnScatter(graph_data) charts.RollingPerformance(graph_data$strat) chart.RelativePerformance(graph_data$strat graph_data$spy) table.CAPM(graph_data$strat graph_data$spy) chart.RollingCorrelation(graph_data$strat graph_data$spy) #downside risks table.DownsideRisk(graph_data Rf=.03/12) table.Drawdowns(graph_data$strat) table.Drawdowns(graph_data$spy)

Created by Pretty R at inside-R.org

#R Rfinance SPY Trading

NFL Team Rankings 2015-12-21

by Robert Del Vicario

The Bengals retain the #1 spot, but with Dalton out I don't expect that to last. Patriots are back on the #2 spot! And, importantly the next two games look to be a lock as they hae a 0.92 probability of beating both the Jets and the Dolphins. Interestingly it looks as if Carolina will remain unbeaten in the regular season. The probability that they beat the Falcons looks to be about .90 and the probability that they beat Tampa Bay is around .98. As a side note astute followers (do I have any?) of the rankings will notice that there are variances in historical ratings week to week. E.g. last week when I posted New England was #4 in week 14 and now they are #3. This is because I run the ratings before the Monday night game has been played. Therefore I'm usually missing two teams which can impact the ratings.

#nfl nflrankings R quantile_regression

NFL Team Rankings 2015-12-14

by Robert Del Vicario

Pats fall to #4 despite a win and the Bengals maintain the #1 position despite a loss.

#nfl #nflrankings

NFL Team Rankings 2015-12-07

by Robert Del Vicario

New team rankings are up and interestingly the Patriots have reclaimed the #1 spot. Why they have recliamed the #1 spot isn't clear to me yet (ran the model with both quantile and linear regression with similar results), but I suspect the Pats schedule is starting to look tougher than it previously did. Despite their #1 rank I suspect the Pats are in trouble unless they start getting some of their injured players back (i.e., Gronk & Edelman). The Bengal's fall to #2 despite a blowout win over Cleveland is a bit weird, and likely reflects a schedule that looks increasingly soft. Anyway rankings below. Edit: Found a bug in my code where I was dropping the most recent game played. As expected New England is no longer in the #1 spot and Cincinnati holds the #1 spot.

#nfl #nfl_rankings #quantile_regression

Quantile Regression Team Rankings for 2015-11-30

by Robert Del Vicario

New NFL team rankings are up, and this time in Tableau format. You can select either a view of all team rankings from any point since the third week of the year using the "Team Ranking in Week:" filter or you can click on an individual team in the "Team Rankings" table and the "team Ranking Over Time" and "Team Offense & Defense" graphs will filter on that particular team. Additionally, a weighting scheme was implemented such that the first game of the season will carry 1/2 the weight of last game of the season.

#nfl quantreg

NFL Quantile Regression Team Rankings for 11/23

by Robert Del Vicario

Rank Team Offense Defense Total 1 New England Patriots 5.0 -3.4 8.4 2 Carolina Panthers -0.2 -4.0 3.8 3 Arizona Cardinals 0.0 0.0 0.0 4 Cincinnati Bengals -6.6 -5.8 -0.8 5 Green Bay Packers -4.6 -2.6 -2.0 6 Kansas City Chiefs 0.8 2.8 -2.0 7 Indianapolis Colts -7.6 -5.4 -2.2 8 Denver Broncos -8.2 -5.2 -3.0 9 Buffalo Bills 2.2 5.8 -3.6 10 Miami Dolphins -4.6 1.2 -5.8 11 Tennessee Titans -1.4 5.0 -6.4 12 Pittsburgh Steelers -6.6 0.2 -6.8 13 New York Jets -10.2 -3.0 -7.2 14 Atlanta Falcons -4.6 3.0 -7.6 15 Oakland Raiders -2.8 4.8 -7.6 16 St. Louis Rams -7.2 1.2 -8.4 17 Seattle Seahawks -7.8 1.4 -9.2 18 Minnesota Vikings -12.4 -3.0 -9.4 19 Cleveland Browns -9.6 1.4 -11.0 20 Chicago Bears -1.8 10.4 -12.2 21 Tampa Bay Buccaneers -8.0 5.8 -13.8 22 Baltimore Ravens -9.4 5.0 -14.4 23 San Diego Chargers -11.2 3.6 -14.8 24 Detroit Lions -13.2 4.4 -17.6 25 Jacksonville Jaguars -11.6 6.2 -17.8 26 New Orleans Saints 3.0 21.8 -18.8 27 Washington Redskins -14.6 4.4 -19.0 28 New York Giants -10.4 9.2 -19.6 29 Dallas Cowboys -14.8 6.0 -20.8 30 Houston Texans -12.8 8.8 -21.6 31 Philadelphia Eagles -10.6 15.4 -26.0 32 San Francisco 49ers -19.8 9.8 -29.6

Visualizing Lineup Risk & Reward

by Robert Del Vicario

Following up on yesterday's post I took a crack at creating a number (1,000) of near-optimal lineups and then calculating the expected standard deviation of fantasy points for each lineup. Lineup standard deviation is calculated as (t(rep(1,9)) %*% (covar %*% rep(1,9)))^(1/2) for all those R users out there. The expected standard deviation comes right from finance literature. Additionally, I should note that I assumed players that weren't on the same team had a covariance of 0. Technically this isn't true, but when I looked at the global correlation between fantasy points and opponent team points it was quite low (~0.02). I think the dashboard embedded below is pretty interesting in that there are a surprising number of lineups *very* close to one another in terms of expected total points, but that the standard deviation of portfolios varies significantly. For example, the 279th best lineup has an expectation of 266 fantasy points compared to 271 fantasy points for the best lineup. However, that same portfolio has a standard deviation of 75 fantasy points compared to 84 fantasy points for the lineup with the greatest expected points. That is a tradeoff that I might be willing to make. The opposite holds for high variance lineups. I might be willing to take take a little less expectation in GPPs to gain some variance.

DFS Lineup Mean Variance Optimization

by Robert Del Vicario

DFS sites with optimizers seem to be popping up left, right and center. However, what many players want isn't simply an optimized team, but rather a team that will generate an expected number of fantasy points for a given level of risk.

Fortunately that is a problem that has been solved (to some extent) in finance. Mean variance optimization, developed by Harry Markowitz, allows an investor to maximize a portfolio's return for a given leve of risk, or minimize a portfolio's risk for a given level of return. For many players these are exactly tye types of lineups that they would like to be able to optimize against. In a 50/50 you might want to construct a team with an expected FP total while minimizing risk, or in a GPP you might actually want to maximize risk in an attempt to generate high variance teams allowing you to place well.

So what do we need to accomplish this?

Player fantasy points projections

Player positions to set constraints and build our team

Player salaries to set constraints against

Player covariance matrix to understand how player performances covary

The hard part here is the player covariance matrix. There are a few options for constructing a covariance matrix here. The first is that each player covaries with every other player (unlikely), the second is that each player only covaries with with every player he plays against (likely but hard to model), and third that each player covaries with his teammates and the opposing team (likely and easier to model). With this bit of knowledge we can construct a covariance matrix for each player on a given night. Players who aren't facing each other are expected to have a covariance of 0.

Once we have created the above pieces we can begin the optimization process. I coded up the following in JuMP using IBM's CPLEX solver..

using DataFrames, JuMP, Gadfly, AmplNLWriter, Cbc, CPLEX cd("C:\\Users\\blahblah\\Desktop\\Projects\\140427 IJulia Notebooks\\151119 Mean Variance Portfolio Optimization") #read in data for optimziation covar = readcsv ("covar.csv") c = readcsv("c.csv") pg = readcsv("pg.csv") pf = readcsv("pf.csv") sf = readcsv("sf.csv") sg = readcsv("sg.csv") proj = readcsv("proj_pts.csv") sal = readcsv("sal.csv") n = size(pf,1) #m = Model(solver=CouenneNLSolver()) #m = Model(solver=CbcSolver()) m = Model(solver = CplexSolver()) @defVar(m, x[1:n], Bin) @addConstraint(m, dot(x, c[1:n,1]) == 1) @addConstraint(m, dot(x, pg[1:n,1]) == 2) @addConstraint(m, dot(x, pf[1:n,1]) == 2) @addConstraint(m, dot(x, sg[1:n,1]) == 2) @addConstraint(m, dot(x, sf[1:n,1]) == 2) @addConstraint(m, dot(x, sal[1:n,1]) = 125) @setObjective(m, Min, sum{covar[i,j] * x[i] * x[j], i = 1:n, j = 1:n}) status = solve(m) println("Objective value: ", getObjectiveValue(m))

The upside is that the above runs and solves the problem (I assume at some point in time). However, I left it running on my box for 4 hours at 100% CPU utilization and it did little more than chew up 36gb of ram and run up my electrical bill. The issue lies in the binary constraint which turns this optimizaiton problem from an easy model into solve to a very hard model to solve.

I don't think all is lost. Previously I had posted some code that folks to generate numerous lineups very quickly (~1k in 6 seconds)using a linear solver. What we can do is take these lineups output from the linear solver and calculate the risk of each lineup pretty quickly in R. This would allow us to create a large set of lineups while assessing the risk associated with each lineup. One could then simply pick the best lineup for their needs (e.g., high risk for for GPP or low risk for a 50/50).

Next time around I'll look to post some code for creating the lineups and assessing lineup risk.

#dfs optimization nba

NFL Team Rankings

by Robert Del Vicario

I was curious to see how my beloved Pats were performing in contrast to the rest of the league so I used the code from my previous post to determine their ranking relative to the rest of the league, and it looks like they are still holding the #1 spot.

RankTeamOffenseDefenseTotal 1New England Patriots2.5-5.58.01 2Cincinnati Bengals-3.5-8.55.01 3Carolina Panthers-3.99-51.01 4Kansas City Chiefs-2.5-3.51 5Arizona Cardinals000 6Indianapolis Colts-1-0.5-0.49 7Pittsburgh Steelers-8.5-8-0.5 8Denver Broncos-8-7-1 9Buffalo Bills-4-2-1.99 10Green Bay Packers-3.5-1.5-2 11New York Giants-1.990.5-2.49 12New York Jets-8.5-6-2.49 13Seattle Seahawks-5.5-3-2.49 14Minnesota Vikings-5-2.5-2.5 15St. Louis Rams-11.5-7.5-4 16Baltimore Ravens-5.5-0.5-4.99 17New Orleans Saints-50.5-5.5 18Houston Texans-10-4-5.99 19Chicago Bears-10.5-4-6.5 20San Diego Chargers-5.51-6.5 21Atlanta Falcons-8.5-1-7.5 22Philadelphia Eagles-70.5-7.5 23Dallas Cowboys-12.5-4.51-7.99 24Oakland Raiders-71-8 25San Francisco 49ers-15.5-6.5-8.99 26Washington Redskins-11-2-9 27Tampa Bay Buccaneers-81.99-9.99 28Jacksonville Jaguars-9.53-12.49 29Miami Dolphins-130-13 30Detroit Lions-13.51-14.5 31Cleveland Browns-13.52-15.5 32Tennessee Titans-18.5-1.01-17.49

Edit: I forgot to include the Cardinals as they were the intercept team.

require(quantreg) require(reshape2) require(sqldf) require(ggplot2) require(stringr) require(data.table) require(XML) require(dplyr) # UDF --------------------------------------------------------------------- cleanScores % select(date, winner, field, loser, winner_score, loser_score) df % arrange(rank) write.csv(rankings, paste0(Sys.Date(), "_nfl_rankings.csv"), row.names = F)

Estimating Spreads & Points Totals with Quantile Regression

by Robert Del Vicario

Estimating team strength with spreads and points totals has been of interest to me since I took an analytics course taught by the fantastic Wayne Winston.

In that course he showed us how to estimate team strength with Excel's solver. Estimating team strength was deceptively easy, but a point estimate of team strength doesn't lend itself to estimating points spreads or totals as it requires you to assume that they are normally distributed. Examining the points total distribution for the 2015 season makes it apparent that points totals are not normally distributed.

So the question becomes how can we solve this issue? I've been a big fan of quantile regression to estimate conditional quantiles for a while now, and Roger Koenker has come up with a pretty nifty solution to this exact problem. Additionally, he has been kind enough to post his code for estimating outcomes which can be found at the link above and the paper can be found here.

After looking over his solution I tossed everything he did (except the ideas) and implemented my own version that is 100% inferior, but a bit less complex.

To start we develop a paired comparison model in which each team's estimated points is estimated based on the opponent, whether the team is playing at home and the interaction in average pace of each team for the 1st through 99th quantiles.

# Quantile regression model ----------------------------------------------- #remove ties in points df$pts df$pts + rnorm(nrow(df) 0 .0001) rq_mdl rq(pts ~ team + opp + venue + pace_ma_1 * opp_pace_ma_1 data = df[df$date as.Date("2015-02-01") ] tau = seq(.01 .99 .01))

Created by Pretty R at inside-R.org

The next step in the process is to estimate outcomes for each team (in points) based on our quantile estimates. We have a couple of options here. First we could assume that there is no corrleation between team performance. The second is that we can assume there is perfrect correlation between team performances. And last we can assume that there is some level of correlation in team performance which is the more realistic assumption, and borne out by measuring the Spearman rank correlation in points between teams.

With that knowledge we can easily generate a multivariate normal distribution using the between team points correlations and convert the normal distribution to a uniform distribution that we will use to inform our bootstrap point spread and points totals estimates. Props to the Datah blog for the intro on copulas

# Multivariate uniform distribution --------------------------------------- df %>% summarise( kendall = cor(pts opp_pts method = 'kendall') pearson = cor(pts opp_pts) spearman = cor(pts opp_pts method = 'spearman') ) #rank based correlation matrix cor_mat cor(df$pts df$opp_pts method = "spearman") cor_mat matrix(c(1 cor_mat cor_mat 1) 22) # cor_mat #gaussian multivariate distribution set.seed(1) ab rmvnorm(mean=c(00)sig=cor_mat n=10000) #uniform distribution u pnorm(ab) # round this to index quantile regression model outputs u round(u 2) u u * 100 summary(u) hist(u[1]) hist(u[2]) u ifelse(u == 0 1 u) u ifelse(u == 100 99 u) summary(u) u as.data.frame(u) rm(ab cor_mat)

Created by Pretty R at inside-R.org

Finally we write a little function to use our joint uniform distribution to estimate the outcomes points totals and spreads of any game conditional on team specific characteristics. The first thing we notice is that the points total estimates for the Atlanta vs. Chicago game exhibits fat tails with right skew just as we would have expected based on the season points totals.

calcTotals function(team opp game_date x u rq_mdl){ #todo error checking #extract team data team x[x$team == team & x$date == game_date ] opp x[x$team == opp & x$date == game_date ] #predict matchup conditional quantiles team_pred predict(rq_mdl team) opp_pred predict(rq_mdl opp) #extract matchup estimates team_score team_pred[u[1]] opp_score opp_pred[u[2]] #combine into df df_out data.frame(team_score opp_score) df_out$spread df_out$team_score - df_out$opp_score df_out$total df_out$team_score + df_out$opp_score df_out } # Test Matchup ------------------------------------------------------------ matchup calcTotals("ATL" "CHI" as.Date("2014-02-25") df u rq_mdl) #45% chance of a win nrow(matchup[matchup$spread > 1 ]) / nrow(matchup) #6% chance of a push.. more or less nrow(matchup[abs(matchup$spread) 1 ]) / nrow(matchup)

Created by Pretty R at inside-R.org

The final question here is whether this method is an improvement over your typical linear model, and here the jury is out. Koenker didn't find that the model's performance was statistically better (or worse) than that of a linear model, but it does seem to be a more realistic, flexibile and intuitive modeling approach to my mind.

#quantile regression nba points totals

FanDuel NBA Optimal Lineup Solver in R

by Robert Del Vicario

I was playing around with lpSolve in R today and built this Fanduel NBA solver for creating multiple lineups. It is surprisingly quick and will pump out 1,000 lineups in about 6 seconds on my laptop.

https://dl.dropboxusercontent.com/u/10684925/basketball_data.csv

require(lpSolve) require(data.table) setwd("C:/Users/blahblah/Desktop") #setup data df <- fread("basketball_data.csv") mm <- cbind(model.matrix(as.formula("FP~Pos"), df)[,2:5], ifelse(df$Pos == "C", 1, 0), df$Salary, df$FP) colnames(mm) <- c("pf", "pg", "sf", "sg", "c", "salary", 'fp') #setup solver mm <- t(mm) obj <- df$FP dir <- c('=', '=', '=', '=', '=', '<=', '<=') x <- 20000 vals <- c() ptm <- proc.time() for(i in 1:1000){ rhs <- c(2, 2, 2, 2, 1, 60000, x) lp <- lp(direction = 'max', objective.in = obj, all.bin = T, const.rhs = rhs, const.dir = dir, const.mat = mm) vals <- c(vals, lp$objval) x <- lp$objval - 0.00001 } proc.time() - ptm

Fanduel NBA Lineup Optimizer

by Robert Del Vicario

Following up on the DraftKings lineup optimizer I shared last week you'll find a FanDuel lineup optimizer at the link below.

https://dl.dropboxusercontent.com/u/10684925/FanDuel%20NBA%20Optimizer.xlsm

The first tab you should examine is the FD Data Sheet. Columns A through L take the CSV file FD outputs. Column M contains the player name and columns N & O should contain your customized projections. I do two projections; 1 for if a player starts and 1 for if a player is not expected to start thus the two projections.

Next is FanDuel Lineup optimizer tab. Here you have the option to choose the Starting or Bench projection, Lock Players and Exclude players (columns G through I). I wouldn't touch anything else on this tab except the Create Lineups button unless you feel like making adjustments and/or adding functionality.

To use the optimizer you will need to install OpenSolver from http://opensolver.org/. Which will provide excel with a linear solver that can handle more than 250 variables.

I would also note that the projections in the spreadsheet for today aren't real so don't use them.

Finally, the VBA code isn't well tested so if you find bugs let me know!

DraftKings Lineup Optimizer in Excel

by Robert Del Vicario

Since the NBA regular season is upon us I figured that I would share with folks my homebrew lineup optimizer for those who don't know how to program. The optimizer resides in Excel and relies on OpenSolver. To use the optimizer you'll need to download OpenSolver from opensolver.org and install it as an Excel add-in.

From there the workbook has three tabs. The Data Sheet tab is where you enter the players for that day with projected points. The Lineup Optimizer which is where you determine which players you want to lock and how many lineups you want to create. The Team Outputs tab is where the optimized teams are output to.

It is pretty simple in terms of functionality and currently only setup for DraftKings, but easily extensible if you know a little VBA.

I hope folks have fun with it or learn something new!

Disclaimer: If I blow up your computer with my hastily written VBA code it isn't my fault...

https://dl.dropboxusercontent.com/u/10684925/Lineup%20Optimizer.xlsm

Basketball Research Data Set

by Robert Del Vicario

I was doing some work today on clustering player types, a player representation that I hope to be more informative than position, and realized that I didn’t have a very good process for creating reference research data sets for my own internal use. Ideally my modeling process should be compartmentalized between scraping, data manipulation, data analysis, model build, points projections, and team optimization. Unfortunately it is a bit of a hodgepodge as all of those pieces are a work in process. This means that any time I need to transform data for a specific analysis or project I start from scratch which isn’t a fantastic use of my time.

Well that stops today! I wrote a couple scripts allowing me to decouple all of the different parts of the process. Additionally, because basketball research data is so hard to come by I figured that I might as well share what I’ve scraped with folks.

For players moving averages start at the season start and are split by whether the player starts or not. For example, a player that Starts, Doesn't Start, Starts, Doesn't Start and scores 4, 5, 4, and 4, points would have two moving averages. In the player's second game they would be averaging 4 points while starting and 4.5 points while not starting. A player's moving average runs through the end of the season. I have strong suspicions that a player's average for any given statistic doesn't change much over the course of a season, thus the long running moving average.

For teams moving averages start at the beginning of the season and run through the end of the season. This is simply because I'm too lazy to find some "optimal" moving average number, and if a team tanks for a long period it'll start dragging down their average anyhow.

The data is fairly large ~210K rows, and 100+ variables. I'd start with the variable descriptions. Additionally, I take no responsibility for anything that I happen to have miscalculated.

Variable Descriptions

Data

by Robert Del Vicario

I’ve decided to open source my DFS scripts tools via github with the hope that I can find some contributors to work on them with me. This will be an R based endeavor with the front end developed in Shiny. I’m also open to learning Javascript to develop the front end in OpenCPU. Initially the project will be aimed towards developing NBA DFS tools, but could easily expand to other sports.

As I see it the tools will have three major components:

Web scrapers to to support the collection of fantasy sports statistics and information. Several of which have already been developed, but could use exception handling / streamlining.

A few light databases to support DFS player analysis and visualization

Player projections to support DFS lineup decision making and optimization

Player analysis tools to support decision making at the player level

Real-time collection of roster information through automated searching of Twitter feeds

You can view what I’ve put together so far here: https://quantifan.shinyapps.io/Quantifan_Lineup_Optimizer

Code can be found on github: https://github.com/prescient/quantifan_dfs_optimizer

p>Feel free to pull, fork, contribute, whatever… I’d be thrilled if I had collaborators to help with this project and this forum seems like a good place to find them!

Trending Blogs

Recently Viewed Blogs

Quantifan