New NAF Glicko Rankings - Updated 21-Jan-20

CyberedElf · Post by **CyberedElf** » Fri Mar 30, 2018 11:59 pm

mubo wrote:It's not, but it's an approximation, and crucially a better one than an established player starting at 350. It only makes a major difference in cases where coaches have one race they *always* take, then take a second and get an artificially low phi. Again, if anyone wants to suggest an alternative approach, happy to listen. I might add a min so that a new phi is never below 100 for example. .

I suggest 350. Mu is the approximation of player skill and phi is the confidence in the value of mu. We agree that a new mu for a coach starting a new race can be some function of the previously established mu(s). I think it is inaccurate to assume that we have ANY confidence in this new, untested value. Why is the maximum of the current phi(s) "crucially a better one?" What are you basing this assumption on?

PM sent

mubo · Post by **mubo** » Sat Mar 31, 2018 7:51 am

CyberedElf wrote: I suggest 350. Mu is the approximation of player skill and phi is the confidence in the value of mu. We agree that a new mu for a coach starting a new race can be some function of the previously established mu(s). I think it is inaccurate to assume that we have ANY confidence in this new, untested value. Why is the maximum of the current phi(s) "crucially a better one?" What are you basing this assumption on?
PM sent

Fair, I haven't tested different starting models based to ability to predict game outcomes, has been largely done by feel.

I'm glad you're engaging- but in this case the answer is because *I* think it's more reasonable. I think Straume's Khemri should start at 164 rather than 350, because we have some information about his ranking deviation gleaned from other races. I'm happy to hear alternative heuristics than max, I acknowledge it's far from perfect. I don't think 350 (totally naive player) is worth considering though.

In general, I have made several design decisions that could be argued over (length of ranking period, decay speed, system volatility and others), I welcome thoughts and am trying to be transparent, the next time I revisit the code I'll bear these in mind. Currently I'm working on a couple more superficial elements.

PS If anyone is interested the code is at https://github.com/hardingnj/NAF

CyberedElf · Post by **CyberedElf** » Sun Apr 01, 2018 5:31 am

I am curious about the other assumptions. I would prefer decay to be much higher or much lower. I don't think people's skill decays much with time so I don't see why the earned confidence would go down with time. At the other end, a high decay gives you a leaderboard, which can be useful in its own light. The explanation of those assumptions are why I eagerly await the article.

There is an inherent flaw in any starting deviations. Average is not average. I understand and find this loss acceptable compared to the benefit of a better starting mu. Elo, and I think Glicko, are normally zero-sum, so the average of all is always the average, when everyone starts in the middle.

At first I though anything other than 350 initial phi was an absolute error, but I started considering the extremes. As mentioned before, if a coach plays one team to a high confidence, EVERY new team will start with a high confidence. To me, that is a problem. If a coach has played 25 teams to similar mus and all a high confidence and the last team starts at phi=350, that is also a problem.

First:
rp = number of previous races played
max = the current phi value of the least confident of the previous races played
tr = total other races (currently 26-1=25, -1 for the team being created)
New phi = ((rp*max)+((tr-rp)*350))/tr
It is the average of all other races (with unplayed races being 350) and assuming all played races have the least confident phi determined so far.

Second:
new phi = the average phi of all other races with unplayed races being 350 then capped at the current highest phi if needed.

In both ideas, the more races a coach has played the more confident we are that the new mu is reliable, but never more reliable than a previous mu.

I wrote a few other ideas, but then realized they were inherently flawed. They only worked reasonably for coaches that performed above average, not for coaches that performed below average.

Let's continue to use Straume as the example. If he takes Khemri to the next tournament, it would start with a mu of 1662.36 and a phi of 164.7 giving him a rating of 1250.61. This is higher than his current Orc rating of 1214.80. Back to Glicko theory, phi is a measure of confidence and mu-2.5*phi is the lower bounds of the confidence interval we have for the estimation of the coach's true skill. That is why that value is used for the rating. Glicko predicts that the player is at least that skilled. I think it is unreasonable to say that a newly initialized coach/race would have a high lower bound of skill than a previous race. Straume's new Khemri rating should not start better than his Orcs no matter how good he is with Dark Elves. Maybe Orcs are Straume's worst team, but maybe Khemri should be. I think this can be best fixed by adjusting the initialization of phi.

mubo wrote:350 (totally naive player)

Maybe I don't understand you or maybe this is where our disconnect is. Instead of me trying to guess what you mean and continuing to argue off on a tangent, would you please explain the quote? What does phi tell us about the player?

Straume: if you would prefer not to be used as an example please let me know and I will edit.

mubo · Post by **mubo** » Tue Apr 03, 2018 10:01 am

Hi-
I'd rather we put any technical discussion over on the github as an issue.
I think this may be making the thread hard too parse/skim for others.

Purplegoo · Post by **Purplegoo** » Fri Apr 06, 2018 9:15 pm

Hi all,

The April Glicko ranking update is live.

Along with the update to your numbers, we have some terrific new features for you this month.

- Dan / Wulfyn has written a lovely blog explaining the Glicko ranking system in more depth. If you have questions about mu, phi or the background of this system, read on!

- Nick / mubo has created a terrific rankings calculator where you can insert your Glicko ranking and see how it would change if / when you beat JimJimany's Wood Elfs, or when you play whatever local coach you know is attending your next tournament. This should help you chart your progress in real time, or even before time!

- The rankings themselves now feature a measure of how active rankings have changed compared to last month and where you sit overall for each race for which you have an active ranking. This should be pretty self-explanatory on the ranking page

As ever, thanks to Nick and Dan for all of their hard work. All feedback appreciated, and enjoy!

Phil.

CyberedElf · Post by **CyberedElf** » Sat Apr 07, 2018 1:37 am

I appreciated the blog article.

Wulfyn wrote:The φ score is given the starting default value, as we have no confidence in what their true ability with that team is. This also means that the ranking will be quite low as the coach has to play enough games to first get their φ with that team to under 100, which is still a 250pt deduction to their ranking.

For a new race, Wulfyn, in the blog, says the initial phi = 350; mubo says the initial phi = maximum of all previous phi. Looking at the code, I think it was done as mubo described. I, personally, think Wulfyn's idea is correct, even if not implemented.

I was hoping for the reasoning of the decay function.

Wulfyn · Post by **Wulfyn** » Mon Apr 09, 2018 2:06 pm

You are totally right, a mistake on my part to write that. It should as mubo said be the max of existing races. Thanks for the spot, I will update the article to clarify!

From a 3rd person perspective I agree with both you and mubo in this approach but for different reasons. Setting it to 350 I think would be technically sound if all teams were independent. Setting it to the max I think would be technically sound if all teams were homogenous. I don't think either of these are quite true; on the whole I think being good at one team means you will be good at others, but we all know that many people have a blind spot. I prefer control teams, for example, and playing Norse gives me a headache. Also having invested so many games and gained so much experience in Lizards makes me a bit worse when playing other teams as I can fall into Lizard tactics that just don't work elsewhere.

So for me it would be somewhere between the two, but I have no idea where. I suspect max of existing is closer to the truth than resetting to 350. Also I think if we accept that we should set it to 350 then we should not change the mu, as a phi of 350 is equivalent to saying "we just have no idea", and in that case I don't think it is logical to say "we just have no idea, but let's make it 1650 instead of 1500". Also as mubo points out you can get into some weird situations depending on the number of teams played and the number of games played with them as to where that phi ends up. If your max phi team is Khemri on 200, it feels like whether you play Khemri then a new team (say Orcs) in the next 2 tournaments or Orcs then Khemri should not make much of a difference. But if you play Khemri first your phi could reduce to 150 which is the starting point for Orcs, which would then be even lower. Play Orcs first and they will start at 200 and reduce, but not as far as the first scenario. Overall I doubt this will have many real world incidents that are meaningful (any substantial play with a single race will quickly drown out this starting point), but the technical part of me thinks there might be something better. I'm just not sure what.

On your other point regarding a combined ranking this is something that I am looking to do for an unofficial tacklezone single-player ranking (I've been quite keen on this for some time). There are a number of approaches I think. When mubo and I first started talking about this last summer I was still using an Elo system (oh fool me!). I created a combined ranking based on the weighted racial stats. As mubo said this required 2 run throughs of the data (and there is a discussion to say whether results from 2006 should be part of this weighting given the meta, and even the ruleset, has moved on). But once created it is possible to have either an overall race weighting or a race vs race weighting that can be applied to each game played that would allow us to process the data on a per-player rather than per-player-team basis.

But there were a couple of problems. The main theme of glicko is that we need to see the distribution, not just the peak of the curve. A racial weighting would be looking at just this peak. Maybe a better approach is to take it on a decile basis to see where among the distribution of all players of a certain team you as an individual lie. That way we can compare the weightings not from the midpoint but from each decile? I mean the ideal is continuous, but continuous maths is a ball ache as I am not sure I have the energy to try and derive functions for the curves. The thinking is that - well take Wood Elves vs. Lizards. I think that for the entire BB community this is a match up where Lizards get shafted, so the win rate will be heavily in favour of Wood Elves. But I think at high level play a competent Lizard coach is not far behind the Wood Elves. There are certain tricks and traps you can use that overcomes the natural "leap in and kill a skink" plan the Wardancers have.

Also maybe stunties should be excluded? On the first couple run throughs (again using Elo) I was amazed to find an Ogre player consistently in the top 10. This was because the overall racial match ups were so bad all it took was a couple good results to shoot up the rankings. Glicko will help fix that, but I think it does require some scenario testing.

And then there is the worst one of all - tournaments are not equal. How can I trust a Necromantic vs anyone racial match up stats when these guys consistently get bumped from around 10th best choice to possible the best choice when the tier package is kind to tier 2? Recent Eurobowls and NAFC, for example, sees Necromantic a popular choice as they get a package that is near perfect for what they want. I don't think anyone has the time to go back through every tournament over the last 14 years, research the tiering, and try to apply weighting (aka guesses) to the weighting.

But I am going to try anyway, and you're welcome to join my charge into the cannonade if you want to?

edit: re decay function I will try and get something added this week to explain that. If there's anything else you'd like more detail on please let me know as like this it may be worth adding into the main article.

CyberedElf · Post by **CyberedElf** » Tue Apr 10, 2018 1:34 am

To me, there is a difference between "We just have no idea" and "We have no confidence in an idea." I think phi = 350 is the latter and distinct from the former.

I tried to fork the github, but it uses snakemake which is not supported for windows, and I don't feel like installing a vm or building a linux box.

Wulfyn · Post by **Wulfyn** » Tue Apr 10, 2018 9:08 am

CyberedElf wrote:To me, there is a difference between "We just have no idea" and "We have no confidence in an idea." I think phi = 350 is the latter and distinct from the former.

Although I was using colloquial language which may confuse (sorry!), there really is no difference in these statements.

Confidence (phi) can be measured, and so is on a sliding scale between 0 (absolute certainty, or 100% confidence) and infinite (absolutely no certainty, or 0% confidence). For practical reasons having infinity as a starting phi is not a good idea, so we define our "absolutely no certainty" as a phi of 350. So a phi of 350 is "we just have no idea". Phi 350 is 0% confidence. They are the same thing.

If we are saying that we have absolutely no idea (i.e. a phi of 350) then we have no justification for changing the mu away from the default level. It is a contradiction to say "I have absolutely no idea, but x is more likely than y". This is because to say that x is more likely than y you must have some level of confidence that this statement is true, and that necessarily means having a confidence that is > 0% (i.e a phi < 350).

Truth be told we probably have the evidence to measure this, by looking at the correlation of win rate for different teams of the same player (weighted by team, minimum number of games played). But, unless you believe that there is zero correlation, there is no justification for a phi of 350 (and if there is zero correlation we should start the mu at 1500 again). Any correlation, no matter how slight, gives us information. And as phi is a measurement of the information, any information we have will reduce the phi below 350.

Hope that makes sense.

CyberedElf · Post by **CyberedElf** » Tue Apr 10, 2018 4:42 pm

Wulfyn wrote:Any correlation, no matter how slight, gives us information. And as phi is a measurement of the information, any information we have will reduce the phi below 350.

Phi is one measurement of the the information. I think mu is the dominant measure of the information gained. I agree, phi can be reasonably initialized other than 350, but I think using the max of previous races is a problem for multiple reasons previously stated. We agree on the newly initialized mu calculation, I just believe that this accounts for the majority (not all) of the information we have.

I think the function for a new phi should include how many previous races were used in the calculation. The more races a coach has played, the more reasonable it is to believe that we have a higher confidence in an average mu. The "max" method behaves in an opposite manner. In that method, the more races a coach has played, the more likely their max phi is to be a higher value. Since the new phi should never be lower than the max method, I think that method would actually approaches the best approximation as the number of previous races increases, but is at its most inaccurate with fewer previous races. Since the method is least accurate after only one race and the phi of the second race is determined solely by the first race, the method starts at its least accurate and continues to add incremental error for each new race.

Wulfyn wrote:Although I was using colloquial language which may confuse (sorry!), there really is no difference in these statements.

My language was also colloquial, my apology also. You are correct, due to my colloquial use of language the statements are the same, so I will rephrase.
"To me, there is a difference between "We have no idea" and "We have almost no confidence in an idea." I think phi = 350 is the latter and distinct from the former." I believe that distinction is useful and important.

<Disclaimer>
After writing some of this post, I realized it can easily be perceived as harsh and aggressive, which is not my intent. While I may disagree with some of their statements and have different beliefs, I have a lot of respect for Wulfyn and mubo and greatly appreciate the work they have done. Please accept my apology, if I offend. I did remove almost all antagonistic statements, but the disclaimer might still be needed for what follows.
</Disclaimer>
Wulfyn, you basically said, "We defined phi=350 as 0% confidence, so my statement is correct." You used that definition to justify the rest of your argument. That is not how statistics work. Unless you do set phi to infinity, you ARE setting a minimum confidence greater than 0%. Statistically, that means you DO have an idea. There is a substantive difference between 1750/350 and 1500/350. For example, there is is still only a 54% chance that they could represent the same true value with sample sizes as small as n=2. (n=10 is only 12% chance.) 0% confidence means there is 100% chance that they could represent the same value.

mubo · Post by **mubo** » Wed Apr 11, 2018 8:59 am

CyberedElf wrote: I think the function for a new phi should include how many previous races were used in the calculation. The more races a coach has played, the more reasonable it is to believe that we have a higher confidence in an average mu. The "max" method behaves in an opposite manner. In that method, the more races a coach has played, the more likely their max phi is to be a higher value. Since the new phi should never be lower than the max method, I think that method would actually approaches the best approximation as the number of previous races increases, but is at its most inaccurate with fewer previous races. Since the method is least accurate after only one race and the phi of the second race is determined solely by the first race, the method starts at its least accurate and continues to add incremental error for each new race.

For clarity, I agree with this (as I think I acknowledged when you first raised this). As you've noted elsewhere however- it's not a simple or obvious fix. I'm not going to engage in any technical discussion here, as I think it's beyond the remit of this announcement thread, which has already ballooned. Happy to continue on github or elsewhere.

CyberedElf wrote: <Disclaimer>

It's cool. Honestly, it's not the tone that frustrates me, it's the verbosity.

Pipey · Post by **Pipey** » Wed Apr 11, 2018 9:41 am

Separately from this technical discussion which is largely over my head

Is there value in giving the two variables of phi and mu different names to make them easier to understand, less abstract? Something like "aggregate" (mu) and "confidence" (phi)?

mubo · Post by **mubo** » Wed Apr 11, 2018 11:21 am

Pipey wrote:Separately from this technical discussion which is largely over my head

Is there value in giving the two variables of phi and mu different names to make them easier to understand, less abstract? Something like "aggregate" (mu) and "confidence" (phi)?

Very possibly. I think phi is often termed Rating deviation (RD), and mu is rating. Then we could think of something else to call the final thing you are ranked on (rating score?). Interested to hear if there is more feedback on this.

Post by **lunchmoney** » Wed Apr 11, 2018 11:28 am

Every time I see Mu all I can think of is

Yeah, changing Mu and Phi to more catching words is a good idea

dode74 · Post by **dode74** » Wed Apr 11, 2018 2:25 pm

mubo wrote:Very possibly. I think phi is often termed Rating deviation (RD), and mu is rating. Then we could think of something else to call the final thing you are ranked on (rating score?). Interested to hear if there is more feedback on this.

I think Trueskill uses Peak (mu) and Spread (sigma). Initial ranking is based on mu - 3*sigma and is called, funnily enough, Trueskill rating. Peak, Spread and Glicko Score might work.

Talk Fantasy Football

New NAF Glicko Rankings - Updated 21-Jan-20

Re: New NAF Glicko Rankings

Re: New NAF Glicko Rankings

Re: New NAF Glicko Rankings

Re: New NAF Glicko Rankings

Re: New NAF Glicko Rankings

Re: New NAF Glicko Rankings

Re: New NAF Glicko Rankings

Re: New NAF Glicko Rankings

Re: New NAF Glicko Rankings

Re: New NAF Glicko Rankings

Re: New NAF Glicko Rankings

Re: New NAF Glicko Rankings

Re: New NAF Glicko Rankings

Re: New NAF Glicko Rankings

Re: New NAF Glicko Rankings