NTBB: Stats

VoodooMike · Post by **VoodooMike** » Wed Feb 06, 2013 10:57 pm

spubbbba wrote:In R, B and MM then it’s very rare to play matches with big TV differences. So surely those would have to come from league play. By your own standards do we have enough games to show the advantages a higher TV gives, and whether this is mitigated by inducements and if so by how much?

It comes from MM data, since that's the only data that exists in sufficient quantity to perform any serious analysis on. Even in the limited range of TV difference we see a reasonably strong relationship between TV difference and underdog loss... there's neither a data-based nor logic-based reason to assume that as TV difference increases beyond the scope of the existing data, it will suddenly reverse that trend... in fact, even the so-called design goal of inducements doesn't support such an idea.

"By my own standards" yes, we have enough data to support the idea that TV difference progressively favours the team with the higher TV, in any format. That doesn't mean that the underdog always loses, it just means that across all games played, we can expect to see the TV underdog losing more often, and losing more the farther below the other team's TV it is.

Would it be great to have a metric shit-ton of additional data from play at higher TV differences? Absolutely. You'll never get it, though, and we know that... so worrying too much about it is like sitting around jerking off to the thought of supermodels. I prefer to remain grounded in reality.

spubbbba wrote:Also I’m not sure how you’d factor in the differences in the metagame between TV based matching and league play. The crp rules were written for league play and NTBBL is primarily concerned with this too.
The trouble with matching by TV and open play is that each game is effectively a 1 off. If you play a 2000 vs 2000 game and win but lose a bunch of players so end up 1500 TV your next game won’t be against another 2000 team but someone roughly equal to you. That is why many argue that min-maxing isn’t an issue in leagues.

Ok, first off, due to the lack of extensive data on league play, and the additional sources of error involved that mean we'd need comparably more league data to get the same statistical power, we have no actual data on which to compare League and MM play - all of that is based on feelings and suspicions. I fully understand the theory behind what you're saying, but you need to understand that it remains theory until it is supported by something more than anecdotal evidence.

Maybe you can say "well, everybody thinks this".. and then I can point out that "everybody" thought the earth was flat, and everybody thought the sun revolved around the earth. Objective reality is not a democracy.

Another important point is that the difference between environments has to be so big that we'd expect to see a significantly different result in Leagues than in MM. You (and galak and dode) say that BB was designed for leagues, not MM play, and yet MM play shows win%s reasonably close to the tier expectations. Are we seriously saying that we expect that if we actually had enough league data for similar levels of analysis, we'd really be seeing a large enough difference in league win%s to say that MM data is apples to its oranges? It's very, very unlikely, given what we've seen of the data we actually DO have access to.

spubbbba wrote:It’s why I don’t think tabletop tournaments are much use for looking at low TV balance, not only do they reset every game but also have lots of house rules. Things like being able to assign skills really makes a huge difference to some races, Lizardmen instantly spring to mind.

House rules really do a lot to hurt the validity of data, yes. Data from a different ruleset is garbage data if we're trying to examine the specific ruleset and rosters of the game as written. However, the assigning of skills thing is more theory than fact, since we have no data on it. I think it sounds reasonable, but likewise, the idea that the earth is flat is pretty reasonable too, from my personal experience.. luckily we have data that lets us know it isn't.

spubbbba wrote:Another potential issue is that the open leagues have imbalances of teams just like leagues do. If you look at B it is dominated by bashers, so how would that be factored in?

It's not a similar issue at all. Compositional issues in leagues are much more focused on the fact that certain rosters will not be represented at all, and when there's only one or two teams, the coach's own ability is a much bigger factor in the resulting data. In open leagues you have a large number of teams of every roster, and in MM you can't pick and choose which rosters you play against. The random matching results in less error.

There are statistical methods to compensate for the different number of teams of certain rosters. If you go back to older postings you'll see I've previously posted the win%s both in total, and controlling for the roster compositional imbalances. If you had a large enough number of leagues, the same controls could be applied there, but like I said, it's the OTHER sources of error that we can't control for, that will make the league data less statistically powerful.

I am not saying that League data is inherently useless. What I am saying is that you'd need, say, 3000 League games to get the same statistical resolution as 1000 MM games.. and we actually have many multiples LESS League data than MM data... and even with the large amount of MM data we have, we find our statistical power gets pretty low as we try to break things down into certain TV ranges and racial combinations. Now, 3000 vs 1000 is just me picking arbitrary numbers (its not actually 3:1... the ratio is unknown since we don't have enough League data to calculate it) but we do know that there are more sources of error in League data than MM data, and more sources of error always create that sort of scenario.

Koadah wrote:There's is probably a little bit more understanding now but even so, I don't think that things have moved on a whole lot from where I came in.

Maybe not, but I'm not trying to move those particular things. If I can get some people to understand better then the community as a whole is better for it. Plenty of people don't really care (including people involved in these arguments) but that's how most debates work... you're not actually expecting the other side to change their mind, you're presenting two sides of an argument so that people who care to listen have more on which to base their own resulting opinions, etc etc.

VoodooMike · Post by **VoodooMike** » Thu Feb 07, 2013 12:01 am

Plasmoid wrote:You honestly thought I was making a comment about school districts? OK...
Well I wasn't.

Sorry, I didn't think you were dicking around rather than being specific. Maybe NTBB itself should have been my hint.

Plasmoid wrote:Me: I'll add up My Data (aka the BBRC data) + FOL + MM.
You: You'll do it wrong, meaning you're on drugs. And if you do it right, it's like 1/1000th of the data and won't make a difference.
Me: I've done at right already. And it's between a 20th and 40th. Depending on the team. Roughly a 30th in total - very far from a 1000th.
You: You don't understand arithmatic.

If you have as much data as you claim then great - run the numbers with and without your data and see what the difference is. If you ever bother to learn how to calculate CIs (which is a prerequisite to being able to actually use the numbers to justify anything) then see if your data has changed any of results in a statistically significant fashion. Then you'll understand the point.

Plasmoid wrote:The combined data is still there, 3rd post from the bottom of page 1, should you ever wish to check it out.

You mean the averages without confidence intervals? I'm not sure how many times this needs to be said, but those numbers are absolutely irrelevant on their own... in fact, they're misleading on their own. You really, really, really, really need to wrap your head around this fact. Stop talking until you have. I will reiterate:

60% +/- 2% is clearly outside of the 45-55% range.
60% +/- 10% is not.

You report "60%" and say "see?"... no, I don't see and neither do you! You say you don't know how to calculate CIs.. that's peachy. It means your numbers have no meaning even to you at the moment.

Plasmoid wrote:I'm saying that if I add data from an entirely different population than the one I'm trying to make a conclusion about - and I expect the 'booster' population to have a different mean, then I'll be adding a factor (a confound?) that would give me distorted data.

Actually, the reference was to your demands that the B/R/FOL data be cut down to only games played by teams in their first <X> games.. you're now mixing up topics and replies. Such data will remain data from the environment that you imagine is significantly different than the one you want to examine (granted, you have no data to back up that belief yet), it'll just be cut down and thus, have less ability to find differences and thus, give no support for the need for any aspect of NTBB. As I said before... go wild... it isn't me you'll be screwing in the process.

Plasmoid wrote:Gen. Plasmoid: OK. So how do figure out the blast radius of the bomb on the plane ahead of time then.
President: Easy, we blow up all the different bombs we have of that size.
Gen. Plasmoid: Really? The same size? So it doesn't matter of it's TNT, C-4 or Nuclear?
President: Hell no, we'll just calculate the average and go from there.
Gen. Plasmoid: Sir, that sounds wrong. The bomb on the plane is TNT. Shouldn't we focus on TNT then?
President: General Plasmoid, you're an idiot.

Uh, actually that's exactly how we know the blast radius of various forms of explosives - we've detonated them many times in the past and thus can say with a certain level of confidence that the next detonation of a comparable amount will be approximately <x>. With TNT, C-4, or whatever, we know they're different because we have data from past experimentation to show that they're different. In this case, however, you have no data to support the idea that the results will be different. Once again you are taking your assumption and pushing it as fact.

So, the last line, at least, is accurate.

Plasmoid wrote:I think it is highly likely. Because the TV-system is flawed.

Is it? Is that a fact supported with data, or just more of what you think? I'm absolutely willing to entertain the idea but not strictly on your opinions and feelings.

Plasmoid wrote:I'd rather collect the data that I know to be relevant, rather than the data you speculate is similar.

Awesome - I fully support the idea of you collecting data. Also of you learning how to analyze data, since your results have no meaning until you do. Until you have actually collected and alayzed this so-called relevant data, we can reiterate that you have no data, which likewise means you have no objective support for anything you have done with NTBB. Typically one finds evidence of a problem before one fixes it, rather than fixing a supposed problem and THEN looking for evidence to support the idea that a fix was needed... but y'know, so long as there's some objective evidence somewhere in the mix, we can probably all be happy. I look forward to it!

What I have said is this: there's data on B/R/FOL, there's no significant amounts of data on anything else. As such, there's also no data to say that the data that exists will be totally different from the data that doesn't. Go use the data that exists to show that rosters need adjusting, and you'll have some form of numeric support for the idea... at the moment, you have no support at all for the idea that you've done anything other than pull things out of your ass at random. Having NO DATA is not helping you out... having data that lacks the power to show statistically significant differences is also not going to help you out. You keep arguing, in essence, in favour of throwing NTBB into the trash can.

So.. y'know... this entire discussion is you arguing in favour of screwing yourself and your work. Maybe that helps you understand why I seem incredulous?

Plasmoid wrote:In fact, since you're the one wanting to add in a second population that is 'long term' to describe something that is 'short term', why don't you show some significant and statistically reliable data that it's all the same?

When you get aroudn to looking up how to calculate CIs, maybe you can look up the idea of demanding negative proof. It's not my problem to prove that two populations are NOT different. The null hypothesis is always that there ISN'T a significant difference.

Plasmoid wrote:Like Hitonagashi I'm not interested in one particular variable, I'm interested in the on-table total performance.

Tiers are a single thing - win%... the NT in NTBB is "Narrow" and "Tiers", though I may lobby to have it changed to "Not Trustworthy". You should be concerned with that single variable pretty heavily. As for sources of error, you obviously didn't understand what was written if this is your reply to that.

Plasmoid wrote:(Coincidentally, I think the impact of inducements can be seriously overstated. With a TV difference of 1-9 the impact of inducements is rather limited - and we'll have a lot of those in short term play.)

It's spiffy that you "think" that... now if only there were some way to deal with this idea in a mathematical fashion.. some way to determine if the effect is something we can be, say, 95% confident is a real effect rather than just caused by random fluctuations. Oh wait, there is... and we used it. Thank for coming out.

Plasmoid wrote:Either way I most certainly will not use data for teams that have spent 30+ games to morph into a very different version of a low-TV team than what the team looked like when it reach the same TV the first time around.

You can use the mottle patterns on your dog's anus as data if you want.. it'll just result in no support for what you're after, and lots of room for people to shoot down any data you try to use to support your actions. Additionally, you have no idea how to analyze the data, and have declared you're not going to bother looking up how, so... I'm not sure why you say you'll be using or not using any particular data in the first place.

Plasmoid wrote:Even if the data for all 5 überteams were to put them within 45-55 for short term play, there would still be the CRP+10 list of house rules and the changes to the tier 2 and tier 3 teams.

Like I said before, if you want to change "NTBB" to "Stuff Plasmoid Thinks is Better" then it's pretty certain that everyone who has raised the statistical objections will withdraw them. It's because you're claiming, to the point of making it the title of your changes, that they're about dealing with tier imbalances, both in and outside of the tiers, that you've received this criticism. If the data does put everything in acceptable ranges, and all that is left is unrelated houserules, then NTBB will just be increasingly inaccurately named.

Plasmoid wrote:I'd have no problem sticking Plasmoids in the title. Why would I?
After all, these are house rules for anyone whose experience of the game matches mine.

It's not about adding your name, its about removing the reference to narrowing the tiers, which suggests you're actually using data to make the game more accurate in terms of the game's stated design goals. You're not. Maybe you imagine you are, but you don't have data to support the idea, and don't know how to analyze data to find those effects, which means that the changes you've made are NOT based on data, they're pulled out of your ass... so allusions to being a data-driven set of changes are dishonest.

Plasmoid wrote:Right. You know the total number, and you know that all data is either W, T or L. And you know the mean.
What else do you require then? The specific win, draw and loss numbers? I've got those too.
Something else?

Let me present you with two arbitrary datasets:

Set A: 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
Set B: 1 3 5 1 3 5 1 3 5 1 3 5 1 3 5 1 3 5
Set C: -10 3 10 -10 3 10 -10 3 10 -10 3 10 -10 3 10 -10 3 10

You know the total number of values in each set, and you know the mean.. which is the same for each. The datasets have quite different 95% confidence intervals, however, because they have very different standard deviations. Set A has an SD of ZERO, because there's no variation in the data. Set B has an SD of 1.68, which results in a 95% (actually, 96% since I'm just using the quick and dirty margin of error method) of 3.36... Set C has an SD of 5.88 and a 95% CI of 11.76. So, what we can, with 95% confidence, say about the population that each dataset is a sample from is:

Set A: the mean value for the population is exactly 3
Set B: the mean value for the population is somewhere between -0.36 and 6.36
Set C: the mean value for the population is somewhere between -8.76 and 14.76

Why are the ranges so big? Because the more variation you have in the data, the higher the standard deviation (essentially the most common distance datapoints fall from the mean) and the less confident you are of the proximity of the population's mean to the sample's mean. The more data you get, assuming the data continues to cluster around that mean, the narrower those ranges will become.

In each case you had the totals, and the means... but as you can see, without the raw data (or specific variance data calculated from the raw data) you don't actually have any information about the population, you just have information about the sample, and while that's peachy if all you care about is the sample, it means you can draw no conclusions about the population on those pieces of information alone.

Darkson · Post by **Darkson** » Thu Feb 07, 2013 6:33 am

What about those races that are screwed by TV matching (especially in Cyanide), those that are meant to be able to take advantage of being the underdogs I.e. goblins and halflings. How can we be sure they're were they're meant to be without league data?

besters · Post by **besters** » Thu Feb 07, 2013 6:57 am

[quote="VoodooMike
In each case you had the totals, and the means... but as you can see, without the raw data (or specific variance data calculated from the raw data) you don't actually have any information about the population, you just have information about the sample, and while that's peachy if all you care about is the sample, it means you can draw no conclusions about the population on those pieces of information alone.[/quote]

Surely this statment is inaccurate, you can always draw conclusions, it is the chance of the conclusion being incorrect that increases, as long as you are aware of this you can proceed accordingly.

Let's face it, you can get erroneous conclusions from perfectly accurate data, but you can potentially move in the right direction from flawed data.

spubbbba · Post by **spubbbba** » Thu Feb 07, 2013 2:50 pm

VoodooMike wrote: It comes from MM data, since that's the only data that exists in sufficient quantity to perform any serious analysis on. Even in the limited range of TV difference we see a reasonably strong relationship between TV difference and underdog loss... there's neither a data-based nor logic-based reason to assume that as TV difference increases beyond the scope of the existing data, it will suddenly reverse that trend... in fact, even the so-called design goal of inducements doesn't support such an idea.

Some inducements are clearly not as good, for instance extra re-rolls, rookie mercs and wandering apoths. Having 2 rerolls and an apoth is better than 1 re-roll and a wandering apoth for instance. Others like stars, babes and wizards are harder to quantify.
From what Galak’s said, the goal is that the underdog should be at a disadvantage, otherwise why bother building a team. I’d be interested to see much of a difference gaps in TV makes at various ranges. In my experience inducements are less useful to newer teams. A 10% disadvantage seems a lot worse to a rookie team than one at 2000 TV. But it would be good to see if the data bore that out as well as how much of a difference in results the TV advantage made.

VoodooMike wrote: Ok, first off, due to the lack of extensive data on league play, and the additional sources of error involved that mean we'd need comparably more league data to get the same statistical power, we have no actual data on which to compare League and MM play - all of that is based on feelings and suspicions. I fully understand the theory behind what you're saying, but you need to understand that it remains theory until it is supported by something more than anecdotal evidence.

Another important point is that the difference between environments has to be so big that we'd expect to see a significantly different result in Leagues than in MM. You (and galak and dode) say that BB was designed for leagues, not MM play, and yet MM play shows win%s reasonably close to the tier expectations. Are we seriously saying that we expect that if we actually had enough league data for similar levels of analysis, we'd really be seeing a large enough difference in league win%s to say that MM data is apples to its oranges? It's very, very unlikely, given what we've seen of the data we actually DO have access to.

The lack of league games is an issue that is hard to get around, not only that but most have their own house rules which can complicate matters further.
I am still a bit wary of using TV matched pen leagues as I question their competitiveness. In R/B/MM then every game is basically a friendly. You want to win and improve your team but the only penalty for losing is to your ranking.
But I guess you could use game theory to factor that in, I’m sure someone will have done a study of the differences between friendly, cup and league games for sports teams or games.

VoodooMike wrote: There are statistical methods to compensate for the different number of teams of certain rosters. If you go back to older postings you'll see I've previously posted the win%s both in total, and controlling for the roster compositional imbalances. If you had a large enough number of leagues, the same controls could be applied there, but like I said, it's the OTHER sources of error that we can't control for, that will make the league data less statistically powerful.

I am not saying that League data is inherently useless. What I am saying is that you'd need, say, 3000 League games to get the same statistical resolution as 1000 MM games.. and we actually have many multiples LESS League data than MM data... and even with the large amount of MM data we have, we find our statistical power gets pretty low as we try to break things down into certain TV ranges and racial combinations. Now, 3000 vs 1000 is just me picking arbitrary numbers (its not actually 3:1... the ratio is unknown since we don't have enough League data to calculate it) but we do know that there are more sources of error in League data than MM data, and more sources of error always create that sort of scenario.

Using MM, what would be the minimum amount of games needed to get reasonably accurate stats on the 24 “official” teams? Would these need to be broken down still further to Tv ranges and racial match ups? Otherwise risk certain teams being too good in some scenarios and too weak in others.

VoodooMike · Post by **VoodooMike** » Thu Feb 07, 2013 7:53 pm

Darkson wrote:What about those races that are screwed by TV matching (especially in Cyanide), those that are meant to be able to take advantage of being the underdogs I.e. goblins and halflings. How can we be sure they're were they're meant to be without league data?

The point is, again, that we have no data to support the idea that goblins and halflings DO take advantage of being the underdog. When you're working with numbers you need to detach that from what you believe will be true - a concept unsupported by data is no better than any OTHER concept supported by data. The theories and design intentions, etc... tell us where to look, but they don't count as data themselves.

Do we know, based on data, that they ARE screwed by TV matching, or are we just assuming? Just how much of an underdog do they have to be? We do, afterall, have plenty of data from MM and there isn't ZERO TV difference... there's varying amounts of TV difference, but the range isn't huge. Does this mean that you expect to see win% for goblins and halflings increase when they're the underdog, as TV difference goes up... because... you can still see if that's happening in MM using the games goblins have played, and examining the range of TV differences in those games.

The ultimate point is... when you have to choose between using data that may (or may not, since you need to compare it to another dataset before you can conclusively say) have some confounds, and a total LACK of data... you should go with the former. If you can find an effect with data that some people can voice confound concerns with, then you can simply state that should better data ever become available, you will revisit the results. Taking action based on NO DATA is irresponsible and foolish... the only thing NO DATA supports, is waiting for data before you do anything.

besters wrote:Surely this statment is inaccurate, you can always draw conclusions, it is the chance of the conclusion being incorrect that increases, as long as you are aware of this you can proceed accordingly.

It is not in any way inaccurate. You can draw conclusions from tea leaves and chicken entrails too, but they're horseshit conclusions based on nothing... aka, guessing. It ultimately doesn't matter if you guess right or wrong in the long run - it's unsupported gambling with the truth. In terms of statistical analysis, you cannot draw any conclusion whatsoever from the mean of a sample alone... period... for the reasons demonstrated and explained.

besters wrote:Let's face it, you can get erroneous conclusions from perfectly accurate data, but you can potentially move in the right direction from flawed data.

And sometimes people die in car crashes even if they wear their seatbelt... while sometimes people survive a car crash only because they weren't wearing them. Does that, to you, suggest that wearing or not wearing your seatbelt are of equal value in terms of safety? It sounds like you believe exactly that.

spubbbba wrote:Some inducements are clearly not as good, for instance extra re-rolls, rookie mercs and wandering apoths. Having 2 rerolls and an apoth is better than 1 re-roll and a wandering apoth for instance. Others like stars, babes and wizards are harder to quantify.

"Clearly" by what metric? When you want data driven conclusions you have to take a step back from any feelings and assumptions you might have and let the numbers tell the story, because preconceived notions are more likely to prevent you from being objective than they are to help you. Quantification is what numbers are all about, afterall - the only time things are tough to quantify in terms of data is when you don't have enough data to work with.

spubbbba wrote:I am still a bit wary of using TV matched pen leagues as I question their competitiveness. In R/B/MM then every game is basically a friendly.

Oh? I'd suggest the opposite. League games are friendlier because you have a lasting relationship with the other coaches which is why houserules are something that happen so often in leagues - people can actually agree on things on occasion. MM play is where strangers are playing one another, and they're not there to socialize... it's where we've seen problems with cheating and exploiting disconnections and bugs in the game. League play people can just agree not to do certain things and its done. You don't see any complaints of "power gaming" in leagues (ie, minmaxing in MM) nor a massive focus on teams that can get and stay developed. If you're looking for an environment where people are trying to eke out every bit of power and performance from their team, you probably want MM, not League. Of course, without League data, or even a clear metric for "competitiveness" we can argue our guesses until the sun goes out.

spubbbba wrote:Using MM, what would be the minimum amount of games needed to get reasonably accurate stats on the 24 “official” teams? Would these need to be broken down still further to Tv ranges and racial match ups? Otherwise risk certain teams being too good in some scenarios and too weak in others.

One can guess, but it's impossible to say... also, what do we define as "reasonably accurate"? If we're talking about League games... league composition will be a problem we can't really control for with anything but a larger dataset. Variation in data is what makes our result less "accurate" in terms of confidence interval ranges, as I hope I demonstrated in my previous posting with the 3 datasets. League games will tend to have a higher variation, and thus, take more data to result in similar accuracy.

To get a decent dataset for League play you'd need a large number of LEAGUES (not just games within them...) each of which was playing with pure CRP rules and rosters, no houserules. The assumption would be that with a sufficient number of leagues, the compositional differences would balance themselves out in the long run, and with enough games the TV differences would balance themselves out in the long run too. Having only a few leagues, but a large number of games, would amplify the composition problems, probably to the point of making the data difficult to trust. (note: if anyone doesn't understand what I mean by compositional differences and TV differences adding error, I can explain this in nauseating detail, I'm just going to leave it as is until someone specifically needs elaboration).

In short, since we don't know how much error these things will represent, we can't really estimate the number of games or leagues we'd need. Suffice to say that as the dataset increased, so too would the accuracy... typically. It's certainly possible to add data that makes the results less "accurate" (in terms of widening the CI, rather than narrowing it) if the added data lands far away from the data you were already using. It'd be more "accurate" in terms of your ultimate results would be more reliable (you can't omit data just because it doesn't match up with existing data in terms of numbers!) but the results would be of less use in finding effects.

Does that make sense to you?

spubbbba · Post by **spubbbba** » Fri Feb 08, 2013 10:42 am

VoodooMike wrote: "Clearly" by what metric? When you want data driven conclusions you have to take a step back from any feelings and assumptions you might have and let the numbers tell the story, because preconceived notions are more likely to prevent you from being objective than they are to help you. Quantification is what numbers are all about, afterall - the only time things are tough to quantify in terms of data is when you don't have enough data to work with.

Clearly, because for those 3 inducements I listed we know how much they cost, that’s why I separated them. Extra Training, Wandering Apothecary and Rookie Mercenaries all perform the exact same function as their permanent counterparts but cost more, so are not as good. If you take 2 otherwise identical humans teams but one has no rerolls and the other 2 then the team with 2 will be 100TV higher. If the underdog takes the extra training inducement then for that game they’ll be equal TV but the 2 re-roll team will have everything the underdog has but an extra re-roll on top.
That’s why I was saying the other inducements are harder to quantify (so you would need data on those) as you can’t have permanent babes or wizards, whilst stars come with loner and have skills you either can’t get or may not need.

VoodooMike wrote: Oh? I'd suggest the opposite. League games are friendlier because you have a lasting relationship with the other coaches which is why houserules are something that happen so often in leagues - people can actually agree on things on occasion. MM play is where strangers are playing one another, and they're not there to socialize... it's where we've seen problems with cheating and exploiting disconnections and bugs in the game. League play people can just agree not to do certain things and its done. You don't see any complaints of "power gaming" in leagues (ie, minmaxing in MM) nor a massive focus on teams that can get and stay developed. If you're looking for an environment where people are trying to eke out every bit of power and performance from their team, you probably want MM, not League. Of course, without League data, or even a clear metric for "competitiveness" we can argue our guesses until the sun goes out.

Yeah, that does highlight your earlier point about leagues being more diverse so needing even more data. Even if they follow crp exactly there can be unwritten rules, as illustrated in the other thread about “unwritten rules” such as frowning on fouling or stalling. So even if you follow the rules and are a pleasant opponent to play you can come under external pressures to alter your playstyle or not pick a certain race.
I have seen complaints about powergaming in leagues, but some of those have seemed unreasonable, for instance Slann being too good.
I’ve not seen many complaints from leagues about clawpomb or min-maxing but whether that is because it is seen as an unfriendly tactic so not used or opponents conspire to target teams that do this or they are just less effective in small leagues it is impossible to tell. I guess you could try and test this somewhat by having a MM league with different TV ranges, say 1 that did 0-100 TV and another 0-500TV. One argument for not seeing min-maxing much in leagues is that it is only effective against roughly equal opponents and gives no advantage if you have to face a much stronger team. But then there’s the difficulty of defining where TV efficiency ends and min-maxing begins.

VoodooMike wrote: One can guess, but it's impossible to say... also, what do we define as "reasonably accurate"? If we're talking about League games... league composition will be a problem we can't really control for with anything but a larger dataset. Variation in data is what makes our result less "accurate" in terms of confidence interval ranges, as I hope I demonstrated in my previous posting with the 3 datasets. League games will tend to have a higher variation, and thus, take more data to result in similar accuracy.

To get a decent dataset for League play you'd need a large number of LEAGUES (not just games within them...) each of which was playing with pure CRP rules and rosters, no houserules. The assumption would be that with a sufficient number of leagues, the compositional differences would balance themselves out in the long run, and with enough games the TV differences would balance themselves out in the long run too. Having only a few leagues, but a large number of games, would amplify the composition problems, probably to the point of making the data difficult to trust. (note: if anyone doesn't understand what I mean by compositional differences and TV differences adding error, I can explain this in nauseating detail, I'm just going to leave it as is until someone specifically needs elaboration).

In short, since we don't know how much error these things will represent, we can't really estimate the number of games or leagues we'd need. Suffice to say that as the dataset increased, so too would the accuracy... typically. It's certainly possible to add data that makes the results less "accurate" (in terms of widening the CI, rather than narrowing it) if the added data lands far away from the data you were already using. It'd be more "accurate" in terms of your ultimate results would be more reliable (you can't omit data just because it doesn't match up with existing data in terms of numbers!) but the results would be of less use in finding effects.

Does that make sense to you?

I think I get the gist of it, so it looks like we’d struggle to ever get enough data from leagues then.
It takes a lot of effort to keep a league going and all the variation amongst them complicates things still further. Even very mainstream ones like the WIL have a few minor house rules like enforcing team variety and TV caps for lower divisions.
Plus there is the added issue with tabletop leagues that you have to collect all the data manually, which makes it much harder if you want to compare how a team does against different opponents and at higher TV’s.
So it does look like we’d never be able to compile enough league data unless a real concerted effort was made through Cyanide and FUMBBL, which is unlikely to happen.

koadah · Post by **koadah** » Fri Feb 08, 2013 11:19 am

But as these rules are based on CRP+ not CRP so shouldn't the data be based on CRP+ games?

It seems that we are just guessing at the the effect that CRP+ would have then guessing again at a fix to the guessed effect.

VoodooMike · Post by **VoodooMike** » Fri Feb 08, 2013 5:34 pm

spubbbba wrote:Clearly, because for those 3 inducements I listed we know how much they cost, that’s why I separated them. Extra Training, Wandering Apothecary and Rookie Mercenaries all perform the exact same function as their permanent counterparts but cost more, so are not as good. If you take 2 otherwise identical humans teams but one has no rerolls and the other 2 then the team with 2 will be 100TV higher. If the underdog takes the extra training inducement then for that game they’ll be equal TV but the 2 re-roll team will have everything the underdog has but an extra re-roll on top.

However, inducements are a per-match thing, and while they may cost more than their permanent counterparts, you are wed to your choices with the permanent ones, and may shift them around with the inducement versions after seeing who and what you will be facing in that particular match. Does that really make them "the same but different in price" or does the ability to assign the traits as necessary, based on the match, change the effect? The point is... we don't really know because we don't have data to base our position on, just speculation.

spubbbba wrote:I think I get the gist of it, so it looks like we’d struggle to ever get enough data from leagues then.

Yes, pretty much. I mean, we can collect league data and such, but we're unlikely to get a clear enough idea of the win%s, based on the limited amount of data, to justify making changes of any sort based on those win%s being too far from expected values, because the CIs will make our ranges too wide due to all the uncontrolled variation. It'd be peachy if we could collect tens of thousands of game results (hundreds of thousands would be better, considering we have 100k MM match results and still have some problems with precision when we break it down into rosters by TV range at higher TVs) but I don't think it is a realistic expectation.

spubbbba wrote:So it does look like we’d never be able to compile enough league data unless a real concerted effort was made through Cyanide and FUMBBL, which is unlikely to happen.

That'll never happen. Add to that the fact that even if we got everyone, everywhere, to report every single match they play in a league, and got every one of them to play pure CRP... you'd still have fewer match reports to work from. MM is a data cow specifically because its the fastest, easiest place to get a game... you can play as many games as you feel like in an evening, with as many teams as you feel like.

koadah wrote:But as these rules are based on CRP+ not CRP so shouldn't the data be based on CRP+ games?

Again, the point is that unless you have data to support the idea of making a change, you aren't justified in making it... at least in terms of it being a move toward a goal other than "I just wanna". This isn't about finding out what effects NTBB has, its about NTBB having no data to support the idea that specific changes need to be made to achieve its stated goal, which is to narrow the tiers (both in terms of making teams in a tier closer to one another in performance, and making various tiers closer to one another). Without a basis for making those changes it has no business claiming it is "narrowing the tiers"... it's nothing more than "plasmiod's houserules" and should be called such, rather than prostituting itself as being a data driven fix for design flaws.

koadah wrote:It seems that we are just guessing at the the effect that CRP+ would have then guessing again at a fix to the guessed effect.

Yes, it is all a lot of guessing... and if you have no data to support the idea that a fix is needed, then you have no justification for making such a "fix". You can say "well I just want to make this change, so there" and that's fine... but it's better that people know that that's what you're doing, rather than thinking that you're somehow basing it on actual analysis.

It's "Measure twice, cut once" not "Cut now, wait and see if the customers call later to say their house is wobbly", which appears to be both Plasmoid and the BBRC's strategy.

dode74 · Post by **dode74** » Fri Feb 08, 2013 6:58 pm

It'd be peachy if we could collect tens of thousands of game results (hundreds of thousands would be better, considering we have 100k MM match results and still have some problems with precision when we break it down into rosters by TV range at higher TVs) but I don't think it is a realistic expectation.

I have over 12,000 league results from OCC. All biased by the issues of it being a Cyanide league, a single league, and our particular structure, of course, but still: the data might be able to be gathered. I'll have a hunt around.

VoodooMike · Post by **VoodooMike** » Fri Feb 08, 2013 8:05 pm

dode74 wrote:I have over 12,000 league results from OCC. All biased by the issues of it being a Cyanide league, a single league, and our particular structure, of course, but still: the data might be able to be gathered. I'll have a hunt around.

The main issue with Cyanide based results is that the available rosters have been steadily changing over the years - that doesn't mean there's no reason to look at Cyanide data, just that there is likely to be some changes in win% as new rosters are added to the mix (you can probably run a quick t-test on seasonal results for rosters between versions to check). If you can line up FOL and OCC data based on similar time frames, that might help to control for that to some degree.

OCC is, if I remember correctly, several tiers of play rather than one giant league? That makes it more like a few smaller leagues than one large one. Compositional issues may still be a problem, but that would probably show itself as variance.

No reason not to check for a significant difference between that data and FOL for the same rosters <shrug>. Finding one won't mean that OCC data is immediately usable as a justification to make changes, but it'd certainly be a step in the right direction, and open the door to further examination of the data.

Pluisje · Post by **Pluisje** » Fri Feb 08, 2013 8:24 pm

As a follow up to the statistics lesson, a question.

How do you calculate the CI when there are only three possible results: W/L/D? Or actually just two, because the draws are discarded, otherwise the goal would be 33/33/33, not 50/50. Not a Bell curve in (my) sight.

VoodooMike · Post by **VoodooMike** » Fri Feb 08, 2013 8:56 pm

Pluisje wrote:How do you calculate the CI when there are only three possible results: W/L/D? Or actually just two, because the draws are discarded, otherwise the goal would be 33/33/33, not 50/50. Not a Bell curve in (my) sight.

Hey, I'm glad someone is paying attention, and you're asking a good question (don't worry, there's an answer).

First, draws are not discarded. The win% used by BBRC (and thus, by us for these discussions) treats a draw as half a win... so you have 1/0.5/0 as far as the calculation goes. Draws simply push the mean toward the middle, as do "mirror matches" (including games of the roster versus itself).

Next, you'll notice in the example I used above with the three datasets, each dataset has, at most, 3 possible values... but you can still calculate a mean and standard deviation from them. The CI can (for the most part... we'll probably just be dealing with the gambler's error margin idea in these discussions) be based on the SD. So, if you plot the W/D/L data visually, you're going to see three vertical lines on a page, more or less.

Now, you probably saw me say that all data can be mapped to the normal curve... and that's where you might say "uh... rolls from a six sided dice won't make a curve... it'll make a horizontal line" and that's true.. but what I was saying isn't that all graphs of data will make that curve, but that all data can be framed to fit it. Lets say you had... 10,000 rolls of a d6... if you plot it by value then you'll have that line. IF, instead, you randomly assign those 10,000 rolls to groups of 10 rolls, and take the average within those groups of 10... then you plotted that... you'd have the normal curve immediately. The mean of means, and that's what we usually work with for statistical analysis.

How you frame the data depends on what you're trying to look at, really. If we work directly with 0, 0.5, and 1 as results... then we'll absolutely need a lot of data in order to get a CI that doesn't cover the entire field (0-100%) simply because of the lack of variation in the variation! It'll still work, but it takes longer. You can also use other things like making groups and creating the mean of means thing, though in those cases you need to make sure that the resulting number of groups is still large enough to analyze with each group counting as only a single datapoint, and so on.

Luckily for us, the math handles most of these things... and the modern software packages know all the correction formulas and can apply them all as necessary.

Pluisje · Post by **Pluisje** » Sun Feb 10, 2013 2:55 pm

koadah wrote:...

Here's some more data though if anyone is interested.

Hi Koadah, I would be interested. However, this file is too big for the programs I have. Could you make a selection of just the MatchID, MatchDate, HomeRating, AwayRating, HomeRaceId, AwayRaceId, HomeTouchdowns, AwayTouchdowns and DivisionCode? Then I'll crunch some numbers.

I didn't find a "games played previously" item in the list. If that is a selectable item, please add that too. Thanks in advance.

koadah · Post by **koadah** » Sun Feb 10, 2013 9:16 pm

Pluisje wrote:
koadah wrote:...

Here's some more data though if anyone is interested.
Hi Koadah, I would be interested. However, this file is too big for the programs I have. Could you make a selection of just the MatchID, MatchDate, HomeRating, AwayRating, HomeRaceId, AwayRaceId, HomeTouchdowns, AwayTouchdowns and DivisionCode? Then I'll crunch some numbers.

I didn't find a "games played previously" item in the list. If that is a selectable item, please add that too. Thanks in advance.

Try this one.

http://94.236.9.52/downloads/fumbbl2013-02-10.zip

The gameNo field may be a little out on some of the very old (early 2005) teams due to not having all their games yet. There should be none of those in the Box.

Talk Fantasy Football

NTBB: Stats

Re: NTBB: Stats

Re: NTBB: Stats

Re: NTBB: Stats

Re: NTBB: Stats

Re: NTBB: Stats

Re: NTBB: Stats

Re: NTBB: Stats

Re: NTBB: Stats

Re: NTBB: Stats

Re: NTBB: Stats

Re: NTBB: Stats

Re: NTBB: Stats

Re: NTBB: Stats

Re: NTBB: Stats

Re: NTBB: Stats