Systematic and Practical Testing Guidelines

Are you a Quiet Speculation member?

If not, now is a perfect time to join up! Our powerful tools, breaking-news analysis, and exclusive Discord channel will make sure you stay up to date and ahead of the curve.

Learn More

Thanks to Wizards' bombshell unban announcement, Modern is on the cusp of a great age of experimentation. Established and rogue archetypes alike will be examining their 75s in efforts to either combat or incorporate the format's latest newcomers, and I fully expect some new archetypes to appear in the wake of these powerful additions to the card pool. As any prospective deckbuilder knows, the first thing one must do after crafting a new list is to test it, which brings some questions with it. How long should I test before I trust my results? How do I know whether the cards I am testing are helping me win? Having done my fair share of testing and tweaking over the years, I'd like to open a discussion on this topic.

This article will offer some general thoughts on the deck testing process. These include setting guidelines by which to determine the length of the testing period, how to evaluate your results, and how to decide when your data is applicable.

Sample Size

As David has shown in his exhaustive testing of a variety of banned (and formerly banned) cards in Modern, it's rather difficult to collect enough matches of Magic for a deck comparison analysis to stand up to the results of a student's t-test at high confidence intervals, thanks to the often-minute differences an individual card makes in a deck's winrate along with the inherent variance of the game. This means that most experimentation that falls short of that threshold will involve some qualitative aspects. Chief among these is deciding how much data is good enough to be considered a reasonable sample.

Deciding what to define as a reasonable sample can seem somewhat arbitrary at first; after all, what is the real difference between two datasets if running a statistical test on both results in the acceptance of the null hypothesis? My definition of a reasonable sample is one that approximates a statistically valid sample while still being practical for the tester to achieve. This obviously varies with the particulars of the tester, but in the interest of providing a guideline, I would be somewhat skeptical of any conclusions made regarding a card's effectiveness that are backed by a sample of fewer than 75 individual matches, and I would prefer the number to be 100 or greater.

Given the advent of Magic Online as well as several other online platforms which enable easy access to testing, I believe that these sample sizes are relatively accessible, and they result in more stringently collected data. I would also err in favor of a larger sample for a relatively new or previously untested deck, as opposed to a small tweak in an established list. Say you're stoked for the return of Bloodbraid Elf and Jace, the Mind Sculptor, and want to test out a Temur Moon list that jams them both, like this one:

Temur Moon, by Roland F. Rivera Santiago

Creatures

4 Bloodbraid Elf

4 Snapcaster Mage

4 Tarmogoyf

2 Tireless Tracker

Enchantments

2 Blood Moon

Planeswalkers

2 Jace, the Mind Sculptor

Instants

4 Lightning Bolt

2 Tarfire

2 Mana Leak

2 Electrolyze

Sorceries

4 Serum Visions

4 Ancestral Vision

2 Roast

Lands

2 Breeding Pool

2 Forest

4 Island

4 Misty Rainforest

2 Raging Ravine

4 Scalding Tarn

2 Steam Vents

2 Stomping Ground

Sideboard

1 Ancient Grudge

2 Anger of the Gods

2 Dispel

2 Engineered Explosives

2 Disdainful Stroke

4 Huntmaster of the Fells

2 Relic of Progenitus

The rationale behind this 75 is to craft a deck that is well-suited to win the midrange mirror thanks to its ability to generate card advantage, while still holding up against aggro or big mana with removal and land disruption, respectively. Following the guidelines mentioned above, I would look to get 100 matches in before making any major changes to the list.

Data Evaluation

After deciding on a deck to test and a trial period, there are a few things to keep in mind.

Don't leap to early conclusions. While it is rather natural to notice a card you're interested in testing when you draw it, and make a note of how it performed for you in that game, resist the temptation to make conclusions on it as you go along. Giving in could result in the creation of an internal narrative, which might make it harder to objectively interpret your results. Instead, write down detailed notes on how the card performed, and take a holistic view of those notes at the end of your trial.
Keep detailed notes: Note-keeping was alluded to in the earlier point, but I would suggest expanding the notes beyond the cards that are being tested. Some good data to keep for pretty much any type of testing are factors such as the opposing player's deck archetype, being on the play or on the draw, any mulligans that have occurred, mana flood and screw, decisions you regretted after matches, and whether the sideboard was appropriate for the matchup. Other parameters can be annotated, but I would consider these to be the essentials.
Choose a standard of success. People play Magic for a variety of reasons. While I assume that someone interested in documenting their testing in a detailed manner is looking for some degree of "competitive success," there are several ways to define even that.
For example, the requisite winrate for a deck to break even in Magic Online leagues at the time of this writing is 50%, according to this expected value calculator.

However, chances are the winrate will have to be quite a bit higher if you want to recoup your investment, consistently post 5-0 results that can be features on Wizards' database, or have a chance of making Top 8 in a format challenge. Higher still would be the winrate for a deck capable of taking down large paper events such as SCG Opens, Grand Prix, and Pro Tour Qualifiers. In the case of edits to an existing list, the standard of success may be relative; can this configuration of the deck perform better than the previous one against the field?

To throw a number out there, I consider a 60% winrate in Magic Online leagues to be a good benchmark of success for any prospective brew, and would definitely be looking for the proposed Temur Moon list to meet or exceed that standard. My standard of success for any changes to the Merfolk list I piloted to a Top 16 finish at the SCG Classic in Philadelphia are a bit higher, namely because that deck has consistently exceeded that threshold in the past.
Evaluate the quality of competition. It also matters who you're testing against. Obviously, there is only so much control one can exert over what opponents one faces when testing, especially if you choose to do so online. However, certain venues offer stiffer competition than others on average, and you can choose which ones you frequent depending on the standard of success you have set for the deck. Outside of a dedicated testing groups of high-level players (such as pro teams), I'd say Magic Online competitive leagues provide the best competition available to the average player, followed by the friendly leagues or local events with a disproportionate number of successful players. Next, I'd rank league efforts on other online platforms or your typical local FNM-level event. I'd honestly avoid making major testing conclusions on data taken from kitchen-table magic, two-man queues or the Tournament Practice room on Magic Online, or free play on other online platforms; the lack of stakes involved usually means opponents' skill levels vary wildly, and it becomes more difficult to separate the deck's performance from that of the players involved.

How to Move Forward

After collecting a reasonable sample and deciding on a standard of success, it's time to peruse the data and figure out how testing went. Some major questions and follow-ups to ask:

How did the testing go? Did the test deck meet or exceed the established standard of success? How did the cards in question figure into the deck's meeting (or failing to meet) this standard? This is one of the points where the notes come in handy. One potential conclusion you may come to is that your deck failed to meet the standard because of mana issues, which could point to the manabase needing some reworking or the need for a larger sample to try and account for some bad variance. Alternatively, the supporting elements for the strategy may be somewhat lacking.
For instance, let's say the Temur Moon list above doesn't quite meet our standard of success, and the chief reason why is that our Bloodbraid Elf hits were somewhat lackluster. Having detailed notes on your cascade hits can help you come to that conclusion, and address it in future configurations by incorporating cards like Vendilion Clique or Savage Knuckleblade at the expense of poor cascade hits like Mana Leak.
Will the tested changes be kept or abandoned? In the case of a new archetype, this is when you should decide whether you want to tweak it further, or drop the deck altogether. If you do decide to drop the deck, will you abandon the idea? Or move forward with a new take on it?
Going back to Temur Moon, let's say Huntmaster of the Fells is underachieving as a sideboard card, and that is causing our winrate to suffer in matchups that it is supposed to address, such as Burn. Chances are that including a more cost-efficient method to address the matchup (like Courser of Kruphix) could address the problem. On the other hand, if testing reveals that the deck isn't as strong in the midrange mirror as cards like Ancestral Vision would have led you to believe, chances are that you're better off going back to the drawing board and coming up with a new concept.
Will further changes be made? In the event of a deck that met or exceeded the standard of success, you should decide whether you want to continue riding the proverbial wave of good results, or if you want to tweak it further in an effort to keep climbing. For a deck that failed to meet your preestablished standard, this is when you should decide the extensiveness of revisions you feel it needs. I would lean heavily on my notes here, as changing even a few cards can have far-reaching consequences on a deck's performance.
For example, if the Temur Moon list described above was found to be somewhat poor at fending off artifact-based decks such as Affinity and Lantern Control, and that has either kept it from reaching my standard of success or I would like to make further changes in order to chase a higher standard of success, I would have to decide how much of my 75 I would like to change in order to shore up those matchups. Some potential bits of wiggle room are to consider Abrade in spots currently occupied by Roast, adding artifact-based sweepers such as Creeping Corrosion or Shatterstorm to the board, or whether relevant spot removal like Destructive Revelry would help shore up post-board games.

Conclusion

I have used this method several times to evaluate changes made for my decks, and I have been satisfied at the thoroughness of my conclusions, and the improvement they have resulted in. Additionally, I honestly haven't found them to be overly time-consuming, which is an important factor when considering that I have a finite amount of time to dedicate to the game. If you have any comments on this testing method or a method of your own that you'd like to share, feel free to drop me a line in the comments.

I’m working an a hellishly complicated deckbuilding tool.

The easiest way to describe it is that by using paperstrips to measure individual Card performance I use evolution to build my decks.

I’ve been working at this for at least 2 years and is constantly refining the process to get faster results.

At the moment I’m developing a milldeck and has just begun generation 4.
(It started out as a very clasic espermill before I let evolution improve upon it)
https://www.mtgvault.com/wickeddarkman/decks/spotting-the-trends-in-my-mill/

I play 10 games against each of the 20-25 testdecks, which means that I play at least 200 games to get my measures for a single generation. (I proxy all testdecks)

First I run a trimming phase where I put paperstrips inside the sleeve of each spell in the deck
(I also build the mana using paperstrips) During the trimming phase I only play 5 games against each testdeck games. Each strip is divided into two sections Keep and cut.
I play all 100 games, and whenever a card is cast it gets marked with a “keep”, and if I lose the game all remaining cards in the hand are marked with a “cut”. everything having more cuts than keeps are simply removed from the deck and I then run an insertion phase.

The paperstrips are used a Little more advanced during the insertion phase.
To begin with the cards removed at the trimming phase are sleeved with blank strips.
If during a game I draw a blank card it can become any card I choose at the moment I cast it as long as I write down that choice on the strip. (Blank cards are the prime targets of discard and counterspells) During all the following 200 games I get lots of data on what cards could possibly be put into the deck, and in the end, the choices that have been used most times simply replace the blanks as the new cards (for instance once I’ve noted “talent of the telepath” on a strip I give it a smiley whenever the card is used as that card during a game)

So all in all a single generation takes 300 games to produce.
But if you use the concept on a tier 1 deck you probably only need to run 1 generation.

My mill has been through 4 generations and is close to being a tier 3 deck.

7 thoughts on “Systematic and Practical Testing Guidelines”

Tommy Hoff Hansen says:

February 15, 2018 at 4:41 am

I’m working an a hellishly complicated deckbuilding tool.

The easiest way to describe it is that by using paperstrips to measure individual Card performance I use evolution to build my decks.

I’ve been working at this for at least 2 years and is constantly refining the process to get faster results.

At the moment I’m developing a milldeck and has just begun generation 4.
(It started out as a very clasic espermill before I let evolution improve upon it)
https://www.mtgvault.com/wickeddarkman/decks/spotting-the-trends-in-my-mill/

I play 10 games against each of the 20-25 testdecks, which means that I play at least 200 games to get my measures for a single generation. (I proxy all testdecks)

First I run a trimming phase where I put paperstrips inside the sleeve of each spell in the deck
(I also build the mana using paperstrips) During the trimming phase I only play 5 games against each testdeck games. Each strip is divided into two sections Keep and cut.
I play all 100 games, and whenever a card is cast it gets marked with a “keep”, and if I lose the game all remaining cards in the hand are marked with a “cut”. everything having more cuts than keeps are simply removed from the deck and I then run an insertion phase.

The paperstrips are used a Little more advanced during the insertion phase.
To begin with the cards removed at the trimming phase are sleeved with blank strips.
If during a game I draw a blank card it can become any card I choose at the moment I cast it as long as I write down that choice on the strip. (Blank cards are the prime targets of discard and counterspells) During all the following 200 games I get lots of data on what cards could possibly be put into the deck, and in the end, the choices that have been used most times simply replace the blanks as the new cards (for instance once I’ve noted “talent of the telepath” on a strip I give it a smiley whenever the card is used as that card during a game)

So all in all a single generation takes 300 games to produce.
But if you use the concept on a tier 1 deck you probably only need to run 1 generation.

My mill has been through 4 generations and is close to being a tier 3 deck.

Log in to Reply
1. Roland F. Rivera Santiago says:
  
  February 15, 2018 at 12:52 pm
  
  That sounds like a pretty sweet system, but also time-intensive. What’s your ballpark estimate of the extra time invested to track matches in this manner?
  
  Log in to Reply
  1. Tommy Hoff Hansen says:
    
    February 16, 2018 at 4:22 am
    
    Generation 0 was made at 30 september 2017
    https://www.mtgvault.com/wickeddarkman/decks/the-return-of-godikas-mill/
    
    generation 1 at 16 oct 2017.
    generation 2 at 16 nov 2017
    generation 3 at 23 jan 2018
    generation 4 at 12 feb 2018
    
    So each generation takes about a month to Work out.
    I’m working solo and knows that being two would cut down a lot on the time.
    
    The best part of the system is that you get to measure the Cards any way you can think of. Since this project has been with mill I have a lot of Measurements on how many Cards I mill with Cards like mind funeral, mesmeric orb and many others.
    
    Since I keep notes on almost everything I can backtrace and try out different Things and setups at later times.
    
    I have been speculating in working out a system where each “strip” becomes a full sheet on it’s own, which would store even more information for me.
    
    In essence I’ve re-invented the way artifical intelligences uses massive amounts of data to figure stuff out, using only paperstrips and my mind as a computational system. I’m hoping to double the processing time soon
    
    Log in to Reply
    1. Roland F. Rivera Santiago says:
      
      February 16, 2018 at 8:50 am
      
      That’s actually a pretty good rate of evolution. I might lift a couple of notes from your method and see how I like them.
      
      Log in to Reply
      1. Tommy Hoff Hansen says:
        
        February 19, 2018 at 4:46 am
        
        It’s faster than the changes of the ordinary meta, which can make it very valid.
        
        Here’s a link to the best “step by step” guide I have so far:
        https://www.mtgvault.com/wickeddarkman/decks/the-paperstrip-method-v2/
        
        If you have any indepth questions I can answer them here or at the link-site
        
        I’m currently refining a lot of the process.
        The major strength of it all is the storage of data on paperstrips, so the main problem is to “design/program” the paperstrips in a way that makes you store the best data for you in as many ways you want to research.
        
        For example:
        
        Having a strip for wall of omens means that I want to extract data from how the wall affects the game. It may draw me a card, but is it a good card? Does the wall gets to block? Does the wall die to removal?
        
        I’d write the strip out like this:
        
        Wall of omens:
        Good Draw:
        Blocks:
        Is killed:
        ———————
        Bad Draw:
        No blocking:
        
        Log in to Reply
ben coley says:

February 15, 2018 at 5:28 am

This is a good way to evaluate small changes such as going up or down a land for your manabase.

Manabases in particular are an element which benefits from lots of reps in order to establish the refined optimal build.

Cheers for the article

Log in to Reply
1. Roland F. Rivera Santiago says:
  
  February 15, 2018 at 12:16 pm
  
  Glad you liked it! And I agree, manabases are particularly finicky.
  
  Log in to Reply

Join the conversation Cancel reply

You must be logged in to post a comment.