How to Make a Poker AI - Part 3: Monte Carlo CFR

Monte Carlo

Monte Carlo refers to an algorithm that utilises randomisation to gain a sample that is an accurate representation of the full population. The name refers to the Casino de Monte-Carlo; a legendary icon of gambling.

The Casino de Monte-Carlo

Vanilla CFR requires full traversal of the game tree, where every possible choice of every possible node is explored. If we instead could sample parts of this game tree where choices are chosen on their probability of occurring then we can get a representation of the game tree through multiple iterations. This should then converge just as with Vanilla CFR.

External Sampling

There are multiple different approaches to sampling and all will sample different parts of the game tree with different methods. In this article I will be using external sampling, this method of sampling is effective and also simple to implement.

External sampling means all actions that are made by the opponent or chance player are sampled. For example the algorithm is at an opponent decision node in the game tree, the opponents strategy is Raise 30%, Call 60% and Fold 10%, the algorithm will randomly choose one of these actions with probability equal to the action probability and then only explore the game tree of that action. The same thing is done with chance nodes, except with chance nodes each card has an equal probability of being chosen.

Implementation

To implement external sampling will only require minor changes to the Vanilla CFR implementation, we just have to randomly pick an option when at a public node or an opponent node. This results in the following changes to our code from last weeks Vanilla CFR algorithm:

  1. Instead of propagating the game tree with every possible hand combination. The train method will randomly pick a hand for player 1 and player 2 from their respective ranges. As long as these hands do not conflict with each other or the board the algorithm will begin propagation of the game tree with these random hands.

  2. At chance nodes instead of exploring every possible card, the chance node will instead randomly pick a card from the available options.

  3. Decision nodes are broken up into two separate node types; opponent decision nodes and updating player decision nodes. The opponent decision node will randomly pick an action from the available actions with probability equal to the actions probability. The updating player decision node will behave like in Vanilla CFR and explore all action options to calculate node utility.

The changed methods can be seen below.

 public float Train(int numberIterations, List<string> boardArranged, List<string[]> handCombosP1, List<string[]> handCombosP2)
{
    InformationSetMethods = new InformationSetCFRLogic();
    BestResponseUtility = new BestResponseUtility(InfoSetMap, InformationSetMethods);
    float P1Util = 0;
    int utilP1Count = 0;
    List<string> startHistory = new List<string>(boardArranged);
    
    for (int i = 0; i < numberIterations; i++)
    {
        Iteration = i;
        UpdatingPlayer = (Player)(i % 2);

SampleCards:
        //Randomly choose player 1 and player 2 hands
        int indexP1 = rand.Next(0, handCombosP1.Count);
        int indexP2 = rand.Next(0, handCombosP2.Count);

        //Resample hands that conflict with the board or eachother
        if (IsCardConflict(boardArranged, handCombosP1[indexP1], handCombosP2[indexP2]))
        {
            goto SampleCards;
        }

        //Initialise startNode
        GameStateNode startNode = GameStateNode.GetStartingNode(startHistory, handCombosP1[indexP1].ToList(), handCombosP2[indexP2].ToList());

        //Begin the CFR Recursion
        P1Util += CalculateNodeUtilityMC(startNode);
        utilP1Count++;

        //Output exploitability every 100 iterations
        if(i%100==0)
        {
        Console.WriteLine($"Iteration MC {i} complete.");
        Console.WriteLine($"Strategy Exploitability Percentage MC: {BestResponseUtility.TotalDeviation(boardArranged, handCombosP1, handCombosP2)}");
        }
    }
            
    //return average player 1 utility
    return P1Util / utilP1Count;
}
 public float CalculateNodeUtilityMC(GameStateNode gameStateNode)
{
    ///// TERMINAL NODE /////
    if (gameStateNode.ActivePlayer == Player.GameEnd)
    {
        var u = PokerRules.CalculatePayoff(gameStateNode.History, gameStateNode.Player1Cards, gameStateNode.Player2Cards);
        return u;
    }

    float[] actionUtilities = new float[gameStateNode.ActionOptions.Count];
    float nodeUtility;

    ///// CHANCE NODE /////
    if (gameStateNode.ActivePlayer == Player.ChancePublic)
    {
        //Randomly choose card
        int cardChoice = rand.Next(gameStateNode.ActionOptions.Count);
        float actionProbability = (float)1 / gameStateNode.ActionOptions.Count;
        GameStateNode childGameState = new GameStateNode(gameStateNode, gameStateNode.ActionOptions[cardChoice], actionProbability);
        nodeUtility = CalculateNodeUtilityMC(childGameState);
        return nodeUtility;
    }
    
    InformationSet infoSet = GetInformationSet(gameStateNode);
    var strategy = InformationSetMethods.GetStrategy(infoSet);

    ///// UPDATING PLAYER DECISION NODE /////
    if (gameStateNode.ActivePlayer == UpdatingPlayer)
    {
        float activePlayerReachProbability = gameStateNode.ReachProbabiltyP1;
        if (gameStateNode.ActivePlayer == Player.Player2){ activePlayerReachProbability = gameStateNode.ReachProbabiltyP2; }
        InformationSetMethods.AddToStrategySum(infoSet, strategy, activePlayerReachProbability);

        //get utility of each action
        for (int i = 0; i < actionUtilities.Length; i++)
        {
            var actionProbability = strategy[i];
            GameStateNode childGameState = new GameStateNode(gameStateNode, gameStateNode.ActionOptions[i], actionProbability);
            actionUtilities[i] = CalculateNodeUtilityMC(childGameState);
        }
        //average utility for node calculated by Dot product action utilities and action probabilities
        nodeUtility = actionUtilities.Zip(strategy, (x, y) => x * y).Sum();

        InformationSetMethods.AddToCumulativeRegrets(infoSet, gameStateNode, actionUtilities, nodeUtility);
    }
    ///// OPPONENT DECISION NODE /////
    else
    {
        //Randomly choose action
        int strategyChoice = RandomActionSelection(strategy);
        GameStateNode childGameState = new GameStateNode(gameStateNode, gameStateNode.ActionOptions[strategyChoice], strategy[strategyChoice]);
        nodeUtility = CalculateNodeUtilityMC(childGameState);
    }

    return nodeUtility;
}

Monte Carlo vs Vanilla CFR

Comparing these algorithms poses a small problem as Monte Carlo CFR requires far more iterations, but at the same time each iteration takes far less time. In the following charts we use Nodes Touched on the x-axis, this should be approximately proportional to the computational time and allows us to compare the two algorithms.

As can be seen on the following chart, by sampling we have have been able to get the Monte Carlo CFR algorithm to converge much quicker than the Vanilla CFR algorithm.

Sources

[1] Lanctot, M. et.al. Monte Carlo Sampling for Regret Minimization in Extensive Games. 2009. Available from: https://proceedings.neurips.cc/paper/2009/file/00411460f7c92d2124a67ea0f4cb5f85-Paper.pdf

Next
Next

How to Make a Poker AI - Part 2: Vanilla CFR Code and Results