## What is Wordle?

Since I’m never one to pass up an opportunity for a math lesson, it occurs to me that the game Wordle would make a great central example in a lesson on fractions.

A lesson on information theory, specifically the concept of entropy Because the entire algorithm is based on the concept of entropy, I got drawn into the puzzle like a lot of other people did, and like a lot of programmers, I got drawn into trying to write an algorithm that would play the game as efficiently as possible.

So, I thought I’d share some of my thought processes and the math involved with you here.

First things first, in case you are unfamiliar with wordle and killing two things can be accomplished at once while we let me go over the game’s rules.

Describe the direction we’re taking this in as well.

Which entails creating a small algorithm that will essentially play the game for us.

I haven’t completed the wordle for today, which is February 4; we’ll see how the bot does.

If your goal is to guess a five-letter mystery word, Wordle gives you six chances to get it right.

For instance, Wordlebot suggests that you start with the word crane.

You receive information for each guess you make.

The gray box is informing me that there isn’t a c in the correct answer, and the yellow box is indicating how close your guess is.

The green box is informing me that the secret word does contain a r, but it is not in that position and that the an is in the third position.

There is also no n or e after that, so let me continue.

Just go in and spread the word about the information we obtained after starting with a crane.

Don’t worry about the data it’s displaying right now; I’ll explain it all later.

However, the schtick is its top recommendation for our second choice.

Although your guess must be an actual five-letter word, you’ll see that it is fairly open-ended in terms of what it will actually allow you to guess.

In this instance, we attempt to stick, and everything appears to be going well.

We hit the s and the h, so we know Since there is a r in the first three letters, it will be something like “s-h-a something r” or “S-h-a-r-something.” And it appears that Wordle-Bot is aware that there are only two options left.

sharp or broken, shard.

At this point, it’s kind of a toss-up between them, so I guess just Since it’s alphabetical, Shard comes first.

Which hooray is the correct response, so we figured it out in three.

If you’re unsure whether that is beneficial, I overheard someone say that using Wordle, four is equal to three.

You have to be on your game to be getting four consistently, but it’s certainly not crazy, and when you get it in three it just feels great, to use a birdie analogy, which I think is pretty apt.

So, if you’re interested, here’s what I’d like: Here, I’ll just outline my thought process for how I approach Wordlebot from the beginning.

As I said earlier, this is really just a pretext for a lesson on information theory, and the primary objectives are to define information and discuss its role in society.

### Initial ideas

Describe entropy I started by considering the relative frequency of the various letters in the English language.

I then asked myself, “Is there an opening guess or opening pair of guesses that hits a lot of these most frequent Letters?” One that came to mind was doing others followed by nails.

The idea is that if you press a letter, you will either receive a green or a yellow one.

It always feels good and like you’re getting information, but in these situations even if you miss and always get grays, you’re still getting a lot of information because it’s uncommon to find a word that doesn’t contain any of these letters.

Despite this, it doesn’t feel very systematic because, for instance, it doesn’t take into account the possibility of missing a word.

The letters in why type nails are in this order.

I’m not sure if it’s preferable to have the final s when typing snail.

Now, a friend of mine mentioned that he preferred to begin with the word “weary,” which surprised me somewhat because it contains some uncommon letters like the w and the y.

However, who knows? Perhaps that is a better opening.

Is there a quantitative score we can assign to evaluate the accuracy of a potential guess? Now let’s go back and clarify how the game is set up so that there is a List of words that you can enter that are considered valid guesses in order to set up how we’re going to rank potential guesses.

Although it only has about 13,000 words, it contains many extremely unusual details, such as what lies ahead.

The vibe of the game is that the answer will always be a decently common word, and in fact, there is another list of roughly 2 300 words that Are the possible answers and this is a Human curated list, I believe specifically for the purpose of avoiding family arguments when playing the game of Scrabble.

It’s kind of entertaining that the game creator’s girlfriend plays it, but what I’d like to do is see if we can write a program to solve Wordle without using prior knowledge of this list, for one thing.

There are many reasonably common five-letter words that you won’t find in that list, but it would be preferable to write a program that is a little more resilient and would play wordle against anyone, not on the list.

What is the official website, and also how did we find out about it? What this list of potential responses amounts to You can always just look up what Tomorrow’s answer will be because it’s visible in the source code, but the way that it’s visible in the source code is in the precise order in which answers come up from day to day.

There is a sense in which using the List is cheating, so what would make for a more engaging puzzle and richer information theory lesson is to use more general data instead, such as relative word frequencies in general, to capture this intuitive preference for more common words.

For example, if my friend suggests weary, how should we evaluate its quality? He said he likes that unlikely w because he likes the long-shot nature of just how good it feels if you do hit that w.

However, if the first pattern revealed looked like this, there are only 58 words in this huge lexicon that match that Pattern, which is a huge reduction from Thirteen.

The point is that the pattern with a lot of information is by its very Nature unlikely to occur; in fact, what it means to be informative is that it’s Unlikely.

A much more probable pattern to see with this opening would be something like This where most of these are very obscure and even questionable words.

Of course, they’re not all equally likely to be answered, but at least for our first pass at all of this, let’s assume that they’re all equally likely and then refine that a bit later.

The probability that this is the pattern you would see in this case, where there are 1400 possible matches, works out to be around 11%, so the most likely outcomes are also the least instructive.

Let me demonstrate the full distribution of probabilities across all possible patterns so you can get a more comprehensive picture of this.

As a result, each bar you see corresponds to a potential color pattern that could be revealed.

There are three to five possibilities, and they are arranged from least common to most common from left to right.

When making a guess, you hope to land somewhere in the long tail, like over here, where there are only 18 possibilities for what matches this pattern and clearly looks like this.

Here, you get all grays, which occurs about 14% of the time.

Alternatively, if we move a little bit further to the left, we could go all the way over here.

Here’s a good puzzle for you: Name three English words that begin with “w,” end with “y,” and contain the letter “r” at some point.

The answers are, let’s see.

Riley, wordy, and wormy So, to assess how effective this word is Overall, if we go through each Pattern and multiply its probability of occurring times something that Measures how informative it is, that could give us an objective score, we want some kind of measure of The expected amount of information that You’re going to get from this Distribution.

Your initial thought for what that Something should be might be the average number of matches; however, I’d like to use a measurement that is more general that we frequently apply to information and that will be more adaptable once we have different probabilities assigned to each of these 13 000 words for whether or not they are matches.

### Information theory basics

Actually, the solution The Bit is the common unit of information, and while it has a somewhat amusing formula, it is actually quite simple when we look at examples.

In our example, the space of Possibilities is all possible words, and it turns out that approximately half of the five-letter words have an s.

A little less than that, but approximately half, is said to be one bit of information.

That observation would therefore provide you with one piece of information.

However, if a new fact reduces the space of possible outcomes by a factor of 4, we say that fact has two pieces of information.

For instance, it turns out that about 25% of these words contain a t.

We say it contains three bits of information if the observation reduces that area by a factor of eight, and so on.

It is divided into sixteenths by four bits and thirty seconds by five bits.

Now might be a good time for you to stop, think, and ask for yourself.

What is the informational formula for The number of bits used to calculate the probability of an event Basically, what we’re saying is that when you add one-half to the number of bits, the probability is the same as the number of bits.

If you rearrange this further, you can say that the information is the log base 2 of 1 divided by the probability.

Occasionally, you can see this with yet another rearrangement, where the information is the negative log base 2 of the probability.

When stated in this way, it may seem a little strange to those who are unfamiliar with it, but it really just refers to the very simple but effective question, “How many times have you cut your possibilities in half?” Now, if you’re wondering why logarithms are entering the picture, one reason this is a nicer unit is that I thought we were just playing a fun word game.

Is it simply much simpler to discuss extremely unlikely events by stating that an observation contains 20 bits of information rather than that the probability that such and such will occur is 0.000095? However, a more significant explanation for why this logarithmic expression ended up being The way that information adds up is a very helpful addition to the theory of probability.

For instance, if one observation gives you two bits of information, reducing your space by a factor of four, and a second observation, such as your second guess in wordle, gives you another three bits of information, reducing your space by a factor of eight further, the two together give you five bits of information in the same way that probabilities like to multiply.

Therefore, the Logs make it much easier to handle once we enter the realm of something like an expected value where we are adding up lots of numbers.

The main thing I want you to notice is that the higher the probability as we approach those more likely patterns, the Lower the information, and the fewer bits you gain.

Let’s return to our distribution for Weary and add another small tracker here showing us how much information There is for each pattern.

The expected value of this information, where we go through each pattern and say how probable it is, and then multiply that by how many bits of information we receive, is how we will assess the quality of this guess.

And in the case of Weary, that turns out to be 4.9 bits, meaning that, on average, the knowledge you gain from this initial guess is equivalent to cutting your space of possibilities in half roughly five times.

Similar to slate, but with a higher expected information value In this instance, you’ll notice that the distribution is much flatter.

In particular, the most likely occurrence of all grays only has a six percent chance of happening, so at the very least, you’re getting 3.9 bits of information, but that’s just a minimum.

Usually, you’d get something better.

It turns out that, when you run the numbers on this one and add up all of the pertinent terms, the average Information is actually about 5.8 bits.

Therefore, compared to tired, your space of possibilities will be roughly half as large.

There’s a funny story about the name for this expected value of information quantity.

As you can see information Theory was created by Claude Shannon after this initial estimate of the average.

When he mentioned that he didn’t really have a good name for this expected value of information quantity, John Von Neumann allegedly said, “Well, the story goes, you should call it entropy and for two reasons at the beginning of what Was becoming computer science,” according to one account.

At the time, he was working at Bell Labs.

He was discussing some of His yet-to-be published ideas with neumann.

Shannon was dealing with pure probability theory when he came up with the name, so if this account is to be believed, it is kind of intentional.

If you’re wondering how it relates to the second law of thermodynamics from physics, there is undoubtedly a connection, but for the purposes of this discussion, I just want you to picture the expected information value of a specific guess when I use the word entropy.

Entropy can be thought of as simultaneously measuring two different things: first, the degree of flatness of the distribution; the closer a distribution is to uniform, the higher the entropy will be.

Observing any one of the three to fifth total patterns in our example would provide information with a log base of two of three.

However, entropy is also somewhat a measure of how many possibilities there are in the first place.

For example, if you happen to have some word where there are only 16 possible patterns and each one is equally likely, the entropy of this Expected information would be four bits.

However, the entropy would be six bits if the second word had 64 potential patterns that could emerge and they were all equally likely to do so.

Therefore, if you observe a distribution in the wild that has an entropy of six bits, it’s as if it’s indicating that there will be as much variation and uncertainty as if there are 64 equally likely outcomes.

For my first attempt at the world bot, I essentially had it do this: it goes through all of the various guesses you could have for each of the 13 000 words, computes the entropy for each one or, more specifically, the entropy of The distribution across all patterns That you might see for each one, and then chooses the highest since that one is likely to reduce your space of possibilities as much as possible.

Even though I’ve only discussed the first guess here, the process is the same for the subsequent guesses.

For instance, after spotting a pattern on the first guess, which would limit you to a smaller set of words based on what matches with it, you would simply continue playing the same game with regard to that smaller set of words for a proposed second guess.

You would then look at the distribution of all patterns that could arise from that more limited set of words.

The one that maximizes that entropy is the one you find after searching through all 13 000 possibilities.

Let me pull up a small version of Wordle that I created that displays the Highlights of this analysis in the Margins to demonstrate how this works in practice.

The top answer, at least for the time being; we’ll refine this later, is tares, which denotes that after performing all of its entropy calculations on the right here, it is showing us which ones have the highest Expected information.

Um Naturally, Avetch is the most prevalent vetch.

We can see how much information was expected for each guess we make here, and if I kind of disregard its recommendations and choose slate because I like slate, we can see how much actual information we received given this particular pattern on the right of the word.

It appears that we were a little unlucky because the expected score was 5.8, but we actually received a score that was lower, and here on the left side, it is showing us all of the various words that are possible given our current position.

The uncertainty measurement tells us the entropy of this Distribution across the possible words, which right now because it’s a uniform Distribution is just a needlessly complicated way to count the number of possibilities.

For example, if we were to Take 2 to the power of 13.66 That should be an option.

The blue bars are telling us how likely It thinks each word is so at the moment It’s assuming each word is equally likely to occur but we’ll refine that in a moment.

The only reason this is a little off is that I’m not showing all the decimal places.

At the moment, that might seem unnecessary and like it would complicate matters excessively.

However, you’ll understand why having both numbers handy in a moment, so here it is.

It appears to be indicating that ramen has the highest Entropy for our second guess.

Which, once more, just really doesn’t feel like a word, so in order to maintain my moral high ground, I’m going to proceed and type in “rains.” Once again, it appears that we were a little unlucky because we were expecting 4.3 bits and only received 3.39 bits of information, which reduces our options to 55.

And here, perhaps, I’ll just do what it advises, which is to eat kombu whatever it might be Okay, so this is actually a good chance for a puzzle; it tells us that this pattern provides us with 4.7 bits of information, but that there were 5.78 bits of uncertainty over on the left before we saw that pattern.

As a quiz question, what does that mean for the number of potential outcomes that are still possible? It entails that there is only one remaining possibility, which is equivalent to saying that there are two possible outcomes.

The program continues to try to gather information until there is only one possibility left, at which point it guesses the answer, so obviously, we need to change the way it is currently written to account for this.

From here, because you and I know which words are more common, we know that the answer should be an abyss, but as it is currently written, the program doesn’t know that.

A more advantageous endgame tactic, but let’s say we use version one of our wordle solvers and run some simulations to see how it performs.

In other words, this method uses the 2315 words that make up the wordle answers as a testing set while using the naive approach of not taking into account how common a word is and instead simply trying to maximize the information at each stage until it gets there.

There is only one remaining option.

By the time the simulation is over, the average score is 4.124, which isn’t too bad, to be honest.

However, wordle players will tell you that they can typically complete the task in four steps; the real challenge is to complete the task as many times as possible in three steps.

Between the scores of three and four Here, incorporating whether a word is common or not is the obvious low-hanging fruit, but how exactly do we do that?

### Incorporating word frequencies

The method I used was to obtain a list of the relative frequencies for every word in the English language.

I did this by using Mathematica’s word frequency Data function, which draws its information from the Google Books English Ngram Public Dataset.

As an illustration, what would happen if we sorted the words from most common to least common? Evidently, these are the eight most frequent five-letter words in the English language, making sense that these Other words would appear more frequently given that First is which follows there’s there and there first itself is not first but ninth.

Where those following first are following were, and those being slightly less common.

For example, while the word braid is in some ways about a thousand times less likely than which is given a score of 0.002 in this data set, both of these are common enough words that they’re almost certainly worth considering so we want to take that into account when modeling how likely each of these words is to be the final answer using this data.

a binary cutoff more so Imagine taking this entire sorted list of words, placing it on an x-axis, and then applying the sig point function, which is the conventional method for having a function whose output is essentially binaryâ€”it is either zero or one, but there is smoothing in between for that Region of uncertainty.

Consequently, the value of the Sigmoid function above, wherever it sits, will be the probability that I am assigning to each word for being on the Final list.

Within the x-axis Now, obviously, this depends on a few parameters, such as the width of the x-axis space those words fill, which determines how gradually or steeply we decrement from one to zero, and the placement of the words from left to right.

I looked through the scandalous list and tried to find a window where, when I looked at It, I figured about half of these words Are more likely than not to be the final Answer and use that as the cutoff.

To be honest, the way I did this was kind of just licking my finger and sticking it into the wind.

If we start with my old openers, which were other end nails, and end up in a situation where four possible Words match it, let’s say we consider them all equally likely.

This is another situation where entropy becomes a really useful measurement.

Let me ask you: What is this distribution’s entropy? Since each one is one and four and that’s two, the information associated with each of these possibilities will be in the log base 2 of four.

The two pieces of information and the four possibilities are all very good.

But what if I told you that there are actually more than four matches? When we examine the entire word list, 16 words match it.

However, suppose our model assigns those additional 12 words a really low probability of being the correct answerâ€”say, one in a thousandâ€”because they are extremely obscure.

Now, let me ask you: What is the entropy of this distribution? Entropy measures the number of matches, so you might expect it to measure the log Base 2 of 16, which would add four and a half bits of uncertainty to what we already had.

However, the actual uncertainty is not really that different.

It wouldn’t be all that more surprising to learn that the final answer is a charm from what we already know because there are only 12 extremely difficult words.

As a result, 2.11 bits are obtained when you actually perform the calculation here and multiply the probability of each occurrence by the relevant information.

Zooming out, this is part of what makes Wordle such a good example for an Information Theory lesson.

We have these two distinct feeling applications for Entropy, the first one telling us what The expected information we’ll get from A given guess, and the second one telling us how much information we’d get from it if we did learn them.

This pretty much illustrates it.

Here, we have two adjacent patterns that are about equally likely, but one of them has 32 possible words that match it.

If we check what those 32 words are, they are all just extremely unlikely words, and when you scan your eyes over them, it’s difficult to find any that seem like plausible answers, except perhaps yells.

However, if we look at the neighboring pattern in the distribution, which is considered to be about as likely, there are a lot more possible words that match it To show how we take all of that into account, let me pull up the Wordlebot version 2.

There are two or three key differences from the first one that we saw.

First of all, as I just mentioned, the way that we compute these entropies and expected values of information now incorporate the probability that a given word would actually be the answer by using more refined distributions across patterns.

As it happens, tares are still the top word, though the ones f and g are the ones f and g The uncertainty value that a number of bits represent over on the left is no longer just redundant with the number of possible matches now that we pull it up and you know we calculated Say 2 to the 8.02 which would be a little above 256 I guess 259 what it’s saying is even though there are 526 total words.

Once we have a few guesses on the table again ignore its recommendation because we can’t let machines rule our lives.

You could consider it in this way: it knows.

It’s a little less certain than it was in the previous case because Borks, Yortz, Zoro, and Zorus are also not the answer.

If I continue playing the game, I’m refining this with a few guesses that are relevant to what I would like to explain here.

The only options with a meaningful chance at this point are dorms and words, and you can see that it prioritizes choosing both of those over all of these other values that strictly speaking would provide more information in the fourth guess if you look over at its top picks.

At this point, there are technically seven possibilities.

To gauge the quality of each guess the first time I did this, I simply added up these two numbers, which worked better than you might think, but it didn’t feel very systematic.

Although there are other options available, this is the one I chose if we’re thinking about the possibility of a future guest, as in this instance.

If we do that, the expected score of our game is what we really care about.

In order to determine that expected score, we ask: What’s the probability that words are The correct response, which at the moment is described by 58, is four.

However, with a probability of one less than 58 percent, our score in this game will be more than four.

We are unsure of how much more there will be, but we can estimate it based on how much uncertainty there is likely to be once we reach that point.

There are currently 1.44 bits of uncertainty, specifically.

The Expected information we’ll receive if we guess words is 1.27 Bits, so if we guess words, the Difference represents how much uncertainty we’re likely to have.

Next, we need a function, which I’ll refer to as f.

Here that links this uncertainty to an expected score, and how I went about it was to simply plot a lot of data from prior games based on Version One of the bots to ask, “Hey, what was the actual score after various Points with certain very measurable Amounts of uncertainty?” For instance, these Data Points Here that are Sitting Above A Value That’s Around Like 8.7 Or So Are Saying for Some Games After A Point At Which There Were 8.7 Bits Of Uncertainty For Some Games For other games, it took three guesses, and for other games, it took four guesses before the final answer could be determined.

If we move over to the left, all of these points over zero indicate that there is only one possibility whenever there are no bits of uncertainty.

Then, it is comforting to know that there is only ever one guess needed.

When there was a sliver of uncertainty, which reduced the options to essentially just two, then it was occasionally necessary to make one additional guess, sometimes two additional guesses, and so on.

Possibly a slightly simpler method of visualizing To group this information and calculate averages For instance, this bar indicates that on average, at each point where there was one bit of uncertainty, a new guess was needed.

Similarly, this bar indicates that on average, at each point where there were four bits or more of uncertainty, which is equivalent to reducing the number of possible outcomes to 16, a new guess is needed.

From this point on, I simply performed a regression to fit a function to the data.

### Final performance

With this as version 2.0, how does it perform when we run the same set of simulations with it competing against all 2 315 possible wordle answers? Overall, it performs significantly better than our first version, which is reassuring.

The average score is around 3.6, though, unlike the first version, there are a few instances where it loses and needs more than six, presumably because there are times when it’s making that trade-off actually to win.

In other words, if we try to get more sophisticated than just choosing this prior distribution based on word frequency data, this 3.43 probably gives a max at how good we could get with that, or at least how good I could get with that.

Performance essentially uses the ideas that I’ve been discussing here, but it goes a little further by searching for the expected information twice rather than once.

Originally, I was going to talk more about that, but I realize that we have already gone on for quite a while.

However, one thing that stands out is that after doing this two-step search and then running a few sample simulations in the top Candidates, it appears, at least to me, that cranberry is the winner.

The uncertainty you start with is slightly more than 11 bits if you use the true word list to determine your space of possibilities, and it turns out The best case scenario after your first two guesses, according to a brute force search, is around 10 bits, which is the maximum possible expected information.

I believe it’s fair and probably conservative to say that you could never possibly write an algorithm that gets this average as low as three because with the words available to you, there’s simply not room to get enough information after only two Steps to be able to guarantee the answer.

With perfectly optimal play, you’ll be left with about one bit of uncertainty, which is the same as being down to two possible guesses.

Always in the third position, without fail you

## Leave a Reply