data – Alex KG Ellis

January 18, 2022

Best Wordle Starting Words

I, like many people, have been enthralled this month by Wordle. Being data-minded, I have been yearning for a statistical analysis of what the best starting word is. I found something very close to what I was looking for in this piece, but unfortunately the author used a random dictionary word universe rather than one tailored to this game. Luckily a clever person scraped Wordle’s code and identified both the full list of answers and the full list of accepted words, and posted both on Github. So I went about replicating Bakhtiari’s work with this list. Here is what I found.

Letter distribution

In the thousands of programmed answer words, the letters of the alphabet have the following distribution in total, and distribution in each of the letter places (relevant for guessing a letter in the right location):

letter	contain	1st	2nd	3rd	4th	5th
e	46%	3%	10%	8%	14%	18%
a	39%	6%	13%	13%	7%	3%
r	36%	5%	12%	7%	7%	9%
o	29%	2%	12%	11%	6%	3%
t	29%	6%	3%	5%	6%	11%
l	28%	4%	9%	5%	7%	7%
i	28%	1%	9%	11%	7%	0%
s	27%	16%	1%	3%	7%	2%
n	24%	2%	4%	6%	8%	6%
u	20%	1%	8%	7%	4%	0%
c	19%	9%	2%	2%	7%	1%
y	18%	0%	1%	1%	0%	16%
h	16%	3%	6%	0%	1%	6%
d	16%	5%	1%	3%	3%	5%
p	15%	6%	3%	3%	2%	2%
g	13%	5%	1%	3%	3%	2%
m	13%	5%	2%	3%	3%	2%
b	12%	7%	1%	2%	1%	0%
f	9%	6%	0%	1%	2%	1%
k	9%	1%	0%	1%	2%	5%
w	8%	4%	2%	1%	1%	1%
v	6%	2%	1%	2%	2%	0%
x	2%	0%	1%	1%	0%	0%
z	2%	0%	0%	0%	1%	0%
q	1%	1%	0%	0%	0%	0%
j	1%	1%	0%	0%	0%	0%

Best starting words

I put my finger on the scale here a little bit to focus on words that are common enough that it doesn’t feel like cheating to guess them, even if they’re acceptable. Your results may vary if you have different standards.

Another strategy I like to use is treating the second word like a starting word: ignoring what I learned from the first word and hoping to get sufficient information from the first 10 letters to stand a better chance of guessing the final word on the 3rd try.

There are several metrics that one could use to determine what the optimal starting word is, so here are a few options for you:

First word	Rank	Reason	Best 2nd word	Next 5 letters after
IRATE	1	Best score combining probabilities for yellow and green results	LOCUS	NYHDP
LATER	16	Next best score combining probabilities for yellow and green results, only considering common words for each of first 2 words. The differences in probability among the top words are so minor relative to the brain work of solving the puzzle that going with #1 is really not that important unless you feel compelled to.	ICONS	UYHDP
RACED	79	Highest chance of getting a green while only using the top 15 letters by overall frequency (CABER is marginally better but B isn’t that common overall)	TOILS	NUYHP
ADIEU	1295	My old starting word. If you’re going to maximize your vowels on the first word this is still a good one, but that’s not necessarily a good strategy for getting the most information you can.	SCORN	TLYHP

Here’s my work if you’re interested. Hope this post gives you what you’re looking for and have fun solving the puzzles!

November 21, 2014

How blue is Rhode Island, by town

Originally posted on Rhode Island Future. They have lots of great stuff, so head over and check it out!

In the sensationally titled “Revenge of the Swamp Yankee: Democratic Disaster in South County,” Will Collette argued emotionally that despite statewide wins for Democrats in Rhode Island two weeks ago, South County was a sad place for the party. He makes a strong case that local South County races, through low turnout and Republican money, had a night more like the rest of the country than the rest of Rhode Island.

Will focuses on General Assembly and Town Council races, but his post made me wonder how different towns around Rhode Island voted compared to the state averages. So I dug into the numbers for statewide races. Here’s what I came up with:

Democratic Lean by Town Population

Democratic Lean by Town Density

This is a little confusing; here’s what I did:

I looked up what percentage of the votes in each town the Democrats and Republicans for each statewide office received.
I subtracted the GOP candidate’s percentage from the Democrat’s for each town, giving the percentage margin the Democrats won (or didn’t) by.
I then averaged together the margins for each statewide race, roughly giving each town’s Democratic lean.
I then subtracted the average statewide Democratic lean from each of those town leans, giving us an idea of how each town compares to Rhode Island as a whole.

Those are the numbers you see above. Here’s my spreadsheet. A few observations:

Hardly anyone lives in New Shoreham. But we already knew Block Island isn’t a population hub. (These population numbers are from Wikipedia and could be wrong.)
There’s a clear trend of the denser and more populous cities voting more for Democrats than less populous towns. I ran the correlations and it’s 0.55 for population and 0.82 for density. Both are reasonably strong.
Imagine the vaguely logarithmic trendline that would best fit these points. For the density graph the formula for that trendline would be y = 0.084*ln(x) - 0.6147. It’s in relation to that trendline that I’ve made the map at right. Gray towns are those that voted about how you’d expect based on their density, blue towns voted more Democratic than density would suggest while red towns voted less Democratic.
Remember this is one point in time, November 4, 2014. It can’t tell us a lot about how things are changing or how all those people who didn’t turn out would vote if they did.

So at the end of the day, what does this tell us? Municipalities with higher population & density tend to vote for Democrats more than towns with lower populations. This isn’t just true in Rhode Island, it’s true across the country. But what is interesting here is how different areas of the state deviate from that implied trendline.

March 28, 2014

Graph: How Cold Has It Been This Winter?

Last week FiveThirtyEight posted an article looking at weather data nationally to assess the notion that this winter was particularly cold. I am quick to caveat such claims as applications of the availability heuristic, so I loved FiveThirtyEight’s analysis to settle with real measurements whether the claim is true.

But even the shiny graphs and rigorous analysis of FiveThirtyEight left me unsatisfied on this issue. That’s because what I really cared about was my local weather. So I dug around and found some data of my own: average and 2013/14 high and low temperatures for my home of Amherst, Massachusetts.

Highs and Lows: Average vs Actual

First a note: I like using highs and lows rather than daily average temperatures because they feel more real to me. The temperature oscillated between these bounds on that day, but how long was it actually at that specific average temperature? That said, my results are going to resemble those I would have gotten had I used averages, so it’s not super important in this case.

The above graph is pretty messy, so we can’t really answer the question with it too well. So I made another one:

Moving Weekly Average of Deviation from Temperature Norm

This one can answer the question. Yes, it has been cold. Especially since late January, but also multiple times this winter before that as well.

To make this, first I subtracted the average highs and lows for each day from the actuals. For each day, this number showed me how many degrees warmer or colder it got than the mean. I then averaged these variance numbers together to get a composite number (see, doesn’t that look like average temperature?) measuring what the mean variance was from high and low temperatures. Finally, I took the moving average for each day and the three days before and after it. I did this because our perceptions of what the weather is like are influenced not only by what’s going on in the moment, but also what’s happened recently and what’s in the forecast.

The result, as you can see, shows us bizarrely cyclical trends in temperature this winter. Every 2.5 to 3 weeks, we see this measure of temperature variance cycle back to another peak or valley. I can’t think of any methodological error that would distort the results in this way (except maybe the moving average, but that shouldn’t regulate such long stretches of time), and have no reason to believe the source data are wrong, so my best guess is that it’s coincidental.

Presuming my methodology is sound, this is just the sort of graph I was looking for to explain what the temperature was like this winter. I hope you find it interesting as well.