Counting Letters
One of the things those involved in cryptanalysis do is count letters. Sometimes a lot of letters. But what do all those counts represent? Something the “experts” that write books often fail to do is explain exactly what their counts are and how they arrived at them. Then, even if they do tell you something, it may still mask how the original data was compiled such as “compiled from a 30,000 word corpus”. The various ways you can count letters can completely change the results.
Knowing how your statistics were generated and what is likely to happen with a given cipher could be the difference between an easy break or a difficult mess. Below is a small part of a few digram counts of a text sample. The ciphertext samples were encrypted with a digraphic cipher. The high count is for the plaintext or equivalent ciphertext pair TH. All counts ignored word divisions.
Plaintext strict count with offset (X added before the first letter)
S 6 10 1 1 1 5 2
T 71 21 1 2 3 2 3
Plaintext strict count no offset
S 9 8 0 0 3 1 1
T 88 11 1 0 1 0 3
Plaintext moving window
S 15 18 1 1 4 6 3
T 159 32 2 2 4 2 6
Ciphertext moving window
F 8 14 17 96 4 9 6
Ciphertext strict count no offset
F 3 13 0 88 1 0 0
There are significant differences in the counts of the exact same plaintext/ciphertext. Presumably there would be statistical differences, but they converge with a large enough text. If you add the offset and no offset counts together you get the moving window count. However, the encrypted text masks it's true frequencies using the moving window count while the strict count exactly matches the plaintext strict count.
The moving window counter uses a “window” that is N wide where N may be 2 or higher. This type of counter is bad for analysis because it only moves one character at a time. It averages or otherwise simulates the effects of a large data set and blurs the actual N-gram characteristics if they exist. The moving window counter also artificially inflates the counts by an approximate factor of two. As shown above, the moving window count won't match between plaintext and ciphertext.
The strict counter only counts N sized letter groups. Because it doesn't cross N-gram boundaries, this is the counter you want when looking for N-gram characteristics. You will get exact N-gram counts. If used on a large enough file, the strict count data will converge on the moving window data. As shown above, the strict count will match known plaintext and ciphertext exactly.
The next problem seems to be one of statistics. Using my 1Gb corpus, the moving window and strict counts result in 2.605012% and 2.604406% respectively for the letter pair TH. With a large enough corpus, the two count differences are virtually insignificant and should compare proportionally.
So what did all of this accomplish? I did the work so you don't have to. When counting a known N-gram based ciphertext, it is best if you use a strict count. For reference purposes, the moving window count of a large corpus is the same as a strict count that is diffused by the text itself. Both are accurate when normalized. However, if you are comparing known plaintext and ciphertext, your best match will be a strict count because it eliminates the diffusion that a sliding window would count.









