Discover Top Posts Tagged with #letterfrequency

Counting Letters

One of the things those involved in cryptanalysis do is count letters. Sometimes a lot of letters. But what do all those counts represent? Something the “experts” that write books often fail to do is explain exactly what their counts are and how they arrived at them. Then, even if they do tell you something, it may still mask how the original data was compiled such as “compiled from a 30,000 word corpus”. The various ways you can count letters can completely change the results.

Knowing how your statistics were generated and what is likely to happen with a given cipher could be the difference between an easy break or a difficult mess. Below is a small part of a few digram counts of a text sample. The ciphertext samples were encrypted with a digraphic cipher. The high count is for the plaintext or equivalent ciphertext pair TH. All counts ignored word divisions.

Plaintext strict count with offset (X added before the first letter)

S 6 10 1 1 1 5 2

T 71 21 1 2 3 2 3

Plaintext strict count no offset

S 9 8 0 0 3 1 1

T 88 11 1 0 1 0 3

Plaintext moving window

S 15 18 1 1 4 6 3

T 159 32 2 2 4 2 6

Ciphertext moving window

F 8 14 17 96 4 9 6

Ciphertext strict count no offset

F 3 13 0 88 1 0 0

There are significant differences in the counts of the exact same plaintext/ciphertext. Presumably there would be statistical differences, but they converge with a large enough text. If you add the offset and no offset counts together you get the moving window count. However, the encrypted text masks it's true frequencies using the moving window count while the strict count exactly matches the plaintext strict count.

The moving window counter uses a “window” that is N wide where N may be 2 or higher. This type of counter is bad for analysis because it only moves one character at a time. It averages or otherwise simulates the effects of a large data set and blurs the actual N-gram characteristics if they exist. The moving window counter also artificially inflates the counts by an approximate factor of two. As shown above, the moving window count won't match between plaintext and ciphertext.

The strict counter only counts N sized letter groups. Because it doesn't cross N-gram boundaries, this is the counter you want when looking for N-gram characteristics. You will get exact N-gram counts. If used on a large enough file, the strict count data will converge on the moving window data. As shown above, the strict count will match known plaintext and ciphertext exactly.

The next problem seems to be one of statistics. Using my 1Gb corpus, the moving window and strict counts result in 2.605012% and 2.604406% respectively for the letter pair TH. With a large enough corpus, the two count differences are virtually insignificant and should compare proportionally.

So what did all of this accomplish? I did the work so you don't have to. When counting a known N-gram based ciphertext, it is best if you use a strict count. For reference purposes, the moving window count of a large corpus is the same as a strict count that is diffused by the text itself. Both are accurate when normalized. However, if you are comparing known plaintext and ciphertext, your best match will be a strict count because it eliminates the diffusion that a sliding window would count.

#cryptography #counting #letterfrequency #cryptanalysis #digram #cryptology

Letter Frequency

I had this real nice post all ready for this and tumblr ate my charts. I knew tumblr hates anything so organized, but even my text based charts won’t work. So I have to do the planned post at a later time with tumblr proof charts and graphs. I was really hoping to avoid making JPG files out of them.

In any case, the frequency counter program counts only letters and the executable is compiled from the C version. It has been tested on a corpus I have that is over 1Gb in size. It isn’t exactly fast, but nothing that accesses 1Gb of data is usually very fast. On the other hand, it isn’t really horribly slow and it should only take a few minutes depending on your hardware. For small ciphertext samples and even up to 1Mb, it will be pretty fast even on older hardware..

Before you try to replicate my efforts on a 5Gb file or maybe a 10Gb file, I must warn you that you’ll need to make some changes and build your own. The posted source code expects 32 bit integers and the actual display text is formatted to 8 digits. A 5Gb file of American English text may actually be the limit though honestly I never expected to have a 1Gb corpus.

The source code and the executable are available from github.

#letterfrequency #C C++ QB QuickBasic cryptography cyptanalysis source code executable pencil&paper codes&ciphers