A statistical attack on the running key cipher

Bauer, Craig

ABSTRACT: The frequencies of n-graphs, groups of n letters, for n as large as 6 are compiled and used to launch a statistical attack on a running key cipher. Programs written in C++ make the calculations possible. The results of the attack for various values of n are analyzed.

KEYWORDS: Running key cipher, William F. Friedman, n-graphs,

INTRODUCTION

The running key cipher uses the characters in a book as a long non-repeating key. For example, suppose we wish to encipher the message BEGIN THE ATTACK AT DAWN using A Tale of Two Cities[3] as our key. Ignoring spaces, the message is twenty characters long, so we only need to use the first twenty characters of A Tale of Two Cities and the key is given by IT WAS THE BEST OF TIMES I. In order to add the letters of the message and key together, we first convert them to numbers using A=O, B=1,…, Z=25. With these assignments, B+I becomes 1+8=9=J and E+T becomes 4+19=23=X. The next pair, G+W, sums to 28, so we perform the arithmetic mod 26 and get 2=C. Continuing as above, the complete ciphertext is JXCIF MOIBX KTQPT BPEOV. Variations of this cipher exist, but we shall be content to examine this simplest case.

A statistical attack is made possible by the fact that the key letters are not random. If they were, the cipher would be an unbreakable one-time pad. Since the key consists of normal text, the frequencies of individual letters and combinations of letters may be exploited to gain a partial decipherment. The manner in which this can be done is detailed in the following section.

THE ATTACKS

Suppose the letter A appears in the ciphertext. It could have arisen in 14 different ways:1

A+A, B+Z, C+Y, D+X, E+W, F+V, G+U, H+T, I+S, J+R, K+Q, L+P, M+O, N+N

where the first letter appears in the original message and the second letter is from the corresponding position in the key or vice-versa. As will be seen, our method of attack only attempts to find the correct pair of letters from which the ciphertext arises. It does not make any effort to distinguish between message and key characters. For this reason, we do not consider reversals such as Z+B to be separate cases. We are interested in pairs, not ordered pairs. However, since a reversal is another way the same pair can produce a given ciphertext letter, we must account for this when calculating the probability that those two letters will be paired in message and key.

Without consulting a table of frequencies, it is clear that the probability A resulted from A+A is higher than the probability that it arose from F+V, even keeping in mind that the latter probability needs to be doubled, as the two letters could also combine in the order V+F. It is not obvious whether or not the pairing A+A will be more likely than H+T and T+H combined. Let P(X) denote the probability of a randomly chosen letter from a plaintext message being an X. For example, the frequency of E is approximately 12.7% in modern English, so P(E)=.127. To see what combination is most likely to give rise to an A in the ciphertext, we examine the product of the probabilities for the 14 pairs of letters that can combine to give A, doubling where necessary. The pair giving the largest value is the one most likely to be correct. The 14 values that must be calculated are

P(A)P(A), 2P(B)P(Z), 2P(C)P(Y), 2P(D)P(X), 2P(E)P(W), 2P(F)P(V), 2P(G)P(U), 2P(HP(T), 2P(IP(S), 2P(JP(R), 2P(K)P(Q), 2P(L)P(P), 2P(M)P(O), P(N)P(N)

Figure 1, which is based on the table of frequencies given by Beutelspacher [2], provides values for all of the above pairings, as well as their relative rankings. We find that the pair T+H is the one most likely to form the A in the ciphertext.

Similar calculations may be performed to find the pair of letters most likely to give rise to any of the 26 possible distinct letters that may appear in a given ciphertext. The values for the probabilities vary depending on the source from which they are compiled. As our results will show, using a frequency table other than the one provided by Beutelspacher can change the results (which pairs are most likely to give rise to given ciphertext letters).

This is the sort of attack that was first proposed by William Friedman [5]. He listed the most probable combinations yielding each letter of ciphertext and then chose from these possibilities so that words were formed from them when taken over several consecutive letters of ciphertext.

Our attack follows Friedman’s lead, but takes it a few steps further. We have written programs in C++ to eliminate the tedium of determining the most likely message and key values for corresponding positions. The programs also allow us to use a text file for compiling the probabilities of individual letters at runtime. In addition to this, further programs analyze larger groupings of letters from the ciphertext to determine the most probable corresponding blocks of the message and key. These more advanced and computationally intensive attacks will be examined momentarily. We first examine the monographic analysis in greater detail.

To form a sample ciphertext, we used Poe’s The Cask of Amontillado from [71 as the key and enciphered the first 1,000 letters of Dracula [9]. Punctuation, spacing, and case were ignored.

Sticking to the theme, the program compiled single letter frequencies from the complete text of Frankenstein [8].2 After the frequencies were compiled, the pair most likely to yield a given ciphertext letter (based on the product of the probabilities, as detailed above) was assumed to be correct. The calculation was simple. Each of the probabilities was calculated in a loop and the pair of letters yielding the highest value was saved. Comparison with the true solutions showed that we obtained 28.9% accuracy. That is, the pair of letters (message and key) most likely to yield the given cipherletter were the ones actually used 28.9% of the time for our sample cipher message. This is significantly higher than random guessing would produce, but not high enough to make the message readable. We would like to obtain 90% accuracy. Such a decipherment would then appear like a paper with a large number of typos! Of course, we would first have the extra step of determining which half of each pair yielding a given ciphertext letter was from the message and which half was from the key.

The program allowed the option of using default values for single letter frequencies, instead of compiling them at runtime. That is, if the user did not wish to have the frequencies based on a particular file of his or her choice, Beutelspacher’s values were present in the program and could be used instead. The values gave 31% accuracy for our example.

The single letter frequencies used above are also known as monograph frequencies. In a similar manner, an n-graph is defined to be an ordered set of n characters and for any particular value of n, these frequencies may be tabulated. For n = 1 through 6, we call these monographs, digraphs, trigraphs, tetragraphs, pentagraphs, and hexagraphs, respectively. The most frequent digraph in English is TH. It seems reasonable that when MO appears in a ciphertext, it was most likely the result of the letters TH in the message combining with the letters TH in the key, written TH+TH=MO. To confirm this, and determine which pairs of digraphs are most likely to give rise to each of the digraphs appearing in the ciphertext, we wrote another program in C++. The user was given a choice concerning the source of the digraph frequencies. Default values could be used (detailed shortly) or the frequencies could be compiled at runtime, using a text file chosen by the user.

A sample result indicates that if TU appears in the ciphertext, we want to replace it by TH+AN, since these digraphs maximize P(X1X2) P(YIY2) subject to the constraint X^sub 1^X^sub 2^ + Y^sub 1^Y^sub 2^ = TU. Again we must be careful to double the probabilities whenever the two digraphs differ. These second-order results differ from those suggested by the monographic analysis, as U, considered by itself, is not most likely to arise from the combining of H and N. We used Frankenstein again to compile our frequencies and the percentage of correct characters increases to 30.4%. It should be noted that this second program requires far more calculations than the first. There are only 26 distinct monographs, but 26^sup 2^ = 676 digraphs. For each of these digraphs, the program had to calculate the most likely message and key pairing that could give rise to it.

The program was written so that a ciphertext made up of the letters C^sub 1^C^sub 2^C^sub 3^C^sub 4^ is considered to consist of the digraphs C^sub 1^C^sub 2^ and C^sub 3^C^sub 4^. The digraph C^sub 2^C^sub 3^ is not examined.

As with the monographic analysis, the digraph program included an option to use default values for the probabilities. In this case, the values were as stated by Kullback in [6] and resulted in 27.8% correct. Kullback’s digraph frequencies were based on 50,000 letters from government telegrams and were not expected to be as accurate for our sample ciphertext as frequencies compiled from the text of a novel. The language typical of telegrams is much more stilted than the text of novels. Words like THE are used less frequently in telegrams and since our sample cipher message arises from combining the texts of two works of fiction, telegram frequencies do not provide a good model.

We continued on with trigraphs. The most common trigraph in English is THE. When THE appears in the same position in the message and the key, we get THE+THE=MOI. This provides us with a way of recognizing this particular method of encipherment, if it is not already known. The relative frequency of the trigraph MOI will be higher than expected for another method of encipherment. It should be noted that the suggested values for the message and key will continue to vary as we increase the value of n while looking at n-graphs. For example, as seen above, the 0 in MOI is most likely to have arisen from H+H, but the letter O may appear in another context (not between an M and an I), in which case another pair for the message and key would be more likely.

Let f(n) denote the percent of the message deciphered correctly, as a function of n, the size of the groups of letters considered. For example, n=3 for trigraphic analysis. We would expect f(n) to be an increasing function that approaches 100 asymptotically. Imagine for a moment that computing power is unlimited. If frequency tables could be compiled for 100 letter groups, based on all of English literature, decipherment would be a cinch. As almost all 100 character strings are meaningless, the probability that more than one pair of meaningful strings could be found that combine to yield a given ciphertext is exceedingly low. Hence we would have a high degree of confidence in our proposed decipherment. However, we must ask if all meaningful 100 character combinations have even been used. We are, in fact, asking for a frequency table that cannot be compiled for lack of data.

Of course, another problem is that computing power is not unlimited. Our programs were able to grind through trigraphs and tetragraphs without any problems. When we got to pentagraphs, the program was still able to complete a run in a matter of a few minutes, but the hexagraph test required a little over 90 minutes to run on a one gigahertz processor and used over 300 megabytes of RAM. We note that there are 26^sup 6^ = 308,915,776 hexagraphs, although most of them never occur in English. The complete source code for these programs in C++ is available from the authors upon request.

THE RESULTS

The results of the trigraph test gave 32.7% correct, so we improved again, as expected. The frequencies were derived from the text of Frankenstein for this test and default values were no longer present. Many books and websites list the trigraphs with the greatest frequencies, but full lists of all trigraphs with nonzero frequencies are harder to find. The process of compiling the list at runtime from the text file of a novel or a collection of short stories is not a problem, as this part of the program runs very rapidly and the runtime does not increase greatly for larger n-graphs.

For the tetragraph test, the percent correct dropped to 29.8% and neither the pentagraph test nor the hexagraph test were as accurate as the trigraphic analysis. This is explained by the observation that monographic frequencies do not vary as greatly from author to author as do higher order statistics. Bennett [1] has pointed out that even digraph frequencies can sometimes be used to distinguish authors. We cannot expect the hexagraph frequencies from Frankenstein to correspond closely to those used by Stoker or Poe.

The identity of the sender may be known for an intercepted ciphertext, so it is not unreasonable to compile our frequencies based on writings produced by that person (other messages sent in plaintext, for example). To simulate this we compiled frequencies using Stoker’s The Lair of the White Worm [10].

We also tried other texts for the sake of comparison. Cricket on the Hearth [4) by Charles Dickens was among the texts chosen. War and Peace [111 by Leo Tolstoy was used in an English translation. The Works of Edgar Allan Poe in Five Volumes [71 were used after cutting out The Cask of Amontillado (recall that we used this as our key). Last of all, we tried Dracula itself and Poe and Dracula, a file of our own creation, which consisted of an unexpurgated version of The Works of Edgar Allan Poe in Five Volumes and Dracula combined. The results of these programs are summarized in Figure 2. The first column gives the text used for compiling frequencies and the other columns provide the percent correctly deciphered as a function of the size of the n-graphs used. For example, n = I corresponds to monographic analysis and n = 2 corresponds to digraphic analysis.

CONCLUSIONS

For larger character combinations (greater values of n), the frequencies must be based on greater samplings of text. There are too many six-letter strings that do not occur in Frankenstein, although they may be common in the key or message we are trying to uncover. Even when using the entire work from which the sample ciphertext was produced, we do not vary greatly from getting a mere 1/3 of the letters correct. Apparently the frequencies of groups of four, five, and six letters do not remain constant enough, even within a work by a single author, to make pure frequency analysis very successful. A new approach seems warranted. Instead of assuming the cipher letters arise from the most probable pairing of letters that can give rise to it, we could, for each n-graph of ciphertext, store several of the more likely possibilities. However, this approach would greatly increase the amount of memory needed and we would still need some way of determining which of the possibilities to finally select.

We are presently investigating another alternative, namely a dictionary attack. This approach forms all possible combinations yielding a given ciphertext block, subject to the constraint that both the message and the key are made up of valid text, hence, the dictionary. An advantage of this attack is that all possibilities will arise, not just ones represented in a sample text from which frequencies are compiled.

1If the numerical value of the letter is even, it can arise in 14 distinct ways. If the value is odd, there are only 13 distinct possibilities.

2 A great deal of literature is now available in text file form on the Internet. See [12], [13], and [14], for example.

REFERENCES

1. Bennett, William. 1976. Scientific and Engineering Problem-solving with the Computer. Englewood Cliffs NJ: Prentice-Hall, Inc.

2. Beutelspacher, Albrecht. 1994. Cryptology. Washington DC: MAA.

3. Dickens, Charles. 1992. A Tale of Two Cities. Philadelphia PA: Courage Books.

4. Dickens, Charles. 1946. Christmas Stories: A Christmas Carol, The Chimes, The Cricket on the Hearth. Cleveland OH: World Pub. Co.

5. Friedman, William, 1918. Riverbank Publications No. 16. Reprinted (1979) in The Riverbank Publications Volume 1. Laguna Hills CA: Aegean Park Press. 6. Kullback, Solomon. 1976. Statistical Methods in Cryptanalysis. Laguna Hills CA: Aegean Park Press.

7. Poe, Edgar Allan. 1903. The Works of Edgar Allan Poe in Five Volumes. New York NY: Collier.

8. Shelley, Mary. 1990. Frankenstein. Philadelphia PA: Courage Books. 9. Stoker, Bram. 1971. Dracula. New York NY: Scholastic Book Services. 10. Stoker, Bram. 1911. The Lair of the White Worm. London UK: W. Foulsham.

11. Tolstoy, Leo. 1963. War and Peace. New York NY: Scholastic Book Services.

12. http://www.literature.org/. (Source for text files of [8], [9], and [10].) 13.http://onlinebooks.library.upenn.edu/subjects.html. (Source for text file of [71.)

14. http: //promo. net/pg/. (Home Page for Project Gutenberg. Source for text files of [41 and [11].)

Craig Bauer1 and Christian N. S. Tate2

ADDRESS: (1) Physical Sciences Department, York College of Pennsylvania, York PA 17405-7199 USA. cbauer@ycp.edu and (2) P. O. Box 1301 Kinder LA 70648 USA. cnstate@yahoo.com

BIOGRAPHICAL SKETCHES

Craig Bauer is an Assistant Professor of mathematics at York College of Pennsylvania. He was introduced to modern cryptology by Jay Anderson at Franklin & Marshall College. Craig pursued the subject further with Ernest Stitzinger and Robert E. Hartwig at North Carolina State University, where he received his PhD in mathematics under the direction of Mohan Putcha. He created and

taught a course in cryptology at Northwestern State University of Louisiana and hopes to be able to offer it at York soon.

Christian N. S. Tate is a student at Northwestern State University majoring in mathematics. He is also interested in computer science.

Copyright Cryptologia Oct 2002

Provided by ProQuest Information and Learning Company. All rights Reserved