## Breaking cryptograms – part 1

Lets start with the definitions, but put them in a different, more practical manner. This approach allows more in-depth understanding of cryptogram-breaking methodology. Wanted to write more about the code I posted some time ago, but didn’t find time albeit that’d be perfect to break this code. A few notes on the fly.

First thing to know- this is important in cryptography and game theory, we need to know how much “intelligence” you may expect from your opponent to be one, not two steps ahead. Can this opponent on purpose make the frequency distrubution good using trash letters? Is this code expected to be mathematically sophisticated? Do we know the way the author expresses his thoughts (any dialoges, conversations)? Do we know any other code from him? What can we expect as the result of the encoded message?

Let A be a set of symbols used in the cryptogram. It consists of letters, numbers and other unidentified symbols. It is identified by its length (number of symbols), its geometry (positioning of the symbols, relative and global one).

It has its meaning. It can be a group of sentences (like a letter), a number, or another cryptogram (e.g. SFK3240SFAKJ may be the solution and the number of mailbox).

Breaking the code  requires finding elements that we’re most certain about (that those are correct) and deducting the rest from there. For example, we need to know that SFK3240SFAKJ would be the solution or that the author of the code always creates “letters”, or that a certain code is used to encode the message. We know something for sure, we try it out and proceed from there. In case of the more complicated codes, we may iteratively (for simplification purposes, from practical point of view) verify certain options (assumptions).

In the first part of part 1, we’re assuming the encoded message is a letter.

If a code is a letter, we know that – on average- in all the letters  we have certain number of vowels and specific letters, we know the frequency of certain words, we know that every deviation from a rule is already of significant knowledge breakthrough for us (always). As I said, in case that the result of the code (the encoded message) is a “letter” (a letter can also be an another cryptogram- the thing is that we need to know or assume that what we have is the actual result).

From the message below, we can see that the frequency of letters is quite similar to the normal distribution of frequency, thus we rather don’t expect the substitution code here. We’re assume it’s not a substitution here, therefore we are going to expect some sort of code involving positioning and geometry of the message. Lets assume this is true. (see: http://en.wikipedia.org/wiki/Letter_frequency)

As we can see, certain phrases appear often in the code even though the distribution of the letters is similar to the global one (even given that we don’t have a good sample size).

Another point to think about is that 71 74 75 and 194 are the only numbers that appear in the code. Maybe we could find out their meaning. Moreover, those are written in a specific way as you can easily note from the document. It might be, for instance, 4 different operations (or three 71,74,75).

The letters appear next to ncBE and PRSE and TFR, which you can see are close on the keyboard, and that stems from the frequency of existence of those letters, and given that we assume it’s not a substitution code (or if it is it preserves the distribution of frequency – assumption), then the combination of TFR, PRSE, ncBE with 71,74,75 may be worth further analysis.

Do we think in this code capital letters have different meaning that their small equivalents? Capital letters are used very often and only specific letters are small (and we need to assume potential misunderstanding of the written text), so I’d assume for the sake of analysis that big letters are used for easier understanding of the text, thus would not expect too much advanced maths behind it here. Of course, we always need to check if n-th letters (letters mod n, or words mod n)  make the encoded message. We need to do a lot of checks, but these analyses are omitted on purpose.

Does this code use “trash” words or letters to distract us? Due to the distribution of frequency, I’d conjecture “not” with the additional assumption from the previous and first (or second) paragraphs.

BTW.[**] Does there exist a letter shifting code preserving the distribution of letters (*) (where a letter is turned into another letter, not a group of letters!)? Convolution? Would have to think about it for a while.

Another thing is that many of the assumptions are just part of the ‘on the fly’ analysis and, for instance, we cannot say for sure there’s no substitution code involved here, because we don’t have good enough sample size and [*].

Based on the small sample size, we could conjecture for the sake of simplification that the encoded message is in English (distribution of the letters). The number of shorter words allows conjecturing there is some sort of substitution used here. Given this and former arguments against substitution, we could conjecture the usage of substitutions from the keyboard. This would potentially mean that we could deal with **, because QWERTY keyboards are constructed so that it’s easier to access certain types of letters that appear often in the language.

I don’t know when I sit again with the code, so I place my current state of here for those who are really looking forward to it.