Cracking the Vigenère Cipher: Predicting Keys with ML.NET

Cracking the Vigenère Cipher: Predicting Keys with ML.NET

·

14 min read

The other day, I watched Zodiac, a movie about the infamous serial killer who sent coded messages to news agencies. The initial message he sent in the film was named Z408 cipher due to its 408 character length. This code utilized a homophonic substitution cipher in which every letter of the original text was replaced with one or more symbols, increasing the difficulty of deciphering it through frequency analysis.

I don't plan on discussing the cipher specifics here, but it was deciphered in a week by an amateur couple through intuition and pattern recognition.

This reminded me of the Vigenère Cipher, a more difficult code that remained unbroken for centuries. I aimed to develop a data model to forecast the key and determine if we can resolve the issue. Needless to say I was going to use C# and ML.NET.

So, let's get started. Here is what I am planning to do:

  1. Describe what a Vigenère Cipher is and why it is hard to break.

  2. Develop an application that decodes the cipher as long as we know the key.

  3. Train an ML.NET model that will predict what the key is.

    1. Create a data model generator so we can generate our training model as big or as small as we wish.
  4. Put it all together in a console project.

So, let's begin.

The Vigenère Cipher: A Quick Overview

The Vigenère cipher encrypts text by using a series of different Caesar ciphers based on the letters of a particular keyword. A Caesar cipher is a simple method of encoding messages using a substitution method where letters in the alphabet are shifted by some fixed number of spaces. For example, a Caesar cipher with a right shift of 1, would encode A as B, B as C and so on. This method is named after the Roman Emperor Julius Caesar who used it in this private correspondence.

A Caesar cipher is very easy to design but also very easy to break. It has a relatively small number of combinations to check (26 letters in the alphabet). Comparatively, a Vigenère Cipher with a key of 5 letters has over 11 million combinations (the longer the key the higher the number of combinations) and the Enigma code has 158,962,555,217,826,360,000 possible combinations.

The Vigenère cipher is based on the Caesar Cipher bit it takes it a step further, making it moderately difficult to decipher for any unintended party. Created by a 16th century Italian cryptographer Giovan Battista Bellaso, this cipher cannot be broken with brute-force or the word pattern attack as there are too many combinations. It is thought to have remained unbroken until Charles Babbage broke it in the 19th century.

Another fun fact about the Vigenère cipher is that it is computationally infeasible to break if the key has length of 100 - even if thousands of characters of plaintext are encrypted. This cipher is perfectly secret if the length of the key is at least equal to the length of the message.

This cipher is based on the Vigenère Square below and it is used together with a keyword to encrypt a message.

The letters in the top row, represent the letters in the message. In order to encode the message, locate the column associated with the letter you're encoding, then identify where it meets the row containing the keyword letter corresponding to the letter in the message. The intersecting letter is the encoded letter for the message.

Let's clarify this by an example:

Suppose we want to encrypt this message: Thank God It Is Friday and we want to use the keyword Apple.

We begin by writing the keyword as many times as necessary above the message and then write down the letter where the message letter meets the key letter.

KeywordA P P L E A P P L E A P P L E A P P
MessageT H A N K G O D I T I S F R I D A Y
EncryptedT W P Y O G D S T X I H U C M D P N

In math terms, you convert each letter to its corresponding position in the alphabet (A=0, B=1...Z=25). Our message becomes:

T=19, H=7, A=0, N=13, K=10, G=6, O=14, D=3, I=8, T=19, I=8, S=18, F=5, R=17, I=8, D=3, A=0, Y=24 A=0, P=15, P=15, L=11, E=4, A=0, P=15, P=15, L=11, E=4, A=0, P=15, P=15, L=11, E=4, A=0, P=15

Then you add the positions of the message and keyword letters, apply mod 26 to find the new position of the encrypted letter:

(19+0) % 26 = 19 (T) (7+15) % 26 = 22 (W) (0+15) % 26 = 15 (P) (13+11) % 26 = 24 (Y) (10+4) % 26 = 14 (O) (6+0) % 26 = 6 (G) (14+15) % 26 = 3 (D) (3+15) % 26 = 18 (S) (8+11) % 26 = 19 (T) (19+4) % 26 = 23 (X) (8+0) % 26 = 8 (I) (18+15) % 26 = 7 (H) (5+15) % 26 = 20 (U) (17+11) % 26 = 2 (C) (8+4) % 26 = 12 (M) (3+0) % 26 = 3 (D) (0+15) % 26 = 15 (P) (24+15) % 26 = 13 (N)

The new letters in parentheses are your encrypted letters.

As you can see it is pretty straight forward and it is relatively easy to decode if you know the key. But what if we do not know the key? Can machine learning help us in predicting the key based on the encrypted text? Can we train a model to accomplish this?

Let's find out.

The Vigenère Cipher Implementation

Let's start by implementing the encryption and decryption functions for the Vigenère cipher in C#.

public class VigenereCipher
{
    // Encrypts text using the Vigenere cipher
    public static string Encrypt(string text, string key)
    {
        string result = string.Empty;
        key = key.ToUpper();
        for (int i = 0, j = 0; i < text.Length; i++)
        {
            char c = text[i];

            if (char.IsLetter(c))
            {
                c = char.ToUpper(c);
                result += (char)((c + key[j % key.Length] - 2 * 'A') % 26 + 'A');
                j++;
            }
            else
            {
                result += c; // Non-letter characters remain unchanged
            }
        }
        return result;
    }

    // Decrypts text using the Vigenere cipher
    public static string Decrypt(string text, string key)
    {
        string result = string.Empty;
        key = key.ToUpper();
        for (int i = 0, j = 0; i < text.Length; i++)
        {
            char c = text[i];

            if (char.IsLetter(c))
            {
                c = char.ToUpper(c);
                result += (char)((c - key[j % key.Length] + 26) % 26 + 'A');
                j++;
            }
            else
            {
                result += c;
            }
        }
        return result;
    }
}

We can test this code with a simple message and key. Let's create a new console app and put the following code in the Program.cs file:

using VigenereCipherApp;

var text = "THANK GOD IT IS FRIDAY";
var key = "APPLE";

var encrypted = VigenereCipher.Encrypt(text, key);
Console.WriteLine($"Encrypted: {encrypted}"); 

var decrypted = VigenereCipher.Decrypt(encrypted, key);
Console.WriteLine($"Decrypted: {decrypted}");

When you run this code, the result is:

We now have a Vigenère cipher application.

Generating Training Data with a Key Generator

To train a machine learning model, we need data - a lot of data. We need pairs of encrypted text and their keys for our model to learn from. Instead of using a static dataset which will be limited, I thought it would be a better idea to create a Key Generator in our app so we can be as flexible as possible and generate dynamic and randomized data for training our model.

Here is the code for the Training Data Generator:

public class TrainingDataGenerator
{
    private static readonly Random Random = new();

    // Generate a random key with a given length
    public static string GenerateRandomKey(int length)
    {
        const string chars = "ABCDEFGHIJKLMNOPQRSTUVWXYZ";
        var key = new char[length];

        for (var i = 0; i < length; i++) 
            key[i] = chars[Random.Next(chars.Length)];

        return new string(key);
    }

    // Generate a random plaintext of a given length
    public static string GenerateRandomPlainText(int length)
    {
        const string chars = "ABCDEFGHIJKLMNOPQRSTUVWXYZ ";
        var text = new char[length];

        for (var i = 0; i < length; i++) 
            text[i] = chars[Random.Next(chars.Length)];

        return new string(text);
    }

    // Generate training data with specified number of examples, key length, and text length
    public static List<CipherData> GenerateTrainingData(
        int numberOfExamples, 
        int minKeyLength, 
        int maxKeyLength,
        int textLength)
    {
        var trainingData = new List<CipherData>();

        for (var i = 0; i < numberOfExamples; i++)
        {
            // Randomize key length for diversity
            var keyLength = Random.Next(minKeyLength, maxKeyLength + 1);
            var key = GenerateRandomKey(keyLength);
            var plainText = GenerateRandomPlainText(textLength);

            // Encrypt the plaintext using the Vigenere cipher
            var encryptedText = VigenereCipher.Encrypt(plainText, key);

            trainingData.Add(
                new CipherData { EncryptedText = encryptedText, Key = key });
        }

        return trainingData;
    }
}

This generator creates encrypted text using random keys of varying lengths, which we will be feeding into out machine learning model.

Training the Model with ML.NET

Now that we have our data, we need to build our machine learning model. Our goal here is to predict the encryption key based on the encrypted text.

Before we code, we need to add ML.NET to our program by running the following command in the NuGet Package Manager Console:

dotnet add package Microsoft.ML

Once this is done, we can then go ahead and build our training code:

using Microsoft.ML;

namespace VigenereCipherApp;

public class VigenereCipherML
{
    public static void TrainModel(MLContext mlContext, IDataView trainData)
    {
        // Define the pipeline
        var pipeline = mlContext.Transforms.Text.FeaturizeText("Features", nameof(CipherData.EncryptedText))
            .Append(mlContext.Transforms.Conversion.MapValueToKey("Label", nameof(CipherData.Key))) // Map Key to Label
            .Append(mlContext.Transforms.Concatenate("Features", "Features"))
            .Append(mlContext.MulticlassClassification.Trainers.SdcaMaximumEntropy("Label", "Features"))
            .Append(mlContext.Transforms.Conversion.MapKeyToValue("PredictedLabel", "PredictedLabel"))
            .Append(mlContext.Transforms.CopyColumns("PredictedKey", "PredictedLabel"));

        // Train the model
        var model = pipeline.Fit(trainData);

        // Save the model for future use
        mlContext.Model.Save(model, trainData.Schema, "vigenere_model.zip");

        Console.WriteLine("Model training complete.");
    }

    public static void PredictKey(MLContext mlContext, string encryptedText)
    {
        // Load the trained model
        var trainedModel = mlContext.Model.Load("vigenere_model.zip", out _);

        // Create a prediction engine
        var predictionEngine = mlContext.Model.CreatePredictionEngine<CipherData, KeyPrediction>(trainedModel);

        // Make a prediction
        var input = new CipherData { EncryptedText = encryptedText };
        var prediction = predictionEngine.Predict(input);

        return prediction.PredictedKey ?? "Could not predict Key.";
    }
}

Putting It All Together

At this point, we have everything we need: Our encryption functions, training data generator, model training in place - so, now we are ready to run and predict the keys.

using Microsoft.ML;
using VigenereCipherApp;

var mlContext = new MLContext();

// Check if the model already exists
var modelExists = System.IO.File.Exists("vigenere_model.zip");

// Generate training and test data
var (trainDataList, testDataList) = TrainingDataGenerator.GenerateTrainingData(
    numberOfExamples: 500,
    minKeyLength: 3,
    maxKeyLength: 15,
    textLength: 100 
);

var trainData = mlContext.Data.LoadFromEnumerable(trainDataList);

if (!modelExists)
{
    // If model doesn't exist, train the model
    Console.WriteLine("Model not found. Training the model...");

    // Train the model
    VigenereCipherML.TrainModel(mlContext, trainData);
}
else
{
    Console.WriteLine("Trained model found. Skipping training...");
}

// Predict the key for a new encrypted text
var encryptedText = "TWPYOGDSTXIHUCMDPN";
var predictedKey = VigenereCipherML.PredictKey(mlContext, encryptedText);

// Display the predicted key
Console.WriteLine($"Predicted Key: {predictedKey}");

// Decrypt the encrypted text using the predicted key
var decryptedText = VigenereCipher.Decrypt(encryptedText, predictedKey);
Console.WriteLine($"Decrypted Text: {decryptedText}");

Except that it does not work. What we get is gibberish.

We can shorten the texts during training to limit complexity. We can experiment with faster algorithms, such as NaiveBayes. Other options are, SdcaMaximumEntropy, or FastTree.

What you want to use depends on what you want to achieve:

  • NaiveBayes is likely the fastest algorithm you can use, especially if you're working with larger datasets.

  • SdcaMaximumEntropy offers a balance of speed and accuracy, making it a solid choice if you're looking for better generalization.

  • FastTree might also be worth experimenting with if memory usage isn't a constraint.

I use SdcaMaximumEntropy for the balance it offers.

We can increase the number of examples however, training may be extremely slow depending on your machine. To offset this, we can optimize the pipeline by enabling caching of the preprocessed data so it does not need to be recalculated on every iteration during training. Another option is to use parallel processing if you have a multi-core machine. If you do not have GPU capabilities, you can also use multiple CPU cores by configuring the execution options.

If all fails, we may offload the training to Azure AI Services. I think we should cover that in a separate article.

Those recommendations are for the speed of training. How about the accuracy of our predictions?

Let's see how our model performs by developing an evaluation method. First we need to generate separate training and test data (and change the name of the method GenerateTrainingData to GenerateTrainingAndTestData) . Then we create an evaluation method and run our app to see the results.

Here is the evaluation method:

 public static void EvaluateModel(MLContext mlContext, IDataView testData)
    {
        var trainedModel = mlContext.Model.Load("vigenere_model.zip", out _);
        var predictions = trainedModel.Transform(testData);

        var metrics = mlContext.MulticlassClassification.Evaluate(predictions, "Label");

        Console.WriteLine($"Log-loss: {metrics.LogLoss}");
        Console.WriteLine($"Macro accuracy: {metrics.MacroAccuracy}");
        Console.WriteLine($"Micro accuracy: {metrics.MicroAccuracy}");
    }

We add the following code to our main right after we generate training and test data.

// Convert lists to IDataView for ML.NET
var trainData = mlContext.Data.LoadFromEnumerable(trainDataList);
var testData = mlContext.Data.LoadFromEnumerable(testDataList);

if (!modelExists)
{
    // If model doesn't exist, train the model
    Console.WriteLine("Model not found. Training the model...");

    // Train the model
    VigenereCipherML.TrainModel(mlContext, trainData);
}
else
{
    Console.WriteLine("Trained model found. Skipping training...");
}

// Evaluate the model with test data
VigenereCipherML.EvaluateModel(mlContext, testData);

// Predict the key for a new encrypted text
var encryptedText = "TWPYOGDSTXIHUCMDPN";
var predictedKey = VigenereCipherML.PredictKey(mlContext, encryptedText);

// Display the predicted key
Console.WriteLine($"Predicted Key: {predictedKey}");

// Decrypt the encrypted text using the predicted key
var decryptedText = VigenereCipher.Decrypt(encryptedText, predictedKey);
Console.WriteLine($"Decrypted Text: {decryptedText}");

When I run the app now I get this result:

Log-Loss: 34.538 | Macro Accuracy: 0 | Micro Accuracy: 0

Log-loss, also known as cross-entropy loss, measures the performance of a model. The lower the number, the better the performance. 34.538 is pretty high which means the model is not making accurate predictions.

Macro and Micro accuracy measures the overall accuracy of the model. A value of 0 for both means that our model is not predicting anything correctly.

It looks like our model is struggling to learn.

Let's try something else:

  1. Increase Key Variety and Length Distribution: Make sure the generated keys have a wide variety of lengths and patterns (in GenerateTrainingAndTestData method).

  2. Control Text Complexity: The keys need variety but so do the encrypted messages (in GenerateRandomPlainText method).

  3. Ensure Class Balance: It is possible that certain keys are overrepresented in the dataset, which causes the model to become biased toward those keys. We need to make sure that each key length or pattern is equally represented and the number of short, medium and long keys are equal (in GenerateTrainingAndTestData method).

  4. Use Stratified Sampling for Test Data: We need to make sure that both train and test dataset contain a similar distribution of short, medium and long keys (in GenerateTrainingAndTestData method).

  5. Add Some Noise: We are going to add some noise to the training data hoping to improve the model's ability to generalize. For that we are going to create a new method:

     public static string AddNoiseToText(string text, double noiseLevel)
         {
             const string chars = "ABCDEFGHIJKLMNOPQRSTUVWXYZ ";
             char[] noisyText = text.ToCharArray();
    
             for (int i = 0; i < text.Length; i++)
             {
                 if (random.NextDouble() < noiseLevel)
                 {
                     noisyText[i] = chars[random.Next(chars.Length)];
                 }
             }
    
             return new string(noisyText);
         }
    

We then integrate this method into data generation process. Our final data generation code looks like this:

public static (List<CipherData> trainData, List<CipherData> testData) GenerateTrainingAndTestData(
        int numberOfExamples,
        int testDataPercentage,
        int minKeyLength,
        int maxKeyLength,
        int textLength,
        double noiseLevel)
    {
        var allData = new List<CipherData>();

        for (var i = 0; i < numberOfExamples; i++)
        {
            // Generate a key
            var keyLength = random.Next(minKeyLength, maxKeyLength + 1);
            var key = GenerateRandomKey(keyLength);

            // Generate plaintext
            var plainText = GenerateRandomPlainText(textLength);

            // Optionally add noise to plaintext or key
            var noisyPlainText = AddNoiseToText(plainText, noiseLevel);

            // Encrypt the (noisy or original) plaintext with the key
            var encryptedText = VigenereCipher.Encrypt(noisyPlainText, key);

            allData.Add(new CipherData { EncryptedText = encryptedText, Key = key });
        }

        // Split data into training and test data sets
        var testDataSize = (int)(numberOfExamples * (testDataPercentage / 100.0));
        var trainData = allData.Take(numberOfExamples - testDataSize).ToList();
        var testData = allData.Skip(numberOfExamples - testDataSize).ToList();

        return (trainData, testData);
    }

Then we run our updated main (do not forget to delete vigenere_model.zip file usually found in bin/Debug/net8.0).

using Microsoft.ML;
using VigenereCipherApp;

var mlContext = new MLContext();

// Check if the model already exists
var modelExists = System.IO.File.Exists("vigenere_model.zip");

// Generate training and test data
var (trainDataList, testDataList) = TrainingDataGenerator.GenerateTrainingAndTestData(
    numberOfExamples: 2000,
    testDataPercentage: 20,   // 20% of data for testing
    minKeyLength: 3,
    maxKeyLength: 15,
    textLength: 50,
    noiseLevel: 0.1           // Add 10% noise to the plaintext
);

// Convert lists to IDataView for ML.NET
var trainData = mlContext.Data.LoadFromEnumerable(trainDataList);
var testData = mlContext.Data.LoadFromEnumerable(testDataList);

if (!modelExists)
{
    // If model doesn't exist, train the model
    Console.WriteLine("Model not found. Training the model...");

    // Train the model
    VigenereCipherML.TrainModel(mlContext, trainData);
}
else
{
    Console.WriteLine("Trained model found. Skipping training...");
}

// Evaluate the model with test data
VigenereCipherML.EvaluateModel(mlContext, testData);

// Predict the key for a new encrypted text
var encryptedText = "TWPYOGDSTXIHUCMDPN";
var predictedKey = VigenereCipherML.PredictKey(mlContext, encryptedText);

// Display the predicted key
Console.WriteLine($"Predicted Key: {predictedKey}");

// Decrypt the encrypted text using the predicted key
var decryptedText = VigenereCipher.Decrypt(encryptedText, predictedKey);
Console.WriteLine($"Decrypted Text: {decryptedText}");

and drumroll.....

We are still not getting the results we want.

As a last resort, instead of generating random keys, we will use real words. Google has a 10,000 Word Library that has the most used words in English and this may provide us with more realistic data.

After implementing this change, we run our app again:

While it's an improvement, it is still not predicting accurately.

Conclusion

Key prediction for the Vigenère cipher is highly challenging, especially without knowing the key length. It’s not just about finding patterns in the encrypted text—there's a massive search space of possible keys, and without prior knowledge of the key length, it’s almost impossible for a machine learning model to make accurate predictions based purely on patterns in the ciphertext.

So, why is this problem so hard for machine learning?

  • First if all, the number of possible key grows exponentially with the length of the key. For example, if the key is made up of 26 possible characters and is 10 characters long, you have 26^10 possible combinations (~141 trillion keys). Without knowing at least the key length, the model needs to guess not only the characters in the key but also the length of it.

  • Vigenère ciphers are designed to hide letter frequency patterns by using different shifts for each character in the plaintext. This means the patterns the model is trying to learn can vary wildly depending on the key length, making it extremely hard to generalize.

  • Crypto software use techniques like hill climbing and frequency analysis to break Vigenère ciphers. These methods don’t rely on machine learning but instead use heuristics to iteratively refine guesses based on partial information (such as assumptions about the key length or frequency of letters).

Even though we used a very large word library, it still did not improve our accuracy because the model still had no idea about the key length. Without at least narrowing down the key length, even Crypto software would struggle to break the cipher.

In short, while this was a failure, it was a great learning exercise. It also showed that Machine Learning is not just about coding but also about experimenting. It also showed us that the dataset is very important and while very powerful, Machine Language cannot be applied to every problem and it is not the silver bullet people sometimes make it out to be. We can combine machine learning with heuristics to get better results but machine learning alone is not the right tool for this task.

You can find the source code for this project here: https://github.com/tjgokcen/VigenereCipherAppUsingML