Evaluating and Tuning the Model in Machine Learning

Here's a confession: my first machine learning model was a disaster. It was like when I first tried dependency injection - I thought I understood the concept, but the results proved otherwise. I injected something alright, but what I could not tell.
If you're venturing into ML territory, especially from the comfortable world of C# and .NET, you might feel the same way.
You feel like a stranger in a strange land. Hello World apps and examples are nice but as soon as you step outside of that comfort zone, things get murky. It's not enough just to build the model, you need to evaluate and tune it as well for it to be useful.
Let's break down model evaluation and tuning in terms we're familiar with.
It's All About Testing (But Not Unit Testing)
As .NET developers, we're obsessed with testing. We write unit tests, integration tests, and end-to-end tests. Sometimes we write tests to test the tests we have written.
Machine learning has its own version of testing, but it's a bit different from what we're used to.
In our regular .NET world, things are pretty straightforward:
[Fact]
public void UserIsValid_ShouldReturnTrue()
{
var result = validator.IsValid(user);
Assert.True(result); // It's either true or false, simple!
}
But in machine learning, we're dealing with probabilities. Our testing looks more like this:
// First, split the data like separating test and production environments
var splitData = mlContext.Data.TrainTestSplit(data, testFraction: 0.2);
// Train the model on the training data
var model = trainingPipeline.Fit(splitData.TrainSet);
// Now evaluate it on the test data
var predictions = model.Transform(splitData.TestSet);
var metrics = mlContext.BinaryClassification.Evaluate(
predictions,
labelColumnName: "ActualLabel",
scoreColumnName: "PredictedLabel"
);
// Let's look at all our metrics
Console.WriteLine($"""
Model Performance Report:
------------------------
Accuracy: {metrics.Accuracy:P2}
This means we're right {metrics.Accuracy:P2} of the time
Precision: {metrics.PositivePrecision:P2}
When we predict something is true,
we're right {metrics.PositivePrecision:P2} of the time
Recall: {metrics.PositiveRecall:P2}
We catch {metrics.PositiveRecall:P2} of all actual positive cases
F1 Score: {metrics.F1Score:P2}
This is our balanced performance score
Area Under Curve: {metrics.AreaUnderRocCurve:P2}
The closer to 100%, the better at ranking predictions
""");
OK, great, but, how do we know if it is any good?
Lets try to explain ML model evaluation with something most of us understand: basketball. When you're playing basketball and you evaluate your shooting ability, you look at different aspects of your game. It's similar with machine learning models.
Think about accuracy first. In basketball, this is simply how many shots you make out of all shots taken. If you take 100 shots and make 70, that's 70% accuracy. Simple enough, right? In ML, it is the same. Accuracy tells us how often our model gets it right. But just like in basketball, raw accuracy doesn't tell the whole story.
Let's say you only take shots when you're right under the basket. Sure, your accuracy might be amazing but are you passing up some good opportunities such as not taking a 3-pointer when you are in the open? This is where precision and recall come in.
Precision is like your shooting percentage when you're confident enough to take the shot. When you decide "yes, I'm taking this shot," how often does it go in? In ML terms, when our model says "yes" (like "this email is spam" or "this transaction is fraudulent"), how often is it right? High precision means you're not crying wolf - when you call something out, people can trust you.
Recall is different - it's about how many good opportunities you take. In basketball, it's like "out of all the good shots you could have taken, how many did you actually take?" This is where passing up on those good opportunities come into play. In ML, recall tells us if our model is catching all the cases it should. For example, if you're detecting fraud, high recall means you're catching most of the actual fraud cases, even if means you sometimes have false positives.
The F1 Score is like your overall effectiveness as a scorer - it balances between taking good shots (precision) and not missing opportunities (recall). Just like the best basketball players need to balance being selective with their shots while still being aggressive enough to help their team, a good ML model needs to balance precision and recall.
Here's what these metrics look like in code:
var predictions = model.Transform(splitData.TestSet);
var metrics = mlContext.BinaryClassification.Evaluate(predictions);
Console.WriteLine($"""
Shooting Statistics:
-------------------
Overall Accuracy: {metrics.Accuracy:P2}
Like making {metrics.Accuracy:P2} of all shots taken
Precision: {metrics.PositivePrecision:P2}
When you take the shot, it goes in {metrics.PositivePrecision:P2} of the time
Recall: {metrics.PositiveRecall:P2}
You're taking {metrics.PositiveRecall:P2} of good shot opportunities
F1 Score: {metrics.F1Score:P2}
Your overall effectiveness as a scorer
""");
Different situations call for different priorities.
Continuing with our basketball analogy, if you're protecting a lead in the final minutes, you want high precision - you only take the sure shots. If you're down by a lot, you might sacrifice some precision for higher recall - you need to take more riskier shots such as 3-pointers to have a chance at winning.
Similarly, in ML, your priorities depend on your problem. Detecting spam email? You might want high precision to avoid blocking important messages. Detecting fraud? You might want high recall to catch as many cases as possible, even if it means investigating some false alarms.
When Things Go Wrong
Overfitting: The Over-Analyzer
Imagine a basketball player who has watched every single game tape of their next opponent. They've memorized every move, every play, every tendency. During practice against the scout team (who's mimicking the opponent), they're amazing - they anticipate everything perfectly and score at will.
But then the actual game comes, and they look terrible. Why? Because their opponent isn't doing exactly what they did in those game tapes. They've changed things up, adapted their style, and now our over-prepared player is lost.
This is overfitting in machine learning. Your model has essentially memorized the training data instead of learning general patterns and therefore it fails when it is presented with a different scenario.
It’s also like studying for a certification only memorizing the answers to the questions without understanding them. Sure, you can pass the exam and get the certification but chances are you will feel lost with the first real-world crisis.
Here's what it looks like in practice:
// Training metrics look amazing
var trainingMetrics = mlContext.BinaryClassification.Evaluate(
model.Transform(trainData));
Console.WriteLine($"Training accuracy: {trainingMetrics.Accuracy:P2}");
// 99.9%!
// But test metrics tell the real story
var testMetrics = mlContext.BinaryClassification.Evaluate(
model.Transform(testData));
Console.WriteLine($"Test accuracy: {testMetrics.Accuracy:P2}");
// 65%... ouch
Underfitting: The Novice
Now imagine the opposite - a player who only learned the absolute basics. They know to "put the ball in the hoop" but that's about it. No matter what situation they're in, they do the same thing: get ball, run straight, shoot. No strategy, no adaptation, just the same simple approach every time.
Using the studying for a certification analogy, if the student is a beginner developer, chances are they are not going to get the senior level certification because a beginner would struggle because they are trying to apply only basic concepts to complex scenarios.
This is underfitting. Your model is too simple to understand the patterns in your data. Here's what that looks like in code:
// Both training and test metrics are poor
Console.WriteLine($"""
Training accuracy: {trainingMetrics.Accuracy:P2} // 62%
Test accuracy: {testMetrics.Accuracy:P2} // 60%
""");
Finding the Sweet Spot
So how do we avoid both these extremes? Just like in basketball and professional development, the key is finding the right balance. It is to study their opponents and understand patterns, but also know how to adapt when things change. Best players and models have a deep understanding of the fundamentals that applies to any situation.
In ML terms, we want our model to learn the true patterns in the data without memorizing the noise. Here's how we might check for this balance:
public bool IsModelBalanced(ITransformer model, IDataView trainData, IDataView testData)
{
var trainMetrics = mlContext.BinaryClassification.Evaluate(
model.Transform(trainData));
var testMetrics = mlContext.BinaryClassification.Evaluate(
model.Transform(testData));
// Check if training and test scores are reasonably close
var accuracyDifference = Math.Abs(
trainMetrics.Accuracy - testMetrics.Accuracy);
Console.WriteLine($"""
Model Balance Report:
--------------------
Training Accuracy: {trainMetrics.Accuracy:P2}
Test Accuracy: {testMetrics.Accuracy:P2}
Difference: {accuracyDifference:P2}
Diagnosis:
{(accuracyDifference > 0.1
? "Possible overfitting - model performs much better on training data"
: trainMetrics.Accuracy < 0.7
? "Possible underfitting - model performance is too low"
: "Model seems well-balanced!")}
""");
return accuracyDifference <= 0.1 && trainMetrics.Accuracy >= 0.7;
}
Signs you've got a good balance:
Training and test scores are similar (like a player who performs consistently in practice and games)
The model performs well but not suspiciously perfect (like a player with good but realistic stats)
It can handle new, slightly different data (like a player who can adapt to different opponents)
Tuning Your Model: Hyperparameter Optimization
Let's stick with basketball for one final analogy to bring it all together.
When you're trying to improve your game, you might adjust various aspects of your training - most probably based on your weaknesses: how many hours you practice, how you split time between drills and scrimmages, how much you focus on shooting versus other skills. Each of these decisions affects your overall performance.
In machine learning, these adjustments are called hyperparameter optimization. Think of hyperparameters as the "training knobs" you can turn to improve your model's performance.
Tuning a model is like adjusting your shower temperature. First you make big turns of the handle to get from freezing cold to roughly warm. Once you're in the right zone, you make smaller, more precise adjustments to get that perfect temperature. You know, that sweet spot where you're not doing the shower dance between too hot and too cold.
That's exactly how we tune our machine learning models. Let's look at how this works:
// We pass 3 parameters for high, med, low and compare the results
// Let's say we find that 0.01 (learning rate) works best
// Like finding that the handle at roughly halfway gives good temperature
// Think of this like adjusting your shower temperature:
// First try: Big adjustments to find roughly the right temperature
// Then: Smaller, careful adjusts to get it just right
// First round - big adjustments
// Like turning the shower handle a lot
var learningRates = new[] { 0.1f, 0.01f, 0.001f };
// How much mixing of hot/cold
var maxDepths = new[] { 3, 5, 7 };
// How long you let it run
var numTrees = new[] { 50, 100, 200 };
Console.WriteLine("First round: Making big adjustments to find the right zone...");
// Second round - fine tuning
// Small handle adjustments
var refinedRates = new[] { 0.005f, 0.01f, 0.02f };
// Fine-tuning the mix
var refinedDepths = new[] { 4, 5, 6 };
// Adjusting the duration
var refinedTrees = new[] { 75, 100, 125 };
Console.WriteLine("Second round: Fine-tuning to get it just right...");
It's really that simple:
First you make big turns of the handle to get from freezing to roughly warm
Then you make tiny adjustments to get that perfect temperature
Once you find the sweet spot, you might make micro-adjustments to keep it perfect
This is way better than just guessing random values or sticking with the defaults.
The Bottom Line
Machine learning isn't magic - it's practice, measurement, and adjustment, just like improving at any skill. We split our data to test properly, measure different aspects of performance to understand our strengths and weaknesses, watch out for overthinking (overfitting) and underthinking (underfitting), and carefully tune our training approach.
Remember: your first model, like your first shot in basketball, probably won't be perfect. That's okay! The key is understanding these fundamentals so you can improve systematically. Start simple, measure everything, and adjust based on what you learn.




