Skip to main content
2025-05-2015 min read
Data Science & AI

A Crash Course in Scikit-learn: Your First Steps in Machine Learning

Scikit-learn is where everyone starts their ML journey and where most of us realize we have no idea what we're doing. It's the library that promises you'll be building intelligent systems but delivers the harsh reality that you'll spend most of your time wondering why your model thinks everything is either a flower or spam.
But here's the thing: scikit-learn is actually brilliant. It's comprehensive, well-designed, and surprisingly consistent once you understand the patterns. This crash course will help you build your first model and, more importantly, understand why it's probably wrong.

1. Introduction to Scikit-learn

What is it? Scikit-learn is an open-source library that provides simple and efficient tools for data mining and data analysis. It's built on NumPy, SciPy, and matplotlib, making it a natural fit in the Python scientific computing stack.
Why is it so popular?
  • Simplicity: It features a clean, consistent API. You can swap between different models with minimal code changes, which is great when your first five choices don't work.
  • Comprehensive: It has more algorithms than you'll ever understand. Classification, regression, clustering, dimensionality reduction - it's like a buffet where you recognize maybe three dishes but everything looks important.
  • Excellent Documentation: The official docs actually explain things, unlike that one library we shall not name where the documentation is just the source code with extra steps.

2. Core Concepts in Scikit-learn

At its core, scikit-learn's API is built around the Estimator object. An estimator is any object that learns from data.
  • fit(X, y): This is the training step. You feed it examples (X) and their labels (y), and it attempts to find patterns. Success rate varies wildly.
  • predict(X): After training, your model makes predictions on new data. This is when you discover whether your model actually learned generalizable patterns or just memorized the training data.
  • transform(X): Some estimators modify your data. Scalers normalize it, encoders convert categories to numbers. Essential preprocessing that everyone forgets the first time.
Data in scikit-learn must be formatted correctly:
  • Features Matrix (X): A 2D array where rows are samples and columns are features. Always 2D. Even if you have one feature. Even if you have one sample. Scikit-learn has strong opinions about shapes.
  • Target Vector (y): A 1D array containing what you're trying to predict. The "answers" to the test, if you will.

3. Your First Machine Learning Model: A Step-by-Step Example

Let's build a simple classification model using the famous Iris dataset. Yes, the Iris dataset. Again. It's the "Hello World" of machine learning - everyone's sick of it, but here we are.

Step 1: Load the Data

Scikit-learn comes with several built-in datasets.
python
1from sklearn.datasets import load_iris
2
3iris = load_iris()
4X, y = iris.data, iris.target

Step 2: Split Data into Training and Testing Sets

We need to evaluate our model on data it has never seen before. train_test_split is a handy utility for this.
python
1from sklearn.model_selection import train_test_split
2
3X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

Step 3: Choose and Train a Model

Let's use a K-Nearest Neighbors (KNN) classifier. It's a simple yet effective algorithm.
python
1from sklearn.neighbors import KNeighborsClassifier
2
3# Create an instance of the model
4knn = KNeighborsClassifier(n_neighbors=3)
5
6# Train the model
7knn.fit(X_train, y_train)

Step 4: Make Predictions

Now, let's see what the model thinks about our test data.
python
1y_pred = knn.predict(X_test)

Step 5: Evaluate the Model

How well did our model do? Let's check its accuracy.
python
1from sklearn.metrics import accuracy_score
2
3accuracy = accuracy_score(y_test, y_pred)
4print(f"Model Accuracy: {accuracy:.2f}")
5# Expected Output: Model Accuracy: 1.00
An accuracy of 1.0 means our model correctly classified every single flower in the test set! Before you start your ML consultancy based on your 100% accurate model, let me break some news: the Iris dataset is basically the ML equivalent of a participation trophy. It's so well-behaved that getting perfect accuracy is like being proud of solving a children's puzzle. In the real world, if you see 100% accuracy, you probably have a bug.

Visualizing How KNN Works

Let's see what's happening under the hood. The interactive demo below shows how KNN creates decision boundaries:

Interactive KNN Classification

Lower K = more complex boundaries, Higher K = smoother boundaries

Setosa
Versicolor
Virginica
Try experimenting with:
  • K value: K=1 creates complex, jagged decision boundaries because it only considers the nearest neighbor. Higher K values create smoother boundaries by averaging more neighbors, reducing sensitivity to outliers.
  • Different features: Some feature combinations separate the classes better than others. Sepal dimensions might cluster species more clearly than petal width alone.
  • Decision regions: The colored backgrounds show which class the model would predict for any point in that region.

4. Solving a Real-World Problem: Spam Detection

Let's apply what we've learned to a common problem: classifying SMS messages as "spam" or "ham" (not spam).

Building a Complete Spam Detection Pipeline

python
1import pandas as pd
2import numpy as np
3from sklearn.model_selection import train_test_split, cross_val_score
4from sklearn.feature_extraction.text import TfidfVectorizer
5from sklearn.naive_bayes import MultinomialNB
6from sklearn.svm import LinearSVC
7from sklearn.ensemble import RandomForestClassifier
8from sklearn.pipeline import Pipeline
9from sklearn.metrics import classification_report, confusion_matrix
10import matplotlib.pyplot as plt
11import seaborn as sns
12
13# Create a more realistic dataset
14messages = [
15 # Ham messages
16 "Hey, are you free for lunch tomorrow?",
17 "I'll be home late tonight, traffic is terrible",
18 "Can you pick up milk on your way home?",
19 "Meeting rescheduled to 3pm. See you there!",
20 "Thanks for the birthday wishes!",
21 "The project deadline has been extended to Friday",
22 "Great presentation today!",
23 "Let me know when you arrive",
24 "Happy anniversary! Love you",
25 "Don't forget the team lunch at noon",
26 "Your package was delivered",
27 "Reminder: Doctor appointment at 2pm",
28 "Nice seeing you yesterday!",
29 "Can we talk later?",
30 "Running 10 minutes late",
31
32 # Spam messages
33 "WINNER!! You've won £1000000! This is definitely real! -Nigerian Prince",
34 "Hot singles in your area want to discuss your car's extended warranty",
35 "Doctors HATE this one weird trick! (It's called eating vegetables)",
36 "You've inherited $50M from a long-lost uncle you never knew existed!!!",
37 "URGENT: Your computer has virus. Click here to download more virus",
38 "Congratulations! You're our 1,000,000th visitor! So is everyone else!",
39 "Make $$$ working from home! Requires: time machine, unicorn, PhD in wizardry",
40 "Your package is waiting! We don't know what package. Just click the link",
41 "Free iPhone 15! Just pay shipping of $999.99",
42 "Your bank account will be suspended! -Definitely Your Real Bank (trust us)",
43 "Bitcoin opportunity! Turn $1 into $1M! Math professors HATE this!",
44 "Free cruise to the Bahamas! Departs from Nebraska. Seems legit",
45 "IRS here. We only accept iTunes gift cards for tax payment now",
46 "Your crush wants to meet you! They're definitely real and not a bot",
47 "Lose 50 pounds in 2 days with this miracle pill! Side effects include levitation"
48]
49
50labels = ['ham'] * 15 + ['spam'] * 15
51
52# Create DataFrame
53df = pd.DataFrame({'message': messages, 'label': labels})
54
55# Step 1: Create a pipeline with preprocessing and model
56pipeline = Pipeline([
57 ('tfidf', TfidfVectorizer(stop_words='english', max_features=1000)),
58 ('classifier', MultinomialNB())
59])
60
61# Step 2: Split data
62X_train, X_test, y_train, y_test = train_test_split(
63 df['message'], df['label'], test_size=0.3, stratify=df['label'], random_state=42
64)
65
66# Step 3: Train the model
67pipeline.fit(X_train, y_train)
68
69# Step 4: Make predictions
70y_pred = pipeline.predict(X_test)
71
72# Step 5: Evaluate performance
73print("Classification Report:")
74print(classification_report(y_test, y_pred))
75
76# Step 6: Visualize confusion matrix
77cm = confusion_matrix(y_test, y_pred, labels=['ham', 'spam'])
78plt.figure(figsize=(8, 6))
79sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
80 xticklabels=['ham', 'spam'], yticklabels=['ham', 'spam'])
81plt.title('Confusion Matrix')
82plt.ylabel('True Label')
83plt.xlabel('Predicted Label')
84plt.tight_layout()
85plt.show()
86
87# Step 7: Compare different models
88models = {
89 'Naive Bayes': MultinomialNB(),
90 'SVM': LinearSVC(random_state=42),
91 'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42)
92}
93
94for name, model in models.items():
95 pipeline = Pipeline([
96 ('tfidf', TfidfVectorizer(stop_words='english', max_features=1000)),
97 ('classifier', model)
98 ])
99 scores = cross_val_score(pipeline, df['message'], df['label'], cv=5, scoring='f1_macro')
100 print(f"{name}: {scores.mean():.3f} (+/- {scores.std() * 2:.3f})")
101
102# Step 8: Feature importance - What words indicate spam?
103pipeline.fit(df['message'], df['label'])
104feature_names = pipeline.named_steps['tfidf'].get_feature_names_out()
105coefficients = pipeline.named_steps['classifier'].feature_log_prob_[1] - \
106 pipeline.named_steps['classifier'].feature_log_prob_[0]
107
108# Get top spam indicators
109top_spam_idx = coefficients.argsort()[-10:][::-1]
110top_spam_words = [(feature_names[i], coefficients[i]) for i in top_spam_idx]
111
112print("\nTop spam indicators:")
113for word, score in top_spam_words:
114 print(f" {word}: {score:.3f}")
115
116# Step 9: Test on new messages
117new_messages = [
118 "Meeting at 3pm in conference room",
119 "URGENT!!! Your goldfish has won the lottery!!!",
120 "Can you send me the report?",
121 "Hello, I am a normal human. Please click this legitimate link: totallynotascam.virus"
122]
123
124predictions = pipeline.predict(new_messages)
125for msg, pred in zip(new_messages, predictions):
126 print(f"\nMessage: '{msg[:50]}...'\nPrediction: {pred}")
This enhanced example demonstrates:
  • Creating a balanced dataset with realistic examples
  • Using pipelines to combine preprocessing and modeling
  • Proper evaluation with classification reports and confusion matrices
  • Comparing multiple algorithms
  • Analyzing which features (words) are most indicative of spam
  • Making predictions on new, unseen messages

5. A Quick Tour of Other Common Models

Scikit-learn's consistent API makes it easy to try different models. It's like speed dating, but for algorithms.
  • Classification:
    • DecisionTreeClassifier: Makes decisions by asking a series of yes/no questions, like a very judgemental game of 20 questions. "Is petal_length > 2.5? Is sepal_width < 3.0? Congratulations, you're a virginica!"
    • SVC (Support Vector Classifier): Tries to find the best line (or hyperplane if you're fancy) to separate your classes. Imagine drawing a line between cats and dogs, but in math dimensions you can't visualize.
  • Regression (predicting continuous values):
    • LinearRegression: Draws the best straight line through your data points. It's optimistic and believes all relationships are linear. Spoiler: they're not.
  • Clustering (unsupervised learning):
    • KMeans: Groups your data into K clusters by playing an endless game of "musical chairs" with data points until everyone's reasonably happy with their group.
Here's how you might use KMeans to find clusters in the Iris data (without using the labels):
python
1from sklearn.cluster import KMeans
2
3kmeans = KMeans(n_clusters=3, random_state=42, n_init=10)
4kmeans.fit(X)
5print(kmeans.labels_)

Comparing Different Classifiers

Each algorithm makes different assumptions about your data and creates different types of decision boundaries:

Classifier Comparison

Key Observations:

  • KNN: Creates flexible, non-linear boundaries
  • SVM: Finds optimal linear separations
  • Decision Tree: Makes axis-aligned rectangular splits
  • Naive Bayes: Creates smooth, probabilistic boundaries
Setosa
Versicolor
Virginica
Notice how each classifier approaches the problem differently:
  • KNN: Makes decisions based on nearby points - simple but can capture complex patterns
  • SVM: Tries to find the optimal separating boundary with maximum margin
  • Decision Trees: Creates hierarchical if-then rules, always axis-aligned
  • Naive Bayes: Assumes features are independent and follow specific distributions

6. Putting It All Together with Pipelines

Pipelines are a powerful feature for chaining multiple steps together. For example, it's common practice to scale your data before feeding it to a classifier.
python
1from sklearn.pipeline import Pipeline
2from sklearn.preprocessing import StandardScaler
3from sklearn.svm import SVC
4
5# Create a pipeline
6pipe = Pipeline([
7 ('scaler', StandardScaler()), # Step 1: Scale the data
8 ('svm', SVC()) # Step 2: Apply the classifier
9])
10
11# The pipeline can be used like any other estimator
12pipe.fit(X_train, y_train)
13print(f"Pipeline Accuracy: {pipe.score(X_test, y_test):.2f}")
Pipelines help prevent data leakage from your test set and simplify your workflow.

7. Common Pitfalls and Debugging Tips (Or: How I Learned to Stop Worrying and Debug My Models)

When starting with scikit-learn, you'll make these mistakes. We all did. Here's your field guide to ML failures:

1. "Why Does Income Matter 1000x More Than Age?"

Your model thinks salary (in dollars) is way more important than age (in years) because 50,000 > 25. Congratulations, you've created an algorithm that only cares about money. How very Silicon Valley of you.
python
1# ❌ Bad: My model is a gold digger
2X_train = [[25, 50000], [30, 60000], [35, 70000]] # age, salary
3
4# ✅ Good: Teaching your model that all features matter
5from sklearn.preprocessing import StandardScaler
6scaler = StandardScaler()
7X_train_scaled = scaler.fit_transform(X_train)
8X_test_scaled = scaler.transform(X_test) # Use transform, not fit_transform!

2. "My Model Can See The Future!" (Data Leakage)

Your model isn't psychic, you're just showing it the test answers. It's like letting students study with the exact exam questions - of course they'll ace it.
python
1# ❌ Bad: "I'll just scale everything together, what could go wrong?"
2X_scaled = StandardScaler().fit_transform(X)
3X_train, X_test, y_train, y_test = train_test_split(X_scaled, y)
4
5# ✅ Good: Keep your test data pure and innocent
6X_train, X_test, y_train, y_test = train_test_split(X, y)
7scaler = StandardScaler()
8X_train_scaled = scaler.fit_transform(X_train)
9X_test_scaled = scaler.transform(X_test) # No peeking at test data stats!

3. "I Have 99% Accuracy!" (The Class Imbalance Trap)

Your fraud detection model has 99% accuracy? Amazing! Oh wait, only 1% of transactions are fraudulent? Your model just predicts "not fraud" for everything. You've built a very expensive way to say "no."
python
1from collections import Counter
2print(Counter(y_train)) # {0: 9900, 1: 100} - Uh oh
3
4# Your "amazing" model:
5print("Accuracy: 99%!") # Just predicts the majority class
6print("Fraud caught: 0%") # Completely useless
7
8# Fix it with stratified splitting and better metrics:
9X_train, X_test, y_train, y_test = train_test_split(
10 X, y, test_size=0.3, stratify=y, random_state=42
11)
12
13# Look at metrics that actually matter:
14from sklearn.metrics import classification_report
15print(classification_report(y_test, y_pred)) # Shows the ugly truth

4. "Default Settings Are Fine, Right?" (Wrong)

Using default hyperparameters is like cooking everything at 350°F for 30 minutes. Sometimes it works, usually it doesn't, and occasionally you set off the smoke alarm.
python
1# ❌ Bad: The "It'll probably work" approach
2model = RandomForestClassifier() # n_estimators=100, max_depth=None, etc.
3
4# ✅ Good: Actually finding what works
5from sklearn.model_selection import GridSearchCV
6
7param_grid = {
8 'n_neighbors': [3, 5, 7, 9],
9 'weights': ['uniform', 'distance']
10}
11
12grid_search = GridSearchCV(KNeighborsClassifier(), param_grid, cv=5)
13grid_search.fit(X_train, y_train) # Warning: Your laptop fan will sound like a jet engine
14
15print(f"Best parameters: {grid_search.best_params_}")
16print(f"Time spent: 3 hours")
17print(f"Improvement: 0.02%") # But hey, it's optimized!

5. "ValueError: Input contains NaN" (The Missing Data Surprise)

Scikit-learn is like a picky eater - it refuses to work if there's anything it doesn't recognize on its plate. One NaN value and it throws a tantrum.
python
1# The error that makes you question your dataset
2model.fit(X_train, y_train) # ValueError: Input contains NaN
3
4# Check the damage:
5print(f"Missing values: {X.isna().sum().sum()}") # 47... oops
6
7# The "good enough" fix:
8from sklearn.impute import SimpleImputer
9imputer = SimpleImputer(strategy='mean') # When in doubt, use the average!
10X_imputed = imputer.fit_transform(X) # Now with 100% made-up values!

8. ML Beginner's Bingo

Before we wrap up, here's a fun game. Check off all the squares you've experienced:
Googled "ValueError: Input contains NaN"
Got 100% accuracy (it was a bug)
Forgot to scale features
Trained on test data
Used Iris dataset
Laptop overheated during GridSearch
Tried to predict the stock market
Model predicted everything as one class
Forgot train_test_split
Thought more features = better model
Used accuracy on imbalanced data
FREE SPACE: "ML is just if-statements"
Tried to use strings as features
Model worked great on Monday, failed on Tuesday
Spent 3 hours on a typo
Asked "Is this AI?"

9. Conclusion & Next Steps

Congratulations! You now know enough scikit-learn to be dangerous. You can load data, train models, make predictions, and most importantly, you understand why your model is probably wrong.
Remember:
  • Your first model will be terrible. Your second one will be slightly less terrible.
  • The Iris dataset is not representative of real-world data, no matter how much we use it.
  • When your model achieves 100% accuracy, you have a bug.
  • Machine learning is 10% choosing algorithms and 90% figuring out why your data is weird.
Next steps for your journey into ML madness:
  • Try the official scikit-learn documentation (it's actually good)
  • Graduate from Iris to a dataset that fights back
  • Build something ridiculous (hot dog/not hot dog classifier, anyone?)
  • Remember: with great computational power comes great responsibility to not predict everything as spam
Now go forth and overfit some data! Just remember to use a validation set. 🚀