A Crash Course in Scikit-learn: Your First Steps in Machine Learning

Scikit-learn is where everyone starts their ML journey and where most of us realize we have no idea what we're doing. It's the library that promises you'll be building intelligent systems but delivers the harsh reality that you'll spend most of your time wondering why your model thinks everything is either a flower or spam.

But here's the thing: scikit-learn is actually brilliant. It's comprehensive, well-designed, and surprisingly consistent once you understand the patterns. This crash course will help you build your first model and, more importantly, understand why it's probably wrong.

1. Introduction to Scikit-learn

What is it? Scikit-learn is an open-source library that provides simple and efficient tools for data mining and data analysis. It's built on NumPy, SciPy, and matplotlib, making it a natural fit in the Python scientific computing stack.

Why is it so popular?

•Simplicity: It features a clean, consistent API. You can swap between different models with minimal code changes, which is great when your first five choices don't work.
•Comprehensive: It has more algorithms than you'll ever understand. Classification, regression, clustering, dimensionality reduction - it's like a buffet where you recognize maybe three dishes but everything looks important.
•Excellent Documentation: The official docs actually explain things, unlike that one library we shall not name where the documentation is just the source code with extra steps.

2. Core Concepts in Scikit-learn

At its core, scikit-learn's API is built around the Estimator object. An estimator is any object that learns from data.

•fit(X, y): This is the training step. You feed it examples (X) and their labels (y), and it attempts to find patterns. Success rate varies wildly.
•predict(X): After training, your model makes predictions on new data. This is when you discover whether your model actually learned generalizable patterns or just memorized the training data.
•transform(X): Some estimators modify your data. Scalers normalize it, encoders convert categories to numbers. Essential preprocessing that everyone forgets the first time.

Data in scikit-learn must be formatted correctly:

•Features Matrix (X): A 2D array where rows are samples and columns are features. Always 2D. Even if you have one feature. Even if you have one sample. Scikit-learn has strong opinions about shapes.
•Target Vector (y): A 1D array containing what you're trying to predict. The "answers" to the test, if you will.

3. Your First Machine Learning Model: A Step-by-Step Example

Let's build a simple classification model using the famous Iris dataset. Yes, the Iris dataset. Again. It's the "Hello World" of machine learning - everyone's sick of it, but here we are.

Step 1: Load the Data

Scikit-learn comes with several built-in datasets.

python

1 from sklearn.datasets import load_iris
2  
3 iris = load_iris()
4 X, y = iris.data, iris.target

Step 2: Split Data into Training and Testing Sets

We need to evaluate our model on data it has never seen before. train_test_split is a handy utility for this.

python

1 from sklearn.model_selection import train_test_split
2  
3 X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

Step 3: Choose and Train a Model

Let's use a K-Nearest Neighbors (KNN) classifier. It's a simple yet effective algorithm.

python

1 from sklearn.neighbors import KNeighborsClassifier
2  
3 # Create an instance of the model
4 knn = KNeighborsClassifier(n_neighbors=3)
5  
6 # Train the model
7 knn.fit(X_train, y_train)

Step 4: Make Predictions

Now, let's see what the model thinks about our test data.

python

1 y_pred = knn.predict(X_test)

Step 5: Evaluate the Model

How well did our model do? Let's check its accuracy.

python

1 from sklearn.metrics import accuracy_score
2  
3 accuracy = accuracy_score(y_test, y_pred)
4 print(f"Model Accuracy: {accuracy:.2f}")
5 # Expected Output: Model Accuracy: 1.00

An accuracy of 1.0 means our model correctly classified every single flower in the test set! Before you start your ML consultancy based on your 100% accurate model, let me break some news: the Iris dataset is basically the ML equivalent of a participation trophy. It's so well-behaved that getting perfect accuracy is like being proud of solving a children's puzzle. In the real world, if you see 100% accuracy, you probably have a bug.

Visualizing How KNN Works

Let's see what's happening under the hood. The interactive demo below shows how KNN creates decision boundaries:

Interactive KNN Classification

K Neighbors: 3

Lower K = more complex boundaries, Higher K = smoother boundaries

X-axis Feature

Y-axis Feature

Setosa

Versicolor

Virginica

Try experimenting with:

•K value: K=1 creates complex, jagged decision boundaries because it only considers the nearest neighbor. Higher K values create smoother boundaries by averaging more neighbors, reducing sensitivity to outliers.
•Different features: Some feature combinations separate the classes better than others. Sepal dimensions might cluster species more clearly than petal width alone.
•Decision regions: The colored backgrounds show which class the model would predict for any point in that region.

4. Solving a Real-World Problem: Spam Detection

Let's apply what we've learned to a common problem: classifying SMS messages as "spam" or "ham" (not spam).

Building a Complete Spam Detection Pipeline

python

1 import pandas as pd
2 import numpy as np
3 from sklearn.model_selection import train_test_split, cross_val_score
4 from sklearn.feature_extraction.text import TfidfVectorizer
5 from sklearn.naive_bayes import MultinomialNB
6 from sklearn.svm import LinearSVC
7 from sklearn.ensemble import RandomForestClassifier
8 from sklearn.pipeline import Pipeline
9 from sklearn.metrics import classification_report, confusion_matrix
10 import matplotlib.pyplot as plt
11 import seaborn as sns
12  
13 # Create a more realistic dataset
14 messages = [
15     # Ham messages
16     "Hey, are you free for lunch tomorrow?",
17     "I'll be home late tonight, traffic is terrible",
18     "Can you pick up milk on your way home?",
19     "Meeting rescheduled to 3pm. See you there!",
20     "Thanks for the birthday wishes!",
21     "The project deadline has been extended to Friday",
22     "Great presentation today!",
23     "Let me know when you arrive",
24     "Happy anniversary! Love you",
25     "Don't forget the team lunch at noon",
26     "Your package was delivered",
27     "Reminder: Doctor appointment at 2pm",
28     "Nice seeing you yesterday!",
29     "Can we talk later?",
30     "Running 10 minutes late",
31     
32     # Spam messages
33     "WINNER!! You've won £1000000! This is definitely real! -Nigerian Prince",
34     "Hot singles in your area want to discuss your car's extended warranty",
35     "Doctors HATE this one weird trick! (It's called eating vegetables)",
36     "You've inherited $50M from a long-lost uncle you never knew existed!!!",
37     "URGENT: Your computer has virus. Click here to download more virus",
38     "Congratulations! You're our 1,000,000th visitor! So is everyone else!",
39     "Make $$$ working from home! Requires: time machine, unicorn, PhD in wizardry",
40     "Your package is waiting! We don't know what package. Just click the link",
41     "Free iPhone 15! Just pay shipping of $999.99",
42     "Your bank account will be suspended! -Definitely Your Real Bank (trust us)",
43     "Bitcoin opportunity! Turn $1 into $1M! Math professors HATE this!",
44     "Free cruise to the Bahamas! Departs from Nebraska. Seems legit",
45     "IRS here. We only accept iTunes gift cards for tax payment now",
46     "Your crush wants to meet you! They're definitely real and not a bot",
47     "Lose 50 pounds in 2 days with this miracle pill! Side effects include levitation"
48 ]
49  
50 labels = ['ham'] * 15 + ['spam'] * 15
51  
52 # Create DataFrame
53 df = pd.DataFrame({'message': messages, 'label': labels})
54  
55 # Step 1: Create a pipeline with preprocessing and model
56 pipeline = Pipeline([
57     ('tfidf', TfidfVectorizer(stop_words='english', max_features=1000)),
58     ('classifier', MultinomialNB())
59 ])
60  
61 # Step 2: Split data
62 X_train, X_test, y_train, y_test = train_test_split(
63     df['message'], df['label'], test_size=0.3, stratify=df['label'], random_state=42
64 )
65  
66 # Step 3: Train the model
67 pipeline.fit(X_train, y_train)
68  
69 # Step 4: Make predictions
70 y_pred = pipeline.predict(X_test)
71  
72 # Step 5: Evaluate performance
73 print("Classification Report:")
74 print(classification_report(y_test, y_pred))
75  
76 # Step 6: Visualize confusion matrix
77 cm = confusion_matrix(y_test, y_pred, labels=['ham', 'spam'])
78 plt.figure(figsize=(8, 6))
79 sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', 
80             xticklabels=['ham', 'spam'], yticklabels=['ham', 'spam'])
81 plt.title('Confusion Matrix')
82 plt.ylabel('True Label')
83 plt.xlabel('Predicted Label')
84 plt.tight_layout()
85 plt.show()
86  
87 # Step 7: Compare different models
88 models = {
89     'Naive Bayes': MultinomialNB(),
90     'SVM': LinearSVC(random_state=42),
91     'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42)
92 }
93  
94 for name, model in models.items():
95     pipeline = Pipeline([
96         ('tfidf', TfidfVectorizer(stop_words='english', max_features=1000)),
97         ('classifier', model)
98     ])
99     scores = cross_val_score(pipeline, df['message'], df['label'], cv=5, scoring='f1_macro')
100     print(f"{name}: {scores.mean():.3f} (+/- {scores.std() * 2:.3f})")
101  
102 # Step 8: Feature importance - What words indicate spam?
103 pipeline.fit(df['message'], df['label'])
104 feature_names = pipeline.named_steps['tfidf'].get_feature_names_out()
105 coefficients = pipeline.named_steps['classifier'].feature_log_prob_[1] - \
106                pipeline.named_steps['classifier'].feature_log_prob_[0]
107  
108 # Get top spam indicators
109 top_spam_idx = coefficients.argsort()[-10:][::-1]
110 top_spam_words = [(feature_names[i], coefficients[i]) for i in top_spam_idx]
111  
112 print("\nTop spam indicators:")
113 for word, score in top_spam_words:
114     print(f"  {word}: {score:.3f}")
115  
116 # Step 9: Test on new messages
117 new_messages = [
118     "Meeting at 3pm in conference room",
119     "URGENT!!! Your goldfish has won the lottery!!!",
120     "Can you send me the report?", 
121     "Hello, I am a normal human. Please click this legitimate link: totallynotascam.virus"
122 ]
123  
124 predictions = pipeline.predict(new_messages)
125 for msg, pred in zip(new_messages, predictions):
126     print(f"\nMessage: '{msg[:50]}...'\nPrediction: {pred}")

This enhanced example demonstrates:

•Creating a balanced dataset with realistic examples
•Using pipelines to combine preprocessing and modeling
•Proper evaluation with classification reports and confusion matrices
•Comparing multiple algorithms
•Analyzing which features (words) are most indicative of spam
•Making predictions on new, unseen messages

5. A Quick Tour of Other Common Models

Scikit-learn's consistent API makes it easy to try different models. It's like speed dating, but for algorithms.

•Classification:
- •DecisionTreeClassifier: Makes decisions by asking a series of yes/no questions, like a very judgemental game of 20 questions. "Is petal_length > 2.5? Is sepal_width < 3.0? Congratulations, you're a virginica!"
- •SVC (Support Vector Classifier): Tries to find the best line (or hyperplane if you're fancy) to separate your classes. Imagine drawing a line between cats and dogs, but in math dimensions you can't visualize.
•Regression (predicting continuous values):
- •LinearRegression: Draws the best straight line through your data points. It's optimistic and believes all relationships are linear. Spoiler: they're not.
•Clustering (unsupervised learning):
- •KMeans: Groups your data into K clusters by playing an endless game of "musical chairs" with data points until everyone's reasonably happy with their group.

Here's how you might use KMeans to find clusters in the Iris data (without using the labels):

python

1 from sklearn.cluster import KMeans
2  
3 kmeans = KMeans(n_clusters=3, random_state=42, n_init=10)
4 kmeans.fit(X)
5 print(kmeans.labels_)

Comparing Different Classifiers

Each algorithm makes different assumptions about your data and creates different types of decision boundaries:

Classifier Comparison

X-axis Feature

Y-axis Feature

Key Observations:

• KNN: Creates flexible, non-linear boundaries
• SVM: Finds optimal linear separations
• Decision Tree: Makes axis-aligned rectangular splits
• Naive Bayes: Creates smooth, probabilistic boundaries

Setosa

Versicolor

Virginica

Notice how each classifier approaches the problem differently:

•KNN: Makes decisions based on nearby points - simple but can capture complex patterns
•SVM: Tries to find the optimal separating boundary with maximum margin
•Decision Trees: Creates hierarchical if-then rules, always axis-aligned
•Naive Bayes: Assumes features are independent and follow specific distributions

6. Putting It All Together with Pipelines

Pipelines are a powerful feature for chaining multiple steps together. For example, it's common practice to scale your data before feeding it to a classifier.

python

1 from sklearn.pipeline import Pipeline
2 from sklearn.preprocessing import StandardScaler
3 from sklearn.svm import SVC
4  
5 # Create a pipeline
6 pipe = Pipeline([
7     ('scaler', StandardScaler()),  # Step 1: Scale the data
8     ('svm', SVC())                 # Step 2: Apply the classifier
9 ])
10  
11 # The pipeline can be used like any other estimator
12 pipe.fit(X_train, y_train)
13 print(f"Pipeline Accuracy: {pipe.score(X_test, y_test):.2f}")

Pipelines help prevent data leakage from your test set and simplify your workflow.

7. Common Pitfalls and Debugging Tips (Or: How I Learned to Stop Worrying and Debug My Models)

When starting with scikit-learn, you'll make these mistakes. We all did. Here's your field guide to ML failures:

1. "Why Does Income Matter 1000x More Than Age?"

Your model thinks salary (in dollars) is way more important than age (in years) because 50,000 > 25. Congratulations, you've created an algorithm that only cares about money. How very Silicon Valley of you.

python

1 # ❌ Bad: My model is a gold digger
2 X_train = [[25, 50000], [30, 60000], [35, 70000]]  # age, salary
3  
4 # ✅ Good: Teaching your model that all features matter
5 from sklearn.preprocessing import StandardScaler
6 scaler = StandardScaler()
7 X_train_scaled = scaler.fit_transform(X_train)
8 X_test_scaled = scaler.transform(X_test)  # Use transform, not fit_transform!

2. "My Model Can See The Future!" (Data Leakage)

Your model isn't psychic, you're just showing it the test answers. It's like letting students study with the exact exam questions - of course they'll ace it.

python

1 # ❌ Bad: "I'll just scale everything together, what could go wrong?"
2 X_scaled = StandardScaler().fit_transform(X)
3 X_train, X_test, y_train, y_test = train_test_split(X_scaled, y)
4  
5 # ✅ Good: Keep your test data pure and innocent
6 X_train, X_test, y_train, y_test = train_test_split(X, y)
7 scaler = StandardScaler()
8 X_train_scaled = scaler.fit_transform(X_train)
9 X_test_scaled = scaler.transform(X_test)  # No peeking at test data stats!

3. "I Have 99% Accuracy!" (The Class Imbalance Trap)

Your fraud detection model has 99% accuracy? Amazing! Oh wait, only 1% of transactions are fraudulent? Your model just predicts "not fraud" for everything. You've built a very expensive way to say "no."

python

1 from collections import Counter
2 print(Counter(y_train))  # {0: 9900, 1: 100} - Uh oh
3  
4 # Your "amazing" model:
5 print("Accuracy: 99%!")  # Just predicts the majority class
6 print("Fraud caught: 0%")  # Completely useless
7  
8 # Fix it with stratified splitting and better metrics:
9 X_train, X_test, y_train, y_test = train_test_split(
10     X, y, test_size=0.3, stratify=y, random_state=42
11 )
12  
13 # Look at metrics that actually matter:
14 from sklearn.metrics import classification_report
15 print(classification_report(y_test, y_pred))  # Shows the ugly truth

4. "Default Settings Are Fine, Right?" (Wrong)

Using default hyperparameters is like cooking everything at 350°F for 30 minutes. Sometimes it works, usually it doesn't, and occasionally you set off the smoke alarm.

python

1 # ❌ Bad: The "It'll probably work" approach
2 model = RandomForestClassifier()  # n_estimators=100, max_depth=None, etc.
3  
4 # ✅ Good: Actually finding what works
5 from sklearn.model_selection import GridSearchCV
6  
7 param_grid = {
8     'n_neighbors': [3, 5, 7, 9],
9     'weights': ['uniform', 'distance']
10 }
11  
12 grid_search = GridSearchCV(KNeighborsClassifier(), param_grid, cv=5)
13 grid_search.fit(X_train, y_train)  # Warning: Your laptop fan will sound like a jet engine
14  
15 print(f"Best parameters: {grid_search.best_params_}")
16 print(f"Time spent: 3 hours")
17 print(f"Improvement: 0.02%")  # But hey, it's optimized!

5. "ValueError: Input contains NaN" (The Missing Data Surprise)

Scikit-learn is like a picky eater - it refuses to work if there's anything it doesn't recognize on its plate. One NaN value and it throws a tantrum.

python

1 # The error that makes you question your dataset
2 model.fit(X_train, y_train)  # ValueError: Input contains NaN
3  
4 # Check the damage:
5 print(f"Missing values: {X.isna().sum().sum()}")  # 47... oops
6  
7 # The "good enough" fix:
8 from sklearn.impute import SimpleImputer
9 imputer = SimpleImputer(strategy='mean')  # When in doubt, use the average!
10 X_imputed = imputer.fit_transform(X)  # Now with 100% made-up values!

8. ML Beginner's Bingo

Before we wrap up, here's a fun game. Check off all the squares you've experienced:

Googled "ValueError: Input contains NaN"

Got 100% accuracy (it was a bug)

Forgot to scale features

Trained on test data

Used Iris dataset

Laptop overheated during GridSearch

Tried to predict the stock market

Model predicted everything as one class

Forgot train_test_split

Thought more features = better model

Used accuracy on imbalanced data

FREE SPACE: "ML is just if-statements"

Tried to use strings as features

Model worked great on Monday, failed on Tuesday

Spent 3 hours on a typo

Asked "Is this AI?"

9. Conclusion & Next Steps

Congratulations! You now know enough scikit-learn to be dangerous. You can load data, train models, make predictions, and most importantly, you understand why your model is probably wrong.

Remember:

•Your first model will be terrible. Your second one will be slightly less terrible.
•The Iris dataset is not representative of real-world data, no matter how much we use it.
•When your model achieves 100% accuracy, you have a bug.
•Machine learning is 10% choosing algorithms and 90% figuring out why your data is weird.

Next steps for your journey into ML madness:

•Try the official scikit-learn documentation (it's actually good)
•Graduate from Iris to a dataset that fights back
•Build something ridiculous (hot dog/not hot dog classifier, anyone?)
•Remember: with great computational power comes great responsibility to not predict everything as spam

Now go forth and overfit some data! Just remember to use a validation set. 🚀

1	`from sklearn.neighbors import KNeighborsClassifier`
2
3	`# Create an instance of the model`
4	`knn = KNeighborsClassifier(n_neighbors=3)`
5
6	`# Train the model`
7	`knn.fit(X_train, y_train)`

1	`from sklearn.metrics import accuracy_score`
2
3	`accuracy = accuracy_score(y_test, y_pred)`
4	`print(f"Model Accuracy: {accuracy:.2f}")`
5	`# Expected Output: Model Accuracy: 1.00`

1	`from sklearn.cluster import KMeans`
2
3	`kmeans = KMeans(n_clusters=3, random_state=42, n_init=10)`
4	`kmeans.fit(X)`
5	`print(kmeans.labels_)`

1	`from sklearn.pipeline import Pipeline`
2	`from sklearn.preprocessing import StandardScaler`
3	`from sklearn.svm import SVC`
4
5	`# Create a pipeline`
6	`pipe = Pipeline([`
7	`('scaler', StandardScaler()), # Step 1: Scale the data`
8	`('svm', SVC()) # Step 2: Apply the classifier`
9	`])`
10
11	`# The pipeline can be used like any other estimator`
12	`pipe.fit(X_train, y_train)`
13	`print(f"Pipeline Accuracy: {pipe.score(X_test, y_test):.2f}")`

1	`# ❌ Bad: My model is a gold digger`
2	`X_train = [[25, 50000], [30, 60000], [35, 70000]] # age, salary`
3
4	`# ✅ Good: Teaching your model that all features matter`
5	`from sklearn.preprocessing import StandardScaler`
6	`scaler = StandardScaler()`
7	`X_train_scaled = scaler.fit_transform(X_train)`
8	`X_test_scaled = scaler.transform(X_test) # Use transform, not fit_transform!`

1	`from sklearn.datasets import load_iris`
2
3	`iris = load_iris()`
4	`X, y = iris.data, iris.target`

1	`from sklearn.model_selection import train_test_split`
2
3	`X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)`

1	`import pandas as pd`
2	`import numpy as np`
3	`from sklearn.model_selection import train_test_split, cross_val_score`
4	`from sklearn.feature_extraction.text import TfidfVectorizer`
5	`from sklearn.naive_bayes import MultinomialNB`
6	`from sklearn.svm import LinearSVC`
7	`from sklearn.ensemble import RandomForestClassifier`
8	`from sklearn.pipeline import Pipeline`
9	`from sklearn.metrics import classification_report, confusion_matrix`
10	`import matplotlib.pyplot as plt`
11	`import seaborn as sns`
12
13	`# Create a more realistic dataset`
14	`messages = [`
15	`# Ham messages`
16	`"Hey, are you free for lunch tomorrow?",`
17	`"I'll be home late tonight, traffic is terrible",`
18	`"Can you pick up milk on your way home?",`
19	`"Meeting rescheduled to 3pm. See you there!",`
20	`"Thanks for the birthday wishes!",`
21	`"The project deadline has been extended to Friday",`
22	`"Great presentation today!",`
23	`"Let me know when you arrive",`
24	`"Happy anniversary! Love you",`
25	`"Don't forget the team lunch at noon",`
26	`"Your package was delivered",`
27	`"Reminder: Doctor appointment at 2pm",`
28	`"Nice seeing you yesterday!",`
29	`"Can we talk later?",`
30	`"Running 10 minutes late",`
31
32	`# Spam messages`
33	`"WINNER!! You've won £1000000! This is definitely real! -Nigerian Prince",`
34	`"Hot singles in your area want to discuss your car's extended warranty",`
35	`"Doctors HATE this one weird trick! (It's called eating vegetables)",`
36	`"You've inherited $50M from a long-lost uncle you never knew existed!!!",`
37	`"URGENT: Your computer has virus. Click here to download more virus",`
38	`"Congratulations! You're our 1,000,000th visitor! So is everyone else!",`
39	`"Make $$$ working from home! Requires: time machine, unicorn, PhD in wizardry",`
40	`"Your package is waiting! We don't know what package. Just click the link",`
41	`"Free iPhone 15! Just pay shipping of $999.99",`
42	`"Your bank account will be suspended! -Definitely Your Real Bank (trust us)",`
43	`"Bitcoin opportunity! Turn $1 into $1M! Math professors HATE this!",`
44	`"Free cruise to the Bahamas! Departs from Nebraska. Seems legit",`
45	`"IRS here. We only accept iTunes gift cards for tax payment now",`
46	`"Your crush wants to meet you! They're definitely real and not a bot",`
47	`"Lose 50 pounds in 2 days with this miracle pill! Side effects include levitation"`
48	`]`
49
50	`labels = ['ham'] * 15 + ['spam'] * 15`
51
52	`# Create DataFrame`
53	`df = pd.DataFrame({'message': messages, 'label': labels})`
54
55	`# Step 1: Create a pipeline with preprocessing and model`
56	`pipeline = Pipeline([`
57	`('tfidf', TfidfVectorizer(stop_words='english', max_features=1000)),`
58	`('classifier', MultinomialNB())`
59	`])`
60
61	`# Step 2: Split data`
62	`X_train, X_test, y_train, y_test = train_test_split(`
63	`df['message'], df['label'], test_size=0.3, stratify=df['label'], random_state=42`
64	`)`
65
66	`# Step 3: Train the model`
67	`pipeline.fit(X_train, y_train)`
68
69	`# Step 4: Make predictions`
70	`y_pred = pipeline.predict(X_test)`
71
72	`# Step 5: Evaluate performance`
73	`print("Classification Report:")`
74	`print(classification_report(y_test, y_pred))`
75
76	`# Step 6: Visualize confusion matrix`
77	`cm = confusion_matrix(y_test, y_pred, labels=['ham', 'spam'])`
78	`plt.figure(figsize=(8, 6))`
79	`sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',`
80	`xticklabels=['ham', 'spam'], yticklabels=['ham', 'spam'])`
81	`plt.title('Confusion Matrix')`
82	`plt.ylabel('True Label')`
83	`plt.xlabel('Predicted Label')`
84	`plt.tight_layout()`
85	`plt.show()`
86
87	`# Step 7: Compare different models`
88	`models = {`
89	`'Naive Bayes': MultinomialNB(),`
90	`'SVM': LinearSVC(random_state=42),`
91	`'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42)`
92	`}`
93
94	`for name, model in models.items():`
95	`pipeline = Pipeline([`
96	`('tfidf', TfidfVectorizer(stop_words='english', max_features=1000)),`
97	`('classifier', model)`
98	`])`
99	`scores = cross_val_score(pipeline, df['message'], df['label'], cv=5, scoring='f1_macro')`
100	`print(f"{name}: {scores.mean():.3f} (+/- {scores.std() * 2:.3f})")`
101
102	`# Step 8: Feature importance - What words indicate spam?`
103	`pipeline.fit(df['message'], df['label'])`
104	`feature_names = pipeline.named_steps['tfidf'].get_feature_names_out()`
105	`coefficients = pipeline.named_steps['classifier'].feature_log_prob_[1] - \`
106	`pipeline.named_steps['classifier'].feature_log_prob_[0]`
107
108	`# Get top spam indicators`
109	`top_spam_idx = coefficients.argsort()[-10:][::-1]`
110	`top_spam_words = [(feature_names[i], coefficients[i]) for i in top_spam_idx]`
111
112	`print("\nTop spam indicators:")`
113	`for word, score in top_spam_words:`
114	`print(f" {word}: {score:.3f}")`
115
116	`# Step 9: Test on new messages`
117	`new_messages = [`
118	`"Meeting at 3pm in conference room",`
119	`"URGENT!!! Your goldfish has won the lottery!!!",`
120	`"Can you send me the report?",`
121	`"Hello, I am a normal human. Please click this legitimate link: totallynotascam.virus"`
122	`]`
123
124	`predictions = pipeline.predict(new_messages)`
125	`for msg, pred in zip(new_messages, predictions):`
126	`print(f"\nMessage: '{msg[:50]}...'\nPrediction: {pred}")`

1	`# ❌ Bad: "I'll just scale everything together, what could go wrong?"`
2	`X_scaled = StandardScaler().fit_transform(X)`
3	`X_train, X_test, y_train, y_test = train_test_split(X_scaled, y)`
4
5	`# ✅ Good: Keep your test data pure and innocent`
6	`X_train, X_test, y_train, y_test = train_test_split(X, y)`
7	`scaler = StandardScaler()`
8	`X_train_scaled = scaler.fit_transform(X_train)`
9	`X_test_scaled = scaler.transform(X_test) # No peeking at test data stats!`

1	`from collections import Counter`
2	`print(Counter(y_train)) # {0: 9900, 1: 100} - Uh oh`
3
4	`# Your "amazing" model:`
5	`print("Accuracy: 99%!") # Just predicts the majority class`
6	`print("Fraud caught: 0%") # Completely useless`
7
8	`# Fix it with stratified splitting and better metrics:`
9	`X_train, X_test, y_train, y_test = train_test_split(`
10	`X, y, test_size=0.3, stratify=y, random_state=42`
11	`)`
12
13	`# Look at metrics that actually matter:`
14	`from sklearn.metrics import classification_report`
15	`print(classification_report(y_test, y_pred)) # Shows the ugly truth`

1	`# ❌ Bad: The "It'll probably work" approach`
2	`model = RandomForestClassifier() # n_estimators=100, max_depth=None, etc.`
3
4	`# ✅ Good: Actually finding what works`
5	`from sklearn.model_selection import GridSearchCV`
6
7	`param_grid = {`
8	`'n_neighbors': [3, 5, 7, 9],`
9	`'weights': ['uniform', 'distance']`
10	`}`
11
12	`grid_search = GridSearchCV(KNeighborsClassifier(), param_grid, cv=5)`
13	`grid_search.fit(X_train, y_train) # Warning: Your laptop fan will sound like a jet engine`
14
15	`print(f"Best parameters: {grid_search.best_params_}")`
16	`print(f"Time spent: 3 hours")`
17	`print(f"Improvement: 0.02%") # But hey, it's optimized!`

1	`# The error that makes you question your dataset`
2	`model.fit(X_train, y_train) # ValueError: Input contains NaN`
3
4	`# Check the damage:`
5	`print(f"Missing values: {X.isna().sum().sum()}") # 47... oops`
6
7	`# The "good enough" fix:`
8	`from sklearn.impute import SimpleImputer`
9	`imputer = SimpleImputer(strategy='mean') # When in doubt, use the average!`
10	`X_imputed = imputer.fit_transform(X) # Now with 100% made-up values!`