16 minute read

Introduction to unsupervised learning

Emmanuel Ohiri

Nov 7, 2024, 12:00 PM

While supervised learning has dominated the machine learning landscape with its impressive achievements in classification and regression tasks, unsupervised learning tackles a different, yet equally important, challenge: deciphering the hidden structure of data without labels or predefined answers.

Imagine trying to understand a complex puzzle with only the pieces and no picture on the box – that's the essence of unsupervised learning. Unlike supervised learning, where algorithms learn from labeled examples, unsupervised learning algorithms explore the data independently, seeking to identify underlying patterns, structures, and relationships.

This ability to find hidden insights without explicit guidance makes unsupervised learning a powerful tool for various applications, from customer segmentation and anomaly detection to dimensionality reduction and feature extraction.

In this article, we will provide a comprehensive introduction to unsupervised learning, exploring its concepts, algorithms, applications, advantages, and challenges.

What is unsupervised learning?

Unsupervised learning is a branch of machine learning where algorithms find structure in unlabeled data. But what does this mean in practice? Imagine being given a dataset with thousands of customer purchasing records, each containing information about the products bought, the time of purchase, and the customer's location. Your goal is to understand the different types of customers you have.

Unsupervised Learning Image 1 Source: TechVidan

An unsupervised learning algorithm could analyze this data and automatically group customers with similar purchasing behaviors, revealing distinct customer segments without you having to specify any categories beforehand. This is a key distinction from supervised learning, where the algorithm is given explicit instructions in the form of labeled data.

In our customer example, supervised learning would require you to pre-label a subset of customers with their segment (e.g., "budget-conscious," "frequent buyer," "impulse shopper"). Unsupervised learning, on the other hand, operates without this guidance, relying solely on the inherent information within the data itself.

The absence of labels allows unsupervised learning to explore the data without bias, uncovering hidden patterns and relationships that might not be apparent through supervised methods.

These are some key components that enable unsupervised learning:

Key characteristics of unsupervised learning:

Unsupervised learning possesses distinct qualities that set it apart from other machine learning paradigms. Let's delve into these key characteristics:

1. Absence of labels

The most defining characteristic of unsupervised learning is the lack of labeled data. In supervised learning, data comes with predefined labels or target values that guide the learning process. For instance, in image recognition, images are labeled as "cat" or "dog," allowing the algorithm to learn the features associated with each category.

In contrast, unsupervised learning algorithms work with raw, unlabeled data, like a collection of images without any species identification. This absence of labels forces the algorithm to discover inherent patterns and structures within the data without external guidance.

2. Discovering hidden structures

Unsupervised learning excels at uncovering hidden structures and relationships within data. These structures can take various forms, such as:

Clusters: Groups of similar data points, like customers with similar buying habits.
Latent variables: Underlying factors that influence the data, such as hidden topics in a collection of documents.
Reduced dimensionality: Simplified representations of complex data that capture essential information.

By identifying these hidden structures, unsupervised learning helps us make sense of complex data and extract meaningful insights.

3. Data exploration and insight generation

Unsupervised learning is a powerful tool for data exploration and insight generation, especially when dealing with unfamiliar datasets or domains with limited prior knowledge. By analyzing unlabeled data, unsupervised learning algorithms can reveal unexpected patterns, anomalies, or trends that might otherwise go unnoticed. This exploratory capability makes it valuable for tasks like:

Understanding customer behavior: Identifying distinct customer segments and their preferences.
Detecting anomalies: Finding unusual patterns in financial transactions or network traffic that could indicate fraud or security breaches.
Generating new features: Creating new variables from existing data that can improve the performance of other machine learning models.

To better understand the unique characteristics of unsupervised learning, let's compare it with supervised learning:


Feature	Supervised Learning	Unsupervised Learning
Data	Labeled	Unlabeled
Goal	Predict specific output	Discover patterns or structures
Example Task	Classification, Regression	Clustering, Dimensionality Reduction
Output	Known and predefined	Unknown, inferred

This table highlights the fundamental differences between the two approaches. Supervised learning aims to predict outcomes based on labeled examples, while unsupervised learning focuses on exploring unlabeled data to uncover hidden patterns and insights.

Types of unsupervised learning

Unsupervised learning can be broadly categorized into three primary types: clustering, dimensionality reduction, and association rule learning. Each of these types serves different purposes and employs different algorithms to uncover hidden patterns in data.

Unsupervised Learning Image 2 Source: Enjoy algorithms

Clustering

Clustering is a technique in unsupervised learning used to group similar data points into clusters or groups based on specific characteristics. Since it is unsupervised, clustering does not rely on labeled data—meaning it discovers natural groupings in the data without prior knowledge about the categories or labels of the input data points.

Key concepts in clustering

Unsupervised learning: In clustering, the model works with raw, unlabeled data, aiming to identify patterns or groupings that emerge organically. The goal is to find clusters of data points that are more similar than points in other clusters.
Distance/similarity measures: Clustering is typically based on the distance or similarity between data points. The most common measures include:
Euclidean distance: Measures the straight-line distance between two points in space.
Manhattan distance: Measures the absolute distance along axes at right angles.
Cosine similarity: Used to measure the cosine of the angle between two vectors, often applied to high-dimensional data (e.g., text). These measures help define how "close" or "similar" points are and, thus, whether they should be in the same cluster.
Clusters: The groups formed by clustering methods should have the following characteristics:
Homogeneity within clusters: Data points in the same cluster should be very similar.
Heterogeneity between clusters: Data points from different clusters should be as dissimilar as possible.

Dimensionality reduction

Dimensionality reduction is a type of unsupervised learning used, especially as datasets grow larger and more complex. As the number of variables (features) increases, models can suffer from issues like overfitting, increased computational complexity, and the curse of dimensionality (where the data becomes sparse and less meaningful in higher dimensions).

Dimensionality reduction addresses these challenges by transforming data into a lower-dimensional space while retaining as much of the original information as possible. For example, Principal Component Analysis (PCA), one of the most widely used techniques, reduces dimensions by identifying the directions of maximum variance in the data without using any output labels.

Unsupervised Learning Image 3 Source: Geeks for geeks

The same holds for techniques like t-SNE, Autoencoders, and Random Projections. These algorithms work purely based on the structure of the input data. While dimensionality reduction is typically unsupervised, there are some supervised variants, like Linear Discriminant Analysis (LDA), where the algorithm considers the labels or categories to maximize the separation between them. However, these cases are exceptions rather than the norm.

Both dimensionality reduction and clustering are commonly classified as unsupervised learning techniques because they uncover patterns or structures in data without relying on labels. In clustering, the goal is to group similar data points, whereas, in dimensionality reduction, the goal is to simplify the dataset by reducing the number of features while retaining the core structure.

Association rule learning

Association rule learning is a form of unsupervised learning widely used in applications like market basket analysis. Its primary purpose is to uncover interesting relationships between variables in large datasets by identifying if-then patterns, or rules, that describe the likelihood of certain items appearing together. This technique is unsupervised because it doesn't require labeled data; instead, it searches for associations or co-occurrences among items in the data.

Unsupervised Learning Image 4 Source: Wikipedia

Like clustering and dimensionality reduction, association rule learning does not require predefined labels. It explores the data to find patterns and relationships, often expressed as "if-then" rules. For example, in a retail setting, the rule might be: "If a customer buys bread, they are likely to buy butter."

Metrics used in association rule learning:

Support: Indicates how frequently a particular itemset appears in the dataset. For example, the support for the itemset {bread, butter} is the proportion of transactions that include bread and butter.
Confidence: Measures how often the rule has been found to be true. It is calculated as the ratio of transactions that include both the antecedent and the consequent to those that include only the antecedent. For example, if 80% of transactions that include bread also include butter, the confidence of the rule "If bread, then butter" is 80%.
Lift: Indicates the strength of a rule compared to random co-occurrence. A lift greater than 1 suggests that the antecedent's occurrence increases the consequent's likelihood.

Core algorithms in unsupervised learning

Unsupervised learning employs a variety of algorithms to uncover hidden patterns in data. Here, we'll explore some of the most common and fundamental algorithms used in clustering, dimensionality reduction, and association rule learning.

Clustering algorithms

K-means clustering: K-means clustering is perhaps the most well-known clustering algorithm. It partitions data into k clusters, where each data point belongs to the cluster with the nearest mean (centroid). K-means is relatively simple to understand and implement, but it requires pre-defining the number of clusters (k).
Hierarchical clustering: This approach builds a hierarchy of clusters. It can be either agglomerative (bottom-up, starting with individual data points and merging them) or divisive (top-down, starting with all data points in one cluster and recursively splitting them). Hierarchical clustering doesn't require specifying the number of clusters beforehand, but interpreting the resulting dendrogram can be subjective.
Density-based spatial clustering of applications with noise (DBSCAN): The DBSCAN algorithm groups together data points that are closely packed together, marking outliers as noise. DBSCAN is good at handling clusters of varying shapes and sizes and doesn't require pre-defining the number of clusters. However, it can be sensitive to parameter settings.

Dimensionality Reduction Algorithms

Principal Component Analysis (PCA): A widely used technique that reduces dimensionality by finding the principal components, which are new variables that capture the maximum variance in the data. PCA is effective for visualizing high-dimensional data and can improve the performance of other machine-learning models.
t-Distributed stochastic neighbor embedding (t-SNE): This algorithm is well-suited for visualizing high-dimensional data in a low-dimensional space (typically 2D or 3D). t-SNE focuses on preserving local neighborhood structures, making it effective for revealing clusters and patterns.

Unsupervised Learning Image 5 Source: Paper

Autoencoders: These are neural networks that learn compressed representations of data by encoding and decoding it. Autoencoders can be used for dimensionality reduction, feature extraction, and anomaly detection.

Association Rule Learning Algorithms

Apriori algorithm: Apriori algorithm is a classic algorithm for finding frequent itemsets and generating association rules. It uses a breadth-first search approach and prunes the search space by exploiting the downward closure property (if an itemset is infrequent, all its supersets are also infrequent).
FP-Growth algorithm: This algorithm is generally faster than Apriori for large datasets. It uses a frequent pattern tree (FP-tree) structure to store compressed information about frequent itemsets, avoiding candidate generation and making the process more efficient.
Eclat algorithm: The eclat algorithm uses a depth-first search approach and focuses on finding frequent itemsets by intersecting tidsets (transaction identifiers). Eclat can be more efficient than Apriori for certain types of datasets.

This is not an exhaustive list, but it covers some of the most important algorithms in unsupervised learning. Each algorithm has its own strengths and weaknesses, and the choice of algorithm depends on the specific problem and the characteristics of the data.

How unsupervised learning works

To illustrate how unsupervised learning works, let's walk through a simplified example using K-means clustering applied to text data. Imagine you have a collection of customer reviews about a product and want to understand the general sentiment expressed in these reviews without reading each one individually.

Here's how unsupervised learning can help:

Data Preparation: First, we need to prepare the data for analysis. In this case, we have text data, so we'll use a technique called TF-IDF (Term Frequency-Inverse Document Frequency) to convert the text into numerical representations that the algorithm can understand. TF-IDF essentially creates a numerical vector for each review, where each number represents the importance of a word in that review relative to the entire collection of reviews.

    
    import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer

# Sample data
data = {
    'text': [
        "I love this product! It's amazing.",
        "Terrible experience, I hate it.",
        "It's okay, not the best but not the worst."
    ]
}
df = pd.DataFrame(data)

# Convert text to numerical representation using TF-IDF
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(df['text'])

Clustering: Now, we can apply the K-means clustering algorithm. This algorithm aims to group similar data points together based on their distance from each other in the numerical space we created. We need to specify the desired number of clusters (k). In this case, let's say we want to group the reviews into two clusters: positive and negative sentiment.

    
    from sklearn.cluster import KMeans

# Apply K-means clustering with 2 clusters
kmeans = KMeans(n_clusters=2, random_state=42)
kmeans.fit(X)

Labeling: The algorithm assigns each data point (review) to one of the clusters. We can then extract these labels and add them to our original data.

    
    labels = kmeans.labels_
df['cluster'] = labels
print(df[['text', 'cluster']])

Interpretation: By analyzing the clusters, we can gain insights into the overall sentiment expressed in the reviews. For example, we might find that one cluster contains reviews with words like "love," "amazing," and "fantastic," indicating positive sentiment, while the other cluster contains reviews with words like "terrible," "hate," and "disappointed," indicating negative sentiment.

Here is the entire code:

    
    import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans

# Step 1: Generate self-generated data
# Creating a dataset with sample text data for clustering
data = {
    'text': [
        "I love this product! It's amazing.",
        "Terrible experience, I hate it.",
        "Absolutely fantastic, will buy again!",
        "Worst purchase ever, very disappointed.",
        "It's okay, not the best but not the worst.",
        "I really enjoy using this, highly recommend!",
        "Not worth the money, very poor quality."
    ]
}

# Convert to DataFrame
df = pd.DataFrame(data)

# Step 2: Preprocess the data
# Using TfidfVectorizer to convert text to a TF-IDF representation
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(df['text'])

# Step 3: Apply K-Means clustering
# Assuming we want to find 2 clusters in the data (positive vs negative sentiment)
kmeans = KMeans(n_clusters=2, random_state=42)
kmeans.fit(X)

# Step 4: Extract the cluster labels
labels = kmeans.labels_

# Step 5: Visualize the clustering results
# Adding labels to the original DataFrame
df['cluster'] = labels

print("Cluster assignments:")
print(df[['text', 'cluster']])

# Step 6: Test the model with new data
new_text = [
    "I am so happy with this purchase!",
    "This is awful, I regret buying it.",
    "It's just okay, nothing special.",
    "Best product ever, I love it!",
    "Extremely poor quality, very unhappy.",
    "Absolutely love it, fantastic quality!",
    "It's mediocre, nothing impressive.",
    "Horrible, broke within a week."
]
new_X = vectorizer.transform(new_text)
new_labels = kmeans.predict(new_X)

# Print predictions for new data
for text, label in zip(new_text, new_labels):
    print(f"Text: '{text}' -> Cluster: {label}")

Key Takeaways:

Unsupervised learning algorithms can analyze unlabeled data to discover hidden patterns and structures.
K-means clustering is a simple yet powerful technique for grouping similar data points.
TF-IDF is a common method for converting text data into a numerical representation suitable for machine learning algorithms.
This example demonstrates how unsupervised learning can be used to analyze customer reviews and gain insights into sentiment without any prior labeling.

This is a simplified illustration, but it captures how unsupervised learning works. In real-world applications, the data and algorithms can be much more complex. Still, the underlying principle remains the same: to uncover hidden patterns and structures in data without explicit guidance.

Challenges in unsupervised learning

Lack of ground truth

Without labeled data, evaluating the success of unsupervised learning models is challenging. Since there's no "right answer" to compare against, choosing the correct number of clusters or optimal data representation becomes subjective.

Model evaluation

How do we determine the quality of clusters or reduced dimensions? Unlike supervised learning, where accuracy or F1-score offers a clear metric, unsupervised learning requires alternative methods:

Silhouette score: Measures how similar an object is to its own cluster compared to others.
Elbow method: Helps determine the optimal number of clusters in K-means by looking at the sum of squared distances.

Scalability

Many unsupervised learning algorithms struggle with scalability. As the size of the dataset grows, clustering algorithms like K-means or hierarchical clustering may become computationally expensive. Dimensionality reduction techniques like PCA can also face difficulties when dealing with very large datasets.

Applications of unsupervised learning

Anomaly detection

Unsupervised learning is frequently used for anomaly detection, identifying rare or unusual patterns in data, which is valuable in fraud detection, cybersecurity, and network monitoring, where unexpected behaviors or outliers could signify a potential security breach.

Unsupervised Learning Image 6

Customer segmentation

In marketing, unsupervised learning algorithms are used to segment customers into distinct groups based on their behavior, preferences, or demographics. Businesses can then tailor their marketing strategies to different customer segments, improving engagement and sales.

Image compression

Dimensionality reduction techniques such as PCA are used in image compression to reduce high-resolution images' storage and computational requirements while preserving important information.

Text mining and natural language processing

Unsupervised learning helps with tasks like topic modeling or sentiment analysis in text mining. For example, Latent Dirichlet Allocation (LDA) can identify latent topics within a set of documents, while clustering algorithms can group documents by theme.

Recommender systems

Association rule learning is commonly used in recommender systems. Retailers like Amazon and Netflix use it to suggest products or content to users based on their previous behavior. They often analyze patterns to recommend what a user might want to buy or watch next.

Conclusion

Unsupervised learning is a cornerstone of modern data science, with broad applications across industries from marketing to cybersecurity. While it presents unique challenges, particularly around evaluation and scalability, it remains a powerful tool for discovering hidden structures and patterns within data.

You can build your supervised learning projects today using cost-effective cloud resources on CUDO Compute. CUDO Compute offers the best NVIDIA GPUs, like the NVIDIA H100 and H200, at affordable rates. Click here to get started, or contact us.

Learn more:

Machine learning

Artificial intelligence

High performance computing

Continue reading