Gene Expression Clustering and Visualisation using Machine Learning

Overview

This project explores how unsupervised machine learning methods can reveal hidden biological patterns in cancer-related gene expression data. The goal was to identify biologically meaningful clusters of patients based on gene expression similarity and visualize these high-dimensional relationships in a comprehensible, low-dimensional form.

Objectives

Apply K-Means clustering to partition patient samples into groups with similar gene expression profiles.
Use Principal Component Analysis (PCA) and t-distributed Stochastic Neighbor Embedding (t-SNE) for dimensionality reduction and visualization.
Compare true cancer subtype labels with the clustering results to assess biological interpretability.

Methodology

Dataset Preparation:
The dataset consisted of thousands of genes across multiple cancer patient samples. Data preprocessing included normalization, missing value handling, and log transformation to stabilize variance.
Dimensionality Reduction:
- PCA was used to reduce the number of features while preserving variance across gene expression values.
- t-SNE was applied for non-linear embedding, helping to visualize local structures and potential subtype boundaries in 2D.
Clustering Analysis:
- The K-Means algorithm was run with different values of k to explore optimal cluster separation.
- The Silhouette Score and Within-Cluster Sum of Squares (WCSS) were used to determine the best clustering configuration.
Interpretation:
- Clusters were compared with known biological labels to assess alignment with actual cancer subtypes.
- Visualizations highlighted how machine learning can rediscover biologically relevant structures without explicit supervision.

Results

PCA captured over 85% of the variance within the first few components.
t-SNE provided clear visual separation between groups corresponding to major cancer subtypes.
The resulting clusters overlapped strongly with known patient categories, demonstrating the ability of unsupervised models to extract latent biological signals.

Tech stack

Languages: Python
Libraries: Pandas, NumPy, scikit-learn, Matplotlib, Seaborn
Techniques: K-Means Clustering, PCA, t-SNE, Silhouette Analysis

Reflections

This project deepened my understanding of how machine learning can be used in biomedical research to uncover hidden relationships in complex data. It also strengthened my interest in interpretable models and data-driven discovery, bridging computational and biological sciences.