EN / TR
Özge Karasu Özge Karasu

Curiosity leads me. I follow and write.

Gene Expression Clustering and Visualisation using Machine Learning

15.05.2021

Overview

This project explores how unsupervised machine learning methods can reveal hidden biological patterns in cancer-related gene expression data. The goal was to identify biologically meaningful clusters of patients based on gene expression similarity and visualize these high-dimensional relationships in a comprehensible, low-dimensional form.

Objectives

  • Apply K-Means clustering to partition patient samples into groups with similar gene expression profiles.
  • Use Principal Component Analysis (PCA) and t-distributed Stochastic Neighbor Embedding (t-SNE) for dimensionality reduction and visualization.
  • Compare true cancer subtype labels with the clustering results to assess biological interpretability.

Methodology

  1. Dataset Preparation:
    The dataset consisted of thousands of genes across multiple cancer patient samples. Data preprocessing included normalization, missing value handling, and log transformation to stabilize variance.

  2. Dimensionality Reduction:

    • PCA was used to reduce the number of features while preserving variance across gene expression values.
    • t-SNE was applied for non-linear embedding, helping to visualize local structures and potential subtype boundaries in 2D.
  3. Clustering Analysis:

    • The K-Means algorithm was run with different values of k to explore optimal cluster separation.
    • The Silhouette Score and Within-Cluster Sum of Squares (WCSS) were used to determine the best clustering configuration.
  4. Interpretation:

    • Clusters were compared with known biological labels to assess alignment with actual cancer subtypes.
    • Visualizations highlighted how machine learning can rediscover biologically relevant structures without explicit supervision.

Results

  • PCA captured over 85% of the variance within the first few components.
  • t-SNE provided clear visual separation between groups corresponding to major cancer subtypes.
  • The resulting clusters overlapped strongly with known patient categories, demonstrating the ability of unsupervised models to extract latent biological signals.

Tech stack

  • Languages: Python
  • Libraries: Pandas, NumPy, scikit-learn, Matplotlib, Seaborn
  • Techniques: K-Means Clustering, PCA, t-SNE, Silhouette Analysis

Reflections

This project deepened my understanding of how machine learning can be used in biomedical research to uncover hidden relationships in complex data. It also strengthened my interest in interpretable models and data-driven discovery, bridging computational and biological sciences.