Country Clustering Using K-Means and PCA

An unsupervised machine learning project that uses K-Means and PCA to group countries based on socio-economic and demographic indicators for insightful global comparisons.

Explore Findings View Project Analysis
Cluster Analysis Visualization

Project Overview

This project applies K-Means clustering and Principal Component Analysis (PCA) to group countries into distinct clusters based on their socio-economic and development indicators.

By analyzing features such as GDP, literacy rate, population, life expectancy, and more, the project helps reveal patterns among nations and enables data-driven regional comparisons. The interactive web application was built using Streamlit for ease of exploration and visualization.

Methodology

Data Preparation

The dataset was cleaned, missing values handled, and numerical features scaled using standardization to ensure effective distance calculations in clustering. Key steps included:

  • Collection of country-level data from World Bank and UN datasets
  • Handling missing values through mean imputation for countries with partial data
  • Feature engineering to create composite indicators
  • Standardization using StandardScaler to normalize features to comparable scales
Dimensionality Reduction

PCA was used to reduce the dataset to two principal components for visualization while retaining maximum variance from the original features.

  • Applied PCA to reduce 15 features to 2 principal components
  • These components captured 73% of the variance in the original dataset
  • Feature importance analysis revealed GDP per capita and life expectancy as the most influential variables

The dimensionality reduction simplified the data while preserving meaningful relationships between countries.

Clustering with K-Means

K-Means clustering was applied to the transformed data. The optimal number of clusters was determined using the Elbow Method and Silhouette Score.

  • Tested K values from 2 to 10 clusters
  • Elbow method indicated K=4 as optimal
  • Silhouette score of 0.68 confirmed good cluster separation
  • Final model implementation with K=4 clusters
Cluster Visualization

The final clusters were visualized in a 2D PCA plot, colored by cluster label. The interactive visualization allows users to explore how countries are grouped based on similar traits.

  • Created an interactive scatter plot with Plotly
  • Implemented geographic mapping to visualize global patterns
  • Developed parallel coordinates plot to show feature distribution across clusters
  • Built an interactive Streamlit dashboard for exploration

Project Timeline

Project Initiation

January 2025

Defined project goals and collected country-level data from World Bank and UN datasets.

Data Preprocessing

February 2025

Cleaned data, handled missing values, and standardized features for clustering analysis.

Model Development

March 2025

Applied PCA for dimensionality reduction and implemented K-Means clustering with optimal parameters.

Visualization & Deployment

April 2025

Created interactive visualizations and deployed Streamlit application for public exploration.

Key Findings

Cluster 1: Developed Economies

Countries with high GDP, high literacy rates, and long life expectancies. Typically Western nations and some in East Asia.

Cluster 2: Emerging Economies

Countries with rapidly growing economies, improving infrastructure, and increasing quality of life indicators.

Cluster 3: Developing Nations

Countries with moderate development indicators, limited infrastructure, but potential for economic growth with policy improvements.

Cluster 4: Underdeveloped Regions

Countries with low GDP per capita, lower literacy rates, and shorter life expectancies, requiring significant development assistance.

Global Impact

International Aid Targeting

This analysis helps international organizations optimize aid distribution based on cluster-specific needs rather than regional generalizations.

Policy Development

Countries can identify peer nations in the same cluster to analyze their successful development policies and adapt them locally.

Growth Potential Identification

Investors can use cluster analysis to identify countries poised to transition between development stages for strategic investment.

This project analyzed 10-year historical data to identify countries that have successfully transitioned between clusters, particularly those that moved from Cluster 3 (Developing) to Cluster 2 (Emerging).

Key Success Factors:
  • Investment in education systems (average 15% increase in literacy rates)
  • Healthcare infrastructure improvements (8-year average increase in life expectancy)
  • Economic diversification beyond traditional industries
  • Infrastructure development, particularly digital connectivity
  • Governance reforms focused on transparency and economic participation

These insights provide a roadmap for countries seeking to accelerate their development trajectory using data-driven policy approaches.

Technologies Used

Python Pandas Scikit-learn Seaborn Matplotlib Plotly Streamlit

This project demonstrates proficiency in preprocessing country-level datasets, performing unsupervised learning with K-Means and PCA, and building interactive web apps using Streamlit for global data analysis.

View Complete Analysis