Wine Clustering Model

Overview

This repository contains a K-Means clustering model trained on the mltrev23/wine-clustering dataset. The model is designed to group wines into clusters based on their chemical properties, helping to uncover underlying patterns and relationships within the data.

Model Details

Algorithm

K-Means Clustering: A popular unsupervised learning algorithm that partitions the data into K clusters by minimizing the within-cluster variance. This algorithm is effective for identifying distinct groups in the dataset based on the input features.

Training Data

Dataset: The model is trained on the mltrev23/wine-clustering dataset.
- Features: The dataset includes the following features: Alcohol, Malic_Acid, Ash, Ash_Alcanity, Magnesium, Total_Phenols, Flavanoids, Nonflavanoid_Phenols, Proanthocyanins, Color_Intensity, Hue, OD280, and Proline.
- Target: Since this is an unsupervised learning task, there is no explicit target variable. The goal is to find natural groupings within the data.

Model Performance

Number of Clusters (K): The optimal number of clusters was determined using the elbow method or silhouette analysis, resulting in K = [Insert optimal K] clusters.
Cluster Interpretation: Each cluster represents a group of wines with similar chemical properties. The characteristics of each cluster can be analyzed to understand the distinguishing features of the wines within that cluster.

(Replace the placeholder [Insert optimal K] with the actual number of clusters.)

Requirements

To run the model, you'll need the following Python libraries:

pip install pandas
pip install numpy
pip install scikit-learn

Usage

Loading the Model

You can load the trained K-Means model using the following code snippet:

import joblib

# Load the trained model
model = joblib.load('wine_clustering_kmeans.model')

Making Predictions

To assign new data points to the existing clusters, use the following code:

import pandas as pd

# Example input data (replace with your actual data)
data = pd.DataFrame({
    'Alcohol': [14.23, 13.20],
    'Malic_Acid': [1.71, 1.78],
    'Ash': [2.43, 2.14],
    'Ash_Alcanity': [15.6, 11.2],
    'Magnesium': [127, 100],
    'Total_Phenols': [2.80, 2.65],
    'Flavanoids': [3.06, 2.76],
    'Nonflavanoid_Phenols': [0.28, 0.26],
    'Proanthocyanins': [2.29, 1.28],
    'Color_Intensity': [5.64, 4.38],
    'Hue': [1.04, 1.05],
    'OD280': [3.92, 3.40],
    'Proline': [1065, 1050]
})

# Predict the cluster for each wine sample
cluster_predictions = model.predict(data)
print(cluster_predictions)

Evaluation

You can evaluate the clustering model's performance by analyzing the characteristics of each cluster:

import matplotlib.pyplot as plt
from sklearn.decomposition import PCA

# Assuming you have a dataset and cluster assignments
pca = PCA(n_components=2)
reduced_data = pca.fit_transform(data)
plt.scatter(reduced_data[:, 0], reduced_data[:, 1], c=cluster_predictions)
plt.title("Wine Clusters")
plt.xlabel("PCA Component 1")
plt.ylabel("PCA Component 2")
plt.show()

Cluster Profiling

To understand what differentiates each cluster:

import numpy as np

# Calculate the mean values of each feature for each cluster
cluster_centers = model.cluster_centers_

for i, center in enumerate(cluster_centers):
    print(f"Cluster {i}:")
    print({feature: value for feature, value in zip(data.columns, center)})

Model Interpretability

The K-Means model provides cluster centroids that can be analyzed to understand the typical properties of wines in each cluster. You can use these centroids to profile each cluster and derive insights about the wines grouped together.

References

If you use this model in your research or application, please cite the dataset and the following reference for K-Means clustering:

Dataset: mltrev23/wine-clustering
K-Means Clustering: MacQueen, J. (1967). Some methods for classification and analysis of multivariate observations. In Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability (Vol. 1, No. 14, pp. 281-297).