Wine Clustering Model
Overview
This repository contains a K-Means clustering model trained on the mltrev23/wine-clustering
dataset. The model is designed to group wines into clusters based on their chemical properties, helping to uncover underlying patterns and relationships within the data.
Model Details
Algorithm
- K-Means Clustering: A popular unsupervised learning algorithm that partitions the data into
K
clusters by minimizing the within-cluster variance. This algorithm is effective for identifying distinct groups in the dataset based on the input features.
Training Data
- Dataset: The model is trained on the
mltrev23/wine-clustering
dataset.- Features: The dataset includes the following features:
Alcohol
,Malic_Acid
,Ash
,Ash_Alcanity
,Magnesium
,Total_Phenols
,Flavanoids
,Nonflavanoid_Phenols
,Proanthocyanins
,Color_Intensity
,Hue
,OD280
, andProline
. - Target: Since this is an unsupervised learning task, there is no explicit target variable. The goal is to find natural groupings within the data.
- Features: The dataset includes the following features:
Model Performance
- Number of Clusters (
K
): The optimal number of clusters was determined using the elbow method or silhouette analysis, resulting inK = [Insert optimal K]
clusters. - Cluster Interpretation: Each cluster represents a group of wines with similar chemical properties. The characteristics of each cluster can be analyzed to understand the distinguishing features of the wines within that cluster.
(Replace the placeholder [Insert optimal K]
with the actual number of clusters.)
Requirements
To run the model, you'll need the following Python libraries:
pip install pandas
pip install numpy
pip install scikit-learn
Usage
Loading the Model
You can load the trained K-Means model using the following code snippet:
import joblib
# Load the trained model
model = joblib.load('wine_clustering_kmeans.model')
Making Predictions
To assign new data points to the existing clusters, use the following code:
import pandas as pd
# Example input data (replace with your actual data)
data = pd.DataFrame({
'Alcohol': [14.23, 13.20],
'Malic_Acid': [1.71, 1.78],
'Ash': [2.43, 2.14],
'Ash_Alcanity': [15.6, 11.2],
'Magnesium': [127, 100],
'Total_Phenols': [2.80, 2.65],
'Flavanoids': [3.06, 2.76],
'Nonflavanoid_Phenols': [0.28, 0.26],
'Proanthocyanins': [2.29, 1.28],
'Color_Intensity': [5.64, 4.38],
'Hue': [1.04, 1.05],
'OD280': [3.92, 3.40],
'Proline': [1065, 1050]
})
# Predict the cluster for each wine sample
cluster_predictions = model.predict(data)
print(cluster_predictions)
Evaluation
You can evaluate the clustering model's performance by analyzing the characteristics of each cluster:
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
# Assuming you have a dataset and cluster assignments
pca = PCA(n_components=2)
reduced_data = pca.fit_transform(data)
plt.scatter(reduced_data[:, 0], reduced_data[:, 1], c=cluster_predictions)
plt.title("Wine Clusters")
plt.xlabel("PCA Component 1")
plt.ylabel("PCA Component 2")
plt.show()
Cluster Profiling
To understand what differentiates each cluster:
import numpy as np
# Calculate the mean values of each feature for each cluster
cluster_centers = model.cluster_centers_
for i, center in enumerate(cluster_centers):
print(f"Cluster {i}:")
print({feature: value for feature, value in zip(data.columns, center)})
Model Interpretability
The K-Means model provides cluster centroids that can be analyzed to understand the typical properties of wines in each cluster. You can use these centroids to profile each cluster and derive insights about the wines grouped together.
References
If you use this model in your research or application, please cite the dataset and the following reference for K-Means clustering:
- Dataset:
mltrev23/wine-clustering
- K-Means Clustering: MacQueen, J. (1967). Some methods for classification and analysis of multivariate observations. In Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability (Vol. 1, No. 14, pp. 281-297).