Hi Alberto,
As a quick solution, you may code the nominal variables into a set of binary ones (i.e., one-hot encoding, some people call them dummy variables). Ordinal variables get each level assigned a value. Then you can apply k-means using the Euclidean distance, as you are only dealing with numeric input. Remember to re-scale/standardize your data before clustering since the variables you are working with have quite different scales.
The above solution is a naive one and could be a good starting point. There are possible drawbacks, such as possibly having too many sparse dummy variables to represent content from one nominal variable. The "industry" variable in your case might be one. You may conduct manual aggregates such as combining related industries.
Clustering categorical data in a more systematic way is still an active research topic. There is recent literature on
representation learning that could be helpful, but I have not found code that is widely adopted. If you just need a quick solution, I'll say the one-hot encoding approach is the way to go.
Thanks,
Wenjun Zhou
------------------------------
Wenjun Zhou
Associate Professor
University of Tennessee Knoxville
Knoxville TN
------------------------------
Original Message:
Sent: 01-19-2023 15:25
From: Alberto Aparicio
Subject: Similarity Analysis
Thank you Paul. Your response was quite helpful. I read online there is a method to determine the clusters. Also thank you for suggesting Pro Bono Analytics!
https://www.datanovia.com/en/lessons/k-means-clustering-in-r-algorith-and-practical-examples/
Would be helpful to know if anyone has advise on how to create distances for nominal data. What has been done before? Any practical examples? Thank you.
------------------------------
Alberto Aparicio
Data Analyst
Charitable Adult Rides & Services, Inc.
La Mesa CA
Original Message:
Sent: 01-18-2023 11:24
From: Paul Rubin
Subject: Similarity Analysis
A common clustering technique is k-means clustering, in which k is the (predefined) number of clusters you want. If you don't have a fixed number of clusters in mind, you could run it with varying numbers of clusters and decide which result makes you happiest.
Clustering algorithms tend to use the norm of the difference between attribute vectors of two entities, so you're going to have some adventures. For the ratio-scaled data, you will need to decide how to scale the attributes (e.g., how large a difference in revenue equates to a difference of one employee). For the nominal data (software developer v. manufacturer), I think you will need to invent some "distances" (difference between software developer and manufacturer = 10, difference between manufacturer and retailer = 17, ...). Unfortunately, I don't work with clustering; perhaps someone else knows some references on this.
Lastly, from your sig I gather you work for a 501(c)(3), so if this is work related and you want help, you might consider starting a project with INFORMS Pro Bono Analytics.
Cheers,
Paul
------------------------------
Paul Rubin
Professor Emeritus
Michigan State University
East Lansing MI
Original Message:
Sent: 01-17-2023 12:10
From: Alberto Aparicio
Subject: Similarity Analysis
Hello,
What statistical methods could be used to cluster companies that are similar? Say you have four companies W, X, Y, Z. Company W is an manufacturer, has 300 employees, and revenue is $100M. Company X is software developer, has 30 employees, and revenue is $2M. Company Y is a software developer, has 40 employees, and revenue is $4M. Company Z is a manufacturer, has 400 employees, and revenue is $90M. I am looking for a statistical clustering-type method to conclude Companies W and Z are a cluster (similar) and Companies X and Y are a cluster (similar).
Any advise would be appreciated. Thank you.
Regards, Al
------------------------------
Alberto Aparicio
Data Analyst
Charitable Adult Rides & Services, Inc.
La Mesa, CA
------------------------------