INFORMS Open Forum

• 1.  Similarity Analysis

Posted 19 days ago
Hello,

What statistical methods could be used to cluster companies that are similar? Say you have four companies W, X, Y, Z. Company W is an manufacturer, has 300 employees, and revenue is $100M. Company X is software developer, has 30 employees, and revenue is$2M. Company Y is a software developer, has 40 employees, and revenue is $4M. Company Z is a manufacturer, has 400 employees, and revenue is$90M. I am looking for a statistical clustering-type method to conclude Companies W and Z are a cluster (similar) and Companies X and Y are a cluster (similar).

Any advise would be appreciated. Thank you.

Regards, Al​​​

------------------------------
Alberto Aparicio
Data Analyst
Charitable Adult Rides & Services, Inc.
La Mesa, CA
------------------------------

• 2.  RE: Similarity Analysis

Posted 18 days ago

A common clustering technique is k-means clustering, in which k is the (predefined) number of clusters you want. If you don't have a fixed number of clusters in mind, you could run it with varying numbers of clusters and decide which result makes you happiest.

Clustering algorithms tend to use the norm of the difference between attribute vectors of two entities, so you're going to have some adventures. For the ratio-scaled data, you will need to decide how to scale the attributes (e.g., how large a difference in revenue equates to a difference of one employee). For the nominal data (software developer v. manufacturer), I think you will need to invent some "distances" (difference between software developer and manufacturer = 10, difference between manufacturer and retailer = 17, ...). Unfortunately, I don't work with clustering; perhaps someone else knows some references on this.

Lastly, from your sig I gather you work for a 501(c)(3), so if this is work related and you want help, you might consider starting a project with INFORMS Pro Bono Analytics.

Cheers,

Paul

------------------------------
Paul Rubin
Professor Emeritus
Michigan State University
East Lansing MI
------------------------------

• 3.  RE: Similarity Analysis

Posted 17 days ago
Thank you Paul. Your response was quite helpful. I read online there is a method to determine the clusters. Also thank you for suggesting Pro Bono Analytics!

https://www.datanovia.com/en/lessons/k-means-clustering-in-r-algorith-and-practical-examples/

Would be helpful to know if anyone has advise on how to create distances for nominal data. What has been done before? Any practical examples? Thank you.

------------------------------
Alberto Aparicio
Data Analyst
Charitable Adult Rides & Services, Inc.
La Mesa CA
------------------------------

• 4.  RE: Similarity Analysis

Posted 16 days ago
Hi Alberto,

As a quick solution, you may code the nominal variables into a set of binary ones (i.e., one-hot encoding, some people call them dummy variables). Ordinal variables get each level assigned a value. Then you can apply k-means using the Euclidean distance, as you are only dealing with numeric input. Remember to re-scale/standardize your data before clustering since the variables you are working with have quite different scales.

The above solution is a naive one and could be a good starting point. There are possible drawbacks, such as possibly having too many sparse dummy variables to represent content from one nominal variable. The "industry" variable in your case might be one.  You may conduct manual aggregates such as combining related industries.

Clustering categorical data in a more systematic way is still an active research topic. There is recent literature on representation learning that could be helpful, but I have not found code that is widely adopted. If you just need a quick solution, I'll say the one-hot encoding approach is the way to go.

Thanks,
Wenjun Zhou

------------------------------
Wenjun Zhou
Associate Professor
University of Tennessee Knoxville
Knoxville TN
------------------------------

• 5.  RE: Similarity Analysis

Posted 12 days ago
Hi Wenjun,

Splendid! Thank you so much for responding. This was quite helpful. I like the idea of one-hot encoding, but it makes me wonder how I can apply "distance" across industries. For instance, is manufacturing closer to retail or finance or what.

This exercise made me wonder why and how it is possible these "similarity" methods have not been widely adopted? Think about it. There are thousands of private equity investment opportunities. As a preliminary step, an M&A firm could pull all these investment opportunities and determine which firms are most similar and as a cascading event determine which companies are the least risky investment opportunities with the greatest yield. Better yet, take EDGAR data, why not grab 10-Ks or 10-Qs, etc. Anyways, these are just thoughts I am having. If someone figures this out, give me a call.

Thank you Wenjun. Appreciate it.

Regards, Al

------------------------------
Alberto Aparicio
Data Analyst
Charitable Adult Rides & Services, Inc.
La Mesa CA
------------------------------