# INFORMS Open Forum

## Similarity Analysis

• #### 1.  Similarity Analysis

Posted 01-17-2023 12:11
Hello,

What statistical methods could be used to cluster companies that are similar? Say you have four companies W, X, Y, Z. Company W is an manufacturer, has 300 employees, and revenue is $100M. Company X is software developer, has 30 employees, and revenue is$2M. Company Y is a software developer, has 40 employees, and revenue is $4M. Company Z is a manufacturer, has 400 employees, and revenue is$90M. I am looking for a statistical clustering-type method to conclude Companies W and Z are a cluster (similar) and Companies X and Y are a cluster (similar).

Any advise would be appreciated. Thank you.

Regards, Al​​​

------------------------------
Alberto Aparicio
Data Analyst
Charitable Adult Rides & Services, Inc.
La Mesa, CA
------------------------------

• #### 2.  RE: Similarity Analysis

Posted 01-18-2023 11:25

A common clustering technique is k-means clustering, in which k is the (predefined) number of clusters you want. If you don't have a fixed number of clusters in mind, you could run it with varying numbers of clusters and decide which result makes you happiest.

Clustering algorithms tend to use the norm of the difference between attribute vectors of two entities, so you're going to have some adventures. For the ratio-scaled data, you will need to decide how to scale the attributes (e.g., how large a difference in revenue equates to a difference of one employee). For the nominal data (software developer v. manufacturer), I think you will need to invent some "distances" (difference between software developer and manufacturer = 10, difference between manufacturer and retailer = 17, ...). Unfortunately, I don't work with clustering; perhaps someone else knows some references on this.

Lastly, from your sig I gather you work for a 501(c)(3), so if this is work related and you want help, you might consider starting a project with INFORMS Pro Bono Analytics.

Cheers,

Paul

------------------------------
Paul Rubin
Professor Emeritus
Michigan State University
East Lansing MI
------------------------------

• #### 3.  RE: Similarity Analysis

Posted 01-19-2023 15:25
Thank you Paul. Your response was quite helpful. I read online there is a method to determine the clusters. Also thank you for suggesting Pro Bono Analytics!

https://www.datanovia.com/en/lessons/k-means-clustering-in-r-algorith-and-practical-examples/

Would be helpful to know if anyone has advise on how to create distances for nominal data. What has been done before? Any practical examples? Thank you.

------------------------------
Alberto Aparicio
Data Analyst
Charitable Adult Rides & Services, Inc.
La Mesa CA
------------------------------

• #### 4.  RE: Similarity Analysis

Posted 01-20-2023 12:36
Hi Alberto,

As a quick solution, you may code the nominal variables into a set of binary ones (i.e., one-hot encoding, some people call them dummy variables). Ordinal variables get each level assigned a value. Then you can apply k-means using the Euclidean distance, as you are only dealing with numeric input. Remember to re-scale/standardize your data before clustering since the variables you are working with have quite different scales.

The above solution is a naive one and could be a good starting point. There are possible drawbacks, such as possibly having too many sparse dummy variables to represent content from one nominal variable. The "industry" variable in your case might be one.  You may conduct manual aggregates such as combining related industries.

Clustering categorical data in a more systematic way is still an active research topic. There is recent literature on representation learning that could be helpful, but I have not found code that is widely adopted. If you just need a quick solution, I'll say the one-hot encoding approach is the way to go.

Thanks,
Wenjun Zhou

------------------------------
Wenjun Zhou
Associate Professor
University of Tennessee Knoxville
Knoxville TN
------------------------------

• #### 5.  RE: Similarity Analysis

Posted 01-23-2023 17:55
Hi Wenjun,

Splendid! Thank you so much for responding. This was quite helpful. I like the idea of one-hot encoding, but it makes me wonder how I can apply "distance" across industries. For instance, is manufacturing closer to retail or finance or what.

This exercise made me wonder why and how it is possible these "similarity" methods have not been widely adopted? Think about it. There are thousands of private equity investment opportunities. As a preliminary step, an M&A firm could pull all these investment opportunities and determine which firms are most similar and as a cascading event determine which companies are the least risky investment opportunities with the greatest yield. Better yet, take EDGAR data, why not grab 10-Ks or 10-Qs, etc. Anyways, these are just thoughts I am having. If someone figures this out, give me a call.

Thank you Wenjun. Appreciate it.

Regards, Al

------------------------------
Alberto Aparicio
Data Analyst
Charitable Adult Rides & Services, Inc.
La Mesa CA
------------------------------

• #### 6.  RE: Similarity Analysis

Posted 05-22-2023 04:38

Hi Alberto,

Depending on how many companies you are trying to cluster, you may have problems with one-hot-encoding. As the number of items in the frontier you're trying to cluster increases, you start to approach a theoretical limit relating to infinite dimensional vector spaces. One-hot-encoding is good for small numbers of items (10^3-10^4), but degrades for large problem sets.

K-means is a non-linear greedy search algorithm, so it tends to converge to highly sub-optimal local minima. You may get better mileage from higher order greedy search that makes use of dynamic programming, and also keep in mind K-means is just a base case of the Expectation-Maximisation algorithm (EM) so you can tune it quite easily if you know enough about EM and non-linear optimisation.

A family of distance measures to consider is local neighbourhood methods, which are not discussed anywhere of significance in the data mining literature. The best description of local distance measures is in the classic Information Retrieval text Baeza-Yates [1]. If you want me to copy the section for you please contact me directly.

If you do choose to pursure any of the metrics in [1], I'd be very interested to hear about your experiences.

Best Regards,

ap

[1] Baeza-Yates, Ricardo, Ribeiro Neto, and .. Baeza-Yates-Ribeiro-Neto. "5.3 Automatic Local Analysis." Essay. In Modern Information Retrieval, 123–31. Pearson, Addison-Wesley, 1999.

------------------------------
Andrew Prendergast
Principal Researcher
VizDynamics
South Yarra VIC
------------------------------

• #### 7.  RE: Similarity Analysis

Posted 05-24-2023 14:23

Thank you, Andrew.

I used a couple R packages to utilize the k-prototype clustering algorithm. What are your thoughts on this method for a dataset that contains discrete and continuous data?

------------------------------
Alberto Aparicio
Data Analyst
Charitable Adult Rides & Services, Inc.
San Diego, CA
------------------------------

• #### 8.  RE: Similarity Analysis

Posted 05-26-2023 08:59

There are a number of clustering methods well known in the literature, in software, and easy to program.  The key ingredient is "metric of similarity" - often the starting point is the "distance" - differences in employee size, revenue.  However these are different scales - so you need a scale adjustment.  The key question is cluster for what purpose.

------------------------------
Ken Fordyce
director analytics without borders
Arkieva
Wilmington DE
------------------------------

• #### 9.  RE: Similarity Analysis

Posted 05-26-2023 11:56

Thank you, Ken. Would you kindly list the clustering methods known to you? And I am curious, what software you would recommend?

------------------------------
Alberto Aparicio
Data Analyst
Charitable Adult Rides & Services, Inc.
San Diego, CA
------------------------------

• #### 10.  RE: Similarity Analysis

Posted 05-26-2023 18:16
Lets see if email replies work...

@Alberto I've sent you a PDF on LinkedIn from Baeza-Yates, these distance measures you won't find in the data mining literature.

For a good list of classical distance measures, refer to "Data Mining" by Han & Kamber.

ap