Hi Alberto,

Depending on how many companies you are trying to cluster, you may have problems with one-hot-encoding. As the number of items in the frontier you're trying to cluster increases, you start to approach a theoretical limit relating to infinite dimensional vector spaces. One-hot-encoding is good for small numbers of items (10^3-10^4), but degrades for large problem sets.

K-means is a non-linear greedy search algorithm, so it tends to converge to highly sub-optimal local minima. You may get better mileage from higher order greedy search that makes use of dynamic programming, and also keep in mind K-means is just a base case of the Expectation-Maximisation algorithm (EM) so you can tune it quite easily if you know enough about EM and non-linear optimisation.

A family of distance measures to consider is local neighbourhood methods, which are not discussed anywhere of significance in the data mining literature. The best description of local distance measures is in the classic Information Retrieval text Baeza-Yates [1]. If you want me to copy the section for you please contact me directly.

If you do choose to pursure any of the metrics in [1], I'd be very interested to hear about your experiences.

Best Regards,

ap

[1] Baeza-Yates, Ricardo, Ribeiro Neto, and .. Baeza-Yates-Ribeiro-Neto. "5.3 Automatic Local Analysis." Essay. In Modern Information Retrieval, 123–31. Pearson, Addison-Wesley, 1999.

------------------------------

Andrew Prendergast

Principal Researcher

VizDynamics

South Yarra VIC

------------------------------

Original Message:

Sent: 01-17-2023 12:10

From: Alberto Aparicio

Subject: Similarity Analysis

Hello,

What statistical methods could be used to cluster companies that are similar? Say you have four companies W, X, Y, Z. Company W is an manufacturer, has 300 employees, and revenue is $100M. Company X is software developer, has 30 employees, and revenue is $2M. Company Y is a software developer, has 40 employees, and revenue is $4M. Company Z is a manufacturer, has 400 employees, and revenue is $90M. I am looking for a statistical clustering-type method to conclude Companies W and Z are a cluster (similar) and Companies X and Y are a cluster (similar).

Any advise would be appreciated. Thank you.

Regards, Al

------------------------------

Alberto Aparicio

Data Analyst

Charitable Adult Rides & Services, Inc.

La Mesa, CA

------------------------------