A common clustering technique is k-means clustering, in which k is the (predefined) number of clusters you want. If you don't have a fixed number of clusters in mind, you could run it with varying numbers of clusters and decide which result makes you happiest.Clustering algorithms tend to use the norm of the difference between attribute vectors of two entities, so you're going to have some adventures. For the ratio-scaled data, you will need to decide how to scale the attributes (e.g., how large a difference in revenue equates to a difference of one employee). For the nominal data (software developer v. manufacturer), I think you will need to invent some "distances" (difference between software developer and manufacturer = 10, difference between manufacturer and retailer = 17, ...). Unfortunately, I don't work with clustering; perhaps someone else knows some references on this.Lastly, from your sig I gather you work for a 501(c)(3), so if this is work related and you want help, you might consider starting a project with INFORMS Pro Bono Analytics.Cheers,
Depending on how many companies you are trying to cluster, you may have problems with one-hot-encoding. As the number of items in the frontier you're trying to cluster increases, you start to approach a theoretical limit relating to infinite dimensional vector spaces. One-hot-encoding is good for small numbers of items (10^3-10^4), but degrades for large problem sets.
K-means is a non-linear greedy search algorithm, so it tends to converge to highly sub-optimal local minima. You may get better mileage from higher order greedy search that makes use of dynamic programming, and also keep in mind K-means is just a base case of the Expectation-Maximisation algorithm (EM) so you can tune it quite easily if you know enough about EM and non-linear optimisation.
A family of distance measures to consider is local neighbourhood methods, which are not discussed anywhere of significance in the data mining literature. The best description of local distance measures is in the classic Information Retrieval text Baeza-Yates . If you want me to copy the section for you please contact me directly.
If you do choose to pursure any of the metrics in , I'd be very interested to hear about your experiences.
 Baeza-Yates, Ricardo, Ribeiro Neto, and .. Baeza-Yates-Ribeiro-Neto. "5.3 Automatic Local Analysis." Essay. In Modern Information Retrieval, 123–31. Pearson, Addison-Wesley, 1999.
Thank you, Andrew. I used a couple R packages to utilize the k-prototype clustering algorithm. What are your thoughts on this method for a dataset that contains discrete and continuous data?
There are a number of clustering methods well known in the literature, in software, and easy to program. The key ingredient is "metric of similarity" - often the starting point is the "distance" - differences in employee size, revenue. However these are different scales - so you need a scale adjustment. The key question is cluster for what purpose.
Thank you, Ken. Would you kindly list the clustering methods known to you? And I am curious, what software you would recommend?
The Institute for Operations Research and the Management Sciences
phone 1 443-757-3500
phone 2 800-4INFORMS (800-446-3676)