There are many crimes that happen in the United States. Some believe that bigger cities and more populated states have more crime rates. In this project, we will be analyzing a dataset describing 1973 violent crime rates by US State. The crimes considered are assault, murder, and rape. Also included is the percentage of the population living in urban areas. Our goal is to use unsupervised machine learning methods such as Cluster Heat Maps, Principal Component Analysis, K-Means Clustering, Hierarchical Clustering, and DBSCAN to understand how violent crimes differ between states.
Jupyter Lab 3.4.4
Python 3.10
USArrest.csv
The dataset is available as USarrests.csv. The dataset has 50 observations (corresponding to each state) on 4 variables:
You can read more about the dataset here.
In our preliminary analysis, we can see that Murder and Assault are highly positively correlated with a correlation of 0.8. There are some correlations between Rape + Assault and Rape + Murder but not as strong. There is almost no correlation between the Urban Population with Murder and Assault.
In our cluster heat map analysis, we can visualize that Urban Population does not necessarily correlate with Murder and Assault. This means that states with a higher population do not determine murders or assaults. There is some relationship between the number of Rapes, Murders, and Assaults. We can also see that Murder and Assault are similar within the states.
Principal Component Analysis (PCA) is one of the most used unsupervised machine learning algorithms across a variety of applications: exploratory data analysis, dimensionality reduction, information compression, and data de-noising. PCA is a dimensionality reduction technique that transforms a set of features in a dataset into a smaller number of features called principal components while at the same time trying to retain as much information in the original dataset as possible. Since we have 4 different variables, we have a fourth-dimensional data set. PCA can take 4 or more variables and make a two-dimensional PCA plot. PCA can also tell us which variable is the most valuable for clustering the data. It also can tell us how accurate the two-dimensional graph is.
Principal Component Analysis calculates the average of each variable and using this average, finds the center of the data. It then shifts the data so that the center of the data is at the origin. From here, we input principal components. The principal components are vectors, but they are not chosen at random. The first principal component (PC1) is computed so that it explains the greatest amount of variance in the original features. Thus, it minimizes the distance between each data point on the graph (Sum of Squared) so PC1 is a linear combination of variables. It uses a scaled vector called the "Eigenvector" or "Singular Vector" for PC1. The sum of squared distances for the best fit line is the Eigenvalue for PC1. The second component (PC2) is orthogonal to the first, and it explains the greatest amount of variance left after the first principal component. Then we find PC3 and PC4 which are perpendicular to PC1 and PC2 that goes through the origin. The number of PCs is either the number of variables or the number of samples, whichever is smaller.
Python Code:
# PCA Model
pca_model = PCA()
X_PCA = pca_model.fit_transform(X)
df_plot = pd.DataFrame(X_PCA, columns=['PC1', 'PC2', 'PC3', 'PC4'])
df_plot.head()
fig,ax1 = plt.subplots()
ax1.set_xlim(X_PCA[:,0].min()-1,X_PCA[:,0].max()+1)
ax1.set_ylim(X_PCA[:,1].min()-1,X_PCA[:,1].max()+1)
# Recenters Data
for i,name in enumerate(df['Unnamed: 0'].values):
ax1.annotate(name, (X_PCA[i,0], X_PCA[i,1]), ha='center',fontsize=10)
Once all the principal components are figured out, you can use the eigenvalues to determine the proportion of variation that each PC accounts for. Then you can create a scree plot which is a graphical representation of the percentages of variation that each PC accounts for.
Python Code:
# Scree Plot
var_ratio = pca_model.explained_variance_ratio_
plt.plot([1,2,3,4], var_ratio, '-o')
In this scree plot, we can see that PC1, PC2, and PC3 account for the vast majority of the variation. This means that a three-dimensional graph, using just PC1, PC2, and PC3 would be a good approximation of this four-dimensional graph since it would account for 95.66% of the variation in the data. Also, a two-dimensional graph would account for 86.76% of the variation in the data.
We will now cluster the states into four clusters using k-means. K-means cluster identifies initial clusters and calculates the variances between each cluster or the Euclidean distance. It clusters all the remaining points, calculates the mean of each cluster, and then reclusters based on the new means. It repeats until the clusters no longer change. It restarts the cluster until it finds the best overall cluster. It does as much reclustering as we tell it to do. It then comes back and returns to the optimal one.
Python Code:
kmeans = KMeans(n_clusters=4)
kmeans.fit(X)
y_kmeans = kmeans.predict(X)
fig,ax1 = plt.subplots()
plt.scatter(X[:, 0], X[:, 1], c=y_kmeans, s=50)
for i,name in enumerate(df['Unnamed: 0'].values):
ax1.annotate(name, (X[i,0], X[i,1]), ha='center',fontsize=10)
Python Code:
cluster1 = df2[kmeans.labels_==i].tolist()
Where i
is the index of the number of clusters.
Clusters | States |
---|---|
1 | Connecticut, Delaware, Hawaii, Indiana, Kansas, Massachusetts, New Jersey, Ohio, Oklahoma, Oregon, Pennsylvania, Rhode Island, Utah, Virginia, Washington, Wyoming |
2 | Alabama, Arkansas, Georgia, Louisiana, Mississippi, North Carolina, South Carolina, Tennessee |
3 | Alaska, Arizona, California, Colorado, Florida, Illinois, Maryland, Michigan, Missouri, Nevada, New Mexico, New York, Texas |
4 | Idaho, Iowa, Kentucky, Maine, Minnesota, Montana, Nebraska, New Hampshire, North Dakota, South Dakota, Vermont, West Virginia, Wisconsin |
By using this we can determine the intra-cluster distance so that we can interpret the best k value.
We can see that the total intra-cluster distance is large for k = 1 and decreases as we increase k, until k = 4, after which it tapers off and gets only marginally smaller. The slope becomes constant after k = 4. This indicates that k = 4 is a good choice.
Python Code:
fig,ax1 = plt.subplots()
ax1.set_xlim(X_PCA[:,0].min()-1,X_PCA[:,0].max()+1)
ax1.set_ylim(X_PCA[:,1].min()-1,X_PCA[:,1].max()+1)
plt.scatter(X_PCA[:, 0], X_PCA[:, 1], c=y_kmeans, s=50)
for i,name in enumerate(df['Unnamed: 0'].values):
ax1.annotate(name, (X_PCA[i,0], X_PCA[i,1]), ha='center',fontsize=10)
Based on the updated PCA plot with the clustering, it is consistent with the clustering with the points split into four sections.
We will now use hierarchical clustering with complete linkage and Euclidean distance, cluster the states into four clusters. Then we will visualize the cluster results on top of the first two components.
Hierarchical clustering is often associated with heatmaps. It organizes the rows and columns based on similarity. This makes it easy to see correlations in the data.
Python Code:
fig,ax1 = plt.subplots()
agg_cluster_model = AgglomerativeClustering(linkage="complete", affinity='euclidean', n_clusters=4)
y_pred = agg_cluster_model.fit_predict(X)
plt.scatter(X[:, 0], X[:, 1], c=y_pred, marker="o")
for i,name in enumerate(df['Unnamed: 0'].values):
ax1.annotate(name, (X[i,0], X[i,1]), ha='center',fontsize=10)
Python Code:
cluster1 = df2[agg_cluster_model.labels_==i].tolist()
Where i
is the index of the number of clusters.
Clusters | States |
---|---|
1 | Alabama, Alaska, Georgia, Louisiana, Mississippi, North Carolina, South Carolina, Tennessee |
2 | Arkansas, Connecticut, Delaware, Hawaii, Indiana, Kansas, Kentucky, Massachusetts, Minnesota, Missouri, New Jersey, Ohio, Oklahoma, Oregon, Pennsylvania, Rhode Island, Utah, Virginia, Washington, Wisconsin, Wyoming |
3 | Arizona, California, Colorado, Florida, Illinois, Maryland, Michigan, Nevada, New Mexico, New York, Texas |
4 | Idaho, Iowa, Maine, Montana, Nebraska, New Hampshire, North Dakota, South Dakota, Vermont, West Virginia |
Python Code:
fig,ax1 = plt.subplots()
ax1.set_xlim(X_PCA[:,0].min()-1,X_PCA[:,0].max()+1)
ax1.set_ylim(X_PCA[:,1].min()-1,X_PCA[:,1].max()+1)
plt.scatter(X_PCA[:, 0], X_PCA[:, 1], c=y_pred, s=50)
for i,name in enumerate(df['Unnamed: 0'].values):
ax1.annotate(name, (X_PCA[i,0], X_PCA[i,1]), ha='center',fontsize=10)
The results are slightly different from the k-means. The data now is still split into four sections, but some of the states belong in different clusters.
Density-based spatial clustering of applications with noise (DBSCAN) is a data clustering algorithm proposed by Martin Ester, Hans-Peter Kriegel, Jörg Sander, and Xiaowei Xu in 1996. It is a density-based clustering non-parametric algorithm: given a set of points in some space, it groups together points that are closely packed together (points with many nearby neighbors), marking as outlier points that lie alone in low-density regions (whose nearest neighbors are too far away).
Python Code:
from sklearn.cluster import DBSCAN
db_model = DBSCAN(eps=1, min_samples=2)
db_model.fit(X)
y_pred = db_model.fit_predict(X)
labels = db_model.labels_
n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0)
n_noise_ = list(labels).count(-1)
fig,ax1 = plt.subplots()
plt.scatter(X[:, 0], X[:, 1], c=y_pred, marker="o");
Python Code:
cluster1 = df2[db_model.labels_==i].tolist()
Where i
is the index of the number of clusters.
Clusters | States |
---|---|
1 | Alabama, Georgia, Louisiana, Mississippi, South Carolina, Tennessee |
2 | Connecticut, Idaho, Indiana, Iowa, Kansas, Kentucky, Maine, Massachusetts, Minnesota, Missouri, Montana, Nebraska, New Hampshire, New Jersey, North Dakota, Ohio, Oklahoma, Oregon, Pennsylvania, Rhode Island, South Dakota, Utah, Vermont, Virginia, Washington, West Virginia, Wisconsin, Wyoming |
3 | Illinois, New York, Texas |
4 | Maryland, Michigan, New Mexico |
Python Code:
fig,ax1 = plt.subplots()
ax1.set_xlim(X_PCA[:,0].min()-1,X_PCA[:,0].max()+1)
ax1.set_ylim(X_PCA[:,1].min()-1,X_PCA[:,1].max()+1)
plt.scatter(X_PCA[:, 0], X_PCA[:, 1], c=y_pred, s=50)
for i,name in enumerate(df['Unnamed: 0'].values):
ax1.annotate(name, (X_PCA[i,0], X_PCA[i,1]), ha='center',fontsize=10)
These results on the PCA plot are a lot different than before. The DBSCAN is extremely sensitive to the changes in epsilon in the dataset.