Analysis of Heavy Metal Lyrics - Part 4: Network Graphs | Philippe’s data science adventures

Click here for full-size version of the window below. If you’re interested in seeing the full code, check out the original notebook. In the next article we’ll prototype machine learning models for lyric-based genre classification.

Note: Dashboard may take a minute to load

Summary

Network graphs are used to visualize associations between elements of a dataset. In this case, they can be used to show heavy metal bands with similar lyrics or genre tags. This gives us another way to see if there are correlations between lyrical content and genre affiliation.

We can treat bands as nodes in a graph and define edges according to what we’re interested in:

When visualizing genre association, connect two nodes with an edge if they share a genre tag in common.
When visualizing lyrical similarity, connect two nodes with an edge if their vocabularies share many words in common. Similiarity can be quantified by cosine similarity between TF-IDF vectors, so we set a minimum cosine similarity for two nodes to be connected.

To distinguish between nodes, we can likewise cluster them by attribute, using K-means clustering. The idea will be to label each node according to a group whose centroid it is nearest to.

For genre clustering, we create a feature vector for each genre, and perform K-means over this genre space.
For vocabulary clustering, we create a Tf-Idf vectorization as before.

Dimensionality reduction is required for efficient clustering in both cases, since the parameter space is rather large (a thousand words for lyrical clustering and some tens of genres for genre clustering). Truncated single-value decomposition (SVD) is preferred over principal component analysis (PCA) here due to the sparsity of the feature matrices.

This analysis is applied to the full dataset, but for visualization purposes only the top-350 bands with over 2,500 lyrics are shown in the final product, to preserve consistency with the feature plots dashboard.

Imports

Show code

import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from nltk.corpus import stopwords

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD, PCA
from sklearn.cluster import KMeans
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.metrics import silhouette_score

import sys
sys.path.append('../scripts/')

import lyrics_utils as utils

Data pre-processing

Show code

def tokenizer(s):
    return [word for word in s.split() if len(word) >= 4]


def clean_words(words, minlen=4):
    stop_words = stopwords.words('english')
    words = [w for w in words if (w not in stop_words) and (len(w) >= minlen)]
    return words

df = utils.load_bands('../bands-1pct.csv')
df = utils.get_band_stats(df)
df_short = utils.get_band_words(df, num_words=2500, num_bands=350)
names_short = df_short['name'].values
names = df['name'].values
df['clean_words'] = df['words'].apply(clean_words)

Graph nodes

Vocabulary clustering

The corpus of band lyrics is vectorized using the scikit-learn TfidfVectorizer, reduced with TruncatedSVD, and fit to K-means. The choice of \(k=5\) clusters is discussed below. A convenient benefit of K-means is that we usually end up with fairly even clusters.

vectorizer = TfidfVectorizer(
    stop_words=stopwords.words('english'),
    tokenizer=tokenizer,
    min_df=0.01,
    max_df=0.9,
    max_features=1000,
    sublinear_tf=False,
)
corpus = list(df['clean_words'].apply(lambda x: ' '.join(x)))
X = vectorizer.fit_transform(corpus)
vocab_svd = TruncatedSVD(n_components=2, random_state=0)
X_svd = vocab_svd.fit_transform(X)
vocab_kmeans = KMeans(n_clusters=5, random_state=0).fit(X_svd)
vocab_labels = vocab_kmeans.labels_
print("Lyrics clustering results:")
for i in sorted(set(vocab_labels)):
    print(f"{i}: {sum(vocab_labels == i) / len(vocab_labels) * 100:.1f}%")

Lyrics clustering results:
0: 22.4%
1: 19.9%
2: 18.6%
3: 21.1%
4: 18.1%

Show code

plt.figure(figsize=(10, 10))
for i in range(vocab_kmeans.n_clusters):
    cluster_idx = np.where(vocab_labels == i)[0]
    # for more visible plotting, just grab bands in the shortened list, and only the first 30
    cluster_idx = np.array([idx for idx in cluster_idx if df.name[idx] in names_short], dtype=int)
    cluster_idx = cluster_idx[:30]
    # plot x and y values in the reduced parameter space
    x, y = X_svd[cluster_idx].T
    plt.plot(x, y, 'o')
    for j in range(len(cluster_idx)):
        plt.text(x[j], y[j], names[cluster_idx][j])
plt.grid()
plt.show()

png

At this point it’s hard to see if this has been effective at all. Thankfully we can use Gephi to better visualize the clustered data, complete with the edges and color-coding. All of the cluster labels and node connections generated in this notebook are exported for processing in Gephi, resulting in the final network graphs.

Choosing ‘k’ for K-means

Deciding on \(k\), the number of clusters, is up to us and can depend on the task at hand. Here I combine two possible ways of choosing \(k\):

Compute the sum of squared errors (\(SSE\)), which is simply the Euclidean distance between each point and its corresponding cluster center. You can’t simply look for a minimum here, since the higher \(k\) is, the lower WSS becomes. However, you can look for an “elbow” in the curve, beyond which further improvement is small.
Maximize the silhouette score, which measures how similar each point in a cluster is to each other point in the cluster, compared to its closeness to points in the nearest neighboring cluster. A higher silhouette score is better, but it may not be useful if the maximum score corresponds to a trivially small \(k\), as is the case here.

I combine these two by dividing the silhouette score by the log of \(SSE\), and look for the earliest (smallest \(k\)) local maximum in the metric. Since \(SSE\) drops of roughly exponentially, dividing by \(log(SSE)\) makes more sense than just \(SSE\). This gives a curve that reaches its first local maximum at \(k=5\). Really it’s \(k=2\) but I want to skip that trivial option.

Show code

def test_k_values(points, krange):
    n_k = len(krange)
    sse = np.zeros(n_k)
    sil = np.zeros(n_k)
    lab = np.zeros((n_k, len(points)))
    for i, k in enumerate(krange):
        kmeans = KMeans(n_clusters=k, random_state=0).fit(points)
        labels = kmeans.labels_
        centroids = kmeans.cluster_centers_
        for j, p in enumerate(points):
            center = centroids[labels[j]]
            sse[i] += (p[0] - center[0])**2 + (p[1] - center[1])**2
        sil[i] = silhouette_score(points, labels, metric='euclidean')
        lab[i, :] = labels
    return sse, sil, lab

kmin = 2
kmax = 10
krange = np.arange(kmin, kmax + 1)
sse, sil, lab = test_k_values(X_svd, krange)

fig, (ax1, ax2, ax3) = plt.subplots(3, 1, figsize=(12, 8))

ax1.plot(krange, sse)
ax1.set_xlim(kmin, kmax)
ax1.set_ylim(0, wss.max())
ax1.set_xlabel("k")
ax1.set_ylabel("Within-cluster-sum of\nsquared errors (WSS)")

ax2.plot(krange, sil)
ax2.set_xlim(kmin, kmax)
ax2.set_ylim(0, sil.max())
ax2.set_xlabel("k")
ax2.set_ylabel("Silhouette score")

ax3.plot(krange, sil / np.log(wss))
ax3.set_xlim(kmin, kmax)
ax3.set_xlabel("k")
ax3.set_ylabel("Ratio")

plt.show()

png

Genre clustering

Similar process for genres, but in this case the genre columns (binary values) are treated as the feature matrix. The value of \(k=6\) is determined in a similar way.

genre_cols = [c for c in df.columns if 'genre_' in c]
Y = df[genre_cols].values
Y_svd = TruncatedSVD(n_components=2, random_state=0).fit_transform(Y)
genre_kmeans = KMeans(n_clusters=6, random_state=0).fit(Y_svd)
genre_labels = genre_kmeans.labels_
print("Genre clustering results:")
for i in sorted(set(genre_labels)):
    print(f"{i}: {sum(genre_labels == i) / len(genre_labels) * 100:.1f}%")
for i in sorted(set(genre_labels)):
    center_genres = sorted(zip(genre_cols, Y[genre_labels == i].mean(axis=0)), key=lambda x: -x[1])
    print(", ".join([f"{genre.split('_')[1]} {value:.2f}" for genre, value in center_genres if value > 0.1]))

Genre clustering results:
0: 21.7%
1: 10.9%
2: 16.9%
3: 5.9%
4: 19.8%
5: 24.8%
thrash 0.39, progressive 0.37, doom 0.29, power 0.15
death 1.00, doom 0.40, thrash 0.39, progressive 0.26
black 1.00
black 1.00, death 1.00, symphonic 0.12
death 1.00
power 0.40, heavy 0.29, symphonic 0.11

Choosing ‘k’ for genre clustering

Show code

kmin = 2
kmax = 10
krange = np.arange(kmin, kmax + 1)
wss, sil, lab = test_k_values(Y_pca, krange)

fig, (ax1, ax2, ax3) = plt.subplots(3, 1, figsize=(12, 8))

ax1.plot(krange, wss)
ax1.set_xlim(kmin, kmax)
ax1.set_ylim(0, wss.max())
ax1.set_xlabel("k")
ax1.set_ylabel("Within-cluster-sum of\nsquared errors (WSS)")

ax2.plot(krange, sil)
ax2.set_xlim(kmin, kmax)
ax2.set_ylim(0, sil.max())
ax2.set_xlabel("k")
ax2.set_ylabel("Silhouette score")

ax3.plot(krange, sil / np.log(wss))
ax3.set_xlim(kmin, kmax)
ax3.set_xlabel("k")
ax3.set_ylabel("Ratio")

plt.show()

png

Create labeled nodes

For handling graph nodes, Gephi just needs a table of numerical id values, string label values, and any number of integer columns representing categories for coloring the graph. In this case I include the genre and vocabulary clusters as categorical columns.

nodes = pd.DataFrame({'id': range(1, len(names) + 1), 'label': names, 'genre': genre_labels, 'lyrics': vocab_labels})
nodes = nodes[nodes.label.isin(names_short)].reset_index(drop=True)
nodes

	id	label	genre	lyrics
0	1	Amorphis	2	0
1	2	Blind Guardian	3	0
2	3	Iced Earth	3	0
3	4	Katatonia	2	3
4	5	Entombed	2	3
...	...	...	...	...
195	3217	Whitechapel	5	3
196	3315	Fleshgod Apocalypse	2	0
197	3328	Alestorm	3	1
198	3682	Ghost	0	2
199	3722	Deafheaven	5	4

200 rows × 4 columns

Edges

Vocabulary similary

We can quantify similarity between two bands’ lyrics by measuring the cosine similarity between their respective Tf-Idf vectors. This is a pretty efficient calculation given the large vector space being used. The threshold for drawing an edge between two nodes depends on the cosine similarities of each node to all other nodes. For instance, to determine if nodes \(i\) and \(j\) will be connected, the cosine similiarity between them, \(C_{ij}\), must exceed the mean of \(C_{ik}\) for all \(k \neq i\) plus two standard deviations of \(C_{ik}\). The factor of two standard deviations is chosen to get roughly the same ratio of edges to nodes as I get in the genre similarity method further below.

cos_matrix = cosine_similarity(X)
cos_vals = cos_matrix[cos_matrix < 1]
vocab_edges_dict = {'Source': [], 'Target': []}
for i, name in enumerate(names):
    row = cos_matrix[i, :]
    vals = row[row < 1]
    thresh = vals.mean() + 2 * vals.std()
    idx = np.where(row > thresh)[0]
    idx = idx[idx != i]
    for j in idx:
        vocab_edges_dict['Source'].append(i + 1)
        vocab_edges_dict['Target'].append(j + 1)
vocab_edges = pd.DataFrame(vocab_edges_dict)
vocab_edges = vocab_edges[vocab_edges.Source.isin(nodes.id) & vocab_edges.Target.isin(nodes.id)]
vocab_edges

	Source	Target
0	1	2
1	1	4
2	1	19
3	1	20
4	1	24
...	...	...
297923	3722	571
297924	3722	593
297943	3722	1679
297951	3722	2225
297954	3722	2615

5406 rows × 2 columns

num_edges = np.zeros(len(nodes))
for i, row in nodes.iterrows():
    targets = vocab_edges.loc[vocab_edges.Source == row.id, 'Target'].values
    sources = vocab_edges.loc[vocab_edges.Target == row.id, 'Source'].values
    num_edges[i] = len(targets) + len(sources)

print(num_edges.mean())

plt.hist(num_edges)
plt.show()

54.06

png

Genre similarity

This is simple: draw an edge between two nodes if they share a genre tag in common.

genre_edges_dict = {'Source': [], 'Target': []}
for i, row1 in nodes.iterrows():
    id_i = row1.id
    for j, row2 in nodes[nodes.id > id_i].iterrows():
        id_j = row2.id
        if np.dot(Y[id_i - 1], Y[id_j - 1]) > 0:
            genre_edges_dict['Source'].append(id_i)
            genre_edges_dict['Target'].append(id_j)
genre_edges = pd.DataFrame(genre_edges_dict)
genre_edges

	Source	Target
0	1	4
1	1	5
2	1	6
3	1	8
4	1	9
...	...	...
5625	3024	3116
5626	3024	3175
5627	3024	3217
5628	3024	3315
5629	3175	3315

5630 rows × 2 columns

num_genre_edges = np.zeros(len(nodes))
for i, row in nodes.iterrows():
    targets = genre_edges.loc[genre_edges.Source == row.id, 'Target'].values
    sources = genre_edges.loc[genre_edges.Target == row.id, 'Source'].values
    num_genre_edges[i] = len(targets) + len(sources)

print(num_genre_edges.mean())

plt.hist(num_genre_edges)
plt.show()

56.3

png