Analysis of Heavy Metal Lyrics - Part 3: Word Clouds | Philippe’s data science adventures

This article is the third part of the lyrical analysis heavy metal lyrics. If you’re interested in seeing the full code, check out the original notebook. In the next article we’ll use clustering and graph methods to visualize the genre and lyrical data as a network.

Word clouds are a fun and oftentimes helpful technique for visualizing natural language data. They can show words scaled by any metric, although term frequency (TF) and term-frequency-inverse-document-frequency (TF-IDF) are the most common metrics. For a multi-class or multi-label classification problem, word clouds can highlight the similarities and differences between separate classes by treating each class as its own document to compare with all others. The word clouds seen here were made with the WordCloud generator by amueller, with pre-processing done via gensim and nltk.

In the case of heavy metal genre classification, term frequency alone would not be very illuminating: the genres visualized here share a lot of common themes. TF-IDF does much better at picking out the words that are unique to a genre: black metal lyrics deal with topics like the occult, religion, and nature; death metal focuses on the obscene and horrifying; heavy metal revolves around themes more familiar to rock and pop; power metal adopts the vocabulary of fantasies and histories; and thrash metal sings of violence and war. The full corpus word cloud shows themes common to all heavy metal genres.

Imports

Show code

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from wordcloud import WordCloud
from nltk.corpus import stopwords
from nltk.tokenize import RegexpTokenizer
from sklearn.feature_extraction.text import TfidfVectorizer

plt.rcParams['font.size'] = 18

Data

Show code

df = pd.read_csv('../bands-1pct.csv')
genre_cols = [c for c in df.columns if 'genre_' in c]
genres = [c.replace('genre_', '') for c in genre_cols]

Functions for creating and visualizing word clouds

Show code

def tokenizer(s):
    t = RegexpTokenizer('[a-zA-Z]+')
    return [word.lower() for word in t.tokenize(s) if len(word) >= 4]


def get_wordclouds(corpus, names, min_df=0, max_df=1, width=800, height=500):
    vectorizer = TfidfVectorizer(stop_words=stopwords.words('english'), tokenizer=tokenizer, min_df=min_df, max_df=max_df)
    X = vectorizer.fit_transform(corpus)
    vocabulary = vectorizer.get_feature_names()
    out = {}
    for i, name in names.items():
        print(name)
        freqs = X.toarray()[i,:]
        word_freqs = dict(zip(vocabulary, freqs))
        out[name] = WordCloud(width=width, height=height).fit_words(word_freqs)
    return out


def plot_wordclouds(clouds):
    names = list(clouds.keys())
    width = clouds[names[0]].width
    height = clouds[names[0]].height
    dpi = plt.rcParams['figure.dpi']
    nrows = int(np.ceil(len(names) / 2))
    ncols = 2
    figsize = (width / dpi * ncols, height / dpi * nrows)
    fig, subplots = plt.subplots(nrows, ncols, figsize=figsize, facecolor='k')
    for i in range(subplots.size):
        ax = subplots[i // 2, i % 2]
        ax.set_facecolor('k')
        ax.set_axis_off()
        if i < len(names):
            name = names[i]
            ax.imshow(clouds[name])
            ax.set_title(name, color='w', fontweight='bold', y=1.05)
    plt.show()

Word clouds for genres

Here we split the full dataframe by genre, so each document consists of all the lyrics for that genre.

Show code

genre_corpus = []
for genre, col in zip(genres, genre_cols):
    other_cols = [c for c in genre_cols if c != col]
    words = df[(df[col] == 1) & (df[other_cols] == 0).all(axis=1)].words
    genre_corpus.append(' '.join(words))

genre_clouds = get_wordclouds(genre_corpus, dict(enumerate(genres)), min_df=0.5, max_df=0.8)
plot_wordclouds(genre_clouds)

png

Word clouds for bands

We can likewise build word clouds for individual bands. Here are word clouds for the top-ten bands by number of album reviews. You can see more artist-specific word clouds by clicking on any of the bands included in the lyrics dataset dashboard).

Show code

band_corpus = list(df.words)
bands = df.sort_values('reviews', ascending=False)['name'].values[:10]
print(bands)
bands_dict = {i: name for i, name in df.name.items() if name in bands}

band_clouds = get_wordclouds(band_corpus, bands_dict, min_df=0.1, max_df=0.5)
plot_wordclouds(band_clouds)

png