How to visualize if there are any common elements between lists? Most people will reply with something between “potatoes” and Venn diagrams. Venn works well with two lists, but what if you have three or… six?
Scientist are not always aiming at best readability, but rather focus on comprehensiveness. This works as long as you share your work with patient and interested colleagues or people evolving in the field. Stakeholders presentations have additional requirements: you have to balance comprehensiveness with readability, they are not specialists.
We will explore how to do this with a simple example: movie genres.
First and foremost, let’s load IMDB movie data into a pandas dataframe and import all the libraries that we will need:
import gzip
import io
import pprint
import upsetplot
import pandas as pd
from collections import defaultdict
from matplotlib_venn import venn2, venn3
from matplotlib import pyplot as plt
from urllib.request import Request, urlopen
def load_movie_data(sample=True):
"""
Directly download and format data into pandas dataframe
/!\ File is about 130Mb depending on speed connection,
it might take some time.
"""
req = Request('https://datasets.imdbws.com/title.basics.tsv.gz')
req.add_header('Accept-Encoding', 'gzip')
response = urlopen(req)
content = gzip.decompress(response.read())
data = pd.read_csv(io.BytesIO(content), encoding='utf8', sep="\t")
data = data[data.isAdult == 0]
if sample:
# Two percent might seem low but there is approx. 7 million
# titles without Adult category.
return data.sample(frac=0.02)
else:
return data
data = load_movie_data()
pp = pprint.PrettyPrinter(indent=4)
print("Data column names: ")
pp.pprint(list(data.columns))
print("Data shape: " + str(data.shape))
print("Genres column examples: ")
pp.pprint(data.genres.sample(5).head())
Output:
Data column names:
[ 'tconst',
'titleType',
'primaryTitle',
'originalTitle',
'isAdult',
'startYear',
'endYear',
'runtimeMinutes',
'genres']
Data shape: (146135, 9)
Genres column example:
6711046 Music,Short
5159561 News,Talk-Show
2496837 Comedy
6745879 Short
5418258 Comedy,Short
Name: genres, dtype: object
Next, let’s see how many movies are in both “Romance” and “Action” categories:
# Reshape data to have for every category,
# a list of movies.
genres_movies = defaultdict(list)
for index, row in data.iterrows():
try:
for genre in row["genres"].split(','):
genres_movies[genre].append(row['primaryTitle'])
except:
pass
pp = pprint.PrettyPrinter(indent=4, depth=1)
print("Data structure: ")
pp.pprint(genres_movies)
# Plot a simple Venn diagram and save it to file
venn2([set(genres_movies['Action']), set(genres_movies['Romance'])], set_labels = ('Action', 'Romance'))
plt.savefig("./simple_venn.png")
plt.clf()
Output:
Data structure:
defaultdict(,
{ 'Action': [...],
'Adult': [...],
'Adventure': [...],
'Animation': [...],
'Biography': [...],
'Comedy': [...],
'Crime': [...],
'Documentary': [...],
'Drama': [...],
'Family': [...],
'Fantasy': [...],
'Film-Noir': [...],
'Game-Show': [...],
'History': [...],
'Horror': [...],
'Music': [...],
'Musical': [...],
'Mystery': [...],
'News': [...],
'Reality-TV': [...],
'Romance': [...],
'Sci-Fi': [...],
'Short': [...],
'Sport': [...],
'Talk-Show': [...],
'Thriller': [...],
'War': [...],
'Western': [...],
'\\N': [...]})
Here is the resulting Venn diagram:
It is perfectly readable, almost anybody can grasp what it means quickly: 443 movies in this sample have the two categories. Similarly, you can add one category and use the venn3
function:
venn3([set(genres_movies['Action']), set(genres_movies['Romance']), set(genres_movies['Drama'])], set_labels = ('Action', 'Romance', 'Drama'))
plt.savefig("./large_venn.png")
plt.clf()
Which generates something similar to:
And this is still very reasonable. What if you’d want to have a few extra categories? Well, at least with matplotlib-venn
, you can’t! I don’t know how much of a design choice it is, but given the first example, for the sake of readability, I think it is a brilliant limitation (this not the case with other libraries/languages, for example with R - venn diagrams.
“But I really need to show 4 categories!
And they are business cases where it is actually useful, but not with Venn diagrams.
Luckily, some people have thought hard on the problem and came up with upset plots. The initial implementation is in R, but it also exists in Python, here is how to start:
genres_movies_set = dict()
for k, v in genres_movies.items():
genres_movies_set[k] = set(v)
def plot_upset(genres_movies_set, movie_categories, filename):
upset_data_sub = upsetplot.from_contents({k: v for k, v in genres_movies_set.items() if k.startswith(movie_categories)})
upsetplot.plot(upset_data_sub)
plt.savefig(filename)
return
plot_upset(genres_movies_set, ('Action', 'Romance'), "./simple_upset.png")
Output:
How to read it? The first element is to look at the dots, every black dot, indicates the concerned category/ies. Then, your eyes should shift to the top bar chat which tells you that for the given categories, you have that many movies. Two or more connected dots, indicate movies that have multiple categories.
If it helps, it is based on the same data as the Venn with two sets, just presented differently. It is a bit unusual, but once your brain process the novelty, it actually works very well.
What about the bar chart on the left? If we take the example of the “first line of dots”, it represents the addition of all the movies that are categorized in “Action” movies whether they are only in “Action” or also harbor “Romance” tag.
For two categories, it is clearly an overkill, but look at what happens with four:
plot_upset(genres_movies_set, ('Action', 'Romance', 'Drama', 'Sci-Fi'), "./large_upset.png")
Output:
There is nothing wrong wit Venn diagrams, it is just a matter of knowing how and when to use them. This was a short introduction to explore their limits and open to a less well-known alternative: Upset plots. They are not perfect either, they work in certain situations and the expected readers/viewers should be willing to invest a minimal amount of energy into reading them. The biggest advantage is that when you do need to study complex intersecting sets, you now have the visual tool to do so.
From there, where to go? I showed here only the core principles of Venn diagrams and upset plots. If you are interested in readability, you could start thinking about colors, highlight bars (yes that is possible), font size and high resolution outputs.
Every plot type has a working window, as data professionals, I think it is a part of the job to know them and choose accordingly.
I wrapped up everything in the following gist that can be run with a simple python3 ./create_movie_venn_upset_charts.py.
Copyright 2023 - Mikael Koutero. All rights reserved.