Who is the Centre of the Movie Universe?

Using Python and NetworkX to Analyse the Social Network of Movie Stars

By Rhyd Lewis, School of Mathematics, Cardiff University, http://www.rhydlewis.eu, 13th February 2020.

This article analyses the social network of movie stars and seeks to identify the most "central" actors in the movie business. The analysis is presented in a step-by-step, tutorial-like fashion and makes use of the Python programming language together with the NetworkX library. An extended pdf version of this article together with full listings of results is available here. Links to code and data sets appear at the bottom of this page.

A social network is made up of a set of nodes (usually people) that have links, or edges between them which describe their relationships. In the social network formed by movie actors, each actor is represented as a node. Pairs of actors are then joined by an edge if they are known to have appeared in a movie together.

For this study we use information taken from the Internet Movie Database IMDb. Specifically, we use a dataset collected by the administrators of the Oracle of Bacon website. Complete and up-to-date versions of this dataset can be downloaded directly from here. Our version of this dataset was downloaded at the start of January 2020 and contains the details of 164,318 different movies. Each movie in this set is stored as a JSON object containing, among other things, the title of the movie, a list of the cast members, and the year of its release.

Input and Preliminary Analysis

In this section we show how the dataset can be read into our program using standard Python commands. We then carry out a preliminary analysis of the data and produce some simple visualisations.

Before proceeding with our analysis, note that is was first necessary to remove a few "dud" movies from the dataset. In our case, we decided to remove the 44,075 movies that had no cast specified. We also deleted a further 5,416 movies that did not include a year of release. This leaves a final "clean" database of 114,827 movies with which to work. In the following Python code we call this file data.json.

We begin by first importing the relevant Python libraries into our program. Next, we transfer the contents of the entire dataset into a Python list called Movies.

In [1]:
import json
import networkx as nx
import matplotlib.pyplot as plt
import collections
import statistics 
import time
import random

Movies = []
with open("data.json", "r", encoding="utf-8") as f:
    for line in f.readlines():
        J = json.loads(line)
        Movies.append(J)

Having parsed the dataset in this way, we are now able to access any of the list's elements using standard Python commands. For example, the statement Movies[0] will return the full record of the first element in the list; the statement Movies[0]["cast"][0] will return the name of the first cast member listed for the first movie; and so on.

Having read the dataset into the Movies list, we can now carry out some basic analysis. Here we will look at the number of movies produced per year, and the sizes of the casts that were used. The code below uses the collections.Counter() method to count the number of movies released per year. This information is written to the variable C, which is then used to produce a bar chart via the plt.bar() command.

In [2]:
C = collections.Counter([d["year"] for d in Movies])
plt.xlabel("Year")
plt.ylabel("Frequency")
plt.title("Number of Movies per Year")
plt.bar(list(C.keys()), list(C.values()))
plt.show()

The resultant bar chart is shown above. As we might expect, we see that nearly all movies in this dataset were released between the early 1900's and 2020, with a general upwards trend in the number of releases per year.

We now take a look at the sizes of casts used in movies. The following code does this by producing a bar chart in the same way as the previous example.

In [3]:
C = collections.Counter([len(d["cast"]) for d in Movies])
plt.xlabel("Cast Size")
plt.ylabel("Frequency")
plt.title("Number of Movies per Cast Size")
plt.bar(list(C.keys()), list(C.values()))
plt.show()

This chart indicates that nearly all movies have casts of between one and fifty actors. However, there are some outliers with much larger casts. To get the names of these movies, the following code reorders the list Movies into descending order of cast size. The first five entries in this list are then written to the screen.

In [4]:
Movies = sorted(Movies, key = lambda i: len(i["cast"]), reverse=True)
for i in range(5):
    print(Movies[i]["title"], "=", len(Movies[i]["cast"]))
Cirque du Soleil: Worlds Away = 268
Hollywood Without Make-Up = 132
The Longest Day (film) = 117
The Founding of a Party = 116
The Founding of a Republic = 106

Forming the Social Network

In this section we now construct the complete social network of actors using tools from the NetworkX library. As mentioned earlier, our network is made up of nodes (actors in this case), with edges connecting actors that have appeared in a movie together. Probably the most appropriate type of network to use here is a multigraph. Multigraphs allow us to define multiple edges between the same pair of nodes, which makes sense here because actors will often appear in multiple movies together. Note also that the edges in this network are not directed. This means that if actor A has appeared with actor B, then B has also appeared with A.

The following code constructs our network G using the Movies list from the previous section. As shown, the code considers each movie in turn. It then goes through each pair of actors that appeared in this movie and adds the appropriate edge to the network. Each edge is also labelled with the corresponding movie title.

In [5]:
G = nx.MultiGraph()        
for movie in Movies:
    for i in range(0, len(movie["cast"]) - 1):
        for j in range(i + 1, len(movie["cast"])):
            G.add_edge(movie["cast"][i], movie["cast"][j], title=movie["title"])

print("Number of nodes in this multigraph =", G.number_of_nodes())
print("Number of edges in this multigraph =", G.number_of_edges())
Number of nodes in this multigraph = 395414
Number of edges in this multigraph = 9968607

As shown in the above output, the resultant network is very large, with a total of 395,414 different nodes (actors) and 9,968,607 different edges.

Analysing Connections in the Network

Having formed our social network of actors, we can now analyse some of its interesting features. In this section we start by calculating the total number of movies that each actor has appeared in. We then determine the most prolific acting partnerships in the movie business by calculating the number of movies that each pair of actors has appeared in.

Movies Per Actor

The following code calculates the total number of movies per actor and lists the top five. For each node in our network, this is achieved by going through its incident edges and forming a set S of all the different labels (movie titles) appearing on these edges. The final results are stored in the dictionary D. For output purposes, the contents of D are then put into a sorted list L, and the first five entries in this list are written to the screen.

In [6]:
D = {}
for v in G.nodes():
    E = list(G.edges(v, data=True))
    S = set()
    for e in E:
        S.add(e[2]["title"])
    D[v] = S
L = sorted(D.items(), key=lambda item: len(item[1]), reverse=True)
for i in range(5):
    print(L[i][0], ":", len(L[i][1]))    
Sukumari : 703
Jagathy Sreekumar : 695
Adoor Bhasi : 579
Brahmanandam : 576
Manorama : 558

The output above indicates that all of the top five positions in this list are occupied stars of Indian cinema, with the great Sukumari (1940-2013) winning the competition with 703 recorded movie appearances.

Acting Partnerships

We now consider the number of collaborations between different pairs of actors - that is, the number of movies that each pair of actors has appeared in together. The following code calculates these figures. It goes through every pair of actors that are known to have appeared in at least one movie together and then counts the total number of edges between the corresponding nodes. Again, the top five collaborations are reported.

In [7]:
D = {}
for e in G.edges():
    D[e[0] + " and " + e[1]] = G.number_of_edges(e[0], e[1])
L = sorted(D.items(), key=lambda kv: kv[1], reverse=True)
for i in range(5):
    print(L[i][0], ":", L[i][1])
Adoor Bhasi and Prem Nazir : 292
Larry Fine and Moe Howard : 216
Adoor Bhasi and Sankaradi : 207
Adoor Bhasi and Bahadoor : 198
Brahmanandam and Ali : 193

This output shows that the most prolific acting partnership in the network is due to the late Indian actors Adoor Bhasi (1927-1990) and Prem Nazir (1926-1991), who appeared in an impressive 292 movies together. Next on the list are Larry Fine and Moe Howard (two of the Three Stooges) who co-starred in 216 different movies.

Calculating Shortest Paths

As we have seen, when two actors have not appeared in a movie together there will be no edge between the corresponding nodes in the social network. However, we can still look for connections between actors by using paths of intermediate actors. This is similar to the so-called Six Degrees of Separation - the idea that all people are six or fewer social connections away from each other.

Connecting actors using chains of intermediate actors is an idea popularised by the Oracle of Bacon website, who provide a simple tool for finding shortest paths between any pair of actors. As mentioned earlier, the Oracle of Bacon is also the source of the dataset used in this work.

Before looking at the techniques used in identifying shortest paths, we first simplify our network slightly by converting it into a "simple graph". Simple graphs allow a maximum of one edge between a pair of nodes; hence, when we have multiple edges between a pair of nodes in our multigraph (because the two actors have appeared in multiple movies together), these will now be represented as a single edge. Note that this conversion maintains the number of nodes in the network but it reduces the number of edges. It will therefore make some of our calculations a little quicker. The following code constructs our simple graph. The final line then checks whether this new network is connected. (A network is connected when it is possible to form a path between every pair of nodes.)

In [8]:
G = nx.Graph()        
for movie in Movies:
    for i in range(0, len(movie["cast"]) - 1):
        for j in range(i + 1, len(movie["cast"])):
            G.add_edge(movie["cast"][i], movie["cast"][j], title=movie["title"])         
print("Number of nodes in simple graph  =", G.number_of_nodes())
print("Number of edges in simple graph  =", G.number_of_edges())
print("Graph Connected?                 =", nx.is_connected(G))
Number of nodes in simple graph  = 395414
Number of edges in simple graph  = 8676962
Graph Connected?                 = False

As shown in the above output, the new network still has 395,414 nodes, but it also contains 8,676,962 edges instead of the original 9,968,607 - a 13% reduction. We also see that the network is not connected; that is, it is composed of more than one distinct connected component.

Shortest paths can be found in our "simple" social network using the following code. This makes use of the NetworkX command nx.shortest_path() which returns a list of nodes P that specifies the path. For example, the following code achieves this.

In [9]:
P = nx.shortest_path(G, source="Anthony Hopkins", target="Samuel L. Jackson")
print(P)
['Anthony Hopkins', 'Brad Pitt', 'Samuel L. Jackson']

While this code does indeed tell us the shortest path between Anthony Hopkins and Samuel L. Jackson, it does not give the names of the movies involved in this path. In addition, if no path exists between the actors, or if we type in a name that is not present in the network, then the program will halt with an exception error. A better alternative is to therefore put the nx.shortest_path statement inside a bespoke function, and then add some code that (a) checks for errors, and (b) writes the output in a more helpful way. The following code does this and shows two examples.

In [10]:
def writePath(G, u, v):
    print("Here is the shortest path from", u, "to", v, ":")
    if not u in G or not v in G:
        print("  Error:", u, "and/or", v, "are not in the network")
        return
    try:
        P = nx.shortest_path(G, source=u, target=v)
        for i in range(len(P) - 1):
            t = G.edges[P[i],P[i+1]]["title"]
            print(" ", P[i], "was in", t, "with", P[i+1])
    except nx.NetworkXNoPath:
        print("  No path exists between", u, "and", v)

writePath(G, "Catherine Zeta-Jones", "Jonathan Pryce")
writePath(G, "Homer Simpson", "Neil Armstrong")
Here is the shortest path from Catherine Zeta-Jones to Jonathan Pryce :
  Catherine Zeta-Jones was in Blue Juice with Sean Pertwee
  Sean Pertwee was in Shopping (1994 film) with Jonathan Pryce
Here is the shortest path from Homer Simpson to Neil Armstrong :
  Error: Homer Simpson and/or Neil Armstrong are not in the network

Connectivity and Centrality Analysis

In this section we now use three techniques from the field of centrality analysis to help us identify the most "central" and "important" actors in our social network.

Recall from the last section that our network is currently not connected. This means that the graph is made up more than one connected component, and that paths between actors in different components do not exist. To investigate these connected components, we can use the command nx.connected_components(G) to construct a list holding the number of nodes in each.

In [11]:
C = [len(c) for c in sorted(nx.connected_components(G), key=len, reverse=True)]
print("Number of components =", len(C))
print("Component sizes      =", C)
Number of components = 2533
Component sizes      = [379859, 72, 46,..., 2, 2, 2]

The output of these statements indicates that our network of actors is actually made up of 2,533 different components; however, the vast majority of the nodes (96%) all occur within the same single connected component, indicating the existence of paths between all of these 379,859 different actors. We also see that the remaining components are very small (in most cases they are composed of actors who all appeared in the same single together movie, but no others).

For the remainder of our analysis we will now focus on the single connected component of 379,859 actors. To do this we first need to isolate the component. We can then use the subgraph() command to form the network represented by this component. The output of the following code confirms that this new network is indeed connected as we would expect.

In [12]:
C = max(nx.connected_components(G), key=len)
G = G.subgraph(C)
print("Number of nodes in simple graph =", G.number_of_nodes())
print("Number of edges in simple graph =", G.number_of_edges())
print("Graph Connected?                =", nx.is_connected(G))
Number of nodes in simple graph = 379859
Number of edges in simple graph = 8612493
Graph Connected?                = True

The following three subsections will now investigate the "centrality" of the nodes appearing in this connected network.

Degree Centrality

In networks, the degree of a node is simply the number of edges that are touching it. For our network of actors the degree represents the number of different people that an actor has worked with. To access the degree of a node the simplest option is to use the command G.degree(). For example, the following code tells us that, according to our dataset, Henry Fonda (1905-1982) appeared in film with 1,325 different actors.

In [13]:
print(G.degree("Henry Fonda"))
1325

A second option is to use the NetworkX function nx.degree_centrality(G), which calculates the degree centrality of every node in the network. The degree centrality of a node is determined by dividing its degree by the maximum possible degree, which in this case is simply the number of nodes in the network minus one (i.e., 379,858). The degree centrality of Henry Fonda, for example, is therefore calculated as 1,325 divided by 379,858, giving 0.00349. This figure can be interpreted as the proportion of actors in the network that Henry Fonda has acted with - in this case, just under 0.35%.

The following code creates a dictionary D that holds the degree centrality of all nodes in the network. As with previous examples, the top five actors according to this measure are then written to the screen.

In [14]:
D = nx.degree_centrality(G)
L = sorted(D.items(), key=lambda item: item[1], reverse=True)
for i in range(5):
    print(L[i][0], ":", L[i][1])
Nassar : 0.007731836633689431
Sukumari : 0.006710402308230971
Manorama : 0.006610364925840709
Brahmanandam : 0.0064761042284222
Vijayakumar : 0.006236541023224468

The top positions in this list are again occupied by actors from Indian cinema. To determine the actual degree of these nodes (i.e., the number of different actors that each actor has worked with), we simply need to multiply these figures by the number of nodes minus one. We then find that Nassar has appeared with a massive 2,937 different actors; Sukumari with 2,549; Manorama with 2,511; Brahmanandam with 2,460; and Vijayakumar with 2,369.

Betweenness Centrality

Betweenness centrality considers the number of shortest paths in a network that pass through a particular node. In social networks this helps to detect the "middlemen" that serve as a links between different parts of a network. It also helps to identify "hubs" in the network that, when removed, will start to disconnect parts of the network from each other.

The formula for calculating the betweenness centrality of an individual node $v$ in a network is as follows: $$ C(v)=\sum_{s,t\in V}\frac{\sigma(s,t|v)}{\sigma(s,t)} $$ where $V$ is the set of nodes in the network, $\sigma(s,t)$ is the number of different shortest paths between two nodes $s$ and $t$, and $\sigma(s,t|v)$ is the number of these $s$-$t$-paths that are seen to pass through $v$. In other words, the betweenness centrality of a node $v$ is the sum of the fraction of all-pairs shortest paths that pass through node $v$.

We can calculate the betweenness centrality of all nodes using the NetworkX command nx.betweenness_centrality(G). However, executing this command on a network as big as ours is infeasible. This is because it involves having to calculate all of the shortest paths between all pairs of nodes, which would take a huge amount of calculation. (In technical terms, it involves using an algorithm that has a complexity of $O(nm)$, where $n$ is the number of nodes and $m$ is the number of edges.) Luckily, we can make some savings in these calculations by using a sample of $k$ nodes to estimate the betweenness centrality of all nodes. This is carried out by the following code using a sample of 1,000 nodes. As before, the top five results are written to the screen. Some statements are also included to show us how long the calculation takes.

In [15]:
start = time.time()
D = nx.betweenness_centrality(G, k=1000)
end = time.time()
print("Time taken =", end-start, "seconds")
L = sorted(D.items(), key=lambda item: item[1], reverse=True)
for i in range(5):
    print(L[i][0], ":", L[i][1])
Time taken = 35078.12739729881 seconds
Christopher Lee : 0.009681571845060294
Om Puri : 0.008096614699536075
Jackie Chan : 0.007916442261041035
Anupam Kher : 0.007905221698069204
Harrison Ford : 0.004888140405090745

The output of this code indicates that Christopher Lee appears on by far the highest number of shortest paths in the network, followed by Om Puri, Jackie Chan, Anupam Kher, and then Harrison Ford. As indicated, this calculation took approximately 10 hours to carry out on our computer (a 3.2 GHtz Windows 10 machines with 8 GB RAM). This suggests that a full calculation using all nodes instead of just a sample would have taken something in the region of 150 days to complete.

Closeness Centrality

Like betweenness centrality, closeness centrality also considers shortest paths in a network. For a given node $v$ it represents the mean shortest path length from $v$ to all other nodes in the network. If an actor is found to be connected to other actors via short paths, they can therefore be considered to be quite central in the network. The formula used for calculating the closeness centrality of a particular node $v$ is as follows: $$ C(v) = \frac{1}{(\sum_{u\in V}d(v,u))/n} $$ where $d(v,u)$ is the length of the shortest path (number of edges) between nodes $v$ and $u$, and $n$ is the number of nodes in the network. Higher values of this measure therefore indicate a higher centrality. Note that we can also calculate the actual mean path length from a node $v$ to all other nodes in the network by simply dividing 1 by $C(v)$.

As with the previous example, calculating the closeness centrality of all nodes in our large network of actors would take too long on a single computer because, once again, it involves calculating the shortest paths between all pairs of nodes. In our case we make things easier by restricting our calculations to the top 1,000 actors according to the betweenness centrality measure from the previous subsection. The following code does this. First, it produces a list V of all actors in the network, ordered according to their betweenness centrality score. The closeness centrality is then calculated for each of the first 1,000 actors in this list. These are then ranked, and the top five are output.

In [16]:
V = [L[i][0] for i in range(len(L))]
D = {}
start = time.time()
for i in range(1000):
    D[V[i]] = nx.closeness_centrality(G, V[i])
end = time.time()
print("Time taken =", end-start, "seconds")
L = sorted(D.items(), key=lambda item: item[1], reverse=True)
for i in range(5):
    print(L[i][0], ":", L[i][1])
Time taken = 29075.51345229149 seconds
Christopher Lee : 0.34724000968978963
Michael Caine : 0.3428649311215694
Harvey Keitel : 0.3422440138966974
Christopher Plummer : 0.34123099284135183
Robert De Niro : 0.34059459561239835

The above output shows that, of these actors, Christopher Lee is again the most central. Amazingly, we find that we can get from Christopher Lee to any other actor in the network in an average of just 2.88 hops. The next best-connected actors are then, respectively, Michael Caine (average of 2.917 hops), Harvey Keitel (2.922), Christopher Plummer (2.931) and Robert De Niro (2.936).

If we want, we can also take a closer look at the number of actors within a certain number of hops from a chosen actor. For example, the following code creates a dictionary D that holds the length of the shortest path between Christopher Lee and all other actors in the network. It then counts the number of actors of distance 0, 1, 2, and so on.

In [17]:
D = nx.shortest_path_length(G, "Christopher Lee")
print(collections.Counter(D.values()))
Counter({0: 1, 1: 2056, 2: 98758, 3: 226350, 4: 48625, 5: 3655,  6: 369, 7: 36, 8: 9})

The output ftells us that all actors can be reached from Christopher Lee using fewer than nine hops. Exactly one actor can be reached in zero hops (Christopher Lee himself); 2,056 actors can be reached with one hop; 98,758 with two hops; and so on. In particular, note that over 86% of actors can be reached from Christopher Lee in fewer than four hops, and 99% in fewer than five hops.

Distribution of Actors' Closeness Centrality Scores

Finally, it is also interesting to look at the distribution of actors' closeness centrality scores. We have already seen that actors like Christopher Lee, Om Puri, and Michael Caine are very central and well connected, but what is the score of a "typical" actor. Once again, the expense of calculating shortest paths between all pairs of actors is prohibitively expensive for a large network like ours. Instead, the code below takes a random sample of 1,000 actors, calculates their closeness centrality scores, works out the mean and standard deviation of this sample, and then uses the bespoke function doHistogramSummary() to plot this information to the screen.

In [18]:
def doHistogramSummary(X, xlabel, ylabel, title):
    plt.xlabel(xlabel)
    plt.ylabel(ylabel)
    plt.title(title)
    plt.hist(X, bins=20)
    plt.show()

random.shuffle(V)
D = {}
for i in range(1000):
    D[V[i]] = 1 / nx.closeness_centrality(G, V[i])
L = D.values()
print("Mean =", statistics.mean(L))
print("SD   =", statistics.stdev(L))
doHistogramSummary(L, "Average distance to all Actors", "Frequency", "Closeness centrality distribution")
Mean = 4.28898541297011
SD   = 0.497230661969893

The output from this code indicates that the average distance between any two actors in this sample is just under 4.29 hops. As a comparison, this is slightly lower than the six hops hypothesised in the Six Degrees of Separation; however, it is slightly higher than the mean of 3.57 found in a similar analysis of Facebook friendships carried out by Facebook Research in 2016.

Conclusions and Further Discussion

This article has used tools from the NetworkX library to help determine the most important people in the social network of movie actors. Regardless of the measures used, the most central actors are consistently those who have or had very long acting careers, such as Christopher Lee, Nassar, Sukumari, Michael Caine, Om Puri, and Jackie Chan. This is quite natural, because long careers bring more acting opportunities, helping to improve an actor's connectivity in the network.

To contrast these findings, an extended version of this work shows the twenty most central actors by decade. These statistics were found in the same way as above, but only used movies that were released in that particular decade. As we might expect, this causes many new names to crop up. For movies released in the 1950's, for example, actors such as Louis de Funes (1914-1983) and George Thorpe (1891-1961) seem to be very central; for the 1990's on the other hand, the most central actors are people like Samuel L. Jackson, Om Puri, Vijayakumar, Roshan Seth and Frank Welker.

There are other ways in which we might have performed this analysis. Alternative centrality measures are also included in the NetworkX library, such as page rank centrality, Eigenvector centrality, Katz centrality, and current-flow closeness centrality.

All materials from this study are available online:

  • A complete listing of the dataset, source code, and results can be found in this zip file.
  • An extended pdf version of this work, together with comprehensive tables of results can be found here.