A Gentle Introduction to Statistical Mechanics of Graphs
October 10th, 2004
Let’s start with one little datum: the human brain has 10^11 (that is 10 with 11 zeros, or a 100 billion) nodes (neurons in this case) and the world wide web has 10^9 nodes (pages in this case). Both are directed graphs, but they are somewhat different in dynamic evolution: the brain synapses change their activating thresholds over time (using reinforcement mechanisms) while the web changes links and nodes upon external influence (humans or computer programs changing pages).
We could say, for classification, that the brain is an “active graph” while the web is a “passive graph”. Therefore, we will not expect “intelligence” emerging from the topological complexity of web pages in altering its own links directly, but it is widely known how Google was able to make some of this distributed knowledge emerge from the topological complexity (which is, at this point, probably close to the brain of a little animal).
So, humanity was able to create a passive graph that is now reaching the complexities of the inner workings of neural systems in little animals, but its passivity will preclude any form of self-awareness: this is because the act of modifying the graph is not self-induced.
Another datum: is the node size really what measures the complexity of a graph? no, we have to count the edges/arcs too. But is the number of edges good enough? again, no, we need something more.
So, how do we know if two complex graphs are similar in complexity? and how do define complexity similarity?
When graphs are so big, you need to have a way to see their structure at-a-glance. The basic property of every edge is that it always connects two nodes, so there is no much to know there. Way more important are concepts like “in-degree” and “out-degree” of nodes (how many egdes point to it and how many start from it) and a way to have a global view of the graph is to have a (logarithmic, given the size) distribution where each dot in the plot means “these many nodes (x axis) had that many edges (y axis)” either in-ward or out-ward going.
A property of those degree distributions is that, in a log-log scale, and for real-life graphs, they all appear to form straight lines.
When this happens, the distribution is said to follow a “power law” and the “slope” of this straight line is said to be its coefficient. All power law coefficients have been shown to be in range between 2.0 and 3.5 and it was mathematically proven that, for a certain type of graph, below 2.0 there is not enough linkage density to provide a “phase transition” in the graph and allow the entire graph to be connected.
Recently, a theory was put forward that says that graphs obtain their maximal resistance to “giant connected component fragmentation” when they form a power-law distribution: basically, it means that any other distribution would be less efficient in preserving itself, therefore all graphs that are too big for a single entity to modify globally, will exhibit such a distribution shape.
Now: how do the web and the human brain look like? they are less than two orders of magnitude away from each other. Do they follow a power-law distribution? There is a seminal paper that analyzed a dump of the Altavista search engine in 1999 that shows how the web does indeed follow a power law distribution for node degrees. Google does not publish the distribution of their graph (AFAIK, and this is a shame, IMO) so we don’t know better. And there is no way that we can “crawl” a human brain and find out its global topological distribution. Experiments in really small insects (up to 800-neuron brains) has showed that it does follow a power-law but it’s really not enough data and really not that much intelligence there (in the insect, not in the paper, mind you) to have statistical significance.
But what is the meaning of such a power law distribution?
If you take a graph and you connect nodes at random, you don’t get a straight-line distribution, but a bell-shaped one. But if you take a graph, pick nodes at random but connect them with a probability that is proportional to the existing node degree of the other node, you get a power law. In short: if you are rich, you are more likely to get richer. If you know many people, you are more likely to know get to know even more. If your blog is popular, it’s likely to become more popular. In Italy, we say: “always rains where is already wet”. It can be shown how such an effect would generate a power-law distributed water levels in buckets!
But what does this really mean? what properties do power-law graphs exhibit?
- small-world effect: a property of power-law graphs is that they have a very small diameter. The graph diameter is the maximum length of edges that one has to traverse to reach two nodes in a fully connected graph. As an example, the network of social connections in the US was shown to have a diameter of 6! that is every person in the US is a friend of a friend of a friend of a friend of a friend of a friend (or friend^6 if you wish) of another US citizen. The web in 1999 had a diameter of 17.
- strong connectedness: some power-law-distributed graphs exhibit the amazing property that allows them to sustain up to 80% of random node removal without losing their full connectedness. This property was the key in the design of decentralized systems (such as the internet or peer2peer file distribution networks). Another interesting effect (here used by the RIAA to damage those distribution networks) is that these networks are not so resistant at “targetted node removal”: if you know what networks are mostly connected and you remove those first, the amount of nodes you have to take out to break it apart is much smaller.
- ease of percolation: since they are all always connected and have small diameters, anything that flows flows very quickly around the entire graph. This can be good or bad. Good when, for example, queries in a p2p system need only a few seconds to reach all the nodes. Bad when the same ease is used by email worms or flu viruses.
Note how the properties that we used to analyze a graph are very local. We take each node and we ask ourselves how much that node is connected. But we know, from our experience, that graphs form clusters, the abstract equivalent of a community of nodes.
So, let’s introduce another property of a node, called “clustering coefficient”, as the ratio between the number of edges between the nodes that this node is connected to and the number of all possible edges between them. In short: how many of your friends know each other, how “densly connected” is your group of friends. The clustering coefficient is a real number that ranges between 0.0 and 1.0.
Now, it has been shown how two graphs that have both power-law distributions can have totally different distributions of their node’s clustering coefficients. I find this very fascinating because it indicates how the degree distribution of nodes (or any node-local distribution, for that matter) does not yield enough information to categorize a graph.
Let’s think about two types of nodes for a second: one is a person that moves a lot might end up with lots of friends, but their friends are unlikely to know each-other directly. This node’s in and out degrees will be big but the clustering coefficient will be close to zero. On the other hand, a person that is born and dies in the same place will likely have a small number of in and out degrees but a clustering coefficient close to one.
Now, how does the clustering coefficient distribution changes the properties of a graph?
Good question! There is not much literature about this, yet.
Googling around for this, I found this very interesting thesis about the statistical analysis of the graphs of collaborations between sourceforge projects. Amazingly enough, it shows that the clustering coefficient of the graph decreases over time! What does this mean I honestly don’t know, but my fingers are itching to start to analyze the ASF data to find out where we stand. Oh, wait, look what I found googling around “applying social network analysis in the information in CVS repositories“.
I’m sure at this point you are wondering what statistical mechanics has to do with all this.
The biggest challenge in complex graph analysis is to understand how it operates. Since real-life graphs are always changing, the challenge is to discover what is the process that makes them evolve.
The holy grail of statistical mechanics of graphs is to understand which processes operate on the graph (and hopefully finding out that there are just a finite number of such processes!), simply by looking at its statistical properties (both locally an globally, both in space and time).
Personally, I don’t know where to stand in this search for the “theory of everything” for graphs, since it reminds me of the complexity of 21-dimensional super-string theory to make all force fields unify. But I have to admit there is appeal in the fact that graphs obtained from incredibly different processes do, indeed, exhibit the same sort of distributions.
One thing I already noticed with my work on virtual community graphs created out of mail replies and on my work in transforming digital libraries into RDF models is that graphs do exhibit incredibly different distributions depending on the fact that centralized control was present or not. This is not easy to tell if you just browse around the graph, with a space locality of one or two nodes, but if you start drawing a picture of a graph that loses the details on the single node and focuses on the clustering properties, then you see amazing differences, even just by looking at it. It’s like distinguishing families of fractals.
So, what’s next?
We have seen how the simple introduction of one circle of friends around a node identified a “less local” property and opened the door for new speculations: what is the property of a network with lost of small villages compared with the network with lots of moving people? How is percolation effected? How is resistance to attacks? Research is underway in these areas.
But are there more global properties of a graph? A property has been proposed, called “modularity”, that is a single number for every graph and basically indicates the ratio between the density of connectivity between the graph and the density of connectivity that the same graph would have if the edge were randomly distributed. It basically indicates how similar the graph is to a random graph. It has been used in algorithms that isolate network clusters by subsequent removal of the edges with maximum “betweenness” (that is, the how many of the minimum-distance paths between two nodes in the graph pass thru that edge) and finding where to stop by looking at the local maxima of the modularity function obtained after each edge removal (incredibly clever algorithm!).
But moreover: are there more global properties of the network? and what about time-global properties? what about the distribution of time-based clustering coefficients, such as “friends that became friends with others because they met them thru me”?
The hunt of the understanding of complex graphs just started and it’s so fascinating and challenging (both theoretically and practically) that I can hardly stop myself from spending all my free time (and even my non free-time, eheh) reading about it