Thursday, July 16, 2015

Lindy Effect on source code

Recently I have a talk to one manager who, after looking through the source code  discovered, that old-and-dirty code is not changed for a year within which my team worked on the project.

My first thought and reply was that no one never re factored it because it simply works and the business will hardly pay for that.

But I was wondering if a Lindy effect applies here.

Could you expect that newly committed source code will (in general) have lower probability to remain within a year than source committed a long time ago?

I'll try to explain it with another example.
Writing a software is like writing a book. You write first version, then show it to your spouse and rewrite, the show it to the editor an rewrite again, and again and again. So at version X you have some text which was added in version 2, some comes from version 1, some comes within the current version. What text have more chances to be removed - added within version X, X-1 or added within X-C  (c >1)?

I have checked that with one of the most critical project (as for me) : mercurial_SCM (one of two most used source control system). It was chosen due to:
  1. widely used
  2. the source code of that project is within mercurial itself
  3. at time of publication there were 25000+ commits made by 250+ authors (see stat)

On figure 1 you can see that 97% of source code which were committed at least 5000 commits ago will remain  the next 1000 commits*.

You also could found out that lines committed 20000 commits ago have 99.5% probability to survive next 1000 commits.

On figure 2 the same statistics is given for Mozilla project 

On figure 3 the statistics is given for Pidgin chat client

On figure 3 the data is given for GNU Octave

On figure 4 the data is given for CLisp programming language.
On figure 5 the data is given for python programming language project.

On figure 6 the data is given for nginx web server project.
On figure 7 the data is given for GNU Multi-Precision Library project

The source code of how I did that one could find in github.

* I have chosen to compare not in dates, but in commits, thus reducing effects of some activity picks.

** the higher is the age, the less input data was, so figures are a bit more volatile on the right

No comments: