Recently I have a talk to one manager who, after looking through the source code discovered, that old-and-dirty code is not changed for a year within which my team worked on the project.
My first thought and reply was that no one never re factored it because it simply works and the business will hardly pay for that.
But I was wondering if a Lindy effect applies here.
Could you expect that newly committed source code will (in general) have lower probability to remain within a year than source committed a long time ago?
I'll try to explain it with another example.
Writing a software is like writing a book. You write first version, then show it to your spouse and rewrite, the show it to the editor an rewrite again, and again and again. So at version X you have some text which was added in version 2, some comes from version 1, some comes within the current version. What text have more chances to be removed - added within version X, X-1 or added within X-C (c >1)?
I have checked that with one of the most critical project (as for me) :
mercurial_SCM (one of two most used source control system). It was chosen due to:
- widely used
- the source code of that project is within mercurial itself
- at time of publication there were 25000+ commits made by 250+ authors (see ohloh.net stat)
On figure 1 you can see that 97% of source code which were committed at least 5000 commits ago will remain the next 1000 commits*.
You also could found out that lines committed 20000 commits ago have 99.5% probability to survive next 1000 commits.
On figure 2 the same statistics is given for
Mozilla project
On figure 3 the statistics is given for
Pidgin chat client
On figure 3 the data is given for
GNU Octave
On figure 4 the data is given for
CLisp programming language.
On figure 5 the data is given for
python programming language project.
On figure 6 the data is given for
nginx web server project.
On figure 7 the data is given for
GNU Multi-Precision Library project
The source code of how I did that one could find in
github.
* I have chosen to compare not in dates, but in commits, thus reducing effects of some activity picks.
** the higher is the age, the less input data was, so figures are a bit more volatile on the right