Tuesday, November 3, 2015

кто поломал билд



Над проектом работают два разработчика : Рафик и Эллочка. Эллочка менее опытный разработчик и допускает в два раза больше ошибок чем Рафик.

Коммитят они последовательно, при этом у них есть CI сервер, который не дает закомитить пока:
  1. не скомпилирует и не прогонит тесты над последним коммитом;
  2. причем если последний коммит "ломает" билд - то сервер самостоятельно его отменяет;
  3. т.е. в один момент времени может быть только один виновный;
Рафик ломает билд в 40% своих коммитов.
Вы в произвольный момент времени взглянули на экран CI и увидели, что "билд поломан".

Кто виноват?
С вероятностью >85% Рафик не виноват (см. закон Баеса)

А вот если Рафик ломает билд в 10% своих коммитов то вероятность того что Рафик не виноват <70%













Tuesday, August 4, 2015

Пример выбора порядка выполнения подзадач при необходимости ответа заказчика


   Недавно у меня был такой случай. У одного разработчика было задание, которое мы разбили на 3 не связанные между собой подзадачи.
  • подзадача 1  - время выполнения 1/2 дня (`t_{1}=1d=8h`);
  • подзадача 2 - время выполнения 1 час (`t_{2}=1h`);
  • подзадача 3 - есть два несовместимых варианта решения :
    • вариант А  - пол дня (`t_{3}^A=1h`)
    • вариат - B  3 дня (`t_{3}^B=3d = 24h`). 
   Выбор за заказчиком. Заказчик может ответить в течении часа (`t_{reply}^{fast}=1h`), а может и затребовать день(`t_{reply}^{slow}=1d`). По нашим (с разработчиком) прикидкам с вероятностью 90% заказчик выберет вариант B (`P_{3}^B=90%`). На то чтобы сформулировать ему вопрос понадобится около получаса (`t_{email} = 1/2h`).

   Я бы работал по следующему плану:
  • отправил письмо заказчику;
  • подзадача 1;
  • подазача 2 ;
  • если не получен ответ то приступил бы к подзадаче 3, вариант Б.
  • по получении ответа либо продолжил бы делать вариант Б либо переключился на вариант А

   Разработчик же решил делать так:
  • отправил письмо заказчику;
  • подзадача 3, вариант Б (который 3 дня);
  • получил ответ от заказчика (угадал - т.е. вариант Б+)
  • подзадача 1;
  • подзадача 2;
   Откровенно говоря, я как-то и не подумал о том что существуют други подходы к планированию, чем "мой".

    Дальше для тех, кто любит математику:

    Сначала мой случай.
    Максимальное время на выполнение всего задания (худший случай) при моем подходе:

`T_{my}^{max} = t_{email,1,2} + max( (t_{reply}^{slow} - t_{email,1,2} + t_3^A), t_3^B ) =`
`= 1/2h + 4h+1h +max ((1d - 1/2h-4h-1h),3d) =`
`= 5 1/2h +max ((2 1/2h), 3d) = 29 1/2 h`,
   где `t_{email,1,2} = t_{email} + t_1 + t_2`;

   Минимальное время на выполнение всего задания (лучший случай) при моем подходе:

`T_{my}^{min}= t_{email,1,2} + min( (t_{reply}^{slow} - t_{email,1,2} + t_3^A), t_3^B ) = 8 h `

   Вероятность худшего случая при моем подходе :

`P_{my}^{max} = P_3^B = 90%`
    Мат. ожидание времени выполнения при моем подходе:

`E[T_{my}] = P_{my}^{max} * T_{my}^{max} + (1-P_{my}^{max}) * T_{my}^{min} = `
` = 90 % * 29 1/2 h + 10% * 8h = 27.35 h `


    Теперь подход разработчика.
     Максимальное время на выполнение всего задания (худший случай) при подходе разработчика:
`T_{dev}^{max}= t_{email}+max(t_{reply}^{slow} + t_3^A, t_3^B) + t_{1,2} =`
`= 1/2h + max(1d + 1/2d,3d) + 4h + 1h = 29 1/2 h`

    Минимальное время на выполнение всего задания (лучший случай) при подходе разработчика:

`T_{dev}^{min} = t_{email}+min(t_{reply}^{slow} + t_3^A, t_3^B) + t_{1,2} = 14h`

   Вероятность худшего случая при подходе разработчика:

`P_{dev}^{max} = P_{email}^{long}= 50%`

Мат. ожидание времени выполнения при подходе разработчика:
`E[T_{dev}] = P_{dev}^{max} * T_{dev}^{max} + (1-P_{dev}^{max}) * T_{dev}^{min} =`
` = 50 % * 29 1/2 h + 50% * 14h = 21.75 h `


Т.о. подход разработчика, дает меньшее ожидаемое время исполнения, хотя в лучшем случае и занимает большее время.

Thursday, July 16, 2015

Lindy Effect on source code


Recently I have a talk to one manager who, after looking through the source code  discovered, that old-and-dirty code is not changed for a year within which my team worked on the project.

My first thought and reply was that no one never re factored it because it simply works and the business will hardly pay for that.

But I was wondering if a Lindy effect applies here.

Could you expect that newly committed source code will (in general) have lower probability to remain within a year than source committed a long time ago?

I'll try to explain it with another example.
Writing a software is like writing a book. You write first version, then show it to your spouse and rewrite, the show it to the editor an rewrite again, and again and again. So at version X you have some text which was added in version 2, some comes from version 1, some comes within the current version. What text have more chances to be removed - added within version X, X-1 or added within X-C  (c >1)?

I have checked that with one of the most critical project (as for me) : mercurial_SCM (one of two most used source control system). It was chosen due to:
  1. widely used
  2. the source code of that project is within mercurial itself
  3. at time of publication there were 25000+ commits made by 250+ authors (see ohloh.net stat)

On figure 1 you can see that 97% of source code which were committed at least 5000 commits ago will remain  the next 1000 commits*.

You also could found out that lines committed 20000 commits ago have 99.5% probability to survive next 1000 commits.


On figure 2 the same statistics is given for Mozilla project 





On figure 3 the statistics is given for Pidgin chat client


On figure 3 the data is given for GNU Octave


On figure 4 the data is given for CLisp programming language.
On figure 5 the data is given for python programming language project.


On figure 6 the data is given for nginx web server project.
On figure 7 the data is given for GNU Multi-Precision Library project

The source code of how I did that one could find in github.

* I have chosen to compare not in dates, but in commits, thus reducing effects of some activity picks.

** the higher is the age, the less input data was, so figures are a bit more volatile on the right

Web Services Interoperability. Java client identified via certificates

Based on http://webservices20.blogspot.com/search/label/Web%20Services%20Interoperability

The problem is: to call WCP endpoint from java client being identified via certificate by both parties.

  1. Set up the environment: You should have NetBeans 7 (or higher).
  2. Download and extract somewhere metro of 2.1 version (or higher) via http://metro.java.net
  3. Establish a vpn connection to quansis if needed
  4. Install "Soap Web Services" plugin (Tools -> plugings -> available plugins)
  5. Create a new project of type Java Application:


  6. Right click the package in the project view and add a new "web service client":

  7. Paste that Wsdl URL in opened window and click "finish".




  8. Development server of quansis uses self-signed certificates, so Netbeans IDE (as any other java application) will complain, just accept the certificate in warning window. Sometimes to force it to accept the certificate you could need to open the WSDL URL via web browser
  9. On success in your project some new elements will appear in META-INF folder
  10. Unforunatelly Netbeans use default metro library from glassfish, which is 2.0. That version has some bugs with istablishing secure connections to WCP IIS endpoints. To overcome this obstacle you need:

  • Open the web service quality attribute form as you did in step 6. Uncheck the "use development defaults" check box. Now configure the keystore and trust store.

  1. write the client code.
  2. Now run the application (F6).

Wednesday, May 20, 2015

Lindy Effect : Java Garbage collector


This post is a new one in a row of Lindy Effect in Software Engineering.

Lindy Effect states that : the life expectancy of non-perishable things that posits for a certain class of nonperishables, like a technology or an idea, every additional day may imply a longer life expectancy.


In software memory usage is one of the most important things. Some data doesn't need to be persistent in memory forever, and should be deleted after becoming not needed.

Up to 70-s in most computer languages developers need to clean (remove not needed data) directly.

In many recent languages a new form of automatic memory management garbage collection could be used instead.

With these technology developers don't need to keep in mind if some data is not needed - the system will remove it in most cases automatically.

In Java garbage collector works on the following procedure (a bit simplified):
  1. the memory is divided in 3 spaces : young generation (eden,  survivor), old generation (tenured)
  2. when new data appears it appears in eden space
  3. whenever free memory is low eden space is checked: not referenced data is automatically deleted, the rest moved to survivor space
  4. whenever free memory is very low survivor space is checked:not referenced  data is automatically removed - cleaned, the rest either remains either in survivor space of moved to tenured space depending on data age
  5. whenever free memory is extremely low then tenured space is checked and not referenced data is removed from there
 Steps 3..5 were designed due to "empirical analysis of applications has shown that most objects are short lived." (Java Garbage Collection Basics: Why Generational Garbage Collection).


But that procedure could be considered as an application of a Lindy Effect as well: the data which survived longer has higher probability to survive longer.



Monday, April 20, 2015

Lindy effect: check on openhub.net


Recently I've found an idea that "The longer a technology has been around, the longer it’s likely to stay around." [J. Coock blog].

I've found that some software developers consider it interesting (see comments on my post on Ukrainian developer's forum).

Unfortunatelly I haven't found some statistics to prove that assumption.

There is a openhub.com resource which collects different statistics on open source software projects.

I've scanned first 34000 projects from it. Results of my scanning is available at [google docs report file]. In that file createdAt stands for date when openhub originally scanned the project, updatedAt stands for a date when openhub submitted last modifications (in general it rescans the project biweekly).

The summarized statistics is given on the table-1. The obtained statistics is not that reliable due to limited input data set etc, but it could be considered as a starting point.

Note: due to limited time range only first several rows could be considered.

Plase also note, that most of openhub projects are closed automatically in some scheduled dates. So the summarized statistics  is close to misleading, you'll better use the [google docs report file] and calculate yourself


Table-1: summarized statistics
Projects age (end of life) years Number of project of length of life >= Percentage of survived additional 0.5 years Percentage of survived additional 1 years Percentage of survived additional 1.5 years Percentage of survived additional 2 years Percentage of survived additional 2.5 years
0 34050 48% 46% 45% 42% 41%
0.05 20270 80% 77% 75% 71% 69%
0.25 17013 94% 91% 89% 84% 81%
0.5 16211 97% 94% 89% 87% 83%
0.75 15915 97% 95% 89% 86% 21%
1 15706 97% 91% 90% 86% 6%
1.25 15490 98% 92% 88% 21%
1.5 15305 94% 92% 88% 6%
1.75 15143 94% 90% 22%
2 14354 98% 94% 6%
2.25 14214 96% 23%
2.5 14070 96% 7%
2.75 13703 24%
3 13523 7%

Tuesday, March 31, 2015

Lindy effect maven analyzer tool

From time to time I see that developers tend to prefer newest technologies and newest libraries they could find. On another hand managers prefer libraries with "big names" behind.

But as for me there is another point to consider: the longer the library is the longer it could remain. For example pen survived for thousands of years, but typing machines wont stay a decade.


Fortunately for me I've found J.D. Cook "The Lindy effect" blog which put it in mathematical terms:
"The longer a technology has been around, the longer it’s likely to stay around."

S.a. getting full statistics is a bit too difficult I've created a tool for estimating your project dependencies (libraries being used): https://github.com/bogdartysh/lindy-stat/

With that tool you could find which of used by your maven project are newest one, so you'll be aware of them as of the most risky ones