Skip to main content

Why AI or ML Software Projects need Heroes

 - By Suvodeep Majumder, IEEE Member, Joymallya Chakraborty, Amritanshu Agrawal, Tim Menzies, IEEE Fellow


I felt executing ML or AI project is not only purely technical,l but also clear communication between team members and the size of the team also matters, Hence I picked up this topic

Heroes are those who participate in 80% (or more) of the communications associated with a commit.

Abstract

A “hero” project is one where 80% or more of the contributions are made by the 20% of the developers. In the literature, such projects are deprecated since they might cause bottlenecks in development and communication. However, there is little empirical evidence on this matter. Further, recent studies show that such hero projects are very prevalent. Accordingly, this paper explores the effect of having heroes in project, from a code quality perspective. We identify the heroes developer communities in 1100+ open source GitHub projects. Based on the analysis, we find that (a) hero projects are majorly all projects; and (b) the commits from “hero developers” (who contribute most to the code) result in far fewer bugs than other developers. That is, contrary to the literature, heroes are standard and very useful part of modern open source projects.


A “hero” project is one where 80% or more of the contributions come from 20% of developers. In the literature, such hero projects are deprecated since, it is said, they are bottlenecks that slow project development and causes information loss. Recent studies have motivated a re-examination of the implications of heroes. In 2018, Agrawal et al. studied 661 open-source projects and 171 in-house proprietary projects. In that sample, over 89% of all projects were hero-based1. Only in small open source projects (with under 15 core developers) where non-hero projects were more prevalent. To say the least, this widespread prevalence of heroes is at odds with established wisdom in the SE literature. Hence, it is now an open and pressing issue to understand why so many projects are hero-based. To that end, this paper checks the Agrawal et al. result. All of the project data recollected from scratch from double the number of open source projects (over 1100 projects)thanusedbyAgrawaletal.Also, we use a different method for recognizing a hero project. Agrawal et al. just counted the number of commits made by each developer. In this study, we say Heroes are those who participate in 80% (or more) of the communications associated with a commit.


  1. We clearly demonstrate the benefits of hero-based development, which is contrary to much prior pessimism. 
  2. Our conclusions come from over 1100+ projects, whereas prior work commented on heroes using data from just a handful of projects. 
  3. Our conclusions come from very recent projects instead of decades-old data. 
  4. We show curves that precisely illustrate the effects on code quality for different levels of communication. This is different to prior work that only offered general qualitative principles. 
  5. This paper makes its conclusions using more metrics than prior work. Not only do we observe an effect (using process and resource metrics) to report the frequency of developer contribution, but we also report the consequence of that effect (by joining to produce metrics to reveal software quality). 
  6. Instead of just reporting an effect (that heroes are common, as done by Agrawal et al.) we can explain that effect(heroes are those that communicate more and that communication leads to fewer bugs).
  7. Asaservicetootherresearchers, all the scripts and data of this study can be downloaded from  https://github.com/ai-se/Git_miner mine



Firstly, when we say 1100+ projects, that is shorthand for the following. Our results used the intersection of two graphs of code interaction graph (of who writes what code) from 1327 projects with a social interaction graph (who discusses what commits) from 1173 projects. Secondly, by code interaction graphs and social interaction graphs, we mean the following. Each graph has own nodes and edges{N,E}.

For code interaction graphs:

  • Individual developers have their own node Na; 
  • The edge Eb connects two nodes and indicates if ever one developer has changed another developer’s code. For social interaction graphs like Figure 1: 
  • A node Nc is created for each individual who has created or commented on an issue. 
  • An edge Ed indicates communication between two individuals (as recorded in the issue tracking system. If this happens N times then the weight Wd = N. Thirdly,our definition of “hero” is not “writes 80% of the software” since such a definition is hard to operationalize for modern agile projects (where many people might lend a hand to much of the code). Instead we say heroes are those that “participate in 80% of the discussions prior to the commits






Table 1


Software Quality Metrics


Table 1 shows that most papers do not use a wide range of metrics. Xenos distinguishes these kinds of metrics as follows. Product metrics are metrics that are directly related to the product itself, such as code statements, delivered executable, manuals, and strive to measure product quality, or attributes of the product that can be related to product quality. Also, process metrics focus on the process of software development and measure process characteristics, aiming to detect problems or to push forward successful practices. Lastly, personnel metrics (a.k.a. resource metrics) are those related to the resources required for software development and their performance. The capability, experience of each programmer and communication among all the programmers are related to product quality


  • Code interaction graph is a process metrics; 
  • Social interaction graphs is a personnel metrics; 
  • Defect counts are product metrics


This paper combines all three kinds of metrics and applies the combination to exploring the effects of heroism on software development. There are many previous studies that explore one or two of these types of metrics. Fig 2 summarizes Table 1 and shows that, in that sample, very few papers in software metrics and code quality combine insights of the product and process and personnel metrics.


Fig 2




Some of our own engineering judgement to filter our data as follows:

  • Collaboration: refers to the number of pull requests. This is indicative of how many other peripheral developers work on this project. We required all projects to have at least one pull request. 
  • Commits: The project must contain more than 20 commits. 
  • Duration: The project must contain software development activity of at least 50 weeks. 
  • Issues: The project must contain more than 10 issues. 
  • Personal Purpose: The project must not be used and maintained by one person. The project must have at least eight contributors. 
  • Software Development: The project must only be a placeholder for software development source code. 
  • Project Documentation Followed: The projects should follow proper documentation standard to log proper commit comment and issue events to allow commit issue linkage. 
  • Social network validation: The Social Network that is being built should have at least 8 connected nodes in both the communication and code interaction graph. 


Target projects selection


  • Release: (based on Git tags) mark a specific point in your repository’s history. A number of releases defines different versions published, which signifies considerable amount of changes done between each version. 
  • Duration: length of the project from its inception to current date or project archive date, and signifies how long a project has been running and in active development phase. 
  • Stars: signifies number of people liking a project or use them as bookmarks so they can follow what’s going on with the project later. 
  • Forks: A fork is a copy of a repository. Forking a repository allows you to freely experiment with changes without affecting the original project. This signifies how people are interested in the repository and actively thinking of modification of the original version. 
  • Watcher: Watchers are GitHub users who have asked to be notified of activity in a repository, but have not become collaborators. This is a representative of people actively monitoring projects, because of possible interest or dependency. 
  • Developer: Developers are the contributors to a project, who work on some code, and submit the code using commit to the codebase. The number of developers signifies the interest of developers in actively participating in the project and volume of the work



Fig 4


Process Metrics


  • Project commits were extracted from each branch in git history. 
  • Commits are extracted from the git log and stored in a file system.
  • To access the file changes in each commit we recreate the files that were modified in each commit by (a) continuously moving the git head chronologically on each branch. Changes were then identified using git diff on two consecutive git commits. 
  • The graph is created by going through each commit and adding a node for the committer. Then we use git blame on the lines changed to find previous commits following a similar process of SZZ algorithm. We identify all the developers of the commits from git blame and add them as a node as well. 
  • After the nodes are created, the edges were drawn between the developer who changed the code, and whose code was changed. Those edges were weighted by the changing size between the person.



Personnel Metrics


  • A node is created for the person who has created the issue, then another set of nodes are created for each person who has commented on the issue. So essentially in Social interaction graph, each node in the graph is any person (developer or non-developer) ever created an issue or commented in an issue. 
  • The nodes are connected by edges, which are created by (a) connecting the person who has created the issue to all the persons who have commented in that issue and (b) creating edges between all the persons who have commented on the issue, including the person who has created the issue. 
  • The edges are weighted by the number of comments between two persons. 
  • The weights are updated using the entire history of the projects. The creation and weight update are similar to Figure 5.


Fig 5

Product Metrics


• It first starts with all the commits from git log and identifies the commit messages as this is often an excellent source of information regarding what the commit is about.
• Then to use the commits messages for labelling it uses a natural language-based processor, which includes stemming and other nltk preprocessors to normalize the commit messages.
• Then to identify commit messages which is a representation of bug/issue fixing commits, a list of words and phrases extracted from previous studies of 1000+ projects (Open Source and Enterprise) are used. The system checked for these words and phrases in the commit messages and if found, it marks these as commits which fixed some bugs.
• To perform a sanity check a portion of the commits was manually verified using random sampling from different projects.
• These labeled commits are now processed to extract the file changes as the process mentioned in-process metrics.
• Next git blame is used to go back in the git history each line of the changes in each file to identify a responsible commit where each line was created or changed last time.

 Finally, top contributors (or heroes) and non-heroes were defined as :
Node Degree of Ni = D(Ni) = n X j=1 aij (1) 
Hero = Rank(D(Ni)) > P 100 ∗(N + 1) (2)
Non-Hero = Rank(D(Ni)) < P 100 ∗(N + 1) (3)
where:
N = Number of Developers
P = Percentile(95)
Rank() = The percentile rank of a score is the percentage = of scores in its frequency distribution that are = equal to or lower than it.
a = Adjacency matrix for graph where = aij > 0 denotes a connection.


 Categorization of the developers into 2 groups: 

• The hero developers, the core group of the developers of a certain project who makes regular changes in the codebase. In this study this is represented by the developers whose node degree is above 95th percentile of the node degree (developers communication and code interaction of the system graph).
• The non-hero developers are all other developers; i.e. developers associated with nodes with a degree below the 95th percentile. This study compares the performance these 2 sets of developers using the percentage of bugs introduced by them in the codebase.


Analysis


RQ1 How common are hero projects? 

We say a project is a “hero project” if, when we isolate the developers who handle 95% of the interactions (or more), we see only 5% (or less) of the developers. To compute “interaction”, we mean the weighted in-degree counts to each vertex. The top 95% group are all vertices with a count above min +.2∗(max −min) (where min,max come from the smallest,largest counts). This definition could be applied to either the code interaction graph or the social interaction graph. Regardless, the observed pattern is the same. No matter what the source, the pattern is the same. Measured interms of code or social interaction, hero projects comprise over 80% of our sample.

RQ2 What impact does heroism have on code quality? 


RQ2 explores what kind of effect heroism have on code quality. In order to explore this, we created the developer social interaction graph and developer code interaction graph, then we identified the developer responsible for introducing those bugs into the codebase. Then we find the percentage of buggy commits introduced by those developers by checking
(a) the number of buggy commit introduced by those developers and
(b) their number of total commits.

RQ3: Does team size alter the above results?


Projects are sectioned into three categories:
• Small: A project is considered small if number of developers is greater than 8 but less than 15.
• Medium: A project is considered medium if number of developers is greater than 15 but less than 30.
• Large: A project is considered big if number of developers is greater than 30.

Critical for sucess of projects


Chief Programmer 


One strange feature of our results is that what is old is now new. Our results (that heroes are important) echo a decade's old concept. In 1975, Fred Brooks wrote of “surgical teams” and the “chief programmer” [108]. He argued that


  • Much as a surgical team during surgery is led by one surgeon performing the most critical work while directing the team to assist with less critical parts.
  • Similarly, software projects should be led by one “chief programmer” to develop critical system components while the rest of a team provides what is needed at the right time


Brooks conjecture that “good” programmers are generally much more as productive as mediocre ones. This can be seen in the results that hero programmers are much more productive and less likely to introduce bugs into the codebase. Heroes are born when developers become e so skilled at what they do, that they assume a central position in a project. In our view, organizations need to acknowledge their dependence on such heroes, perhaps altering their human resource policies and manage these people more efficiently by retaining them.


 CONCLUSION 


The established wisdom in the literature is to depreciate “heroes”, i.e., a small percentage of the staff who are responsible for most of the progress on a project. But, based on a study of 1100+ open source GitHub projects, we assert:


  • Overwhelmingly, most projects are hero projects. This result holds true for small, medium, and large projects. 
  • Hero developers are far less likely to introduce bugs into the codebase than their non-hero counterparts. Thus having heroes in projects significantly affects the code quality. Our empirical results call for a revision of a long-held truism in software engineering. Software heroes are far more common and valuable than suggested by the literature, particularly from code quality perspective. Organizations should reflect on better ways to find and retain more of these software heroes. 


Comments

Popular posts from this blog

ABOD and its PyOD python module

Angle based detection By  Hans-Peter Kriegel, Matthias Schubert, Arthur Zimek  Ludwig-Maximilians-Universität München  Oettingenstr. 67, 80538 München, Germany Ref Link PyOD By  Yue Zhao   Zain Nasrullah   Department of Computer Science, University of Toronto, Toronto, ON M5S 2E4, Canada  Zheng Li jk  Northeastern University Toronto, Toronto, ON M5X 1E2, Canada I am combining two papers to summarize Anomaly detection. First one is Angle Based Outlier Detection (ABOD) and other one is python module that  uses ABOD along with over 20 other apis (PyOD) . This is third part in the series of Anomaly detection. First article exhibits survey that covered length and breadth of subject, Second article highlighted on data preparation and pre-processing.  Angle Based Outlier Detection. Angles are more stable than distances in high dimensional spaces for example the popularity of cosine-based similarity measures for text data. Object o is an out

TableSense: Spreadsheet Table Detection with Convolutional Neural Networks

 - By Haoyu Dong, Shijie Liu, Shi Han, Zhouyu Fu, Dongmei Zhang Microsoft Research, Beijing 100080, China. Beihang University, Beijing 100191, China Paper Link Abstract Spreadsheet table detection is the task of detecting all tables on a given sheet and locating their respective ranges. Automatic table detection is a key enabling technique and an initial step in spreadsheet data intelligence. However, the detection task is challenged by the diversity of table structures and table layouts on the spreadsheet. Considering the analogy between a cell matrix as spreadsheet and a pixel matrix as image, and encouraged by the successful application of Convolutional Neural Networks (CNN) in computer vision, we have developed TableSense, a novel end-to-end framework for spreadsheet table detection. First, we devise an effective cell featurization scheme to better leverage the rich information in each cell; second, we develop an enhanced convolutional neural network model for tab

DEEP LEARNING FOR ANOMALY DETECTION: A SURVEY

-By  Raghavendra Chalapathy  University of Sydney,  Capital Markets Co-operative Research Centre (CMCRC)  Sanjay Chawla  Qatar Computing Research Institute (QCRI),  HBKU  Paper Link Anomaly detection also known as outlier detection is the identification of rare items, events or observations which raise suspicions by differing significantly from the majority of the data. Typically the anomalous items will translate to some kind of problem such as bank fraud, a structural defect, medical problems or errors in a text. Anomalies are also referred to as outliers, novelties, noise, deviations and exceptions Hawkins defines an outlier as an observation that deviates so significantly from other observations as to arouse suspicion that it was generated by a different mechanism. Aim of this paper is two-fold, First is a structured and comprehensive overview of research methods in deep learning-based anomaly detection. Furthermore the adoption of these methods