More than 3 years have passed since last update.

Computing average life expectancy of a line of code

Posted at 2020-08-03

Motivation

Some time ago when I asked one of my colleagues, what whas his main motivation for becoming a programmer. He then replied in a sense that he wants his code be part of a software which can be useful for many years.

Since my colleague worked as a Web Developer at that time, I was pretty skeptical of this answer. In my previous experience, the front-end part of an average project tends to be rewritten every three months or so, so this is not a very good place to look for "stable" and "unchanging" code.

Later, however, I became more curious and decided to check, how long on average the line of code lives in our company's repository. The additional benefit was that I got myself a great excuse to play around with GitPython package during my work time!

If you have not heard this name before, GitPython provides a Python-interface to git. Since we will use it heavily below, the basic familiarity is assumed and welcomed. I will also use pandas below pretty much as well.

Finally, one last word before we embark on a journey. While working on this, I made myself an explicit goal to NOT to use anything except GitPython and pandas (remember, one of my goals was to learn the former). However, if you look for something more user-friendly, there are other packages built on top of GitPython, which provide much more rich and friendly interface. In particular, PyDriller popped out during my searches. But perhaps, plenty of others exist too.

Main work

Ok, so here we go. First, we initialize Repo object which will represent our repository. Make sure you have downloaded the repository and checked out the newest version. Replace PATH_TO_REPO below with path to your git repository on your disk.

from git import Repo

PATH_TO_REPO = "/Users/nailbiter/Documents/datawise/dtws-rdemo"
repo = Repo("/Users/nailbiter/Documents/datawise/dtws-rdemo")
assert not repo.bare

Next, we check out the branch we want to investigate (see the variable BRANCH below). In your case it probably will be master, but our main branch is called development for some reasons.

import pandas as pd
BRANCH = "development"
head = repo.commit(f"origin/{BRANCH}")
from IPython.display import HTML

def head_to_record(head):
    return {"sha":head.hexsha[:7],
            "parent(s)":[sha.hexsha[:7] for sha in head.parents],
            "name":head.message.strip(),
            "commit_obj":head
           }

records = []
while head.parents:
#     print(f"parent(s) of {head.hexsha} is/are {[h.hexsha for h in head.parents]}")
    records.append(head_to_record(head))
    head = head.parents[0]
records.append(head_to_record(head))

pd.DataFrame(records)

	sha	parent(s)	name	commit_obj
0	31ad850	[c77bfb0]	docs	31ad850b08014bbf299e534e28cdfee32be90654
1	c77bfb0	[d4935dc]	stash	c77bfb02b6aa0992be7d51ddc09c295a9b25d4d1
2	d4935dc	[f294f04]	rename owner	d4935dc2157c6f968db8bae7d68868955c06f6ea
3	f294f04	[c51257b]	stash	f294f049161ac9c8c2215f33b8d0bc25f49f88b3
4	c51257b	[b684146]	stash	c51257b1b89ea10bd213ed5ba575033fd0514e89
...	...	...	...	...
298	e636c9f	[0f9ad8d]	[Task] Ran Prettier	e636c9ff7f9125064c2f3d367680107baebae250
299	0f9ad8d	[ec1c72f]	[Task] Setup Prettier	0f9ad8d85d18010098228cabae36e674b4402686
300	ec1c72f	[e4da3e5]	[Feat.] Can Set Initial State	ec1c72fe382a5b9dfa42a8810a740f96eb72c05c
301	e4da3e5	[b17eb26]	Initial DTWS commit	e4da3e5ae67322feae1b93e2180b219054339182
302	b17eb26	[]	Initial commit from Create React App	b17eb263f8cf0adc7eb8cda7e0a0ae4aee8aff82

303 rows × 4 columns

Now, variables records in the code above represents all the commits on development branch since the beginning till the current moment.

Next we need to go along these commits and collect info regarding every line which appeared/disappeared in that commit. This information later will help us to determine lifetime of each line which ever appeared in our repository.

To do so, we create the variable res which is a dictionary. It's keys are tuples of the form (<line_content>,<commit>,<filename>) and its values are sets containing hashes of all the commits in which this line appeared. It is a rather big structure and computing it takes some time.

Therefore, be ready that the code below will take some time to finish (around 100 seconds on my reasonably new MacBook Pro with our repository having only ~300 commits).

I guess, there should be much more effective and elegant way to collect this data, but I have not came up with it yet. Suggestions are welcomed.

import pandas as pd
from tqdm import tqdm

def collect_filestates(end,start=None):
    """this procedure collects names of all files which changed from commit `start`
    till commit `end` (these assumed to be adjacent)"""
    if start is not None:
        diffs = start.diff(other=end)
        fns = [diff.b_path for diff in diffs]
        change_types = [diff.change_type for diff in diffs]
        res = [{"filename":t[0],"status":t[1]} for t in zip(fns,change_types)]
        return res
    else: 
        fns = end.stats.files.keys()
        return [{"filename":f,"status":"C"} for f in fns]
    
def collect_lines(end,start=None):
    """collects information about all lines that changed from `start` to `end`"""
    filestates = [r for r in collect_filestates(end,start) if r["status"] != "D"]
    res = {}
    for fs in filestates:
        fn = fs["filename"]
        blame = repo.blame(end,file=fn)
        for k,v in blame:
            for vv in v:
                res[(vv,k.hexsha,fn)] = end.hexsha
    return res

res = {}
for i in tqdm(range(len(records))):
    _res = collect_lines(end=records[i]["commit_obj"],start=None if (i+1)==len(records) else records[i+1]["commit_obj"])
    for k,v in _res.items():
        if k in res:
            res[k].add(v)
        else:
            res[k] = {v}
            
{k:v for k,v in list(res.items())[:5]}

100%|██████████| 303/303 [01:41<00:00,  2.99it/s]





{('*.swo',
  'a73ec421cfc05cc3816cb3b2b2505d228e60c386',
  'pipeline/.gitignore'): {'31ad850b08014bbf299e534e28cdfee32be90654',
  'a73ec421cfc05cc3816cb3b2b2505d228e60c386'},
 ('*.swp',
  'a73ec421cfc05cc3816cb3b2b2505d228e60c386',
  'pipeline/.gitignore'): {'31ad850b08014bbf299e534e28cdfee32be90654',
  'a73ec421cfc05cc3816cb3b2b2505d228e60c386'},
 ('.pulled_data.json',
  'a73ec421cfc05cc3816cb3b2b2505d228e60c386',
  'pipeline/.gitignore'): {'31ad850b08014bbf299e534e28cdfee32be90654',
  'a73ec421cfc05cc3816cb3b2b2505d228e60c386'},
 ('.config.custom.json',
  'a73ec421cfc05cc3816cb3b2b2505d228e60c386',
  'pipeline/.gitignore'): {'31ad850b08014bbf299e534e28cdfee32be90654',
  'a73ec421cfc05cc3816cb3b2b2505d228e60c386'},
 ('.stderr.txt',
  'a73ec421cfc05cc3816cb3b2b2505d228e60c386',
  'pipeline/.gitignore'): {'31ad850b08014bbf299e534e28cdfee32be90654',
  'a73ec421cfc05cc3816cb3b2b2505d228e60c386'}}

Now, as we have our marvelous res structure, we can easily compute the lifetime of every line which ever appeared in our repository: for every key in res we simply compute the duration between oldest and newest commit in its value set.

But again, this may take some time (around 6 minutes on my machine).

from datetime import datetime
import pandas as pd
from tqdm import tqdm

_records = []
for k in tqdm(res):
    dates = [datetime.fromtimestamp(repo.commit(sha).committed_date) for sha in res[k]]
    _records.append(dict(line=k[0],commit=k[1],file=k[2],lifetime=max(dates)-min(dates)))
lines_df = pd.DataFrame(_records)
lines_df

100%|██████████| 1806272/1806272 [05:52<00:00, 5123.11it/s]

	line	commit	file	lifetime
0	*.swo	a73ec421cfc05cc3816cb3b2b2505d228e60c386	pipeline/.gitignore	144 days 19:07:29
1	*.swp	a73ec421cfc05cc3816cb3b2b2505d228e60c386	pipeline/.gitignore	144 days 19:07:29
2	.pulled_data.json	a73ec421cfc05cc3816cb3b2b2505d228e60c386	pipeline/.gitignore	144 days 19:07:29
3	.config.custom.json	a73ec421cfc05cc3816cb3b2b2505d228e60c386	pipeline/.gitignore	144 days 19:07:29
4	.stderr.txt	a73ec421cfc05cc3816cb3b2b2505d228e60c386	pipeline/.gitignore	144 days 19:07:29
...	...	...	...	...
1806267	supports-color@^5.3.0:	b17eb263f8cf0adc7eb8cda7e0a0ae4aee8aff82	yarn.lock	0 days 00:00:00
1806268	through@^2.3.6:	b17eb263f8cf0adc7eb8cda7e0a0ae4aee8aff82	yarn.lock	0 days 00:00:00
1806269	typedarray@^0.0.6:	b17eb263f8cf0adc7eb8cda7e0a0ae4aee8aff82	yarn.lock	0 days 00:00:00
1806270	whatwg-fetch@3.0.0:	b17eb263f8cf0adc7eb8cda7e0a0ae4aee8aff82	yarn.lock	0 days 00:00:00
1806271	xtend@^4.0.0, xtend@~4.0.1:	b17eb263f8cf0adc7eb8cda7e0a0ae4aee8aff82	yarn.lock	0 days 00:00:00

1806272 rows × 4 columns

The table lines_df we assembled above contains the following columns:

line -- that's line's content
commit -- that's the first commit in which this line appeared.
file -- filename in which this line appears
lifetime -- lifetime of a line

In the code below we add two more columns to this table:

author -- author of the line (to protect their privacy, I do not list real names, but rather one-letter nicknames)
ext -- file extension of filename

from os.path import splitext, isfile
import json

if not isfile("author_masks.json"):
    to_author = lambda s:s
else:
    with open("author_masks.json") as f:
        d = json.load(f)
        to_author = lambda s:d[s]

lines_df.sort_values(by="lifetime",ascending=False)
lines_df["author"] = [to_author(str(repo.commit(sha).author)) for sha in lines_df["commit"]]
lines_df["ext"] = [splitext(fn)[1] for fn in lines_df["file"]]
lines_df

	line	commit	file	lifetime	author	ext
0	*.swo	a73ec421cfc05cc3816cb3b2b2505d228e60c386	pipeline/.gitignore	144 days 19:07:29	L
1	*.swp	a73ec421cfc05cc3816cb3b2b2505d228e60c386	pipeline/.gitignore	144 days 19:07:29	L
2	.pulled_data.json	a73ec421cfc05cc3816cb3b2b2505d228e60c386	pipeline/.gitignore	144 days 19:07:29	L
3	.config.custom.json	a73ec421cfc05cc3816cb3b2b2505d228e60c386	pipeline/.gitignore	144 days 19:07:29	L
4	.stderr.txt	a73ec421cfc05cc3816cb3b2b2505d228e60c386	pipeline/.gitignore	144 days 19:07:29	L
...	...	...	...	...	...	...
1806267	supports-color@^5.3.0:	b17eb263f8cf0adc7eb8cda7e0a0ae4aee8aff82	yarn.lock	0 days 00:00:00	J	.lock
1806268	through@^2.3.6:	b17eb263f8cf0adc7eb8cda7e0a0ae4aee8aff82	yarn.lock	0 days 00:00:00	J	.lock
1806269	typedarray@^0.0.6:	b17eb263f8cf0adc7eb8cda7e0a0ae4aee8aff82	yarn.lock	0 days 00:00:00	J	.lock
1806270	whatwg-fetch@3.0.0:	b17eb263f8cf0adc7eb8cda7e0a0ae4aee8aff82	yarn.lock	0 days 00:00:00	J	.lock
1806271	xtend@^4.0.0, xtend@~4.0.1:	b17eb263f8cf0adc7eb8cda7e0a0ae4aee8aff82	yarn.lock	0 days 00:00:00	J	.lock

1806272 rows × 6 columns

Analysis

Finally, having this info, we can then group, and average lifetime on various parameters.

For example, below, we see the average lifetime of every line conditional on file extension:

from datetime import timedelta
from functools import reduce

def averager(key, df=lines_df):
    ltk = "lifetime (days)"
    return pd.DataFrame([
        {key:ext,
         ltk:(reduce(lambda t1,t2:t1+t2.to_pytimedelta(),slc["lifetime"],timedelta())/len(slc)).days
        } 
        for ext,slc 
        in df.groupby(key)
    ]).set_index(key).sort_values(by=ltk,ascending=False)
averager("ext").plot.barh(figsize=(16,10),logx=True)

You can see that ironically, the lines that stay unchanged the longest, belong to "insignificant" files like .lock (that's various yarn.lock's), .rules (that's Firabase rules), .html (that's index.html and since our project uses React, the main index.html also receives almost no changes) and others. In particular, files with empty extension refer to .gitignore's.

And finally, we can see the average lifetime of a line of code, conditional on author.

averager("author").plot.barh(figsize=(16,10))

We can see that the colleague I mentioned in the beginning (he goes by the nickname "J" here) indeed authored the longest-surviving lines in the whole repository. Good for him.

However, let's look more closely at the secret of his success:

_df = lines_df[[ext=="J" for ext in lines_df["author"]]].loc[:,["line","file","lifetime","ext"]]
_df = averager(df=_df,key="ext")
_df[[x>0 for x in _df["lifetime (days)"]]]

	lifetime (days)
ext
.html	181
.lock	165
	154
.rules	91
.yml	54
.md	39
.ts	38
.js	26
.json	18
.snap	11
.css	5
.tsx	4

Being the founder of the repository under consideration, he in particular mostly authored the aforementioned index.html, yarn.lock and *.rules files. As I explained before, these received almost no changes during the subsequent development.

Further work

Since we store the info on filenames as well, we can compute the averages conditional on folders, thus seeing, which parts of project are more "stable" than the others.

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up