Hofstadter’s Law: It always takes longer than you expect, even when you take into account Hofstadter’s Law.
— Douglas Hofstadter, Gödel, Escher, Bach
Writing prose and writing code have a lot in common, but perhaps the biggest similarity is that neither writers nor programmers can get things done on time. Writers are notorious for missing deadlines. Programmers are notorious for being wildly off with estimates. The question is, why?
Today, I had an idea for how to answer this question. What I found was eye opening.
Looking into my books
I wrote both of my books, Hello, Startup and Terraform: Up & Running, using Atlas, which manages all the content in Git. That means that every line of text, every edit, and every change was captured in the Git commit log.
So what does it really take to write two books?
Let’s start with my first book, Hello, Startup, which is 602 pages long and contains roughly 190,000 words. I ran cloc in the Hello, Startup Git repo and got the following output (truncated for readability):
So the 602 pages comes from 26,571 lines of text. The vast majority of those lines are in AsciiDoc, which is the Markdown-like language used in Atlas to write almost all the content. The rest consists of HTML and CSS, which are used in Atlas to define the layout and structure of the book, plus a whole bunch of other programming languages (Java, Ruby, Python, etc.) which are used in the many code examples throughout the book.
But the 602 pages and 26,571 lines we see are just the final result. They don’t capture the roughly 10 months of writing, rewriting, editing, proofreading, copyediting, researches, notes, and other work it took to get there. To get more insight, I used git-quick-stats to analyze the entire commit log for the book:
So, I added 163,756 lines and deleted 131,425 lines, for a total of 295,181 lines of code churn. That is, I wrote and deleted 295,181 lines of to produce a final output of 26,571 lines. That’s a ratio of more than 10:1! For every 1 line that got published, I actually had to write 10!
I admit that lines added and removed in Git are not a perfect measure of the editing process, but if anything, this data understates the work involved, as much of the editing process isn’t reflected in the Git commit log at all. For example, I wrote the first few chapters in Google Docs before switching to Atlas and I made many rounds of edits on my computer without commits in between.
The data is far from perfect, but I suspect the order of magnitude is correct: a 10:1 ratio of raw text to published text.
Terraform: Up & Running
Let’s see if the numbers are similar for my second book,
Terraform: Up & Running, which
is 206 pages and contains roughly 52,000 words. Here’s the (truncated) output from
Those 206 pages come from 8,410 lines of text. Again, the majority of the text is AsciiDoc, though you can see this book has even more code examples, primarily in HCL, which is the underlying language used in Terraform. There’s also lots of Markdown, which I used to document those HCL examples.
git-quick-stats to check the edit history for this book:
Over roughly 5 months, I added 32,209 lines and deleted 22,402 lines, for a total of 54,611 lines of code churn. In this case, the editing process is even more understated, as Terraform: Up & Running started as a blog post series, which went through considerable churn before I moved it over to Atlas and Git. The blog post series is about half the length of the book, so it seems reasonable to increase the total churn by 50%. That gives us 54,611 * 1.5 = 81,916 lines of churn to produce 8,410 lines in the final result.
Again, we see a ratio of roughly 10:1! No wonder writers miss deadlines. We’re being held to a schedule for a 250 page book, but to write such a book, we actually have to write 2,500 pages.
What about programming?
So how does writing a book compare to writing code? I decided to take a look at a few open source Git repos of various levels of maturity, from a few months old, all the way up to 23 years old.
terraform-aws-couchbase is a set of modules open sourced
in 2018 to deploy and manage Couchbase on AWS. Here’s the (truncated)
And here are the
That’s 37,693 lines of code churn to produce 7,481 final lines of code, or a 5:1 ratio. Even on a repo that’s less than 5 months old, we’ve already rewritten every line 5 times! No wonder software estimation is hard. We come up with a time estimate for ~7,500 lines of code, not realizing that we’ll have to write ~35,000 lines of code to get there!
Let’s see what happens when we look at a slightly older codebase.
Terratest is an open source library that was created in 2016 for testing
infrastructure code. Here’s the (truncated)
And here are the
That’s 49,126 lines of code churn to produce 6,140 final lines of code, or an 8:1 ratio for this ~2 year old repo. But Terratest is still fairly young, so let’s go back in time a bit more.
Terraform is an open source library released in 2014 for managing
infrastructure as code. Here’s the (truncated)
And here are the
That’s 12,945,966 lines of code churn to produce 1,371,718 final lines of code, or an 9:1 ratio. Terraform is about 4 years old now, but it hasn’t hit a 1.0 release yet, so even with a 9:1 ratio, it’s not a fully “mature” codebase. Let’s go back a few more years.
in 2010. Here’s the (truncated)
And here are the
That’s 224,211 lines of code churn to produce 15,325 final lines of code, or a 14:1 ratio. At the time of writing, Express is ~8 years old, on version 4.x, and the most popular and battle-tested web framework for Node.js. It seems that once we’re north of a 10:1 ratio, we can confidently say that a codebase is “mature.” Let’s see what happens if we go back even further in time.
And here are the
That’s 730,146 lines of code churn to produce 47,559 final lines of code, or a 15:1 ratio for this ~12 year old repo. Let’s go back another 10 years and see what we find.
MySQL is a popular open source relational database that came out in 1995.
Here’s the (truncated)
And here are the
That’s 58,562,999 lines of code churn to produce 3,662,869 final lines of code, or a 16:1 ratio for this ~23 year old repo. Wow! Roughly speaking, every single line of MySQL has been rewritten 16 times.
Here’s a summary of the data we’ve seen for books:
|Terraform: Up & Running||81,916||8,410||10:1|
And here’s a summary of what we’ve seen for programming:
So what do all these numbers mean?
The 10:1 rule of writing and programming
Give that my data set is limited, I can only draw a few preliminary conclusions:
The ratio of “raw materials” to “finished product” in a book is roughly 10:1. Keep this in mind the next time an editor asks you for a timeline! If you want to write a 300 page book, you’ll probably have to write around 3,000 pages.
Similarly, the ratio of “code churn” to “lines of code” in mature and non-trivial software is also at least 10:1. Keep this in mind the next time a manager or customer asks you for a time estimate! To build a 10,000 line app, expect to write roughly 100,000 lines.
These can be summarized as what I shall dub the 10:1 rule of writing and programming:
Good software and good writing requires that every line has been rewritten, on average, at least 10 times.
Of course, lines of code and lines of writing are an imperfect measure, but I think given enough data, we may be able to determine if the 10:1 rule actually holds up, and if it’s useful to help improve estimation.
Some questions I’d love to know the answer to:
- Could we use the ratio of code churn to lines of code as a quick measure of the maturity of a piece of software? For example, for critical pieces of infrastructure, such as databases, programming languages, and operating systems, should we only trust code that has at least a 10:1 ratio?
- Does the amount of churn depend on the type of software? For example, Bill Scott found that at Netflix, only about 10% of the UI code lasted more than a year, and the other 90% had to be thrown away. What are the rates of churn in backend code, databases, CLI tools, and so on?
- What percentage of code churn comes after the initial release? That is, what percentage of work can be classified as “software maintenance?”
If you have written a book and can do a similar analysis, I’d love to hear what numbers you found! And if anyone has time to automate this analysis, I’d love to see what the ratios are across a variety of open source projects. Share your thoughts in the comments!
Two quick notes from those discussions:
First, it looks like similar 10:1 rules show up in film, journalism, music, and photography! How cool is that?
Second, a common response is that even a single character change may show up in Git as an “inserted line” or “deleted line”, so when you see 100,000 lines were changed, it doesn’t mean that all the text in those lines was rewritten. This is true, but as I wrote above, there are also many types of changes missing from the data:
- I don’t do a commit for every single line that I change. In fact, I may change a line 10 times, and commit only once.
- This is actually even more pronounced for code. While doing a code-test cycle, I may change a few lines of code 50 times over, but only do one commit.
- For my books, a lot of edit rounds and writing happened outside of Git (e.g., I wrote some of the chapters in Google Docs or Medium and O’Reilly does copyediting in a PDF).
My guess is that these two factors roughly cancel out. It won’t be exact, of course, and the actual ratio may be 8:1 or 12:1, but the order of magnitude is probably correct, and 10:1 is easier to remember.
GitHub user Decagon created a repo called hofs-churn that contains a Bash script to easily calculate code churn for your Git repos. He also ran it across a variety of open source repos such as React.js, Vue, Angular, RxJava, and many others, and the results are pretty interesting!