For most of my career as a data scientist (and also throughout my PhD years), I’ve set my own work goals day to day. At the end of the previous day, I would write a note to my future-self to say “this is where you are, and this is what you should do next.” Then in the morning I’d re-assess my goals for that day. Sometimes I would agree with yesterday-self and often, after subconsciously thinking about it overnight, I’d think of a better approach. After that I would execute. I considered this ability to make my own decisions about direction to be the crux of what a PhD education was all about.
For the last year, however, I’ve been a part of something totally different: a data scientist in an agile development team. I had heard about this working style before, often practiced by teams of mostly software engineers that put code into production as I was doing the data science research. I was always of the opinion that the agile software development model, in which you plan and estimate work in 2-3 week intervals (called “sprints”) isn’t a good fit for data science (or any research in general) because in research it’s impossible to predict the complexity and effort required. Any researcher will tell you that in most cases, what started out as a “simple” experiment taking 5 steps took closer to 50 steps after accounting of the plethora of mini-roadblocks along the way.
Nevertheless, I found a way to make it work. Here are the key lessons I learned:
1. Write definitions of done that are more generic than ones for software engineering tasks
In agile development, the ticket author writes a definition of done. This describes what the end result of the task looks like. For data science, the research output is yet unknown, so the best definition of done that can be provided is that the research itself was done (and perhaps documented). Example include:
- One or more Jupyter notebooks describing basic statistics of the newly available text data
- Continue development of an alternative model that might improve metrics of performance over our current model.
- A wiki page describing the opportunity (or lack thereof) of developing a model to solve a new initiative that management thinks looks promising based on their experience
- Document changes in metrics of performance after attempting to tune parameters in the existing model
Notice how these definitions do not describe how things will be done and does not bound us deliver explicit results before the ticket is considered done. The flexible nature of the first bullet, for example, can mean 1 notebook with a few stats or 3 notebooks complete with fancy charts. The basic idea is that some research was done.
2. Assign a medium amount of points and use that to time box the research
Agile teams use tools like Jira to create “tickets” that require “points” which are estimates of complexity. Philosophically, points are not supposed to correlate directly to hours spent, but for data scientists we can use them as a proxy to time box research. Going back to the example above, once new data is made available we can spend an infinite amount of time researching its characteristics. Once a chart or stats are produced, we’re naturally inclined (in fact, trained) to wonder “what would the same charts look like if we sliced or filtered the data in a slightly different way?” or “this chart doesn’t look right, I need to make a new one with a better normalization scheme.” At some point, we need to time box our research and points are an excellent way to do this. For example, if I intend to spend 2 days on a bit of research I’ll assign 3 points to the ticket. For 4 days I’ll assign 5 points. Notice that since there’s no direct correlation between time and points, the amount of time I spend is flexible.
3. Use ticket reviews to your advantage
One of my favorite parts of the agile framework is the idea of ticket reviews. Your colleague will review the ticket by comparing work done with the definition of done (defined above). This works well for software engineers to ensure code meets coding standards and doesn’t break other code. It also works well for data scientists. Having to explain your work to somebody else before it’s considered done often makes you realize (even if they didn’t say a word) flaws or new ideas to consider for next sprint. It also allows your fellow data scientist who might be working on another aspect of the project to chime in on how you two can share analysis and build off each other.
It doesn’t have to be a data scientist who reviews your tickets. Sometimes it’s a software engineer or occasionally the business owner who reviews your ticket. In these cases, you get to exercise your important data science communication skills.
Lastly, you’ll be review others’ work as well! Instead of just hunkering down and announcing fabulous results every few months, the peer review process forces you to learn and add to other parts of the codebase. You might be reviewing a small tweak in some software engineering if/then statements and be able to comment on how next year your model will be taking over this set of business rules. Or you might review the data engineer’s latest efforts to optimize the SQL tables for faster reads when you realize her optimizations and your needs for faster research queries aren’t completely in line. All these actions make for a well integrated team.
The Bottom Line
The world changes, technology changes, and the way we work changes. After a year in this agile style, I’ve come to appreciate and see it as a good option for data science in the future. Companies are beginning to realize that months-long research in data science, followed by throw-it-over-the-wall implementation by the software organization isn’t always the way to go and that integrated, agile teams that deliver value in small increments are more stomach-able for today’s impatient executives anxious to prove to their bosses that they’ve getting return on their data science investments sooner. As much of a misfit as you initially judge it to be, don’t close the door on doing data science in an agile team too soon.