Why 90% of Data Scientists Hate Their Own Code after 6 Months?
The documentation habit that separates professionals from amateurs, and your future self will thank you for
Eight months after rolling out a critical ML model, I received the fateful Teams message: ‘Hey, can you tell me how features were created in the estimation pipeline?’ I opened my own code. Stared at that for half an hour. And realized something terrifying.
I didn’t know what past-me was thinking either. The logic was there. The code worked. But the why of every choice? Gone. Puff…. from my memory
I had spent the next two days reverse-engineering my own thought process. It was when documentation ceased to be a chore and became my survival.
Why Your Brain Can’t Save You
Hate to break it to you, but your memory is a dirty liar.
Have you ever been sure of a fact only to discover it’s wrong? If you ask a data scientist about some previous transformation flow/model, they will be like: “Oh yeah, I remember how this works”.
But when they revisit it, chances are they won’t. It’s human. We can only recall so much, especially with all the information to which we are bombarded with everyday.
Cognitive science tells us that each time we remember something, the memory is imperfectly re-formed. We add new information in the process of remembering by changing it each time we recall it. Within weeks, the reasoning behind your model architecture becomes fuzzy. Within months, it’s vanished from your memory.
Documentation isn’t about proving that you did good work. It’s about creating a conversation between present-you and future-you. And trust me, future-you will have questions that present-you forgot to ask.
And this habit doesn’t just save you. It saves everyone who touches your work after you move on, get promoted.
So, What Actually Deserves Documentation?
Not everything needs documenting. Too much documentation is as useless as too little. It becomes noise nobody reads. The key is capturing the right things at the right depth.
1. Code and Algorithms
Writing comments that say “#Loop through data” is like writing a recipe that says “cook the food.” We can see you’re looping. What we need to know is why.
Functionality First
Each function or script needs a clear narrative. What problem does it solve? Not in abstract terms, but in a concrete direct way. If you can’t explain it to a colleague in one sentence, you probably don’t understand it well enough yet.
Document the inputs it expects and the outputs it produces. Be specific about data types and formats. If your function expects a pandas DataFrame with specific columns, say so. If it writes files to disk or modifies global state, mention that too.
The Edge Cases Nobody Remembers
Six months from now, you won’t remember that edge case you spent two hours debugging. The one where empty strings break everything, or where negative values cause silent failures. Document it. Future you (or your teammate) will save hours by reading these.
Dependencies:
List every library, package, and external tool your code relies on. Include version numbers. I know, it feels tedious. But “it worked on my machine” has destroyed more projects than bad algorithms ever did. A simple requirements.txt or environment file isn’t documentation, it’s insurance.
Example that works:
def calculate_risk_score(transaction_data, user_history):
“”“
Combines transaction patterns with user behavior to flag potential fraud.
We weight recent transactions 3x more heavily because fraudsters typically
act quickly after account compromise. User history older than 90 days is
excluded - testing showed it added noise without improving detection.
Args:
transaction_data: Last 30 days of activity (pandas DataFrame)
user_history: User’s behavioral baseline (dict with ‘avg_transaction’, ‘location_patterns’)
Returns:
Risk score between 0–100, where >75 triggers manual review
Edge cases:
- New users get baseline score of 50 (neutral)
- Missing location data defaults to last known good location
“”“See the difference? Now future-you understands not just what but why and how.
2. Data
In my twelve years in the data science industry, poor documentation of data causes more production failures than poor code.
Sources and Provenance:
Where does your data come from? Document not just the database name, but also which tables, and what user credentials or patterns of access were used? If you are using APIs, note the endpoints used, methods of authentication, and any rate limits. This is how you prevent that 2 AM panic when something breaks and nobody has a clue where the data lives anymore.
Preprocessing:
Data in its raw form is messy. We scrub it, convert it, filter it. In the absence of documentation, a user who retrieves the clean data might take it for granted that the dataset is the raw one. They would just skip your preprocessing and then everything breaks.
Keep a record of every change. Not only what you did, but why you did it. “Outliers beyond 3 standard deviations were removed as initial EDA indicated data entry errors in the upper tail.” That sentence just saved a week of confusion for someone.
Schemas and Structure:
List your columns. Describe what they represent. Note their data types. If “customer_id” is actually a string that looks like a number, say so. If “date” is formatted as YYYY-MM-DD in one table but MM/DD/YYYY in another, warn people.
The transformation logic:
When you engineer features, you’re encoding assumptions about how the world works. Maybe you created transaction_velocity by dividing spend by time. Document why you used that formula. Why did you log-transform a feature? Why capped a feature at the 99th percentile?
These decisions embed your worldview into the model. Make that worldview visible.
3. Models
Model architecture documents often feel like assembly instructions from IKEA, technically complete, practically useless.
Instead, document your model like you’re explaining it to yourself six months from now, when the metrics have drifted and someone’s asking “why did we build it this way?”
Capture the architecture decisions:
What’s the actual structure? Layers, attention heads, embedding dimensions
Why this activation function and not another?
How did you choose your loss function? (MSE vs MAE vs Huber vs custom)
Don’t just list learning_rate=0.001. Explain that you started at 0.01, but the model diverged, so you dropped to 0.001 and added a scheduler that drops it further after plateau.
Training procedures:
The manner in which you split the data is tremendously important. Record the reasons for choosing a particular data split. Justify your choice of Time-based split for time-series and Stratified split for imbalanced classes.
If you performed early stopping, please record what level you used and what the rationale was for choosing that threshold.
These are the decisions related to the model that alter what your model learns and how it learns it.
Evaluation metrics:
You chose the F1 score over accuracy as a key evaluation metric. Document it because your classes were imbalanced or because false negatives cost more than false positives?
Document the business logic behind your metrics. When people ask why the model is 85% accurate, you need to remind them that actually an 85% F1 in fraud detection with 1% base rate is extremely good.
4. Decisions and Assumptions:
Every ML project involves hundreds of judgment calls. You chose logistic regression over decision trees. You decided to exclude outliers above the 99th percentile. You assumed that missing values meant “no” rather than “unknown.” But it’s very important to document your decisions.
Document your reasoning:
What alternatives did you consider?
What trade-offs did you make? (Speed vs accuracy, interpretability vs performance)
What assumptions are you making about the world?
1. Assumptions:
Assumptions are dangerous because they’re invisible. Every project rests on assumptions. About the data, about the problem, about the deployment environment. Document them down.
Example assumptions to document:
“User location data is accurate within 50 miles” (what happens if GPS is spoofed?)
“Product prices don’t change mid-session” (what if they do during flash sales?)
“Training data represents future data” (what about black swan events?)
2. Known Limitations
Document what your model can’t do. Where it might fail. What edge cases it doesn’t handle.
Example limitations to document:
“Model performs poorly on transactions below $5 due to limited training examples in that range, recommend manual review for small transactions.”
“This fraud detector works well for credit card transactions but hasn’t been tested on wire transfers.”
Future-you will thank past-you for being clear about where the dragons are.
5. Processes and Workflows:
A brilliant model that nobody can deploy is just expensive research. Only if you document the steps to deploy the model or run the workflows, other team members can easily deploy and run your model/ workflows.
Deployment Steps
Document how your code gets from development to production. What infrastructure does it need? AWS, GCP, on-premises servers? What environment variables must be set? What secrets or credentials are required?
Create a deployment checklist. Make it so clear that someone who has never seen your project before can follow it successfully.
Monitoring and Maintenance
It is also equally important to track model performance. Document what metrics should trigger alerts. How do you detect data drift or model degradation?
Document the monitoring tools, the thresholds that matter, and the procedures for when things go wrong. Because they will go wrong and you won’t be around all the time to take care of it.
Workflow Diagrams
Humans process visual information faster than text. A single flowchart showing how data moves from source to model to output can save hours of explanation. The other team members or your future-self will benefit a lot from this.
6. Collaboration and Onboarding
Projects outlive their creators. Teams change. People move on. Your documentation is how knowledge survives those transitions.
Onboarding Guides:
Create a quick-start guide for new team members. What do they need to install? What do they need to configure the project environment? Where are the important files? What should they read first?
Make it possible for someone to be productive in hours, not weeks.
Team Standards
Document your coding standards, review processes, and best practices. When everyone follows the same patterns, everyone moves faster.
These standards shouldn’t feel like rules. Treat it like agreements that let everyone collaborate without constantly negotiating style.
The Documentation Habit That Actually Works
Here’s what I learned after years of failed documentation attempts:
Document continuously, not retroactively.
Don’t wait until the project ends. Document as you go, when decisions are fresh and reasoning is clear. Five minutes now saves five hours later.
2. Write for confused-you, not smart-you.
The person reading your docs isn’t at their best. They’re debugging at 3 PM on Friday before a long weekend. Write for that person.
3. Make it searchable and skimmable.
Use headers and examples. Add formatting. Nobody reads documentation linearly, they search for the one thing they need right now to debug.
4. Update when you break your own assumptions.
When you discover your documentation was wrong or you break your own assumptions, update it immediately.
Final Thoughts:
Time spent documenting feels like time stolen from real work. Documentation is real work. It’s the work that makes all future work possible. It’s the work that lets your team move fast without breaking things. It’s the work that makes your code outlive your employment.
Your future self will thank you. Your teammates will thank you. And the person who inherits your project after you leave will think you’re a genius, not for the code you wrote, but for the map you left behind.
That’s the habit worth building.

