90% of the work as a data scientist is cleaning the data
I've heard this - or similar statements - many times, and last weekend I fought with the perfect example, which I'd like to share.
Back in May (Oh God, it's already so long ago??) I started a fun project analysing exercise data I tracked with my Garmin fitness tracker. A few weeks in, I realised that, if I wanted to continue doing this every week until the middle of September, having a .csv file for each week to analyse could be... annoying. Incidentally, I also took a course in SQL database management, which inspired me to set up a little database for my exercise data.
It took me ages, being busy with other things, to even get started on that, but I've had the last two weeks off and finally got started. Last weekend, I wanted to try to upload a first batch of data. Said data was downloaded directly from the Garmin Connect website as .csv, over the format of which I have no control (even getting out a specific time range is difficult/impossible).
Of one problem I already knew from my previous analyses: the duration of running workouts is measured in minutes, seconds, and milliseconds, but not in MM:SS.mmm, but in MM:SS:mm. All other data is displayed either as MM:SS or HH:MM:SS, if applicable. I hope you see that this is awkward and terribly annoying. My previous solution was changing this manually each week, but I'm so far behind now, and it's uncomfortable anyway, that I switched to doing it in my Python script. The simple, first thought solution was to split the time string at the colons, add a "00:" at the beginning, and remove the millisecond value. This will of course stop working if I ever go for a run that's longer than an hour, but I don't plan to do that any time soon. I hope to find someone who does that with Garmin, though, to see what the format would look like then, and change my script accordingly.
The second problem I only discovered last weekend, and I'm honestly a bit shocked that anyone would do that.
The table of course contains workout speed data, called "avg pace" and "max pace". Online, when viewing the table, the unit of this value is given in something like small print, for each workout. That's because Garmin is using very different values for different types of exercises. Running and walking are both provided in min/km, so I get values in MM:SS format here again. Cycling and uncategorised exercises, on the other hand, are listed with kph values, meaning they're simple decimals.
Now, when I set up the database I forgot to check on this and decided that the "avg pace" and "max pace" columns should contain numeric values, and of course that blew up in my face. As far as I can see, I have three choices now:
Stick to this - what my gut calls - horrible concept of having values of different units in the same column, and just upload everything as strings for later parsing.
Create extra columns - to have two each for pace (in min/km) and speed (in kph), and only use the ones that work for that specific exercise. This means not only extra columns (two of which will always be NA while the other two are filled) but also extra work, making sure that everything is sorted correctly.
Convert either of the two formats into the other to have consistent data. This could then be converted back to the original format for analysis, if needed.
Oh, and there's also a fourth option, which is "Oh, f*** this, I didn't need speed/pace in my analysis so far, I'll just ignore that." I think I'll go with option three, though. ;-)
You would think that a widely used product that tracks workouts would provide you with suitable data to download, but I hope I made you see that this is not necessarily the case. Personally, I believe that saving data with different units (and formats!) in the same column goes against all common sense, but even here I was apparently wrong.