Discover Top Posts Tagged with #cleaning data

Concerns with Dirty Data

With increase in data, most of your in-house data professionals are turned into data janitors – who spend hours cleaning data, instead of analyzing it for strategy and business insights. Try and accept the fact that the increase in data is humongous and beyond human capacity. https://technofaq.org/posts/2018/12/preventive-measures-to-dodge-dirty-data/

#cleaning data

GDPR: a common sense approach.

Brad is cleaning our data to ensure we are GDPR compliant.

GDPR: A common-sense approach. In reality, under GDPR, explaining how and why you deleted data will probably more important than identifying the data you hold.

A pub chain deleted all its customer data and deleted all its social feeds. Whether you know or not who they are is irrelevant. Anyway, a simple but effective way of dealing with GDPR and not ending up with the data wooden spoon.

For many companies, deleting the customer data would be akin to committing business suicide. Other companies would do well to consider whether the time they spend on social platforms is money well spent…

So GDPR is the art of being single-minded on only keeping the data you need to keep your customers happy.

Asking your customer to opt-in will elicit the same response as asking turkeys to opt-in for Christmas. They might know or not know about Christmas, but they will wonder what they are letting themselves in for.

Avoid the ‘opt-in’ approach unless as a last resort.

Your existing customers have effectively done a ‘soft opt-in’ as a result of the business relationship they have already voluntarily entered into: they understand you need some basic data to fulfil their order. You still need to reach out to them though and let them know about the new rules of the game: you only hold the data necessary to fulfil their order. You will delete the data when you do not need it anymore to honour the warranty, for example. They have not ‘opted in’ as such, they understand that, in order to fulfil their order, you need specific information. It would be difficult to explain why you need their age or why you have their card details on your system…

All the customer data associated with orders that are no longer current must be deleted. If you manage to get them to subscribe to a newsletter, then their data automatically becomes clean as they have voluntarily engaged in a business relationship, they have asked you to provide a service.

So, are you going to be compliant and avoid the fine? These are two totally different questions. Are you going to be compliant? No. The rules are so complex that it will be easy for any ICO auditor to find non-compliance if they put their mind to it. Will you get fined? If you can demonstrate that you have taken GDPR seriously and that you have put in place processes to not only meet GDPR but also check whether these processes are effective (PIAs anyone?), then you are unlikely to be fined… unless you demonstrate a totally incompetent approach. If you say that all your customer data is held in two databases and an auditor finds boxes of data lying around, then a fine will likely go your way.

If you follow the spirit of the rules, make decisions when decisions have to be made, are able to explain the rationale behind the decision and ensure that an auditor cannot point at data you have either not plainly identified or obviously ‘not thought about’, then you are good to go.

From then on, carry out regular Protection Impact Assessments (the famous PIAs), report in clear by Ragging them, demonstrate progress over time and you will be able to sustain the GDPR drive and have room to focus on other things…

Looking for a 100% proof solution: deploy the ISO 9001:2015 standards.

If you are looking for an original way to clean your data, Brad is your man

He will just argue that his world is more about app design...

Question or challenge? Just reach out: [email protected]

#GDPR #Common-sense #CPR Global Tech #Easy GDPR #Cleaning data

90% of the work as a data scientist is cleaning the data

I've heard this - or similar statements - many times, and last weekend I fought with the perfect example, which I'd like to share.

Back in May (Oh God, it's already so long ago??) I started a fun project analysing exercise data I tracked with my Garmin fitness tracker. A few weeks in, I realised that, if I wanted to continue doing this every week until the middle of September, having a .csv file for each week to analyse could be... annoying. Incidentally, I also took a course in SQL database management, which inspired me to set up a little database for my exercise data.

It took me ages, being busy with other things, to even get started on that, but I've had the last two weeks off and finally got started. Last weekend, I wanted to try to upload a first batch of data. Said data was downloaded directly from the Garmin Connect website as .csv, over the format of which I have no control (even getting out a specific time range is difficult/impossible).

Of one problem I already knew from my previous analyses: the duration of running workouts is measured in minutes, seconds, and milliseconds, but not in MM:SS.mmm, but in MM:SS:mm. All other data is displayed either as MM:SS or HH:MM:SS, if applicable. I hope you see that this is awkward and terribly annoying. My previous solution was changing this manually each week, but I'm so far behind now, and it's uncomfortable anyway, that I switched to doing it in my Python script. The simple, first thought solution was to split the time string at the colons, add a "00:" at the beginning, and remove the millisecond value. This will of course stop working if I ever go for a run that's longer than an hour, but I don't plan to do that any time soon. I hope to find someone who does that with Garmin, though, to see what the format would look like then, and change my script accordingly.

The second problem I only discovered last weekend, and I'm honestly a bit shocked that anyone would do that. The table of course contains workout speed data, called "avg pace" and "max pace". Online, when viewing the table, the unit of this value is given in something like small print, for each workout. That's because Garmin is using very different values for different types of exercises. Running and walking are both provided in min/km, so I get values in MM:SS format here again. Cycling and uncategorised exercises, on the other hand, are listed with kph values, meaning they're simple decimals. Now, when I set up the database I forgot to check on this and decided that the "avg pace" and "max pace" columns should contain numeric values, and of course that blew up in my face. As far as I can see, I have three choices now:

Stick to this - what my gut calls - horrible concept of having values of different units in the same column, and just upload everything as strings for later parsing.

Create extra columns - to have two each for pace (in min/km) and speed (in kph), and only use the ones that work for that specific exercise. This means not only extra columns (two of which will always be NA while the other two are filled) but also extra work, making sure that everything is sorted correctly.

Convert either of the two formats into the other to have consistent data. This could then be converted back to the original format for analysis, if needed.

Oh, and there's also a fourth option, which is "Oh, f*** this, I didn't need speed/pace in my analysis so far, I'll just ignore that." I think I'll go with option three, though. ;-)

Conclusion

You would think that a widely used product that tracks workouts would provide you with suitable data to download, but I hope I made you see that this is not necessarily the case. Personally, I believe that saving data with different units (and formats!) in the same column goes against all common sense, but even here I was apparently wrong.

#SummerPain #data analysis #cleaning data #garmin

REFERENCE: Cleaning up your data

Types of Data

Categorical data - Has a set number of categories, like race, gender and eye color. In Excel, you would format it as ‘text’ or ‘general.’

If there are multiple versions of the same category (it’s spelled incorrectly) go to Data > Filter > select all categories that should be the same and make them uniform.

Ordinal data - On a scale.

Numerical data - Contains infinite numbers that you can keep counting, like age (but not age ranges). You would use formulas or an equation to do an analysis (like the average, total or most common).

Data Cleaning Checklist

The purpose of cleaning data is to make sure that you can understand the data categories and also the individual data points.

Did you check for spaces?

Did you look up identifiers you don’t understand in the metadata?

ex) The DC Number was condensed, so you had to format the column as numbers

Make sure all your columns just have one data point.

Go through every categorical column with the filter tool to make sure everything is clean

#Reference #Cleaning Data

Machine Learning in Spark - LDA : A Complete example for clustering.

Machine Learning in Spark – LDA : A Complete example for clustering.

In this blog we will be demonstrating the functionality of applying the full ML pipeline over a set of documents which in this case we are using 10 books from the internet.

So lets start with first thing first..

What is Clustering ?

Clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense or another) to…

View On WordPress

A very clear and simple guide that demonstrates WHY regular expressions are useful when you need to clean some data (that is, make it consistent).

You use a text-editor program to do this (e.g. TextWrangler on the Mac). There is no programming involved.

#cleaning data #tools #learn

Google Refine is a power tool for working with messy data, cleaning it up, transforming it from one format into another, extending it with web services, and linking it to databases like Freebase.

Recommended by Nathan, page 42.

The video intro is really worth watching!

Read: How journalists can use Google Refine to clean ‘dirty’ data sets, by Matt Wynn

#tools #formats #cleaning data

90% of the work as a data scientist is cleaning the data

I've heard this - or similar statements - many times, and last weekend I fought with the perfect example, which I'd like to share.

Stick to this - what my gut calls - horrible concept of having values of different units in the same column, and just upload everything as strings for later parsing.

Convert either of the two formats into the other to have consistent data. This could then be converted back to the original format for analysis, if needed.

Oh, and there's also a fourth option, which is "Oh, f*** this, I didn't need speed/pace in my analysis so far, I'll just ignore that." I think I'll go with option three, though. ;-)

Conclusion

#SummerPain #data analysis #cleaning data #garmin

#cleaning data

Trending Tags

Recently Viewed Tags

#cleaning data