Data Munging: Identify Columns in CSVs
"Data munging or data wrangling is loosely the process of manually converting or mapping data from one "raw" form into another format that allows for more convenient consumption of the data with the help of semi-automated tools."
So now I have tens of thousands of CSV files with data about a lot of people and I need to identify the columns in the CSV.
The first method that occurred to me is to look at the headers and try to figure out what they could mean. But even looking for something like email is not straightforward - there could be a header called email, email id, email_id and oftentimes the header is simply missing. So, unless your data is very structured (which it seldom is) this does not work.
Emails and Phones
I identified email columns by saying that if more than 50% values in a column are valid emails then that column is Email. Similarly for phones.
Names
This is more interesting. There is no regular expression for identifying a valid name. So here is what I did - I took a list of say 100 most common first name in the geography that the data was from and then said, if more than 10% values in a column are found in this list then that column is the first name and similarly for last name. Then, if the ratio between number of first names matched and number of last names matched is between 0.2 to 5 then it is simply the full name. Of course, this doesn't give me 100% accuracy but the results were pretty good and adequate for my needs. And it definitely beats going through all the files and manually identifying the columns!
This technique works even better for identifying cities and states and is useful in many other cases.















