Stata code to correct typos and heterogeneous ways to code gender in Colombian data bases
Imagine you have the following situation:
Then, with regular expressions, you could replace those very heterogeneous options of response, into a well defined dictionary with the following:
Now your data will look like this:
There were several tricks here:
-First, notice that before we do any search and replacement, we did basic transformations to the variable, so to reduce the amount of possible cases. For example, “Male “ and “MALE” are the same, just that the first one has a white space at the tail of the string, and the second one is on upper case. We can remove all of these whitespaces, and convert everything to upper case, and then they would become a single case.
- Second, notice the use of the special characters ^ and $. The first one means “starts with”. The $ means “ends with” in the regex (regular expressions) syntax. So ^MALE$ means “starts with MALE and there is nothing else after”. This is to avoid that, for example, when the program finds the MALE part of FEMALE, it wont replace for anything. If we didn’t take this precaution, when we ask to replace MALE for MASCULINO, then FEMALE would look like FEMASCULINO, which would be a clear mistake. The work of homogeinizing a dictionary of a column/variable that has this several possibilities due to a non-standard way of recording information is full with this kind of cases, which requires one to be really carefull before doing searches and replacements.
Equivalent code in python and R project in other posts of this blog











