Calculation of growth rates between consecutive periods by cross-section unit (individuals, groups) in panel data in R project
Cosimo Galluzzi
TVSTRANGERTHINGS
will byers stan first human second
macklin celebrini has autism
2025 on Tumblr: Trends That Defined the Year

titsay
Aqua Utopia|海の底で記憶を紡ぐ
Cosmic Funnies

Janaina Medeiros

No title available
KIROKAZE
"I'm Dorothy Gale from Kansas"

ellievsbear

Discoholic 🪩
art blog(derogatory)

Love Begins
Xuebing Du

oozey mess

blake kathryn

No title available

seen from Canada
seen from United States
seen from Germany
seen from Canada
seen from France
seen from United Kingdom
seen from Malta
seen from United States

seen from United States

seen from United States

seen from France
seen from United States
seen from United States
seen from United States
seen from United States
seen from United States
seen from United States

seen from United States

seen from United States
seen from United States
@armature57
Calculation of growth rates between consecutive periods by cross-section unit (individuals, groups) in panel data in R project
Knowledge can be a either a dangerous or a useless thing for those who don’t have the emotional skills to handle it. And I don’t say this to promote some kind of censoring on how information should be difused in a paternalistic way, an argument often used by totalitarian states to find an excuse for attacking freedom of speach. That will be a matter for another post. But here, I can tell you of some groups of people for whom knowledge is a double sword. People who, like me, face mental illness, and are forced to learn in an environment that don’t take into account the way our brains work. The possibility to understand the paper emotional intelligence has on the process of learning, and to take into account in the process of teaching what aspects of our brains can be involved in this process, and how their failure can lead to failure in the process of decition making, and more over, of perception of reality, sounds like something of science fiction, of the most naive utopia. And yet, I think neuroscience is begining to get us there for the first time in history.
I think many people of society can secretely relate, to some extent, to what this guy went through. In my case, the start of my career was somehow similar to what this guy experienced, with the difference I had no brain injury, but other mental illness. And I know self diagnoze is something we should avoid at all costs. either on purpous, when you go around on the internet trying to find out what is wrong with you, or when you just see by accident something you can relate to, and fall in the temptation to dig deeper until you convince yourself that’s what you have. However, I think one of the most direct and efficient ways to introduce the topic of neurodiversity and mental health in the process of learning is to actually teach the kids some situations with whom they might feel related to. Some cases they can use as mere analogies, to say “Hey, I don’t have the capacity to tell I have that, but I can sincerely relate to that in this and this and this”. I can see a course of neurodiversity on school curriculum, just like what happened with sex education, and will eventually happen with gender and minority studies. An education system that can help students have inputs to discover their own skills and limitations, with the help of a professional, can be one more fair, and in a feasible way.
In my case, I will definitely be telling my therapist about this case, as I feel that it is a good input for him to know that I feel related to it to a great extent. But I know I must trust his judgement on what to do with this information, and I won’t be taking the comparison literal. Here again, I deffend the power and limitations of analogies. And god knows one of the best things of this year is to have found this youtube channel of Antroporama. Go check it out.
Ayer estaba leyendo en un artículo de Pulzo que la Iglesia Católica está enfrentando problemas financieros en Colombia, y podría quebrar para Agosto. Quienes han seguido este blog saben que soy (o intento ser) ateo, o al menos no-teista. No obstante, se que en un país como Colombia, la iglesia sigue prestando servicios útiles a la comunidad, en especial algunos que el Estado no se atreve a prestar aún, por falta de capacidad financiera, voluntad política, o porque sencillamente no tiene capacidad de hacer presencia en algunas zonas en medio del conflicto armado. Así que la verdad esta no es una buena noticia para quien es ateo por convicción en la racionalidad y el pensamiento práctico, y no quien solo sustituyó superstición por ideología. Quien así lo vea, podrá pensar que la presencia de la Iglesia en asuntos normalmente del Estado es aún un mal necesario.
Por otro lado, esta es sola otra manifestación de un fenómeno más general. Y es que la la cuarentena, la violencia, y otros desafíos sociales, están destruyendo lazos que tardan años en construirse, y a través de los que fluye información en una sociedad. Que son tan importantes como la infraestructura física. Se los llama de hecho "capital social", para reflejar que tiene varias características comunes con el capital físico. Esto es, se acumula lentamente mediante inversión de tiempo y recursos en ella (y además, emociones y apego), se desacumula gradualmente en situaciones normales si no se invierte lo suficiente (con cuantas relaciones humanas esto no nos ha pasado), o se destruye súbitamente en situaciones como desastres naturales, y la cantidad de servicios que prestan son directamente proporcionales al stock actual que haya (el valor del networking), etc. Sin negar que la cuarentena sea necesaria para preservar vidas individuales, el desafío es que no se haga destruyendo formas más "agregadas" de vida.
Eso es el desafío como tal en este momento, el balance, el trade-off, entre preservar la vida individual y la vida social. Cuando se lo muestra como el balance entre la Economía y la Vida, me parece que por un artificio retórico se oculta el carácter particular que ese trade-off tiene en una situación que desafía a la humanidad como especie. Algunos de los aspectos que entendemos por economía son realmente instrumentales, y pueden interrumpirse por breves momentos sin que eso implique luego perder la capacidad del mercado organizar los recursos, ni comprometer la libertad económica. Esto es importante para lograr un consenso social, pues esto último es el mayor miedo de personas que defienden el capitalismo por su relación con la libertad individual, y no solo por su capacidad de satisfacer necesidades materiales que sostienen la vida.
Para ver como preservar relaciones sociales que eran cimientos del libre mercado, aunque el funcionamiento normal de este se interrumpa por un tiempo, gusta por ejemplo lo que hizo EEUU para no destruir su aparato productivo durante la segunda guerra mundial. Nada mejor que este video de VisualPolitik para ilustrar como EEUU durante este periodo se comportó como una economía socialista en verdad, aunque no se reconociera en público. Un aparente pragmatismo hizo que los dirigentes de EEUU, en vez de hacer que la gente renuncie a sus trabajos, se mueva a otros lados del país a nuevas fabricas del gobierno, y destruya sus relaciones sociales y productivas normales, lo que hicieron fue darle a las empresas privadas la orden, y los recursos, para que adapten el capital físico, y capaciten al capital humano, para que produzcan bienes y servicios para la guerra. Como cuando Ford dejó de producir carros particulares y empezó a producir tanques y carros de guerra.
Lo más fascinante es como EEUU ha sido capaz de entrar y salir de ese estado conforme a las necesidades, sin quedarse atrapado en ese estado luego por una dictadura. Aunque ciertamente, si hubo un costo a pagar, en la medida que ciertas conductas quedaron, que lo alejan del ideal libertario de los padres fundadores. En eso ahondaré más en otro post, a medida que vaya investigando más sobre este asunto. No obstante, ciertamente esto ayudó a preservar algunas de las relaciones sociales subyacentes a las relaciones productivas, y que el permitieron luego aprovechar las oportunidades de crecimiento económico que trajo la posguerra.
Artículo sobre el riesgo de quiebre de la Iglesia Católica en Colombia: https://www.pulzo.com/economia/iglesia-catolica-punto-quebrar-colombia-por-covid-19-PP921830?utm_source=notificacion&utm_medium=onesignal
Muy buen video este de Esquizofrenia Natural. Me hace pensar en lo que plantea Richard Dawkins, de que la vida puede estar en un cambio de paradigma. El mecanismo de selección natural que creó la vida misma por la competencia de genes, y que dió origen al hombre, que es la vida autoconciente, puede estar creando un nuevo paradigma de vida : la competencia de "memes" : ideas sueltas que pueden auto replicarse y mutar. Tal vez esos memes estén evolucionando, y creando una inteligencia compleja. Piensese por ejemplo en el crowdsourcing de ideas, en las comunidades que se auto organizan, para construir ideas complicadas a partir de ideas sencillas de miles de contribuyentes. Con ayuda de la inteligencia artificial para procesar toda esa información, tal vez este nuevo paradigma también tenga un pináculo evolutivo, un ser super conciente que en algún momento podrá plantear preguntas tan existencialistas y abstractas que los seres humanos nisiquiera aspiramos a hacernos aún. Lo que pasa en La última pregunta de Asimov podría ser toda una realidad de esta forma.
“How to build character without becoming a prick. A full-of-insecurities-guy’s handbook for everyday life”. I swear one day I will write this book.
Richard Dawkins en El Gen Egoista, sobre el valor de la divulgación y como es equivocado separarlo de la ciencia. Verdaderamente inspirador para mi en momentos en los que dudo si este blog servirá a algún propósito relevante.
Can you imagine using simulation methods to teach history and social sciences to children? Agent based modeling have the huge potential to be used to teach people how small and simple pieces with well known behavior might relate in unexpected and not so simple ways. This is usually what we think happens in history, with towns, empires, leaders, taking decisions that take into account some of the information of their environment, but cannot see the whole picture, leading to unexpected consequences. To see this big picture retrospectively, scientists usually try to do models that require some skill in solving them, like solving systems of equations. However, as the models grow complex, scientists are having to turn back to the numerical methods.
This has a potential to make easier scientific divulgation for the common people. And don’t missunderstand me, I will always love the elegance of the analytical models. But it is quite a challenge to translate them for scientists into simple languaje, when you are interested not only on explaining the inputs and outputs, but also the intuiticon of the mechanics involved. But if programming becomes a widespread skill even for the young, the process of divulgation might turn to just helping people make toy models that mimick some important aspects of the real simulation scientists do. And with all the advantages this might bring, like gamifying the process of learning in interactive ways. For now, I am really curious on how this anthropologists and sociologists modeled this historical phenomena. I would love to learn to do things like that in Python. For more on this project, follow the hashtag #SimulationForDivulgation.
Text quoted in the image taken from: Macal, C; North, M. (2005). Tutorial on Agent Based Model and Simulation. Proceedings of the 2005 Winter Simulation Conference
On how my amateur interest for the mind’s functioning started: From mental illness to machine learning. Part I.
My interests often follow strange life paths. It is my mind trying to find some order in a messy environment I guess. An endogenously messy probably. It’s not my intention to blame someone else. The point here is that my curiosity has been a kind friend in difficult times. At it always find a way to make me see the relief of novelty even in troubling times.
In a difficult period of my life, a deep curiosity for the human mind functioning was awaken. It started when I noticed that all my family had several mental illness that had caused them much difficulties in life, particularly in developing a life project and being functional and work. For many, even in their attempts to just be happy and kind with one another. A disfunctional family was the saddest result of this. It intensified when I discovered that legacy had passed to me, although in that moment it might have been also a method of protection. A wrong one, as not only I tried to fixed myself with my reasoning skills, but I used it to shield me from the possibilit I might need external help. And finally, It came strongly after my grandfather’s stroke.
My grandfather was always a strong man, in every sense. Mentally, of spirit, and phisically. That, until he wasn’t. In just one day, he was reduced in his cognitive skills to little more than a baby. His memories were taken away of their structure in time and place, like if someone had taken a puzzle and shuffled all pieces. Some retained their local structure, others were taken important pieces in the middle, and others were connected to pieces of other parts of the puzzle. Even his dreams seem to have taken the place of actual facts, some fears materialiced in things that had not happened, and some painful things were removed from his memory, so he eventually had to deal with them again, like to deal with the fact my grandmother was not there anymore. In the saddest picture, however, I was able to see the marvel of the human mind, as I saw how my grandfather started learning everything again.
To make clear why what I am going to tell you was new to me, I first have to tell you I had never grown with very little children, I have no nephews or little cousins, or little brothers or sisters. So, seeing the development of a mind was something entirely new to me. To see him learning again basic skills like swallowing properly, walking, and using his hands, in a trial and error way, was intriguing. This as you usually take those skills for granted, as if they come from common scence embeded in your genes. As if they were instint. But there was the powerful machine of the brain, making little associations, local models of how its own body should behave, and dealing with the complex emotions that appeared in the process. It was an experience that connected us all to our most human side, as my family have always been this proud people, concerned with only the romantic idea of being intellectual and concerning ourselves only with the understanding of the most important phenomena of the world and of the spirit. A touching and humbling experience indeed, strengthened by the power and desire of self exploration.
Follow me for the continuation of this post, coming soon.
Continuing with my new obsession with tea ceremonies, I really liked this video about Gong Fu Cha Tea. For starters, it really makes clear that this is a discipline, not a ritual, like the chinese ceremony. It is a craft designed to enhance your tasting capacity over tea, instead of something to do good tea to just drink. It is a philosophy completely different from the one I described as a comment in the video below. So, this far, there are drinkers of tea, tasters of tea, and ritualists of tea. Even though they might appear so similar that you might think they are the same. But that similarity is just like many others in other contexts that share some history to some point, but then separate and do different purposes. My brainy side will love to do this analogy: Think of the people that use least squares to do triangulation of several measures (like cartographists and phisicists), with the people that are trying to adjust an equation for a prediction as a black box (like machine learning practitioners), and with the people that use it to do causal explanantion of phenomena, like scientists. Same tool, very different purposes, which require an additional “something else”. A something else that, if forgotten or mixed up, well, will give you a bad result. Intention is always tied to purpose and to the idoneity of the tool, and in these sublte distinctions, the devil lives in the details. I think this analogy will serve more students in my private classes of data science than Gong Fu Tea lovers. Or who am I kidding, probably just me. Anyway, it’s always entertaining to make analogies out of the blue. For now, enjoy this video.
One of the things that calms me the most is watching this tea ceremonies. I have always felt a respect for the honor to daily things, daily details, daily tasks, that eastern cultures show. Specially for a person like me, for whom sometimes doing little things can be the beginning of an internal conflict of wether I am being smart enough while doing them, practical, ordered, if everything in my logic of doing things is clear to my and to others, If they approve it or if i am being complacient enough to others (and if it is a good idea being so), If it is useful for the world’s welfare and society, If it is productive effort, etc. Yes, it’s that bad I’m afraid, you might know that horrible monologue yourself. Honestly, to see that someone, in some place of the world, takes this much attention to something like brewing tea, makes me feel the world is not that oppresing. It makes me feel that it’s ok not being in a hurry all the time, and that art and method can be in everything we do. And also is facinating to take conciense, to really note that there are so much details in many of the things we do, that there are so much micro tasks involved even in the simple acts. Those are the forgoten and given for granted marvels of the human brain that Minsky mentions in his book “The society of the mind”. Those that we as adults tend to underestimate, but that once as children took us so much effort to master, and that those who work with robots know they can be a real miracle to achieve when you are trying to program something to do them. And there can be beauty in each step if you look carefully. One day, when this quarantine is over, I would love to do things like this. For now, I started reading The Book of Tea by Kakuzo Okakura.
Failing on being an atheist
One would believe that not believing, or in general “not doing” something, would be easy. That unless you are an addict and the act being “not consuming”. But sometimes, it is difficult, in weird ways. Like when you try to control something inside your mind. The thought of not thinking, suddenly and out of nowhere brings the most creative thoughts about the topic you want to avoid. In my case, I think I am failing on being a full atheist. This is no surprise or sudden epiphany, as long ago I surrendered a little bit on this wish, and in the spirit of honesty with others or with myself, I define myself actually as a non-theist. Like Buddhists and Taoists define themselves, being particularly fond with the first spiritual path. There is something in the back of my head that can’t stop acting as in that song, “I have a Dream” of ABBA. Of believing in angels, and something good in everything one sees. That even If rationally, I think that there is nothing supernatural on the idea of good. Only a construct of humanity that obeys its wish and reflection on wanting to live a good life, for one and the others. And yet, I am even curious on an intellectual way to understand people beliefs about the unknown, and how they seek for answers to their daily problems on such beliefs.
I am constantly tempted to say to myself that it’s only my nature, or even that this is human nature, to shield my subconscious believe from any self recrimination about this failure. But the truth is I really don’t know that. All I know is that a part of me wants to believe, another is a little ashamed of doing so, and that last part is the one that wants to take control. A little tired of trying to avoid that internal conflict, pretending it does not exist and that I act as if i was a real full-atheist, I decided that perhaps atheism is not just the “absence” of a system of beliefs, but a system on itself, designed to somehow deal with some of those wishes. Therefore, I decided to read a little about that. That’s how I am now reading “Goodbye God” of Richard Dawkins, one of the most famous atheists of the world. His conviction when talking about the subject is great. He believes that it is a responsibility of all scientists to become militant atheists. So, if there is some guidance to look for, I guess it would be in one of his books. Follow me with the hashtag #GoodbyeGod to see my progress on this book and its implications.
Did you know there are people trying to measure economic development with machine learning on parts of the world with low or non existent access to quality data?
From people measuring poverty on North Korea with satellite images taken at night, to see in what parts of the country there is low level of coverage of the power grid, to organizations measuring poverty in Africa by inspecting the concentration of houses with roofs with low quality materials. This proxy means of poverty are very useful also as a source of validation of official data where they exist, although as all proxy methods, they have limitations. However, their potential for allowing citizens to contribute to build independent assesments of official data is huge. Imagine the possibilities they open if they suspect there might be interests of the government to hide information on a problematic situation, like usually happens in totalitarian states. More on this on posts with the hashtag #RemoteSensoringofPoverty and #CitizenScientists
Calculating in python group averages of several variables/columns in python when a group is a period of time between random events
Imagine you have the next kind of data base:
Imagine you want to create group averages of the variables “Valor1″ and “Valor” by a period of time. This period is the one that happens between the years where te variable “Evento” takes a value of 1. Think of this as being a random event, that has no fixed pattern (you can’t just take averages every four years, for example). The way to proceed would be first to find a way to create a variable that says that, for instance, Period 1 is the one between year 1991 and 1994, Period 2 is the year 1995 alone, and so on. The command cumsum (cummulated sum) of the dummy varaible Event will do the trick:
After this, it is only a matter of calculate the group means of Valor1 and Valor2 by the variable “Period”. For that we can do:
Follow me to see in a forecoming post how to do this in R project and stata.
Agent based models and machine learning integration : The future of scientific models based on observational data ?
The integration discused here is one of those things I cannot wait to see in action. Agent based models are great in modelling dynamics of individuals that interact with their environment and among them, and learn from the feedback of their actions. Methodologically speaking, they are attractive as they allow you to use data of different levels of aggregation (for example, some will be micro-data, while some will be market-level data or city-level data). These models fill the gaps between levels of aggregations by setting relationships among elements of the simulated environment. But not by modelling them analitycally, given that such relationships can be highly complex and difficult to capture through exact equations or problems with closed solutions. Instead, the gaps of information are filled with simulations based on assumptions of the unobservables and/or the correlations among variables, that enter into the movement equations of the system. This way, the only thing that has to be set in an analytical way are the behaviour equations (like the utility function of individuals in economic models) the laws of movement of them in their environments, but not the solution to the system of (non-linear) equations. Numerical methods will do the rest. It is nice also that forces one to make explicit the distribution assumptions of the unknowns.
Methods like this one, that use analytical modeling till one point, and then use “brute force” to do other part, have been used in several moments on science, long before computers. That I know, It goes back to the numerical integration solution proposed by Crommelin to solve the multi-body problem of gravity that made it hard to calculate accurately the trayectory of the Halley Commet in 1909. In this, he used as the “structural equations” (as economists would call them) of the model the differential equations that governed movement of the commet and the planets Jupiter, Saturn, Neptune, Uranos, along with the Moon and the Sun, and instead of trying to explicitly acknowledge how these forces would perturbe the elliptical orbit of the commet (what would be equivalent to try to solve an Economics general equilibrium model analytically), he did it numerically, just like the Agent based model would do. This as the elliptical model was just a benchmark model. Valuable for science as they were ceteribus paribus clauses that taught us the laws of motion of the celestial bodyes, thanks to Newton’s contributions. But as the sky was not a controlled experiment, to adjust the ellipse to account for external forces was really difficult, and implied to do make further assumptions just to make calculations easier. The numerical integration instead, substituted some of the intellectual effort that would have to be done by scientists in solving the numerical problem, with the effort of less qualified, but more abundant, workers. Nonetheless, think how hard it was to do this back then, when tens of “human computers” would have to do all the brute force during months? (If you are interested on learning more on this, I bet you would enjoy the book When Computers where Human by David Alan Grier. A very entertaining reading.)
Machine Learning is another tool based on brute force, and that has several similarities with the Agend based models. The analogy of how the model is itself learning from its environment, in a way that is not analytically tracktable, even though the equation of the model being trained exists, with how the agent based models are calibrated, is very well explained in this video, by Bob Trenwith. To him and the other people that noticed these similarity I would like to congratulate them for their brilliance, and the way they show that the powerfull innovations come from creative minds willing to make clever analogies and test them to their limits. A discussion of what I think this integration implies will be done in a following post. For now, enjoy the video, and I really recomend Mr. Trenwith channel if you are interested on topics from Machine Learning to Bayesian Statistics and applications in biology.
Microtasking for crowdsourcing and anonimization of data in Python : An idea to handle sensible data to external parties for pre-processing without revealing the true identity of the individual. Case of address and name in Colombian data-bases
I have often crossed with this situation. There are some raw data in really bad shape, in the power of an institution that has de duty to protect the identity of individuals in the data base. Just to know that someone is on the data might itself be problematic (think for instance if it’s a database about default on loans or on repiting an entrance test. People usually want this not to be publicly known). On the other hand, an external institution needs the data, and could do wonders with it, because it has technical staff and access to techniques that are out of the scope of the institution that owns the data. Or simply, the institution that owns it would be overflown by requests of data that would imply a lot of pre-processing. If there was no problem of disclosure, the institution would want to give free access to the data so that each user could transform to tailor its own needs. But that is just out of the question.
A possibility would be that the owning institution could do some anonimization of the data to give the non-identity related variables to the user. However, sometimes it is the sensible information the one that needs to be somehow processed. For example, think about when you want to merge several data bases from several sources thorugh fuzzy match by names of the people on them. This, as the document ID is missing or has problems. In Colombia, matching the ID of underage people with their ID’s when they are adults is quite imposible, and the matching has to be done by name, which has a lot of typos or appears writen in different ways on the different sources of information one wants to merge. Another case is when you have to convert addresses into longitude and latitude coordinates for georeferencing. In such cases, many times the address has a lot of typos, or the taxonomy is writen with abbreviations or in very heterogeneous ways. Again, to reveal the entire address is not an option, but you would like to give the data in some shape, so that the homogeneization of addresses can be done elsewhere.
Nowadays, the division of such tasks in microtasks given to a crowd of volunteers or people recieving micropayments (the practice of crowdsourcing of which I have talked about elsewhere in this blog) could handle such a task, if we find a way to facilitate the data to them. Here, I propose a way to do so. First, lets consider the next column of addresses, in the taxonomy used in Colombia.
As you see, sometimes the address is incomplete. Other times, the same kind of term is writen in different ways (like Calle and Cl.). Now imagine a column like this, but with thousands of entries. Using regular expressions, or simple search and replace protocols, would imply to correct this through many lines of code (the regex inferno as they call it) and would require a lot of work. Now, imagine we want others to do it for us, so we do this instead: First, we split the different parts of the address into several columns, splitting every time a white space appears. Like this:
The good thing about python is that it authomatically creates the right number of columns, which was unknown by us when we used the command (the larger address had 4 components, while the shortest had None, as it was a missing value). After assessing it was a limited number of columns, we can do something like this:
Now, lets create a new data set that only has the split addresses.
Now, imagine we could randomly sort each column, giving something like this:
Look that this data base has all the information of the original addresses, but sorted in a way that no single original address can be tracked. The identity and location of the person is protected, and yet, all the inputs needed for a code that will do the necessary pre-processing are there. Therefore, we could safely give this to an external party to do the rest of the job, so he can use as much regular expressions (or other methodology, like fuzzy matching through phonetic algorithms) as needed to give shape to the addresses as the GIS requires. Something like this:
And so on ....
Once finished, this very same code, if applied not to the randomized data base, but to the original, will correct the problem. Therefore, once the external party has finished the code, they can handle it to you, so you can run it in your server, and then do the georeferencing. To use crowdsourcing, the only additional step needed would be to give to each person only a sample of rows of the big data frame, so that each can work on a subset of the problems needed. You could even give several of them duplicated rows, to do cross validation among them.
R project code to split a variable with several values into separate dummy columns. Case of population group on Colombian databases
Sometimes, you will have a column variable that is filled with many values inside. For example, if the method of collection of information was a web survey, where the form allowed the person to check several options for an answer. In that case, the database engineers might have considered that instead of creating some sparce columns for each possible category, they might just put all categories in the same column, and to leave to the analyst the work of extracting particular relevant information and convert it into separate columns. For example, think of the next column:
In this case, the population group includes several options. Some people might belong to two or more. The next code separates them all without having to know exactly how many categories they are, in case one is sure that all possible options were standarazided.
The result would be:
If the field was not homogeinized, you could mix this code with regular expressions in the preprocessing stage, so that no redundant columns are created.
Follow my blog to see in another post to see how this would be done in python and stata.
Stata code to correct typos and heterogeneous ways to code gender in Colombian data bases
Imagine you have the following situation:
Then, with regular expressions, you could replace those very heterogeneous options of response, into a well defined dictionary with the following:
Now your data will look like this:
There were several tricks here:
-First, notice that before we do any search and replacement, we did basic transformations to the variable, so to reduce the amount of possible cases. For example, “Male “ and “MALE” are the same, just that the first one has a white space at the tail of the string, and the second one is on upper case. We can remove all of these whitespaces, and convert everything to upper case, and then they would become a single case.
- Second, notice the use of the special characters ^ and $. The first one means “starts with”. The $ means “ends with” in the regex (regular expressions) syntax. So ^MALE$ means “starts with MALE and there is nothing else after”. This is to avoid that, for example, when the program finds the MALE part of FEMALE, it wont replace for anything. If we didn’t take this precaution, when we ask to replace MALE for MASCULINO, then FEMALE would look like FEMASCULINO, which would be a clear mistake. The work of homogeinizing a dictionary of a column/variable that has this several possibilities due to a non-standard way of recording information is full with this kind of cases, which requires one to be really carefull before doing searches and replacements.
Equivalent code in python and R project in other posts of this blog