krishg200 @krishg2000 - Tumblr Blog

Bar graph

Univariate graph of animal phobia.

data[ANIMALS_MAP[CODE]] = data[ANIMALS_MAP[CODE]].astype('category') data[ANIMALS_MAP[CODE]] = data[ANIMALS_MAP[CODE]].cat.rename_categories(ANIMALS_MAP[VALUES]) seaborn.countplot(x=ANIMALS_MAP[CODE], data=data) plt.xlabel('Ever had fear/avoidance of insects, snakes, birds, other animals') plt.title ('Distribution of Animal Phobia Cases') plt.show()

This graph shows that the number of the cases with animal phobia is rather small, but still considerable.

Univariate graph of respondents’ distribution by the region of origin.

# Create subset for the 20 origins that have more than 400 occurrences (calculated in Assignment 3) condition_origin = data[ORIGIN_MAP[CODE]].isin(REGIONS) subset_origin = data[condition_origin].copy() # Add a new column with regions based on origin values subset_origin['REGION'] = subset_origin.apply(lambda row: assign_region(row), axis=1) seaborn.countplot(subset_origin['REGION']) plt.xlabel('Regions of Origin') plt.xticks(rotation=20) plt.title('Distribution by Regions of Origin') plt.tight_layout() plt.show()

This graph is a product of distributing original multiple values by ‘bins’ in order to make them better fit into the image. We see that the predominant region of origin in this dataset is Western Europe, followed by Africa, Latin America and Central Europe. Note that the graph only represents the distribution for the respondents whose origin had 400 or more occurrences in the dataset.

Univariate graph of respondents’ distribution by the health perception.

data[HEALTH_MAP[CODE]] = data[HEALTH_MAP[CODE]].astype('category') data[HEALTH_MAP[CODE]] = data[HEALTH_MAP[CODE]].cat.rename_categories(HEALTH_VALUES) seaborn.countplot(data[HEALTH_MAP[CODE]]) plt.xlabel('Self-Preceived Current Health') plt.title('Distribution by Health Perception') plt.show()

This graph shows that most respondents tend to describe their health as good, very good or excellent.

Bivariate graph catregorical value--> categorical value (region of origin [explanatory] -> type of animal phobia [response]).

condition_ap_for_origin = subset_origin[ANIMALS_MAP[CODE]] == 1 subset_origin_ap = subset_origin[condition_ap_for_origin].copy() subset_origin_ap[APPUREMIXED] = pd.to_numeric(subset_origin_ap[APPUREMIXED], errors='coerce') seaborn.catplot(x='REGION', y=APPUREMIXED, kind='bar', ci=None, data=subset_origin_ap) plt.xlabel('Regions of Origin') plt.ylabel('Proportion of Pure Animal Phobia') plt.title('Pure Animal Phobia vs. Region of Origin') plt.xticks(rotation=20) plt.show()

Again the explanations regarding Native Americans are below.

That said, we see in this graph that Native Americans and the respondents of the African origins demonstrate the lowest proportion of pure animal phobia. Third lowest are Latin Americans. That is interesting, as these particular groups demonstrate the highest rates of animal phobia in general .

DetailAnimal phobia

As it turned out in the eailer assignment, there might be really a considerable difference between pure and mixed animal phobia. It was particularly striking on the example of compared perceived health results for those with mixed and pure animal phobia together and pure animal phobia alone. Briefly, those with pure animal phobia tend to find their health better than those with mixed animal phobia (whose results were even slightly worse than for the whole dataset).

So, I decided to focus on this distinction. Here is the distribution of the types of animal phobia only for the respondents who had this experience.

condition_ap = data[ANIMALS_MAP[CODE]] == 1 subset_ap = data[condition_ap].copy() # Make a subset of those with animal phobia subset_ap[APPUREMIXED] = subset_ap[APPUREMIXED].astype('category') subset_ap[APPUREMIXED] = subset_ap[APPUREMIXED].cat.rename_categories(['Mixed', 'Pure']) seaborn.countplot(x=APPUREMIXED, data=subset_ap) plt.xlabel('Types of Animal Phobia') plt.title ('Distribution of Pure and Mixed Animal Phobia') plt.show()

The proportion of pure animal phobia is actually similar to the proportion of animal phobia to no-animal-phobia.

Health perception -> Type of AP

So, getting to the health perception. If a person with animal phobia perceives their health as good, is it reasonable to expect that their AP is pure?

condition_ap = data[ANIMALS_MAP[CODE]] == 1 subset_ap = data[condition_ap].copy() # Make a subset of those with animal phobia # group health categories into two subset_ap['HEALTHBINARY'] = subset_ap.apply(lambda row: sort_health(row), axis=1) seaborn.catplot(x=APPUREMIXED, y='HEALTHBINARY', kind='bar', ci=None, data=subset_ap) plt.xlabel('Type of Animal Phobia') plt.ylabel('Proportion of Good Perceived Health') plt.title('Perceived Health -> Type of AP') plt.xticks(rotation=20) plt.tight_layout() plt.show()

Well, looks like not exactly. Although the proportion of those with pure AP is slightly bigger, the proportion of those with mixed AP and good perceived health is still big, so it is about 73% to 85%, nothing impressive.

By the way, to produce this graph I had to sort all the health categories into two. I chose Good (Excellent, Very good, Good) and Not good (Fair, Poor).

Origin / Descent

This variable turned out to be challenging again. Out of 60 original categories I used only 20 (those that had 400 or more occurrences). Still, they were too many to properly fit them into a plot. So, presumably, I had to group them into some ‘bins’, or generalized categories, based on some principle. I chose regional classification as this principle. On the way, I found out that this principle was not exactly robust, because there are really numerous versions of such a classification. As a result, my sorting was extremely approximate, but I think it is good enough for this course’s purposes.

There was one origin, however, that I could not fit into this regional classification: the Native Americans. As I initially chose this variable as a potential marker of different cultural backgrounds, I wanted to keep this cultural distinction within my regional groupings as well. In most cases, I think, it is roughly reflected in political geography. But definitely not in this case, because otherwise Native Americans would be merged with Latin or Northern Americans. My solution was to actually keep them as is.

So, getting to the proportions. In my previous assignment, I found considerable variability in animal phobia rate depending on origin. In my new regionalized version this variability is still in place.

subset_origin[ANIMALS_MAP[CODE]] = subset_origin[ANIMALS_MAP[CODE]].replace(9, np.nan) subset_origin[ANIMALS_MAP[CODE]] = subset_origin[ANIMALS_MAP[CODE]].replace(2, 0) subset_origin[ANIMALS_MAP[CODE]] = subset_origin[ANIMALS_MAP[CODE]].astype('category') print(subset_origin[ANIMALS_MAP[CODE]].describe()) subset_origin[ANIMALS_MAP[CODE]] = pd.to_numeric(subset_origin[ANIMALS_MAP[CODE]], errors='coerce') seaborn.catplot(x='REGION', y=ANIMALS_MAP[CODE], kind='bar', ci=None, data=subset_origin) plt.xlabel('Regions of Origin') plt.ylabel('Proportion of Animal Phobia') plt.title('Animal Phobia vs. Region of Origin') plt.xticks(rotation=20) plt.show()

We see the highest rates of animal phobia for Africa and [Native Americans, followed by Southern Europe and Latin America.

This picture, however, drastically changes, if we look at the proportions of pure animal phobia by region. That graph was already posted above, in the Summary section, but I will reproduce it here again.

Here we see that the proportion of pure animal phobia is actually the lowest exactly for those who had the highest rates in the general animal phobia. Namely, such groups as Native Americans, Africa and Latin America show the smallest proportions of pure animal phobia compared to the others.

The full code of it is ----------<

import pandas as pd import numpy as np import seaborn import matplotlib.pyplot as plt import time from pprint import pprint from global_vars import * from reference import * def sort_health(row): ''' Recoding values for perceived health :param row: row of a dataset :return: code value (int) ''' good = [1, 2, 3, 4] if row[HEALTH_MAP[CODE]] in good: return 1 else: return 0 def assign_region(row): ''' Recoding values for origins to group them by region :param row: row of a dataset :return: value (region name, str) ''' return REGIONS[row[ORIGIN_MAP[CODE]]] def main(): ''' This function contains the main flow of the process of operating the data :return: None ''' # Load data data = pd.read_csv(DATA_SOURCE, low_memory=False) # Convert necessary values to numbers data[ANIMALS_MAP[CODE]] = pd.to_numeric(data[ANIMALS_MAP[CODE]], errors='coerce') data[ORIGIN_MAP[CODE]] = pd.to_numeric(data[ORIGIN_MAP[CODE]], errors='coerce') data[HEALTH_MAP[CODE]] = pd.to_numeric(data[HEALTH_MAP[CODE]], errors='coerce') # Convert Unknown and Other to NaN data[ORIGIN_MAP[CODE]] = data[ORIGIN_MAP[CODE]].replace([98, 99], np.nan) data[HEALTH_MAP[CODE]] = data[HEALTH_MAP[CODE]].replace(9, np.nan) # Create mixed vs. pure variable # Create new columns to store recoded values for different kinds of specific phobias for phobia in ALL_SPECIFIC_PHOBIAS: data[phobia[CODE] + '_NEW'] = data[phobia[CODE]].replace([2, 9], 0) sp_new_list = [entry[CODE] + '_NEW' for entry in ALL_SPECIFIC_PHOBIAS] # creating a list of names for the new columns # Sum up all values for phobias in new columns and store the result in a new column 'APPUREMIXED' data[APPUREMIXED] = data.loc[:, sp_new_list].sum(axis=1) condition_for_replace = data[APPUREMIXED] > 1 data.loc[condition_for_replace, APPUREMIXED] = 0 # replace values > 1 with 0 appuremixed_freq = data[data[ANIMALS_MAP[CODE]] == 1][APPUREMIXED].value_counts(sort=False, dropna=False) appuremixed_percent = data[data[ANIMALS_MAP[CODE]] == 1][APPUREMIXED].value_counts(sort=False, dropna=False, normalize=True) print('\nFrequencies, percentages for pure and mixed animal phobia') print(pd.concat(dict(Frequencies=appuremixed_freq.rename({1: 'Pure', 0: 'Mixed'}), Percentages=appuremixed_percent.rename({1: 'Pure', 0: 'Mixed'})), axis=1)) ### VISUALIZATIONS ## UNIVARIATE # Univariate 1: Plotting APPUREMIXED variable # Read APPUREMIXED data as categorical condition_ap = data[ANIMALS_MAP[CODE]] == 1 subset_ap = data[condition_ap].copy() # Make a subset of those with animal phobia subset_ap[APPUREMIXED] = subset_ap[APPUREMIXED].astype('category') subset_ap[APPUREMIXED] = subset_ap[APPUREMIXED].cat.rename_categories(['Mixed', 'Pure']) # seaborn.countplot(x=APPUREMIXED, data=subset_ap) # plt.xlabel('Types of Animal Phobia') # plt.title ('Distribution of Pure and Mixed Animal Phobia') # plt.show() # Univariate 1.1: Plotting original animal phobia variable # Read animal phobia data as categorical data[ANIMALS_MAP[CODE]] = data[ANIMALS_MAP[CODE]].replace(9, np.nan) data[ANIMALS_MAP[CODE]] = data[ANIMALS_MAP[CODE]].astype('category') data[ANIMALS_MAP[CODE]] = data[ANIMALS_MAP[CODE]].cat.rename_categories(ANIMALS_MAP[VALUES]) # seaborn.countplot(x=ANIMALS_MAP[CODE], data=data) # plt.xlabel('Ever had fear/avoidance of insects, snakes, birds, other animals') # plt.title ('Distribution of Animal Phobia Cases') # plt.show() # Univariate 2: Plotting ORIGINS variable # Create subset for the 20 origins that have more than 400 occurrences (calculated in Assignment 3) condition_origin = data[ORIGIN_MAP[CODE]].isin(REGIONS) subset_origin = data[condition_origin].copy() # Add a new column with regions based on origin values # seaborn.countplot(subset_origin['REGION']) # plt.xlabel('Regions of Origin') # plt.xticks(rotation=20) # plt.title('Distribution by Regions of Origin') # plt.tight_layout() # plt.show() # Univariate 3: Plotting PERCEIVED HEALTH variable data[HEALTH_MAP[CODE]] = data[HEALTH_MAP[CODE]].astype('category') data[HEALTH_MAP[CODE]] = data[HEALTH_MAP[CODE]].cat.rename_categories(HEALTH_VALUES) # seaborn.countplot(data[HEALTH_MAP[CODE]]) # plt.xlabel('Self-Preceived Current Health') # plt.title('Distribution by Health Perception') # plt.show() ## BIVARIATE # Bivariate 1 - Health # If a person with animal phobia perceives their health as good, is it reasonable to expect that their AP is pure? # group health categories into two subset_ap['HEALTHBINARY'] = subset_ap.apply(lambda row: sort_health(row), axis=1) # seaborn.catplot(x=APPUREMIXED, y='HEALTHBINARY', kind='bar', ci=None, data=subset_ap) # plt.xlabel('Type of Animal Phobia') # plt.ylabel('Proportion of Good Perceived Health') # plt.title('Perceived Health -> Type of AP') # plt.xticks(rotation=20) # plt.tight_layout() # plt.show() # Bivariate 2 - Region vs. Health Perception # Create a column with regions (as 'bins') # Is there any association between region and health perception? SPOILER: NO subset_origin['HEALTHBINARY'] = subset_ap.apply(lambda row: sort_health(row), axis=1) # seaborn.catplot(x='REGION', y='HEALTHBINARY', kind='bar', ci=None, data=subset_origin) # plt.xlabel('Region of Origin') # plt.ylabel('Proportion of Good Perceived Health') # plt.title('Region of Origin -> Perceived Health') # plt.xticks(rotation=20) # plt.tight_layout() # plt.show() # Bivariate 3: region -> AP (general) subset_origin[ANIMALS_MAP[CODE]] = subset_origin[ANIMALS_MAP[CODE]].replace(9, np.nan) subset_origin[ANIMALS_MAP[CODE]] = subset_origin[ANIMALS_MAP[CODE]].replace(2, 0) subset_origin[ANIMALS_MAP[CODE]] = subset_origin[ANIMALS_MAP[CODE]].astype('category') print(subset_origin[ANIMALS_MAP[CODE]].describe()) subset_origin[ANIMALS_MAP[CODE]] = pd.to_numeric(subset_origin[ANIMALS_MAP[CODE]], errors='coerce') seaborn.catplot(x='REGION', y=ANIMALS_MAP[CODE], kind='bar', ci=None, data=subset_origin) plt.xlabel('Regions of Origin') plt.ylabel('Proportion of Animal Phobia') plt.title('Animal Phobia vs. Region of Origin') plt.xticks(rotation=20) plt.show() # Bivariate 4: region -> pure AP condition_ap_for_origin = subset_origin[ANIMALS_MAP[CODE]] == 1 subset_origin_ap = subset_origin[condition_ap_for_origin].copy() subset_origin_ap[APPUREMIXED] = pd.to_numeric(subset_origin_ap[APPUREMIXED], errors='coerce') # seaborn.catplot(x='REGION', y=APPUREMIXED, kind='bar', ci=None, data=subset_origin_ap) # plt.xlabel('Regions of Origin') # plt.ylabel('Proportion of Pure Animal Phobia') # plt.title('Pure Animal Phobia vs. Region of Origin') # plt.xticks(rotation=20) # plt.show() def distribute_origins_by_regions(): ''' Using a hand-made dictionary REGIONS_ORIGINS ({origin_name: region_name}), and ORIGINS_VALUES mapper, create a new dictionary {origin_category_code: region_name} ''' result = dict() for entry in REGIONS_ORIGINS: for code in ORIGINS_VALUES: if entry == ORIGINS_VALUES[code]: region = REGIONS_ORIGINS[entry] result[code] = region print(len(result)) pprint(result) unique_regions = set(list(result.values())) print('Num unique regions: {}'.format(len(unique_regions))) for reg in unique_regions: print(reg) return result

run;

#mapping #bargraph #scatterplot #data management

Assignment4 bar graph import pandas as pd import numpy as np import seaborn import matplotlib.pyplot as plt import time from pprint import pprint from global_vars import * from reference import * def sort_health(row): ''' Recoding values for perceived health :param row: row of a dataset :return: code value (int) ''' good = [1, 2, 3, 4] if row[HEALTH_MAP[CODE]] in good: return 1 else: return 0 def assign_region(row): ''' Recoding values for origins to group them by region :param row: row of a dataset :return: value (region name, str) ''' return REGIONS[row[ORIGIN_MAP[CODE]]] def main(): ''' This function contains the main flow of the process of operating the data :return: None ''' # Load data data = pd.read_csv(DATA_SOURCE, low_memory=False) # Convert necessary values to numbers data[ANIMALS_MAP[CODE]] = pd.to_numeric(data[ANIMALS_MAP[CODE]], errors='coerce') data[ORIGIN_MAP[CODE]] = pd.to_numeric(data[ORIGIN_MAP[CODE]], errors='coerce') data[HEALTH_MAP[CODE]] = pd.to_numeric(data[HEALTH_MAP[CODE]], errors='coerce') # Convert Unknown and Other to NaN data[ORIGIN_MAP[CODE]] = data[ORIGIN_MAP[CODE]].replace([98, 99], np.nan) data[HEALTH_MAP[CODE]] = data[HEALTH_MAP[CODE]].replace(9, np.nan) # Create mixed vs. pure variable # Create new columns to store recoded values for different kinds of specific phobias for phobia in ALL_SPECIFIC_PHOBIAS: data[phobia[CODE] + '_NEW'] = data[phobia[CODE]].replace([2, 9], 0) sp_new_list = [entry[CODE] + '_NEW' for entry in ALL_SPECIFIC_PHOBIAS] # creating a list of names for the new columns # Sum up all values for phobias in new columns and store the result in a new column 'APPUREMIXED' data[APPUREMIXED] = data.loc[:, sp_new_list].sum(axis=1) condition_for_replace = data[APPUREMIXED] > 1 data.loc[condition_for_replace, APPUREMIXED] = 0 # replace values > 1 with 0 appuremixed_freq = data[data[ANIMALS_MAP[CODE]] == 1][APPUREMIXED].value_counts(sort=False, dropna=False) appuremixed_percent = data[data[ANIMALS_MAP[CODE]] == 1][APPUREMIXED].value_counts(sort=False, dropna=False, normalize=True) print('\nFrequencies, percentages for pure and mixed animal phobia') print(pd.concat(dict(Frequencies=appuremixed_freq.rename({1: 'Pure', 0: 'Mixed'}), Percentages=appuremixed_percent.rename({1: 'Pure', 0: 'Mixed'})), axis=1)) ### VISUALIZATIONS ## UNIVARIATE # Univariate 1: Plotting APPUREMIXED variable # Read APPUREMIXED data as categorical condition_ap = data[ANIMALS_MAP[CODE]] == 1 subset_ap = data[condition_ap].copy() # Make a subset of those with animal phobia subset_ap[APPUREMIXED] = subset_ap[APPUREMIXED].astype('category') subset_ap[APPUREMIXED] = subset_ap[APPUREMIXED].cat.rename_categories(['Mixed', 'Pure']) # seaborn.countplot(x=APPUREMIXED, data=subset_ap) # plt.xlabel('Types of Animal Phobia') # plt.title ('Distribution of Pure and Mixed Animal Phobia') # plt.show() # Univariate 1.1: Plotting original animal phobia variable # Read animal phobia data as categorical data[ANIMALS_MAP[CODE]] = data[ANIMALS_MAP[CODE]].replace(9, np.nan) data[ANIMALS_MAP[CODE]] = data[ANIMALS_MAP[CODE]].astype('category') data[ANIMALS_MAP[CODE]] = data[ANIMALS_MAP[CODE]].cat.rename_categories(ANIMALS_MAP[VALUES]) # seaborn.countplot(x=ANIMALS_MAP[CODE], data=data) # plt.xlabel('Ever had fear/avoidance of insects, snakes, birds, other animals') # plt.title ('Distribution of Animal Phobia Cases') # plt.show() # Univariate 2: Plotting ORIGINS variable # Create subset for the 20 origins that have more than 400 occurrences (calculated in Assignment 3) condition_origin = data[ORIGIN_MAP[CODE]].isin(REGIONS) subset_origin = data[condition_origin].copy() # Add a new column with regions based on origin values # seaborn.countplot(subset_origin['REGION']) # plt.xlabel('Regions of Origin') # plt.xticks(rotation=20) # plt.title('Distribution by Regions of Origin') # plt.tight_layout() # plt.show() # Univariate 3: Plotting PERCEIVED HEALTH variable data[HEALTH_MAP[CODE]] = data[HEALTH_MAP[CODE]].astype('category') data[HEALTH_MAP[CODE]] = data[HEALTH_MAP[CODE]].cat.rename_categories(HEALTH_VALUES) # seaborn.countplot(data[HEALTH_MAP[CODE]]) # plt.xlabel('Self-Preceived Current Health') # plt.title('Distribution by Health Perception') # plt.show() ## BIVARIATE # Bivariate 1 - Health # If a person with animal phobia perceives their health as good, is it reasonable to expect that their AP is pure? # group health categories into two subset_ap['HEALTHBINARY'] = subset_ap.apply(lambda row: sort_health(row), axis=1) # seaborn.catplot(x=APPUREMIXED, y='HEALTHBINARY', kind='bar', ci=None, data=subset_ap) # plt.xlabel('Type of Animal Phobia') # plt.ylabel('Proportion of Good Perceived Health') # plt.title('Perceived Health -> Type of AP') # plt.xticks(rotation=20) # plt.tight_layout() # plt.show() # Bivariate 2 - Region vs. Health Perception # Create a column with regions (as 'bins') # Is there any association between region and health perception? SPOILER: NO subset_origin['HEALTHBINARY'] = subset_ap.apply(lambda row: sort_health(row), axis=1) # seaborn.catplot(x='REGION', y='HEALTHBINARY', kind='bar', ci=None, data=subset_origin) # plt.xlabel('Region of Origin') # plt.ylabel('Proportion of Good Perceived Health') # plt.title('Region of Origin -> Perceived Health') # plt.xticks(rotation=20) # plt.tight_layout() # plt.show() # Bivariate 3: region -> AP (general) subset_origin[ANIMALS_MAP[CODE]] = subset_origin[ANIMALS_MAP[CODE]].replace(9, np.nan) subset_origin[ANIMALS_MAP[CODE]] = subset_origin[ANIMALS_MAP[CODE]].replace(2, 0) subset_origin[ANIMALS_MAP[CODE]] = subset_origin[ANIMALS_MAP[CODE]].astype('category') print(subset_origin[ANIMALS_MAP[CODE]].describe()) subset_origin[ANIMALS_MAP[CODE]] = pd.to_numeric(subset_origin[ANIMALS_MAP[CODE]], errors='coerce') seaborn.catplot(x='REGION', y=ANIMALS_MAP[CODE], kind='bar', ci=None, data=subset_origin) plt.xlabel('Regions of Origin') plt.ylabel('Proportion of Animal Phobia') plt.title('Animal Phobia vs. Region of Origin') plt.xticks(rotation=20) plt.show() # Bivariate 4: region -> pure AP condition_ap_for_origin = subset_origin[ANIMALS_MAP[CODE]] == 1 subset_origin_ap = subset_origin[condition_ap_for_origin].copy() subset_origin_ap[APPUREMIXED] = pd.to_numeric(subset_origin_ap[APPUREMIXED], errors='coerce') # seaborn.catplot(x='REGION', y=APPUREMIXED, kind='bar', ci=None, data=subset_origin_ap) # plt.xlabel('Regions of Origin') # plt.ylabel('Proportion of Pure Animal Phobia') # plt.title('Pure Animal Phobia vs. Region of Origin') # plt.xticks(rotation=20) # plt.show() def distribute_origins_by_regions(): ''' Using a hand-made dictionary REGIONS_ORIGINS ({origin_name: region_name}), and ORIGINS_VALUES mapper, create a new dictionary {origin_category_code: region_name} ''' result = dict() for entry in REGIONS_ORIGINS: for code in ORIGINS_VALUES: if entry == ORIGINS_VALUES[code]: region = REGIONS_ORIGINS[entry] result[code] = region print(len(result)) pprint(result) unique_regions = set(list(result.values())) print('Num unique regions: {}'.format(len(unique_regions))) for reg in unique_regions: print(reg) return result if __name__ == '__main__': start = time.time() main() # distribute_origins_by_regions() stop = time.time() print('Running time: {}'.format(stop - start))

Assignment 3 DATA MANAGMENT AND VISUALIZATION

The instructions regarding the blog post I also found a bit vague. In fact, the presentation requirements were just like last time:

The script;

The output (“that displays at least 3 of your data managed variables as frequency distributions”);

Some comments (“describing these frequency distributions in terms of the values the variables take, how often they take them, the presence of missing data, etc.”).

And below is what I have done about my variables so far.

Like previously, I stick to my three basic variables from NESARC dataset, which are:

The experience of animal phobia;

The origin (or descent);

Perceived health.

But this time I decided to use some other variables for comparison.

Animal phobia

In my literature reviewI mentioned a study that made a distinction between pure and mixed (that is combined with some other specific phobias) animal phobia. So I thought it may be instructive to also look at these two types separately. This required a number of tricks, which are supposedly called data management. These steps were:

Take the values from all specific phobia variables (there are 11 of them including animal phobia) and store them in new columns having replaced all ‘No-s’ and ‘Unknowns’ with 0 (so that they only have 1 if there was this phobia experience and 0 if there was not or it is unknown);

Create a new column and fill it with the result of summing up all rows within those recoded special phobia columns;

Take a subset of my dataset, which only includes respondents with animal phobia experience (that is all values in the correspondent column == 1);

Recode the new column with summed results so that it only keeps 1 values and those greater than 1 (indicating there are other phobias apart from animals) are 0;

Use this recoded column to distinguish pure and mixed cases of animal phobia.

Here is the code snippet:

# Create new columns to store recoded values for different kinds of specific phobia for phobia in ALL_SPECIFIC_PHOBIAS: data[phobia[CODE] + '_NEW'] = data[phobia[CODE]].replace([2, 9], 0) # Sum up all values for phobias in new columns and store the result in a new column 'APPUREMIXED' data[APPUREMIXED] = data.loc[:, sp_new_list].sum(axis=1) condition_for_replace = data[APPUREMIXED] > 1 data.loc[condition_for_replace, APPUREMIXED] = 0 # replace values > 1 with 0 appuremixed_freq = data[data[ANIMALS_MAP[CODE]] == 1][APPUREMIXED].value_counts(sort=False, dropna=False) appuremixed_percent = data[data[ANIMALS_MAP[CODE]] == 1][APPUREMIXED].value_counts(sort=False, dropna=False, normalize=True) print('\nFrequencies, percentages for pure and mixed animal phobia') print(pd.concat(dict(Frequencies=appuremixed_freq.rename({1: 'Pure', 0: 'Mixed'}), Percentages=appuremixed_percent.rename({1: 'Pure', 0: 'Mixed'})), axis=1))

The resulting frequency distributions:

Frequencies, percentages for pure and mixed animal phobia Frequencies Percentages Mixed 6836 0.751787 Pure 2257 0.248213

I wonder if it could be done in an simpler way.

Origin or descent

With origins it was even trickier. I wanted to see the percentages of the respondents with animal phobia for each kind of origin separately. But I failed to find a way to do it based on this dataset. I am sure there are better ways to handle this (maybe via grouping? but again, I still do not quite understand the mechanics). So I just created a new dataframe to store all necessary data to calculate these percentages. Here are the steps:

Get frequencies for origins;

Get frequencies for origins for the respondents with animal phobia;

Combine these two results into a new dataframe with origin names as indices and two frequencies variables as columns;

Calculate percentages and store them in a new column.

I also replaced unknow and other origins with NaN values and dropped them when creating the new dataframe.

Code snippet:

# Convert Unknown and Other to NaN data[ORIGIN_MAP[CODE]] = data[ORIGIN_MAP[CODE]].replace([98, 99], np.nan) # Get frequencies by origin origins = data[ORIGIN_MAP[CODE]].value_counts(sort=False, dropna=True) # Get origin frequencies based on the condition that respondents have animal phobia condition = data[ANIMALS_MAP[CODE]] == 1 origins_with_ap = data[condition][ORIGIN_MAP[CODE]].value_counts(sort=False, dropna=True) # Create a new dataframe out of these two frequency series origins_df = origins.rename(ORIGIN_MAP[VALUES]).to_frame(name='ORIGCOUNTS') origins_df['ORIGAPCOUNTS'] = origins_with_ap.rename(ORIGIN_MAP[VALUES]) # Create a new column in this new df to store percentages origins_df['APPERCENT'] = origins_df['ORIGAPCOUNTS'] / origins_df['ORIGCOUNTS']

And here is the top and the bottom of the sorted output (the printed the column with percentages):

Turkish 0.315789 African American (Black, Negro, or Afro-American) 0.301406 Other Caribbean or West Indian (Spanish Speaking) 0.291667 Filipino 0.269058 African (e.g., Egyptian, Nigerian, Algerian) 0.264706 Guamanian 0.263158 Vietnamese 0.257426 Other Spanish 0.253623 Other Caribbean or West Indian (Non-Spanish Speaking) 0.252475 Canadian 0.250000 ... Israeli 0.148936 Russian 0.138756 Indonesian 0.137931 Chinese 0.133987 Other Eastern European (Romanian, Bulgarian, Albanian) 0.114035 Iranian 0.106383 Iraqi 0.100000 Samoan 0.100000 Jordanian 0.090909 Australian, New Zealander 0.078947

As in the previous assignment, the overall percentage of animal phobia was 21%. On the origin level though these percentages demonstrate considerable variety. There are much lower values for some (like Australian, New Zealander, about 8%) and higher values for others (e.g. African American, 30%).

However, these results may have different weight, so to speak. For example, we see that the Turkish origin is on the very top with about 32% of animal phobia rate. But there are only 19 respondents with this origin for the whole dataset, and 6 of them had this animal fear experience. One might doubt that on such a tiny sample the result might be trustworthy. On the other hand, there are African Americans, who are really numerous (7684).

That is why I decided to work only with a subset of those origins, which have 400 or more occurrences in the dataset. I chose 400 as a threshold, because it is kind of a magic number in the research area (based on sample size calculations and confidence intervals).

Here is the code:

subset_orig_gte_400 = origins_df[origins_df['ORIGCOUNTS'] >= 400].copy() print('Origins subset (gte 400 respondents)') print(subset_orig_gte_400.sort_values(by=['APPERCENT'], ascending=False))

As a result I got a smaller subset (20 rows instead of 60) with the following animal phobia shares:

African American (Black, Negro, or Afro-American) 0.301406 American Indian (Native American) 0.241026 South American (e.g., Brazilian, Chilean, Columbian) 0.232323 Central American (e.g., Nicaraguan, Guatemalan) 0.228361 Puerto Rican 0.220662 Dutch 0.203980 French 0.201336 Spanish (Spain) , Portugese 0.198819 Italian 0.198071 Irish 0.187867 Scottish 0.187335 English 0.185410 Cuban 0.184444 Mexican-American 0.184300 Norwegian 0.184211 Mexican 0.183088 German 0.179607 Swedish 0.178654 Polish 0.176768 Russian 0.138756

Here I recall a valuable input by a peer who commented on my Assignment 2 and, among all, mentioned that some Native Americans may have higher animal fear rate.

It is also worth noting that none of the origins showed any extraordinary animal phobia rate, like 50% or higher.

Perceived health

For the perceived health variable I also recoded all Unknowns into NaN, just in case, and then dropped them.

What is more impressive, I had a look at perceived health distribution for those with pure animal phobia. As in my previous assignment, I compared the distribution across the whole dataset with the distribution for those with animal phobia. There was some difference (in particular, the percentage of those whose perceived health is poor, was slightly higher, 7% vs. 5%).

So, this time I calculated perceived health distribution for pure animal phobia and compared it with previous calculations. Code:

data[HEALTH_MAP[CODE]] = data[HEALTH_MAP[CODE]].replace(9, np.nan) # Get percentages for perceived health distribution (for all) health_percent = data[HEALTH_MAP[CODE]].value_counts(sort=False, dropna=True, normalize=True) # health perception vs. animal phobia health_ap_percent = data[data[ANIMALS_MAP[CODE]] == 1][HEALTH_MAP[CODE]].value_counts(sort=False, dropna=True, normalize=True) # health perception vs. pure animal phobia health_pure_ap_percent = data[(has_ap & has_pure_ap)][HEALTH_MAP[CODE]].value_counts(sort=False, dropna=True, normalize=True) print('\nCompared distribution percentages for Perceived Health') print(pd.concat(dict(Dataset=health_percent.rename(HEALTH_MAP[VALUES]), AnimalPhobia=health_ap_percent.rename(HEALTH_MAP[VALUES]), PureAnimalPhobia=health_pure_ap_percent.rename(HEALTH_MAP[VALUES])), axis=1))

Result:

Compared distribution percentages for Perceived Health AnimalPhobia Dataset PureAnimalPhobia Excellent 0.228515 0.287576 0.299777 Fair 0.159652 0.121862 0.095323 Good 0.271926 0.248652 0.249889 Poor 0.072829 0.051813 0.044989 Very good 0.267078 0.290097 0.310022

As we can see, the share of those who perceive their health as poor is the smallest in the case of pure animal phobia. These respondents also most often perceive their health as excellent or very good. Actually this reminds me of the study Pure animal phobia is more specific than other specific phobias by Vladeta Ajdacic-Gross et al., which states that “Pure animal phobia showed no associations with any included risk factors and comorbid disorders, in contrast to numerous associations found in the mixed subtype and in other specific phobias”.

Wrap up

The attempt to distinguish pure and mixed animal phobia showed the proportion of 25% (pure) vs. 75% (mixed). While processing other special phobias data I faced a dilemma on how to treat missing values (or Unknown). I saw two ways:

Code all Unknowns as NaN to make sure I only count the results for those cases that are definitely true. This approach would imply that pure animal phobia is the one, about which we are absolutely sure that it is not combined with any other specific phobias.

Code all No-s and Unknowns as 0 and treat them equally. This would imply that pure animal phobia is the one, about which we have no evidence that it is combined with any other specific phobias.

I chose the latter. First, the approach with NaN would lead to a messier picture with lots of uncertainties to be taken into consideration. Second, and even more important, I cannot be sure that this dataset lists all possible specific phobias. By the way, I failed to find something like pyrophobia there. So, even if I try and clean out all Unknowns on the dataset level, there will still be huge unknowns outside its scope. That is why decided not to make any difference between No and Unknown when recoding these variables.

With the origins variable I ended up with a subset of those with 400 or more occurrences as most representative. The top-3 origins by animal phobia percentage were African American (30%), American Indian (24%), South American (23%). The lowest percentage is in the cases of Russian (14%), Polish (18%) and Swedish (18%). Looks like there may be some geographic pattern indeed. Although we can see that Cuban, Mexican and Mexican-American origins (that is southern) are somewhere in the middle. I might also want to later have a look at the distribution of pure animal phobia across the origins.

As to the perceived health variable, I compared the results for the whole dataset with the results for those with animal phobia and with pure animal phobia. I was surprised to see that those with pure animal phobia tend to estimate their health better than the others: Poor: 7% for the respondents with animal phobia, including mixed cases; 5% for the whole dataset; 4% for those with pure animal phobia. And excellent: 23% (animal phobia including mixed); 29% (whole dataset); 30% (pure animal phobia).

#database #visualization

Assignment 3 data management and visualization.

In fact, the presentation requirements were just like last time:

The script;

The output (“that displays at least 3 of your data managed variables as frequency distributions”);

Some comments (“describing these frequency distributions in terms of the values the variables take, how often they take them, the presence of missing data, etc.”).

And below is what I have done about my variables so far.

Like previously, I have done to my three basic variables from NESARC dataset, which are:

The experience of animal phobia;

The origin ;

Perceived health.

But this time I decided to use some other variables for comparison.

Animal phobia

In my literature view in assignment 1. I mentioned a study that made a distinction between pure and mixed animal phobia. So I thought it may be instructive to also look at these two types separately. This required a number of tricks, which are supposedly called data management. These steps were:

Create a new column and fill it with the result of summing up all rows within those recoded special phobia columns;

Take a subset of my dataset, which only includes respondents with animal phobia experience (that is all values in the correspondent column == 1);

Recode the new column with summed results so that it only keeps 1 values and those greater than 1 (indicating there are other phobias apart from animals) are 0;

Use this recoded column to distinguish pure and mixed cases of animal phobia.

Here is the code :

The resulting frequency distributions:

Frequencies, percentages for pure and mixed animal phobia Frequencies Percentages Mixed 6836 0.751787 Pure 2257 0.248213

I wonder if it could be done in an simpler way.

Origin or descent

With origins it was even trickier. I wanted to see the percentages of the respondents with animal phobia for each kind of origin separately. But I failed to find a way to do it based on this dataset. I am sure there are better ways to handle this. So, I just created a new dataframe to store all necessary data to calculate these percentages. Here are the steps:

Get frequencies for origins;

Get frequencies for origins for the respondents with animal phobia;

Combine these two results into a new dataframe with origin names as indices and two frequencies variables as columns;

Calculate percentages and store them in a new column.

I also replace the unknown and other origins with Nan values and dropped them when creating the new dataframe.

Code snippet:

And here is the top and the bottom of the sorted output:-

As was shown in my previous assignment , the overall percentage of animal phobia was 21%. On the origin level though these percentages demonstrate considerable variety. There are much lower values for some (like Australian, New Zealander, about 8%) and higher values for others (e.g. African American, 30%).

Here is the code:

subset_orig_gte_400 = origins_df[origins_df['ORIGCOUNTS'] >= 400].copy() print('Origins subset (gte 400 respondents)') print(subset_orig_gte_400.sort_values(by=['APPERCENT'], ascending=False))

As a result I got a smaller subset (20 rows instead of 60) with the following animal phobia shares:

Here I recall a valuable input by a peer who commented on my Assignment 2 and, among all, mentioned that some Native Americans may have higher animal fear rate.

It is also worth noting that none of the origins showed any extraordinary animal phobia rate, like 50% or higher.

Perceived health

For the perceived health variable I also recoded all Unknowns into NaN, just in case, and then dropped them.

What is more impressive, I had a look at perceived health distribution for those with pure animal phobia. In my previous assignment, I compared the distribution across the whole dataset with the distribution for those with animal phobia. There was some difference (in particular, the percentage of those whose perceived health is poor, was slightly higher, 7% vs. 5%).

So, this time I calculated perceived health distribution for pure animal phobia and compared it with previous calculations. Code:

Result:

Wrap up

Code all No-s and Unknowns as 0 and treat them equally. This would imply that pure animal phobia is the one, about which we have no evidence that it is combined with any other specific phobias.

#datamangement

Data management and visualization

For assignment 2 we have to write some code to process the chosen variables and present:

the script;

the output that displays three of your variables as frequency tables;

a few sentences describing your frequency distributions in terms of the values the variables take, how often they take them, the presence of missing data, etc.

My selected variables were:

the experience of animal phobia;

the origin of the respondent;

the respondent’s perceived health state.

Below are some of its parts to provide some outline. To print out all my output for frequencies and percentages I wrote the following function:

import pandas as pd def label_print(series_freq, series_percent, series_map, rename=True, ind_sort=False): print(TITLE.format(series_map[CODE], series_map[MEANING])) if rename: # Use labels for the output print(pd.concat(dict(Frequencies=series_freq.rename(series_map[VALUES]), Percentages=series_percent.rename(series_map[VALUES])), axis=1)) if ind_sort: # Sort numeric labels for the output print(pd.concat(dict(Frequencies=series_freq.sort_index(), Percentages=series_percent.sort_index()), axis=1)) elif not rename and not ind_sort: # print(series) print(pd.concat(dict(Frequencies=series_freq, Percentages=series_percent), axis=1))

I use global variables to storage all necessary string values. And I stored all the necessary code meanings as dictionaries ,which I import into my script and use to label the output.

So, here is the piece of code to get the frequencies and percentages for my core variable regarding animal phobia:

animals_freq= data[ANIMALS_MAP[CODE]].value_counts(sort=False) animals_percent = data[ANIMALS_MAP[CODE]].value_counts(sort=False, normalize=True) label_print(animals_freq, animals_percent, ANIMALS_MAP)

And here is the output:

Results for S8Q1A1 - EVER HAD FEAR/AVOIDANCE OF INSECTS, SNAKES, BIRDS, OTHER ANIMALS Frequencies Percentages Yes 9093 0.211009 No 32585 0.756155 Unknown 1415 0.032836

From this we see that a considerable number of the respondents (22%) have had some uneasy experience with the animals.

My next variable was origin or descent. Thing is that there are quite a number of distinct values (60 in total). By the way, I counted the unique values like this:

unique_origins = data[ORIGIN_MAP[CODE]].unique() print('num distinct origins:', len(unique_origins))

Some are really numerous (like African American, 7684 occurrences, about 18%); others are very few (like Malaysian 11, .025%) . So, here I will provide just the top by frequency (over 900 occurrences). The code was:

origins_freq = data[ORIGIN_MAP[CODE]].value_counts() origins_percent = data[ORIGIN_MAP[CODE]].value_counts(normalize=True) label_print(origins_freq, origins_percent, ORIGIN_MAP)

And this is the top of the output:

Results for S1Q1E - ORIGIN OR DESCENT Frequencies Percentages African American (Black, Negro, or Afro-American) 7684 0.178312 German 5345 0.124034 English 4455 0.103381 Irish 3066 0.071148 Mexican 2578 0.059824 Unknown 1855 0.043046 Mexican-American 1758 0.040795 Other 1739 0.040355 Italian 1555 0.036085 French 1048 0.024319 Puerto Rican 997 0.023136 American Indian (Native American) 975 0.022625 ...

Last there is the health variable. Code snippet, nothing new:

health_freq = data[HEALTH_MAP[CODE]].value_counts(sort=False) health_percent = data[HEALTH_MAP[CODE]].value_counts(sort=False, normalize=True) label_print(health_freq, health_percent, HEALTH_MAP)

The output:

Results for S1Q16 - SELF-PERCEIVED CURRENT HEALTH Frequencies Percentages Very good 12424 0.288307 Excellent 12316 0.285800 Good 10649 0.247117 Fair 5219 0.121110 Poor 2219 0.051493 Unknown 266 0.006173

And I also played with subsetting. I made a subset based on three conditions:

The respondent should have experienced animal fear

The respondent should be of one of the top origins (excluding Other and Unknown)

The respondent should perceive their health as poor.

Here is the code snippet:

condition_ap = data[ANIMALS_MAP[CODE]] == 1 condition_health = data[HEALTH_MAP[CODE]] == 5 condition_origin = data[ORIGIN_MAP[CODE]].isin([1, 19, 15, 18, 27, 29, 36, 35, 39, 3]) raw_subset = data[(condition_ap & condition_health & condition_origin)] subset = raw_subset.copy() print('Subset: top origins + poor perceived health + have AP') origins_ap_freq = subset[ORIGIN_MAP[CODE]].value_counts(sort=False) origins_ap_percent = subset[ORIGIN_MAP[CODE]].value_counts(sort=False, normalize=True) label_print(origins_ap_freq, origins_ap_percent, ORIGIN_MAP)

Here is the output:

Subset: top origins + poor perceived health + have AP Results for S1Q1E - ORIGIN OR DESCENT Frequencies Percentages African American (Black, Negro, or Afro-American) 217 0.430556 American Indian (Native American) 27 0.053571 English 67 0.132937 French 10 0.019841 German 54 0.107143 Irish 35 0.069444 Italian 17 0.033730 Mexican 22 0.043651 Mexican-American 20 0.039683 Puerto Rican 35 0.069444

To wrap up: I now have three frequency tables for each of my variables separately. For animal fear/avoidance there is a fair share of those who have experienced this (21%). It may be instructive to have a look at some other specific phobias for comparison, but for now I can just note that this share is by no means small.

For origin or descent, we see that the leaders among the respondents’ origins (60 in total) are:

African American (Black, Negro, or Afro-American) (18%)

German (12%)

English (10%)

And the fewest are:

Jordanian and Malaysian(0.025% each)

Samoan (0.02%)

As to health, most of the respondents (82%) find their health state good, very good or excellent. And only 5% estimated it as poor. I wonder if this distribution is going to change for the subset of those with animal phobia. So I even had a look. Here is the code snippet:

raw_subset = data[(condition_ap)] subset_health_ap = raw_subset.copy() print('\nSubset: perceived health + have AP') health_ap_freq = subset_health_ap[HEALTH_MAP[CODE]].value_counts(sort=True, dropna=False) health_ap_percent = subset_health_ap[HEALTH_MAP[CODE]].value_counts(sort=True, dropna=False, normalize=True) label_print(health_ap_freq, health_ap_percent, HEALTH_MAP)

And here is the output:

Subset: perceived health + have AP Results for S1Q16 - SELF-PERCEIVED CURRENT HEALTH Frequencies Percentages Good 2468 0.271418 Very good 2424 0.266579 Excellent 2074 0.228088 Fair 1449 0.159353 Poor 661 0.072693 Unknown 17 0.001870

So in the subset of those with animal fears, the distribution is a bit different indeed. Still, the idea that the health is in at least good state in predominant (about 79%). But now the order of the top three has changed: Good is the most frequent among the three (it used to be the least frequent), Excellent is the least frequent (used to be in the middle). The share of those who think their health is poor has also increased to be 7%. Too early to jump at any conclusions though. I do not even know if this difference is somehow significant.

As to the subset (above) based on origin, poor perceived health and animal phobia – well, it was purely technical. There is little to be concluded or observed based on the frequency table. Some other approach will be necessary. For now, I am just glad I discovered that nice .isin method for subsetting.

NESARC STUDY

I chose NESARC dataset to explore. The main reasons were: – the size of the dataset , – high detalization of parameters, which provides great opportunities for asking questions.

I am going to focus on specific phobias (SP), particularly on animal phobia (AP). This kind of phobia appears to be rather widely spread and, according to some studies (see below) has rather specific distinctions from other kinds of SP, such as fear of heights, water, dentists, etc. So, it might be a good reason to single out just one phobia in order to narrow down my analysis. The variable is S8Q1A1 (EVER HAD FEAR/AVOIDANCE OF INSECTS, SNAKES, BIRDS, OTHER ANIMALS). My original question is if there is any particular association between this AP and the origin of a person (I would cautiously suggest that different origins probably mean different cultural backgrounds). The variable is S1Q1E (ORIGIN OR DESCENT). Next I would like to have a look at whether there is any association between having AP and the self-perception of health. The variable is S1Q16 (SELF-PERCEIVED CURRENT HEALTH).

So here are the basic questions: – Is there any association between animal phobia and the origin? – Is there any association between animal phobia and self-perceived current health description?

There is one more additional question (just in case I have more time for that). After taking a look at the national dimension of AP, it might be interesting to also consider possible cultural/perception changes over time and check out the percentage of people with AP across different age groups.

Based on the literature review (below), my hypothesis is: – AP shows some association with national / cultural background or context and may be associated with the perception of health condition.

LITRECTURE REVIEW

AP in the course of other SPs or vs. other SPs, anxiety and various disorders/mental conditions.

[1] Vladeta Ajdacic-Gross, Stephanie Rodgers, Mario Müller, Michael P. Hengartner, Aleksandra Aleksandrowicz, Wolfram Kawohl, Karsten Heekeren, Wulf Rössler, Jules Angst, Enrique Castelao, Caroline Vandeleur, Martin Preisig Pure animal phobia is more specific than other specific phobias: epidemiological evidence from the Zurich Study, the ZInEP and the PsyCoLaus European Archives of Psychiatry and Clinical Neuroscience, September 2016, Volume 266, Issue 6, pp 567–577 https://link.springer.com/article/10.1007/s00406-016-0687-4 This study states that pure animal phobia is principally different from other kinds of SP: “Pure animal phobia and mixed animal/other specific phobias consistently displayed a low age at onset of first symptoms (8–12 years) and clear preponderance of females (OR > 3). Meanwhile, other specific phobias started up to 10 years later and displayed almost a balanced sex ratio. Pure animal phobia showed no associations with any included risk factors and comorbid disorders, in contrast to numerous associations found in the mixed subtype and in other specific phobias. Across the whole range of epidemiological parameters examined in three different samples, pure animal phobia seems to represent a different entity compared to other specific phobias. The etiopathogenetic mechanisms and risk factors associated with pure animal phobias appear less clear than ever”. Based on this, I should probably also take into account the distinction between ‘pure’ (not combined with other SPs) and ‘mixed’ (goes in combination with other SPs) animal phobia. So I may need to see the proportion of those who have had only AP symptoms (variable S8Q1A1, EVER HAD FEAR/AVOIDANCE OF INSECTS, SNAKES, BIRDS, OTHER ANIMALS) and those combining AP with other SP episodes.

[2] Kevin Hilbert, Ricard Evens, Nina Isabel Maslowski, Hans-Ulrich Wittchen, Ulrike Lueken Neurostructural correlates of two subtypes of specific phobia: A voxel-based morphometry study Psychiatry Research: Neuroimaging Volume 231, Issue 2, 28 February 2015, Pages 168-175 https://www.sciencedirect.com/science/article/pii/S0925492714003308 Abstract: “The animal and blood-injection-injury (BII) subtypes of specific phobia are both characterized by subjective fear but distinct autonomic reactions to threat. Previous functional neuroimaging studies have related these characteristic responses to shared and non-shared neural underpinnings. However, no comparative structural data are available. This study aims to fill this gap by comparing the two subtypes and also comparing them with a non-phobic control group“. This study shows more complicated dependencies in the comparative analysis of SPs. To be taken into consideration while comparing. Particularly variable S8Q1A8 (EVER HAD FEAR/AVOIDANCE OF SEEING BLOOD/GETTING AN INJECTION) may be of interest.

[3] K. J. Wardenaar, C. C. W. Lim, A. O. Al-Hamzawi, J. Alonso et al. The cross-national epidemiology of specific phobia in the World Mental Health Surveys Psychological Medicine, Volume 47, Issue 10 July 2017 , pp. 1744-1760 https://www.cambridge.org/core/journals/psychological-medicine/article/crossnational-epidemiology-of-specific-phobia-in-the-world-mental-health-surveys/A0EDD4B22E19CDB63269D7A34F2C21AA Results: “The cross-national lifetime and 12-month prevalence rates of specific phobia were, respectively, 7.4% and 5.5%, being higher in females (9.8 and 7.7%) than in males (4.9% and 3.3%) and higher in high- and higher-middle-income countries than in low-/lower-middle-income countries. The median age of onset was young (8 years). Of the 12-month patients, 18.7% reported severe role impairment (13.3–21.9% across income groups) and 23.1% reported any treatment (9.6–30.1% across income groups). Lifetime co-morbidity was observed in 60.5% of those with lifetime specific phobia, with the onset of specific phobia preceding the other disorder in most cases (72.6%). Interestingly, rates of impairment, treatment use and co-morbidity increased with the number of fear subtypes“. This study indicates some association with age and sex. It also states associations with other disorders. This means that variables, such as sex and probably age as well should be taken into consideration. Luckily, the dataset provides SEX and AGE parameters.

AP in the context of culture / nationality

[4] Cultural Clinical Psychology Study Group, W.A. Arrindell, Martin Eisemann et al. Phobic anxiety in 11 nations: Part I: Dimensional constancy of the five-factor model Behaviour Research and Therapy, Volume 41, Issue 4, April 2003, Pages 461-479 https://www.sciencedirect.com/science/article/abs/pii/S0005796702000475 (and Part 2 here https://www.sciencedirect.com/science/article/pii/S0191886903004057) Abstract: “The Fear Survey Schedule-III (FSS-III) was administered to a total of 5491 students in Australia, East Germany, Great Britain, Greece, Guatemala, Hungary, Italy, Japan, Spain, Sweden, and Venezuela, and submitted to the multiple group method of confirmatory analysis (MGM) in order to determine the cross-national dimensional constancy of the five-factor model of self-assessed fears originally established in Dutch, British, and Canadian samples. The model comprises fears of bodily injury–illness–death, agoraphobic fears, social fears, fears of sexual and aggressive scenes, and harmless animals fears. Close correspondence between the factors was demonstrated across national samples. In each country, the corresponding scales were internally consistent, were intercorrelated at magnitudes comparable to those yielded in the original samples, and yielded (in 93% of the total number of 55 comparisons) sex differences in line with the usual finding (higher scores for females). In each country, the relatively largest sex differences were obtained on harmless animals fears. The organization of self-assessed fears is sufficiently similar across nations to warrant the use of the same weight matrix (scoring key) for the FSS-III in the different countries and to make cross-national comparisons feasible. This opens the way to further studies that attempt to predict (on an a priori basis) cross-national variations in fear levels with dimensions of national cultures.” And quoting the abstract for the other part: “Hofstede’s dimensions of national cultures termed Masculinity–Femininity (MAS) and Uncertainty Avoidance (UAI) (Hofstede, 2001) are proposed to be of relevance for understanding national-level differences in self-assessed fears. The potential predictive role of national MAS was based on the classical work of Fodor (Fodor, 1974). Following Fodor, it was predicted that masculine (or tough) societies in which clearer differentiations are made between gender roles (high MAS) would report higher national levels of fears than feminine (or soft/modest) societies in which such differentiations are made to a clearly lesser extent (low MAS). In addition, it was anticipated that nervous-stressful-emotionally-expressive nations (high UAI) would report higher national levels of fears than calm-happy and low-emotional countries (low UAI), and that countries high on both MAS and UAI would report the highest national levels of fears“.

So, to summarize:

National / cultural differences show up when it comes to animal fears (particularly harmless)

Such fears are more common for ‘masculine’ cultures with more rigid gender roles; and also more typical for ‘nervous/emotionaly expressive’ countries.

So there is some cultural association with such animal fears. And here is where I am going to rely on S1Q1E (ORIGIN OR DESCENT) parameter.

[5] Eva Landová1, Natavan Bakhshaliyeva et al. Association Between Fear and Beauty Evaluation of Snakes: Cross-Cultural Findings Front. Psychol., 16 March 2018 https://www.frontiersin.org/articles/10.3389/fpsyg.2018.00333/full The study states that the fear of snakes has evolutionary reasons and is particularly connected to geogrphical and natural conditions in which a country’s culture was formed. Well, just another case to show that researchers do establish some cultural association with fears (and ultimately phobias).

SPs (including AP) and physical conditions

[6] Cornelia Witthauer, Vladeta Ajdacic-Gross, et al. Associations of specific phobia and its subtypes with physical diseases: an adult community study BMC Psychiatry, 2016 https://bmcpsychiatry.biomedcentral.com/articles/10.1186/s12888-016-0863-0 Results: “Specific phobia was associated with cardiac diseases, gastrointestinal diseases, respiratory diseases, arthritic conditions, migraine, and thyroid diseases (odds ratios between 1.49 and 2.53). Among the subtypes, different patterns of associations with physical diseases were established“.

[7] Ella L.Oar, Lara J.Farrell et al. Blood-Injection-Injury Phobia and Dog Phobia in Youth: Psychological Characteristics and Associated Features in a Clinical Sample Behavior Therapy, Volume 47, Issue 3, May 2016, Pages 312-324 https://www.sciencedirect.com/science/article/abs/pii/S0005789416000058 Abstract: “Blood-Injection-Injury (BII) phobia is a particularly debilitating condition that has been largely ignored in the child literature. The present study examined the clinical phenomenology of BII phobia in 27 youths, relative to 25 youths with dog phobia—one of the most common and well-studied phobia subtypes in youth. Children were compared on measures of phobia severity, functional impairment, comorbidity, threat appraisals (danger expectancies and coping), focus of fear, and physiological responding, as well as vulnerability factors including disgust sensitivity and family history. Children and adolescents with BII phobia had greater diagnostic severity. In addition, they were more likely to have a comorbid diagnosis of a physical health condition, to report more exaggerated danger expectancies, and to report fears that focused more on physical symptoms (e.g., faintness and nausea) in comparison to youth with dog phobia. The present study advances knowledge relating to this poorly understood condition in youth“. Here I can note that Blood-Injection-Injury phobia is often mentioned (and explored) in combination with animal phobia (like here and in [2] for instance, but I have come across other cases as well).

To summarize:

SP (AP among them) may be an indication to some physical conditions.

Which makes me think there might be some reflection in self-perception.

Unfortunately, I failed to find any studies of association between SP and hypochondria, which would be more appropriate for my intention to check exactly subjective perception of health.

NESARC dataset

I chose NESARC dataset to explore. The main reasons were: – the size of the dataset (over 40K rows, which makes it more interesting to operate programmatically); – high detalization of parameters, which provides great opportunities for asking questions.