Discover Top Posts Tagged with #frequency distribution

Running First Program Assignment

1) My Program

2) Output

3) Summary

Life Expectancy:

22 countries have a missing life expectancy value, which is slightly more than 10% of all the countries in the dataset. The values that are not missing are all unique, i.e., each country has a unique life expectancy value.

HIV rate:

66 countries have a missing HIV rate value, which is slightly less than 31% of all the countries in the dataset. The most common HIV rate, for countries that have one, is 0.1. 28 countries have an HIV rate of 0.1, which is slightly more than 13% of countries in the dataset.

breast cancer per 100th:

40 countries have a missing breast cancer per 100th value, which is slightly less than 19% of all the countries in the dataset. The most common breast cancer per 100th value is 91.9, and only 2 countries have this value; the rest of the countries, that are not missing a value, have a unique breast cancer per 100th value.

#data analytics #datascience #frequency distribution

Frequency Distribution table for chest pain and THALL attribute

Primarily, pandas and numpy library were imported. Next, the required dataset is loaded. Here, I have uploaded the dataset available at Kaggle.com in the csv format using the read_csv() function. The dataset contains 14 attributes. These are age, sex, chest pain type (4 values), resting blood pressure, serum cholesterol in mg/dl, fasting blood sugar > 120 mg/dl, resting electrocardiographic results (values 0,1,2), maximum heart rate achieved, exercise induced angina, old peak = ST depression induced by exercise relative to rest, the slope of the peak exercise ST segment, number of major vessels (0-3) colored by fluoroscopy, thal: 0 = normal; 1 = fixed defect; 2 = reversible defect and output: 0= less chance of heart attack 1= more chance of heart attack.

Using the len function we can check the total no of variables and observations. The dataset has 14 variables and 302 observations. Out of the 13 variables, only 3 were used for frequency distribution table.

Using dtype function we can check the data type for the variables.

To convert it into numeric format we use to_numeric function available in the pandas library which was imported earlier.

#setting variables you will be working with to numeric data['age'] = pd.to_numeric(data['age']) data['sex'] = pd.to_numeric(data['sex']) data['chest pain'] = pd.to_numeric(data['chest pain']) data['resting blood pressure'] = pd.to_numeric(data['resting blood pressure']) data['cholestrol'] = pd.to_numeric(data['cholestrol']) data['fasting blood sugar'] = pd.to_numeric(data['fasting blood sugar']) data['resting ecg'] = pd.to_numeric(data['resting ecg']) data['max heart rate'] = pd.to_numeric(data['max heart rate']) data['excercise included'] = pd.to_numeric(data['excercise included']) data['old peak'] = pd.to_numeric(data['old peak']) data['slp'] = pd.to_numeric(data['slp']) data['caa'] = pd.to_numeric(data['caa']) data['THALL'] = pd.to_numeric(data['THALL']) data['output'] = pd.to_numeric(data['output'])

Next, we calculated the percentage of observations for each variable using the value_count function. To do so, nomralize parameter should be set as True in the function.

#counts and percentages (i.e. frequency distributions) for each variable print("Counts for sex: ") c1 = data['sex'].value_counts(sort=False) print (c1) print("Percentages for sex: ") p1 = data['sex'].value_counts(sort=False, normalize=True) print (p1) print("Counts for chest pain: ") c2 = data['chest pain'].value_counts(sort=False) print (c2) print("Percentages for chest pain: ") p2 = data['chest pain'].value_counts(sort=False, normalize=True) print (p2) print("Counts for THALL: ") c3 = data['THALL'].value_counts(sort=False) print (c3) print("Percentages for THALL: ") p3 = data['THALL'].value_counts(sort=False, normalize=True) print (p3)

For the first question " On the scale of 0-3(from least to high), what level of chest pain did u experience?" Of the total number, 47.35% chose level 0, 16.55% chose level 1, 28.8% chose level 2 and 7.28% chose level 3.

For the second question" On the scale of 0-3, what was the level of blood flow of the patient observed while doing exercise and resting?"

0 maps to null in the original dataset.

2 -This means that the blood flow was normal.

3 - This means that a reversible defect was found

Of the total number, 0.66% chose level 0, 5.62% chose level 1, 54.96% chose level 2 and 38.74% chose level 3. From the above frequency distribution, we observe that majority of individual's blood flow was normal while doing exercises and resting.

Next, we calculated the frequency distributions for individuals having age greater than 45. This can be achieved by splitting the dataset as follows.

sub1=data[(data['age']>=45)]

For the first question " On the scale of 0-3(from least to high), what level of chest pain did u experience?" Of the total number, 50.40% chose level 0, 15.04% chose level 1, 27.23% chose level 2 and 7.31% chose level 3.

For the second question" On the scale of 0-3, what was the level of blood flow of the patient observed while doing exercise and resting?"

0 maps to null in the original dataset.

2 -This means that the blood flow was normal.

3 - This means that a reversible defect was found

Of the total number, 0.81% chose level 0, 5.69% chose level 1, 51.62% chose level 2 and 41.86% chose level 3. From the above frequency distribution, we observe that majority of individual's blood flow was normal while doing exercises and resting.

We also observe that there is not much difference in the frequency distribution tables for both the cases.

THE WHOLE CODE:

import pandas as pd import numpy as np # any additional libraries would be imported here column_names = ['age','sex','chest pain','resting blood pressure','cholestrol','fasting blood sugar','resting ecg','max heart rate','excercise included','old peak','slp','caa','THALL','output'] data= pd.read_csv("heart.csv",header=None,names=column_names) data = data.iloc[1: , :] # removes the first row of dataframe (In this case, ) print (len(data)) #number of observations (rows) print (len(data.columns)) # number of variables (columns) # checking the format of your variables data['resting blood pressure'].dtype #setting variables you will be working with to numeric data['age'] = pd.to_numeric(data['age']) data['sex'] = pd.to_numeric(data['sex']) data['chest pain'] = pd.to_numeric(data['chest pain']) data['resting blood pressure'] = pd.to_numeric(data['resting blood pressure']) data['cholestrol'] = pd.to_numeric(data['cholestrol']) data['fasting blood sugar'] = pd.to_numeric(data['fasting blood sugar']) data['resting ecg'] = pd.to_numeric(data['resting ecg']) data['max heart rate'] = pd.to_numeric(data['max heart rate']) data['excercise included'] = pd.to_numeric(data['excercise included']) data['old peak'] = pd.to_numeric(data['old peak']) data['slp'] = pd.to_numeric(data['slp']) data['caa'] = pd.to_numeric(data['caa']) data['THALL'] = pd.to_numeric(data['THALL']) data['output'] = pd.to_numeric(data['output']) #counts and percentages (i.e. frequency distributions) for each variable print("Counts for sex: ") c1 = data['sex'].value_counts(sort=False) print (c1) print("Percentages for sex: ") p1 = data['sex'].value_counts(sort=False, normalize=True) print (p1) print("Counts for chest pain: ") c2 = data['chest pain'].value_counts(sort=False) print (c2) print("Percentages for chest pain: ") p2 = data['chest pain'].value_counts(sort=False, normalize=True) print (p2) print("Counts for THALL: ") c3 = data['THALL'].value_counts(sort=False) print (c3) print("Percentages for THALL: ") p3 = data['THALL'].value_counts(sort=False, normalize=True) print (p3) #subset data to adults greater than age 45 sub1=data[(data['age']>=45)] #make a copy of my new subsetted data sub2 = sub1.copy() #counts and percentages (i.e. frequency distributions) for each variable print("Counts for sex: ") c1 = sub2['sex'].value_counts(sort=False) print (c1) print("Percentages for sex: ") p1 = sub2['sex'].value_counts(sort=False, normalize=True) print (p1) print("Counts for chest pain: ") c2 = sub2['chest pain'].value_counts(sort=False) print (c2) print("Percentages for chest pain: ") p2 = sub2['chest pain'].value_counts(sort=False, normalize=True) print (p2) print("Counts for THALL: ") c3 = sub2['THALL'].value_counts(sort=False) print (c3) print("Percentages for THALL: ") p3 = sub2['THALL'].value_counts(sort=False, normalize=True) print (p3)

#frequency distribution #python #data analysis

Running your first program assignment.

Reading gapminder data set and chosing the sub data set which consist of selected features.

load data set into dataframe object

checking the data set by displaying the first five rows.

checking data types of each featur or column.

checking the number of features and observation in the gapminder data set.

create subset of data which including selected varaible in the hypothesis.

#!/usr/bin/env python3 # -*- coding: utf-8 -*- """ Created on Fri Feb 7 02:03:34 2020 @author: mo-akel """ # load built-in modules that will be used for analysis. import pandas as pd import numpy as np

# reading "gapminder.csv" file and load it up into "pandas.DataFrame()" object. gapminder_df = pd.read_csv("gapminder.csv",low_memory=False) # printing out DataFrame dimension "(n_rows,n_columns)". print("gapminder dataframe dimension:",gapminder_df.shape) print("While ",gapminder_df.shape[0], "number of rows and",gapminder_df.shape[1],"number of columns.")

gapminder dataframe dimension: (213, 16) While 213 number of rows and 16 number of columns.

# display the data type of each column. print("columns or features data types") print("************------************") print(gapminder_df.dtypes)

columns or features data types ************------************ country object incomeperperson object alcconsumption object armedforcesrate object breastcancerper100th object co2emissions object femaleemployrate object hivrate object internetuserate object lifeexpectancy object oilperperson object polityscore object relectricperperson object suicideper100th object employrate object urbanrate object dtype: object

# display the first five rows in the gapminder dataset. gapminder_df.head(5)

.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }

country incomeperperson alcconsumption armedforcesrate breastcancerper100th co2emissions femaleemployrate hivrate internetuserate lifeexpectancy oilperperson polityscore relectricperperson suicideper100th employrate urbanrate 0 Afghanistan .03 .5696534 26.8 75944000 25.6000003814697 3.65412162280064 48.673 0 6.68438529968262 55.7000007629394 24.04 1 Albania 1914.99655094922 7.29 1.0247361 57.4 223747333.333333 42.0999984741211 44.9899469578783 76.918 9 636.341383366604 7.69932985305786 51.4000015258789 46.72 2 Algeria 2231.99333515006 .69 2.306817 23.5 2932108666.66667 31.7000007629394 .1 12.5000733055148 73.131 .42009452521537 2 590.509814347428 4.8487696647644 50.5 65.22 3 Andorra 21943.3398976022 10.17 81 5.36217880249023 88.92 4 Angola 1381.00426770244 5.57 1.4613288 23.1 248358000 69.4000015258789 2 9.99995388324075 51.093 -2 172.999227388199 14.5546770095825 75.6999969482422 56.7

create subset of gapminder data set. this is subset including all selected varaibles of the hypothesis.

# in my hypothesis.I have chosen certain variables and here are they # incomeperperson,hivrate,urbanrate and their association with lifeexpectancy. sub_gapminder = gapminder_df[['incomeperperson','hivrate','urbanrate' ,'lifeexpectancy']] # check the hypothesis data set by printing the first five rows. sub_gapminder.head(5)

.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }

incomeperperson hivrate urbanrate lifeexpectancy 0 24.04 48.673 1 1914.99655094922 46.72 76.918 2 2231.99333515006 .1 65.22 73.131 3 21943.3398976022 88.92 4 1381.00426770244 2 56.7 51.093

# check each column type. print("data type of each column or feature in the subset of selected variables.") print(sub_gapminder.dtypes)

data type of each column or feature in the subset of selected variables. incomeperperson object hivrate object urbanrate object lifeexpectancy object dtype: object

deprecated warnning

while we are running this code

sub_gapminder["incomeperperson"] = gapminder_df["incomeperperson"].convert_objects(convert_numeric=True) sub_gapminder["hivrate"] = gapminder_df["hivrate"].convert_objects(convert_numeric=True) sub_gapminder["urbanrate"] = gapminder_df["urbanrate"].convert_objects(convert_numeric=True) sub_gapminder["lifeexpectancy"] = sub_gapminder["lifeexpectancy"].convert_objects(convert_numeric=True) print(sub_gapminder.dtypes)

while we got this error.which is convert_object is deprecated. To re-infer data dtypes for object columns, use Series.infer_objects().so i will do the conversion in another way.

# convert the columns of object type to numeric type. sub_gapminder["incomeperperson"] = gapminder_df["incomeperperson"].convert_objects(convert_numeric=True) sub_gapminder["hivrate"] = gapminder_df["hivrate"].convert_objects(convert_numeric=True) sub_gapminder["urbanrate"] = gapminder_df["urbanrate"].convert_objects(convert_numeric=True) sub_gapminder["lifeexpectancy"] = sub_gapminder["lifeexpectancy"].convert_objects(convert_numeric=True) # check each column type. print(sub_gapminder.dtypes)

/usr/local/lib/python3.5/dist-packages/ipykernel_launcher.py:2: FutureWarning: convert_objects is deprecated. To re-infer data dtypes for object columns, use Series.infer_objects() For all other conversions use the data-type specific converters pd.to_datetime, pd.to_timedelta and pd.to_numeric. /usr/local/lib/python3.5/dist-packages/ipykernel_launcher.py:2: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy /usr/local/lib/python3.5/dist-packages/ipykernel_launcher.py:3: FutureWarning: convert_objects is deprecated. To re-infer data dtypes for object columns, use Series.infer_objects() For all other conversions use the data-type specific converters pd.to_datetime, pd.to_timedelta and pd.to_numeric. This is separate from the ipykernel package so we can avoid doing imports until /usr/local/lib/python3.5/dist-packages/ipykernel_launcher.py:3: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy This is separate from the ipykernel package so we can avoid doing imports until /usr/local/lib/python3.5/dist-packages/ipykernel_launcher.py:4: FutureWarning: convert_objects is deprecated. To re-infer data dtypes for object columns, use Series.infer_objects() For all other conversions use the data-type specific converters pd.to_datetime, pd.to_timedelta and pd.to_numeric. after removing the cwd from sys.path. incomeperperson float64 hivrate float64 urbanrate float64 lifeexpectancy float64 dtype: object /usr/local/lib/python3.5/dist-packages/ipykernel_launcher.py:4: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy after removing the cwd from sys.path. /usr/local/lib/python3.5/dist-packages/ipykernel_launcher.py:5: FutureWarning: convert_objects is deprecated. To re-infer data dtypes for object columns, use Series.infer_objects() For all other conversions use the data-type specific converters pd.to_datetime, pd.to_timedelta and pd.to_numeric. """ /usr/local/lib/python3.5/dist-packages/ipykernel_launcher.py:5: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy """

claculate frequency distribution for each column or feature "missing values included".

income per person frequency and percentage

hive rate frequency and percentage.

urban rate frequency and percentage.

life expectancy frequency and percentage.

displaying frequency distribution table for each feature or variable.

counts and persenctages for incomeperperson

print("counts for incomeperperson.") income_count = sub_gapminder["incomeperperson"].value_counts(sort=False,dropna=False) print(income_count) income_percent = sub_gapminder["incomeperperson"].value_counts(sort=False,dropna=False,normalize=True) print("percentages for incomeperperson.") print(income_percent)

counts for incomeperperson. NaN 23 8614.120219 1 39972.352768 1 279.180453 1 161.317137 1 11894.464075 1 1036.830725 1 9106.327234 1 11744.834167 1 2231.993335 1 1392.411829 1 2146.358593 1 10480.817203 1 595.874535 1 5348.597192 1 10749.419238 1 12729.454400 1 19630.540547 1 554.879840 1 9175.796015 1 2025.282665 1 1975.551906 1 268.331790 1 5634.003948 1 37662.751250 1 25249.986061 1 558.062877 1 1728.020976 1 2923.144355 1 6105.280743 1 .. 1784.071284 1 557.947513 1 103.775857 1 1144.102193 1 285.224449 1 2636.787800 1 17092.460004 1 31993.200694 1 22275.751661 1 1326.741757 1 4189.436587 1 5332.238591 1 1232.794137 1 338.266391 1 268.259450 1 495.734247 1 6334.105194 1 12505.212545 1 269.892881 1 6238.537506 1 1324.194906 1 6243.571318 1 27595.091347 1 5011.219456 1 1258.762596 1 377.421113 1 2344.896916 1 25306.187193 1 4180.765821 1 25575.352623 1 Name: incomeperperson, Length: 191, dtype: int64 percentages for incomeperperson. NaN 0.107981 8614.120219 0.004695 39972.352768 0.004695 279.180453 0.004695 161.317137 0.004695 11894.464075 0.004695 1036.830725 0.004695 9106.327234 0.004695 11744.834167 0.004695 2231.993335 0.004695 1392.411829 0.004695 2146.358593 0.004695 10480.817203 0.004695 595.874535 0.004695 5348.597192 0.004695 10749.419238 0.004695 12729.454400 0.004695 19630.540547 0.004695 554.879840 0.004695 9175.796015 0.004695 2025.282665 0.004695 1975.551906 0.004695 268.331790 0.004695 5634.003948 0.004695 37662.751250 0.004695 25249.986061 0.004695 558.062877 0.004695 1728.020976 0.004695 2923.144355 0.004695 6105.280743 0.004695 ... 1784.071284 0.004695 557.947513 0.004695 103.775857 0.004695 1144.102193 0.004695 285.224449 0.004695 2636.787800 0.004695 17092.460004 0.004695 31993.200694 0.004695 22275.751661 0.004695 1326.741757 0.004695 4189.436587 0.004695 5332.238591 0.004695 1232.794137 0.004695 338.266391 0.004695 268.259450 0.004695 495.734247 0.004695 6334.105194 0.004695 12505.212545 0.004695 269.892881 0.004695 6238.537506 0.004695 1324.194906 0.004695 6243.571318 0.004695 27595.091347 0.004695 5011.219456 0.004695 1258.762596 0.004695 377.421113 0.004695 2344.896916 0.004695 25306.187193 0.004695 4180.765821 0.004695 25575.352623 0.004695 Name: incomeperperson, Length: 191, dtype: float64

hive rate frequency and percentages.

# hive rate frequency and percentages. print("counts for hivrate..") hiv_count = sub_gapminder["hivrate"].value_counts(sort=False,dropna=False) print(hiv_count) print("percentages for hivrate.") hiv_percent = sub_gapminder["hivrate"].value_counts(sort=False,dropna=False,normalize=True) print(hiv_percent)

counts for hivrate.. NaN 66 2.00 2 0.50 5 2.50 2 5.00 1 1.50 2 1.30 2 11.00 1 1.00 4 11.50 1 6.50 1 13.50 1 3.60 1 17.80 1 3.20 1 14.30 1 1.40 1 0.10 28 2.30 1 3.30 1 1.90 1 0.70 3 25.90 1 5.60 1 0.20 15 0.40 9 0.06 16 0.80 5 0.30 10 3.10 1 1.20 4 5.30 1 24.80 1 3.40 3 1.70 1 23.60 1 6.30 1 0.45 1 0.60 3 13.10 1 4.70 1 2.90 1 1.60 1 0.90 4 5.20 1 1.10 2 1.80 1 Name: hivrate, dtype: int64 percentages for hivrate. NaN 0.309859 2.00 0.009390 0.50 0.023474 2.50 0.009390 5.00 0.004695 1.50 0.009390 1.30 0.009390 11.00 0.004695 1.00 0.018779 11.50 0.004695 6.50 0.004695 13.50 0.004695 3.60 0.004695 17.80 0.004695 3.20 0.004695 14.30 0.004695 1.40 0.004695 0.10 0.131455 2.30 0.004695 3.30 0.004695 1.90 0.004695 0.70 0.014085 25.90 0.004695 5.60 0.004695 0.20 0.070423 0.40 0.042254 0.06 0.075117 0.80 0.023474 0.30 0.046948 3.10 0.004695 1.20 0.018779 5.30 0.004695 24.80 0.004695 3.40 0.014085 1.70 0.004695 23.60 0.004695 6.30 0.004695 0.45 0.004695 0.60 0.014085 13.10 0.004695 4.70 0.004695 2.90 0.004695 1.60 0.004695 0.90 0.018779 5.20 0.004695 1.10 0.009390 1.80 0.004695 Name: hivrate, dtype: float64

urban rate frequency and percentage.

# urban rate frequency and percentage. print("counts for urbanrate.") urban_count = sub_gapminder["urbanrate"].value_counts(sort=False,dropna=False) print(urban_count) print("percentages for urbanrate.") urban_percent = sub_gapminder["urbanrate"].value_counts(sort=False,dropna=False,normalize=True) print(urban_percent)

counts for urbanrate. 92.00 1 100.00 6 74.50 1 NaN 10 73.50 1 17.00 1 61.00 1 67.50 1 56.42 1 57.94 1 65.58 2 88.92 1 41.00 1 41.42 1 26.68 1 92.26 1 23.00 1 33.32 1 51.92 1 66.90 1 24.04 1 41.76 1 42.00 1 30.46 1 86.68 1 66.50 1 42.72 1 30.64 1 60.74 1 36.52 1 .. 63.86 1 59.58 1 86.96 1 43.44 1 60.70 1 77.20 1 72.84 1 43.84 1 64.78 1 24.76 1 36.28 1 66.48 1 50.02 1 17.24 1 32.32 1 46.84 1 77.36 1 78.42 1 48.78 1 91.66 1 63.26 1 60.56 1 28.08 1 67.16 1 21.60 1 56.02 1 57.18 1 73.92 1 25.46 1 28.38 1 Name: urbanrate, Length: 195, dtype: int64 percentages for urbanrate. 92.00 0.004695 100.00 0.028169 74.50 0.004695 NaN 0.046948 73.50 0.004695 17.00 0.004695 61.00 0.004695 67.50 0.004695 56.42 0.004695 57.94 0.004695 65.58 0.009390 88.92 0.004695 41.00 0.004695 41.42 0.004695 26.68 0.004695 92.26 0.004695 23.00 0.004695 33.32 0.004695 51.92 0.004695 66.90 0.004695 24.04 0.004695 41.76 0.004695 42.00 0.004695 30.46 0.004695 86.68 0.004695 66.50 0.004695 42.72 0.004695 30.64 0.004695 60.74 0.004695 36.52 0.004695 ... 63.86 0.004695 59.58 0.004695 86.96 0.004695 43.44 0.004695 60.70 0.004695 77.20 0.004695 72.84 0.004695 43.84 0.004695 64.78 0.004695 24.76 0.004695 36.28 0.004695 66.48 0.004695 50.02 0.004695 17.24 0.004695 32.32 0.004695 46.84 0.004695 77.36 0.004695 78.42 0.004695 48.78 0.004695 91.66 0.004695 63.26 0.004695 60.56 0.004695 28.08 0.004695 67.16 0.004695 21.60 0.004695 56.02 0.004695 57.18 0.004695 73.92 0.004695 25.46 0.004695 28.38 0.004695 Name: urbanrate, Length: 195, dtype: float64

life expectancy frequency and percentage.

# life expectancy frequency and percentage. print("counts for lifeexpectancy.") life_count = sub_gapminder["lifeexpectancy"].value_counts(sort=False,dropna=False) print(life_count) print("percentages for lifeexpectancy.") life_percent = sub_gapminder["lifeexpectancy"].value_counts(sort=False,dropna=False,normalize=True) print(life_percent)

counts for lifeexpectancy. NaN 22 63.125 1 74.576 1 62.475 1 74.414 1 79.977 1 58.199 1 75.670 1 81.012 1 72.283 1 55.442 1 81.855 1 48.398 1 68.944 1 75.133 1 76.126 1 69.317 1 65.193 1 75.057 1 77.685 1 68.498 1 62.465 1 79.634 1 73.911 1 80.499 1 61.597 1 79.591 1 71.017 1 82.759 1 68.978 1 .. 76.954 1 73.703 1 79.839 1 48.718 1 71.172 1 73.456 1 48.397 1 81.439 1 75.246 1 55.377 1 74.788 1 74.402 1 82.338 1 79.499 1 81.539 1 54.210 1 67.017 1 61.452 1 73.373 1 73.127 1 69.245 1 68.795 1 79.341 1 76.918 1 57.937 1 73.126 1 64.666 1 75.956 1 57.379 1 50.239 1 Name: lifeexpectancy, Length: 190, dtype: int64 percentages for lifeexpectancy. NaN 0.103286 63.125 0.004695 74.576 0.004695 62.475 0.004695 74.414 0.004695 79.977 0.004695 58.199 0.004695 75.670 0.004695 81.012 0.004695 72.283 0.004695 55.442 0.004695 81.855 0.004695 48.398 0.004695 68.944 0.004695 75.133 0.004695 76.126 0.004695 69.317 0.004695 65.193 0.004695 75.057 0.004695 77.685 0.004695 68.498 0.004695 62.465 0.004695 79.634 0.004695 73.911 0.004695 80.499 0.004695 61.597 0.004695 79.591 0.004695 71.017 0.004695 82.759 0.004695 68.978 0.004695 ... 76.954 0.004695 73.703 0.004695 79.839 0.004695 48.718 0.004695 71.172 0.004695 73.456 0.004695 48.397 0.004695 81.439 0.004695 75.246 0.004695 55.377 0.004695 74.788 0.004695 74.402 0.004695 82.338 0.004695 79.499 0.004695 81.539 0.004695 54.210 0.004695 67.017 0.004695 61.452 0.004695 73.373 0.004695 73.127 0.004695 69.245 0.004695 68.795 0.004695 79.341 0.004695 76.918 0.004695 57.937 0.004695 73.126 0.004695 64.666 0.004695 75.956 0.004695 57.379 0.004695 50.239 0.004695 Name: lifeexpectancy, Length: 190, dtype: float64

create frequency distrbution by using "groupby()" function.

income per person frequency and percentage

hive rate frequency and percentage.

urban rate frequency and percentage.

life expectancy frequency and percentage.

displaying frequency distribution table for each feature or variable.

counts and persenctages for incomeperperson

# frequency distrbution for incomeperperson. print("Frequency distribution by using \"groupby()\" function") print("****************************\n") print("\n counts for incomeperperson.") income_count2 = sub_gapminder.groupby("incomeperperson").size() print(income_count2) print("\n percentages for incomeperperson.") income_oercent_2 = sub_gapminder.groupby("incomeperperson").size() * 100/len(sub_gapminder) print(income_percent)

Frequency distribution by using "groupby()" function **************************** counts for incomeperperson. incomeperperson 103.775857 1 115.305996 1 131.796207 1 155.033231 1 161.317137 1 180.083376 1 184.141797 1 220.891248 1 239.518749 1 242.677534 1 268.259450 1 268.331790 1 269.892881 1 275.884287 1 276.200413 1 279.180453 1 285.224449 1 320.771890 1 336.368749 1 338.266391 1 354.599726 1 358.979540 1 369.572954 1 371.424198 1 372.728414 1 377.039699 1 377.421113 1 389.763634 1 411.501447 1 432.226337 1 .. 20751.893424 1 21087.394125 1 21943.339898 1 22275.751661 1 22878.466567 1 24496.048264 1 25249.986061 1 25306.187193 1 25575.352623 1 26551.844238 1 26692.984107 1 27110.731591 1 27595.091347 1 28033.489283 1 30532.277044 1 31993.200694 1 32292.482984 1 32535.832512 1 33923.313868 1 33931.832079 1 33945.314422 1 35536.072471 1 37491.179523 1 37662.751250 1 39309.478859 1 39972.352768 1 52301.587179 1 62682.147006 1 81647.100031 1 105147.437697 1 Length: 190, dtype: int64 percentages for incomeperperson. NaN 0.107981 8614.120219 0.004695 39972.352768 0.004695 279.180453 0.004695 161.317137 0.004695 11894.464075 0.004695 1036.830725 0.004695 9106.327234 0.004695 11744.834167 0.004695 2231.993335 0.004695 1392.411829 0.004695 2146.358593 0.004695 10480.817203 0.004695 595.874535 0.004695 5348.597192 0.004695 10749.419238 0.004695 12729.454400 0.004695 19630.540547 0.004695 554.879840 0.004695 9175.796015 0.004695 2025.282665 0.004695 1975.551906 0.004695 268.331790 0.004695 5634.003948 0.004695 37662.751250 0.004695 25249.986061 0.004695 558.062877 0.004695 1728.020976 0.004695 2923.144355 0.004695 6105.280743 0.004695 ... 1784.071284 0.004695 557.947513 0.004695 103.775857 0.004695 1144.102193 0.004695 285.224449 0.004695 2636.787800 0.004695 17092.460004 0.004695 31993.200694 0.004695 22275.751661 0.004695 1326.741757 0.004695 4189.436587 0.004695 5332.238591 0.004695 1232.794137 0.004695 338.266391 0.004695 268.259450 0.004695 495.734247 0.004695 6334.105194 0.004695 12505.212545 0.004695 269.892881 0.004695 6238.537506 0.004695 1324.194906 0.004695 6243.571318 0.004695 27595.091347 0.004695 5011.219456 0.004695 1258.762596 0.004695 377.421113 0.004695 2344.896916 0.004695 25306.187193 0.004695 4180.765821 0.004695 25575.352623 0.004695 Name: incomeperperson, Length: 191, dtype: float64

hive rate frequency and percentages.

# frequency distribution for hivrate. print("\n counts for hivrate.") hiv_counts2 = sub_gapminder.groupby("hivrate").size() print(hiv_counts2) print("\n percentages for hivrate.") hiv_percent2 = sub_gapminder.groupby("hivrate").size() * 100 / len(sub_gapminder) print(hiv_percent2)

counts for hivrate. hivrate 0.06 16 0.10 28 0.20 15 0.30 10 0.40 9 0.45 1 0.50 5 0.60 3 0.70 3 0.80 5 0.90 4 1.00 4 1.10 2 1.20 4 1.30 2 1.40 1 1.50 2 1.60 1 1.70 1 1.80 1 1.90 1 2.00 2 2.30 1 2.50 2 2.90 1 3.10 1 3.20 1 3.30 1 3.40 3 3.60 1 4.70 1 5.00 1 5.20 1 5.30 1 5.60 1 6.30 1 6.50 1 11.00 1 11.50 1 13.10 1 13.50 1 14.30 1 17.80 1 23.60 1 24.80 1 25.90 1 dtype: int64 percentages for hivrate. hivrate 0.06 7.511737 0.10 13.145540 0.20 7.042254 0.30 4.694836 0.40 4.225352 0.45 0.469484 0.50 2.347418 0.60 1.408451 0.70 1.408451 0.80 2.347418 0.90 1.877934 1.00 1.877934 1.10 0.938967 1.20 1.877934 1.30 0.938967 1.40 0.469484 1.50 0.938967 1.60 0.469484 1.70 0.469484 1.80 0.469484 1.90 0.469484 2.00 0.938967 2.30 0.469484 2.50 0.938967 2.90 0.469484 3.10 0.469484 3.20 0.469484 3.30 0.469484 3.40 1.408451 3.60 0.469484 4.70 0.469484 5.00 0.469484 5.20 0.469484 5.30 0.469484 5.60 0.469484 6.30 0.469484 6.50 0.469484 11.00 0.469484 11.50 0.469484 13.10 0.469484 13.50 0.469484 14.30 0.469484 17.80 0.469484 23.60 0.469484 24.80 0.469484 25.90 0.469484 dtype: float64

urban rate frequency and percentage.

# frequency distribution for urbanrate. print("\n counts for urabnrate.") urban_counts2 = sub_gapminder.groupby("urbanrate").size() print(urban_counts2) print("\n percentages for urbanrate.") urban_percent = sub_gapminder.groupby("urbanrate").size() * 100/len(sub_gapminder) print(urban_percent)

counts for urabnrate. urbanrate 10.40 1 12.54 1 12.98 1 13.22 1 14.32 1 15.10 1 16.54 1 17.00 1 17.24 1 17.96 1 18.34 1 18.80 1 19.56 1 20.72 1 21.56 1 21.60 1 22.54 1 23.00 1 24.04 1 24.76 1 24.78 1 24.94 1 25.46 1 25.52 1 26.46 1 26.68 1 27.14 1 27.30 1 27.84 2 28.08 1 .. 82.42 1 82.44 1 83.52 1 83.70 1 84.54 1 85.04 1 85.58 1 86.56 1 86.68 1 86.96 1 87.30 1 88.44 1 88.52 1 88.74 1 88.92 1 89.94 1 91.66 1 92.00 1 92.26 1 92.30 1 92.68 1 93.16 1 93.32 1 94.22 1 94.26 1 95.64 1 97.36 1 98.32 1 98.36 1 100.00 6 Length: 194, dtype: int64 percentages for urbanrate. urbanrate 10.40 0.469484 12.54 0.469484 12.98 0.469484 13.22 0.469484 14.32 0.469484 15.10 0.469484 16.54 0.469484 17.00 0.469484 17.24 0.469484 17.96 0.469484 18.34 0.469484 18.80 0.469484 19.56 0.469484 20.72 0.469484 21.56 0.469484 21.60 0.469484 22.54 0.469484 23.00 0.469484 24.04 0.469484 24.76 0.469484 24.78 0.469484 24.94 0.469484 25.46 0.469484 25.52 0.469484 26.46 0.469484 26.68 0.469484 27.14 0.469484 27.30 0.469484 27.84 0.938967 28.08 0.469484 ... 82.42 0.469484 82.44 0.469484 83.52 0.469484 83.70 0.469484 84.54 0.469484 85.04 0.469484 85.58 0.469484 86.56 0.469484 86.68 0.469484 86.96 0.469484 87.30 0.469484 88.44 0.469484 88.52 0.469484 88.74 0.469484 88.92 0.469484 89.94 0.469484 91.66 0.469484 92.00 0.469484 92.26 0.469484 92.30 0.469484 92.68 0.469484 93.16 0.469484 93.32 0.469484 94.22 0.469484 94.26 0.469484 95.64 0.469484 97.36 0.469484 98.32 0.469484 98.36 0.469484 100.00 2.816901 Length: 194, dtype: float64

life expectancy frequency and percentage.

# frequency distribution for lifeexpectancy. print("\n counts for lifeexpectancy.") life_counts2 = sub_gapminder.groupby("lifeexpectancy").size() print(life_counts2) print("\n percentages for lifeexpectancy.") life_percent2 = sub_gapminder.groupby("lifeexpectancy").size() * 100/len(sub_gapminder) print(life_percent2)

counts for lifeexpectancy. lifeexpectancy 47.794 1 48.132 1 48.196 1 48.397 1 48.398 1 48.673 1 48.718 1 49.025 1 49.553 1 50.239 1 50.411 1 51.088 1 51.093 1 51.219 1 51.384 1 51.444 1 51.610 1 51.879 1 52.797 1 53.183 1 54.097 1 54.116 1 54.210 1 54.675 1 55.377 1 55.439 1 55.442 1 56.081 1 56.786 1 57.062 1 .. 79.499 1 79.591 1 79.634 1 79.839 1 79.915 1 79.963 1 79.977 1 80.009 1 80.170 1 80.414 1 80.499 1 80.557 1 80.642 1 80.654 1 80.734 1 80.854 1 80.934 1 81.012 1 81.097 1 81.126 1 81.404 1 81.439 1 81.539 1 81.618 1 81.804 1 81.855 1 81.907 1 82.338 1 82.759 1 83.394 1 Length: 189, dtype: int64 percentages for lifeexpectancy. lifeexpectancy 47.794 0.469484 48.132 0.469484 48.196 0.469484 48.397 0.469484 48.398 0.469484 48.673 0.469484 48.718 0.469484 49.025 0.469484 49.553 0.469484 50.239 0.469484 50.411 0.469484 51.088 0.469484 51.093 0.469484 51.219 0.469484 51.384 0.469484 51.444 0.469484 51.610 0.469484 51.879 0.469484 52.797 0.469484 53.183 0.469484 54.097 0.469484 54.116 0.469484 54.210 0.469484 54.675 0.469484 55.377 0.469484 55.439 0.469484 55.442 0.469484 56.081 0.469484 56.786 0.469484 57.062 0.469484 ... 79.499 0.469484 79.591 0.469484 79.634 0.469484 79.839 0.469484 79.915 0.469484 79.963 0.469484 79.977 0.469484 80.009 0.469484 80.170 0.469484 80.414 0.469484 80.499 0.469484 80.557 0.469484 80.642 0.469484 80.654 0.469484 80.734 0.469484 80.854 0.469484 80.934 0.469484 81.012 0.469484 81.097 0.469484 81.126 0.469484 81.404 0.469484 81.439 0.469484 81.539 0.469484 81.618 0.469484 81.804 0.469484 81.855 0.469484 81.907 0.469484 82.338 0.469484 82.759 0.469484 83.394 0.469484 Length: 189, dtype: float64

#data analysis #data management #frequency distribution

Shows how to utilize the FREQUENCY function to construct a frequency distribution in Microsoft Excel. An example data set is used to show how to apply the function to separate frequencies by class intervals.

Frequency Distribution in Excel

#Excel #microsoft excel #frequency distribution #How to

Disrupted Works

New Post has been published on http://www.expertdelayanalysis.com/disrupted-works/

Disrupted Works

I have developed a method of measuring disruption by comparing the As Planned labour down time with the As Built actual labour disruption.

I call it the Measured Gaps Method.

You need to start a detailed level 4 baseline programme that has been resource loaded and modeled.

If there are no resources then these can be added provided it is done properly and skillfully.

From the resourced As Planned baseline generate a histogram showing the planned resource deployment. This will show peaks and troughs and some gaps when no labour is deployed at all.

This will be the measure of the down time inherent in the tender price. Estimators rarely seek this information to start with so it will be a loss to start with but it is your measurement base.

This can be leveled as an average per working day.

Now you need the detailed As Built data which you use to generate the As Built programme.

The As Built programme will have the same resource allowance as the original. Do not include VO’s or Delay Events you only want the simple As Built dates.

Now generate the labour histogram and you will see a different set of peaks and troughs and gaps.

Again this can be expressed as an average per working day.

The difference between the planned average and the as built average is the calculation of the labour disruption over the tender allowance.

This can be converted to money by using the Contract day rates.

#Frequency distribution #Histogram #Image processing

“(a) The frequency distribution of the number of records for 1012 acacia species, and (b) the relationship between the number of records and the convex hull estimates of the geographical range sizes.“

#transparent #cursing the gif tree #plot #frequency distribution #invasive plants #acacia #tree #convex hull estimate #fruits edit

Data analysis and interpretation-Week 2 - Writing your first program

I have chosen Python to start with my analysis.

Initially, I seemed to run this known issue with Spyder: https://github.com/spyder-ide/spyder/issues/2984. As I would have problems trying to render any image, I have decided to use iPhython notebooks instead. iPython notebooks are very convenient, because they allow me to add text between code snippets and figures so that I can better go back to them in the future.

The first thing I noticed when running Python for loading the dataset, is that the datafile seems to contain some errors, that is why I need to ignore them via forcing “error_bad_lines=False”:

data = pandas.read_csv('gapminder.csv', low_memory=False, error_bad_lines=False )

After trying to force the delimiter as “comma” with no luck, and, as this was a non-starter situation, I have decided to ignore these errors via forcing with the parameter “error_bad_lines=False”, in the read_csv function.

This produces the following warning for 9 countries, an consideration I will need to remember:

'Skipping line 43: expected 16 fields, saw 17\n Skipping line 44: expected 16 fields, saw 17\n Skipping line 85: expected 16 fields, saw 17\n Skipping line 101: expected 16 fields, saw 17\n Skipping line 102: expected 16 fields, saw 17\n Skipping line 114: expected 16 fields, saw 17\n Skipping line 115: expected 16 fields, saw 17\n Skipping line 127: expected 16 fields, saw 17\n Skipping line 212: expected 16 fields, saw 17\n'

This countries correspond to the following missing countries for the analysis:

Congo Rep., Costa Rica, Iceland, Kuwait, Kyrgyzstan, Madagascar, Malawi, Monaco and Zimbabwe.

After printing some general data about our dataset, these are the main stats:

- 202 number of observations (countries)

- 16 number of columns, whose headers are:

Index(['country', 'incomeperperson', 'alcconsumption', 'armedforcesrate', 'breastcancerper100th', 'co2emissions', 'femaleemployrate', 'hivrate', 'internetuserate', 'lifeexpectancy', 'oilperperson', 'polityscore', 'relectricperperson', 'suicideper100th', 'employrate', 'urbanrate'], dtype='object') 202

STARTING THE UNIVARIATE ANALYSIS

Next, we start with the UNIVARIATE analysis, that is, examining each variable separately.

First, we take a look at the distribution of those variables EMPLOYRATE, URBANRATE AND INTERNETUSERATE, so: which values they take and how often they take those values:

These frequency tables speak about the three observed variables: they are continuous variables that take values from 0 to 100 in a contuinuum of values. For some cases, there are some repeated values across countries, but normally they take singular values. There are some NaN or missing values that count for different variables.

In the case of EMPLOYRATE, there are 33 missing values. In the case of URBANRATE there are 9 missing values. In the case of INTERNETUSERATE there are 19 missing values. We will need to account for this during our analysis. If you pass a variable with many unique values to table(), such a numeric variable, it will still produce a table of counts for each unique value, but the counts may not be particularly meaningful:

1) Variable EMPLOYRATE:

c1 = data["EMPLOYRATE"].value_counts(sort=False) print(c1)

employment rate nan 33 50.500000 1 61.500000 3 64.500000 1 63.500000 1 56.500000 1 53.500000 3 81.500000 1 60.500000 1 42.500000 1 54.500000 2 49.500000 1 71.300003 1 52.500000 1 58.500000 1 57.500000 2 73.199997 1 63.799999 2 60.700001 1 51.400002 1 41.099998 1 75.199997 1 56.400002 1 70.400002 2 51.299999 1 50.700001 1 78.199997 2 46.200001 1 42.000000 1 71.800003 1 43.099998 1 46.000000 2 32.000000 1 51.000000 2 65.699997 1 56.900002 1 56.000000 2 46.799999 1 59.000000 1 61.000000 2 62.299999 1 65.000000 3 47.099998 1 64.300003 1 56.299999 2 71.000000 1 72.000000 1 66.900002 1 40.099998 1 76.000000 1 77.000000 1 68.300003 1 78.900002 1 53.099998 1 83.000000 1 73.099998 1 55.900002 3 54.599998 1 66.000000 2 48.700001 2 56.799999 2 68.099998 1 38.900002 1 74.699997 1 47.799999 1 68.000000 1 62.700001 1 46.900002 1 44.700001 1 42.799999 1 59.799999 1 41.200001 1 57.599998 1 46.400002 1 55.700001 1 71.599998 1 45.700001 1 49.599998 1 58.400002 2 66.800003 1 80.699997 1 60.900002 1 63.900002 1 58.200001 2 55.599998 1 65.599998 1 57.900002 1 62.400002 1 59.099998 2 57.200001 1 53.400002 2 50.900002 2 57.099998 1 59.299999 1 61.799999 1 72.800003 1 66.599998 1 63.200001 1 51.200001 2 55.099998 1 42.400002 2 57.299999 1 63.099998 1 55.400002 1 61.700001 1 44.200001 1 73.599998 1 83.199997 2 64.900002 1 81.300003 1 52.099998 1 47.299999 3 59.700001 1 59.900002 3 54.400002 1 58.799999 2 75.699997 1 65.900002 1 44.799999 1 58.900002 2 67.300003 1 71.699997 1 52.700001 1 63.700001 1 65.099998 1 44.299999 1 48.599998 2 60.400002 2 79.800003 1 68.900002 1 41.599998 1 61.299999 1 37.400002 1 dtype: int64

p1 = data["EMPLOYRATE"].value_counts(sort=False, normalize=True) print (p1)

percentage employment rate nan 0.163366 50.500000 0.004950 61.500000 0.014851 64.500000 0.004950 63.500000 0.004950 56.500000 0.004950 53.500000 0.014851 81.500000 0.004950 60.500000 0.004950 42.500000 0.004950 54.500000 0.009901 49.500000 0.004950 71.300003 0.004950 52.500000 0.004950 58.500000 0.004950 57.500000 0.009901 73.199997 0.004950 63.799999 0.009901 60.700001 0.004950 51.400002 0.004950 41.099998 0.004950 75.199997 0.004950 56.400002 0.004950 70.400002 0.009901 51.299999 0.004950 50.700001 0.004950 78.199997 0.009901 46.200001 0.004950 42.000000 0.004950 71.800003 0.004950 43.099998 0.004950 46.000000 0.009901 32.000000 0.004950 51.000000 0.009901 65.699997 0.004950 56.900002 0.004950 56.000000 0.009901 46.799999 0.004950 59.000000 0.004950 61.000000 0.009901 62.299999 0.004950 65.000000 0.014851 47.099998 0.004950 64.300003 0.004950 56.299999 0.009901 71.000000 0.004950 72.000000 0.004950 66.900002 0.004950 40.099998 0.004950 76.000000 0.004950 77.000000 0.004950 68.300003 0.004950 78.900002 0.004950 53.099998 0.004950 83.000000 0.004950 73.099998 0.004950 55.900002 0.014851 54.599998 0.004950

...

2) Variable URBANRATE

urbanization rate

urbanization rate nan 9 74.500000 1 73.500000 1 67.500000 1 26.460000 1 66.500000 1 87.300000 1 52.040000 1 71.100000 1 85.580000 1 73.480000 1 41.000000 1 17.000000 1 60.180000 1 92.680000 1 51.640000 1 32.580000 1 23.000000 1 52.740000 1 61.000000 1 86.960000 1 12.980000 1 73.920000 1 29.540000 1 82.440000 1 33.320000 1 15.100000 1 64.920000 1 57.180000 1 39.380000 1 14.320000 1 93.160000 1 42.000000 1 73.200000 1 77.540000 1 77.480000 1 48.360000 1 18.800000 1 54.340000 1 37.760000 1 69.460000 1 29.520000 1 70.360000 1 81.820000 1 43.840000 1 48.780000 1 56.420000 1 59.620000 1 95.640000 1 69.900000 1 88.920000 1 92.000000 1 77.360000 1 37.860000 1 68.460000 1 35.420000 1 12.540000 1 85.040000 1 24.760000 1 47.440000 1 38.580000 1 24.940000 1 21.600000 1 39.840000 1 71.620000 1 93.320000 1 29.840000 1 34.440000 1 47.880000 1 46.720000 1 50.020000 1 56.700000 1 34.480000 1 27.300000 1 60.560000 1 26.680000 1 27.840000 2 30.880000 1 16.540000 1 77.120000 1 51.460000 1 25.460000 1 77.200000 1 74.920000 1 19.560000 1 53.300000 1 10.400000 1 36.820000 1 24.780000 1 68.080000 1 36.280000 1 73.460000 1 77.880000 1 46.840000 1 73.640000 1 65.220000 1 84.540000 1 41.760000 1 74.820000 1 72.840000 1 42.380000 1 18.340000 1 86.560000 1 69.020000 1 36.160000 1 100.000000 4 80.460000 1 36.520000 1 92.300000 1 61.340000 1 88.520000 1 42.720000 1 51.920000 1 25.520000 1 65.580000 2 81.700000 1 21.560000 1 88.740000 1 48.600000 1 98.320000 1 71.900000 1 17.240000 1 13.220000 1 68.680000 1 32.180000 1 91.660000 1 32.320000 1 83.700000 1 27.140000 1 47.040000 1 36.840000 2 52.360000 1 59.460000 1 28.380000 1 71.400000 1 56.740000 1 94.260000 1 97.360000 1 42.480000 1 75.660000 1 24.040000 1 41.200000 1 46.780000 1 37.340000 1 30.460000 1 51.700000 1 83.520000 1 48.580000 1 61.320000 1 60.740000 1 63.860000 1 30.840000 1 68.120000 1 20.720000 1 88.440000 1 48.620000 1 86.680000 1 64.780000 1 82.420000 1 80.400000 1 28.080000 1 67.980000 1 57.940000 1 59.580000 1 60.300000 1 60.700000 1 17.960000 1 56.560000 1 56.020000 1 66.480000 1 43.440000 1 67.160000 1 66.600000 1 57.280000 1 66.960000 1 54.220000 1 43.100000 1 92.260000 1 71.080000 1 98.360000 1 63.300000 1 60.140000 1 94.220000 1 41.420000 1 89.940000 1 78.420000 1 54.240000 1 56.760000 1 dtype: int64

percentage urbanization rate nan 0.044554 74.500000 0.004950 73.500000 0.004950 67.500000 0.004950 26.460000 0.004950 66.500000 0.004950 87.300000 0.004950 52.040000 0.004950 71.100000 0.004950 85.580000 0.004950 73.480000 0.004950

3) Variable INTERNETUSERATE

internet use rate nan 19 1.400061 1 39.820178 1 74.163040 1 2.199998 1 56.300034 1 9.196775 1 8.370207 1 90.016190 1 69.339971 1 2.699966 1 77.638535 1 13.000111 1 43.055067 1 51.280478 1 9.549931 1 40.772851 1 62.471230 1 2.450362 1

...

percentage internet use rate nan 0.094059 1.400061 0.004950 39.820178 0.004950 74.163040 0.004950 2.199998 0.004950 56.300034 0.004950 9.196775 0.004950 8.370207 0.004950 90.016190 0.004950 69.339971 0.004950 2.699966 0.004950 77.638535 0.004950 13.000111 0.004950 43.055067 0.004950 51.280478 0.004950 9.549931 0.004950 40.772851 0.004950 62.471230 0.004950

Main stats for these three variables:

data_valid_employrate.describe()

count 169.000000 mean 58.746746 std 10.490075 min 32.000000 25% 51.200001 50% 58.400002 75% 65.000000 max 83.199997

data_valid_urbanrate.describe()

count 193.000000 mean 56.483938 std 23.707742 min 10.400000 25% 36.840000 50% 57.180000 75% 73.920000 max 100.000000

data_valid_interentuserate.describe()

count 183.000000 mean 35.540207 std 27.758392 min 0.210066 25% 9.999254 50% 31.568098 75% 55.646421 max 95.638113

Derived variables (this is an advance of week 3):

We can see that there are no values we want to restrict in our set that may not have been restricted yet (the missing countries or with NA values), as in the class videos example.

We could only try to create a secondary variable that tries to subdivide the urban rate into different tiers or levels of urbanization.

Now, we will proceed to create a secondary derived variable called "URBANIZATIONGROUP" for categorizing the level of urbanization of a country. We will assign 4 groups, corresponding to the 4 quartiles of the variable URBANRATE:

UrbLevel1: Countries having 30 or less percent of population living in urban areas

UrbLevel2: Countries having between 25 and 75 percent of population living in urban areas.

UrbLevel3: Countries having more than 75 percent of population living in urban areas.

These are the counts for each of these three levels in our dataset:

UrbLevel1 31 UrbLevel2 114 UrbLevel3 48

or, in percentages:

UrbLevel1 0.160622 UrbLevel2 0.590674 UrbLevel3 0.248705 To see if these levels have been assigned correctly, we execute the crosstab function with these two variables:

print(pandas.crosstab(x2_valid_data_urban['URBANIZATIONGROUP'], x2_valid_data_urban['URBANRATE']))

THIS IS THE PYTHON CODE (I am using Python 2.7):

#convert variables to numeric

data["EMPLOYRATE"]=data["EMPLOYRATE"]

.convert_objects(convert_numeric=True)

data["URBANRATE"]=data["URBANRATE"]

.convert_objects(convert_numeric=True)

data["INTERNETUSERATE"]=data["INTERNETUSERATE"]

.convert_objects(convert_numeric=True)

#get frequency counts (tables):

print("employment rate") c1 = data["EMPLOYRATE"].value_counts(sort=False, dropna=False) print(c1)

#in order to ask for percentages of each value based on those counts: print("percentage employment rate") p1 = data["EMPLOYRATE"].value_counts(sort=False, normalize=True, dropna= False) print (p1)

print("urbanization rate") c2 = data["URBANRATE"].value_counts(sort=False, dropna=False) print(c2) #in order to ask for percentages of each value based on those counts: print("percentage urbanization rate") p2 = data["URBANRATE"].value_counts(sort=False, normalize=True, dropna=False) print (p2)

print("internet use rate") c3 = data["INTERNETUSERATE"].value_counts(sort=False, dropna=False) print(c3) #in order to ask for percentages of each value based on those counts: print("percentage internet use rate") p3 = data["INTERNETUSERATE"].value_counts(sort=False, normalize=True, dropna=False) print (p3)

#get frequency tables using crosstab functions:

freq_table_employrate = pandas.crosstab(index=data["EMPLOYRATE"], columns="count") freq_table_employrate

freq_table_urbanrate = pandas.crosstab(index=data["URBANRATE"], columns="count") freq_table_urbanrate

freq_table_internetuserate = pandas.crosstab(index=data["INTERNETUSERATE"], columns="count") freq_table_internetuserate

freq_table_employrate/freq_table_employrate.sum()

freq_table_urbanrate/freq_table_urbanrate.sum()

freq_table_internetuserate/freq_table_internetuserate.sum()

_ _

#Next, we create new variable URBANIZATIONGROUP to categorize countries based on their percentage of urban population print('URBANIZATIONGROUP - 3 categories - custom groups based on thresholds')

x_valid_data_urban.loc[:,('URBANIZATIONGROUP')]= pandas.cut(x_valid_data_urban['URBANRATE'], [0, 29, 74, 100 ], labels=["UrbLevel1", "UrbLevel2", "UrbLevel3"])

x2_valid_data_urban=x_valid_data_urban.copy()

c5 = x2_valid_data_urban['URBANIZATIONGROUP'].value_counts(sort=False, dropna=True) print(c5)

p5 = x2_valid_data_urban['URBANIZATIONGROUP'].value_counts(sort=False, dropna=True, normalize=True) print(p5)

x2_valid_data_urban.describe()

#data_analysis #exploratory data analysis #frequency distribution #stats

Median Frequency Attenuation

Graft to rule reinforcement distribution: A graded arrangement of the data showing the frequency with which each successive value about the variable occurs is called a shock wave collation.Frequency distribution is classified into dyadic types' continuous distribution and discrete distributions. And median is known as if the list has an odd enumerate about entries, the nucleus is the middle entry in the culture after sorting the list into increasing school. Here, we are present to study about median modish frequency giving out.<\p>

Median in Types re Frequency Distribution<\p>

It has two types, the establishment are<\p>

Median in discrete distribution, and Median in straight distribution. Median in Discrete Frequency Distribution:<\p>

Intermediary in discrete distribution:<\p>

In any discrete spattering if the whole range tally potty be arranged in order as to magnitude, preferably the highest seam at the top, the middle milestone of the array gives the appointment of the median and the score at that point is called the median yellowish the median score. There will be an equal designation in reference to items above and below the median.<\p>

Ex: If seven boys of different heights are made to cutoff in a thunder, the tallest essential, the next tallest next, and so on Solution: The moderate height is of the fourth negro from either disentanglement. If there is an even number of boys, say first team, other self would be natural toward take parce que median the tone midway on that loving of the fourth and that of the interval boy. Alter ego is given by `(4 +5)\2` or 4.5th from a deux martyr. Entering like this element, whether n is odd chaplet even, the position of the midriff is given by the number `(n+1)\2` and the score at the point is called the median.<\p>

Median clout Continuous Af Divergence<\p>

We know that continuous distribution can live represented by a histogram. Else, we also master that the total frequency (n) anent a continuos distribution is equal over against area of the histogram with frequency density. The median for a straight distribution is that quote a price as regards the protean the ordinate at which divides the histogram into the two keep pace with parts upon the equal area. Hence the position of the median in the case of a continuous distribution is prearranged by `n\2` (and not `(n+1)\2` ). The unit of the mediterranean is the same as that of the variates.<\p>

Than: Consider a college of 17 students in despite of the mock heights (in cm): 106, 110, 123, 125, 117, 120, 112, 115, 110, 120, 115, 102, 115, 115,109, 115, 101.<\p>

Pis aller:First arrange the data modernistic ascending or descending superspecies, 101, 102, 106, 109, 110, 110, 112, 115, 115, 115, 115, 115, 117, 120, 120, 123, 125<\p>

Now the median of the grounds set of data is the middle chaser or academic year, that is 115.<\p>

So the midriff in re this continuous distribution is 115.<\p>

Mean of this dissipation is,<\p>

`(106 + 110+123+125+117+120+112+115+110+120+115+102+115+115+109+115+ 110) \17`<\p>

#frequency #two types #discrete frequency distribution #frequency distribution #distribution #continuos distribution #discrete frequency #median score #continuous distribution #discrete distribution #continuous #median height #median frequency #frequency density #total frequency #median #continuous frequency