An Analyst's Diary @anuvarshini-rk - Tumblr Blog

Activation functions for Artificial Neural Networks (ANN)

Artificial Neural Network is a machine learning algorithm based on the model of a human brain. Like the human brain, neural networks consist of a large number of connected processing units powered by activation functions that mimic neurons.

ANNs consists of three layers, the Input layer, Hidden layer(s), Output layer and each layer is made up of multiple interconnected nodes. While the number of nodes in the hidden layer is our choice(the more the better), for the input layer it depends on the dimensions(features) of x(i) and for the output layer, it represents the number of classes. Each node has a weight associated with it that represents the strength of the connection between the two nodes.

When the neural network receives the input, it is multiplied by the weight and then passed to the next node through the connection. A weighted sum (here, x1.w1 + x2.w2) is computed for each node in the second layer and a bias term(b1) is added to it and the resulting sum is passed through the activation function, which then performs the transformation.

y1 = f (x1.w1 + x2.w2 + b1)

The ultimate goal of an activation function is to convert the input and generate the output, which will be used as the input for the next layer. It decides whether the neuron should be activated or not.

Why use an activation function?

Including an activation function increases the complexity and it is introduced as an additional step at each neuron of the neural network. So can we do without an activation function?

Well, if we had to remove the activation function, then the output of the neural network would simply be a linear function that is unable to learn the complex patterns of the data. And if we had only linear layers in a neural network, all the layers would essentially collapse into a single linear layer and thus a ‘deep’ neural network architecture would cease to exist and it would just be a linear classifier!

Without activation function

y = f (W1.W2.W3 x) = f (W x)

where W(i) represents the weight-bias matrix for each layer and f represents the activation function.

Whereas, including activation helps introduce non-linearity thereby capable of finding complex patterns and preventing the layers from collapsing.

With activation function,

y = f1 (W1.f2 (W2.f3 (W3.x)))

Choosing an activation function

The choice of activation function is critical as it has a large impact on the performance and capability of the model. All the hidden layers use the same activation function. However, the output layer might use an activation function different from that of the hidden layer and it depends on the type of prediction used by the model.

1. Sigmoid (or Logistic ) Activation function

It’s one of the most widely used non-linear activation functions. The curve looks S-shaped and the values it takes range from (0,1). It is denoted by

f(x) = 1/(1+e^-z)

where z(i)=W(i).x +b

This is mostly used in output layers of models such as binary and multi-label classification where we have to predict the probability. This function is continuously differentiable but due to small gradient values, it causes the vanishing gradient problem. The output isn’t zero centred.

2. Binary Step Function

This activation function is a threshold-based classifier The neuron is activated only when the value is greater than 0.

f(x) = 0 if x<0

f(x) = 1 if x>=0

The binary step function can be used as an activation function while creating a binary classifier but not for the multiclass classifier. As its derivative is zero (except at 0 where it's undefined), making gradient-based approaches for optimisation impossible.

3. Linear (or Identity ) Activation Function

Linear Activation Function is also called Identity activation function(multiplied by 1) because it does not change the weighted sum of the inputs and returns the value directly.

f(x) = ax

where, a = constant

When non-linearities exist, this activation function alone is insufficient, though it may still be employed as the activation function on the final output nodes for regression tasks.

4. Tanh (or Hyperbolic Tangent) Activation function

Similar to sigmoid, tanh is also S-shaped but shifted mathematically that it ranges from -1 to +1.

f(x)= (e^x - e^(-x)) / (e^x + e^(-x))

Since its entirely differentiable, centred at zero and anti-symmetrical it's more favoured than the sigmoid function in classification tasks and RNNs.To mitigate slow learning and/or vanishing gradients, flatter variations on this function (log-log, softsign, symmetrical sigmoid, etc.) can be employed.

5. ReLU (Rectified Linear Unit) Activation function

ReLU is the most popular function used in the hidden layers of the deep learning model, as it overcomes the limitations of sigmoid and tanh such as the vanishing gradients by introducing sparsity into the model( it doesn’t activate all the neurons at the same time). The range is 0 to infinity.

f(x) = max(0, x)

As the derivative is 0 at non-positive inputs, ReLU may suffer from slow learning or even dead neurons, where neurons that have negative valued inputs are unable to update their weights due to the zero-valued gradients, rendering them silent for the remainder of the training phase.

6. Leaky ReLU Activation Function

Leaky ReLU is the widely popular improved version of ReLU which attempts to solve the dying neurons problem. It includes a very small slope in the case of negative values (as opposed to 0 in ReLU ) so that there is no dead neurons in that region.

f(x) = 0.01(x) if x < 0

f(x) = x if x >= 0

Above is another variant of ReLU, the Parameteric Rectified Linear Unit (PReLU) where the slope of the line is learned during the model training (as opposed to the fixed slope of 0.01 in Leaky ReLU)

f(x) = a(x) if x < 0

f(x) = x if x >= 0

Tips:

Always start with ReLU as an activation function for hidden layers and move to other functions such as leaky ReLU, Parameteric ReLU or Randomized ReLU in case of dead neurons.

Generally, avoid Sigmoid and Tanh due to the vanishing gradients problem.

For Multilayer Perceptron(MLP) and Convolution Neural Networks(CNN) use ReLU.

Recurrent Neural Networks like LSTM commonly uses the Sigmoid activation for recurrent connections and the Tanh activation for output.

Activation functions in the output layer include Linear, Sigmoid and Softmax.

For Regression problems use linear activation function in the output layer.

For Binary and Multilabel classification use sigmoid activation function in the output layer and for Multiclass classification use softmax activation function.

#machine learning #deep learning #artificial intelligence #neural network

Unmasking Multicollinearity

Multicollinearity is a common problem when estimating linear or generalized linear models. It occurs when there are high correlations among predictor variables (apart from the target variable), leading to unreliable and unstable estimates of regression coefficients.

Multicollinearity violates the fundamental assumptions of regression analysis which assumes that there exists a degree of independence among the predictor variables. It can become a roadblock when we want to distinguish the individual effects of the predictor variables on the target variable.

Consider the following equation,

W1 is the increase in y for one unit increase in X1 by keeping the X2 constant. But when X1 and X2 are highly correlated the changes in one would affect the other and we might not be able to see their individual effects on y.

How do we detect multicollinearity?

The severity of the problem can be assessed with a statistic called Variance Inflation Factor (VIF). The VIF can be calculated for each predictor by doing a linear regression of that predictor on all the other predictors and then obtaining R2from that regression.

VIF can be captured by,

VIF estimates how much the variance of a coefficient is inflated because of linear dependence with other predictors.

It has a lower bound of 1 but no upper bound. However, a rule of thumb is that VIF greater than 5 indicates the presence of high multicollinearity. Usually, a VIF of up to 10 can be tolerated only when the dataset has more categorical features or many features in general.

Let's take a look at the blood pressure data obtained from 20 Individuals with high BP

blood pressure (y = BP, in mm Hg)

age (x1 = Age, in years)

weight (x2 = Weight, in kg)

body surface area (x3 = BSA, in sq m)

duration of hypertension (x4 = Dur, in years)

basal pulse (x5 = Pulse, in beats per minute)

stress index (x6 = Stress)

Now let's calculate the VIF through the variance_inflation_factor provided by the statsmodel package. Removing the target variable (BP) and index , the VIF is,

VIF arrived here is way too high because we have not scaled the data. Let’s fix this by transforming the dataset using StandardScaler() and then calculate VIF.

Now, this looks reasonable. The VIF of variables Weight and BSA is higher than 5, which makes sense because as the body surface area(BSA) increases or decreases, so does the weight.

The correlation matrix also allows us to investigate the dependence between multiple variables at the same time. The result is a table containing the Pearson correlation coefficients between each variable and the others.

Our correlation plot confirms the same, as our variable Weight and BSA have a high correlation of 0.88. We can fix this by eliminating a variable(either Weight or BSA) from the feature list. However selecting a feature to retain purely depends on the problem we are trying to solve. As of now we can go ahead and eliminate the BSA variable from the dataset and calculate the VIF of the remaining features.

Impressive! Just removing the BSA variable has reduced the VIF of the remaining predictors to a great extent.

So how do we avoid multicollinearity?

Make sure to avoid creating new variables which might be directly dependent on the variable already present. For example, Incase of creating an ‘Age’ variable from ‘DOB’, either one should be dropped.

Check for identical variables with different measuring units. For example, the Height variable can be present in both inches and cms.

When creating dummy variables for a feature having more than 2 values, set drop_first=True or simply encode them instead. For example, for the Result variable containing Pass and Fail as values we can just encode them as 1 and 0 instead of creating dummy variables or if not we can set the drop_first parameter to True.

When is it safe to be multicorrelated?

Multicollinearity inflates the standard error of the coefficients resulting in their instability. This instability creates a risk in coefficient interpretation and can mislead the hypothesis testing of the coefficients.

However, Multicollinearity does not reduce the predictive power or reliability of the model as a whole. It affects only calculations regarding individual predictors.

Most of the categorical variables with three or more categories have high VIF and this is completely fine because the proportion of the categories is very small.

#machine learning #data #collinearity #correlation #multicollinearity #vif #data science

Choosing Performance metric for Imbalanced Classification Problem

#machine learning #classified #imbalanced dataset #class imbalance #accuracy #metrics

Fixing Imbalanced Classification Problem

Imbalanced datasets pose a challenging problem where the classes are represented unequally. For an imbalanced dataset consisting of two classes, their training examples ratio may be 1:100 and for various scenarios such as fraud detection in claims, click-through rate for an ad serving company and predicting airplane crash/ failure the ratio might be even higher, say 1:1000 or 1:5000.

So how do we fix this?

1. Resampling the dataset

It's one of the straightforward methods of dealing with highly imbalanced datasets by levelling up the classes.

1.1 Under Sampling:

Undersampling involves randomly selecting examples from the majority class and deleting them from the training dataset.

Pros:

It can help improve run time and storage problems by reducing the number of training data samples when the training data set is huge.

Cons:

It can discard potentially useful information which could be important for building rule classifiers.

The sample chosen by random under-sampling may be a biased sample. And it will not be an accurate representation of the population. Thereby, resulting in inaccurate results with the actual test data set.

Note: Under Sampling should only be done when we have a huge number of records.

1.2 Over Sampling:

Over Sampling involves randomly selecting examples from the minority class, with replacement, and adding them to the training dataset.

Pros:

Unlike under-sampling, this method leads to no information loss.

Outperforms under sampling

Cons:

It increases the likelihood of overfitting since it replicates the minority class events.

Note: Oversampling can be considered when we have fewer records

1.2.1 SMOTE

Synthetic Minority Oversampling TEchnique, or SMOTE for short, is an oversampling method. It works by creating synthetic samples from the minor class instead of creating copies.

SMOTE first selects a minority class at random (X1) and finds its k(here, k=4) nearest minority class neighbours(X11, X12, X13, X14). The synthetic instance is created by choosing one of the k nearest neighbours X11 at random and connecting X1 and X11 to form a line segment in the feature space.

Now lets consider our dataset where there are 9900 instances of class 0 and 100 instances of class 1.

After over sampling the minority class using SMOTE, the transformed dataset can be visualized as below:

Here we have 9900 instances for both class 0 and class 1.

Pros:

Mitigates the problem of overfitting caused by random oversampling as synthetic examples are generated rather than a replication of instances

Cons:

While generating synthetic examples SMOTE does not take into consideration neighbouring examples from other classes. This can result in an increase in the overlapping of classes and can introduce additional noise

SMOTE is not very effective for high dimensional data

1.3 Hybrid Approach ( Under Sampling + Over Sampling)

SMOTE: Synthetic Minority Over-sampling Technique,2011 suggested a hybrid approach of combining SMOTE with random under-sampling of the majority class.

Here, we first oversample the minority class to have 10 percent the number of examples of the majority class (e.g. about 1,000), then use random under sampling to reduce the number of examples in the majority class to have 50 percent more than the minority class (e.g. about 2,000).The final class distribution after this sequence of transforms matches our expectations with a 1:2 ratio or 1980(approx. 2000) examples in the majority class and about 990(approx. 1000) examples in the minority class.

2. Cost-Sensitive Learning

In cost-sensitive learning instead of each instance being either correctly or incorrectly classified, each class (or instance) is given a misclassification cost. Thus, instead of trying to optimize the accuracy, the problem is then to minimize the total misclassification cost. Here the penalty is associated with an incorrect prediction.

Sklearn ml models provide the class_weights parameter where we can specify a higher weight for the minority class using a dictionary.

For the logistic regression, we calculate the loss per instance using binary cross-entropy.

Loss= −y log(p) − (1−y)log(1−p).

However, according to the above code snippet, we set the class weights as {0:1,1:10}

NewLoss = −10*y log(p) − 1*(1−y)log(1−p).

So what happens here is that if our model gives a probability of 0.3 and we misclassify a positive example, the NewLoss acquires a value of -10log(0.3) = 5.2287 and if our model gives a probability of 0.7 and we misclassify a negative example, the NewLoss acquires a value of -log(0.3) = 0.52.

That means we penalize our model around ten times more when it misclassifies a positive minority example in this case.

There is no method to pick the apt class weights, so it's a hyperparameter to be tuned. However, if we want to get class_weights using the distribution of the y variable, we can use the following compute_class_weight from sklearn.

Cost-sensitive algorithms include Logistic Regression, Decision Trees, Support Vector Machines, Artificial Neural Networks, Bagged decision trees, Random Forest, Stochastic Gradient Boosting.

3. Ensemble Models

3.1 Bagging:

Bagging is an abbreviation of Bootstrap Aggregating. The conventional bagging algorithm involves generating ‘n’ different bootstrap training samples with replacement. And training the algorithm on each bootstrapped algorithm separately and then aggregating the predictions at the end.

Bagging is used for reducing Overfitting in order to create strong learners for generating accurate predictions. Unlike boosting, bagging allows replacement in the bootstrapped sample.

Pros:

In noisy data environments, bagging outperforms boosting

Improved misclassification rate of the bagged classifier

Reduces overfitting

Cons:

Bagging works only if the base classifiers are not bad to begin with. Bagging bad classifiers can further degrade performance

3.2 Boosting( AdaBoost):

Boosting is an ensemble technique to combine weak learners to create a strong learner that can make accurate predictions. Boosting starts out with a base classifier / weak classifier that is prepared on the training data.

For example in a data set containing 1000 observations out of which 20 are labelled fraudulent. Equal weights W1 are assigned to all observations and the base classifier accurately classifies 400 observations.

The weight of each of the 600 misclassified observations is increased to w2 and the weight of each of the correctly classified observations is reduced to w3.

In each iteration, these updated weighted observations are fed to the weak classifier to improve its performance. This process continues till the misclassification rate significantly decreases thereby resulting in a strong classifier.

Pros:

Good generalization- suited for any kind of classification problem

Very simple to implement

Cons:

Sensitive to noisy data and outliers

3.3 Gradient Boosting

Adaboost either requires the users to specify a set of weak learners or randomly generates the weak learners before the actual learning process. The weight of each learner is adjusted at every step depending on whether it predicts a sample correctly.

Whereas Gradient Boosting builds the first learner on the training dataset to predict the samples, calculates the loss (Difference between real value and output of the first learner). And use this loss to build an improved learner in the second stage.

At every step, the residual of the loss function is calculated using the Gradient Descent Method and the new residual becomes a target variable for the subsequent iteration.

Cons:

Gradient Boosted trees are harder to fit than Random forests

Might lead to overfitting if parameters are not tuned properly

3.3.1 Extreme Gradient Boosting(XGBoost)

Pros:

It is 10 times faster than the normal Gradient Boosting as it implements parallel processing. It is highly flexible as users can define custom optimization objectives and evaluation criteria, has an inbuilt mechanism to handle missing values.

Unlike gradient boosting which stops splitting a node as soon as it encounters a negative loss, XG Boost splits up to the maximum depth specified and prunes the tree backwards and removes splits beyond which there is only negative loss.

In most cases, synthetic techniques such as SMOTE and MSMOTE will outperform the conventional oversampling and undersampling techniques. For better performance, we can use SMOTE or MSMOTE along with advanced boosting methods such as Gradient Boosting or XGBoost.

#machine learning #classification #SMOTE #Under-sampling #Over-Sampling #Ensemble #xgboost #class imbalance #imbalanced dataset

Class Imbalance in ML

Imbalanced classification refers to the classification predictive modelling problem where the number of examples in the training dataset for each class label is not balanced.

If there is a dataset consisting of 10000 genuine and 10 fraudulent transactions, the classifier will tend to classify fraudulent transactions as genuine transactions. The reason can be easily explained by the numbers. Suppose the machine learning algorithm has two possibly outputs as follows:

Model 1 classified 7 out of 10 fraudulent transactions as genuine transactions and 10 out of 10000 genuine transactions as fraudulent transactions.

Model 2 classified 2 out of 10 fraudulent transactions as genuine transactions and 100 out of 10000 genuine transactions as fraudulent transactions.

If we take the number of mistakes made as to the performance of the model, Model 1 has only 17 errors but Model 2 has 102 errors. However, if we want to minimize the fraudulent transactions we should use Model 2. But any machine learning algorithm will generally pick Model 1 resulting in passing a lot of fraudulent transactions unrestricted.

Better Metrics

We can better metrics than just counting the errors, such as:

True Positive (TP) – An example that is positive and is classified correctly as positive

True Negative (TN) – An example that is negative and is classified correctly as negative

False Positive (FP) – An example that is negative but is classified wrongly as positive

False Negative (FN) – An example that is positive but is classified wrongly as negative

Now let's find the performance of our models with respect to our new metrics.

In our case, our primary focus is to reduce the number of fraudulent transactions as much as possible, i.e lesser number of false negatives. So, calculating the False Negative rate for both our Models,

Model 1:

FNR_M1 = 7/ (7+3)

FNR_M1 = 0.7

Model 2:

FNR_M2 = 2/ (2+8)

FNR_M2 = 0.2

Now we see that the False Negative rate of Model 1 is at 70% while the False Negative rate of Model 2 is just 20% which makes it a better classifier.

#machine learning #classification #class imbalance #imbalanced dataset #metrics

Storing XML in Relational Database using Python

A database is a organised collection of structured data stored in the computer system. There are many database systems available, such as ORACLE, POSTGRESS, MySQL and SQLite. Here we will be using SQLite as it is inbuilt into python.

Consider an XML file containing the CD catalog. I have taken the file from https://www.w3schools.com/xml/cd_catalog.xml . Lets start by designing the data model for the given XML.

Since SQLite is embedded into python, we can access it just by including the statement import sqlite3 . The connect() operation makes a connection to the database stored in the file cddb.sqlite in the current directory. A cursor() is like a file handle using which we can perform operations on the data stored in the database. To execute the commands we use the execute() method.

The first execute statement removes if there is any existing tables for Artist, Company, Country and Track and creates new table for each of them. According to our data model ,we have set each of their id’s as primary key, integer type, contains unique and not null values and they will automatically increment.

Then we input the XML file from which the data is to be read. The all variable contains the list of cd’s. Using the for loop we are assigning the values from the list to the fields in the respective table. Using the SQL insert command we are inserting the data into the table and we specify the values as question marks to indicate that the actual values are passed in as tuple. Finally we use the commit() to force the data to be written into the database file.

Once the program is executed, cddb.sqlite file gets generated. In the file, under each table we can find the data that has been inserted from the XML.

#XML #python #database #sqlite #relational database

De-serializing JSON

JSON (JavaScript Object Notation) is one of the text formats for serialization of structured data and it is easier to parse the data in JSON format than in XML format. It is the most popular format and is used largely in modern applications. The data in this format looks like a dictionary in python with key pair values. Here is the simple program to parse data from JSON.

The loads (load string) function coverts the data into list of dictionaries. We can access the value similar to calling the value in the dictionary.

Some of the real life examples include google maps, twitter, you tube and wordpress. Here we are going to retrieve the data using google maps API. The API is the URL pattern or the syntax of what data we should send and what we should receive. In the case of google API the API given in the documentation is taken and key parameters are added to the URL using the method urllib.parse.urlencode(). This encodes the space as “+” and comma separator as “%2C”.

The above code prints the latitude, longitude and the formatted address of the given location.

#json #python #api #google maps

De-serializing XML

The main purpose of eXtensible Markup Language is to share structured data. Its syntax largely resembles html as they contain tags which specify starting and ending of an element and attributes which are key-pair values.

At the time of data exchange, we must make sure that the data is in the correct format between the sender and the receiver. So to validate this we have a contract called XML Schema. It specifies whether the element is single or complex and either of integer, string, decimal or date type. Here is the sample XML

In the program we are parsing the xml data to retrieve the name of the food and the description.

The XML data from the URL is read and stored as the string format in the data variable. lst variable contains the list of food element. By iterating through the list we can derive the name and description of the food using the find method.

#xml #serialize #deserialize #python

Web crawling using BeautifulSoup

Web scraping is a python program used for scrapping information from the browsers.For example we can write a program to retrieve the links and other info from the web pages.This is also called spidering the web and its illegal when done on copyrighted websites. Usually HTML from the web is inconsistent and scraping data from them can be challenging.

To overcome this challenge, we use a library called BeautifulSoup. We can pip install it.So here is a simple program of a web crawler,

Firstly, we are importing the url library and then BeautifulSoup from bs4. In this program we have scrapped the data from wikipedia. All the data from the particular webpage is available in html variable. soup contains the parsed html file. tags contain the anchor tags and it is in list format. So we can iterate through the list and print all the links from that website.

Below is the output of the program:

#web crawling #web scraping #python #beautifulsoup #html

Graphical Visualization of Quantitative Variables

My Hypothesis: Alcohol consumption is one of the high risk factors for acquiring Breast cancer and it influences Life Expectancy.

Below is the SAS code for creating univariate and bivariate graphs to analyse the relationship between the variables, alcconsumption, breastcancerper100TH, and life expectancy from GAPMINDER data set.

The univariate graph of alcohol consumption rate

This graph is bimodal with its highest peak at the category of greater than 8 L of alcohol consumption rate. This is almost equal to the category 0-2 with approx 32%.

This is the univariate graph of new cases of breast cancer

This is a non symmetric, skewed right graph, with its highest peak at the category 0-25 with 50% and the percentage of new cases decreases constantly.

This is the univariate graph of life expectancy,

This is an unimodal, symmetric, non skewed graph, with its peak at the category lesser than 75 and greater than 50 indicated by ‘<75′.

The above bivariate graph shows a weaker positive relationship between the variables alcconsumption and breastcancerper100th. However it is evident that it forms a data cluster for values 0-40 of breastcancerper100th and below 10 L of alcconsumption. Hence there exists a relationship between these variables within the above specified range.

The above graph shows the relationship between the variables alcconsumption and lifeexpectancy. It is non linear and its neither positive nor negative. Since the variables does not show any clear relationship between them, we can conclude that there is null relationship between them.

Finally, from the graphical analysis, there exists a relationship between alcohol consumption rate and breast cancer, whereas we could not find any such trend between alcohol consumption rate and life expectancy.

#data analysis #secondary variables gapminder #data visualization #graph #barchart #correlation

Data Management for Quantitative Variables

My Hypothesis: Alcohol consumption is one of the high risk factors for acquiring Breast cancer and both these factors influence on Life Expectancy.

For the analysis I have taken three variables from the Gapminder dataset: alcconsumption(2008 alcohol consumption per adult (age 15+), breastcancerper100TH (2002 breast cancer new cases per 100,000 female ), life expectancy (2011 life expectancy at birth (years)).

Since the variables of the gapminder data set are quantitative variables , data management for these variables is required. Due to large number of unique values that these variables take, I have grouped them into secondary variables level, new_cases, and span respectively.

For alcohol consumption rate, anything below 2 L falls under low level indicating 1 , less than 4 L indicates 2 level (medium level) and less than 8 L indicates level 3 (High level) and anything above 8 L indicates level 4(Very High level). For new cases of breast cancer, cases below 25 falls under low risk indicating 1, less than 50 indicates new_cases 2 (medium level) and less than 75 indicates new_cases 3 (high risk) and anything above 75 indicates new_cases 4 (very high risk) . For life expectancy, anything below 50 falls under low span indicating 1, less than 75 indicates span 2 (normal span), and anything above 75 indicates span 3 (High span).

The following is the SAS code for data management of the variables.

From the results , we can infer that in majority of the countries the rate of alcohol consumption is higher than the normal level as indicated in the first table. From the second table we can see that very minimal percent of countries have shorter lifespan and more than 50% of the countries have a normal life span. From the last table we can see that around 50% of the new cases are low risk and only 12.68% and 8.45% are classified under high and very high risk.

#data management #data analysis #data analytics #secondary variables gapminder

Examining Frequency Distribution of Alcohol Consumption Rate, Breast Cancer and Life Expectancy

My Hypothesis: Alcohol consumption is one of the high risk factors for acquiring Breast cancer and both these factors influence on Life Expectancy.

For the analysis I have taken three variables from the Gapminder data set: alcconsumption(2008 alcohol consumption per adult (age 15+), breastcancerper100TH (2002 breast cancer new cases per 100,000 female ), life expectancy (2011 life expectancy at birth (years)).

Below is the SAS code for analyzing frequency distribution:

The unique identifier for this data set is country. All the variables of this data set are quantitative variables . As it contains large number of unique values, determining the frequency of such values were of no use. So, I have categorized the data using the PROC FORMAT statement in SAS, thereby forming groups that contain a range of values.The following are the results:

I also wanted to check the missing values and their frequency and percent. So I have used the following code. TABLES alcconsumption/missing, breastcancerper100TH/missing, life expectancy/missing. In case if we don’t want to include missing values in calculation we can replace the missing statement with missprint.

From the results we can infer that in majority of the countries the alcohol consumption is higher than 8.04 while the standard average being 6.04 litres of pure alcohol. While the maximum number of new cases of breast cancer ranges between low and medium risk from values 0.0 to 50.9 per 100,000 women. Considering the life expectancy , more than half of the countries have a normal life span.

#data analysis #data #frequency #distribution #visualization #sas #analytics

Exploration of interdependence among Alcohol consumption rate, Breast cancer and Life expectancy

The main aim is to interpret whether the rate of alcohol consumption is one of the high risk factors for developing breast cancer. For the analysis, I have taken the datasets from Gapminder.

I have further derived a subset consisting of three variables:alcohol consumption rate in litres in 2008(alcconsumption),new cases of breast cancer per 100,000 female in 2002(breastcancer100TH),life expectancy at birth in years in 2011(lifeexpectancy)

By introducing a third variable I am extending my research to further identify if the life span/life expectancy is affected by consumption of alcohol and the risk of acquiring breast cancer.

[1] Alcohol Intake and Breast Cancer Risk: Weighing the Overall Evidence Jasmine A. McDonald, PhD, Abhishek Goyal, MD, MPH, and Mary Beth Terry, PhD

"The overall estimated association is an approximate 30-50% increase in breast cancer risk from 15-30 grams/day of alcohol consumption (about 1-2 drinks/day).Despite variability in defining light, moderate and heavy alcohol intake, studies have found a consistent modest association between higher alcohol intake and increased breast cancer risk."

[2] Alcohol policy impact case study: the effects of alcohol control measures on mortality and life expectancy in the Russian Federation (2019)

"Alcohol consumption has long been recognized as one of the main driving factors of mortality.Since 2003, both alcohol consumption and mortality were declining in parallel. In the period 2003–2018, all-cause mortality decreased by 39% in men and by 36% in women – a trend that was mirrored by an increase in life expectancy."

[3] National Cancer Institute’s Surveillance Epidemiology, and End Results (SEER) Program.

"Breast cancer is the most frequent cancer among women, impacting 2.1 million women each year, and also causes the greatest number of cancer-related deaths among women. In 2018, it is estimated that 627,000 women died from breast cancer – that is approximately 15% of all cancer deaths among women."

Based on the thorough review of literature survey my inference is that alcohol consumption is one of the high risk factors for acquiring breast cancer, and both have an influence on mortality and life expectancy.

#data analysis #visualization #data #alcohol #breast cancer #lifespan

Trending Blogs

Recently Viewed Blogs

An Analyst's Diary