The GAN objective, from practice to theory and back again
2018/1/26 Feed Summary
No Free Hunch
Another Tutorial For GAN: Described current challenge of GAN, mode collapse, where Generator only learns one class of image pattern at each turn as long as Discriminator learns how to combat Generator's learned behavior. This problem is said can be alleviated thourgh backpropagating through Discriminator toghether. It also provides link to ICCV for more GAN details.
Other resource about mode collapse: Mode Collapse in GANS: Mode collapse in GAN happens in a situation where the target probability distribution is multi-modal (this is often true in real life cases) and Generator-Discriminator duel pair could result in stuck at cat-mouse cycle. In this ever-stopping cycle, Generator learns one mode firstly until Discrimintor learns how to combat it and then switch to other mode as so on (This might imply GAN work worse when the dataset are in distinct distributions but not a big mixture). Current literature have solutions roughly can be divided into four (or three) categories:
Produce the more diverse sample in one batch. This can be done by studying the difference of between minibatch of true images and fake images. This would lead to methods using minibatch discrimination and feature matching.
Using minmax objective instead: current the loss function used for GAN in Ian Goodfellow's 2016 paper doesn't take the whole play history into consideration as most AI game agent do. By incorporating more simulation runs might mitigate this. Cons: time consuming for training
Randomly insert old batches into the current run: this is similar to the category 2 to force Discriminator to become more "memorable" learner.
Boosting multiple GANs (actually the correct word is Stacking): each GAN in the ensemble should learn one mode from the whole sample. Cons: time consuming for training
The author also suggests to look at f-divergence used in GAN objective function and find that the original GAN objective function would settle in one mode seeking. This problem should be alleviated when more information (as mentioned above) is included (see below). The GAN objective, from practice to theory and back again: In this post, author carefully examine the loss function used in DCGAN and experiment the result with different f-divergence including KL-Divergence and JS-Divergence. Author contends that in theory the minimum should be achieved when target probability distribution equals to proposed probability no matter what divergence is used; however, in practice, what divergence used does matter. In terms of the modes of target distribution and proposed one, the behavior of target distribution can be classified as "mode-seeking" (find the prominent mode) or ""mode-covering" (try to average over all the modes). The divergence used in DCGAN is a "mode-seeking" behavior as shown in GAN improvement paper (and it explains mode collapse problem). The author futher tries other 3 divergence: they are f-GAN, Reverse KL-divergnece and JS-divergence. The images generated are shown but not the loss behavior in training. PS: Recall the loss function used in DCGAN: it will do runs of feed-forward through Discriminator: one is using real images and the other is using fake image. However, back propagation update is done only with Generator.
Algorithmia
Racial bias in facial recognition: the article discuss about racial bias and Open face serves as a face recognition MNIST trainig set to standardize face recognition training.
Serverless Mircoservices: Introduction and [extend AWS Alexia]: Algorithmia features fast deployment cycle by combining several stages of develoments (from monolithic architecture) at the same stage via building micro-services. And based on serverless structure the development can focus on the code not the environment (server) (FaaS): Current cloud platform provides serverless microservices are Google Cloud Funciton, AWS Lambda, Azure Functions and Algorithmia Serverless AI cloud.
Doing Bayesian Data Analysis
Problem of treating Ordinal Ranking as Metrics in Moving Rating: Ordinal variable is not propriate to express as distance or metric measure. They should be modeled as ordered-probit model (similar to latent model / Dirichlet process : each scale has its own distribution descriptive parameters but also enforce the mean of scale distributions to follow the pre-defined order). For example, in the movie rating problem, the 5 star rating system cannot be measured anyway because the exact scale is unkown. The only information we know is the ordinal scale given. With the assumption that the ordinal scale is the same for everyone (therefore, the same across all movies), one can model in a way each scale has its own distribution. In this post, the author also showed that the mean is not consistently or monotonically increase as movie rating scale when treated as a metric (compute the average rating for one movie). The author also shows that two movies could have very similar average rating (treated as metrics) but could have very differen means in ordinal-probit model. Or one movie has slightly worse average rating than the other but could end up has higher mean in ordinal-probit model. RW: However, the assumption that assume everyone has the same ordinal scale is not always true. But modeling sentiment labels as multi-nomial distribution probably is not correct. Hence, quoted from the author's reply to same ordinal scale comment:
"there is no guarantee that a useful model is a correct model"
Supplementary reading: beyond one-hot encoding: in this post author tried 7 encoding methods for categorical variable which are:
One-Hot-Encoding (or Effects coding) is used to handle not ordered categorical variable. This is often used to compare all groups to grand mean.
Dummy coding is similar to one-hot-encoding but with control group (all zero present). This is often used to compare non-control groups to control group.
Contrast coding is used when the a priori distribution is known where the between group difference is known significantly larger while the within group difference is small. It requires to assign orthogonal coding in regression setting and should be constrained to sum to one (for all the assigned coding). Other variants of contrast coding include: Polynomial coding is a variant of ordinal coding when the spacing is all equal (not really understood), Backward Difference Coding and Hermet coding
Ordinal Coding can be applied to categorical variable whose values are ordered and encoding this way will directly take the integer level as value
Binary: further turn integer encoding int binary representation; this often results in shorter code length than one-hot.
Today's Paper
Neural Machine Translation by Jointly Learning to Align and Translation: Attention Mechanism original paper
Written with StackEdit.















