Cloth Swapping with Deep Learning: Implement Conditional Analogy GAN in Keras

The Conditional Analogy GAN: Swapping Fashion Articles on People Images (link)

Given three input images: human wearing cloth A, stand alone cloth A and stand alone cloth B, the Conditional Analogy GAN (CAGAN) generates a human image wearing cloth B. See figures below.

In my experiment, CAGAN was able to swap clothes in different categories, for example, long/short sleeve t-shirts (which is not shown in original paper). In other words, CAGAN not only changed color of clothes but had to generate human body part to transfer from long sleeve to short sleeve domain.



Images are crawled from We gathered about 2000 human/article pairs as training data.
(*Due to copyright concern, some images in this post are replaced by illustrations or cropped.)


  • Optimizer: Adam
  • Learning rate: 2e-4
  • Batch size: 16 (CAGAN) or 8 (CAGAN+StackGAN-v2)
  • Data augmentation: random cropping and flipping

GitHub repo.

keras implementation of CAGAN can be found here.
Notes about the implementation:

  • In CAGAN paper, a description about implementation detail writes: “In addition, we use always the last 6 channels of any intermediate layer (in both G and D) to store downsampled copies of the inputs x_i,y_i“. I am not totally understand what this means, so what I did was to concatenate x_i  and y_i to every intermediate layers. However, I couldn’t get any successful result but saturated noises when concatenating x_i  and y_i to the discriminator. Thus this concatenation is only applied on the generator.

I. CycleGAN as our first try

Why CycleGAN? It’s a go-to solution (personally) for image-to-image generation. There’s already keras implementations on GitHub.
Did it work? Yes, but lack of diversity and fidelity.

Keras implementation of CycleGAN is borrowed from here.


Given stand alone article images as input, the above figure shows generated human images after training for ~10k iterations. CycleGAN failed to generate human face, and body shapes are far from real. There are also mode collapses (similar human pose) in the bottom row.

II. Reimplement CAGAN

Why CAGAN? Want to generate realistic human images with different poses. Use full leverage of input human images.
Does it work? Yes.



Given three input images: x_i, human wearing cloth A; y_i, stand alone cloth A; and y_j, stand alone cloth B, the Conditional Analogy GAN (CAGAN) generates a human image x_{ij}  that swap its cloth from A to B. A discriminator is applied to help improving generated result quality by classifying True/False on three example pairs.


Architecture of generator G and discriminator D.

The generator is a typical UNET, which concatenate early layer features to later layers. Output of the generator is a four channels tensor containing \left [\alpha , \hat{x}{ij}^R, \hat{x}{ij}^G, \hat{x}_{ij}^B \right ] , where superscription R, G and B means the corresponding color channel. This tensor is then merged with x_i into one RGB images by x_{ij}=\alpha \odot  x_{ij}^{RGB} + (1-\alpha) \odot x_i . Symbol \odot   represents pixel-wise multiplication.

The discriminator is composed of several layers of Conv2D, and the output is a 8x8x1 sigmoid output (input size 128x96x3), a so-called PatchGAN approach.

Training loss function:


In CAGAN training, there are 3 losses applied: First, an adversarial loss L_{cGAN}(G,D).
where \lambda  and \mu  are indices of spatial dimension, if the 8x8x1 sigmoid output. Second loss, the regularizaiton for alpha mask, L_{id}(G):
where ||\cdot|| is the L1 norm. L_{id}(G) “regularizes the outputs of G to change as little as possible from the original human image”. Third, the cycle loss L_{cyc}(G) , also called reconstruction loss in some paper, that “force consistent results when swapping clothes”:

Cherry-picked Results:


So, what’s the problem?
When training for over 3000 updates, there were repetitive artifacts in the generated images, and sometimes human faces were distorted, which looks like this:

These artifacts can also be found in figure 6(c) and (d) of original CAGAN paper.
I think this is caused by small bottleneck dimension (1/16x of input size), so maybe the architectures and approaches used in super-resolution-related tasks can be helpful, which leads to the next section.

III. CAGAN + StackGAN-v2

Why combined CAGAN with StackGan-v2? Want to generate high quality texture/graphics and also stabilize training.
Does it work? Kind of, it generated successful results more often. And its training is more stable.

Any tricks being used during training?
1. Add Gaussian noise to discriminator inputs.
2. Use mixup technique on discriminator inputs.
3. Change Conv2D kernel size to (4,3).
4. Add an identity loss to generator loss. See [Experiment Notes] 2 below.
5. Size 64×48 cyclic output is merged as x_{ij(cyclic)}=\alpha \odot x_{ij(cyclic)^{RGB} + (1-\alpha) \odot \hat{x}_{ij}^{RGB}}, while \hat{x}_{ij}^{RGB} be x_{ij} in the 128×96 and 256×192 cyclic outputs.
6. Concatenate [x_i, y_j] instead of [x_i, y_i] (as in CAGAN paper) to every intermediate layers.

CA-Stack++-GAN.jpgFigure above shows model architecture in detail. The model takes three images as input and generates three human images (with alpha mask) in different sizes at its end. We did some modifications on the CAGAN architecture: First, inspired by this paper working on image completion, we substituted part of stride 2 Conv2D layers with dilated Conv2D layers so that feature map resolution halves only twice. This prevent output images from losing details. Second, a refiner network is introduced in the decoder. Refiner network consists of two stacks of residual blocks, it learns to add detail as well as improve realism of output image. Furthermore, we apply squeeze-and-excitation module on top of residual blocks that (hopefully learns to) increase sensitivity to informative features.

Another backbone of our model is StackGAN-v2 (StackGAN++). StackGAN-v2 consists of multiple generators and discriminators in a tree-like structure. In our model, we use three-stages of generators at different scales: 256 x 192, 128 x 96 and 64 x 48, while the deepest one generates the final output image. The StackGAN architecture helped to stabilize training and improve output color authenticity (e.g., skin color).

Notice that we feed input images, x_i  and y_j, to almost every intermediate layers. This further improved training stability. We also found in experiments that concatenating x_i  and y_j (instead of x_i  and y_i as suggested in CAGAN paper) preserves more detail such as graphics and textures of target article throughout the forward-pass.

(The discriminator has the same structure as of CAGAN. It is purposely kept simple.)

Training loss function:
In addition to loss functions used in CAGAN, we introduced two more loss functions: identity loss L_{ident} and color consistency loss L_{color} (from StackGAN-v2). They are defined as:

L_{ident}=\left \| x_{ident}-x_i \right \|, where x_{ident}=G([x_i, y_i, y_i]),
L_{color}=\lambda_1\left \| m_l - m_{l-1} \right \|_2^2+ \left \| \Sigma _l - \Sigma_{l-1} \right \|_F^2,

where m=\sum _{k} x_k/N  and \Sigma =\sum_{k}(x_k-m)(x_k-m)^T are the mean and covariance of the given image, x_k=(R,G,B)^T represents a pixel in a generated image.

The identity loss encourages model to focus on difference between y_i  and y_j. And color consistency loss is introduced to “keep samples generated from the same input at different generators more consistent in color and thus to improve the quality of the generated”.

Cherry-picked Results:


Input images are shown as the first three images. Following by corresponding generated human image(s) at the right. (Input images are replaced by illustrations)



Long sleeves to short sleeves

The above figure shows refinement of output image in each stage: the upper image shows refinement in skin color and the lower one shows refinement in graphic color.


Our model is able to generate clearer graphic of target article than original CAGAN.



We can see from the result images that there are no repetitive artifact being generated. (Although not being shown, the artifacts on human face are also reduced in our model.)

Great, what now?
Overall, I’m OK with the result since our model is trained on only ~2000 image pairs (<1/7x of CAGAN). But it generates higher quality image as a trade-off of increasing model complexity, in other word, longer training time. What’s more, the generated images are still far from perfect. For example, our model can not learn the distortion of wearing cloth (to stand alone cloth). Most of the swapping results look like just copy-pasting target clothes onto human images following by some refinements. So the placement of graphics are usually off position. We tried Spatial Transformer Layer (with Thin Plane Spline transform) but unfortunately failed to obtain good result. Anyway, there are lots of defections we can think of: blunt edges, low successful rate, unawareness of neck-lines, etc..

Also, we did not conduct any quantitative evaluation. We only judged performance by looking at its visual quality. Here I would like to quote from Generative Adversarial Networks: An Overview (as an excuse of me lacking knowledge of GANs): “How can one gauge the fidelity of samples synthesized by a generative models? Should we use a likelihood estimation? Can a GAN trained using one methodology be compared to another (model comparison)? These are open-ended questions that are not only relevant for GANs, but also for probabilistic models, in general.”

What I’ve learnt from implementing CAGAN:

  1. To understand the concepts/insight behind certain architecture is more important than the architecture itself.
  2. Spend too much time on tuning hyper-parameters, like kernel size and weighting factor of loss function, is unwise since it always leads to trivial improvement.
  3. Intuition NEVER worked on neural networks. What I thought will improved the results were 99% failed.
  4. Assign each layer a proper name so I can Ctrl+F to search them in model.summary().
  5. Do unit test that check if weights are updated after an iteration.

Update 25 Nov., 2017:
A new paper titled “VITON: An Image-based Virtual Try-on Network” from UMD (Larry Davis’ lab) presents impressive result on cloth swapping, basically making this post worthless LOL. I believe TPS transform part in the paper can be replaced by a spatial transformer network.

Update 18 Feb., 2018:
Generative Adversarial Network-Based Virtual Try-On with Clothing Region: An ICLR workshop paper based on CAGAN, in which a human parsing network is introduced to segment out clothing region. i.e., the alpha mask is no longer generated by the generator but a pre-trained network.

Update 20 Nov., 2018:
SwapNet: Garment Transfer in Single View Images: An ECCV2018 paper in which the author proposes a framework that “transferring garments across images of people with arbitrary body pose, shape, and clothing”. The network leverages pose and cloth segmentation as prior information. It also uses warping (as in VITON) to improve texture details of the generated clothing.

[Experiment Notes]
1. Substitute Con2DTranspose with Nearest-neighbour up-sampling did not give better result images since cyclic x_i is too blurry and the alpha mask can’t learn well.
2. Adding identity loss reduce checker board artifact and stabilize network from mode collapse as well. The identity loss is a L1 loss defined as: loss_{L1}(idt, x_i), where idt=G([x_i, y_i, y_i]). (Further investigation is needed. I also modified loss function a little bit in the same experiment)
3. Perceptual loss (MobileNet in my experiment) did not help, too. Perhaps it is because cycle loss does not have much impact on result image. As the cycle loss weighting factor \lambda is 10 in cycleGAN (tjwei’s keras implemetation) and 1 in CAGAN. Also tried perceptual adversarial loss (which substitute Discriminator with pre-trained CNNs), led to mode collapse. Updated 11 Nov.: After examing at output features maps from layers of MobileNet, I found that we can not judge if two human are wearing similar clothes by feature map distance between these two human images. The difference is also affected by color of clothes and human poses. For example, feature map difference of two human wearing similar plain red t-shirts is larger than difference of one human wearing black t-shirt and another gray.
4. StackGAN-v2 architecture might help. Still under experiment (tuning hyper-parameters). Only got similar result with original CAGAN till now. Done.
5. Models trained on images in Lab color space did not perform well on white articles.
6. Didn’t converge when using least square loss.
7. Concatenating input [x_i, y_i] to every intermediate layers is crucial to generate better image. I was wondering if there’s another way to do this, e.g., using residual block instead. But I guess the authors had tried it.
8. Using dilated convolution improves texture (e.g., graphics on t-shirts) quality a little bit.
9. Add an auxiliary local context discriminator (inspired by this paper) Did not find a good way to plug this into CAGANs.
10. Use more dilated Conv2D and less stride2 Conv2D layers in generator (same paper of 9).
11. Using cyclic loss in a StackGAN-v2 architecture is a concept similar to semantic consistency loss in XGAN (recently published paper from google brain): both of them use intermediate layers’ features distance as loss to encourage content consistency.
12. I’ve been wondering for a while if WGAN-GP will bring better result. After skimming through this paper that providing comparison between GAN and WGAN(-GP) on super resolution task, I decided to postpone experiment with WGAN.

4 thoughts on “Cloth Swapping with Deep Learning: Implement Conditional Analogy GAN in Keras

  1. Hi, I am Xintong Han. I tried spatial transformer network (for at least two weeks) when working on the VITON paper. STN is very hard to converge even given the ground truth TPS parameters as part of the supervision, which makes it very hard to outperform shape context matching. If you make any progress on getting the STN work in this scenario, I am very glad to hear about that.


    1. Hi, Xintong.I tried STN on CAGAN and didn’t find successful result either.
      Regarding training STN in a supervised manner, did you mean that given two binary mask as inputs: the cloth mask M and masked target clothing C (similar to the WarpNet cited in VITON paper), the STN can not learn TPS parameters well?


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s