# Cloth Swapping with Deep Learning: Implement Conditional Analogy GAN in Keras

The Conditional Analogy GAN: Swapping Fashion Articles on People Images (link)

Given three input images: human wearing cloth A, stand alone cloth A and stand alone cloth B, the Conditional Analogy GAN (CAGAN) generates a human image wearing cloth B. See figures below.

In my experiment, CAGAN was able to swap clothes in different categories, for example, long/short sleeve t-shirts (which is not shown in original paper). In other words, CAGAN not only changed color of clothes but had to generate human body part to transfer from long sleeve to short sleeve domain.

Preview:

## Dataset:

Images are crawled from zalora.com.tw. We gathered about 2000 human/article pairs as training data.
(*Due to copyright concern, some images in this post are replaced by illustrations or cropped.)

## Configuration

• Learning rate: 2e-4
• Batch size: 16 (CAGAN) or 8 (CAGAN+StackGAN-v2)
• Data augmentation: random cropping and flipping

## GitHub repo.

keras implementation of CAGAN can be found here.

• In CAGAN paper, a description about implementation detail writes: “In addition, we use always the last 6 channels of any intermediate layer (in both G and D) to store downsampled copies of the inputs $x_i,y_i$“. I am not totally understand what this means, so what I did was to concatenate $x_i$  and $y_i$ to every intermediate layers. However, I couldn’t get any successful result but saturated noises when concatenating $x_i$  and $y_i$ to the discriminator. Thus this concatenation is only applied on the generator.

## I. CycleGAN as our first try

Why CycleGAN? It’s a go-to solution (personally) for image-to-image generation. There’s already keras implementations on GitHub.
Did it work? Yes, but lack of diversity and fidelity.

Keras implementation of CycleGAN is borrowed from here.

Results:
Given stand alone article images as input, the above figure shows generated human images after training for ~10k iterations. CycleGAN failed to generate human face, and body shapes are far from real. There are also mode collapses (similar human pose) in the bottom row.

## II. Reimplement CAGAN

Why CAGAN? Want to generate realistic human images with different poses. Use full leverage of input human images.
Does it work? Yes.

Overview:

Given three input images: $x_i$, human wearing cloth A; $y_i$, stand alone cloth A; and $y_j$, stand alone cloth B, the Conditional Analogy GAN (CAGAN) generates a human image $x_{ij}$  that swap its cloth from A to B. A discriminator is applied to help improving generated result quality by classifying True/False on three example pairs.

Architecture:

The generator is a typical UNET, which concatenate early layer features to later layers. Output of the generator is a four channels tensor containing $\left [\alpha , \hat{x}{ij}^R, \hat{x}{ij}^G, \hat{x}_{ij}^B \right ]$, where superscription R, G and B means the corresponding color channel. This tensor is then merged with $x_i$ into one RGB images by $x_{ij}=\alpha \odot x_{ij}^{RGB} + (1-\alpha) \odot x_i$ . Symbol $\odot$  represents pixel-wise multiplication.

The discriminator is composed of several layers of Conv2D, and the output is a 8x8x1 sigmoid output (input size 128x96x3), a so-called PatchGAN approach.

Training loss function:

In CAGAN training, there are 3 losses applied: First, an adversarial loss $L_{cGAN}(G,D)$.
where $\lambda$  and $\mu$  are indices of spatial dimension, if the 8x8x1 sigmoid output. Second loss, the regularizaiton for alpha mask, $L_{id}(G)$:
where ||$\cdot$|| is the L1 norm. $L_{id}(G)$ “regularizes the outputs of G to change as little as possible from the original human image”. Third, the cycle loss $L_{cyc}(G)$ , also called reconstruction loss in some paper, that “force consistent results when swapping clothes”:

Cherry-picked Results:

So, what’s the problem?
When training for over 3000 updates, there were repetitive artifacts in the generated images, and sometimes human faces were distorted, which looks like this:

These artifacts can also be found in figure 6(c) and (d) of original CAGAN paper.
I think this is caused by small bottleneck dimension (1/16x of input size), so maybe the architectures and approaches used in super-resolution-related tasks can be helpful, which leads to the next section.

## III. CAGAN + StackGAN-v2

Why combined CAGAN with StackGan-v2? Want to generate high quality texture/graphics and also stabilize training.
Does it work? Kind of, it generated successful results more often. And its training is more stable.

Any tricks being used during training?
1. Add Gaussian noise to discriminator inputs.
2. Use mixup technique on discriminator inputs.
3. Change Conv2D kernel size to (4,3).
4. Add an identity loss to generator loss. See [Experiment Notes] 2 below.
5. Size 64×48 cyclic output is merged as $x_{ij(cyclic)}=\alpha \odot x_{ij(cyclic)^{RGB} + (1-\alpha) \odot \hat{x}_{ij}^{RGB}}$, while $\hat{x}_{ij}^{RGB}$ be $x_{ij}$ in the 128×96 and 256×192 cyclic outputs.
6. Concatenate $[x_i, y_j]$ instead of $[x_i, y_i]$ (as in CAGAN paper) to every intermediate layers.

Architecture:
Figure above shows model architecture in detail. The model takes three images as input and generates three human images (with alpha mask) in different sizes at its end. We did some modifications on the CAGAN architecture: First, inspired by this paper working on image completion, we substituted part of stride 2 Conv2D layers with dilated Conv2D layers so that feature map resolution halves only twice. This prevent output images from losing details. Second, a refiner network is introduced in the decoder. Refiner network consists of two stacks of residual blocks, it learns to add detail as well as improve realism of output image. Furthermore, we apply squeeze-and-excitation module on top of residual blocks that (hopefully learns to) increase sensitivity to informative features.

Another backbone of our model is StackGAN-v2 (StackGAN++). StackGAN-v2 consists of multiple generators and discriminators in a tree-like structure. In our model, we use three-stages of generators at different scales: 256 x 192, 128 x 96 and 64 x 48, while the deepest one generates the final output image. The StackGAN architecture helped to stabilize training and improve output color authenticity (e.g., skin color).

Notice that we feed input images, $x_i$  and $y_j$, to almost every intermediate layers. This further improved training stability. We also found in experiments that concatenating $x_i$  and $y_j$ (instead of $x_i$  and $y_i$ as suggested in CAGAN paper) preserves more detail such as graphics and textures of target article throughout the forward-pass.

(The discriminator has the same structure as of CAGAN. It is purposely kept simple.)

Training loss function:
In addition to loss functions used in CAGAN, we introduced two more loss functions: identity loss $L_{ident}$ and color consistency loss $L_{color}$ (from StackGAN-v2). They are defined as:

$L_{ident}=\left \| x_{ident}-x_i \right \|$, where $x_{ident}=G([x_i, y_i, y_i]),$
$L_{color}=\lambda_1\left \| m_l - m_{l-1} \right \|_2^2+ \left \| \Sigma _l - \Sigma_{l-1} \right \|_F^2$,

where $m=\sum _{k} x_k/N$  and $\Sigma =\sum_{k}(x_k-m)(x_k-m)^T$ are the mean and covariance of the given image, $x_k=(R,G,B)^T$ represents a pixel in a generated image.

The identity loss encourages model to focus on difference between $y_i$  and $y_j$. And color consistency loss is introduced to “keep samples generated from the same input at different generators more consistent in color and thus to improve the quality of the generated”.

Cherry-picked Results:

Input images are shown as the first three images. Following by corresponding generated human image(s) at the right. (Input images are replaced by illustrations)

Intra-category

Long sleeves to short sleeves

The above figure shows refinement of output image in each stage: the upper image shows refinement in skin color and the lower one shows refinement in graphic color.

Graphics

Our model is able to generate clearer graphic of target article than original CAGAN.

Others

We can see from the result images that there are no repetitive artifact being generated. (Although not being shown, the artifacts on human face are also reduced in our model.)

Great, what now?
Overall, I’m OK with the result since our model is trained on only ~2000 image pairs (<1/7x of CAGAN). But it generates higher quality image as a trade-off of increasing model complexity, in other word, longer training time. What’s more, the generated images are still far from perfect. For example, our model can not learn the distortion of wearing cloth (to stand alone cloth). Most of the swapping results look like just copy-pasting target clothes onto human images following by some refinements. So the placement of graphics are usually off position. We tried Spatial Transformer Layer (with Thin Plane Spline transform) but unfortunately failed to obtain good result. Anyway, there are lots of defections we can think of: blunt edges, low successful rate, unawareness of neck-lines, etc..

Also, we did not conduct any quantitative evaluation. We only judged performance by looking at its visual quality. Here I would like to quote from Generative Adversarial Networks: An Overview (as an excuse of me lacking knowledge of GANs): “How can one gauge the ﬁdelity of samples synthesized by a generative models? Should we use a likelihood estimation? Can a GAN trained using one methodology be compared to another (model comparison)? These are open-ended questions that are not only relevant for GANs, but also for probabilistic models, in general.”

## What I’ve learnt from implementing CAGAN:

1. To understand the concepts/insight behind certain architecture is more important than the architecture itself.
2. Spend too much time on tuning hyper-parameters, like kernel size and weighting factor of loss function, is unwise since it always leads to trivial improvement.
3. Intuition NEVER worked on neural networks. What I thought will improved the results were 99% failed.
4. Assign each layer a proper name so I can Ctrl+F to search them in model.summary().
5. Do unit test that check if weights are updated after an iteration.

Update 25 Nov., 2017:
A new paper titled “VITON: An Image-based Virtual Try-on Network” from UMD (Larry Davis’ lab) presents impressive result on cloth swapping, basically making this post worthless LOL. I believe TPS transform part in the paper can be replaced by a spatial transformer network.

Update 18 Feb., 2018:
Generative Adversarial Network-Based Virtual Try-On with Clothing Region: An ICLR workshop paper based on CAGAN, in which a human parsing network is introduced to segment out clothing region. i.e., the alpha mask is no longer generated by the generator but a pre-trained network.

Update 20 Nov., 2018:
SwapNet: Garment Transfer in Single View Images: An ECCV2018 paper in which the author proposes a framework that “transferring garments across images of people with arbitrary body pose, shape, and clothing”. The network leverages pose and cloth segmentation as prior information. It also uses warping (as in VITON) to improve texture details of the generated clothing.

[Experiment Notes]
1. Substitute Con2DTranspose with Nearest-neighbour up-sampling did not give better result images since cyclic $x_i$ is too blurry and the alpha mask can’t learn well.
2. Adding identity loss reduce checker board artifact and stabilize network from mode collapse as well. The identity loss is a L1 loss defined as: $loss_{L1}(idt, x_i)$, where $idt=G([x_i, y_i, y_i])$. (Further investigation is needed. I also modified loss function a little bit in the same experiment)
3. Perceptual loss (MobileNet in my experiment) did not help, too. Perhaps it is because cycle loss does not have much impact on result image. As the cycle loss weighting factor $\lambda$ is 10 in cycleGAN (tjwei’s keras implemetation) and 1 in CAGAN. Also tried perceptual adversarial loss (which substitute Discriminator with pre-trained CNNs), led to mode collapse. Updated 11 Nov.: After examing at output features maps from layers of MobileNet, I found that we can not judge if two human are wearing similar clothes by feature map distance between these two human images. The difference is also affected by color of clothes and human poses. For example, feature map difference of two human wearing similar plain red t-shirts is larger than difference of one human wearing black t-shirt and another gray.
4. StackGAN-v2 architecture might help. Still under experiment (tuning hyper-parameters). Only got similar result with original CAGAN till now. Done.
5. Models trained on images in Lab color space did not perform well on white articles.
6. Didn’t converge when using least square loss.
7. Concatenating input [$x_i, y_i$] to every intermediate layers is crucial to generate better image. I was wondering if there’s another way to do this, e.g., using residual block instead. But I guess the authors had tried it.
8. Using dilated convolution improves texture (e.g., graphics on t-shirts) quality a little bit.
9. Add an auxiliary local context discriminator (inspired by this paper) Did not find a good way to plug this into CAGANs.
10. Use more dilated Conv2D and less stride2 Conv2D layers in generator (same paper of 9).
11. Using cyclic loss in a StackGAN-v2 architecture is a concept similar to semantic consistency loss in XGAN (recently published paper from google brain): both of them use intermediate layers’ features distance as loss to encourage content consistency.
12. I’ve been wondering for a while if WGAN-GP will bring better result. After skimming through this paper that providing comparison between GAN and WGAN(-GP) on super resolution task, I decided to postpone experiment with WGAN.

## 4 thoughts on “Cloth Swapping with Deep Learning: Implement Conditional Analogy GAN in Keras”

1. Xintong Han says:

Hi, I am Xintong Han. I tried spatial transformer network (for at least two weeks) when working on the VITON paper. STN is very hard to converge even given the ground truth TPS parameters as part of the supervision, which makes it very hard to outperform shape context matching. If you make any progress on getting the STN work in this scenario, I am very glad to hear about that.

Like

1. Hi, Xintong.I tried STN on CAGAN and didn’t find successful result either.
Regarding training STN in a supervised manner, did you mean that given two binary mask as inputs: the cloth mask M and masked target clothing C (similar to the WarpNet cited in VITON paper), the STN can not learn TPS parameters well?

Like

2. tfxsuccess says:

i am struggling with virtual dressing room app. did any one know any such site having coode and detail technical stuff
tfxsuccess@gmail.com

Like

3. Anonymous says:

Long sleeves to short sleeves,it seems worse than others.

Like