Best free video trimmer for windows 10

4. Free Video Cutter Joiner. Just as the name of Free Video Cutter Joiner, the software is a free video cutting software as well as video joiner. It supports Windows 10/ 8/ 7/ XP/ Vista. And you can…

Smartphone

独家优惠奖金 100% 高达 1 BTC + 180 免费旋转




Some jottings for baseline paper of my MAI project

It is been over a semester since my MAI programme started here in KU Leuven! What an exciting new start after flying 30 hours from Southern Hemisphere coastal Auckland to historical university town Leuven! As my final semester just commerced today, I thought probably it is time to write some bits and pieces of my thought on the baseline paper for my MAI project.

After did my Honours research in unsupervised machine learning specifically-data stream mining, I decided to touch my hands on some other trendy fields of AI, which led to the current project I am working on: image-text alignment for artworks. This project overlaps on NLP and CV which requires knowledge in both areas.

Why do we need to dig into the alignment between text and images? Andrej Karpathy and Li Fei-Fei described as below in their CVPR 2015 paper.

Since the proposal of visual-semantic alignments by Karpathy and Li, this topic has been discussed many times and a lot of great works came out which significantly improved the performance of visual-semantic alignments task.

For my thesis project, the domain image dataset we choose mainly focuses on artworks. Different from most of other real-life situational images, artworks may contain some more subtle features/object compared to others, which requires our model to achieve alignment in a very fine-grained level. The baseline model I adopted is Stacked Cross Attention (SCAN) [2].

Lee, et al. mentioned some current existing drawbacks of mainstream image-text alignment models, such as:

This article aims to combine the above aspects (attention mechanism and using pairs), first extracting the features of the image and sentence, then using attention for each region and word, and finally calculating similarity so that attention is used for finer-grained alignment.

Also, this article uses some of the currently available optimisation methods, such as the use of hard-negative, triplet ranking loss, etc.

The proposed model has the following parts:

In simple terms:

The overall steps correspond exactly to the above, except that each word is used to calculate the similarity with the attention of a picture, which is not repeated here.

Target alignment is essentially the setting of the loss function. The author here is, fortunately, the method of triplet loss plus the hardest negative. That is, for a pair (I, T), the hard negative of the image is similar to the text except for the image with the highest similarity between the original image pair and the text. The formula is as follows:

The final loss:

The most prominent thing in this article is the application of attention to the alignment of the word and region levels, which has brought a lot of explanatory improvements. In this way, the mutual attention mechanism and similarity calculation of word and region are also called Stacked. Reason for Cross Attention.

[1] Karpathy, Andrej and Fei-Fei Li. “Deep visual-semantic alignments for generating image descriptions.” CVPR (2015).

[2] Lee, Kuang-Huei & Chen, Xi & Hua, Gang & Hu, Houdong & He, Xiaodong. “Stacked Cross Attention for Image-Text Matching.” ECCV (2018).

[3] Faghri, Fartash & Fleet, David & Kiros, Jamie & Fidler, Sanja. “VSE++: Improved Visual-Semantic Embeddings.” (2017).

[4] Anderson, Peter & He, Xiaodong & Buehler, Chris & Teney, Damien & Johnson, Mark & Gould, Stephen & Zhang, Lei. (2018). “Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering.” 6077–6086. 10.1109/CVPR.2018.00636.

Add a comment

Related posts:

Aprendendo a amar

Tenho pensado muito sobre o amor. Não que antes eu não pensasse, mas venho pensado nele agora como algo diferente. Talvez seja o passar dos anos, talvez seja por conta dos maravilhosos encontros com…

I hope this sucks.

I hope this is the worst piece of writing you’ve ever read. If you were looking for a perfectly edited, sexy version of anything, keep looking. This is not that. This is me on the couch with a tub of…

Selecting the Right Lpwan Technology

Selecting the Right LPWAN Technology A discourse on various IoT protocols is helpful when trying to select the best protocols for long-range communication. Due to its dependence on multiple aspects…