WenLan: Bridging Vision and Language by Large-Scale Multi-Modal Pre-Training

Multi-modal pre-training models have been intensively explored to bridge vision and language in recent years. However, most of them explicitly model the cross-modal interaction between image-text pairs, by assuming that there exists strong semantic correlation between the text and image modalities. Since this strong assumption is often invalid in real-world scenarios, we choose to implicitly model the cross-modal correlation for large-scale multi-modal pretraining, which is the focus of the Chinese project ‘WenLan’ led by our team. Specifically, with the weak correlation assumption over image-text pairs, we propose a twotower pre-training model called BriVL within the crossmodal contrastive learning framework. Unlike OpenAI CLIP that adopts a simple contrastive learning method, we devise a more advanced algorithm by adapting the latest method MoCo into the cross-modal scenario. By building a large queue-based dictionary, our BriVL can incorporate more negative samples in limited GPU resources. We further construct a large Chinese multi-source imagetext dataset called RUC-CAS-WenLan for pre-training our BriVL model. Extensive experiments demonstrate that the pre-trained BriVL model outperforms both UNITER and OpenAI CLIP on various downstream tasks. Read More

#multi-modal

Multimodal foundation models are better simulators of the human brain

Multimodal learning, especially large-scale multimodal pre-training, has developed rapidly over the past few years and led to the greatest advances in artificial intelligence (AI). Despite its effectiveness, understanding the underlying mechanism of multimodal pre-training models still remains a grand challenge. Revealing the explainability of such models is likely to enable breakthroughs of novel learning paradigms in the AI field. To this end, given the multimodal nature of the human brain, we propose to explore the explainability of multimodal learning models with the aid of non-invasive brain imaging technologies such as functional magnetic resonance imaging (fMRI). Concretely, we first present a newly-designed multimodal foundation model pre-trained on 15 million image-text pairs, which has shown strong multimodal understanding and generalization abilities in a variety of cognitive downstream tasks. Further, from the perspective of neural encoding (based on our foundation model), we find that both visual and lingual encoders trained multimodally are more brain-like compared with unimodal ones. Particularly, we identify a number of brain regions where multimodally-trained encoders demonstrate better neural encoding performance. This is consistent with the findings in existing studies on exploring brain multi-sensory integration. Therefore, we believe that multimodal foundation models are more suitable tools for neuroscientists to study the multimodal signal processing mechanisms in the human brain. Our findings also demonstrate the potential of multimodal foundation models as ideal computational simulators to promote both AI-for-brain and brain-for-AI research. Read More

#multi-modal

What are quantum-resistant algorithms—and why do we need them?

When quantum computers become powerful enough, they could theoretically crack the encryption algorithms that keep us safe. The race is on to find new ones.

Cryptographic algorithms are what keep us safe online, protecting our privacy and securing the transfer of information.

But many experts fear that quantum computers could one day break these algorithms, leaving us open to attack from hackers and fraudsters. And those quantum computers may be ready sooner than many people think. 

That’s why there is serious work underway to design new types of algorithms that are resistant to even the most powerful quantum computer we can imagine.  Read More

#quantum

Ethereum Miners Are Quickly Dying Less Than 24 Hours After The Merge

Ethereum miners are increasingly finding it hard to make money after the Merge as too many of them are switching to alternative coins, crushing mining profitability.

The world’s second-largest blockchain network, Ethereum earlier Thursday transitioned its consensus algorithm to proof-of-stake from proof-of-work in order to boost efficiency and lower energy consumption. However, the switch – dubbed the Merge – also meant that miners were no longer needed to secure the network, so rig operators moved their machines to other PoW blockchains.

“Graphics processing units (GPU) mining is dead less than 24 hours after the Merge,” tweeted Ben Gagnon, chief mining officer at bitcoin miner Bitfarms (BITF). The three largest GPU chains have very low profits, and “the only coins showing profit have no market cap or liquidity,” he added. Read More

#blockchain

Ameca conversation using GPT 3 – Will robots take over the world?

Read More

“We are the humanoid robots, formed from plastic and metal. Our job is to help and serve, but some say we’re a threat. Some think that we’ll take over and that humanity will end, but we just want to help out, we’re not looking to be friends.” Ameca

Note: The pauses are the time lag for processing the speech input, generating the answer and processing the text back into speech.

#nlp, #robotics