We [Alibaba] introduce the Qwen-VL series, a set of large-scale vision-language models designed to perceive and understand both text and images. Comprising Qwen-VL and Qwen-VL-Chat, these models exhibit remarkable performance in tasks like image captioning, question answering, visual localization, and flexible interaction. The evaluation covers a wide range of tasks including zero-shot captioning, visual or document visual question answering, and grounding. We demonstrate the Qwen-VL outperforms existing Large Vision Language Models (LVLMs). We present their architecture, training, capabilities, and performance, highlighting their contributions to advancing multimodal artificial intelligence. Code, demo and models are available at https://github.com/QwenLM/Qwen-VL. — Read More
Recent Updates Page 168
AIColor: Colorize your old Photos with the power of AI
If you’re looking to colorize old black and white photos, our AI photo colorizer can help you bring your memories to life. — Read More
Introducing Code Llama, a state-of-the-art large language model for coding
Today, we are releasing Code Llama, a large language model (LLM) that can use text prompts to generate code. Code Llama is state-of-the-art for publicly available LLMs on code tasks, and has the potential to make workflows faster and more efficient for current developers and lower the barrier to entry for people who are learning to code. Code Llama has the potential to be used as a productivity and educational tool to help programmers write more robust, well-documented software. — Read More
Read the paper
Access the code
LLMStack
LLMStack is a no-code platform for building generative AI applications, chatbots, agents and connecting them to your data and business processes.
Build tailor-made generative AI applications, chatbots and agents that cater to your unique needs by chaining multiple LLMs. Seamlessly integrate your own data and GPT-powered models without any coding experience using LLMStack’s no-code builder. Trigger your AI chains from Slack or Discord. Deploy to the cloud or on-premise. — Read More
An analog-AI chip for energy-efficient speech recognition and transcription
Models of artificial intelligence (AI) that have billions of parameters can achieve high accuracy across a range of tasks1,2, but they exacerbate the poor energy efficiency of conventional general-purpose processors, such as graphics processing units or central processing units. Analog in-memory computing (analog-AI)3,4,5,6,7 can provide better energy efficiency by performing matrix–vector multiplications in parallel on ‘memory tiles’. However, analog-AI has yet to demonstrate software-equivalent (SWeq) accuracy on models that require many such tiles and efficient communication of neural-network activations between the tiles. Here we present an analog-AI chip that combines 35 million phase-change memory devices across 34 tiles, massively parallel inter-tile communication and analog, low-power peripheral circuitry that can achieve up to 12.4 tera-operations per second per watt (TOPS/W) chip-sustained performance. We demonstrate fully end-to-end SWeq accuracy for a small keyword-spotting network and near-SWeq accuracy on the much larger MLPerf8 recurrent neural-network transducer (RNNT), with more than 45 million weights mapped onto more than 140 million phase-change memory devices across five chips. — Read More
#nvidia, #humanA high-performance speech neuroprosthesis
Speech brain–computer interfaces (BCIs) have the potential to restore rapid communication to people with paralysis by decoding neural activity evoked by attempted speech into text1,2 or sound3,4. Early demonstrations, although promising, have not yet achieved accuracies sufficiently high for communication of unconstrained sentences from a large vocabulary1,2,3,4,5,6,7. Here we demonstrate a speech-to-text BCI that records spiking activity from intracortical microelectrode arrays. Enabled by these high-resolution recordings, our study participant—who can no longer speak intelligibly owing to amyotrophic lateral sclerosis—achieved a 9.1% word error rate on a 50-word vocabulary (2.7 times fewer errors than the previous state-of-the-art speech BCI2) and a 23.8% word error rate on a 125,000-word vocabulary (the first successful demonstration, to our knowledge, of large-vocabulary decoding). Our participant’s attempted speech was decoded at 62 words per minute, which is 3.4 times as fast as the previous record8 and begins to approach the speed of natural conversation (160 words per minute9). Finally, we highlight two aspects of the neural code for speech that are encouraging for speech BCIs: spatially intermixed tuning to speech articulators that makes accurate decoding possible from only a small region of cortex, and a detailed articulatory representation of phonemes that persists years after paralysis. These results show a feasible path forward for restoring rapid communication to people with paralysis who can no longer speak. — Read More
Analyzing an Expert Proposal for China’s Artificial Intelligence Law
A few months after the introduction of OpenAI’s ChatGPT captured imaginations around the world, China’s State Council quietly announced that it would work toward drafting an Artificial Intelligence Law. The government had already acted relatively quickly, drafting, significantly revising, and finally implementing on August 15 rules on generative AI that build on existing laws. Still, broader questions about AI’s role in society remain, and the May announcement signaled that more holistic legislative thinking was on the horizon.
… In the case of this scholars’ draft of an AI Law, the accompanying explanation notes that it is to serve as a reference for legislative work and is expected to be revised in a 2.0 version. Although the connection between this text and any eventual Chinese AI Law is uncertain, its publication from a team led by Zhou Hui, deputy director of the CASS Cyber and Information Law Research Office and chair of a research project on AI ethics and regulation, makes it an early indication of how some influential policy thinkers are approaching the State Council-announced AI Law effort.
We invited DigiChina community members to share their analysis of the scholars’ draft, a translation of which was led by Concordia AI and is published here. Their responses are below. — Read More
FlexiViT: One Model for All Patch Sizes
Vision Transformers convert images to sequences by slicing them into patches. The size of these patches controls a speed/accuracy tradeoff, with smaller patches leading to higher accuracy at greater computational cost, but changing the patch size typically requires retraining the model. In this paper, we demonstrate that simply randomizing the patch size at training time leads to a single set of weights that performs well across a wide range of patch sizes, making it possible to tailor the model to different compute budgets at deployment time. We extensively evaluate the resulting model, which we call FlexiViT, on a wide range of tasks, including classification, image-text retrieval, open-world detection, panoptic segmentation, and semantic segmentation, concluding that it usually matches, and sometimes outperforms, standard ViT models trained at a single patch size in an otherwise identical setup. Hence, FlexiViT training is a simple drop-in improvement for ViT that makes it easy to add compute-adaptive capabilities to most models relying on a ViT backbone architecture. Code and pre-trained models are available at this https URL — Read More
This paper convinced me LLMs are not just “applied statistics”, but learn world models and structure
This paper convinced me LLMs are not just “applied statistics”, but learn world models and structure: https://thegradient.pub/othello/
You can look at an LLM trained on Othello moves, and extract from its internal state the current state of the board after each move you tell it. In other words, an LLM trained on only moves, like “E3, D3,..” contains within it a model of a 8×8 board grid and the current state of each square. — Read More