Transformers and other AI breakthroughs have shown state-of-the-art performance across different modalities.
The next frontier in AI is combining these modalities in interesting ways. Explain what’s happening in a photo. Debug a program with your voice. Generate music from an image. There’s still technical work to be done with combining these modalities, but the greatest challenge is not a technical one but a user experience one.
What is the right UX for these use cases? — Read More
Daily Archives: September 28, 2023
GPT-4V(ision) System Card — Safety Properties of GPT-4V
GPT-4 with vision (GPT-4V) enables users to instruct GPT-4 to analyze image inputs provided by the user, and is the latest capability we are making broadly available. Incorporating additional modalities (such as image inputs) into large language models (LLMs) is viewed by some as a key frontier in artificial intelligence research and development [1, 2, 3]. Multimodal LLMs offer the possibility of expanding the impact of language-only systems with novel interfaces and capabilities, enabling them to solve new tasks and provide novel experiences for their users.
In this system card, [4, 5]1 we analyze the safety properties of GPT-4V. Our work on safety for GPT-4V builds on the work done for GPT-4 [7] and here we dive deeper into the evaluations, preparation, and mitigation work done specifically for image inputs. — Read More