Have you heard of the 80/20 rule, or the Pareto Principle? It says that roughly 80% of the effects come from 20% of the causes.
In most cases, a small percentage of efforts drive most of the results. Let’s apply this rule to data analysis, and work smarter, not harder!
Why is the 80/20 rule useful? It lets you focus on the few tasks that generate the most value for you and your organization. This saves time, increases efficiency, and makes you more useful at work. — Read More
Monthly Archives: December 2024
Superhuman performance of a large language model on the reasoning tasks of a physician
Performance of large language models (LLMs) on medical tasks has traditionally been evaluated using multiple choice question benchmarks. However, such benchmarks are highly constrained, saturated with repeated impressive performance by LLMs, and have an unclear relationship to performance in real clinical scenarios. Clinical reasoning, the process by which physicians employ critical thinking to gather and synthesize clinical data to diagnose and manage medical problems, remains an attractive benchmark for model performance. Prior LLMs have shown promise in outperforming clinicians in routine and complex diagnostic scenarios. We sought to evaluate OpenAI’s o1-preview model, a model developed to increase run-time via chain of thought processes prior to generating a response. We characterize the performance of o1-preview with five experiments including differential diagnosis generation, display of diagnostic reasoning, triage differential diagnosis, probabilistic reasoning, and management reasoning, adjudicated by physician experts with validated psychometrics. Our primary outcome was comparison of the o1-preview output to identical prior experiments that have historical human controls and benchmarks of previous LLMs. Significant improvements were observed with differential diagnosis generation and quality of diagnostic and management reasoning. No improvements were observed with probabilistic reasoning or triage differential diagnosis. This study highlights o1-preview’s ability to perform strongly on tasks that require complex critical thinking such as diagnosis and management while its performance on probabilistic reasoning tasks was similar to past models. New robust benchmarks and scalable evaluation of LLM capabilities compared to human physicians are needed along with trials evaluating AI in real clinical settings. — Read More
It’s AI Versus the World’s Largest Tuberculosis Epidemic
The scourge of tuberculosis (TB) may be largely a distant memory for most Americans and Europeans, but it killed roughly 1.25 million people last year around the world. A non-profit based in India, which accounts for more than a quarter of all cases, is developing AI tools that could boost efforts to eradicate the disease.
Roughly 10 million people a year fall ill with TB, making it one of the world’s most prevalent infectious diseases. In 2018, Indian Prime Minister Narendra Modi made an ambitious pledge to eliminate TB in India by 2025. With 2.5 million cases recorded in India last year, that goal clearly won’t be met; still, the country has invested hundreds of millions of dollars in a vast national TB program, and has reduced the disease’s incidence by about 18 percent between 2015 and 2023.
… Indian non-profit Wadhwani AI has developed a suite of AI-powered tools to assist health workers detect undiagnosed cases, decide on treatment plans, and prevent people from dropping out of treatment. Working with the Indian government and the U.S. Agency for International Development, the organization is currently piloting these tools across the country. And Wadhwani’s director of solutions, Nakul Jain, says 2025 could see several incorporated into India’s national TB patient management system, Nikshay. — Read More
Transparency assessment of 15 Chinese large models: only 4 allow users to withdraw voiceprint data
None of the 15 tested large model products disclosed the source of training data; based on technical limitations, each company claimed that it could not fully guarantee the authenticity and accuracy of AI-generated content; the vast majority of large model products stated that the information content and prompts entered by users would be used for model training, and only 4 allowed users to revoke authorization of voice data.
…. The three AI products with the highest transparency scores are: Tencent Yuanbao (72 points), iFlytek’s SparkDesk (69 points), and Zhipu’s Qingyan (67 points); the three that rank the lowest are: Baichuan’s Baixiaoying (54), ModelBest’s Luca (51 points), and Metaso [秘塔] (43 points).
The “Report” calls for enhancing the transparency of large model services, which is directly linked to whether the model is trustworthy, and also related to users’ evaluation of the accuracy and reliability of AI-generated content, and better identification of potential AI risks. — Read More
Is OpenAI o3 Really AGI?
The world may have changed, and we might not have realized it yet.
Yesterday, OpenAI shocked (and this is not hyperbole) everyone with the announcement of OpenAI o3 and o3-mini, the brand new models of the ‘o’ family (they skipped ‘o2’ due to trademark reasons).
o3 results are so astonishing that some people are actually convinced that it is AGI, as it destroys some of the so-called ‘impossible’ benchmarks for current models. — Read More
The AI Trillion-Dollar Product
In a very recent interview, Satya Nadella, Microsoft’s CEO, claimed that current business applications will “collapse in the agent era.” Notably, he is referring to the very same apps his company is currently selling. Thus, he is predicting the death of its own current business model in favor of AI agents.
But this vision implies a much more powerful change that Satya is less keen on mentioning because it directly impacts Microsoft’s raison d’être: the introduction of AI as a structural part of general-purpose computing, the end game of ChatGPT: the LLM Operating System, or LLM OS.
This vision is so powerful that it is unequivocally OpenAI’s grand plan. Today, we are distilling their vision into simple words. I believe this is one of my most didactic articles on the future of AI. — Read More
1-800-ChatGPT – Calling and Messaging ChatGPT with your phone
1-800-ChatGPT is an experimental new launch to enable wider access to ChatGPT. You can now talk to ChatGPT via phone call or message ChatGPT via WhatsApp at 1-800-ChatGPT without needing an account.
… You can talk to 1-800-ChatGPT for 15 minutes per month for free, with a daily limit on WhatsApp messages. We may adjust usage limits based on capacity if needed. — Read More
Can LLMs Generate Novel Research Ideas? A Large-Scale Human Study with 100+ NLP Researchers
Recent advancements in large language models (LLMs) have sparked optimism about their potential to accelerate scientific discovery, with a growing number of works proposing research agents that autonomously generate and validate new ideas. Despite this, no evaluations have shown that LLM systems can take the very first step of producing novel, expert-level ideas, let alone perform the entire research process. We address this by establishing an experimental design that evaluates research idea generation while controlling for confounders and performs the first head-to-head comparison between expert NLP researchers and an LLM ideation agent. By recruiting over 100 NLP researchers to write novel ideas and blind reviews of both LLM and human ideas, we obtain the first statistically significant conclusion on current LLM capabilities for research ideation: we find LLM-generated ideas are judged as more novel (p < 0.05) than human expert ideas while being judged slightly weaker on feasibility. Studying our agent baselines closely, we identify open problems in building and evaluating research agents, including failures of LLM self-evaluation and their lack of diversity in generation. Finally, we acknowledge that human judgements of novelty can be difficult, even by experts, and propose an end-to-end study design which recruits researchers to execute these ideas into full projects, enabling us to study whether these novelty and feasibility judgements result in meaningful differences in research outcome. — Read More
HunyuanVideo
We present HunyuanVideo, a novel open-source video foundation model that exhibits performance in video generation that is comparable to, if not superior to, leading closed-source models. In order to train HunyuanVideo model, we adopt several key technologies for model learning, including data curation, image-video joint model training, and an efficient infrastructure designed to facilitate large-scale model training and inference. Additionally, through an effective strategy for scaling model architecture and dataset, we successfully trained a video generative model with over 13 billion parameters, making it the largest among all open-source models. — Read More
Microsoft’s smaller AI model beats the big guys: Meet Phi-4, the efficiency king
Microsoft launched a new artificial intelligence model today that achieves remarkable mathematical reasoning capabilities while using far fewer computational resources than its larger competitors. The 14-billion-parameter Phi-4 frequently outperforms much larger models like Google’s Gemini Pro 1.5, marking a significant shift in how tech companies might approach AI development.
The breakthrough directly challenges the AI industry’s “bigger is better” philosophy, where companies have raced to build increasingly massive models. While competitors like OpenAI’s GPT-4o and Google’s Gemini Ultra operate with hundreds of billions or possibly trillions of parameters, Phi-4’s streamlined architecture delivers superior performance in complex mathematical reasoning. — Read More