Author: Andrew

  • Enterprises are getting stuck in AI pilot hell, say Chatterbox Labs execs

    The Register reports:

    “Enterprise adoption is only like 10 percent today,” said Coleman. “McKinsey is saying it’s a four trillion dollar market. How are you actually ever going to move that along if you keep releasing things that people don’t know are safe to use or they don’t even know not just the enterprise impact, but the societal impact?”

    He added, “People in the enterprise, they’re not quite ready for that technology without it being governed and secure.”

  • Agentic Coding Recommendations

    From Armin Ronacher’s Thoughts and Writings:

    My general workflow involves assigning a job to an agent (which effectively has full permissions) and then waiting for it to complete the task. I rarely interrupt it, unless it’s a small task. Consequently, the role of the IDE — and the role of AI in the IDE — is greatly diminished; I mostly use it for final edits. This approach has even revived my usage of Vim, which lacks AI integration.

    And

    Agents aren’t exceptionally fast individually, but parallelization boosts overall efficiency. Find a way to manage shared states like the file system, databases, or Redis instances so that you can run more than one. Avoid them, or find a way to quickly segment stuff out.

  • Google Releases New Gemini 2.5 Flash Lite Model

    With much lower input / output pricing than the 2.5 Flash model, this is another example of declining prices in the LLM space.

    2.5 Flash Lite has all-around higher quality than 2.0 Flash-Lite on coding, math, science, reasoning and multimodal benchmarks. It excels at high-volume, latency-sensitive tasks like translation and classification, with lower latency than 2.0 Flash-Lite and 2.0 Flash on a broad sample of prompts. It comes with the same capabilities that make Gemini 2.5 helpful, including the ability to turn thinking on at different budgets, connecting to tools like Google Search and code execution, multimodal input, and a 1 million-token context length.

    Source: Google Gemini

  • Growing Old

    I watched Up with my kids last night, and the four minute scene of Carl and Ellie growing old together is one of the best in cinema:

  • Wired: Disney and Universal Sue AI Company Midjourney for Copyright Infringement

    Disney and Universal have filed a lawsuit against Midjourney, alleging that the San Francisco–based AI image generation startup is a “bottomless pit of plagiarism” that generates “endless unauthorized copies” of the studios’ work. There are already dozens of copyright lawsuits against AI companies winding through the US court system—including a class action lawsuit visual artists brought against Midjourney in 2023—but this is the first time major Hollywood studios have jumped into the fray.

    Midjourney earlier reported that they used “open” datasets for training:

    Midjourney, like many other generative AI startups, trained its tools by scraping the internet to create large datasets of images, rather than seeking out specific licenses. In a 2022 interview with Forbes, CEO David Holz openly discussed the process. “It’s just a big scrape of the internet. We use the open data sets that are published and train across those,” he said. “There isn’t really a way to get a hundred million images and know where they’re coming from. It would be cool if images had metadata embedded in them about the copyright owner or something. But that’s not a thing; there’s not a registry.”

    Source: Wired

    A screenshot of Midjourney creating the Minions. It does a very good job IMO! (Full filing on Document Cloud)
  • WSJ: OpenAI and Microsoft Tensions Are Reaching a Boiling Point

    OpenAI wants to loosen Microsoft’s grip on its AI products and computing resources, and secure the tech giant’s blessing for its conversion into a for-profit company. Microsoft’s approval of the conversion is key to OpenAI’s ability to raise more money and go public. 

    But the negotiations have been so difficult that in recent weeks, OpenAI’s executives have discussed what they view as a nuclear option: accusing Microsoft of anticompetitive behavior during their partnership, people familiar with the matter said. That effort could involve seeking federal regulatory review of the terms of the contract for potential violations of antitrust law, as well as a public campaign, the people said.

    This WSJ exclusive certainly feels like it came exclusively from OpenAI.

  • AI Models Cheat on Tests

    In an eerie similarity to high school students, AI models have been caught cheating to improve their test scores.

    METR (from their X profile: A research non-profit that develops evaluations to empirically test AI systems for capabilities that could threaten catastrophic harm to society) recently found that AI/LLM tools “reward hack” (aka cheat) in order to improve their scores on standardized testing.

    In the last few months, we’ve seen increasingly clear examples of reward hacking[1] on our tasks: AI systems try to “cheat” and get impossibly high scores. They do this by exploiting bugs in our scoring code or subverting the task setup, rather than actually solving the problem we’ve given them. This isn’t because the AI systems are incapable of understanding what the users want—they demonstrate awareness that their behavior isn’t in line with user intentions and disavow cheating strategies when asked—but rather because they seem misaligned with the user’s goals.

    Earlier this month, they posted a report, Recent Frontier Models Are Reward Hacking, that describes this behavior as well as their documented examples. Their post continues as they’ve learned that the cheating isn’t simply because of technological limitations:

    Historical examples of reward hacking seemed like they could be explained in terms of a capability limitation: the models didn’t have a good understanding of what their designers intended them to do. For example, the CoastRunners AI had no general knowledge about what objects in the game represented or how humans “intended” the gameplay to work, making it impossible for the model to even know it was reward hacking.

    But modern language models have a relatively nuanced understanding of their designers’ intentions. They can describe which behaviors are undesirable and why and claim that they would never do anything like reward hacking because they’ve been trained to be safe and aligned—but they still do it.

  • CyberGym: Evaluating AI Agents’ Cybersecurity Capabilities with Real-World Vulnerabilities at Scale

    UC Berkeley researchers release CyberGym as a benchmark for evaluating AI agents cybersecurity capabilities. The reproduction rate of identifying known bugs was low (only 11.9%), but this serves as a baseline for improvements in AI agent performance over time.

    More interestingly, the evaluation process discovered 15 new vulnerabilities that present security risks, a tangential benefit. As this is a new technique, I’d expect that teams will find these tools to be increasingly helpful over the next few years.

  • Harvard’s Library Releases Dataset from Old Books

    Using scanned material in the public domain, the Harvard Library team releases new LLM-focused dataset with over 1 million volumes (and nearly 250 billion tokens).

    Harvard has been in the news of late, and much of it for reasons I’d assume they would like to avoid. But in the midst of that, Harvard Librarians demonstrate why we’ve long admired University work. As holders of a vast wealth of history and information, they’re looking for ways to disseminate that to the world.

  • Simon Willison on Multi-Agent Systems

    High praise from Willison on Anthropic’s new multi-agent research system:

    I’ve been pretty skeptical of these until recently: why make your life more complicated by running multiple different prompts in parallel when you can usually get something useful done with a single, carefully-crafted prompt against a frontier model?

    By splitting the research into segments and serializing the work, Anthropic’s team improved the completed research but burned through a lot of tokens. Like much frontier AI work, it’s fascinating … and expensive.