Our Feed

UX Collective - Medium30/07/2026, 05:24
The accessibility paradox
Design / UXopen article
UX Collective - Medium29/07/2026, 21:58
Your people get AI. Get out of their way.
Design / UXopen article
UX Collective - Medium29/07/2026, 21:57
Battling AI fatigue as a designer and developer: a practical guide
Design / UXopen article
UX Collective - Medium29/07/2026, 21:57
The UX of Contrast: Lessons from Indika’s Game Design
Design / UXopen article
Towards Data Science29/07/2026, 16:33
Prompt Engineering Is Solved—Prompt Management Isn’t
Prompt engineering helps you write better prompts—but it doesn’t help you change them safely. This article explores a common production failure where a simple variable rename breaks every live call, and introduces a lightweight static analysis tool that treats prompts like contracts, catching breaking changes before they ship. The post Prompt Engineering Is Solved—Prompt Management Isn’t appeared first on Towards Data Science.
AI / MLopen article
Towards Data Science29/07/2026, 15:00
Why Your Best Predictive Model Gives the Wrong Treatment Effect
Why prediction-driven variable selection misses confounders and how Bayesian Adjustment for Confounding attempts to fix it. The post Why Your Best Predictive Model Gives the Wrong Treatment Effect appeared first on Towards Data Science.
AI / MLopen article
Towards Data Science29/07/2026, 13:30
Los Movimientos, Part II: Solving Large Pickup-and-Delivery Problems with Adaptive Large Neighborhood Search
Building an ALNS heuristic in Python for vehicle routing, time windows, capacity constraints, and mandatory driver breaks. The post Los Movimientos, Part II: Solving Large Pickup-and-Delivery Problems with Adaptive Large Neighborhood Search appeared first on Towards Data Science.
AI / MLopen article
Articles on Smashing Magazine — For Web Designers And Developers29/07/2026, 13:00
The Bull And Bear Case For Digital Design In The Age Of AI
As AI reshapes product design, it could give designers greater autonomy or expose the gaps that autonomy makes harder to hide. Exploring both the bull and bear cases, Andy Budd examines what happens when designers need less permission to act.
Frontendopen article
Codrops29/07/2026, 12:41
Studio Freight: Moving Missions Forward
A look inside Studio Freight’s mission driven approach to strategy, design, and digital experiences, and the bold work they create with teams shaping what comes next.
Frontendopen article
MachineLearningMastery.com29/07/2026, 12:00
Ollama vs. LM Studio vs. llama.cpp: Which Local AI Runtime Should You Use in 2026?
In this article, you will learn how Ollama, LM Studio, and llama.cpp differ across the dimensions that matter most to practitioners, and how to choose...
AI / MLopen article
Towards Data Science29/07/2026, 12:00
Avoiding Entity Key Drift in a Data Lake: Step 1, Normalization
This is the opening piece of a four-part deep dive series, on building a high-frequency streaming pipeline against a live public API. The data source is openSenseMap, a citizen-science IoT network used for climate research, mostly in Germany. A live public API is what makes it useful: it produces data-quality problems and edge cases that clean sample datasets never show. This article focuses on step-1: Normalization, later pieces cover matching algorithms, adaptive polling and noise filtering, and a vendor-agnostic Apache Iceberg pipeline with Terraform that runs locally in Docker and moves to AWS or GCP with minimal change. The post Avoiding Entity Key Drift in a Data Lake: Step 1, Normalization appeared first on Towards Data Science.
AI / MLopen article
UX Collective - Medium28/07/2026, 22:21
The MOS 6502: the people’s princess
The impact of blending arcade game UI concepts with the power of affordable 1970s calculator chips Continue reading on UX Collective »
Design / UXopen article
Towards Data Science28/07/2026, 16:30
How Much Does a Local LLM Actually Cost to Run? I Measured Every Watt on Apple Silicon
Five models, sustained generation, real wall-socket energy at $0.31/kWh — and the surprise the RTX-3090 numbers predicted, only bigger. The post How Much Does a Local LLM Actually Cost to Run? I Measured Every Watt on Apple Silicon appeared first on Towards Data Science.
AI / MLopen article
Hugging Face - Blog28/07/2026, 16:27
The OlmoEarth Platform: Geospatial inference at planetary scale
AI / MLopen article
Hugging Face - Blog28/07/2026, 15:01
LFM2.5-Encoders for Fast Long-Context Inference on CPU
AI / MLopen article
Towards Data Science28/07/2026, 15:00
MCP Explained: How Modern AI Agents Connect to the Real World
from custom integrations to a universal standard for tool access The post MCP Explained: How Modern AI Agents Connect to the Real World appeared first on Towards Data Science.
AI / MLopen article
Towards Data Science28/07/2026, 13:30
Don’t Just “Throw Adam at It”: Misunderstanding Adam Will Cost You
You "vibe coded" the import. Understand Adam's optimization dynamics, why it fails spectacularly, and how to fix it. The post Don’t Just “Throw Adam at It”: Misunderstanding Adam Will Cost You appeared first on Towards Data Science.
AI / MLopen article
Martin Fowler28/07/2026, 13:10
The Orchestrator's Tax
Subagents get justified by time saved and parallel execution, but Rahul Garg explains that's not what matters most. Every token in the orchestrator's context is competing for its attention, and the real value of a subagent is what it keeps out of that context. Subagents should be treated as a tool for protecting the orchestrator's working memory, offloading reasoning it doesn't need to hold onto. Doing this well means giving the orchestrator explicit ground rules for when and how to delegate. more…
Backendopen article
Martin Fowler28/07/2026, 12:32
Why I’m Writing Rachel’s Ramblings
TL;DR I have ideas. I haven’t been writing them. That’s about to change. I promise… myself. I’ve been thinking a lot about talent. Actually, I’ve been thinking a lot about thinking. And writing. Or more specifically, not writing. This really hit me earlier this year at the Future of Software conference. I was surrounded by people sharing their latest ideas and I had a slightly uncomfortable realization: I have my own. Not just opinions. Actual patterns. Hypotheses. Things I’m seeing across clients, across teams, across the industry that feel new or at least not well articulated yet in a way that a leader can think about and act upon in some way that can influence how they strategise and plan for the future. Because helping clients and other leaders internal and external to thoughtworks do this is actually a big part of what I do and without letting my northern humbleness get in my own way, I’m actually pretty good at it. If I wasn’t I wouldn’t be the global CTO of a future thinking tech org, you know the kind that has Martin Fowler as its Chief Scientist. A title I know he loves… Martin, by the way, is one of the people pushing me to do this, which is weird because on paper I’m his boss but I don’t believe in the traditional idea of a boss anyway. I’m a strong believer in the servant leadership type but I’ll save that for when I write about that. Anyway the point is for all the ideas I have and discussion I have I don’t do a good job of writing it down. At best I’ll stick it in a presentation deck when I’m forced to communicate with them in some forum or another. I hate decks and love writing so I’m obviously doing something wrong. So why haven’t I been writing? It’s easy to say I’ve been too busy. I don’t have an easy job. It’s a fun one but not easy. I also have two small children, 5 and 8. In case you are interested, I attempt to give as much time as possible to this busy job. And then I try to have a life. I’m also writing an epic world building sci-fi fantasy book which is a huge passion project I may also share more about so I am definitely busy. But that’s not actually the real reason I haven’t been writing this down and pushing it out publicly. The real reasons… I overthink it. I move too fast to the next idea. I’ve convinced myself it needs to be more polished than it does. So this is an experiment. Rachel’s Ramblings is exactly what it sounds like. Fast, imperfect, thinking out loud. Naming ideas early rather than waiting until they’re fully formed. Because the reality is, most of what I do day to day isn’t answering known questions. It’s spotting patterns and asking questions we haven’t quite figured out yet. My brain works a bit like a knowledge graph. Constant associations, constant pattern matching. That’s useful in conversations, in client work, in strategy. It’s less useful if it never gets written down. So this is me fixing that. I’ll write about: what is the future of software how software development is changing in the age of AI and what that means for engineers, leaders, and organizations how platforms, agents, and people actually work together and occasionally, how I manage the reality of doing this job with all the other things I have going on Some of it will be wrong. Some of it will evolve. That’s the point. If nothing else, this is a forcing function to turn thinking into something that exists outside my head. Let’s see where it goes.
Backendopen article
Towards Data Science28/07/2026, 12:00
Backpropagation Explained for Beginners (Part 2): There Has to Be a Better Way
The idea that makes backpropagation possible. The post Backpropagation Explained for Beginners (Part 2): There Has to Be a Better Way appeared first on Towards Data Science.
AI / MLopen article
Articles on Smashing Magazine — For Web Designers And Developers28/07/2026, 10:00
Thinking Outside The Box: Digital Design In The AI Era
Many of the AI tools we interact with take the form of text boxes. But what if there was a different way to interact with AI? Oleksii Hrzhehorzhevskyi explores a different approach to creating a new AI assistant and how designers can navigate the field as AI continues to change it.
Frontendopen article
UX Collective - Medium27/07/2026, 22:00
From frictionless to meaningful
Design / UXopen article
Towards Data Science27/07/2026, 16:30
“Los Movimientos”: The Routing Problem That Nearly Broke My Spirit
Using mathematical optimization to solve a pickup-and-delivery problem with time windows. The post “Los Movimientos”: The Routing Problem That Nearly Broke My Spirit appeared first on Towards Data Science.
AI / MLopen article
Towards Data Science27/07/2026, 15:00
Reducing Human Annotation with ML Active Learning
In a world where human time is expensive, learn how to use it only when really necessary The post Reducing Human Annotation with ML Active Learning appeared first on Towards Data Science.
AI / MLopen article
Codrops27/07/2026, 13:18
Between Print and Digital: The Making of MERSI’s Website
FLOT NOIR reimagines MERSI’s digital presence as a quiet, immersive space between architectural print and interactive web, combining a tailored Webflow build with precise motion and editorial storytelling.
Frontendopen article
LogRocket Blog27/07/2026, 13:00
Tailwind CSS vs. StyleX: A real migration with 20 components
I migrated 20 production-style components from Tailwind to StyleX. Here's what the data showed about LOC, CSS bundle size, build time, and type safety. The post Tailwind CSS vs. StyleX: A real migration with 20 components appeared first on LogRocket Blog.
Frontendopen article
MachineLearningMastery.com27/07/2026, 12:00
5 Architectural Patterns for Persistent Memory and State in AI Agents
Memory & State For AI Agents Building an AI agent can be tricky. Keeping it on track over a six-month deployment is incredibly hard. LLMs...
AI / MLopen article
UX Collective - Medium27/07/2026, 11:20
The consciousness mirage in AI design
We know it’s not human. We feel heard anyway. What mind attribution does to your users, and why it’s now a UX design dial. Continue reading on UX Collective »
Design / UXopen article
UX Collective - Medium27/07/2026, 11:17
AI can fake your portfolio, Lower latency, 5-to-9, Monochrome dataviz
Design / UXopen article
Hugging Face - Blog27/07/2026, 09:32
NVIDIA Cosmos-H-Dreams: Bringing Real-Time Generative Simulation to Surgical Robotics
AI / MLopen article
Hugging Face - Blog27/07/2026, 00:00
Anatomy of a Frontier Lab Agent Intrusion: A Technical Timeline of the July 2026 Incident
AI / MLopen article
UX Collective - Medium26/07/2026, 19:28
Information architecture is the foundation AI is starving for
Design / UXopen article
UX Collective - Medium26/07/2026, 14:19
Your website is boring (but that might just set it free)
Design / UXopen article
NN/g latest articles and announcements24/07/2026, 17:00
UX-Context Design: Using UX Knowledge to Inform AI-Generated Design
As more interface work is AI-generated, the output of research and design shifts from documents written for humans to curated context that guides AI.
Design / UXopen article
NN/g latest articles and announcements24/07/2026, 17:00
A Concrete Definition of “Product Sense” (and How to Build It)
Product sense means predicting which product decisions will succeed based on patterns learned through experimentation — and knowing when those patterns apply.
Design / UXopen article
Codrops24/07/2026, 14:06
The Art of Continuous Transformation: How Garden Eight Blends Integrity with Play
Inside Garden Eight's approach to creating websites that are not only visually striking, but thoughtfully crafted to evolve, engage, and endure.
Frontendopen article
MachineLearningMastery.com24/07/2026, 12:44
Stateful vs. Stateless Agent Design: Tradeoffs for Scalable Agentic Systems
In this article, you will learn how an agent's approach to managing state — stateless or stateful — shapes both its implementation and the deployment...
AI / MLopen article
LogRocket Blog23/07/2026, 17:00
JWT authentication: Best practices and when to use it
A guide for using JWT authentication to prevent basic security issues while understanding the shortcomings of JWTs. The post JWT authentication: Best practices and when to use it appeared first on LogRocket Blog.
Frontendopen article
Codrops23/07/2026, 14:09
Building Cerebrium: Making Serverless Infrastructure Tangible
Go behind the scenes of Cerebrium to explore the design, 3D, and WebGL techniques that turned complex AI infrastructure into an intuitive interactive experience.
Frontendopen article
LogRocket Blog23/07/2026, 13:00
How to replace screen recordings with Remotion
Discover how to build, render, and automate product demo videos with Remotion, replacing traditional screen recordings with reusable React code. The post How to replace screen recordings with Remotion appeared first on LogRocket Blog.
Frontendopen article
MachineLearningMastery.com23/07/2026, 12:00
An Introduction to Loop Engineering
It's tempting to treat loop engineering as something invented in a single week in June, but the mechanics behind it are closer to five years old, and knowing the lineage is what separates a real understanding of the idea from just repeating the trend piece.
AI / MLopen article
Hugging Face - Blog23/07/2026, 00:00
Bringing Nunchaku 4-bit Diffusion Inference to Diffusers
AI / MLopen article
Codrops22/07/2026, 13:52
Building Ridgeline: Engineering a Real-Time 3D Experience in Webflow
A behind-the-scenes look at the engineering behind Ridgeline, covering the architectural decisions, Webflow integration, real-time terrain rendering, animation techniques, and performance optimizations that brought a cinematic 3D experience to life.
Frontendopen article
LogRocket Blog22/07/2026, 13:00
How to use Chrome’s Modern Web Guidance to prevent AI agents from writing legacy frontend code
Chrome's Modern Web Guidance embeds modern web platform skills into AI coding agents, helping them choose native HTML, CSS, and browser APIs over legacy patterns. The post How to use Chrome’s Modern Web Guidance to prevent AI agents from writing legacy frontend code appeared first on LogRocket Blog.
Frontendopen article
LogRocket Blog22/07/2026, 12:30
How to choose and adapt product management frameworks
Learn how to choose and adapt product management frameworks based on your product stage, constraints, problem type, and business context. The post How to choose and adapt product management frameworks appeared first on LogRocket Blog.
Frontendopen article
LogRocket Blog21/07/2026, 13:30
Essential GUI design principles for creating intuitive interfaces
Explore the core principles of GUI design and learn how consistency, simplicity, feedback, accessibility, and user testing contribute to better digital experiences. The post Essential GUI design principles for creating intuitive interfaces appeared first on LogRocket Blog.
Frontendopen article
Martin Fowler21/07/2026, 13:13
Fragments: July 21
With this post, I’ll wrap up my notes from the second Future of Software Development Retreat. But before I do, I should note that the full Thoughtworks report on the retreat is now available. They have five headline findings: Code generation is no longer the bottleneck — verification is. ‘Harness engineering’ is emerging as a distinct, ownable discipline. Organizations are colliding with a real apprenticeship crisis. The executive/engineer expectation gap is a bigger risk than any technical limitation. Legacy modernization is the clearest, most defensible near-term value pool. ❄ ❄ A session convened around the mismatch of views about using LLMs between engineers using it and the C-suite and boards that were calling for it. The concern is that boards are looking at promised productivity gains, and not concerned enough about the risks, particularly about security. This was illustrated by one tale of a company that used ML-trained software to optimize the replacement of air filters on their field equipment. They were pleased to see that they were able to change the air filters less frequently, saving them $50 million. But the problem was the ML models were trained on equipment used in the desert, while their equipment was used in the arctic. Air filters in the desert deal with dust, but in the arctic the thing to remove is mosquitoes. There’s an important difference here, mosquitoes rot, and enough decaying mosquitoes is a serious fire risk. Fires from such dead mosquitoes around infrequently replaced air filters cost the company $100 billion. Now such a tale could told of many situations without AI in the mix. Plenty of human situations have gone wrong when solutions are applied in a new context (which is why context is such a key word among pattern-writers). But the tale does remind us to be wary of an AI’s suggestions, and to always think of how to build sensors to provide rapid feedback. Engineers particularly worry about the risks when citizen developers start vibe coding. In many ways, of course, this isn’t new. I.T. folks often worry about how many important business decisions are based on spreadsheets, that are built with little control, testing, or assessment of data quality. Vibe-coding amplifies these concerns, so companies need a range of controls to guard against security breaches. Some folks have made a point of raising issues at board level, running threat modeling session with board members to introduce them to the risks. Vibe-coded applications need to be put in separate infrastructure, which deterministic controls over data access to tame the lethal trifecta. One company encouraged widespread vibe-coding from citizen developers but recoiled from the problems of the huge shadow IT that emerged - they are now looking to build a platform to help control this work without stifling the useful tools that were produced. Part of the problem here may be simple experience with LLMs. Many in management find LLMs do a decent job of preparing management reports. Or summarizing management reports prepared by other LLMs. Given this they naturally think LLMs must do a decent job of programming too. My anti-management self has to mention Kelsey Hightower’s observation: The less busy work you have the less appealing these Al tools are One possible antidote to this: get the legal department involved. They see LLMs doing a poor job, and appreciate the risks involved. ❄ ❄ Most folks I talk to, both at the retreat and outside, recognize we are in some form of bubble. Technological advances like this almost always come with economic bubbles, and in the future we will all look back at this, and shake our heads saying we knew there was so much froth. But while it’s easy to see that there is a bubble, it’s hard to see how long it will run or what will emerge after the pop. After all the dotcom bubble was clearly recognized as such… in 1995. We can happily point at those companies that failed (Webvan, pets.com) but need to then acknowledge those that survived (Amazon). Most of those at the retreat were old enough to have lived through the dotcom bubble and crash, but one such grey-hair pointed out an interesting difference. Back then we were excited about what the future would bring, and we saw lots of new things being built. There’s much less of that, this time around. Most people are wary of what the AI bubble is creating. Partly this may stem from the reality that followed the dotcom hope. Social media may be everywhere, but do we think it’s actually improved our lives that much, even if (especially if?) we use so much of it? We hear so much about the incredibly productive things we can do with agentic programming, but has anyone noticed a flood of wonderful applications built with it? Or have we noticed a significant improvement in common applications from the big AI boosters such as Google or Microsoft? This may be another factor in the board-vs-engineer divide. Most of what’s driving adoption of AI at the moment is cost-cutting, and it mostly the boards that get excited by cost-cutting. Perhaps the increasing concerns about token costs will temper the eagerness. ❄ ❄ Folks are finding LLMs helpful in operations: with a good event stream from observability tools, an agent finds anomalies much faster. One of the problems with citizen-developer apps, is that they often don’t provide good observability, since the citizen-developers don’t think to ask for it. The agents ability to look at the event stream does pose governance questions, as often such event streams contain a lot of sensitive information. Reinforcing what I’d heard in Utah, more people agreed that LLMs are valuable for operations folks to help them understand what the code does. Cross matching code and event traces helps them assist humans to find what happened when things go wrong. Agents are particularly handy with repeated incidents, as they can collate lots of information from different cases and present it to the human teams. Getting agents to auto-remediate moves us to the next level of capabilities and concerns. It’s vital that agents carefully document all their actions when they do fixes. We also need to ensure there is feedback to the development team so they can learn. Agents don’t learn, the best they can do is update the context. There was a sense that many people over-estimate the capability of agents to deal with incidents. Such people think of incident resolution as a simple, linear process. But it’s rarely that, instead there’s a lot of surprises and adaptation needed. Humans are good with that, but LLMs are not. One of the perils of agent-developed code is their habit of inserting features that were never asked for. One team spent three days trying to figure out such an unrequested feature, trying to figure out who had requested it and if anyone wanted to keep it. ❄ ❄ ❄ ❄ ❄ A group of law professors carried an interesting experiment to judge how well an LLM can provide short answers to student questions. They created a batch of forty questions in contract law and asked the professors, plus a couple of LLMs, to provide answers. To evaluate the LLM answers they showed professors pairs of answers - one human, one LLM - and asked them which response they would prefer to deliver to a student. Professors rated LLMs far higher than their peers (average win rate = 75.33%), with models performing similarly to the best instructor. LLM responses were also rarely flagged as harmful (3.53%, vs 12.06% for professors). This reminds me of the distinction I mentioned in a recent fragment between interactional and contributory expertise. ❄ ❄ ❄ ❄ ❄ A few days ago Unmesh Joshi published an article here about his experiences using DSLs to enable more reliable use of LLMs. Responses to this included a pointer to an article by Spender Nelson that related similar impressions. DSLs like this hit a lot of sweet spots for LLMs. You can make them extremely token efficient, and enforce hard security boundaries. You can translate high-level LLM intent into a ton of deterministic code, ensuring good behavior and guardrails at the (custom) compiler level. And Large Language Models are very good at learning and working with DSLs. Maybe this shouldn’t come as a surprise; they are language models after all. A small bit of documentation generally is enough to set them off and running, and reasonable error messages let them course-correct even when they go wrong. He describes a couple of examples from their use: a query language for data lakes that takes into account security and authorization issues, and a little expression language to make it easier to create safe SQL where clauses. One of the biggest barriers to using DSLs, particularly external DSLs, is building a parser and tooling. LLMs make this much easier. That said, my sense is that it’s the semantic model that underpins the DSL is what really matters, and the DSL is one projection of that model. LLMs may help us explore other ways to project that model in interesting ways. ❄ ❄ ❄ ❄ ❄ In recent weeks I’ve been noticing the stench of LLM-speak more and more. It’s not just the common tells, it’s a sense of LLM miasma that pervades the prose. I’ve noticed it’s increasingly eliciting a visceral reaction, after a couple of paragraphs I just want to dismiss the entire article out of hand. For some of these, it was necessary for me to hold my nose and wade through the whole text, but it was with an intellectual nausea which obscured the content, even increasing my desire to indulge in such an awful distraction as checking social media. I wonder - is this just me that’s reacting so negatively to LLM-speak? Or do other people have a reaction that leads them to toss aside any prose that sets off their LLM-alarm? One indicator that it’s not just me is this post from Jason Koebler that I highlighted a couple of months ago, where he observed how AI was breaking his brain: People think things that are fake are real, things that are real are fake. Much has been written about “AI psychosis,” the nonspecific, nonscientific diagnosis given to people who have lost themselves to AI. Less has been said about the cognitive load of what other people’s AI use is doing to the rest of us, and the insidious nature of having to navigate an internet and a world where lazy AI has infiltrated everything. Our brains are now performing untold numbers of calculations per day: Is this AI? Do I care if it’s AI? Why does this sound or look or read so weird? Does this person just write like this? Is this a person at all? A while ago, I was thinking that it was reasonable for folks who aren’t as committed to writing as I am to use an AI to help polish their prose. Now I’m turning to encouraging writers to reject it. That pervasive LLM-voice is just so common now, my sense is that it discredits the writing even before the reader has a chance to try to understand what is being said. I don’t think it’s good enough to ask the LLM to write a first draft and then tweak it. I’m not sure writers can edit the LLM-ness out of prose once it’s in there. I even worry about asking an LLM to suggest improvements, I think it’s just too easy to accept an LLM’s suggestions, and in the process trigger your readers’ LLM-antibodies. Of course like most problems, it’s also an opportunity. Those who can get a distinctive human voice will get more visibility and credibility. But the question remains of how we can coach people to let out their true personality into their writing. Academic and corporate writing both tended to stifle engaging prose, LLMs are good amplifiers, and they will amplify this stifling. This is an even greater challenge for those for whom English is their second language (or indeed for many of my colleagues, their third or fourth). It’s too easy for me to neglect to think about a difficulty that I’ve never been able to face. The most immediate advice I can give something I learned many years ago and shared last year - Say Your Writing. Once you’ve got a reasonable draft, read it out loud. By doing this you’ll find bits that don’t sound right, and need to fix. I always suggested this to help people get past sluggish prose, especially if they had spent too much time around academic or corporate writing. But now I think the need to Say Your Writing is even more important, in order to combat the insidious impact of AI. For most people, their speech patterns get closer to their real self, so verbalizing writing is the way to fight those forces that try to smooth away a writer’s individuality.
Backendopen article
LogRocket Blog21/07/2026, 13:00
How to clean up AI-generated code with Fallow
Learn how to use Fallow to analyze AI-generated code, detect dead code, duplicate logic, and complexity issues, and integrate automated code quality checks into your AI-assisted development workflow. The post How to clean up AI-generated code with Fallow appeared first on LogRocket Blog.
Frontendopen article
Codrops21/07/2026, 12:50
Magnetic Commerce: Building the Dash Creative Website
How a simple branding concept shaped the interaction design, animation, and WebGL implementation of the Dash Creative website.
Frontendopen article
MachineLearningMastery.com21/07/2026, 12:33
The Current State of Agentic AI
In this article, you will learn how agentic AI architecture has evolved by mid-2026, including the shift away from orchestrated reasoning loops, the rise of...
AI / MLopen article
Articles on Smashing Magazine — For Web Designers And Developers21/07/2026, 10:00
Weaponizing And Defending The React Flight Protocol: Deserialization Sinks In RSCs
While React Server Components rely on the custom Flight protocol to stream interactive UIs, this same mechanism introduces powerful deserialization sinks that attackers can exploit. Durgesh Pawar breaks down the mechanics behind the CVSS 10.0 “React2Shell” vulnerability to show how protocol manipulation can lead to remote code execution.
Frontendopen article
LogRocket Blog21/07/2026, 03:10
Why I joined the Daily UI challenge after 5 years in UX, and what it taught me
After five years in UX, I revisited the Daily UI challenge to reconnect with hands-on design. Along the way, I learned that great interfaces come from thoughtful briefs, sound judgment, and user feedback, not just better AI tools. The post Why I joined the Daily UI challenge after 5 years in UX, and what it taught me appeared first on LogRocket Blog.
Frontendopen article
Stripe Blog21/07/2026, 00:00
Analyzing the evidence that helps businesses win “product not received” disputes
To understand what can influence win rates, we analyzed evidence packets from one million disputes over a 16-week period. Here’s what the data shows and what it means for how you mitigate disputes.
Backendopen article
Hugging Face - Blog21/07/2026, 00:00
Grabette: an open system to record robot-manipulation data
AI / MLopen article
MachineLearningMastery.com20/07/2026, 11:27
Building Agentic Workflows in Python with LangGraph
In this article, you will learn how to build a complete agentic workflow in Python with LangGraph, from a single model call to a tool-using...
AI / MLopen article
Codrops20/07/2026, 09:45
The Craft Behind Memorable Digital Experiences: Inside Unseen Studio
A look inside Unseen Studio's approach to transforming abstract ideas, hidden stories, and complex concepts into experiences people can explore and remember.
Frontendopen article
Netflix TechBlog - Medium17/07/2026, 21:32
In-House LLM Serving at Netflix
Backendopen article
NN/g latest articles and announcements17/07/2026, 17:00
Does Your Form Really Need a Dropdown List?
Dropdown lists can be used in narrow, specific scenarios, but misuse can cause more harm than alternatives.
Design / UXopen article
NN/g latest articles and announcements17/07/2026, 17:00
Don’t Outsource the Learning: Why Human-Led Research Still Matters in the Age of AI
Even if AI matches researcher-output quality, human-led research will remain essential — the team learning from observing users can't be outsourced.
Design / UXopen article
Codrops17/07/2026, 14:05
ZERO: The Engineering Behind a Defiant Interactive Narrative
A technical breakdown of the pipeline, rendering techniques, and performance optimizations behind ZERO, an immersive scroll driven WebGL experience built for desktop and mobile.
Frontendopen article
MachineLearningMastery.com17/07/2026, 12:00
Agentic AI Security: Defending Against Prompt Injection and Tool Misuse
In this article, you will learn what prompt injection and tool misuse are in the context of agentic AI systems, and which defense strategies experts...
AI / MLopen article
Articles on Smashing Magazine — For Web Designers And Developers17/07/2026, 08:00
When It Makes Sense To “Block” The Main Thread
The common rule of thumb is to never “block” the browser’s main thread when running JavaScript tasks. But is this a hard rule? Victor Ayomipo describes a use case he encountered involving a screenshot extension where he made an exception to the rule and decided that blocking the main thread was absolutely the right thing to do.
Frontendopen article
LogRocket Blog16/07/2026, 15:30
3 examples of great login screen designs
Learn what makes a great login screen through real-world examples and UX best practices for creating secure, accessible, and low-friction authentication flows. The post 3 examples of great login screen designs appeared first on LogRocket Blog.
Frontendopen article
Codrops16/07/2026, 14:45
Meet the Speakers of the First Three.js Conference
Before they take the stage, we asked a handful of speakers to share what they're building, exploring, and excited to bring to Paris.
Frontendopen article
Martin Fowler16/07/2026, 13:25
The Archaeologist’s Copilot
When people think of legacy modernization, most folks aren't imagining the target environment will be Java 8. But this was the challenge facing Nik Malykhin when he needed to run a Java 1.5 codebase on today's hardware. His early use of LLMs gave plausible answers that did not hold up in the codebase. Progress came when he grounded the process in evidence, using AI to support analysis, validation in a stable Docker environment, and gradual refactoring protected by tests. The main takeaway is practical: AI was most useful when constrained by evidence, clear roles, and a step-by-step modernization strategy. more…
Backendopen article
LogRocket Blog16/07/2026, 13:00
How to secure full-stack projects from NPM attacks
Learn how to protect full-stack projects from NPM supply chain attacks with a practical security checklist. The post How to secure full-stack projects from NPM attacks appeared first on LogRocket Blog.
Frontendopen article
MachineLearningMastery.com16/07/2026, 12:25
Run a Local AI Model with Ollama in 15 Minutes
In this article, you will learn how to get a small language model running locally on your own machine in under 15 minutes using Ollama....
AI / MLopen article
Hugging Face - Blog16/07/2026, 11:49
Newer Models, Same Advantage
AI / MLopen article
Hugging Face - Blog16/07/2026, 00:00
Security incident disclosure — July 2026
AI / MLopen article
Hugging Face - Blog15/07/2026, 17:27
Model Routing Is Simple. Until It Isn’t.
AI / MLopen article
Codrops15/07/2026, 14:33
The Architecture Behind Trionn: Coordinating GSAP, Three.js, Lenis, and Web Audio
A behind-the-scenes look at how multiple animation, rendering, and interaction layers were unified into one performant web experience.
Frontendopen article
MachineLearningMastery.com15/07/2026, 12:00
Scikit-Ollama for Scikit-LLM/Ollama Integration
In this article, you will learn how scikit-ollama bridges the scikit-learn interface with locally running Ollama models to perform zero-shot text classification; no cloud API...
AI / MLopen article
Articles on Smashing Magazine — For Web Designers And Developers15/07/2026, 10:00
No, People Don’t Want More AI In Their Life
Many companies assume everyone craves new AI features. But the reality is that most people don't want more AI — at least not in the way most AI leaders envision it. Brought to you by Design Patterns For AI Interfaces, **friendly video courses on UX** and design patterns by Vitaly.
Frontendopen article
NN/g latest articles and announcements15/07/2026, 06:59
UX Conference October Announced (Oct 5 - Oct 16)
Take up to 5 in-depth training courses, teaching user experience best practices for successful design. Training focused on long-lasting skills for UX professionals. October 5- October 16, 2026.
Design / UXopen article
Hugging Face - Blog15/07/2026, 00:00
Welcome Inkling by Thinking Machines
AI / MLopen article
Martin Fowler14/07/2026, 12:51
DSLs Enable Reliable Use of LLMs
LLMs generate code incredibly fast, but to ensure they generate exactly what is intended, they need clear boundaries. Abstractions and Domain-Specific Languages (DSLs) provide a strong harness that guides LLMs right from the start. Unmesh Joshi describes how the example of Tickloom - a domain model and DSL for illustrating distributed system behavior - shows how we can use an LLM as a partner to iteratively build a DSL and as a natural language interface to use it. Such a DSL can act as the key source of truth for software systems in the world of LLMs. more…
Backendopen article
MachineLearningMastery.com14/07/2026, 12:00
LLM Evaluation Frameworks Compared: How to Actually Measure What Your Model Does
In this article, you will learn how to evaluate LLM applications using the three dominant open-source frameworks — RAGAS, DeepEval, and Promptfoo — and why...
AI / MLopen article
Netflix TechBlog - Medium13/07/2026, 22:44
Building Service Topology at Scale: Architecture, Challenges, and Lessons Learned
Backendopen article
Martin Fowler13/07/2026, 12:51
Fragments: July 13
Some more of my notes from Thoughtworks Future of Software Development Retreat. When we had our first retreat in Utah early this year, nobody had heard of Harness Engineering. This time we had a whole session on it. When comes to the guide side of harnesses, most of the discussion is about context management. While context windows have increased is size as models get more sophisticated, that doesn’t mean that models will properly focus on the right bits. Models typically only focus attention on part of the context, and to get the best behavior, we need to manage that focus. One attendee keeps their context small, limiting the agents.md file to less than 200 lines On the sensor side, we see more attention on computational sensors. Two patterns from one participant was shifting to languages with greater controls, (eg Rust rather than Python) and “leveling up” validation approaches, using more property-based testing and techniques from formal methods. One commented that while they aren’t smart enough to write specifications in a formal specification language, they are smart enough to read it and check it makes sense for their domain. Will our attention on harnesses last long enough for our next retreat? Will the models just get so good that harnesses become unnecessary? Those with some mechanical sympathy for LLMs seem to think not - but are they overly coupled to the current state of technology? I find such speculation tends not to lead anywhere useful, I’ve not seen much success in guessing the future in the past, and with technology as radical as this, I don’t see it being any easier. So for the moment, attention to harnesses pays off. We find it reduces token usage, and also allows weaker models to be useful, supporting such things as local hosting of open-weight models. ❄ ❄ Which naturally segues me to a session on self-hosted models. Increasing token costs have made hosting an open-weight model more attractive, particularly due to the decreasing time for open-weight models to catch up with frontier models. Cost isn’t the only factor, however, many folks find a desire to be independent of the frontier model firms to be the the driving force. After all we’ve seen the U.S. government intervene to deny access to models, increasing the desire for greater model sovereignty. Information security is also something to consider, some attendees just can’t give models necessary data for critical work. Even without that, if someone else hosts the model then their model learns rather than your model. And although recent events have increased interest, several participants worked with companies that had been self-hosting for up to a couple of years. Is this trudging down the same path of self-hosted clouds, which led to lots of folks spending excessive funds on half-arsed private clouds? The answer hinges upon whether it ends up being simpler to host a model than a cloud, perhaps due to a simpler interaction protocol. The hard part of this may be the talent required to efficiently use the GPUs, managing an inference data center currently isn’t a widely available skill. Even self-hosted models are a cost to operate, capital costs in GPUs, ongoing costs in electricity. The physical design of a data center can affect optimal usage. There’s an opportunity here for professional services firms to help companies manage this. Cost control also involves teaching people to pick the right model for the job. Can we teach engineers, or indeed other users, to pick a less-powerful model? This, of course, could be a job for model itself, acting as a broker, deciding which model is the best choice to tackle certain jobs. Self-hosting may lead to a greater use of fine-tuning. Currently that’s a niche activity, but over time we could well find that models that are fine-tuned to a particular domain need less reasoning, consume less tokens, and thus are cheaper to operate. We are seeing models trained specifically to support programming. As with any topic with this degree of uncertainty, the big win isn’t finding the right answer, but coming up with a strategy that will cope with the inevitable and unpredictable changes. ❄ ❄ After an event like this, many people come up to me and ask me to make some grand summing up. I hate this, because I rarely leave these kinds of event with some grand narrative. Even after mulling on it afterwards (in writing the above notes) I still usually don’t have one, and distrust one that forms, as my skepticism includes attempts to make coherent narratives of an event that’s naturally rather jumbled. However my failings are irrelevant this time, because Kief Morris has put together such a narrative, and it’s a convincing one, even to a narrative-denier like me. The sessions had different titles and different casts, and on the surface they were about different problems. But they weren’t. Nearly every one of them was a different facet of the same argument. How much do we let an agent decide, and how do we stay confident in what it does? He looks at code review, questions whether it matters, but sees that the rigor that many associate with code review shifts to other forms. He describes the disagreements about how much we should trust an agent to identify and fix production incidents. He sees that the contrast between how much leeway teams give to agents depends on the context they are operating Underneath all of these sessions, the operations debate, the wide-remit team, the dark-factory spectrum, the argument about who’s allowed to steer the model, people were making the same handful of choices over and over about a single thing: the unit of work they were prepared to hand to an agent. How big it is. How much of the job it covers. What you do to get it ready to hand over. How you check what comes back. What you put around the agent to keep it inside the lines. Different rooms set those differently, but they were setting the same controls. ❄ ❄ Sam Ruby convened a session called “Bring me a Rock”. The name evokes a particular kind of management dysfunction. The manager tells his underlings to bring him a rock, and then starts rejecting the results without explaining why (“no not that one”, “no not that one”) until eventually one rock matches the unstated expectation. It names a manager who substitutes serial rejection for the work of saying what they want, and makes you pay for their unfinished thinking one rock at a time. Sam had already written why he thought with LLMs, this changed from a slur to a defensible way to work. When its a bunch of tireless machines with endless patience, that return new rocks in minutes rather than days, then an approach like this (using the brainstorming register becomes a defensible way to work. Sam described the discussion: The room pulled it somewhere narrower than I’d framed, and the narrower place was the more interesting one: not how to explore by elimination but who should even be allowed to. Product managers, increasingly people managers, are reaching for these models directly, and seasoned engineers get measurably better results from them than untrained people do — so the worry followed. If expertise is what separates a good outcome from slop, should non-engineers be steering the model at all? It’s a fair question, and I think it’s the wrong one, because it mistakes the act. When a manager reaches for an LLM instead of routing the work to the team that reports to them, they didn’t pick up a tool — they made a hire. And you don’t ask permission to manage your own team; a manager who decides a piece of work is better given to a new participant than to the existing one is doing the most ordinary thing a manager does. Framed that way, the permission question dissolves into an older, better-understood one — the one Drucker named in 1959: when the worker knows more about the specifics than the manager does, you manage by objective, not by method. The non-engineer steering an agent is exactly that manager, out-known by the thing they’re directing, and the slop the room feared is the old danger of managing by method when you should be managing by objective. The question isn’t may they hire? It’s do they know how to manage by objective? — which you can teach, hire for, and hold people to without anyone first becoming an engineer. Sam’s article explores managing an LLM by objective, giving it a goal rather than a task. And Kief’s earlier point about the essence of the discussion still holds: how confident can we be that it’s done the right thing? We can outsource many things, but not the acceptance criteria, at some point there’s a human request and a human judgment on whether that request was properly executed. But the danger lies in important unstated objectives, unstated perhaps because they weren’t even imagined. It’s easy to state objectives around desired functionality. Give me a an application that will examine my emails and form a todo list for today. But behind that simple statement is a thicket of unstated assumptions. We tend to assume The Genie won’t include any undesired functionality, perhaps deleting emails it thinks are unworthy of our attention. We assume it won’t let an email tell it to send private information to villain@evil.com. We have some hope here - we hear more experiences that suggest that recent models can do an excellent job of finding (and hopefully fixing) security holes. The careful precision of the machine outruns the sloppy if imaginative thinking in squishyware. Perhaps we can assume the genie can take care of some of our unstated objectives. Conformance tests (sensors) are more valuable than specifications (guides), but it’s hard to imagine all the conformance tests that are needed to say what shouldn’t happen. Furthermore, building software is about exploration, finding out how a workflow can evolve as machines are embedded in the process. For a human to guide that process, we need some understanding of it. My sense is that model building is still important, and while I agree that the genie can take an active role in that construction, I don’t think the human can entirely outsource it. Even if the genie builds the model itself, it needs to teach us that model, because the model helps us imagine and communicate the goals, the objectives that we give to the machine. ❄ ❄ ❄ ❄ ❄ If you follow my feeds (which you probably do if you’re reading this), then you’ll know that Birgitta Böckeler has written a couple of memos on working with local models. She first looked the factors that influence how viable they are for programming, and then related some of her recent experiences evaluating such models. As a nice, if accidental, complement to these, Sebastian Raschka wrote a detailed guide to his local model environment. Like Birgitta, he’s found the Qwen 3.6 model to be the current sweet spot for local agentic programming. ❄ ❄ ❄ ❄ ❄ Simon Willison shares a useful tip to save money while using the latest Anthropic Fable model Tell Fable to use other models for smaller tasks, applying its own judgement about which model to use. ❄ ❄ ❄ ❄ ❄ Josh Comeau writes a blog and online courses for developer education, primarily front-end web material. His been successful for most of this decade but has found his online courses have had only ⅓ the sales this year. He attributes this to AI, partly as people worry if it’s worth spending money on a job that may not have a future, but also because AI can provide personalized tutoring. ideally, it shouldn’t cost any money to learn stuff. But I sorta worry about how this is supposed to work, going forwards, if there’s no incentive for people to make high-quality free content. I’ve spoken to a few course creators now, and we’re all seeing the same trend. Revenue down 50%+. Fewer people engaging with our content. People switching to LLMs, which slurp up all of our work and regurgitate it, without consent or compensation. It feels pretty bleak. 😅 ❄ ❄ ❄ ❄ ❄ John Gruber is annoyed that Claude’s desktop app for MacOS in uses Electron. Electron guarantees that an app feels just as wrong on all platforms. He has some tasty invective for the folks at Anthropic with ties to the Electron platform. Finding out that one guy — who is a senior Electron maintainer — has led the teams for the desktop clients for Slack, Notion, and now Claude is like discovering that it was one guy — whose family business was a distillery — who helmed the Titanic, piloted the Hindenburg, and then served as air traffic controller for Amelia Earhart. The deeper question here is whether there should be a future for cross-platform front-ends in the world of agentic programming. There’s lots of evidence that coding agents do a great job of building the same thing in multiple languages and platform ecosystems. That should mean that the days of least-denominator cross-platform UIs are numbered - and that number is small. ❄ ❄ ❄ ❄ ❄ Dan Davies tries to draw a distinction between interactional and contributory expertise. Contributory expertise is that held by people who are doing the work to advance a field of study, interaction expertise is held by folks that spend time talking to contributory experts, building up a decent store of knowledge themselves, but not steeped in the day-to-day of the work. it seems to me that there is an important distinction here, which is not any less important because the dividing line might be difficult to establish empirically, or even if that line turns out to be in a different place from where we guessed it was. As well as difficult cases where it’s not clear, I think we could also come up with cases where the distinction between interactional and contributory expertise would suddenly become very clear and important indeed – the ones where someone who was faking it got “found out”. And so the question that I think is quite important is whether there is a similar kind of distinction between the kind of expertise that it’s possible for a machine to get by industralised consumption and interaction with a much larger corpus of literature than any human being could inhale, and genuine contributory expertise that could apply to entirely new situations outside that literature. As a human, I’d like to think I’m more of a contributor than an interactor (especially given my increasing introversion), and thus relatively safe from being forced into obsolescence by silicon. But I’m also aware that my career is devoid of any original ideas, my skill is only that of someone who is good at selecting and explaining the ideas of others. (As Brian Foote put it more memorably: “an intellectual jackal with good taste in carrion”.) But there’s skill in being a good jackal too - and we don’t really know yet where the real boundaries of the LLMs will lie.
Backendopen article
NN/g latest articles and announcements10/07/2026, 17:00
The 5 Qualities of Site-Specific AI Chatbots
Handoff willingness, flexibility, proactivity, emotional responsiveness, and transparency help you build trustworthy AI chatbots that guide users well.
Design / UXopen article
NN/g latest articles and announcements10/07/2026, 17:00
Design-System Maturity: A 6-Dimension Framework
Use this maturity model to assess your design system’s health and identify where to focus next.
Design / UXopen article
Articles on Smashing Magazine — For Web Designers And Developers10/07/2026, 13:00
From Kickoff To First Concept: How To Turn Brand Strategy Into Visual Direction
The strongest visual concepts don’t start in Figma. They start with the right questions. Explore the pre-concept phase of brand identity design, where teams research brand context, uncover hidden assumptions with stakeholders, and turn shared direction into a visual foundation before a single concept is created.
Frontendopen article
Articles on Smashing Magazine — For Web Designers And Developers09/07/2026, 15:00
Designing For Distressed Users: Why Mental Health Apps Shouldn’t Follow Every UI Fashion
Many UI trends are designed to capture attention and signal innovation, but those goals often conflict with the needs of mental health apps: reducing cognitive strain, fostering trust, and providing a sense of refuge. Kat Homan introduces an evaluation framework that helps designers assess whether trendy visual and interaction patterns support or undermine the unique goals of mental health experiences.
Frontendopen article
Martin Fowler08/07/2026, 11:57
Experiences with local models for coding
Birgitta Böckeler now reports on her recent experiences trying local LLMs for coding. She compares them using two standard tasks, and tries out the most promising model for day-to-day use. more…
Backendopen article
Martin Fowler07/07/2026, 12:34
Viability of local models for coding
Birgitta Böckeler recently spent some time trying out running local LLMs for some programming tasks. In this memo she outlines the factors that influence how viable they are for the job. more…
Backendopen article
Articles on Smashing Magazine — For Web Designers And Developers07/07/2026, 10:00
Meet Kirki: WordPress’s First Visual Builder With An Infinite Canvas
We have been building websites inside boxes for years on WordPress. Let’s take a closer look at [Kirki](https://kirki.com/), the first freeform visual builder with an infinite canvas, and explore how it redefines the experience with cleaner performance, full design freedom, and zero plugin dependency.
Frontendopen article
Martin Fowler06/07/2026, 12:53
Fragments: July 6
Last week, Thoughtworks ran a second Future of Software Development Retreat, this time in Europe. As with the previous event, I’ll be sharing some fragmentary thoughts on this. There were five parallel streams, so I could, at best, only attend ⅕ of sessions. This isn’t an event that forms conclusions, rather one that allows those exploring to share what they’ve found, and their visions for the future. The bliki post lists all the writing I’ve run into on this, by myself and others. I’ll be updating it as more posts appear. Giles Edwards-Alexander “noticed a real difference between the retreats”: Where Deer Valley had hesitancy and a belief that there was something here even if we weren’t yet sure what it was, Engelberg had confidence: the value is here. As I explained to a colleague today, this was not a conference for true believers: the evidence is in. What does the evidence say? Well, that was less clear. Some patterns and practices are emerging (one attendee had catalogued dozens of agentic engineering pattern libraries) but they are emerging. There is more work to do to truly establish what is effective, and when. Greg Herlein felt similarly: Reading the reports of the February event, when a lot of these same folks last got together, the conversation was about what agentic development might look like. Aspirational. More about what was coming. This time? Everybody in the room was doing it. Shipping it. Not slides - production. The whole debate about whether this changes software engineering is over. People have stopped arguing about whether a while ago. They’re arguing about how, and the how is getting real. On a more micro level, I noted two other things. Firstly, there was much talk now about harness engineering, when that wasn’t even a term in Utah - an example of how rapidly things are moving. Secondly people are now worrying about the cost of tokens, where before folks were wanting to do almost anything to incentivize people to talk to The Genie. ❄ ❄ A question that continued from Utah was whether architecture and design are still important. There seems to be two landmark hypotheses here, one is that The Genie has such a Galaxy Brain that we no longer need to care about such matters, it will handle as much spaghetti as we can throw at it. The other is, in Laura Tacho’s memorable phrase: “the Venn Diagram of Developer Experience and Agent Experience is a circle”. The point being that The Genie uses the same constructs to understand a code base that humans do, so things like good modularity and naming help it as much as it helps humans. Adam Tornhill’s writing is a good example of this viewpoint. Tidbits from our session on this: to evaluate the value of architecture we need to focus on desirable outcomes. Internal design quality boils down to ease of change. The question is whether the lessons we’ve learned so far will continue for agents. a way to measure design quality is to look at token costs. If the same change requires less tokens that indicates a better architecture. a good architecture only shows its quality over time, we can’t easily measure it in the short term why did 3GL languages continue when things like 4GLs, UML etc not take hold? It’s because these programming languages hit a sweet spot of human comprehension of computation we’re at the first time ever where the computers care about code quality will future models write machine code directly? If so what will humans review or specify? we should beware of speculating about what LLMs may do in the future. Instead we need mechanical sympathy for our LLMs, so we can gain a sense of how they work and how best to use them. One workflow: take story from backlog talk it over with an agent once get an agreement, make an ADR for persistent record of spec generate a task list get agent to complete it we need abstractions to communicate with agents (echoing Unmesh Joshi’s thoughts on building conceptual models) we often find duplication in LLM generated code, together with mixing of concerns (eg intermingled domain and display logic) - even with a good harness get agents to generate explanatory documentation at the end of a session overnight quality checks with a report for humans to act on in the morning LLMs look at existing code, so if that code has problems, the LLM will amplify them we should be wary of drawing too many conclusions comparing LLM code with human code - human code varies enormously from team to team. ❄ ❄ In his account of the retreat, Mathias Verraes goes into the details of his perspective of these issues of software design. He adds another concern: we need good design as a hedge against the risk of dependence on AI. After all, we don’t know how high the costs may rise to. We see governments blocking access to models. We see popular opposition to AI campaigning against data centers and calling for regulation. How much can we rely on AI tools being available to maintain and extend our software in the future? ❄ ❄ ❄ ❄ ❄ Charity Majors has a post on the ethics of working with AI and does an excellent job of articulating how I feel about this topic. She outlines the harms inherent in AI, both in the creation of its models (training on stolen data) and in inference (slop, lack of accountability, skill atrophy). Her conclusion however, like mine, is that there’s no ethical gain from renouncing the use of AI and castigating those who use it. Such purity provides little practical help with a technology that is so powerful and so useful. The way you show care is by showing up. The way you make the world a better place is by getting down in the muck and building it, using whatever skills and resources you have on hand. The way you drive change is you engage. Yes, we are all complicit. Yes, we are all compromised. No argument. But what are you going to do with that feeling of conviction? Will you channel your discomfort into solidarity and action, or try to ease your conscience by removing yourself from the system? Which does more to help those being harmed? Her suggestions on how to engage aren’t striking, but that’s hardly unusual. At the Future of Software Development Retreat I convened a session on this question, and nothing striking turned up there either. That said, I’ve never been much of an activist, so my imagination may be limited. ❄ ❄ ❄ ❄ ❄ Gergely Orosz has run into a case where an article of his was erased from Google search by a clearly fraudulent DMCA claim. It seems that anyone can file a bogus copyright claim to get an article they don’t like removed from Google’s search index. This happened in this case. I have no information on who filed the copyright claim. Even less so on who claims to be the copyright owner? Because I am the only possible copyright owner! He was able to find the DMCA complaint, it was made by “Ellie Piee” whose profile listed them as living on Bouvet Island, an uninhabited Norwegian dependent territory near Antarctica. It claimed Gergely’s article copied a New York Post article entitled “Band Leader Hits Winning Chord”. But Gergely’s article is “Inside Pollen’s Collapse: “$200M Raised” but Staff Unpaid”, and the two do not share a single sentence. There’s an obvious motivation for folks connected with Pollen to have done this, and I hope the resulting Streisand effect bites them where it hurts. ❄ ❄ ❄ ❄ ❄ 404 media have a bunch of (paywalled) reports on the impact of companies realizing that token costs are getting out of control. They’ve acquired leaked Slack chats, internal dashboards, emails and other material from companies including Citi and Amazon. Companies are urging staff to use less powerful models, or cutting off frontier models entirely. A dashboard indicates that one company has seen its token bill rise from $5 million in August 2025 to $15 million in May 2026, on track to spend over $120 million in the fiscal year. 404 earlier reported about Accenture taking steps to reduce token usage. The biggest problem wasn’t software engineering using agentic programming, but rather staff “chewing tokens” by using AI to do things like turning PDFs into presentation slides. They saw themselves, and their clients, grappling with exponential increases in token costs. Inevitably, after consulting firms spent time urging their clients to use AI heavily, they are now offering services to control these costs. Another post says it appears that one way to reduce token costs is to get AI tools to speak like cavemen, using a skill/plugin. There’s a good summary of all this on 404’s freely available podcast: The AI Tokenpocalypse Is Here. ❄ ❄ ❄ ❄ ❄ I share these thoughts just after the July 4th weekend here in America, indeed the Semiquincentennial. Historian Bret Devereaux celebrated this event with a careful reading of the Declaration of Independence, a document often talked about more than it’s read. Which is a shame since it is hardly very long, and its impact was remarkable, and not just in what is now the United States. The Declaration of Independence was recognized as a radical, potentially explosive document at the time of its issuance, as we’ll see. And it was explosive: the world of 1775 was one dominated by monarchies with just a tiny handful of traditional republics (which we should not ignore!). It took a long time for the seeds of the declaration to spread, but the world it helped create is one where liberal democracies, while hardly universal (more people have always lived in unfree societies than free ones) represent the most economically and culturally dominant bloc in world affairs – something that had never happened before. The Declaration, in its way, remade not just the Thirteen Colonies, but slowly, surely, as water seeps through the cracks of rocks (or my floorboards, alas), it remade the whole world. Devereaux shines a light onto the world of this text, illuminating its historic context, a world that is very different to the one anyone reading this grew up in. It’s assertions of a natural law that there is equality of rights among men and that governments ought to derive their powers from the consent of the governed would seem hardly worth stating now, yet were deeply radical in 1776. I’ve found that reading history like this has helped me understand how the world is, and gives me a broader perspective on the drama of current affairs.
Backendopen article
NN/g latest articles and announcements03/07/2026, 17:00
Stop Reporting UX Activity and Report Business Outcomes
UX teams should report business outcomes — not activity or UX metrics — to show impact on revenue, cost, risk, speed, retention, and to secure resources.
Design / UXopen article
NN/g latest articles and announcements03/07/2026, 17:00
Crafting AI Explanations for Every Role in Your Enterprise
Different enterprise roles need different types of explanations for AI outputs.
Design / UXopen article
Articles on Smashing Magazine — For Web Designers And Developers03/07/2026, 13:00
Users Don’t Need More Tools: They Need Seamless Integrations
A closer look at why users don’t need more tools in their daily lives. What they need are seamless integrations of useful features to match already existing, established mental models. Brought to you by Design Patterns For AI Interfaces, **friendly video course on UX** and design patterns by Vitaly.
Frontendopen article
Articles on Smashing Magazine — For Web Designers And Developers02/07/2026, 10:00
Matching AI Modality To User Intent: Designing The Right Interface
We’ve fallen into conversational tunnel vision, defaulting every AI capability into a chat-based interface simply because LLMs are trained on dialogue data. But great UX is about matching modality to users’ context, intent, and cognitive load, so the interface adapts to the user, not the other way around.
Frontendopen article
Netflix TechBlog - Medium29/06/2026, 13:01
GenPage: Towards End-to-End Generative Homepage Construction at Netflix
Backendopen article
NN/g latest articles and announcements26/06/2026, 17:00
Kick the Bots Out of Your Survey Data
Learn to spot and filter out survey bots’ responses before analysis so fake data doesn’t distort your findings.
Design / UXopen article
Netflix TechBlog - Medium23/06/2026, 00:31
Toward More Controllable AI Video Editing: An Early Research Exploration at Netflix
Backendopen article
Stripe Blog23/06/2026, 00:00
Four travel and hospitality trends from HITEC 2026
More than 6,000 hospitality executives and operators gathered in San Antonio last week for the HITEC conference. The big topic: whether the industry’s AI investment is actually working. Across four days and over 50 meetings, four trends stood out.
Backendopen article
Netflix TechBlog - Medium22/06/2026, 21:35
How Netflix Simplified Batch Compute with Kueue
Backendopen article
Netflix TechBlog - Medium19/06/2026, 23:54
The Data Canary: How Netflix Validates Catalog Metadata
Backendopen article
Netflix TechBlog - Medium19/06/2026, 23:54
Data Projects: Managing Data Assets at Netflix Scale
Backendopen article
Netflix TechBlog - Medium19/06/2026, 23:53
Predicting Risk in Content Launches: How Data-Driven Insights can Transform Launch Planning
Backendopen article
Netflix TechBlog - Medium19/06/2026, 23:53
The Evolution of Cassandra Data Movement at Netflix
Backendopen article
Netflix TechBlog - Medium19/06/2026, 23:53
Thinking Fast & Slow for a Personalized Notification System
Backendopen article
Stripe Blog18/06/2026, 00:00
What Link data tells us about AI spending
We analyzed spending patterns across the 250 million customers paying with Link. We found that Link customers are spending more on AI than they were three months prior, investing heavily in platforms that let them build with AI.
Backendopen article
Martin Fowler16/06/2026, 12:11
Building Reliable Agentic AI Systems
One of the most interesting projects my colleagues have done with LLMs has been building a system with Bayer to allow pharmaceutical researchers to query decades of information about studies buried in PDF reports. Sarang Sanjay Kulkarni describes its evolution from keyword-based search to an intelligent research assistant capable of answering complex questions and drafting regulatory documents. more…
Backendopen article
Stripe Blog11/06/2026, 00:00
Stripe Projects adds new agent integrations, more providers, and custom developer controls
Our data shows that agents are now fully capable of independently writing code and integrating with APIs like Stripe’s. And yet, many of the steps adjacent to writing code are still too hard for agents to do on their own. We’re expanding Stripe Projects to solve this.
Backendopen article
Stripe Blog04/06/2026, 00:00
New ways to turn global demand into revenue
At Sessions 2026, Stripe unveiled dozens of products and capabilities to help businesses turn global demand into revenue. See how to go global faster with localized checkout and Adaptive Pricing, smarter fraud tools, multicurrency treasury support, and automated tax compliance.
Backendopen article
Stripe Blog04/06/2026, 00:00
The future of agentic commerce is here
Explore how AI agents are transforming commerce at Stripe’s Agentic Commerce Next roadshow. Reserve your spot in Seattle.
Backendopen article
Stripe Blog04/06/2026, 00:00
Rethinking risk in the age of AI
Join senior risk and payments leaders in Seattle to explore how AI is reshaping fraud strategy. Seats are limited.
Backendopen article
Stripe Blog03/06/2026, 00:00
Helping businesses optimize network costs with the Visa Digital Commerce Authentication Program (DCAP)
We moved quickly to help Stripe businesses take advantage of DCAP and capture interchange savings while protecting authorization rates. Here’s what we did.
Backendopen article
Stripe Blog28/05/2026, 00:00
Solo founding is at an all-time high: Top performers have these traits in common
In 2025, solo founders in the top decile generated 61 times the revenue of the median solo founder in their first six months. We analyzed the data to understand what drives that gap.
Backendopen article
Stripe Blog27/05/2026, 00:00
Expanding Stripe Radar to protect more of your business
Radar now blocks high-risk transactions across all supported payment methods; defends against new fraud types like multi-account abuse and pay-as-you-go abuse, regardless of which payment processor you use; and gives platforms new tools to evaluate and mitigate merchant risk on and off Stripe.
Backendopen article
Google AI Blog29/03/2024, 18:03
Generative AI to quantify uncertainty in weather forecasting
Posted by Lizao (Larry) Li, Software Engineer, and Rob Carver, Research Scientist, Google Research Accurate weather forecasts can have a direct impact on people’s lives, from helping make routine decisions, like what to pack for a day’s activities, to informing urgent actions, for example, protecting people in the face of hazardous weather conditions. The importance of accurate and timely weather forecasts will only increase as the climate changes. Recognizing this, we at Google have been investing in weather and climate research to help ensure that the forecasting technology of tomorrow can meet the demand for reliable weather information. Some of our recent innovations include MetNet-3, Google's high-resolution forecasts up to 24-hours into the future, and GraphCast, a weather model that can predict weather up to 10 days ahead. Weather is inherently stochastic. To quantify the uncertainty, traditional methods rely on physics-based simulation to generate an ensemble of forecasts. However, it is computationally costly to generate a large ensemble so that rare and extreme weather events can be discerned and characterized accurately. With that in mind, we are excited to announce our latest innovation designed to accelerate progress in weather forecasting, Scalable Ensemble Envelope Diffusion Sampler (SEEDS), recently published in Science Advances. SEEDS is a generative AI model that can efficiently generate ensembles of weather forecasts at scale at a small fraction of the cost of traditional physics-based forecasting models. This technology opens up novel opportunities for weather and climate science, and it represents one of the first applications to weather and climate forecasting of probabilistic diffusion models, a generative AI technology behind recent advances in media generation. The need for probabilistic forecasts: the butterfly effect American Association for the Advancement of Science meeting in Washington, D.C., MIT meteorology professor Ed Lorenz gave a talk entitled, “Does the Flap of a Butterfly's Wings in Brazil Set Off a Tornado in Texas?” which contributed to the term “butterfly effect”. He was building on his earlier, landmark 1963 paper where he examined the feasibility of “very-long-range weather prediction” and described how errors in initial conditions grow exponentially when integrated in time with numerical weather prediction models. This exponential error growth, known as chaos, results in a deterministic predictability limit that restricts the use of individual forecasts in decision making, because they do not quantify the inherent uncertainty of weather conditions. This is particularly problematic when forecasting extreme weather events, such as hurricanes, heatwaves, or floods. Recognizing the limitations of deterministic forecasts, weather agencies around the world issue probabilistic forecasts. Such forecasts are based on ensembles of deterministic forecasts, each of which is generated by including synthetic noise in the initial conditions and stochasticity in the physical processes. Leveraging the fast error growth rate in weather models, the forecasts in an ensemble are purposefully different: the initial uncertainties are tuned to generate runs that are as different as possible and the stochastic processes in the weather model introduce additional differences during the model run. The error growth is mitigated by averaging all the forecasts in the ensemble and the variability in the ensemble of forecasts quantifies the uncertainty of the weather conditions. While effective, generating these probabilistic forecasts is computationally costly. They require running highly complex numerical weather models on massive supercomputers multiple times. Consequently, many operational weather forecasts can only afford to generate ~10–50 ensemble members for each forecast cycle. This is a problem for users concerned with the likelihood of rare but high-impact weather events, which typically require much larger ensembles to assess beyond a few days. For instance, one would need a 10,000-member ensemble to forecast the likelihood of events with 1% probability of occurrence with a relative error less than 10%. Quantifying the probability of such extreme events could be useful, for example, for emergency management preparation or for energy traders. SEEDS: AI-enabled advances paper, we present the Scalable Ensemble Envelope Diffusion Sampler (SEEDS), a generative AI technology for weather forecast ensemble generation. SEEDS is based on denoising diffusion probabilistic models, a state-of-the-art generative AI method pioneered in part by Google Research. SEEDS can generate a large ensemble conditioned on as few as one or two forecasts from an operational numerical weather prediction system. The generated ensembles not only yield plausible real-weather–like forecasts but also match or exceed physics-based ensembles in skill metrics such as the rank histogram, the root-mean-squared error (RMSE), and the continuous ranked probability score (CRPS). In particular, the generated ensembles assign more accurate likelihoods to the tail of the forecast distribution, such as ±2σ and ±3σ weather events. Most importantly, the computational cost of the model is negligible when compared to the hours of computational time needed by supercomputers to make a forecast. It has a throughput of 256 ensemble members (at 2° resolution) per 3 minutes on Google Cloud TPUv3-32 instances and can easily scale to higher throughput by deploying more accelerators. SEEDS generates an order-of-magnitude more samples to in-fill distributions of weather patterns. Generating plausible weather forecasts Global Ensemble Forecast System, GEFS) for a particular date during the 2022 European heat waves. We also compare the results to the forecasts from a Gaussian model that predicts the univariate mean and standard deviation of each atmospheric field at each location, a common and computationally efficient but less sophisticated data-driven approach. This Gaussian model is meant to characterize the output of pointwise post-processing, which ignores correlations and treats each grid point as an independent random variable. In contrast, a real weather map would have detailed correlational structures. Because SEEDS directly models the joint distribution of the atmospheric state, it realistically captures both the spatial covariance and the correlation between mid-tropospheric geopotential and mean sea level pressure, both of which are closely related and are commonly used by weather forecasters for evaluation and verification of forecasts. Gradients in the mean sea level pressure are what drive winds at the surface, while gradients in mid-tropospheric geopotential create upper-level winds that move large-scale weather patterns. The generated samples from SEEDS shown in the figure below (frames Ca–Ch) display a geopotential trough west of Portugal with spatial structure similar to that found in the operational U.S. forecasts or the reanalysis based on observations. Although the Gaussian model predicts the marginal univariate distributions adequately, it fails to capture cross-field or spatial correlations. This hinders the assessment of the effects that these anomalies may have on hot air intrusions from North Africa, which can exacerbate heat waves over Europe. Stamp maps over Europe on 2022/07/14 at 0:00 UTC. The contours are for the mean sea level pressure (dashed lines mark isobars below 1010 hPa) while the heatmap depicts the geopotential height at the 500 hPa pressure level. (A) The ERA5 reanalysis, a proxy for real observations. (Ba-Bb) 2 members from the 7-day U.S. operational forecasts used as seeds to our model. (Ca-Ch) 8 samples drawn from SEEDS. (Da-Dh) 8 non-seeding members from the 7-day U.S. operational ensemble forecast. (Ea-Ed) 4 samples from a pointwise Gaussian model parameterized by the mean and variance of the entire U.S. operational ensemble. Covering extreme events more accurately SEEDS provides better statistical coverage of the 2022/07/14 European extreme heat event, denoted by the brown star . Each plot shows the values of the total column-integrated water vapor (TCVW) vs. temperature over a grid point near Lisbon, Portugal from 16,384 samples generated by our models, shown as green dots, conditioned on 2 seeds (blue squares) taken from the 7-day U.S. operational ensemble forecasts (denoted by the sparser brown triangles). The valid forecast time is 1:00 local time. The solid contour levels correspond to iso-proportions of the kernel density of SEEDS, with the outermost one encircling 95% of the mass and 11.875% between each level. Conclusion and future outlook Acknowledgements All SEEDS authors, Lizao Li, Rob Carver, Ignacio Lopez-Gomez, Fei Sha and John Anderson, co-authored this blog post, with Carla Bromberg as Program Lead. We also thank Tom Small who designed the animation. Our colleagues at Google Research have provided invaluable advice to the SEEDS work. Among them, we thank Leonardo Zepeda-Núñez, Zhong Yi Wan, Stephan Rasp, Stephan Hoyer, and Tapio Schneider for their inputs and useful discussion. We thank Tyler Russell for additional technical program management, as well as Alex Merose for data coordination and support. We also thank Cenk Gazen, Shreya Agrawal, and Jason Hickey for discussions in the early stage of the SEEDS work.
AI / MLopen article
Google AI Blog28/03/2024, 20:53
AutoBNN: Probabilistic time series forecasting with compositional bayesian neural networks
Posted by Urs Köster, Software Engineer, Google Research Time series problems are ubiquitous, from forecasting weather and traffic patterns to understanding economic trends. Bayesian approaches start with an assumption about the data's patterns (prior probability), collecting evidence (e.g., new time series data), and continuously updating that assumption to form a posterior probability distribution. Traditional Bayesian approaches like Gaussian processes (GPs) and Structural Time Series are extensively used for modeling time series data, e.g., the commonly used Mauna Loa CO2 dataset. However, they often rely on domain experts to painstakingly select appropriate model components and may be computationally expensive. Alternatives such as neural networks lack interpretability, making it difficult to understand how they generate forecasts, and don't produce reliable confidence intervals. To that end, we introduce AutoBNN, a new open-source package written in JAX. AutoBNN automates the discovery of interpretable time series forecasting models, provides high-quality uncertainty estimates, and scales effectively for use on large datasets. We describe how AutoBNN combines the interpretability of traditional probabilistic approaches with the scalability and flexibility of neural networks. AutoBNN line of research that over the past decade has yielded improved predictive accuracy by modeling time series using GPs with learned kernel structures. The kernel function of a GP encodes assumptions about the function being modeled, such as the presence of trends, periodicity or noise. With learned GP kernels, the kernel function is defined compositionally: it is either a base kernel (such as Linear, Quadratic, Periodic, Matérn or ExponentiatedQuadratic) or a composite that combines two or more kernel functions using operators such as Addition, Multiplication, or ChangePoint. This compositional kernel structure serves two related purposes. First, it is simple enough that a user who is an expert about their data, but not necessarily about GPs, can construct a reasonable prior for their time series. Second, techniques like Sequential Monte Carlo can be used for discrete searches over small structures and can output interpretable results. Bayesian neural networks (BNNs) while retaining the compositional kernel structure. A BNN is a neural network with a probability distribution over weights rather than a fixed set of weights. This induces a distribution over outputs, capturing uncertainty in the predictions. BNNs bring the following advantages over GPs: First, training large GPs is computationally expensive, and traditional training algorithms scale as the cube of the number of data points in the time series. In contrast, for a fixed width, training a BNN will often be approximately linear in the number of data points. Second, BNNs lend themselves better to GPU and TPU hardware acceleration than GP training operations. Third, compositional BNNs can be easily combined with traditional deep BNNs, which have the ability to do feature discovery. One could imagine "hybrid" architectures, in which users specify a top-level structure of Add(Linear, Periodic, Deep), and the deep BNN is left to learn the contributions from potentially high-dimensional covariate information. How might one translate a GP with compositional kernels into a BNN then? A single layer neural network will typically converge to a GP as the number of neurons (or "width") goes to infinity. More recently, researchers have discovered a correspondence in the other direction — many popular GP kernels (such as Matern, ExponentiatedQuadratic, Polynomial or Periodic) can be obtained as infinite-width BNNs with appropriately chosen activation functions and weight distributions. Furthermore, these BNNs remain close to the corresponding GP even when the width is very much less than infinite. For example, the figures below show the difference in the covariance between pairs of observations, and regression results of the true GPs and their corresponding width-10 neural network versions. Comparison of Gram matrices between true GP kernels (top row) and their width 10 neural network approximations (bottom row). Comparison of regression results between true GP kernels (top row) and their width 10 neural network approximations (bottom row). BNN analogues of the Addition and Multiplication operators over GPs, and input warping to produce periodic kernels. BNN addition is straightforwardly given by adding the outputs of the component BNNs. BNN multiplication is achieved by multiplying the activations of the hidden layers of the BNNs and then applying a shared dense layer. We are therefore limited to only multiplying BNNs with the same hidden width. Using AutoBNN package is available within Tensorflow Probability. It is implemented in JAX and uses the flax.linen neural network library. It implements all of the base kernels and operators discussed so far (Linear, Quadratic, Matern, ExponentiatedQuadratic, Periodic, Addition, Multiplication) plus one new kernel and three new operators: a OneLayer kernel, a single hidden layer ReLU BNN, a ChangePoint operator that allows smoothly switching between two kernels, a LearnableChangePoint operator which is the same as ChangePoint except position and slope are given prior distributions and can be learnt from the data, and a WeightedSum operator. WeightedSum combines two or more BNNs with learnable mixing weights, where the learnable weights follow a Dirichlet prior. By default, a flat Dirichlet distribution with concentration 1.0 is used. WeightedSums allow a "soft" version of structure discovery, i.e., training a linear combination of many possible models at once. In contrast to structure discovery with discrete structures, such as in AutoGP, this allows us to use standard gradient methods to learn structures, rather than using expensive discrete optimization. Instead of evaluating potential combinatorial structures in series, WeightedSum allows us to evaluate them in parallel. To easily enable exploration, AutoBNN defines a number of model structures that contain either top-level or internal WeightedSums. The names of these models can be used as the first parameter in any of the estimator constructors, and include things like sum_of_stumps (the WeightedSum over all the base kernels) and sum_of_shallow (which adds all possible combinations of base kernels with all operators). Illustration of the sum_of_stumps model. The bars in the top row show the amount by which each base kernel contributes, and the bottom row shows the function represented by the base kernel. The resulting weighted sum is shown on the right. M3 dataset. The six base structures were ExponentiatedQuadratic (which is the same as the Radial Basis Function kernel, or RBF for short), Matern, Linear, Quadratic, OneLayer and Periodic kernels. The figure shows the MAP estimates of their weights over an ensemble of 32 particles. All of the high likelihood particles gave a large weight to the Periodic component, low weights to Linear, Quadratic and OneLayer, and a large weight to either RBF or Matern. Parallel coordinates plot of the MAP estimates of the base kernel weights over 32 particles. The sum_of_stumps model was trained on the N374 series from the M3 dataset (insert in blue). Darker lines correspond to particles with higher likelihoods. WeightedSums as the inputs to other operators, it is possible to express rich combinatorial structures, while keeping models compact and the number of learnable weights small. As an example, we include the sum_of_products model (illustrated in the figure below) which first creates a pairwise product of two WeightedSums, and then a sum of the two products. By setting some of the weights to zero, we can create many different discrete structures. The total number of possible structures in this model is 216, since there are 16 base kernels that can be turned on or off. All these structures are explored implicitly by training just this one model. Illustration of the "sum_of_products" model. Each of the four WeightedSums have the same structure as the "sum_of_stumps" model. Periodic and either the Matern or ExponentiatedQuadratic) lead to overfitting on many datasets. To prevent this, we have defined model classes like sum_of_safe_shallow that exclude such products when performing structure discovery with WeightedSums. For training, AutoBNN provides AutoBnnMapEstimator and AutoBnnMCMCEstimator to perform MAP and MCMC inference, respectively. Either estimator can be combined with any of the six likelihood functions, including four based on normal distributions with different noise characteristics for continuous data and two based on the negative binomial distribution for count data. Result from running AutoBNN on the Mauna Loa CO2 dataset in our example colab. The model captures the trend and seasonal component in the data. Extrapolating into the future, the mean prediction slightly underestimates the actual trend, while the 95% confidence interval gradually increases. scikit-learn–inspired estimator interface: import autobnn as ab model = ab.operators.Add( bnns=(ab.kernels.PeriodicBNN(width=50), ab.kernels.LinearBNN(width=50), ab.kernels.MaternBNN(width=50))) estimator = ab.estimators.AutoBnnMapEstimator( model, 'normal_likelihood_logistic_noise', jax.random.PRNGKey(42), periods=[12]) estimator.fit(my_training_data_xs, my_training_data_ys) low, mid, high = estimator.predict_quantiles(my_training_data_xs) Conclusion AutoBNN provides a powerful and flexible framework for building sophisticated time series prediction models. By combining the strengths of BNNs and GPs with compositional kernels, AutoBNN opens a world of possibilities for understanding and forecasting complex data. We invite the community to try the colab, and leverage this library to innovate and solve real-world challenges. Acknowledgements AutoBNN was written by Colin Carroll, Thomas Colthurst, Urs Köster and Srinivas Vasudevan. We would like to thank Kevin Murphy, Brian Patton and Feras Saad for their advice and feedback.
AI / MLopen article
Google AI Blog20/03/2024, 20:54
Computer-aided diagnosis for lung cancer screening
Posted by Atilla Kiraly, Software Engineer, and Rory Pilgrim, Product Manager, Google Research Lung cancer is the leading cause of cancer-related deaths globally with 1.8 million deaths reported in 2020. Late diagnosis dramatically reduces the chances of survival. Lung cancer screening via computed tomography (CT), which provides a detailed 3D image of the lungs, has been shown to reduce mortality in high-risk populations by at least 20% by detecting potential signs of cancers earlier. In the US, screening involves annual scans, with some countries or cases recommending more or less frequent scans. The United States Preventive Services Task Force recently expanded lung cancer screening recommendations by roughly 80%, which is expected to increase screening access for women and racial and ethnic minority groups. However, false positives (i.e., incorrectly reporting a potential cancer in a cancer-free patient) can cause anxiety and lead to unnecessary procedures for patients while increasing costs for the healthcare system. Moreover, efficiency in screening a large number of individuals can be challenging depending on healthcare infrastructure and radiologist availability. At Google we have previously developed machine learning (ML) models for lung cancer detection, and have evaluated their ability to automatically detect and classify regions that show signs of potential cancer. Performance has been shown to be comparable to that of specialists in detecting possible cancer. While they have achieved high performance, effectively communicating findings in realistic environments is necessary to realize their full potential. To that end, in “Assistive AI in Lung Cancer Screening: A Retrospective Multinational Study in the US and Japan”, published in Radiology AI, we investigate how ML models can effectively communicate findings to radiologists. We also introduce a generalizable user-centric interface to help radiologists leverage such models for lung cancer screening. The system takes CT imaging as input and outputs a cancer suspicion rating using four categories (no suspicion, probably benign, suspicious, highly suspicious) along with the corresponding regions of interest. We evaluate the system’s utility in improving clinician performance through randomized reader studies in both the US and Japan, using the local cancer scoring systems (Lung-RADSs V1.1 and Sendai Score) and image viewers that mimic realistic settings. We found that reader specificity increases with model assistance in both reader studies. To accelerate progress in conducting similar studies with ML models, we have open-sourced code to process CT images and generate images compatible with the picture archiving and communication system (PACS) used by radiologists. Developing an interface to communicate model results alpha-numeric score to indicate the lung cancer risk and follow-up recommendations. When assessing patients, radiologists load the CT in their workstation to read the case, find lung nodules or lesions, and apply set guidelines to determine follow-up decisions. Our first step was to improve the previously developed ML models through additional training data and architectural improvements, including self-attention. Then, instead of targeting specific guidelines, we experimented with a complementary way of communicating AI results independent of guidelines or their particular versions. Specifically, the system output offers a suspicion rating and localization (regions of interest) for the user to consider in conjunction with their own specific guidelines. The interface produces output images directly associated with the CT study, requiring no changes to the user’s workstation. The radiologist only needs to review a small set of additional images. There is no other change to their system or interaction with the system. Example of the assistive lung cancer screening system outputs. Results for the radiologist’s evaluation are visualized on the location of the CT volume where the suspicious lesion is found. The overall suspicion is displayed at the top of the CT images. Circles highlight the suspicious lesions while squares show a rendering of the same lesion from a different perspective, called a sagittal view. prior work. The models coordinate with each other to first segment the lungs, obtain an overall assessment, locate three suspicious regions, then use the information to assign a suspicion rating to each region. The system was deployed on Google Cloud using a Google Kubernetes Engine (GKE) that pulled the images, ran the ML models, and provided results. This allows scalability and directly connects to servers where the images are stored in DICOM stores. Outline of the Google Cloud deployment of the assistive lung cancer screening system and the directional calling flow for the individual components that serve the images and compute results. Images are served to the viewer and to the system using Google Cloud services. The system is run on a Google Kubernetes Engine that pulls the images, processes them, and writes them back into the DICOM store. Reader studies area under the ROC curve (AUC) values. These were compared with and without assistance. A multi-case multi-reader study involves each case being reviewed by each reader twice, once with ML system assistance and once without. In this visualization one reader first reviews Set A without assistance (blue) and then with assistance (orange) after a wash-out period. A second reader group follows the opposite path by reading the same set of cases Set A with assistance first. Readers are randomized to these groups to remove the effect of ordering. specificity) by an absolute 5–7% compared to when they didn’t use the assistive system. This potentially means that for every 15–20 patients screened, one may be able to avoid unnecessary follow-up procedures, thus reducing their anxiety and the burden on the health care system. This can, in turn, help improve the sustainability of lung cancer screening programs, particularly as more people become eligible for screening. Reader specificity increases with ML model assistance in both the US-based and Japan-based reader studies. Specificity values were derived from reader scores from actionable findings (something suspicious was found) versus no actionable findings, compared against the true cancer outcome of the individual. Under model assistance, readers flagged fewer cancer-negative individuals for follow-up visits. Sensitivity for cancer positive individuals remained the same. Translating this into real-world impact through partnership DeepHealth, a leading AI-powered health informatics provider; and Apollo Radiology International a leading provider of Radiology services in India to explore paths for incorporating this system into future products. In addition, we are looking to help other researchers studying how best to integrate ML model results into clinical workflows by open sourcing code used for the reader study and incorporating the insights described in this blog. We hope that this will help accelerate medical imaging researchers looking to conduct reader studies for their AI models, and catalyze translational research in the field. Acknowledgements Key contributors to this project include Corbin Cunningham, Zaid Nabulsi, Ryan Najafi, Jie Yang, Charles Lau, Joseph R. Ledsam, Wenxing Ye, Diego Ardila, Scott M. McKinney, Rory Pilgrim, Hiroaki Saito, Yasuteru Shimamura, Mozziyar Etemadi, Yun Liu, David Melnick, Sunny Jansen, Nadia Harhen, David P. Nadich, Mikhail Fomitchev, Ziyad Helali, Shabir Adeel, Greg S. Corrado, Lily Peng, Daniel Tse, Shravya Shetty, Shruthi Prabhakara, Neeral Beladia, and Krish Eswaran. Thanks to Arnav Agharwal and Andrew Sellergren for their open sourcing support and Vivek Natarajan and Michael D. Howell for their feedback. Sincere appreciation also goes to the radiologists who enabled this work with their image interpretation and annotation efforts throughout the study, and Jonny Wong and Carli Sampson for coordinating the reader studies.
AI / MLopen article
Google AI Blog20/03/2024, 16:06
Using AI to expand global access to reliable flood forecasts
Posted by Yossi Matias, VP Engineering & Research, and Grey Nearing, Research Scientist, Google Research Floods are the most common natural disaster, and are responsible for roughly $50 billion in annual financial damages worldwide. The rate of flood-related disasters has more than doubled since the year 2000 partly due to climate change. Nearly 1.5 billion people, making up 19% of the world’s population, are exposed to substantial risks from severe flood events. Upgrading early warning systems to make accurate and timely information accessible to these populations can save thousands of lives per year. Driven by the potential impact of reliable flood forecasting on people’s lives globally, we started our flood forecasting effort in 2017. Through this multi-year journey, we advanced research over the years hand-in-hand with building a real-time operational flood forecasting system that provides alerts on Google Search, Maps, Android notifications and through the Flood Hub. However, in order to scale globally, especially in places where accurate local data is not available, more research advances were required. In “Global prediction of extreme floods in ungauged watersheds”, published in Nature, we demonstrate how machine learning (ML) technologies can significantly improve global-scale flood forecasting relative to the current state-of-the-art for countries where flood-related data is scarce. With these AI-based technologies we extended the reliability of currently-available global nowcasts, on average, from zero to five days, and improved forecasts across regions in Africa and Asia to be similar to what are currently available in Europe. The evaluation of the models was conducted in collaboration with the European Center for Medium Range Weather Forecasting (ECMWF). These technologies also enable Flood Hub to provide real-time river forecasts up to seven days in advance, covering river reaches across over 80 countries. This information can be used by people, communities, governments and international organizations to take anticipatory action to help protect vulnerable populations. Flood forecasting at Google launched a pilot early warning system in the Ganges-Brahmaputra river basin in India, with the hypothesis that ML could help address the challenging problem of reliable flood forecasting at scale. The pilot was further expanded the following year via the combination of an inundation model, real-time water level measurements, the creation of an elevation map and hydrologic modeling. In collaboration with academics, and, in particular, with the JKU Institute for Machine Learning we explored ML-based hydrologic models, showing that LSTM-based models could produce more accurate simulations than traditional conceptual and physics-based hydrology models. This research led to flood forecasting improvements that enabled the expansion of our forecasting coverage to include all of India and Bangladesh. We also worked with researchers at Yale University to test technological interventions that increase the reach and impact of flood warnings. Our hydrological models predict river floods by processing publicly available weather data like precipitation and physical watershed information. Such models must be calibrated to long data records from streamflow gauging stations in individual rivers. A low percentage of global river watersheds (basins) have streamflow gauges, which are expensive but necessary to supply relevant data, and it’s challenging for hydrological simulation and forecasting to provide predictions in basins that lack this infrastructure. Lower gross domestic product (GDP) is correlated with increased vulnerability to flood risks, and there is an inverse correlation between national GDP and the amount of publicly available data in a country. ML helps to address this problem by allowing a single model to be trained on all available river data and to be applied to ungauged basins where no data are available. In this way, models can be trained globally, and can make predictions for any river location. There is an inverse (log-log) correlation between the amount of publicly available streamflow data in a country and national GDP. Streamflow data from the Global Runoff Data Center. estimate uncertainty in river forecasts and showed how ML river forecast models synthesize information from multiple data sources. They demonstrated that these models can simulate extreme events reliably, even when those events are not part of the training data. In an effort to contribute to open science, in 2023 we open-sourced a community-driven dataset for large-sample hydrology in Nature Scientific Data. The river forecast model LSTMs perform well on the task of river forecasting. A diagram of the LSTM, which is a neural network that operates sequentially in time. An accessible primer can be found here. mixture density networks to produce a probabilistic forecast (i.e., predicted parameters of a probability distribution over streamflow). Specifically, the model predicts the parameters of a mixture of heavy-tailed probability density functions, called asymmetric Laplacian distributions, at each forecast time step. The result is a mixture density function, called a Countable Mixture of Asymmetric Laplacians (CMAL) distribution, which represents a probabilistic prediction of the volumetric flow rate in a particular river at a particular time. LSTM-based river forecast model architecture. Two LSTMs are applied in sequence, one ingesting historical weather data and one ingesting forecasted weather data. The model outputs are the parameters of a probability distribution over streamflow at each forecasted timestep. Input and training data Static watershed attributes representing geographical and geophysical variables: From the HydroATLAS project, including data like long-term climate indexes (precipitation, temperature, snow fractions), land cover, and anthropogenic attributes (e.g., a nighttime lights index as a proxy for human development). Historical meteorological time-series data: Used to spin up the model for one year prior to the issue time of a forecast. The data comes from NASA IMERG, NOAA CPC Global Unified Gauge-Based Analysis of Daily Precipitation, and the ECMWF ERA5-land reanalysis. Variables include daily total precipitation, air temperature, solar and thermal radiation, snowfall, and surface pressure. Forecasted meteorological time series over a seven-day forecast horizon: Used as input for the forecast LSTM. These data are the same meteorological variables listed above, and come from the ECMWF HRES atmospheric model. Training data are daily streamflow values from the Global Runoff Data Center over the time period 1980 - 2023. A single streamflow forecast model is trained using data from 5,680 diverse watershed streamflow gauges (shown below) to improve accuracy. Location of 5,680 streamflow gauges that supply training data for the river forecast model from the Global Runoff Data Center. Improving on the current state-of-the-art GloFAS version 4, the current state-of-the-art global flood forecasting system. These experiments showed that ML can provide accurate warnings earlier and over larger and more impactful events. The figure below shows the distribution of F1 scores when predicting different severity events at river locations around the world, with plus or minus 1 day accuracy. F1 scores are an average of precision and recall and event severity is measured by return period. For example, a 2-year return period event is a volume of streamflow that is expected to be exceeded on average once every two years. Our model achieves reliability scores at up to 4-day or 5-day lead times that are similar to or better, on average, than the reliability of GloFAS nowcasts (0-day lead time). Distributions of F1 scores over 2-year return period events in 2,092 watersheds globally during the time period 2014-2023 from GloFAS (blue) and our model (orange) at different lead times. On average, our model is statistically as accurate as GloFAS nowcasts (0–day lead time) up to 5 days in advance over 2-year (shown) and 1-year, 5-year, and 10-year events (not shown). paper for more information. Looking into the future Adaptation and Resilience efforts and reflects Google's commitment to address climate change while helping global communities become more resilient. We believe that AI and ML will continue to play a critical role in helping advance science and research towards climate action. We actively collaborate with several international aid organizations (e.g., the Centre for Humanitarian Data and the Red Cross) to provide actionable flood forecasts. Additionally, in an ongoing collaboration with the World Meteorological Organization (WMO) to support early warning systems for climate hazards, we are conducting a study to help understand how AI can help address real-world challenges faced by national flood forecasting agencies. While the work presented here demonstrates a significant step forward in flood forecasting, future work is needed to further expand flood forecasting coverage to more locations globally and other types of flood-related events and disasters, including flash floods and urban floods. We are looking forward to continuing collaborations with our partners in the academic and expert communities, local governments and the industry to reach these goals.
AI / MLopen article
Google AI Blog19/03/2024, 20:15
ScreenAI: A visual language model for UI and visually-situated language understanding
Posted by Srinivas Sunkara and Gilles Baechler, Software Engineers, Google Research Screen user interfaces (UIs) and infographics, such as charts, diagrams and tables, play important roles in human communication and human-machine interaction as they facilitate rich and interactive user experiences. UIs and infographics share similar design principles and visual language (e.g., icons and layouts), that offer an opportunity to build a single model that can understand, reason, and interact with these interfaces. However, because of their complexity and varied presentation formats, infographics and UIs present a unique modeling challenge. To that end, we introduce “ScreenAI: A Vision-Language Model for UI and Infographics Understanding”. ScreenAI improves upon the PaLI architecture with the flexible patching strategy from pix2struct. We train ScreenAI on a unique mixture of datasets and tasks, including a novel Screen Annotation task that requires the model to identify UI element information (i.e., type, location and description) on a screen. These text annotations provide large language models (LLMs) with screen descriptions, enabling them to automatically generate question-answering (QA), UI navigation, and summarization training datasets at scale. At only 5B parameters, ScreenAI achieves state-of-the-art results on UI- and infographic-based tasks (WebSRC and MoTIF), and best-in-class performance on Chart QA, DocVQA, and InfographicVQA compared to models of similar size. We are also releasing three new datasets: Screen Annotation to evaluate the layout understanding capability of the model, as well as ScreenQA Short and Complex ScreenQA for a more comprehensive evaluation of its QA capability. ScreenAI PaLI, composed of a multimodal encoder block and an autoregressive decoder. The PaLI encoder uses a vision transformer (ViT) that creates image embeddings and a multimodal encoder that takes the concatenation of the image and text embeddings as input. This flexible architecture allows ScreenAI to solve vision tasks that can be recast as text+image-to-text problems. On top of the PaLI architecture, we employ a flexible patching strategy introduced in pix2struct. Instead of using a fixed-grid pattern, the grid dimensions are selected such that they preserve the native aspect ratio of the input image. This enables ScreenAI to work well across images of various aspect ratios. The ScreenAI model is trained in two stages: a pre-training stage followed by a fine-tuning stage. First, self-supervised learning is applied to automatically generate data labels, which are then used to train ViT and the language model. ViT is frozen during the fine-tuning stage, where most data used is manually labeled by human raters. ScreenAI model architecture. Data generation publicly accessible web pages and following the programmatic exploration approach used for the RICO dataset for mobile apps. We then apply a layout annotator, based on the DETR model, that identifies and labels a wide range of UI elements (e.g., image, pictogram, button, text) and their spatial relationships. Pictograms undergo further analysis using an icon classifier capable of distinguishing 77 different icon types. This detailed classification is essential for interpreting the subtle information conveyed through icons. For icons that are not covered by the classifier, and for infographics and images, we use the PaLI image captioning model to generate descriptive captions that provide contextual information. We also apply an optical character recognition (OCR) engine to extract and annotate textual content on screen. We combine the OCR text with the previous annotations to create a detailed description of each screen. A mobile app screenshot with generated annotations that include UI elements and their descriptions, e.g., TEXT elements also contain the text content from OCR, IMAGE elements contain image captions, LIST_ITEMs contain all their child elements. LLM-based data generation PaLM 2 to generate input-output pairs in a two-step process. First, screen annotations are generated using the technique outlined above, then we craft a prompt around this schema for the LLM to create synthetic data. This process requires prompt engineering and iterative refinement to find an effective prompt. We assess the generated data's quality through human validation against a quality threshold. You only speak JSON. Do not write text that isn’t JSON. You are given the following mobile screenshot, described in words. Can you generate 5 questions regarding the content of the screenshot as well as the corresponding short answers to them? The answer should be as short as possible, containing only the necessary information. Your answer should be structured as follows: questions: [ {{question: the question, answer: the answer }}, ... ] {THE SCREEN SCHEMA} A sample prompt for QA data generation. Question answering: The model is asked to answer questions regarding the content of the screenshots, e.g., “When does the restaurant open?” Screen navigation: The model is asked to convert a natural language utterance into an executable action on a screen, e.g., “Click the search button.” Screen summarization: The model is asked to summarize the screen content in one or two sentences. Block diagram of our workflow for generating data for QA, summarization and navigation tasks using existing ScreenAI models and LLMs. Each task uses a custom prompt to emphasize desired aspects, like questions related to counting, involving reasoning, etc. LLM-generated data. Examples for screen QA, navigation and summarization. For navigation, the action bounding box is displayed in red on the screenshot. Experiments and results ChartQA, DocVQA, Multi page DocVQA, InfographicVQA, OCR VQA, Web SRC and ScreenQA. For navigation, datasets used include Referring Expressions, MoTIF, Mug, and Android in the Wild. Finally, we use Screen2Words for screen summarization and Widget Captioning for describing specific UI elements. Along with the fine-tuning datasets, we evaluate the fine-tuned ScreenAI model using three novel benchmarks: Screen Annotation: Enables the evaluation model layout annotations and spatial understanding capabilities. ScreenQA Short: A variation of ScreenQA, where its ground truth answers have been shortened to contain only the relevant information that better aligns with other QA tasks. Complex ScreenQA: Complements ScreenQA Short with more difficult questions (counting, arithmetic, comparison, and non-answerable questions) and contains screens with various aspect ratios. The fine-tuned ScreenAI model achieves state-of-the-art results on various UI and infographic-based tasks (WebSRC and MoTIF) and best-in-class performance on Chart QA, DocVQA, and InfographicVQA compared to models of similar size. ScreenAI achieves competitive performance on Screen2Words and OCR-VQA. Additionally, we report results on the new benchmark datasets introduced to serve as a baseline for further research. Comparing model performance of ScreenAI with state-of-the-art (SOTA) models of similar size. Model performance increases with size, and the performance has not saturated even at the largest size of 5B params. Conclusion Acknowledgements This project is the result of joint work with Maria Wang, Fedir Zubach, Hassan Mansoor, Vincent Etter, Victor Carbune, Jason Lin, Jindong Chen and Abhanshu Sharma. We thank Fangyu Liu, Xi Chen, Efi Kokiopoulou, Jesse Berent, Gabriel Barcik, Lukas Zilka, Oriana Riva, Gang Li,Yang Li, Radu Soricut, and Tania Bedrax-Weiss for their insightful feedback and discussions, along with Rahul Aralikatte, Hao Cheng and Daniel Kim for their support in data preparation. We also thank Jay Yagnik, Blaise Aguera y Arcas, Ewa Dominowska, David Petrou, and Matt Sharifi for their leadership, vision and support. We are very grateful toTom Small for helping us create the animation in this post.
AI / MLopen article
Google AI Blog19/03/2024, 15:00
SCIN: A new resource for representative dermatology images
Posted by Pooja Rao, Research Scientist, Google Research Health datasets play a crucial role in research and medical education, but it can be challenging to create a dataset that represents the real world. For example, dermatology conditions are diverse in their appearance and severity and manifest differently across skin tones. Yet, existing dermatology image datasets often lack representation of everyday conditions (like rashes, allergies and infections) and skew towards lighter skin tones. Furthermore, race and ethnicity information is frequently missing, hindering our ability to assess disparities or create solutions. To address these limitations, we are releasing the Skin Condition Image Network (SCIN) dataset in collaboration with physicians at Stanford Medicine. We designed SCIN to reflect the broad range of concerns that people search for online, supplementing the types of conditions typically found in clinical datasets. It contains images across various skin tones and body parts, helping to ensure that future AI tools work effectively for all. We've made the SCIN dataset freely available as an open-access resource for researchers, educators, and developers, and have taken careful steps to protect contributor privacy. Example set of images and metadata from the SCIN dataset. Dataset composition tanning propensity (self-reported Fitzpatrick Skin Type, i.e., sFST), and to describe the texture, duration and symptoms related to their concern. One to three dermatologists labeled each contribution with up to five dermatology conditions, along with a confidence score for each label. The SCIN dataset contains these individual labels, as well as an aggregated and weighted differential diagnosis derived from them that could be useful for model testing or training. These labels were assigned retrospectively and are not equivalent to a clinical diagnosis, but they allow us to compare the distribution of dermatology conditions in the SCIN dataset with existing datasets. The SCIN dataset contains largely allergic, inflammatory and infectious conditions while datasets from clinical sources focus on benign and malignant neoplasms. Monk Skin Tone (eMST) for the images. This allowed comparison of the skin condition and skin type distributions to those in existing dermatology datasets. Although we did not selectively target any skin types or skin tones, the SCIN dataset has a balanced Fitzpatrick skin type distribution (with more of Types 3, 4, 5, and 6) compared to similar datasets from clinical sources. Self-reported and dermatologist-estimated Fitzpatrick Skin Type distribution in the SCIN dataset compared with existing un-enriched dermatology datasets (Fitzpatrick17k, PH², SKINL2, and PAD-UFES-20). Fitzpatrick Skin Type scale was originally developed as a photo-typing scale to measure the response of skin types to UV radiation, and it is widely used in dermatology research. The Monk Skin Tone scale is a newer 10-shade scale that measures skin tone rather than skin phototype, capturing more nuanced differences between the darker skin tones. While neither scale was intended for retrospective estimation using images, the inclusion of these labels is intended to enable future research into skin type and tone representation in dermatology. For example, the SCIN dataset provides an initial benchmark for the distribution of these skin types and tones in the US population. The SCIN dataset has a high representation of women and younger individuals, likely reflecting a combination of factors. These could include differences in skin condition incidence, propensity to seek health information online, and variations in willingness to contribute to research across demographics. Crowdsourcing method research paper co-authored with investigators at Stanford Medicine. This approach empowers individuals to play an active role in healthcare research. It allows us to reach people at earlier stages of their health concerns, potentially before they seek formal care. Crucially, this method uses advertisements on web search result pages — the starting point for many people’s health journey — to connect with participants. Our results demonstrate that crowdsourcing can yield a high-quality dataset with a low spam rate. Over 97.5% of contributions were genuine images of skin conditions. After performing further filtering steps to exclude images that were out of scope for the SCIN dataset and to remove duplicates, we were able to release nearly 90% of the contributions received over the 8-month study period. Most images were sharp and well-exposed. Approximately half of the contributions include self-reported demographics, and 80% contain self-reported information relating to the skin condition, such as texture, duration, or other symptoms. We found that dermatologists’ ability to retrospectively assign a differential diagnosis depended more on the availability of self-reported information than on image quality. Dermatologist confidence in their labels (scale from 1-5) depended on the availability of self-reported demographic and symptom information. Data Use License prohibits attempts to re-identify contributors. We hope the SCIN dataset will be a helpful resource for those working to advance inclusive dermatology research, education, and AI tool development. By demonstrating an alternative to traditional dataset creation methods, SCIN paves the way for more representative datasets in areas where self-reported data or retrospective labeling is feasible. Acknowledgements We are grateful to all our co-authors Abbi Ward, Jimmy Li, Julie Wang, Sriram Lakshminarasimhan, Ashley Carrick, Bilson Campana, Jay Hartford, Pradeep Kumar S, Tiya Tiyasirisokchai, Sunny Virmani, Renee Wong, Yossi Matias, Greg S. Corrado, Dale R. Webster, Dawn Siegel (Stanford Medicine), Steven Lin (Stanford Medicine), Justin Ko (Stanford Medicine), Alan Karthikesalingam and Christopher Semturs. We also thank Yetunde Ibitoye, Sami Lachgar, Lisa Lehmann, Javier Perez, Margaret Ann Smith (Stanford Medicine), Rachelle Sico, Amit Talreja, Annisah Um’rani and Wayne Westerlind for their essential contributions to this work. Finally, we are grateful to Heather Cole-Lewis, Naama Hammel, Ivor Horn, Michael Howell, Yun Liu, and Eric Teasley for their insightful comments on the study design and manuscript.
AI / MLopen article
Google AI Blog18/03/2024, 18:41
MELON: Reconstructing 3D objects from images with unknown poses
Posted by Mark Matthews, Senior Software Engineer, and Dmitry Lagun, Research Scientist, Google Research A person's prior experience and understanding of the world generally enables them to easily infer what an object looks like in whole, even if only looking at a few 2D pictures of it. Yet the capacity for a computer to reconstruct the shape of an object in 3D given only a few images has remained a difficult algorithmic problem for years. This fundamental computer vision task has applications ranging from the creation of e-commerce 3D models to autonomous vehicle navigation. A key part of the problem is how to determine the exact positions from which images were taken, known as pose inference. If camera poses are known, a range of successful techniques — such as neural radiance fields (NeRF) or 3D Gaussian Splatting — can reconstruct an object in 3D. But if these poses are not available, then we face a difficult “chicken and egg” problem where we could determine the poses if we knew the 3D object, but we can’t reconstruct the 3D object until we know the camera poses. The problem is made harder by pseudo-symmetries — i.e., many objects look similar when viewed from different angles. For example, square objects like a chair tend to look similar every 90° rotation. Pseudo-symmetries of an object can be revealed by rendering it on a turntable from various angles and plotting its photometric self-similarity map. Self-Similarity map of a toy truck model. Left: The model is rendered on a turntable from various azimuthal angles, θ. Right: The average L2 RGB similarity of a rendering from θ with that of θ*. The pseudo-similarities are indicated by the dashed red lines. ill-posed, with naïve approaches often converging to local minima. In practice, such an approach might mistake the back view as the front view of an object, because they share a similar silhouette. Previous techniques (such as BARF or SAMURAI) side-step this problem by relying on an initial pose estimate that starts close to the global minima. But how can we approach this if those aren’t available? Methods, such as GNeRF and VMRF leverage generative adversarial networks (GANs) to overcome the problem. These techniques have the ability to artificially “amplify” a limited number of training views, aiding reconstruction. GAN techniques, however, often have complex, sometimes unstable, training processes, making robust and reliable convergence difficult to achieve in practice. A range of other successful methods, such as SparsePose or RUST, can infer poses from a limited number views, but require pre-training on a large dataset of posed images, which aren’t always available, and can suffer from “domain-gap” issues when inferring poses for different types of images. In “MELON: NeRF with Unposed Images in SO(3)”, spotlighted at 3DV 2024, we present a technique that can determine object-centric camera poses entirely from scratch while reconstructing the object in 3D. MELON (Modulo Equivalent Latent Optimization of NeRF) is one of the first techniques that can do this without initial pose camera estimates, complex training schemes or pre-training on labeled data. MELON is a relatively simple technique that can easily be integrated into existing NeRF methods. We demonstrate that MELON can reconstruct a NeRF from unposed images with state-of-the-art accuracy while requiring as few as 4–6 images of an object. MELON convolutional neural network (CNN) encoder that regresses camera poses from training images. We pass a downscaled training image to a four layer CNN that infers the camera pose. This CNN is initialized from noise and requires no pre-training. Its capacity is so small that it forces similar looking images to similar poses, providing an implicit regularization greatly aiding convergence. The second technique is a modulo loss that simultaneously considers pseudo symmetries of an object. We render the object from a fixed set of viewpoints for each training image, backpropagating the loss only through the view that best fits the training image. This effectively considers the plausibility of multiple views for each image. In practice, we find N=2 views (viewing an object from the other side) is all that’s required in most cases, but sometimes get better results with N=4 for square objects. These two techniques are integrated into standard NeRF training, except that instead of fixed camera poses, poses are inferred by the CNN and duplicated by the modulo loss. Photometric gradients back-propagate through the best-fitting cameras into the CNN. We observe that cameras generally converge quickly to globally optimal poses (see animation below). After training of the neural field, MELON can synthesize novel views using standard NeRF rendering methods. We simplify the problem by using the NeRF-Synthetic dataset, a popular benchmark for NeRF research and common in the pose-inference literature. This synthetic dataset has cameras at precisely fixed distances and a consistent “up” orientation, requiring us to infer only the polar coordinates of the camera. This is the same as an object at the center of a globe with a camera always pointing at it, moving along the surface. We then only need the latitude and longitude (2 degrees of freedom) to specify the camera pose. MELON uses a dynamically trained lightweight CNN encoder that predicts a pose for each image. Predicted poses are replicated by the modulo loss, which only penalizes the smallest L2 distance from the ground truth color. At evaluation time, the neural field can be used to generate novel views. Results peak signal-to-noise ratio (PSNR) against held out test views. We see that MELON quickly converges to the approximate poses of most cameras within the first 1,000 steps of training, and achieves a competitive PSNR of 27.5 dB after 50k steps. Convergence of MELON on a toy truck model during optimization. Left: Rendering of the NeRF. Right: Polar plot of predicted (blue x), and ground truth (red dot) cameras. Reconstruction quality comparison between ground-truth (GT) and MELON on NeRF-Synthetic scenes after 100k training steps. Noisy images novel view synthesis from extremely noisy, unposed images. We add varying amounts, σ, of white Gaussian noise to the training images. For example, the object in σ=1.0 below is impossible to make out, yet MELON can determine the pose and generate novel views of the object. Novel view synthesis from noisy unposed 128×128 images. Top: Example of noise level present in training views. Bottom: Reconstructed model from noisy training views and mean angular pose error. RawNeRF have demonstrated NeRF’s excellent de-noising capabilities with known camera poses. The fact that MELON works for noisy images of unknown camera poses so robustly was unexpected. Conclusion paper and MELON site to learn more. Acknowledgements We would like to thank our paper co-authors Axel Levy, Matan Sela, and Gordon Wetzstein, as well as Florian Schroff and Hartwig Adam for continuous help in building this technology. We also thank Matthew Brown, Ricardo Martin-Brualla and Frederic Poitevin for their helpful feedback on the paper draft. We also acknowledge the use of the computational resources at the SLAC Shared Scientific Data Facility (SDF).
AI / MLopen article
Google AI Blog15/03/2024, 18:22
HEAL: A framework for health equity assessment of machine learning performance
Posted by Mike Schaekermann, Research Scientist, Google Research, and Ivor Horn, Chief Health Equity Officer & Director, Google Core Health equity is a major societal concern worldwide with disparities having many causes. These sources include limitations in access to healthcare, differences in clinical treatment, and even fundamental differences in the diagnostic technology. In dermatology for example, skin cancer outcomes are worse for populations such as minorities, those with lower socioeconomic status, or individuals with limited healthcare access. While there is great promise in recent advances in machine learning (ML) and artificial intelligence (AI) to help improve healthcare, this transition from research to bedside must be accompanied by a careful understanding of whether and how they impact health equity. Health equity is defined by public health organizations as fairness of opportunity for everyone to be as healthy as possible. Importantly, equity may be different from equality. For example, people with greater barriers to improving their health may require more or different effort to experience this fair opportunity. Similarly, equity is not fairness as defined in the AI for healthcare literature. Whereas AI fairness often strives for equal performance of the AI technology across different patient populations, this does not center the goal of prioritizing performance with respect to pre-existing health disparities. Health equity considerations. An intervention (e.g., an ML-based tool, indicated in dark blue) promotes health equity if it helps reduce existing disparities in health outcomes (indicated in lighter blue). Health Equity Assessment of machine Learning performance (HEAL): a framework and dermatology AI model case study”, published in The Lancet eClinicalMedicine, we propose a methodology to quantitatively assess whether ML-based health technologies perform equitably. In other words, does the ML model perform well for those with the worst health outcomes for the condition(s) the model is meant to address? This goal anchors on the principle that health equity should prioritize and measure model performance with respect to disparate health outcomes, which may be due to a number of factors that include structural inequities (e.g., demographic, social, cultural, political, economic, environmental and geographic). The health equity framework (HEAL) Framework for Health Equity Assessment of machine Learning performance (HEAL). Our guiding principle is to avoid exacerbating health inequities, and these steps help us identify disparities and assess for inequitable model performance to move towards better outcomes for all. Case study on a dermatology model prior work. This example dermatology model was trained to classify 288 skin conditions using a development dataset of 29k cases. The input to the model consists of three photos of a skin concern along with demographic information and a brief structured medical history. The output consists of a ranked list of possible matching skin conditions. Using the HEAL framework, we evaluated this model by assessing whether it prioritized performance with respect to pre-existing health outcomes. The model was designed to predict possible dermatologic conditions (from a list of hundreds) based on photos of a skin concern and patient metadata. Evaluation of the model is done using a top-3 agreement metric, which quantifies how often the top 3 output conditions match the most likely condition as suggested by a dermatologist panel. The HEAL metric is computed via the anticorrelation of this top-3 agreement with health outcome rankings. We used a dataset of 5,420 teledermatology cases, enriched for diversity in age, sex and race/ethnicity, to retrospectively evaluate the model’s HEAL metric. The dataset consisted of “store-and-forward” cases from patients of 20 years or older from primary care providers in the USA and skin cancer clinics in Australia. Based on a review of the literature, we decided to explore race/ethnicity, sex and age as potential factors of inequity, and used sampling techniques to ensure that our evaluation dataset had sufficient representation of all race/ethnicity, sex and age groups. To quantify pre-existing health outcomes for each subgroup we relied on measurements from public databases endorsed by the World Health Organization, such as Years of Life Lost (YLLs) and Disability-Adjusted Life Years (DALYs; years of life lost plus years lived with disability). HEAL metric for all dermatologic conditions across race/ethnicity subpopulations, including health outcomes (YLLs per 100,000), model performance (top-3 agreement), and rankings for health outcomes and tool performance. (* Higher is better; measures the likelihood the model performs equitably with respect to the axes in this table.) HEAL metric for all dermatologic conditions across sexes, including health outcomes (DALYs per 100,000), model performance (top-3 agreement), and rankings for health outcomes and tool performance. (* As above.) HEAL metrics for all cancer and non-cancer dermatologic conditions across age groups, including health outcomes (DALYs per 100,000), model performance (top-3 agreement), and rankings for health outcomes and tool performance. (* As above.) Putting things in context Pareto condition (discussed further in the paper), which restricts model changes so that outcomes for each subpopulation are either unchanged or improved compared to the status quo, and performance does not worsen for any subpopulation. The HEAL framework, in its current form, assesses the likelihood that an ML-based model prioritizes performance for subpopulations with respect to pre-existing health disparities for specific subpopulations. This differs from the goal of understanding whether ML will reduce disparities in outcomes across subpopulations in reality. Specifically, modeling improvements in outcomes requires a causal understanding of steps in the care journey that happen both before and after use of any given model. Future research is needed to address this gap. Conclusion Acknowledgements The research described here is joint work across many teams at Google. We are grateful to all our co-authors: Terry Spitz, Malcolm Pyles, Heather Cole-Lewis, Ellery Wulczyn, Stephen R. Pfohl, Donald Martin, Jr., Ronnachai Jaroensri, Geoff Keeling, Yuan Liu, Stephanie Farquhar, Qinghan Xue, Jenna Lester, Cían Hughes, Patricia Strachan, Fraser Tan, Peggy Bui, Craig H. Mermel, Lily H. Peng, Yossi Matias, Greg S. Corrado, Dale R. Webster, Sunny Virmani, Christopher Semturs, Yun Liu, and Po-Hsuan Cameron Chen. We also thank Lauren Winer, Sami Lachgar, Ting-An Lin, Aaron Loh, Morgan Du, Jenny Rizk, Renee Wong, Ashley Carrick, Preeti Singh, Annisah Um'rani, Jessica Schrouff, Alexander Brown, and Anna Iurchenko for their support of this project.
AI / MLopen article
Google AI Blog14/03/2024, 19:38
Cappy: Outperforming and boosting large multi-task language models with a small scorer
Posted by Yun Zhu and Lijuan Liu, Software Engineers, Google Research Large language model (LLM) advancements have led to a new paradigm that unifies various natural language processing (NLP) tasks within an instruction-following framework. This paradigm is exemplified by recent multi-task LLMs, such as T0, FLAN, and OPT-IML. First, multi-task data is gathered with each task following a task-specific template, where each labeled example is converted into an instruction (e.g., "Put the concepts together to form a sentence: ski, mountain, skier”) paired with a corresponding response (e.g., "Skier skis down the mountain"). These instruction-response pairs are used to train the LLM, resulting in a conditional generation model that takes an instruction as input and generates a response. Moreover, multi-task LLMs have exhibited remarkable task-wise generalization capabilities as they can address unseen tasks by understanding and solving brand-new instructions. The demonstration of the instruction-following pre-training of multi-task LLMs, e.g., FLAN. Pre-training tasks under this paradigm improves the performance for unseen tasks. FLAN-11B, T0-11B and OPT-IML-175B). As a result, operating such sizable models poses significant challenges because they demand considerable computational power and impose substantial requirements on the memory capacities of GPUs and TPUs, making their training and inference expensive and inefficient. Extensive storage is required to maintain a unique LLM copy for each downstream task. Moreover, the most powerful multi-task LLMs (e.g., FLAN-PaLM-540B) are closed-sourced, making them impossible to be adapted. However, in practical applications, harnessing a single multi-task LLM to manage all conceivable tasks in a zero-shot manner remains difficult, particularly when dealing with complex tasks, personalized tasks and those that cannot be succinctly defined using instructions. On the other hand, the size of downstream training data is usually insufficient to train a model well without incorporating rich prior knowledge. Hence, it is long desired to adapt LLMs with downstream supervision while bypassing storage, memory, and access issues. Certain parameter-efficient tuning strategies, including prompt tuning and adapters, substantially diminish storage requirements, but they still perform back-propagation through LLM parameters during the tuning process, thereby keeping their memory demands high. Additionally, some in-context learning techniques circumvent parameter tuning by integrating a limited number of supervised examples into the instruction. However, these techniques are constrained by the model's maximum input length, which permits only a few samples to guide task resolution. In “Cappy: Outperforming and Boosting Large Multi-Task LMs with a Small Scorer”, presented at NeurIPS 2023, we propose a novel approach that enhances the performance and efficiency of multi-task LLMs. We introduce a lightweight pre-trained scorer, Cappy, based on continual pre-training on top of RoBERTa with merely 360 million parameters. Cappy takes in an instruction and a candidate response as input, and produces a score between 0 and 1, indicating an estimated correctness of the response with respect to the instruction. Cappy functions either independently on classification tasks or serves as an auxiliary component for LLMs, boosting their performance. Moreover, Cappy efficiently enables downstream supervision without requiring any finetuning, which avoids the need for back-propagation through LLM parameters and reduces memory requirements. Finally, adaptation with Cappy doesn’t require access to LLM parameters as it is compatible with closed-source multi-task LLMs, such as those only accessible via WebAPIs. Cappy takes an instruction and response pair as input and outputs a score ranging from 0 to 1, indicating an estimation of the correctness of the response with respect to the instruction. Pre-training PromptSource that were used to train T0. This collection encompasses a wide range of task types, such as question answering, sentiment analysis, and summarization. Each dataset is associated with one or more templates that convert each instance from the original datasets into an instruction paired with its ground truth response. Cappy's regression modeling requires each pre-training data instance to include an instruction-response pair along with a correctness annotation for the response, so we produce a dataset with correctness annotations that range from 0 to 1. For every instance within a generation task, we leverage an existing multi-task LLM to generate multiple responses by sampling, conditioned on the given instruction. Subsequently, we assign an annotation to the pair formed by the instruction and every response, using the similarity between the response and the ground truth response of the instance. Specifically, we employ Rouge-L, a commonly-used metric for measuring overall multi-task performance that has demonstrated a strong alignment with human evaluation, to calculate this similarity as a form of weak supervision. As a result, we obtain an effective regression dataset of 160 million instances paired with correctness score annotations. The final Cappy model is the result of continuous pre-training using the regression dataset on top of the RoBERTa model. The pre-training of Cappy is conducted on Google's TPU-v4, with RedCoast, a lightweight toolkit for automating distributed training. Data augmentation with a multi-task LLM to construct a weakly supervised regression dataset for Cappy’s pre-training and fine-tuning. Applying Cappy Adapting multi-task LLMs with Cappy Downstream adaptation comparison between Cappy and approaches that rely on an LLM’s parameters, such as fine-tuning and prompt tuning. Cappy’s application enhances multi-task LLMs. Results PromptSource. We demonstrate that Cappy, with 360M parameters, outperforms OPT-175B and OPT-IML-30B, and matches the accuracy of the best existing multi-task LLMs (T0-11B and OPT-IML-175B). These findings highlight Cappy’s capabilities and parameter efficiency, which can be credited to its scoring-based pre-training strategy that integrates contrastive information by differentiating between high-quality and low-quality responses. On the contrary, previous multi-task LLMs depend exclusively on teacher-forcing training that utilizes only the ground truth responses. The overall accuracy averaged over eleven test tasks from PromptSource. “RM” refers to a pre-trained RLHF reward model. Cappy matches the best ones among existing multi-task LLMs. BIG-Bench, a set of manually curated tasks that are considered beyond the capability of many LLMs. We focus on all the 45 generation BIG-Bench tasks, specifically those that do not offer pre-established answer choices. We evaluate the performance using the Rouge-L score (representing the overall similarity between model generations and corresponding ground truths) on every test set, reporting the average score across 45 tests. In this experiment, all variants of FLAN-T5 serve as the backbone LLMs, and the foundational FLAN-T5 models are frozen. These results, shown below, suggest that Cappy enhances the performance of FLAN-T5 models by a large margin, consistently outperforming the most effective baseline achieved through sample selection using self-scoring of the LLM itself. The averaged Rouge-L score over 45 complex tasks within BIG-Bench. The x-axis refers to FLAN-T5 models of different sizes. Every dashed line represents an approach working on FLAN-T5s. Self-scoring refers to using the cross-entropy of LLM to select responses. Cappy enhances the performance of FLAN-T5 models by a large margin. Conclusion Acknowledgments Thanks to Bowen Tan, Jindong Chen, Lei Meng, Abhanshu Sharma and Ewa Dominowska for their valuable feedback. We would also like to thank Eric Xing and Zhiting Hu for their suggestions.
AI / MLopen article
Google AI Blog12/03/2024, 21:15
Talk like a graph: Encoding graphs for large language models
Posted by Bahare Fatemi and Bryan Perozzi, Research Scientists, Google Research Imagine all the things around you — your friends, tools in your kitchen, or even the parts of your bike. They are all connected in different ways. In computer science, the term graph is used to describe connections between objects. Graphs consist of nodes (the objects themselves) and edges (connections between two nodes, indicating a relationship between them). Graphs are everywhere now. The internet itself is a giant graph of websites linked together. Even the knowledge search engines use is organized in a graph-like way. Furthermore, consider the remarkable advancements in artificial intelligence — such as chatbots that can write stories in seconds, and even software that can interpret medical reports. This exciting progress is largely thanks to large language models (LLMs). New LLM technology is constantly being developed for different uses. Since graphs are everywhere and LLM technology is on the rise, in “Talk like a Graph: Encoding Graphs for Large Language Models”, presented at ICLR 2024, we present a way to teach powerful LLMs how to better reason with graph information. Graphs are a useful way to organize information, but LLMs are mostly trained on regular text. The objective is to test different techniques to see what works best and gain practical insights. Translating graphs into text that LLMs can understand is a remarkably complex task. The difficulty stems from the inherent complexity of graph structures with multiple nodes and the intricate web of edges that connect them. Our work studies how to take a graph and translate it into a format that an LLM can understand. We also design a benchmark called GraphQA to study different approaches on different graph reasoning problems and show how to phrase a graph-related problem in a way that enables the LLM to solve the graph problem. We show that LLM performance on graph reasoning tasks varies on three fundamental levels: 1) the graph encoding method, 2) the nature of the graph task itself, and 3) interestingly, the very structure of the graph considered. These findings give us clues on how to best represent graphs for LLMs. Picking the right method can make the LLM up to 60% better at graph tasks! Pictured, the process of encoding a graph as text using two different approaches and feeding the text and a question about the graph to the LLM. Graphs as text GraphQA. Think of GraphQA as an exam designed to evaluate powerful LLMs on graph-specific problems. We want to see how well LLMs can understand and solve problems that involve graphs in different setups. To create a comprehensive and realistic exam for LLMs, we don’t just use one type of graph, we use a mix of graphs ensuring breadth in the number of connections. This is mainly because different graph types make solving such problems easier or harder. This way, GraphQA can help expose biases in how an LLM thinks about the graphs, and the whole exam gets closer to a realistic setup that LLMs might encounter in the real world. Overview of our framework for reasoning with graphs using LLMs. Erdős-Rényi, scale-free networks, Barabasi-Albert model, and stochastic block model, as well as simpler graph structures like paths, complete graphs, and star graphs, providing a diverse set of data for training. When working with graphs, we also need to find ways to ask graph-related questions that LLMs can understand. Prompting heuristics are different strategies for doing this. Let's break down the common ones: Zero-shot: simply describe the task ("Is there a cycle in this graph?") and tell the LLM to go for it. No examples provided. Few-shot: This is like giving the LLM a mini practice test before the real deal. We provide a few example graph questions and their correct answers. Chain-of-Thought: Here, we show the LLM how to break down a problem step-by-step with examples. The goal is to teach it to generate its own "thought process" when faced with new graphs. Zero-CoT: Similar to CoT, but instead of training examples, we give the LLM a simple prompt, like "Let's think step-by-step," to trigger its own problem-solving breakdown. BAG (build a graph): This is specifically for graph tasks. We add the phrase "Let's build a graph..." to the description, helping the LLM focus on the graph structure. We explored different ways to translate graphs into text that LLMs can work with. Our key questions were: Node encoding: How do we represent individual nodes? Options tested include simple integers, common names (people, characters), and letters. Edge encoding: How do we describe the relationships between nodes? Methods involved parenthesis notation, phrases like "are friends", and symbolic representations like arrows. Various node and edge encodings were combined systematically. This led to functions like the ones in the following figure: Examples of graph encoding functions used to encode graphs via text. Analysis and results How LLMs handle graph tasks LLMs struggle: On most of these basic tasks, LLMs did not do much better than a random guess. Encoding matters significantly: How we represent the graph as text has a great effect on LLM performance. The "incident" encoding excelled for most of the tasks in general. Our results are summarized in the following chart. Comparison of various graph encoder functions based on their accuracy on different graph tasks. The main conclusion from this figure is that the graph encoding functions matter significantly. Bigger is (usually) better PaLM 2. Here is a summary of our findings: In general, bigger models did better on graph reasoning tasks. It seems like the extra parameters gave them space to learn more complex patterns. Oddly, size didn't matter as much for the “edge existence” task (finding out if two nodes in a graph are connected). Even the biggest LLM couldn't consistently beat a simple baseline solution on the cycle check problem (finding out if a graph contains a cycle or not). This shows LLMs still have room to improve with certain graph tasks. Effect of model capacity on graph reasoning task for PaLM 2-XXS, XS, S, and L. Do different graph shapes confuse LLMs Samples of graphs generated with different graph generators from GraphQA. ER, BA, SBM, and SFN refers to Erdős–Rényi, Barabási–Albert, Stochastic Block Model, and Scale-Free Network respectively. Comparing different graph generators on different graph tasks. The main observation here is that graph structure has a significant impact on the LLM’s performance. ER, BA, SBM, and SFN refers to Erdős–Rényi, Barabási–Albert, Stochastic Block Model, and Scale-Free Network respectively. Conclusion How to translate the graph to text: how we represent the graph as text significantly influences LLM performance. The incident encoding excelled for most of the tasks in general.. Task type: Certain types of graph questions tend to be harder for LLMs, even with a good translation from graph to text. Graph structure: Surprisingly, the "shape" of the graph that on which we do inference (dense with connections, sparse, etc.) influences how well an LLM does. This study revealed key insights about how to prepare graphs for LLMs. The right encoding techniques can significantly boost an LLM's accuracy on graph problems (ranging from around 5% to over 60% improvement). Our new benchmark, GraphQA, will help drive further research in this area. Acknowledgements We would like to express our gratitude to our co-author, Jonathan Halcrow, for his valuable contributions to this work. We express our sincere gratitude to Anton Tsitsulin, Dustin Zelle, Silvio Lattanzi, Vahab Mirrokni, and the entire graph mining team at Google Research, for their insightful comments, thorough proofreading, and constructive feedback which greatly enhanced the quality of our work. We would also like to extend special thanks to Tom Small for creating the animation used in this post.
AI / MLopen article

The accessibility paradox

Your people get AI. Get out of their way.

Battling AI fatigue as a designer and developer: a practical guide

The UX of Contrast: Lessons from Indika’s Game Design

Prompt Engineering Is Solved—Prompt Management Isn’t

Why Your Best Predictive Model Gives the Wrong Treatment Effect

Los Movimientos, Part II: Solving Large Pickup-and-Delivery Problems with Adaptive Large Neighborhood Search

The Bull And Bear Case For Digital Design In The Age Of AI

Studio Freight: Moving Missions Forward

Ollama vs. LM Studio vs. llama.cpp: Which Local AI Runtime Should You Use in 2026?

Avoiding Entity Key Drift in a Data Lake: Step 1, Normalization

The MOS 6502: the people’s princess

How Much Does a Local LLM Actually Cost to Run? I Measured Every Watt on Apple Silicon

The OlmoEarth Platform: Geospatial inference at planetary scale

LFM2.5-Encoders for Fast Long-Context Inference on CPU

MCP Explained: How Modern AI Agents Connect to the Real World

Don’t Just “Throw Adam at It”: Misunderstanding Adam Will Cost You

The Orchestrator's Tax

Why I’m Writing Rachel’s Ramblings

Backpropagation Explained for Beginners (Part 2): There Has to Be a Better Way

Thinking Outside The Box: Digital Design In The AI Era

From frictionless to meaningful

“Los Movimientos”: The Routing Problem That Nearly Broke My Spirit

Reducing Human Annotation with ML Active Learning

Between Print and Digital: The Making of MERSI’s Website

Tailwind CSS vs. StyleX: A real migration with 20 components

5 Architectural Patterns for Persistent Memory and State in AI Agents

The consciousness mirage in AI design

AI can fake your portfolio, Lower latency, 5-to-9, Monochrome dataviz

NVIDIA Cosmos-H-Dreams: Bringing Real-Time Generative Simulation to Surgical Robotics

Anatomy of a Frontier Lab Agent Intrusion: A Technical Timeline of the July 2026 Incident

Information architecture is the foundation AI is starving for

Your website is boring (but that might just set it free)

UX-Context Design: Using UX Knowledge to Inform AI-Generated Design

A Concrete Definition of “Product Sense” (and How to Build It)

The Art of Continuous Transformation: How Garden Eight Blends Integrity with Play

Stateful vs. Stateless Agent Design: Tradeoffs for Scalable Agentic Systems

JWT authentication: Best practices and when to use it

Building Cerebrium: Making Serverless Infrastructure Tangible

How to replace screen recordings with Remotion

An Introduction to Loop Engineering

Bringing Nunchaku 4-bit Diffusion Inference to Diffusers

Building Ridgeline: Engineering a Real-Time 3D Experience in Webflow

How to use Chrome’s Modern Web Guidance to prevent AI agents from writing legacy frontend code

How to choose and adapt product management frameworks

Essential GUI design principles for creating intuitive interfaces

Fragments: July 21

How to clean up AI-generated code with Fallow

Magnetic Commerce: Building the Dash Creative Website

The Current State of Agentic AI

Weaponizing And Defending The React Flight Protocol: Deserialization Sinks In RSCs

Why I joined the Daily UI challenge after 5 years in UX, and what it taught me

Analyzing the evidence that helps businesses win “product not received” disputes

Grabette: an open system to record robot-manipulation data

Building Agentic Workflows in Python with LangGraph

The Craft Behind Memorable Digital Experiences: Inside Unseen Studio

In-House LLM Serving at Netflix

Does Your Form Really Need a Dropdown List?

Don’t Outsource the Learning: Why Human-Led Research Still Matters in the Age of AI

ZERO: The Engineering Behind a Defiant Interactive Narrative

Agentic AI Security: Defending Against Prompt Injection and Tool Misuse

When It Makes Sense To “Block” The Main Thread

3 examples of great login screen designs

Meet the Speakers of the First Three.js Conference

The Archaeologist’s Copilot

How to secure full-stack projects from NPM attacks

Run a Local AI Model with Ollama in 15 Minutes

Newer Models, Same Advantage

Security incident disclosure — July 2026

Model Routing Is Simple. Until It Isn’t.

The Architecture Behind Trionn: Coordinating GSAP, Three.js, Lenis, and Web Audio

Scikit-Ollama for Scikit-LLM/Ollama Integration

No, People Don’t Want More AI In Their Life

UX Conference October Announced (Oct 5 - Oct 16)

Welcome Inkling by Thinking Machines

DSLs Enable Reliable Use of LLMs

LLM Evaluation Frameworks Compared: How to Actually Measure What Your Model Does

Building Service Topology at Scale: Architecture, Challenges, and Lessons Learned

Fragments: July 13

The 5 Qualities of Site-Specific AI Chatbots