Real-time streaming for the agentic era with NVIDIA
NVIDIA Vera is launching with Redpanda as part of the ecosystem, aiming to deliver 5.5x lower latencies for agents running in mission-critical environments.
619 articles curated by AI from 15+ sources. Updated every 6 hours.
NVIDIA Vera is launching with Redpanda as part of the ecosystem, aiming to deliver 5.5x lower latencies for agents running in mission-critical environments.
Streambed is a tool that streams Postgres data to Iceberg on S3 and supports the Postgres wire protocol.
The author presents a demo of rendering 1 billion rows (14 billion cells) in the browser using DuckDB and Glide Data Grid.
This article discusses the limitations of embeddings in RAG systems, noting that while they handle synonyms and paraphrasing well, they can fail on negations, exact identifiers, and company-specific acronyms. It suggests alternative approaches for when these failures occur.
The article presents a cost control layer for RAG systems that combines semantic caching, query routing, token budgeting, and circuit breaking. The approach reportedly achieves an 85% reduction in LLM costs without significantly impacting answer quality.
A 10 TB events table partitioned by month was the right call two years ago. Now your data volume has grown tenfold, your team runs daily SLA dashboards, and every query that touches "last 7 days" is scanning an entire month's worth of files. In a traditional Hive-style warehouse, fixing this means a
This blog post demonstrates the new features available in DuckDB v1.5.3 for the DuckDB-Iceberg extension, even while the team focuses on DuckLake v1.0 and Quack.
The article discusses the infrastructure required to build a fast and reliable scientific agent using local open-weight models, vLLM, and long-context infrastructure.
Here’s how we built Town Lake, Cloudflare's unified analytics platform, alongside Skipper, an internal AI agent running on top of it.
The traditional data lakehouse was designed for human analysts. Every architectural decision, from how performance is tuned to how business context is stored, assumed that a person would be sitting at the end of the pipeline, writing queries, interpreting results, and carrying those results into dec
Catalog governance is the biggest bottleneck in building a multi-engine lakehouse. When you query the same Apache Iceberg tables with Spark, Flink, and Dremio, synchronizing permissions and access credentials across different engines is traditionally a manual, error-prone chore. Apache Polaris solve
In 2025, solo founders in the top decile generated 61 times the revenue of the median solo founder in their first six months. We analyzed the data to understand what drives that gap.
Choosing the right concept is only half the job. Plenty of teams have adopted the lakehouse model, picked open formats, and still built systems that fail when AI agents start querying them at scale. The Agentic Lakehouse architecture solves a specific problem: how do you structure a data platform so
The post explains how to set an appropriate value for max.poll.records when using Kafka share groups, revealing the relationship between group.share.partition.max.record.locks and the number of consumers per partition.
Redpanda SQL is generally available, allowing direct SQL queries against live Kafka topics and Iceberg history without requiring a separate ETL pipeline.
A fintech company automated its compliance reporting using agentic AI to extract data from multiple sources and synthesize reports, reducing weekly reporting time from 2 days to 2 hours.
WarpStream details their approach to safe shipping across 24 control plane regions, discussing deploy trains, shadow services, zonal Kubernetes clusters, and async reconciliation loops.
A mid-tier SaaS provider automates cloud support triage using a 5-agent workflow, improving ticket validation, routing, and SLA compliance.
We’re introducing SilverTorch, a reimagining of recommendation systems that unifies all retrieval components for user generated content under a unified architecture. SilverTorch shows up to 23.7x higher throughput compared to the state-of-the-art approaches. It’s also showing 20.9x more compute cos
This is Part 3 of a 15-part Apache Iceberg Masterclass. Part 2 covered the metadata structures of all five table formats. This article focuses on exactly how query engines use Iceberg's metadata to avoid reading data they don't need. The single biggest performance advantage of Iceberg over raw data
The post examines the impact of tuning max.poll.records in Kafka share groups, using Kafka 4.2.0 and Dimster for testing.
The article contrasts Retrieval-Augmented Generation (RAG) and agents as solutions for accessing company data with LLMs, highlighting their distinct problem-solving approaches.
Apache Iceberg is not a static format. The spec version number stamped into every table's metadata controls which features that table can use, which engines can read it, and how efficiently row-level changes are handled. The jump from Apache Iceberg V2 to V3 introduces deletion.
Models can lose accuracy after retraining, and reproducing the exact training dataset from months ago can be difficult due to data lake changes. Apache Iceberg solves this by providing data versioning capabilities, allowing you to track and reproduce specific datasets used for training.
The post presents a benchmark comparing the overhead of Kafka Consumer Groups and Share Groups using Dimster, a performance benchmarking tool for Kafka.
DuckDB users can now query Lance datasets using SQL through the CLI or SDKs, enabling AI and retrieval workload capabilities; this post highlights Lance as a good option for vector storage and querying.
Use Weaviate's built-in MCP server to give Claude Code, Cursor, and VS Code hybrid search over your codebase and docs. No glue code.
DuckDB v1.5.3, while a patch release, includes several important new features; the complete release notes are available on GitHub, with installation instructions provided.
The post details how Redpanda Cloud Topics manages the lifecycle of temporary L0 objects and safely deletes them without data loss or excessive storage costs.
Docusign reports reducing dbt unit test authoring time from 5 hours to 30 minutes using a structured AI-assisted framework.
How ChatFeatured cut analytics query times from 2.5 minutes to under a second by migrating from PlanetScale Postgres to Postgres managed by ClickHouse — in just 30 minutes.
The paper proposes a novel ontology for relational database data lineage to address the challenges of modeling lineage, especially with incomplete or missing dependencies between database objects.
The paper explores the potential of language models and graph neural networks to serve as foundation models for relational databases, aiming to enhance deep learning applications on relational data.
The paper presents a gradient-based approach for join ordering, which is a computationally complex problem that critically impacts query execution performance in databases.
Digital sovereignty in streaming demands architectural guarantees, not policy promises. The post discusses how BYOC, schema controls, and open protocols satisfy global regulations.
The D. E. Shaw group replaced its previous observability platform with ClickHouse to handle high-cardinality metrics at scale, achieving 7x better query performance and enabling multi-year capacity planning across millions of compute workloads.
This guide covers three critical decisions for production RAG systems: chunk shaping, embedding selection, and ANN index scaling, bridging the gap between demo retrieval and real-scale deployments.
Learn how to design governance directly into event streaming systems. The post covers schemas, lineage, security, retention, and compliance patterns for reliable data platforms.
A deep dive into how pg_clickhouse's Foreign Data Wrapper decides what SQL to push down to ClickHouse versus execute locally in Postgres .
The article argues that enterprise AI systems are entering a phase where inference design matters as much as model capability itself.
Why does high cardinality break Prometheus but not ClickHouse? In Part 1, we explore the architectural tradeoffs of Prometheus and other series-based systems, showing how cardinality impacts memory, ingestion, querying, and operational stability at scale.
The post presents a critical analysis of MRC's three counterintuitive design decisions behind OpenAI's 131,000-GPU training fabric, including the networking mathematics that make them work and their implications for the AI infrastructure community.
Cloudflare's billing pipeline slowed down due to lock contention in ClickHouse's query planner after a partitioning change. The post details how they identified and fixed the bottleneck.
Learn how ClickStack’s new SQL-powered charting and alerting unlock anomaly detection, rolling baselines, and advanced observability workflows directly on top of ClickHouse, without relying on external tooling.
This article breaks down exactly how each format organizes its metadata, which determines how fast queries start planning and how efficiently concurrent writes occur.
The article explores Databricks' implementation of rate limiting at scale, focusing on shrinking the critical path and the necessary accuracy tradeoffs.
Meta's engineering teams revamped their data ingestion system to enhance reliability at scale, migrating from a legacy system to a new architecture.
Learn how ClickHouse Cloud's Join table engine enables fast, updatable in-memory lookups for dimensional modeling — with automatic upserts, deduplication, and data compaction powered by ReplacingMergeTree under the hood.
The article describes how Figma's engineering team evolved their data pipeline to handle growth, reducing latency from multiple days to real-time.
Cloudflare investigated a performance issue caused by CUBIC's congestion window getting stuck at its minimum, identifying the root cause as incorrect measurement of idle periods. The fix involved accurately distinguishing RTT wait times from application idleness.
The post discusses techniques to cut inference cold starts by 40x using LP, FUSE, C/R, and cuda-checkpoint.
ClickHouse 26.4 is here! In this release, more features become SQL compatible, COUNT DISTINCT gets faster, EXPLAIN gets even prettier, and more
This post introduces Quack, the new client-server protocol for DuckDB. It explains the motivation for a client-server architecture and outlines the design considerations for the Quack protocol, including security, efficiency, and extensibility.
The title suggests a discussion on managing data assets at Netflix's scale.
Avride replaced Apache Iceberg with ClickHouse Cloud, cutting index lookup latency from 20 seconds to under 100ms and ingestion from hours to seconds.
The article focuses on the design and implementation of Pinterest's MCP ecosystem, outlining the key elements required for its successful operation.
The database industry is repeating a historical cycle where specialized systems create fragmentation that demands convergence. As AI agents become primary data consumers, organizations face a new challenge: context silos, where information exists but cannot be retrieved fast enough for autonomous sy
The title suggests a discussion on scaling ArchUnit using Nebula ArchRules.
Apache Doris 4.1 introduces mature spill-to-disk capabilities, enabling Hash Join, Aggregation, and Sort operators to write intermediate state to disk when memory pressure rises so that memory-intensive analytical queries complete without OOM errors.
The program for DuckCon #7 Amsterdam, a DuckDB user conference, has been announced. The event will be held on June 24, 2026, and will run from 15:00 to 20:00 CEST.
This is Part 1 of a 15-part Apache Iceberg Masterclass. This article covers the fundamental question: what problem do table formats solve, and why does the choice between them matter? A data lake without a table format is a collection of files. It has no concept of a transaction, no mechanism to pre
This article presents container design patterns categorized by their coordination scope, providing a structured overview of common practices for distributed systems.
The article posits that many agent reliability issues stem from underlying data engineering problems.
The post discusses using ClickHouse as a Kafka sink, focusing on how async insert mode helps with high message rates but has buffering and dedupe behaviors that aren't always obvious.
Parloa leverages OpenAI models to power scalable, voice-driven AI customer service agents, enabling enterprises to design, simulate, and deploy reliable, real-time interactions.
This Anyscale blog post discusses the architecture of AI agents on Ray Serve, covering both single-agent and multi-agent architectures. It explores how to build and deploy AI agents using Ray.
Neon decoupled storage and compute to deliver up to a 5x performance increase on write-heavy workloads by disabling full-page writes.
DuckDB's Delta Lake and Unity Catalog extensions are no longer experimental. The post details the progress of these extensions.
Adding a column to a large production table used to require a plan involving migration scripts, maintenance windows, and backfill jobs that rewrite every data file to include the new column. Iceberg default column values eliminate the need for backfills during schema evolution.
The post details improvements to vLLM, focusing on correctness before corrections in reinforcement learning.
On May 5, 2026, DENIC published broken DNSSEC signatures for the .de TLD, making millions of domains unreachable. Here's what 1.1.1.1 saw, how serve stale cushioned the impact, and how we restored resolution.
How ClickStack and Odigos eliminate observability gaps with zero-code eBPF instrumentation and full-fidelity distributed tracing at scale.
This post links to a video demonstrating the DuckLake workflow using Rosetta DBT Studio. The presentation covers creating and importing lakehouse instances, exploring metadata, running SQL queries, and building reusable SQL Notebooks.
Agentic analytics makes query-readiness a write-side cost problem. This post compares Snowflake and ClickHouse under continuous ingest, showing how ClickHouse obtains query-ready data at 22× lower cost and delivers 31× better write-side cost-performance.
Singular Bank built Singularity, an internal assistant using ChatGPT and Codex to help bankers save 60–90 minutes daily on meeting prep, portfolio analysis, and follow-up.
Uber uses OpenAI to power AI assistants and voice features that help drivers earn smarter and riders book faster across a global real-time marketplace.
Simon Willison describes how he used AI agents to launch and run a cafe in Stockholm, detailing the architecture and lessons learned.
The article compares three patterns for integrating AI inference with Apache Kafka: external RPC, embedded, and sidecar, focusing on avoiding consumer rebalances, cutting costs, and scaling LLM pipelines.
Jikkou 1.0 is out, featuring Apache Iceberg integration for declarative management of namespaces, tables, and views. It also includes multi-cluster orchestration and Confluent Cloud RBAC.
StarTree claims a 5-37x performance improvement for queries on Iceberg tables compared to Trino and ClickHouse, based on their benchmark.
This article presents a self-healing layer designed to detect and correct hallucinations in RAG systems before they reach users by addressing issues in reasoning.
We use clickhousectl to spin up multiple ClickHouse versions side by side and benchmark two recent performance improvements.
OpenAI introduces MRC (Multipath Reliable Connection), a new supercomputer networking protocol released via OCP to improve resilience and performance in large-scale AI training clusters.
From spinning disks to CPUs to cloud object storage, shifting bottlenecks have shaped Redpanda's architecture. Here’s what Cloud Topics revealed about today’s demand for high-latency storage.
OpenAI and PwC are collaborating to help businesses automate finance workflows using AI agents, with the aim of improving forecasting, strengthening controls, and modernizing the CFO function.
Figma details the architecture and implementation of PGKeeper, their custom Postgres connection pooler, explaining why existing solutions like PgBouncer didn't meet their needs.
How Qonto uses ClickHouse Cloud to power observability at scale — replacing sampling and hour-capped queries with two-week query windows, 99.84% compression on high-cardinality data, and an AI incident companion built on the ClickHouse MCP server.
This blog post discusses the simplicity of the DuckLake specification for dataframes.
OpenAI describes how it rebuilt its WebRTC stack to power real-time Voice AI with low latency, global scale, and seamless conversational turn-taking.
Icestream is a project enabling efficient streaming writes in Apache Iceberg.
A developer is building a no-code agent builder that uses DuckDB to create conversational analytics agents. These agents can respond to queries with interactive charts and UI, allowing users to query databases more easily.
This post proposes a framework for leveraging AI, emphasizing context as infrastructure, taste as configuration, verification for autonomy, scaling through delegation, and closing feedback loops for continuous improvement.
A user shares a community extension for DuckDB called `ducksmiles` that enables processing chemistry data, including SMILES, InChI, and PDB formats, directly within DuckDB. The extension supports functions like `mol_formula` and `mol_weight` and integrates with `read_csv_auto`, `read_text`, and `htt
Cloudflare completed an engineering effort to make its infrastructure more resilient using tools like Snapstone and the Engineering Codex. They implemented safer configuration changes and automated best practices to prevent future incidents.
Fivetran accelerated the transpilation of SQL dialects in SQLGlot by compiling it with mypyc, resulting in faster translation between different SQL dialects for query engines.
This arXiv paper introduces a method to improve Text-to-SQL accuracy by using template-constrained decoding, particularly for recurring questions. It addresses challenges in real-world deployment of Text-to-SQL models, especially in complex or unseen schemas.
DuckDB infers NULL-only columns as the generic JSON type, causing staging issues when real values appear later. The solution involves using a synthetic canonical sample to ensure correct type inference from the outset.
This post discusses the shift of AI engineers from using frameworks like LangChain to building native agent architectures for production LLM applications.
The post discusses the timeline, root cause, and fixes behind "goblin outputs," which are personality-driven quirks in GPT-5 behavior.
This post links to a DuckDB extension for vector search indexes with pluggable quantization.
Stripe introduces Link’s wallet for agents, offering programmatic access to generate one-time-use cards or Shared Payment Tokens, built on Stripe’s new Issuing for agents.
A user shares a practical troubleshooting guide derived from recurring Kafka production failures, covering issues like consumer lag, producers writing without consumers reading, and duplicate processing after restart due to offset commit problems.
This article explores Stripe's Radar system for detecting fraudulent transactions within 100 ms, detailing the architectural decisions behind its effectiveness.
The post discusses how Iceberg deletion vectors offer a more efficient way to handle row deletions in data lakehouses, where deleting rows can be an expensive operation due to the immutable nature of Parquet files.
Choco used OpenAI APIs to streamline food distribution, boost productivity, and unlock growth, providing a customer story on the real-world impact of AI.
The article outlines a strategy for modernizing data platforms by migrating to an Apache Iceberg lakehouse. It suggests avoiding long ETL pipeline builds and focusing on faster time-to-value for analysts.
Pgrx is a framework for building PostgreSQL extensions using Rust, enabling developers to leverage Rust's safety and performance features within the Postgres environment.
Recent pg_clickhouse releases introduce JSONB, date/time, and array function pushdown, plus HTTP result set streaming for lower memory usage.
The article discusses using whisper.cpp within DuckDB to translate speech to text.
The paper introduces SQLyzr, a benchmark for evaluating text-to-SQL models, which have improved due to large language models. The platform addresses shortcomings in existing benchmarks.
The paper argues that the dominant approach in agentic AI, where large language models orchestrate information access by dynamically selecting tools, is misguided. It proposes an alternative architecture focused on data.
The paper studies the efficiency of loading and storing data in Apache Hudi, Apache Iceberg, and Delta Lake using Apache Spark.
The post discusses DeepSeek-V4, a model with a million-token context that agents can use.
ClickHouse positions itself as an alternative to Elasticsearch for log analytics by combining full-text search and large-scale analytics. The post alludes to a performance benchmark.
The article highlights a significant security vulnerability where misconfigured RAG pipelines are exposing vector databases to the public internet. A live map visualizes the scale of the leak, emphasizing the failure of perimeter security in the AI space.
The article describes how Spotify used Honk, Backstage, and Fleet Management to ease the pain of migrating thousands of datasets.
The article introduces the new GizmoSQL iOS app, which uses DuckDB.
DuckDB 1.5.2 is a new release of the SQL database that runs on laptops, servers, and in the browser.
A solution architect explains the architecture decisions critical for successful dbt implementation on Databricks, focusing on potential pitfalls and solutions.
The article discusses the need for both semantic and operational context for AI agents to trust data. It presents how Dagster and Atlan can provide both halves of the context layer.
ClickHouse reports a 30–55% speedup in ClickBench queries and ~15% reduction in compute costs on Google Cloud by migrating to Axion C4A instances.
The author compares different methods for efficiently modifying DuckDB from Java, highlighting performance improvements with the new UDF feature of the Java Drivers.
The article discusses how using dbt's semantic layer provides a foundation for building reliable and governed agentic analytics workflows.
Transient is a CLI tool to provide a governance layer for AI agents, including permission policies and auditing. It helps answer the question of what an agent did, whether it was authorized, and if it can be proven. The tool wraps the agent process and installs quickly.
ClickHouse Cloud now shards indexes across replicas, which distributes memory usage and improves performance for petabyte-scale workloads. This change reduces memory usage, speeds up index analysis, and improves performance.
Apache Arrow version 24.0.0 has been released with 259 resolved issues from 57 contributors. The announcement provides a link to the installation page.
This Redpanda blog post introduces Shadow Linking, a feature designed to simplify disaster recovery through real-time replication.
Lakeflow Jobs cannot see across workspace boundaries. This post explains how Dagster unifies multiple Databricks workspaces into a single asset graph with real dependencies.
This article examines how GitHub built a security architecture that assumes the agent is already compromised.
Nava migrated ELO's payments monitoring platform from Elasticsearch to ClickHouse, cutting storage from 12 TB to 2 TB, slashing annual infrastructure costs by 87%, and delivering sub-2-second end-to-end latency across 300 real-time dashboards.
This article discusses lineage at the row level for Iceberg tables to track how specific rows were affected.
The article presents EvoRAG, a Knowledge Graph-based Retrieval-Augmented Generation framework designed to improve LLM reasoning by retrieving multi-hop paths from knowledge graphs. The framework aims to address the underperformance of existing KG-RAG solutions in real-world scenarios through feedbac
The article explores agentic visual analytics systems, where LLM-driven agents autonomously manage the full visual analytics pipeline. This approach seeks to shift users away from low-level tool manipulation towards higher-level, task-oriented interactions.
Explore the end-to-end pipeline of TurboQuant, a novel KV cache quantization framework. This overview breaks down how multi-stage compression achieves near-lossless storage through PolarQuant and QJL residuals, enabling massive context windows with minimal memory overhead
Hey folks. I find Online Machine Learning (OML) particularly appealing in data streaming environments, even though it hasn't yet seen widespread application across many domains. I wanted to build a complete Event-Driven Architecture that applies stateful stream processing to a real-world physical pr
The article discusses a scenario where a RAG system retrieves the right data but still produces wrong answers and offers suggestions to fix this issue.
The article suggests using Git worktrees to provide AI agents with isolated workspaces for parallel coding sessions, discussing the benefits and setup considerations.
An autonomous driving company consolidated fragmented data platforms by adopting Apache Doris as a unified analytics engine, enabling seamless search across text, vectors, labels, and metadata while reducing query times from minutes to seconds.
Brazilian fintech Trio cut storage by 88% and achieved a "generational leap" in speed by building a unified payment analytics platform on ClickHouse Cloud, handling 243M+ payments and 1B+ daily events with a sliding window approach for late and duplicate
pgvector is an open-source PostgreSQL extension that adds the ability to store, index, and search over vector embeddings, enabling similarity search and other vector-based operations directly within Postgres.
Agents of the Alley is a Context Engineering OS for Claude Code Agents, available on Github.
The article details five practices for reducing Kafka Streams rebalancing issues when running stateful processors with RocksDB on Kubernetes, including static membership configuration and session timeout tuning.
A Go package allows mounting fs.FS as a virtual file system in DuckDB.
Meta's Capacity Efficiency Program uses an AI agent platform to automate the identification and resolution of performance issues across its infrastructure. The platform leverages encoded domain expertise and a standardized tool interface to improve efficiency.
The WarpStream MCP Server allows AI assistants to connect to WarpStream clusters for querying logs, diagnosing issues, and inspecting ACL events directly from the IDE.
Meta shares lessons learned from their post-quantum cryptography (PQC) migration to assist other organizations in strengthening their resilience during the transition to post-quantum cryptography standards. They propose the idea of PQC Migration Levels to help teams manage the complex migration proc
The article introduces memweave, a system for agent memory that uses Markdown and SQLite, eliminating the need for a vector database.
Cloudflare's Artifacts provides Git-compatible versioned storage for code and data, designed for agents, developers, and automations. It supports creating millions of repos and forking from any remote.
The article proposes disaggregated LLM inference, where prefill (compute-bound) and decode (memory-bound) operations are handled by different GPUs, potentially leading to 2-4x cost reductions.
Learn how ClickHouse uses primary indexes, lightweight projections, and skip indexes to prune data before reading it. Demonstrated on a 243 million row UK property sales dataset.
Most RAG tutorials focus on retrieval or prompting, but the real problem starts when context grows. This article outlines a context engineering system built in pure Python that controls memory, compression, re-ranking, and token budgets to keep LLMs stable under realistic constraints.
An iOS app, GizmoSQL, runs DuckDB as a server with Arrow Flight SQL and supports mounting a DuckLake, enabling a "data lake in your pocket." The app runs the TPC-H 1GB benchmark in under 2 seconds on an iPhone 16 Pro Max.
This article discusses the evolving landscape of AI agent frameworks and harnesses, suggesting that while frameworks might be becoming cheaper, the underlying need for structured agent orchestration remains.
This article explains how to optimize GPU efficiency, covering GPU architecture, performance bottlenecks, and optimization strategies using PyTorch and custom kernels.
The article discusses context graphs as a crucial layer between human decisions and AI agents.
Mintlify replaced PostHog with ClickHouse Cloud, resulting in faster dashboard load times, no rate limit errors, a 30% NPS improvement, and a 60% cost reduction.
A developer built a browser-based spreadsheet diff tool using DuckDB WASM to compare Excel/CSV files, highlighting differences in 42k rows × 14 cols in approximately 3 seconds without a server. The tool handles date formats, floating point noise, and case inconsistencies.
This blog post details how agentic AI can be used to automate demand forecasting in CPG supply chains, potentially improving accuracy and efficiency while reducing manual work.
This Reddit post discusses DuckDB's architecture, comparing Ducklake to Snowflake and highlighting potential drawbacks of using object stores for metadata like Iceberg does. The post explores what differentiates MotherDuck's technical architecture from Snowflake's.
In this article, we will look at how the LinkedIn engineering team rebuilt the Feed and the challenges they faced.
DuckLake v1.0 has been released.
Cloudflare introduces Durable Object Facets, which allows Dynamic Workers to instantiate Durable Objects with isolated SQLite databases, enabling platforms that run persistent, stateful code generated on-the-fly.
Cloudflare Sandboxes provide AI agents with persistent, isolated environments, including a shell, filesystem, and background processes.
Outbound Workers for Sandboxes provide a programmable, zero-trust egress proxy for AI agents. This allows developers to inject credentials and enforce dynamic security policies without exposing sensitive tokens to untrusted code.
DuckDB v1.5.2 is a patch release. DuckLake is released.
The article announces the release of DuckLake 1.0.
This post discusses a custom-built solution for Kafka Streams state store recovery during disaster recovery, addressing a previously unsolved problem in the ecosystem.
Most ReAct-style agents are silently wasting their retry budget on errors that can never succeed. In a 200-task benchmark, 90.8% of retries were spent on hallucinated tool calls — not model mistakes, but architectural flaws. This article shows why prompt tuning won’t fix it, and the three structural
Walkthrough of querying data lake files with DuckDB, covering Parquet, Iceberg, and S3 integration patterns.
This LangChain blog post discusses the growing importance of agent harnesses in building AI agents and their connection to agent memory. It highlights the potential drawbacks of using closed harnesses, particularly those behind proprietary APIs, which can limit control over the agent.
LineageScope is a static analyzer for SQL, dbt, Airflow, Spark, and data contracts. It aims to provide data lineage and enforce data contracts.
The article argues that AI coding assistants require a persistent memory layer to overcome the limitations of stateless LLMs. This memory layer improves code quality by providing systematic context across sessions.
This Netflix tech blog post is about evaluating show synopses with LLM-as-a-Judge.
This post announces the open-sourcing of a formally verified TLA+ specification for a leaderless log protocol for Kafka, highlighting the discovery of a design bug through verification. It also mentions using Claude Code to generate a working Rust implementation from the specification, demonstrating
This article from the DuckDB website discusses the design and implementation of DuckDB internals, which is useful for understanding its architecture and performance characteristics.
This article discusses context engineering techniques for AI coding agents, specifically focusing on Claude code sub-agents. It explores how to structure prompts and context to improve the performance of AI coding assistants.
This article outlines golden rules for agent-first product engineering. It explores considerations for designing products around AI agents and their capabilities.
Meta shares its approach to modernizing WebRTC, the technology powering real-time audio and video across their platforms. The article highlights the challenges of forking a large open-source project and how Meta addressed them to stay aligned with community upgrades.
Deep Agents deploy is being launched in beta as a way to deploy a model agnostic, open source agent harness in a production ready way. It is built on Deep Agents and designed for an open world.
This Reddit post links to a DuckDB tutorial demonstrating how to build a text-to-SQL agent that can automatically recover from bad SQL queries. The approach leverages a tool-calling agent loop to inspect and correct errors.
This paper presents LASER, a data-centric method for low-cost and efficient SQL rewriting based on SQL-GRPO, aimed at transforming queries into more efficient variants for database optimization.
This paper introduces AV-SQL, a method for decomposing complex Text-to-SQL queries using agentic views, aimed at improving the accuracy and efficiency of natural language to SQL translation.
This paper introduces SQLStructEval, a framework for evaluating the structural reliability of LLM-generated SQL queries, investigating the structural behavior of these queries to determine if they are sound.
Redpanda Connect now offers native CDC for Oracle, enabling real-time data access without requiring rearchitecting. The solution eliminates the need for a JVM, middleware, and related operational overhead.
This post recaps ClickHouse's involvement at FOSDEM 2026 in Brussels. It highlights the community's activities during the event.
This post highlights GlassFlow's work on achieving high-throughput (500k+ events/sec) transformations for ClickHouse ingestion, particularly in observability and real-time analytics pipelines. It addresses challenges related to scaling throughput.
This Confluent blog post details how Agent Taskflow built a production-grade AI orchestration platform using Confluent and AWS, avoiding self-managed Kafka. The article focuses on scalability and the benefits of using managed services.
This blog post argues that retention limits, sampling, and metric roll-ups hinder AI-driven observability workflows. It suggests that these practices are inadequate for handling the demands of full-fidelity data required by modern AI.
Cloudflare's blog post details how they automated the generation of malware trigger packets using symbolic execution on BPF bytecode. By leveraging the Z3 theorem prover, they significantly reduced analysis time, improving their ability to detect and respond to threats.
This blog post analyzes DuckLake's claim of a 900x speed improvement by inlining data into the catalog, a topic of interest for data engineers optimizing query performance.
This paper proposes a "Black-Hole Attack" against vector databases, where malicious vectors injected near the geometric center can compromise retrieval. It highlights the need for security considerations in vector database design.
This paper introduces Cortex AISQL, a production SQL engine from Snowflake that integrates native semantic operations directly into SQL. This allows users to combine relational operations with semantic reasoning for querying unstructured data.
The article discusses effective strategies for managing the context window in AI agents. It emphasizes improving performance, reducing costs, and maintaining relevant outputs, which is valuable for optimizing AI systems.
This post describes how to combine Fivetran, DuckDB, and Claude to enable conversational analytics on a data lake, highlighting the potential for interacting with data through natural language queries using an MCP server.
The PFC-JSONL extension, which enables block-level timestamp filtering for compressed JSONL logs, has been merged into the DuckDB Community Hub. This allows users to query compressed JSONL files more efficiently using DuckDB.
The post describes a hybrid PyMuPDF and GPT-4 Vision pipeline that significantly reduced the time required for document extraction from PDFs, replacing manual effort, and discusses why the latest models weren't always the optimal solution.
This article delves into the design of an AI agent for navigating large-scale event data, focusing on transforming query patterns into intelligent tools and crafting an effective agent architecture, which offers practical insights into building agents for complex data environments.
This article discusses techniques for optimizing context, a finite resource, when designing AI agents, focusing on how to best utilize available information to enhance agent performance.
ClickHouse version 26.3 introduces async inserts by default, improved JOIN reordering, and materialized CTEs. These features could improve query performance and data management for users.
The Apache Arrow team announced the version 23 release of the Apache Arrow ADBC libraries, which includes 41 resolved issues from 20 contributors. This release focuses on the libraries, which are at version 23, with the API specification versioned separately.
This Netflix tech blog post discusses interval-aware caching for Druid at scale. The article likely contains valuable insights into Druid performance and optimization strategies.
Enterprise AI SaaS automates customer enablement with a 5-agent workflow to close adoption gaps, reduce churn, and scale training across industries
Meta shares how they used AI to map tribal knowledge within large-scale data pipelines, highlighting the challenges and limitations of AI coding assistants' understanding of complex codebases.
This ByteByteGo article explores context engineering for LLMs, explaining how LLMs process information and outlining strategies to improve context utilization.
Respan AI migrated to ClickHouse Cloud for high-throughput LLM observability after outgrowing Postgres. The company now processes 50 million daily events using incremental materialized views.
This article discusses the author's experience building with AI over three months after eight years of wanting to, likely covering lessons learned and insights gained.
This LangChain blog post discusses continual learning for AI agents, highlighting that learning occurs at the model, harness, and context layers, not just model weight updates. Understanding these distinctions is crucial for building systems that improve over time.
The article introduces a Syntaqlite Playground, which is related to dbxlite and Metastax. It's a useful tool for Staff+ level data engineers, ML engineers, and analytics practitioners.
Netflix details how they are using multimodal intelligence to improve video search capabilities. The article likely covers the engineering challenges and solutions involved in building and deploying such a system at scale.
This dbt blog post discusses how to operationalize analytics agents by building context for LLM models using dbt and MCP servers. It explores practical ways to integrate LLMs into analytics workflows.
This article describes how a glibc update in 2018 silently invalidated Postgres text indexes, leading to incorrect query results. It highlights the importance of understanding collation settings for maintaining data integrity in Postgres.
This post details a self-healing deployment pipeline for a GTM Agent. The system automatically detects regressions after each deploy, determines if the change caused the regression, and uses an agent to create a pull request with a fix, minimizing manual intervention.
Atlan describes their migration of workflow orchestration from Argo to Temporal. The post details the reasons for the migration, the crossover architecture used during the transition, and lessons learned from rebuilding orchestration in production.
This paper critiques the use of average recall as the dominant metric for evaluating vector databases, which are crucial in AI systems. It argues that relying solely on average recall can be problematic for users and researchers optimizing these systems.
This paper discusses DocETL, a declarative system for LLM-powered data processing that has gained traction across various domains. DocETL allows users to define complex data processing pipelines using LLMs, enabling tasks like information extraction and data transformation from unstructured document
The agent security market is rapidly developing with companies offering runtime identity enforcement and permission revocation for autonomous agents.
ClickHouse details their work on agentic coding. The article likely details the practical implementations and potential benefits of this approach within the ClickHouse ecosystem.
Netflix details the implementation of Variable Bit Rate (VBR) for all their live events to improve streaming quality.
This article describes how to use Dagster's dbt integration to run and monitor dbt models as part of a larger asset-driven pipeline, focusing on lineage and scheduling improvements.
Learn step-by-step how to debug Dagster pipelines directly inside Docker, bridging development and deployment environments with practical tools.
Warm Docker containers reduce cold starts, learn how Dagster Cloud deploys 5x faster using PEX-based builds.
Use proven tips to make your Python code faster and more efficient, especially for data engineering and pipeline-heavy workloads.
Meta details KernelEvolve, a Ranking Engineer Agent used to optimize AI infrastructure for ads ranking, focusing on autonomous design, execution, and analysis of ranking model experiments.
LangChain reports that open models like GLM-5 and MiniMax M2.7 are now comparable to closed frontier models on agent tasks like file operations and tool use. The article presents evaluation results and instructions for using these open models.
Cloudflare discusses the challenges and opportunities in cache design presented by the explosion of AI-bot traffic, detailing the differences between AI bot traffic and human traffic and providing some early ideas for system design.
This article discusses how data leaders should design the interface between data platforms and the teams that rely on them. It emphasizes the importance of clear boundaries and well-defined responsibilities in data platform engineering.
Redpanda details how they used profile-guided optimization to improve performance and reduce latency in Redpanda 26.1. The article offers a behind-the-scenes look at their optimization process.
This blog post from the DuckDB team introduces data inlining in DuckLake to enable streaming for data lakes. It details the motivation, implementation, and benefits of this approach, including improved performance and reduced latency.
Rill leverages ClickHouse to deliver real-time operational BI for over 100 billion daily events. The integration offers instant data exploration and conversational analytics via a declarative, BI-as-code workflow.
MotherDuck explores the performance and challenges of working with large datasets in Redshift.
Dagster 1.12 introduces a redesigned UI, Components GA, streamlined deployment workflows, and major orchestration upgrades. These enhancements aim to make data orchestration faster, simpler, and more reliable for users.
Gradient Labs is deploying AI account managers for banks using GPT-4.1 and GPT-5.4 models. These agents aim to automate banking support with low latency and high reliability.
This blog post explains multimodal embeddings for searching across different data types (text, images, audio, video) in RAG systems. It provides practical implementations using Weaviate and Gemini.
This DuckDB blog post humorously explores an alternate reality where Dutch, not English, became the dominant language for SQL. It poses the question of how this linguistic shift might have shaped the development and standardization of SQL.
The Airbyte blog discusses why LLM agents often struggle to move beyond basic text-to-SQL functionality. It argues that a missing context layer is crucial for enabling truly intelligent, action-driven AI systems.
The article introduces Dux, distributed DuckDB-backed dataframes on Apache Beam. This is a .
ClickHouse has announced the general availability of its Bring Your Own Cloud (BYOC) offering on Google Cloud. This allows users to run ClickHouse within their own Google Cloud account while maintaining full data sovereignty and zero-trust networking.
Meta is scaling its Ads Recommender runtime models to LLM-scale and complexity using the Meta Adaptive Ranking Model. This article discusses how they are bending the inference scaling curve to deliver better experiences for people and better results for advertisers using AI recommendation systems.
This blog post from AWS covers how to give Kafka clients in different VPCs and AWS accounts secure private access to MSK Serverless clusters. The solution addresses the limitation of MSK Serverless supporting PrivateLink connectivity for up to 5 VPCs in the same account.
This article details the data schema, embeddings, and graph design for an agentic query engine on ApertureDB, focusing on managing mixed data types. It is a .
This paper introduces Exqutor, an extended query optimizer designed for vector-augmented analytical queries, particularly in Retrieval-Augmented Generation (RAG) pipelines. It aims to improve the efficiency of retrieving relevant external knowledge for large language model inference.
Padlet leverages ClickHouse Cloud to provide real-time analytics for classrooms, processing billions of events per month with sub-second query performance without requiring a dedicated data team, showcasing ClickHouse's scalability and ease of use.
This post introduces kernel-anvil, a tool for auto-tuning llama.cpp kernels on AMD GPUs, resulting in a 2x decode speedup by optimizing kernel configurations per model shape. The tool profiles GGUF model layer shapes and generates optimal kernel configs loaded at runtime, without recompilation.
This article describes the architecture of Redpanda Cloud Topics, a new replication mechanism that uses object storage to reduce costs. The discussion of internals is valuable for engineers working with streaming data.
This article explains how to use HNSW indexes effectively with JOINs and WHERE clauses in DuckDB, demonstrating how to combine approximate nearest neighbor search with standard SQL operations for efficient data retrieval.
The article describes a technique for creating self-healing neural networks using PyTorch. It demonstrates how to detect model drift and adapt in real time without retraining.
This WarpStream blog post explains Kafka transactions and contrasts them with WarpStream's implementation, which eliminates stateful brokers. It highlights how WarpStream achieves the same protocol guarantees with different internal architecture.
WarpStream demonstrates a 4x reduction in cloud infrastructure costs compared to self-hosted Kafka in their benchmarks. The savings are attributed to eliminating inter-AZ fees and replacing EBS with object storage.
This Neon blog post discusses their approach to zero-downtime patching using prewarming techniques to ensure continuous availability of customer databases. It details their system's redundancy and failover mechanisms.
The LangChain blog post offers a checklist for evaluating AI agents, covering error analysis, dataset construction, grader design, and offline/online evaluation. The checklist is intended to help ensure production readiness.
This Databricks blog post discusses techniques for ensuring database availability during patching in Lakebase. It focuses on prewarming as a method to minimize downtime during updates, which is crucial for maintaining service reliability in data platforms.
This post shares the configurations used to push Qwen 3.5 27B to 1,103,941 tok/s on 12 nodes with 96 B200 GPUs using vLLM. The improvements came from changes to DP, context window, FP8 KV cache, and MTP-1 speculative decoding.
This paper introduces the concept of Agentic Context Engineering for self-improving language models. It is focused on improving the performance of LMs through dynamic context management, which is relevant to AI agents and context-aware data systems.
Vizier is a physical design advisor for DuckDB. It analyzes queries and recommends changes to the database layout, such as sort orders and indexes, to improve performance.
This article presents ten best practices for ClickHouse, covering topics like primary key design, data types, materialized views, and join optimization. Benchmarks on a 150M row dataset illustrate the impact of these practices.
Cloudflare describes a Kubernetes fix involving fsGroupChangePolicy that reduced Atlantis instance restart times from 30 minutes to 30 seconds by addressing a bottleneck in volume permission handling.
ClickHouse now supports direct querying of Iceberg and Delta Lake formats across major cloud catalogs. This feature eliminates the need for data migration, improving data lake accessibility.
A physical design advisor called Vizier has been developed for DuckDB. It analyzes queries and suggests changes to the database's physical layout, such as sort orders and indexes, to improve query performance.
This paper details ByteHouse, ByteDance's cloud-native data warehouse designed for real-time multimodal data analytics. It addresses the need for efficient and cost-effective data analytics infrastructures in handling the demands of intelligent data services within ByteDance's production environment
This blog post from Firetiger explores strategies for handling large tool results within AI agent workflows. It discusses approaches like summarization, pagination, and streaming to manage the volume of data returned by tools used by agents.
Pgsemantic is a new open-source project that enables instant vector search capabilities within a Postgres database. The tool aims to simplify the process of integrating vector embeddings for semantic search applications directly within Postgres.
Shared-nothing made sense when storage was slow, but shared storage flips that tradeoff. The architectural case for building Kafka directly on object storage.
Bento is the MIT-licensed open source fork of Benthos — stateless stream processing, no feature gating, no license traps, full connector ecosystem intact.
Kafka assumes stateful, partition-owning brokers. How WarpStream reverse-engineered it for stateless Agents. A deep dive into diskless Kafka load balancing.
Object storage does not GC itself; compaction and retention create orphaned data at scale. Five GC strategies for S3-native distributed systems compared.
When pprof hit its limits, WarpStream used gcore and viewcore to trace a goroutine leak in our control plane. A Go debugging guide for distributed systems.
Grafana Labs needed tens of GiB/s with zero inter-AZ costs for Cloud Metrics. How WarpStream diskless architecture met their scale without Kafka ops overhead.
Build real-time security threat monitoring with WarpStream, RisingWave, and Grafana: one materialized view per metric, no complex pipelines, no extra infra.
Cloud block storage costs up to 24x more per GiB than S3. The exact numbers behind why disk-based Kafka clusters drain your budget and what diskless changes.
WarpStream uses Antithesis to deterministically simulate its full SaaS, from signup to Kafka workloads, surfacing correctness bugs random testing misses.
WarpStream Orbit replicates any Kafka-compatible cluster offset-for-offset, preserving consumer groups, ACLs, and configs for zero-gap migrations and DR.
WarpStream BYOC needs zero cross-account IAM access. Raw data stays in your VPC, only metadata touches WarpStream. Secure by design for sensitive workloads.
Cluster quotas and mirroring do not fix Kafka noisy neighbors; they shift the cost. How WarpStream Agent Groups isolate workloads without dedicated clusters.
WarpStream Agents validate records against AWS Glue Schema Registry, blocking malformed data at ingest without dead-letter queues or extra infrastructure.
WarpStream natively embeds Bento inside Agents — Kafka Connect-style integrations and stream processing with zero additional infrastructure inside your VPC.
S3 API call costs can silently inflate your streaming bill. How WarpStream uses distributed memory-mapped caching across Agents to slash GET request volume.
How WarpStream built usage-based billing for its core product, separating events from metrics to keep pricing logic auditable and updatable post-facto.
WarpStream Managed Data Pipelines: fully-managed Bento inside your VPC — SaaS UX, data stays in your account, YAML-driven pipelines with rollback support.
How WarpStream implemented Kafka compacted topics with only 128 MiB RAM — tracking millions of dedup keys without traditional broker memory overhead.
How Goldsky scaled blockchain data to 100 PiB and 100K+ partitions on WarpStream for 10x cheaper than their previous Kafka vendor with zero bottlenecks. This production war story shows the benefits of WarpStream for high-volume streaming data.
Run WarpStream on Tigris for globally distributed, durable Kafka streaming. This setup eliminates region-specific bucket planning and hidden data transfer fees, offering a streamlined approach to managing streaming infrastructure.
WarpStream Multi-Region Clusters deliver RPO=0, ensuring that every acknowledged write survives full regional cloud outages without manual failover or tuning. This post highlights the robustness of WarpStream for mission-critical streaming applications.
Learn how to send structured .NET logs directly to ClickHouse using Serilog — with full schema control, full-text search, and SQL queries over your log data. This post provides a step-by-step guide for setting up and using the integration.
Learn how OpenAI’s Model Spec serves as a public framework for model behavior, balancing safety, user freedom, and accountability as AI systems advance. This post details the considerations and mechanisms used to ensure responsible AI deployment.
This article details an experience using AI for scientific research where the AI hallucinated data. It underscores the importance of verifying AI outputs, especially in data-driven fields.
This paper presents flexvec, a retrieval kernel that exposes the embedding matrix and score array as a programmable surface. It is designed for AI agents and offers opportunities to expose more of the retrieval pipeline to the caller.
A developer implemented ACORN for prefiltered approximate nearest neighbors on a fork of the DuckDB VSS extension, showing significant speedups over brute-force filtering on high-dimensional vector workloads.
OpenAI launches a Safety Bug Bounty program to identify AI abuse and safety risks, including agentic vulnerabilities, prompt injection, and data exfiltration. This program encourages community participation in enhancing the security and robustness of AI models.
Trino is adding support for the NUMBER data type to handle high-precision numeric types beyond the existing DECIMAL limit. This will allow Trino to query data from sources that use these types without loss of precision, improving interoperability.
ClickHouse Cloud's two-window recommender and target-tracking CPU algorithm cut scale-down latency from 30 hours to 3 hours while eliminating oscillations and reducing infrastructure costs. The post details the algorithm and its impact on autoscaling performance.
WarpStream Schema Linking mirrors Confluent-compatible schema registries into a BYOC instance, preserving schema IDs and compatibility rules for disaster recovery.
WarpStream's diskless, S3-native Kafka implementation aims to replace traditional broker-based streaming by eliminating disks, brokers, and inter-AZ costs.
This article answers questions about WarpStream's architecture, BYOC vs. Serverless options, pricing, Kafka compatibility, performance trade-offs, and zero-disk streaming.
After S3EOZ's 85% price drop, WarpStream benchmarks show 3x better latency vs. standard S3 at just 15% higher TCO. Full methodology and real workload numbers are presented, offering a practical comparison for those considering cloud-based streaming solutions.
WarpStream Diagnostics continuously scans clusters for cost inefficiencies and health issues, surfacing actionable fixes before they become incidents. This proactive approach helps data engineers maintain optimal performance and prevent costly disruptions in their streaming data pipelines.
WarpStream + Materialize centralize streaming business logic in SQL with dbt-style version control, enabling operational data products without brittle ETL pipelines. This combination offers a streamlined approach for building and managing real-time data applications.
Kafka idempotent producers without stateful brokers require rethinking deduplication. WarpStream uses retroactive tombstones to separate data from metadata, providing a technical solution for ensuring data integrity in streaming applications.
Tiered storage still runs stateful brokers with expensive disks and inter-AZ replication. It does not solve the real cost problem at the heart of Kafka, offering a critical analysis of a common architectural pattern.
OSS big data tools like Kafka were built for hyper-scalers, then given to everyone. The article discusses why on-prem assumptions in open source infra cause pain in the cloud, offering a high-level perspective on cloud infrastructure design.
Kafka offset-based consumer lag is misleading when message sizes vary. The post shows how to instrument time-based lag metrics for an accurate view of consumer group health, offering a practical solution for monitoring streaming applications.
The article provides a practical guide to Kafka disaster recovery and multi-region data sharing. It discusses how WarpStream Active-Active clusters achieve RPO=0 with no tuning required, showcasing an approach to building highly available streaming infrastructure.
Moda leverages a multi-agent system built on Deep Agents and LangSmith to enable non-designers to create professional-grade visuals. The article highlights a specific use case of AI agents in a design context.
This article outlines the architecture that allows Netflix to live stream to 100 million devices in 60 seconds. It focuses on the challenges and solutions involved in building a large-scale live streaming system.
This article focuses on offline evaluation frameworks for production-ready LLM agents. It tackles the challenge of proving that these complex systems work reliably before deployment.
Cloudflare introduces Dynamic Workers for executing AI-generated code in secure, lightweight isolates. This technique achieves millisecond startup times, significantly faster than traditional container-based sandboxing for AI agents.
The ClickHouse blog details the design of their new text index for high-performance full-text search, especially when data is stored in object storage. The post explains how the design maintains speed at scale.
Cogent Security explains how they built an AI-native vulnerability management platform using ClickHouse. They emphasize ClickHouse's speed as crucial for countering AI-enabled attacks.
This paper introduces BubbleRAG, an evidence-driven retrieval-augmented generation approach for black-box knowledge graphs. It aims to address the limitations of existing graph-based RAG approaches related to recall and precision.
Stripe Engineering details how Radar uses machine learning to prevent free trial abuse. The system predicts abusive behavior with 90% accuracy, based on common trial terms violations.
In this article, we will look at how agentic RAG works, how it improves upon standard RAG, and the trade-offs that should be considered.
Cloudflare's Gen 13 servers introduce AMD EPYC™ Turin 9965 processors and a transition to 100 GbE networking to meet growing traffic demands. In this technical deep dive, we explain the engineering rationale behind each major component selection.
Cloudflare’s Gen 13 servers double our compute throughput by rethinking the balance between cache and cores. Moving to high-core-count AMD EPYC ™ Turin CPUs, we traded large L3 cache for raw compute density. By running our new Rust-based FL2 stack, we completely mitigated the latency penalty to unlo
This arXiv paper presents a novel approach to stream processing by exploring functional isolation to reduce infrastructure costs. It discusses how concurrent workloads can extract insights from real-time data streams while optimizing resource utilization.
The paper introduces ReViSQL, an approach to translating natural language to SQL, aiming to achieve human-level performance. The research focuses on enhancing SQL reasoning by utilizing large language models and AI agents to decompose complex queries.
This paper provides a comprehensive survey on vector databases, covering storage and retrieval techniques, as well as challenges in the field. Vector databases are increasingly important due to their integration with large language models and applications in machine learning.
DuckDB 1.5.1 is released, including fixes and Lance support. The release notes are available on GitHub, and the new version can be installed from the installation page.
Trigger.dev describes their architecture for giving every user SQL access to a shared ClickHouse cluster. This pattern can be useful for connecting agents to platforms and building ETL systems.
This article uses compound probability to illustrate how seemingly accurate AI agents can fail in multi-step tasks. It also proposes a pre-deployment framework to mitigate such failures in production.
A Kafka Streams application with a large RocksDB state store (300M+ keys) experiences slow DR rebuild times (45+ minutes to 2 hours). The author is looking for solutions to improve rebuild performance, which is a common challenge in Kafka Streams.
WarpStream has added a built-in observability layer that stores structured operational events directly as Kafka topics in object storage, allowing users to query those topics.
This blog post describes WarpStream Events, an observability layer that captures Agent logs, ACL decisions, and pipeline execution logs, and allows users to search and visualize them with zero ops.
This article discusses potential failure modes in agentic RAG systems, including retrieval thrash, tool storms, and context bloat, and suggests methods for early detection to avoid high cloud costs.
Kwai, a short-video platform, unified its advertising analytics by migrating from ClickHouse and Elasticsearch to Apache Doris. This resulted in up to 90% latency reduction and a 3x increase in write throughput.
DuckDB has a flexible extension mechanism that allows extensions to be loaded dynamically at runtime, and this post shows how to build them in C#. This extension mechanism can add support for new file formats, introduce custom types, or provide specialized analytical functions.
OpenAI details their approach to monitoring internal coding agents using chain-of-thought analysis to detect and mitigate risks of misalignment, focusing on AI safety.
PondDB is presented as a self-hosted agent memory database built on DuckDB.
The article introduces Blobsearch, an Elasticsearch alternative based on object storage (like S3) and DuckDB for querying logs rapidly. It focuses on using a durable storage solution (S3 with Parquet) combined with the analytical capabilities of DuckDB for cost-effective log analysis and monitoring
A late-adopter walkthrough of using dbt with DuckDB for local development, covering the setup, tradeoffs, and workflow differences compared to cloud warehouses.
Friend bubbles in Facebook Reels highlight Reels your friends have liked or reacted to, helping you discover new content and making it easier to connect over shared interests. This article explains the technical architecture behind friend bubbles, including how machine learning estimates relationshi
This article explores using Apache Beam to ingest metrics into ClickHouse; this provides insights into how to leverage a data processing framework for efficient metric storage and analysis in a columnar database.
WarpStream eliminates disks, brokers, and inter-AZ costs by using an S3-native architecture. This article makes the case for this new approach to streaming.
Ingelt is a Rust/Axum gateway that parses 33 legacy industrial protocols inside 64MB WebAssembly sandboxes. The WASM isolation approach prevents malformed input from crashing the host process.
This post details how ClickStack integrates with ClickHouse to optimize queries for observability workloads. It covers techniques like progressive time window pagination, chunked charts, and automated use of materialized views, offering insights into performance tuning.
This Postgres Weekly article discusses how a badly written query caused an OOM (Out-Of-Memory) killer issue, even with ample RAM. The culprit was `work_mem` exceeding expectations; this is a cautionary tale regarding resource allocation and query optimization in Postgres.
Redpanda is open-sourcing their AI SDK for Go, designed for observable, resilient, and production-grade AI tooling.
Nemotron 3 Nano 4B is presented as a compact LLM suitable for local AI, offering an efficient option for running inference on resource-constrained devices. Staff+ ML engineers working on edge deployment or low-latency applications should investigate this model's architecture and performance characte
This article analyzes the characteristics of well-written corporate engineering blogs. It provides valuable insights for creating engaging and informative technical content within organizations.
ProtoScience is a deterministic pipeline designed to autonomously discover governing equations from raw numerical data. The system uses sparse regression and statistical validation without relying on LLMs.
The Get Shit Done framework uses meta-prompting and spec-driven development to structure LLM-powered system builds. It emphasizes generating detailed specifications before code, reducing iteration cycles.
This dbt blog post discusses how a semantic layer provides a consistent and governed foundation for AI systems used in analytics.
Meta’s Ranking Engineer Agent (REA) autonomously executes key steps across the end-to-end machine learning (ML) lifecycle for ads ranking models. This post covers REA’s ML experimentation capabilities: autonomously generating hypotheses, launching training jobs, debugging failures, and iterating on
The post discusses a shift towards autonomous, continuously evolving AI agents and integrations with tools like NVIDIA NemoClaw.
The author implemented a single binary solution to replace both Postgres and ClickHouse for web analytics.
This article explores context engineering, a topic critical for building AI-ready data systems. The post discusses designing data systems for AI consumption, machine-readable metadata, and contextual memory, providing insights into creating effective data pipelines for AI applications.
An introduction to data-centric query compilation, covering how modern engines like HyPer and Umbra generate machine code from query plans by pushing data through tight loops rather than pulling through iterator trees.
This article highlights the importance of extended statistics in Postgres for query optimization. It likely covers how to create and use extended statistics to improve query performance, especially for complex queries or datasets with skewed data distributions.
Reddit's engineering team migrated their petabyte-scale Kafka deployment from EC2 to Kubernetes. The article details the challenges they faced and the solutions implemented for a successful migration.
This Dagster blog post details CI/CD workflows using branch deployments, automatic retries, and backfill strategies; it also covers data quality via asset checks and monitoring with Dagster Insights, offering actionable advice for managing production data pipelines.
This article explores using DuckDB transpilation to reduce warehouse costs. It could involve techniques for rewriting SQL queries to leverage DuckDB's efficient execution or using DuckDB as a local processing layer before data warehousing, offering a practical method for cost optimization.
Covers subagent patterns for building composable AI agents that delegate tasks to specialized sub-agents, with practical implementation details.
Socialpruf replaced Neon with Postgres managed by ClickHouse, resulting in up to 5x faster query performance. The migration eliminated network transfer costs and improved the speed of real-time social analytics dashboards.
This article explores a neuro-symbolic system where a neural network learns fraud rules automatically, extracting IF-THEN rules during training; the experiment uses a hybrid neural network with a differentiable rule-learning module on the Kaggle dataset.
This article discusses building a product analytics warehouse directly on Postgres. The article likely details schema design choices, performance optimization strategies (indexing, partitioning), and extension usage (like pgvector) relevant for those using Postgres beyond traditional transactional w
The article compares Exasol, ClickHouse, StarRocks, Trino, and DuckDB across concurrency, data volume, and node scaling. While, the comparison could highlight architectural differences, performance trade-offs, and suitability for different analytical workloads across these popular SQL engines.
Flock v0.7.0 is a DuckDB extension that allows users to run RAG, Claude, and LLM metrics directly in SQL. The extension aims to eliminate the need to move data into Python scripts for semantic tasks.
Avalon Synthetic clinical data pipeline , generate realistic FHIR R4 patient data, normalize it through Forge, and query it as OMOP CDM 5.4 views. What is Avalon? Avalon is an end-to-end pipeline that turns Synthea-generated FHIR bundles into clean, documented, queryable tables in BigQuery , then la
Article URL: https://mistral.ai/news/leanstral Comments URL: https://news.ycombinator.com/item?id=47404796 Points: 286 # Comments: 53
Article URL: https://gluey.sh/ Comments URL: https://news.ycombinator.com/item?id=47404438 Points: 1 # Comments: 0
Serverless JARs and Databricks Connect for ScalaServerless JARs enable teams to build..
Stripe uses internal coding agents called 'Minions' to generate over 1,300 automated pull requests per week. The article likely describes the architecture and implementation of these agents.
With the launch of real-time mode (RTM) in Apache Spark 4.1, Structured Streaming..
ClickHouse-connect v0.12.0 introduces a new async-native Python client built using the half-sync/half-async pattern. Benchmarks show a 1.16x improvement in throughput and more stable tail latency under high concurrency.
We generated ~1,100 synthetic patients with Synthea, processed the FHIR R4 output through our normalization engine (Forge), and published it as a free public dataset on BigQuery Analytics Hub. 8 resource types: Patient, Encounter, Observation, Condition, Procedure, Immunization, MedicationRequest, D
We just open-sourced the core data sync engine behind Yeahchain. The problem we solved: traditional databases were hitting performance bottlenecks during high-frequency sync operations. For Yeahchain, we moved to a custom, lock-free architecture that maps shared memory regions directly to our proces
Article URL: https://simonwillison.net/guides/agentic-engineering-patterns/what-is-agentic-engineering/ Comments URL: https://news.ycombinator.com/item?id=47393908 Points: 127 # Comments: 76
https://github.com/seanwevans/lockstep I want to share my work-in-progress systems language with a v0.1.0 release of Lockstep. It is a data-oriented systems programming language designed for high-throughput, deterministic compute pipelines. I built Lockstep to bridge the gap between the productivity
This article reports on performance improvements in Redpanda using NVIDIA Vera, showing latency reductions and throughput gains compared to CPU models;
I spent a few years working at a company where all our microservices backed into MongoDB instances. We were constantly under top-down pressure to deliver fast, and because MongoDB is schemaless, it felt very easy to just add fields to our documents whenever we needed to expose data to another servic
Article URL: https://zzk273.github.io/LATENT/ Comments URL: https://news.ycombinator.com/item?id=47388273 Points: 143 # Comments: 30
Even seemingly simple engineering tasks — like updating an API — can become monumental undertakings when you’re dealing with millions of lines of code and thousands of engineers, especially if the changes are security-related. Meta uses AI codemods to automate security-related changes in their Andro
This ClickHouse blog post explains how to effectively query datetime columns, including examples for hourly bucketing and rush hour analysis using real taxi data.
This Dagster blog post introduces a course on AI-Driven Data Engineering, focusing on building production ELT pipelines with AI coding agents; staff+ engineers should be aware of emerging trends in AI-assisted data engineering, and this course might provide practical workflows for integrating AI int
Overview:Since we released the Databricks AI Security Framework (DASF) in 2024, the..
NBIM cut runtimes 30,40% in 3 months with the dbt Fusion engine and State-Aware Orchestration,without heavy optimization.
OpenAI details how ChatGPT is designed to resist prompt injection and social engineering by constraining risky actions and protecting sensitive data within agent workflows, offering insights into security measures.
OpenAI explains how they built an agent runtime using the Responses API, shell tool, and hosted containers to enable secure and scalable agents with access to files, tools, and state.
This post discusses the practical limits of DuckDB when running on commodity hardware. Understanding these limitations is crucial for optimizing performance and resource allocation in real-world deployments.
This post discusses a proposed patch set for Postgres 19 called pg_plan_advice, which would allow admins to control query planner behavior and improve plan stability;
Apple released the MacBook Neo today and there is no shortage of tech reviews explaining whether it's the right device for you if you are a student, a photographer or a writer. What they don't tell you is whether it fits into our Big Data on Your Laptop ethos. We wanted to answer this using a data-d
Learn how to build model-agnostic infrastructure for AI agents that works across multiple LLMs, enabling flexibility, scalability, and future-proof AI systems.
ETL, ELT, batch, CDC, reverse ETL,learn the key data movement patterns and when to use each.
Airbnb rolled out 20+ local payment methods in 360 days, and this article discusses the underlying technical architecture and engineering decisions.
This post demonstrates a lightweight metrics layer for ClickHouse using MooseStack, an open-source developer agent harness. It explains how to define metrics once and use them across different applications.
As an AI Architect, I spend my days designing AI systems and agents for others. I optimize workflows, fine-tune context windows, and architect serverless solutions to solve complex business problems.. View article
This article shares the technical details behind how Advanced Browsing Protection (ABP) in Messenger protects the privacy of the links clicked on within chats while still warning people about malicious links. It illuminates some of the engineering challenges and infrastructure required to implement
Replo uses ClickHouse to analyze over 100 billion events in real time. This enables fast analytics dashboards for over 4,000 Shopify merchants.
Netflix discusses how they are scaling global storytelling. The article dives into how they modernized localization analytics.
The article highlights the inefficiency of agentic systems that start from zero every run and proposes using cognitive memory to improve performance.
The article highlights an engineer at Cloudflare who rewrote most of Next.js in one week using AI agents; this example suggests a future where AI can rapidly disrupt existing software moats and business models, raising important questions about the evolving role of software engineers.
Explore Apache Iceberg v3 support in Snowflake public preview, including row lineage for CDC, variant data, and enhanced interoperability across open table formats.
The article introduces Iceberg output for Redpanda Connect, enabling users to land data directly into Apache Iceberg tables; it highlights advantages such as automated schema evolution and scalable routing, making it useful for those integrating streaming data with data lakes.
Discover how AI accelerates data lineage with automated docs, testing, and scalable governance.
Explore the decline of traditional RAG in the era of agentic AI, and how autonomous agents are reshaping retrieval, reasoning, and knowledge workflows.
We are entering the golden age of AI Coding. Every day, I see colleagues, both technical and non-technical, marveling at how agents are rewriting the rules of software construction. The.. View article
Netflix discusses optimizing recommendation systems. They use JDK’s Vector API to achieve this.
AI is reshaping data pipelines,driving real-time data, automation, and AI-assisted modeling at scale.
FFmpeg is a multi-tool for media processing, supporting a wide variety of audio and video codecs and container formats. It can also orchestrate complex chains of filters for media editing and manipulation. For the people who use our apps, FFmpeg plays an important role in ensuring that our videos lo
Meta recognizes the long-term benefits of jemalloc, a high-performance memory allocator, in its software infrastructure. Meta is renewing focus on jemalloc, aiming to reduce maintenance needs and modernize the codebase while continuing to evolve the allocator to adapt to the latest hardware and work
Stripe evaluated the ability of AI agents to create real Stripe integrations via benchmark. The benchmark tests how well AI agents can autonomously manage software engineering projects, despite their ability to solve scoped coding problems.
Unify metrics across BI and AI with Omni + the dbt Semantic Layer. Define once, query everywhere, eliminate semantic drift.
This Apache Doris blog post describes how Xiaomi built a unified data platform using Apache Doris, running 40+ clusters and serving 50 million queries per day across petabytes of data.
This article discusses the unique challenges of monitoring LLM agents due to their non-deterministic nature and infinite input possibilities; it proposes focusing on conversation quality and using production traces for continuous improvement, highlighting the shift from traditional software monitori
This article covers agent development with CockroachDB using the LangChain framework, highlighting the integration's support for building production-ready agentic AI applications.
Zscaler built an AI-powered, multi-agent PR review system that uses dbt’s structured context.
A surprising edge case involving row locks with joins in Postgres: non-null foreign keys and valid constraints do not guarantee an inner join will return a row under concurrent modifications. The post traces the exact sequence of operations that triggers the bug.
Discover why AI agents need ontology to structure knowledge, improve reasoning, enable semantic understanding, and make better autonomous decisions.
Netflix discusses DataJunction. The article explains how they use it as their answer to the missing piece of the modern data stack.
Meta is open-sourcing the initial version of RCCLX – an enhanced version of RCCL that we developed and tested on Meta’s internal workloads. RCCLX is fully integrated with Torchcomms and aims to empower researchers and developers to accelerate innovation, regardless of their chosen backend. Communica
This article delves into the architecture and implementation of StarRocks' high-performance vectorized engine, explaining how it enhances query processing speed and efficiency.
A Kubernetes beginner roadmap that goes through all k8s concepts with links to external documentation and exercises. TL;DR For the past few years, I’ve worked in startup environments where learning.. View article
Learn how to clone massive tables in ClickHouse instantly, without copying a single byte. Discover how immutable data parts and part-level copy-on-write make safe experimentation and migrations effortless.
Netflix introduces MediaFM. The article explains how they use multimodal AI foundation for media understanding.
The post delves into the design and implementation of Agent Builder's memory system, discussing the prioritization of memory, technical architecture, and future enhancements; it offers valuable insight into building persistent memory systems for AI agents and their impact on performance.
The author investigates a map-reduce solution for querying 3 billion vectors, inspired by a discussion with Jeff Dean. The article delves into the implementation details of this solution, exploring the challenges and potential optimizations.
InMobi adopted ClickHouse and achieved 20x faster queries and 80% cost savings, building a fast, reliable, and cost-effective platform.
The Apache Iceberg community has finalized the File Format API, a significant architectural enhancement that enables pluggable, consistent, and engine-agnostic file formats within the Iceberg Java codebase.
This Spotify Engineering blog post discusses their multi-agent architecture for smarter advertising. The article likely details the challenges, solutions, and benefits of using a multi-agent approach to improve advertising effectiveness.
Netflix discusses scaling LLM post-training. The article explains their approach and challenges.
Supabase provides a detailed account of the February 12 outage in us-east-2, explaining the root cause and the steps taken to prevent it from happening again. The article provides insight into the incident and the measures implemented to improve system reliability.
With intelligent orchestration and optimization, we achieved a 64% reduction in compute costs and simplified our job architecture.
This Netflix Tech Blog post discusses automating the migration of RDS Postgres to Aurora Postgres. The article likely details the challenges, solutions, and lessons learned during this process, offering insights for others undertaking similar migrations.
This Apache Doris blog post covers how Apache Doris and Apache Paimon can be used to build a unified lakehouse for Web3 on-chain analytics, claiming 5x faster ETL than Spark and 2x faster data lake queries than Trino.
A practical guide to diagnosing GPU memory issues instead of randomly changing hyperparameters until something works Last week, I was building a reinforcement learning model for a customer using GRPO.. View article
Discover practical lessons from building OAuth flows with MCP apps, including OAuth 2.0 patterns, security issues, and implementation tips.
This Netflix Tech Blog post covers high-throughput graph abstraction. The article likely describes the architecture, implementation, and performance considerations of their graph abstraction system, offering practical insights for building similar systems.
This article shares details of the role backend aggregation (BAG) plays in building Meta’s gigawatt-scale AI clusters like Prometheus. BAG allows Meta to seamlessly connect thousands of GPUs across multiple data centers and regions. Their BAG implementation is connecting two different network fabric
This Netflix Tech Blog post details how Netflix validates catalog metadata using a 'Data Canary' system. The article likely explains the architecture, implementation, and benefits of this system for ensuring data quality and reliability.
This Apache Doris blog post explains how to build a real-time Web3 analytics platform using Apache Flink and Apache Doris, enabling sub-second queries on billions of blockchain transactions.
This article discusses how context graphs can improve AI agent performance, emphasizing the shift from simple state management to incorporating semantic understanding of the data; this is.
This Netflix Tech Blog post discusses 'Data Bridge', a system Netflix uses to simplify data movement. The article likely explains the architecture, implementation, and benefits of this system for improving data pipeline efficiency and reducing complexity.
This post explores how an individual replaced a paid SaaS subscription with LLM-generated code in just 20 minutes; this highlights the potential for LLMs to disrupt simple SaaS business models, especially for products that are not actively maintained.
This Netflix Tech Blog post covers the AI evolution of graph search at Netflix. The article likely describes how they're using AI to improve graph search capabilities, offering insights into building intelligent search systems.
The article shares lessons learned from observing billions of agentic workflows, focusing on the challenges of moving from a working demo to a production system.
I think it is worth starting this intro by talking a little bit about the established format for columnar data. Parquet has done some amazing things for analytics. If you go back to the times where CSV was the better alternative, then you know how important Parquet is. However, even if the specific
The article argues that human-in-the-loop is a missing layer in agentic systems.
This StarRocks blog post dives into the details of join optimization within the StarRocks database, explaining why joins can perform faster than expected. The author is a StarRocks committer and engineer at Celerdata.
Building Semantic Tool Selection with Multi-Component Embeddings Last quarter, our enterprise AI platform hit a wall. We had built an impressive suite of 70+ automated tools covering everything from database.. View article
Apache Doris now supports native hybrid search for AI workloads. The new functionality allows vector search, full-text search, and structured analytics within a single SQL engine, enabling AI-powered applications to leverage a unified data platform.
Deep technical guide to dbt-expectations covering regex validation, freshness/SLA checks, completeness validation within time windows, JSON schema validation, statistical distribution checks, and cross-column logic. Shows integration with production monitoring.
Examines how the EU AI Act, Cyber Resilience Act, and Data Act turn messy data from a performance tax into a legal liability. Covers the August 2026 deadline for High-Risk AI system compliance and argues governance must shift from reactive cleanup to embedded-by-design architecture.
The velocity of Generative AI has been nothing short of relentless. In the span of just 24 months, the industry has shifted paradigms three times. We started with the raw.. View article
This article presents a critique of the Iceberg REST catalog, arguing that a semantically correct API can become operationally unreliable at scale;
This Spotify Engineering blog post explains the technical and practical rationale for using separate tech stacks for personalization and experimentation. The article likely details the benefits of this separation, such as improved agility and scalability.
This post outlines building a real-time lakehouse architecture using Redpanda's Iceberg Topics and Databricks Unity Catalog for analytics-ready tables, eliminating the need for batch processing and orchestration, which is of interest to practitioners.
Weaviate 1.35 introduces Object Time-to-Live (TTL), zstd compression support, flat index RQ quantization, multimodal support with Weaviate Embeddings, and runtime configurable OIDC certificates.
Apache Doris introduces the VARIANT data type for high-performance JSON analytics. Optimizations such as dynamic subcolumns, sparse columns, schema templates, lazy materialization, and path-based indexing allow it to outperform PostgreSQL and MongoDB in JSON handling.
This StarRocks blog post provides 10 tips for optimizing OLAP query performance. The author is a StarRocks TSC member and query engine team lead at CelerData.
Learn how to build secure, observable agentic AI on Redpanda using Connect for AI tools and data access, plus broker audit logs to capture every agent action.
ByteDance uses Apache Doris 4.0 to solve billion-scale vector search problems. The company leverages Doris's hybrid search capabilities to build a system that balances accuracy, low latency, and cost-efficiency when handling over 1 billion vectors.
In this post, we describe the current patterns for interacting with Iceberg Catalogs, and pose the question: could it be done from a browser? After elaborating on the DuckDB ecosystem changes required to unlock this capability, we demonstrate our approach to interacting with an Iceberg REST Catalog.
The article identifies architecture as the key challenge in building production multi-agent systems, based on insights from 1.7 billion workflows.
Browser-native SQL workbench built on DuckDB WASM. Query CSV, Parquet, Excel, JSON files locally with zero install. Supports AI SQL assistant, BigQuery connector, and shareable URLs. Your data never leaves your machine.
This article proposes a model extending generic durable functions into three forms: stateless functions, stateful function objects, and linear function chains. It aims to standardize terminology in durable execution engines by linking concepts like 'workflows' and 'activities' to underlying executio
This article discusses context engineering, focusing on how AI agents manage LLM memory by selecting, retrieving, and organizing context from short-term and long-term memory. Context engineering is important for improving the reliability of AI agents in production.
This post delves into the architecture of durable function trees, exploring their integration within larger systems and the advantages they offer for durable execution.
This article explores constructing workflows using durable function calls arranged in trees, built on durable promises and continuations.
Over the past several months, the DuckDB Labs team has been hard at work on the DuckDB-Iceberg extension, with full read support and initial write support released in v1.4.0. Today, we are happy to announce delete and update support for Iceberg v2 tables is available in v1.4.2! The Iceberg open tabl
This article explains the concept of determinism within durable execution frameworks, focusing on identifying code sections that must be deterministic.
This article explains how to leverage the Dify and Weaviate integration for building Retrieval Augmented Generation (RAG) applications. This integration can be valuable for enhancing LLM applications with external knowledge.
Technical overview of Apache Polaris as the emerging open catalog standard for Iceberg. Covers multi-engine interoperability (Spark, Flink, Trino, StarRocks), built-in RBAC with table-level security, short-lived credential vending via cloud provider integrations, and Snowflake's managed Polaris offe
This article introduces Qbeast, an OTree spatial index designed for data lakehouses, and discusses its integration with Apache Iceberg and Delta Lake.
Weaviate 1.34 introduces flat index support with RQ quantization, server-side batching improvements, new client libraries, and Contextual AI integration. These features offer potential performance and functionality improvements for the vector database.
Apache Doris achieves top performance in the JSONBench benchmark, particularly in cold query performance and data quality. The benchmark measures query performance and data handling capabilities when processing JSON data.
This article discusses how stream and batch analytics can be built on Apache Iceberg, highlighting potential conflicts due to their differing requirements.
Apache Doris achieves 70% better price-performance on ARM-based AWS Graviton instances compared to x86. The benchmark results, gathered from standard OLAP tests such as ClickBench, SSB, TPC-H, and TPC-DS, demonstrate the efficiency of Doris on ARM architecture.
This article highlights the emerging trend of developers utilizing multiple AI agents in parallel to generate code. It explores the potential benefits and challenges of this approach to programming.
This article examines the Kafka community's efforts to lower replication costs by discussing KIP-1150, KIP-1176, and KIP-1183.
Comprehensive landscape of open-source data quality tools including Soda Core, Elementary Data, dbt Tests, and DataKitchen TestGen. Explores how the community is democratizing observability capabilities previously locked behind expensive platforms, and how AI is being used to automate test generatio
This article examines query performance optimization techniques in open table formats such as Apache Iceberg, highlighting methods beyond standard indexing.
Apache Doris is shown to be significantly faster than ClickHouse in real-time updates, according to benchmark results. Using ClickBench and SSB (Star Schema Benchmark), Apache Doris outperforms ClickHouse by 18-34x in SSB and 2.5-4.6x in ClickBench.
Argues that most organizations place data contracts in the wrong part of the lifecycle, causing enforcement gaps. Makes the case for contracts closer to the producer, not the consumer, with practical guidance on where they should sit architecturally.
Learn how Search Mode compares against Hybrid Search on the BEIR, LoTTe, BRIGHT, EnronQA, and WixQA Information Retrieval benchmarks.
This article describes training an LLM that can converse in English and item IDs, making recommendations without retrieval or tools.
Apache Doris utilizes various data pruning techniques to optimize query performance by skipping unnecessary data processing. This article dives into the implementation and strategies behind these data pruning techniques within the Doris architecture.
Apache Doris demonstrates superior performance over ClickHouse in various benchmarks including CoffeeBench, TPC-H, and TPC-DS. The benchmarks show that Doris consistently outperforms ClickHouse, showcasing its efficiency and speed in OLAP workloads.
Production-oriented guide showing how to capture column-level lineage in Microsoft Fabric Spark (which ships with OpenLineage pre-installed). Describes a Spark Plugin architecture where a REST API collects lineage events from an OpenLineage Listener, buffering them into Delta Tables for queryable li
Learn how chunking strategies improve LLM RAG pipelines, retrieval quality, and agent memory performance across production AI systems.
This post delves into the internal workings of Apache Fluss, offering a detailed exploration for those interested in data system internals.
Get spun around by our new vector quantization algorithm that utilizes the power of random rotations to improve the speed-quality tradeoff of vector search with Weaviate.
This article introduces a conceptual model for storage unification, designed to present diverse storage systems and formats as a unified resource.
Compares next-generation Iceberg catalogs: Nessie (Git-style branching for data), Apache Polaris, Apache Gravitino, Lakekeeper, and Unity Catalog. Explains how these move beyond simple table-name resolution to provide version control, federated views, fine-grained policies, and multi-engine freedom.
Side-by-side technical comparison of Great Expectations, Soda Core, dbt tests, and Deequ across expressiveness, scalability, integration patterns, and ease of adoption. Provides a decision framework for which tool fits which use case, and discusses layering multiple tools across pipeline stages.
Modal claims that open models can transcribe speech 100 times faster and 100 times cheaper than previous methods.
Builds a complete order processing pipeline with Debezium CDC, Apache Flink transformations, and OpenLineage/Marquez for lineage tracking. Demonstrates how lineage metadata enables root cause analysis when pipeline failures occur, showing practical troubleshooting patterns with end-to-end visibility
Features Russell Spitzer (Apache Iceberg/Polaris PMC) discussing the distinction between business catalogs (discovery/listing) and system catalogs (governing access by understanding table layout). Covers how Polaris vends short-lived credentials scoped to exact table directories.
Constella built a cross-platform thinking tool using Weaviate, RAG, and a multi-tenant architecture. The post details how they implemented vector search and syncing across devices.
This article covers evaluation metrics, how to build eval datasets, evaluation methodology, and a review of several benchmarks for long-context question and answer systems.
Production case study from Whatnot (live shopping marketplace) on combining data contracts with Monte Carlo observability. Their stack uses Snowflake, dbt, and Dagster. Shows how enforcing contracts while layering automated observability kept data incidents flat despite exponential data growth.
This article delves into the difficulties of operating Apache Flink in production environments. It explores the reasons why Flink is considered challenging and provides insights into how to address these operational complexities.
Technical walkthrough of Debezium's built-in OpenLineage integration for automatic CDC lineage tracking. Explains how Debezium Server emits OpenLineage events natively using the Java SDK, modeling run/job/dataset entities without manual instrumentation, with Marquez as a lineage backend.
Covers Unity Catalog announcements: Iceberg catalog federation for governing tables in AWS Glue/Hive/Snowflake without copying data, Unity Catalog Metrics as first-class governed assets, column-level permissions for PII, and the new Discover experience for certified data products with AI-driven reco
Weaviate version 1.31 introduces the MUVERA encoding algorithm for multi-vector embeddings. The post explains the algorithm's details, including its functionality and use cases.
This article discusses how Recsys and search are converging with LLMs via semantic IDs, data augmentation, and unified foundation models.
DuckLake stores data lake metadata in a SQL database instead of files. 22-table schema replaces manifest files, enabling instant snapshot queries and ACID transactions without file listing overhead.
This article details automating agentic workflows using Amazon Q CLI, Anthropic MCP, and tmux to build news agents for daily news recaps.
Explains the columnLineage dataset facet introduced in OpenLineage 0.9.0 for Spark integration. Covers how column-level lineage tracks which input fields produce each output field, its applications for GDPR/HIPAA/CCPA compliance, and the roadmap for extending support beyond Spark.
This article provides guidance on testing custom Flink jobs on Decodable, focusing on modular implementations to improve testability when dealing with external service dependencies. It addresses a common challenge in Flink development and offers practical solutions.
This article demonstrates integrating Neo4j with Qdrant to enhance RAG pipelines by enabling external vector searches; it guides users through a local setup with preloaded data, illustrating the practical aspects of this integration.
The article explains how to build knowledge graph agents using LlamaIndex workflows, offering a blueprint for constructing Text2Cypher agentic interfaces; this integration provides practical insights into developing agentic data pipelines.
This article identifies common pitfalls encountered when building generative AI applications and provides examples.
This Neo4j Developer Blog post explains how to use Anthropic's Model Context Protocol (MCP) to give LLMs like Claude access to knowledge graphs in Neo4j.
Anthropic's guide to building reliable AI agents: tool use patterns, prompt chaining, evaluation frameworks, error recovery, and when NOT to use agents.
Integrate Neo4j knowledge graphs with LangChain for powerful GraphRAG applications that deliver deeper, more insightful answers.
Explore how GraphRAG can be used to streamline the process of ingesting commercial contract data and building a Q&A Agent.
MCP standardizes how AI models connect to data sources and tools. Client-server architecture with typed resources, tool definitions, and prompts that any LLM application can implement.
Dive into the impact of fine-tuning models for the Text2Cypher task of transforming natural language questions to Cypher queries.
Learn how to build a support agent that relies on information from Stack Overflow using the GenAI Stack – Neo4j, LangChain & Ollama in Docker.
Explore performance benchmarks of LLM models on Neo4j Text2Cypher (2024) Dataset, comparing foundational vs. fine-tuned models for Cypher query translation.
The Text2CypherRetriever allows users to retrieve data from Neo4j using natural language, simplifying query generation for GenAI applications.
DuckDB ships a built-in web UI for interactive SQL exploration, schema browsing, and result visualization -- no install needed beyond the CLI.
The post describes how to monitor Neo4j in a clustered environment using tools like Dynatrace and Kibana, along with best practices.
This technical blog post explores the importance of Change Data Capture (CDC) for developers. It covers the fundamentals of CDC, its common use cases, and the advantages of log-based CDC compared to other approaches. Understand how CDC can improve operational performance, enable real-time analytics,
The new GraphAcademy course teaches how to convert unstructured data into graphs using GenAI, LLMs, and Python.
Neo4j’s graph database enables real-time analysis to uncover hidden fraud rings and protect financial assets, aiding in bank fraud detection.
The post details how to turn CSV files into graph models using LLMs, simplifying data relationships and enhancing insights.
Learn to build accurate, explainable recommendation systems with minimal code using Neo4j graph database and Keymaker framework.
Learn how to build a GraphRAG agent using Neo4j and Milvus, combining graph and vector search for enhanced retrieval, better context, and accurate answers.
Learn how to overcome the challenges of structured data operations in text embeddings in RAG applications using knowledge graphs.
Enhance GraphRAG applications by combining hybrid search and graph traversal with Neo4j’s HybridCypherRetriever, improving retrieval for complex queries.
Learn how Debezium, the de-facto standard for open-source change data capture (CDC), has evolved to support deployments without the need for Kafka-related infrastructure.
How enterprises combine knowledge graphs with LLMs: grounding responses in structured facts, reducing hallucinations, enabling explainable AI, and the architectural patterns for graph-augmented generation.
Explore advanced GraphRAG retrieval patterns and how graph structures enhance RAG systems. Learn actionable strategies to implement and optimize GraphRAG.
Prefect 3.0 drops DAGs entirely: Python-native flows with dynamic task creation, automatic retries, event-driven triggers, and a hosted platform that eliminates scheduler management.
Recommend movies to users based on their reading histories and ratings. Learn the setup of Neo4j, mapping data into Java with Neo4j Object Graph Mapper (Neo4j-OGM), and crafting Cypher queries for recommendations.
LLMs generating SQL without a semantic layer produce inconsistent, wrong metrics. How the dbt Semantic Layer provides guardrails: metric definitions, entity relationships, and governed access for AI agents.
Architecture-level comparison: Polars' Rust-based columnar engine with lazy evaluation, query optimization, and Apache Arrow memory vs Pandas' eager NumPy-backed row operations. Benchmarks on real workloads.
A comprehensive guide on troubleshooting and configuring Flink SQL to write to Delta Lake on S3 or MinIO.
Inside Snowflake's Cascades-style query optimizer: join reordering, pruning with micro-partition statistics, adaptive execution, and how they test optimizer correctness at scale.
Ontologies provide the structured backbone that LLMs lack: taxonomies, controlled vocabularies, entity disambiguation, and how combining ontological reasoning with neural approaches produces more reliable AI systems.
DataFusion as a modular query engine: how it powers InfluxDB 3.0, Comet Spark accelerator, and Ballista distributed queries. Extensible optimizer, custom table providers, and user-defined functions in Rust.
This article explores declarative resource management in Decodable, highlighting its benefits for SDLC best practices, resource management, environment migration, and resource cleanup.
The asset-centric paradigm shift: why defining what data should exist (Dagster assets) is better than defining how to compute it (Airflow tasks). Software-defined assets, IO managers, and testability.
Analysis of how Iceberg's catalog-agnostic design, hidden partitioning, and multi-engine support gave it an architectural advantage over Delta Lake and Hudi.
Practical data governance: automated PII detection, column-level lineage, data contracts between teams, freshness SLAs, and how to implement governance incrementally without blocking teams.
Building real-time ML feature pipelines with Flink: window aggregations, CDC ingestion, point-in-time joins, and integration with Feast and Tecton feature stores.
How DuckDB's community extension system works: writing C++ extensions, the extension repository, signed distribution, and examples of spatial, httpfs, and Iceberg extensions.
Microsoft's GraphRAG approach: automatically building knowledge graphs from document corpora, community detection for topic summarization, and how graph-based retrieval answers global questions that vector search cannot.
This post describes a method for building an LLM router that dynamically selects the optimal LLM for a given request based on configurable criteria. It covers techniques for evaluating LLM performance, implementing routing logic, and optimizing for cost-effectiveness.
Using LLMs to extract entities and relationships from documents, resolve coreferences, and populate a Neo4j knowledge graph. Includes schema design, prompt engineering for extraction, and evaluation metrics.
Snowflake Dynamic Tables: define a pipeline as a SQL query and let Snowflake handle scheduling, incremental refresh, and dependency management. Replaces streams + tasks for most use cases.
Stripe's ledger system for financial data: immutable event log, double-entry accounting in the data warehouse, reconciliation pipelines, and how they ensure every cent is accounted for.
DuckDB's relational API as a replacement for Pandas in ETL pipelines, with benchmarks showing 10-100x performance improvements on larger-than-memory datasets.
Databricks open-sources Unity Catalog, providing a universal governance layer across Delta, Iceberg, and Hudi tables with fine-grained access control and lineage tracking.
Honest comparison of ClickHouse and Snowflake architectures for real-time analytics workloads, covering query latency, ingestion throughput, cost models, and operational complexity.
Hard-won lessons from practitioners: prompt engineering diminishing returns, when to fine-tune vs RAG, evaluation beyond vibes, cost optimization, and the reliability gap between demo and production.
Airbnb's Midas data quality framework: automated anomaly detection, lineage-based impact analysis, SLA tracking, and self-healing pipelines at petabyte scale.
How LinkedIn processes 7 trillion events per day: Kafka for event transport, Samza for stream processing, Venice for derived data serving, and Brooklin for cross-DC replication.
Asset-centric vs task-centric orchestration: how Dagster's software-defined assets, type system, and built-in IO managers compare to Airflow's DAG paradigm.
When Postgres analytics hits a wall: column compression, vectorized execution, and approximate query processing in ClickHouse vs row-oriented scans in Postgres. Migration patterns and hybrid architectures.
Why dimensional modeling still matters even though the ELT era made star schemas seem obsolete. The semantic layer as the modern replacement for physical dimension tables.
Why feature stores exist: the training-serving skew problem, online vs offline stores, feature computation patterns, and how Feast, Tecton, and Hopsworks compare architecturally.
Netflix's migration from Hive to Iceberg at exabyte scale, including incremental processing patterns with Maestro orchestrator and Spark.
Why vector similarity alone fails for complex reasoning. Using Neo4j knowledge graphs alongside embeddings: entity extraction, relationship mapping, graph traversal for multi-hop queries, and hybrid retrieval.
Figma's horizontal sharding journey: from a single Postgres instance to 100+ shards using PgBouncer, application-level routing, and their custom migration tooling for zero-downtime resharding.
Why LLM-generated SQL fails in production: schema ambiguity, implicit business logic, multi-table joins, aggregate semantics, and why a semantic layer is the real solution instead of better prompting.
How data contracts formalize the interface between producers and consumers, with practical schema enforcement patterns using protobuf, JSON Schema, and dbt tests.
Advanced RAG patterns: multi-query retrieval, recursive summarization, parent-child chunk linking, self-RAG with reflection, and corrective RAG that verifies its own retrievals.
How Cube's semantic layer sits between databases and consumers: pre-aggregations, access control, caching, and serving consistent metrics to dashboards, notebooks, and LLMs via API.
End-to-end guide for building production RAG systems: chunking strategies, embedding model selection, retrieval metrics (MRR, NDCG), reranking, and hallucination detection.
Practical guide to running Kafka with KRaft consensus, covering migration from ZooKeeper, operational considerations, and performance characteristics.
How the dbt Semantic Layer works: MetricFlow engine, semantic models, dimension/measure definitions, and querying metrics from any BI tool via the JDBC/GraphQL API.
The original modern data stack (Fivetran + Snowflake + dbt + Looker) matured into a commodity. What comes next: embedded analytics, semantic layers, and AI-native data tools.
Production-grade vector search with pgvector: HNSW vs IVFFlat index tradeoffs, optimal dimensionality, bulk loading strategies, and benchmarks against dedicated vector databases.
Spark Connect introduces a thin client protocol that separates Spark applications from cluster infrastructure, enabling remote execution without shipping JARs or managing classpaths.
Production LLM serving: continuous batching vs static batching, PagedAttention (vLLM), speculative decoding, KV cache optimization, and how to maximize GPU utilization.
The semantic web stack (RDF, OWL, SPARQL) is quietly powering enterprise knowledge management. How knowledge graphs, linked data, and ontologies are being integrated with LLMs and modern data architectures.
Spotify's migration from proprietary data warehouse to an open lakehouse stack: partition pruning strategies, compaction scheduling, and how they cut compute costs by 40%.
Using DuckDB as a command-line JSON processor, replacing jq for complex data transformations with SQL syntax.
Production Airflow patterns: KubernetesExecutor vs CeleryExecutor, DAG serialization, connection management, XCom anti-patterns, and monitoring with StatsD and Prometheus.
Flink's roadmap toward true batch-stream unification, materialized tables, and the new ProcessFunction API for stateful event processing.
Deep dive into Postgres internals: shared_buffers vs OS cache, parallel query tuning, JIT compilation tradeoffs, connection pooling with PgBouncer, and VACUUM strategies for write-heavy workloads.
The case for using Postgres as your only database: JSONB for documents, pg_cron for scheduling, pgvector for embeddings, logical replication for CDC, and extensions for everything else.
What a semantic layer actually is beyond marketing: universal metric definitions, entity relationships, access policies, and why it matters more in the age of LLM-generated SQL.
Discord's migration from Cassandra to ScyllaDB for their message store: hot partition detection, consistent hashing, compaction tuning, and achieving P99 reads under 1ms at 2T messages.
Side-by-side comparison of the three major open table formats covering ACID semantics, schema evolution, time travel, compaction, and ecosystem support.
How dbt Mesh enables cross-project dependencies, model contracts, and versioning for organizations running hundreds of dbt projects across teams.
How ClickHouse's MergeTree engine works: LSM-tree-inspired sorted parts, sparse primary index, data skipping indexes, background merges, and why it achieves sub-second queries on billions of rows.
Practical fine-tuning guide using TRL, QLoRA, and Flash Attention 2. Covers dataset preparation, hyperparameter selection, evaluation, and deployment with real cost breakdowns.
Photon is a C++ vectorized execution engine that replaces Spark's JVM-based Catalyst for scan-heavy workloads, achieving 3-8x speedups through SIMD, memory-mapped I/O, and adaptive execution.
How the analytics engineer role evolved from a dbt power user to a critical bridge between data engineering and business intelligence, with practical career guidance.
WarpStream's architecture: a Kafka-compatible broker that writes directly to S3 instead of local disks. No inter-broker replication, no partition reassignment, and 80% cheaper than self-hosted Kafka.
Uber's integrated caching layer that combines Docstore with an in-process cache, handling 40M reads/sec with sub-millisecond P99 latency.
Companion post to the O'Reilly book covering Iceberg's hidden partitioning, schema evolution, time travel, and compaction strategies for production lakehouses.
Practical guide to Data Vault 2.0: hub-link-satellite patterns, hash keys for parallelism, point-in-time tables, and when Data Vault makes sense vs One Big Table or dimensional modeling.
How Arrow's in-memory columnar format enables zero-copy data exchange between Spark, DuckDB, Pandas, Polars, and databases via ADBC and Flight SQL.
The mechanics behind Kafka's exactly-once guarantees: idempotent producers, transactional messaging, consumer group coordination, and the performance cost of EOS vs at-least-once.
Martin Kleppmann's reflections on how the landscape has changed since DDIA: new consensus protocols, CRDTs in production, the shift to event streaming, and what he'd write differently today.
DuckDB's multi-database support: attach Postgres, MySQL, and SQLite databases alongside local files, and query across them with standard SQL joins.
Reference architecture for LLM applications covering RAG pipelines, embedding models, vector databases, orchestration frameworks, and evaluation patterns.
Anyscale's blog post describes a technique to reduce the cost of embedding computations in retrieval-augmented generation (RAG) pipelines by 10x using Anyscale and Pinecone. This approach is .
Quantitative analysis of GPU options for LLM inference and fine-tuning: comparing H100 vs A100 vs consumer GPUs, quantization tradeoffs (GPTQ, AWQ, GGUF), and cost-per-token calculations.
DuckDB's SQL dialect extensions that make queries more readable: GROUP BY ALL, SELECT * EXCLUDE, implicit column aliases, and string slicing.
Deep comparison of Kafka and Pulsar replication models: ISR vs BookKeeper quorum writes, tail latency characteristics, data loss scenarios, and how each handles broker failures.
Visual walkthrough of how Stable Diffusion works: the latent space, the denoising U-Net, CLIP text encoder, classifier-free guidance, and how LoRA fine-tuning adapts the model.
Internal architecture of Snowflake's multi-cluster shared data architecture, covering storage layer, virtual warehouses, metadata store, and query optimization.
The MoE architecture behind Mixtral and Switch Transformer: expert routing, load balancing, training instability, and why sparse models achieve better performance per FLOP than dense models.
The definitive visual explanation of the Transformer architecture: self-attention, multi-head attention, positional encoding, and how information flows through encoder-decoder layers.
Whatnot went from no modern data stack to processing tens of millions of events across hundreds of event types each day. Zack Klein explains how Whatnot leverages data contracts and data observability to achieve high quality data at scale for stakeholders, focusing on a small team's approach to data
Solving the viral 1 Billion Row Challenge using DuckDB SQL instead of Java -- demonstrating that a single SQL query on a laptop can process 1B rows in under 4 seconds.
Jay Kreps' foundational essay on the unified log abstraction, how it connects databases, distributed systems, and real-time data -- the intellectual basis for Kafka.
Vishnu Ram from Credit Karma discusses data reliability for LLMs and offers best practices related to data observability in AI.
Ed Presz from Pie Insurance explains how his team built an incident detection and notification workflow to drive ownership of data quality for business stakeholders.
The article emphasizes that for GenAI, data observability must prioritize resolution, pipeline efficiency, and streaming/vector infrastructures.
Reference, context, and preference-based metrics, self-consistency, and catching hallucinations.
Evals, RAG, fine-tuning, caching, guardrails, defensive UX, and collecting user feedback.
Discover PyTorch's journey in an episode with Soumith Chintala, its Co-Creator and Meta's VP/Fellow. Learn about TensorFlow's impact, community-guided innovation, and the open vs. closed-source debate.
In this article, we explore how to utilize OpenAI's ChatGPT and LangChain to build a Question-Answering bot for Weights & Biases' podcast series, Gradient Dissent.
This article focuses on data contracts for data warehouses, emphasizing programmatic accountability in batch data processing. It outlines the importance of defining and enforcing data contracts to improve data quality and reliability.
This Anyscale blog post dives into data ingest in a third-generation ML architecture, specifically using Ray Data. It provides code samples to illustrate how distributed libraries can improve performance by exploiting distributed memory bandwidth.
XGBoost-Ray is a new backend for distributed XGBoost training that supports multi-node and multi-GPU setups. It includes distributed data loading, fault tolerance with elastic training, and integrates with the Ray Tune hyperparameter optimization framework.