482 articles curated by AI from 15+ sources. Updated every 6 hours.

What is pgvector?

pgvector is an open-source PostgreSQL extension that adds the ability to store, index, and search over vector embeddings, enabling similarity search and other vector-based operations directly within Postgres.

databricks.com postgres
2d

Post-Quantum Cryptography Migration at Meta: Framework, Lessons, and Takeaways

Meta shares lessons learned from their post-quantum cryptography (PQC) migration to assist other organizations in strengthening their resilience during the transition to post-quantum cryptography standards. They propose the idea of PQC Migration Levels to help teams manage the complex migration proc

engineering.fb.com engineering
2d

Artifacts: versioned storage that speaks Git

Cloudflare's Artifacts provides Git-compatible versioned storage for code and data, designed for agents, developers, and automations. It supports creating millions of repos and forking from any remote.

blog.cloudflare.com engineering
2d

Index-based pruning in ClickHouse

Learn how ClickHouse uses primary indexes, lightweight projections, and skip indexes to prune data before reading it. Demonstrated on a 243 million row UK property sales dataset.

clickhouse.com clickhouse
3d

Agent Harnesses Are Dead. Long Live Agent Harnesses.

This article discusses the evolving landscape of AI agent frameworks and harnesses, suggesting that while frameworks might be becoming cheaper, the underlying need for structured agent orchestration remains.

blog.crewai.com agents
4d

Ducklake’s architecture makes so much sense, and really highlights the drawbacks of using the object store itself for metadata like Iceberg does. Ducklake+Motherduck seem well positioned to take Snowflake customers. What differentiates motherduck’s technical architecture from Snowflake’s?

This Reddit post discusses DuckDB's architecture, comparing Ducklake to Snowflake and highlighting potential drawbacks of using object stores for metadata like Iceberg does. The post explores what differentiates MotherDuck's technical architecture from Snowflake's.

reddit.com duckdb
5d

DuckLake v1.0

DuckLake v1.0 has been released.

reddit.com duckdb
5d

Dynamic, identity-aware, and secure Sandbox auth

Outbound Workers for Sandboxes provide a programmable, zero-trust egress proxy for AI agents. This allows developers to inject credentials and enforce dynamic security policies without exposing sensitive tokens to untrusted code.

blog.cloudflare.com agents
5d

DuckLake 1.0

The article announces the release of DuckLake 1.0.

duckdb.org duckdb
6d

Your ReAct Agent Is Wasting 90% of Its Retries — Here’s How to Stop It

Most ReAct-style agents are silently wasting their retry budget on errors that can never succeed. In a 200-task benchmark, 90.8% of retries were spent on hallucinated tool calls — not model mistakes, but architectural flaws. This article shows why prompt tuning won’t fix it, and the three structural

towardsdatascience.com ml
6d

DuckDB Meets Data Lakes [video]

Walkthrough of querying data lake files with DuckDB, covering Parquet, Iceberg, and S3 integration patterns.

youtube.com duckdb
6d

Your harness, your memory

This LangChain blog post discusses the growing importance of agent harnesses in building AI agents and their connection to agent memory. It highlights the potential drawbacks of using closed harnesses, particularly those behind proprietary APIs, which can limit control over the agent.

blog.langchain.com agents
7d

Why Every AI Coding Assistant Needs a Memory Layer

The article argues that AI coding assistants require a persistent memory layer to overcome the limitations of stateless LLMs. This memory layer improves code quality by providing systematic context across sessions.

towardsdatascience.com ml
7d

Show HN: Formally Verified Leaderless Log Protocol for Kafka

This post announces the open-sourcing of a formally verified TLA+ specification for a leaderless log protocol for Kafka, highlighting the discovery of a design bug through verification. It also mentions using Claude Code to generate a working Rust implementation from the specification, demonstrating

github.com kafka
8d

Design and Implementation of DuckDB Internals

This article from the DuckDB website discusses the design and implementation of DuckDB internals, which is useful for understanding its architecture and performance characteristics.

duckdb.org duckdb
8d

Context Engineering for AI Coding Agents

This article discusses context engineering techniques for AI coding agents, specifically focusing on Claude code sub-agents. It explores how to structure prompts and context to improve the performance of AI coding assistants.

amux.io agents
9d

Escaping the Fork: How Meta Modernized WebRTC Across 50+ Use Cases

Meta shares its approach to modernizing WebRTC, the technology powering real-time audio and video across their platforms. The article highlights the challenges of forking a large open-source project and how Meta addressed them to stay aligned with community upgrades.

engineering.fb.com engineering
9d

Oracle CDC now available in Redpanda Connect

Redpanda Connect now offers native CDC for Oracle, enabling real-time data access without requiring rearchitecting. The solution eliminates the need for a JVM, middleware, and related operational overhead.

redpanda.com streaming
10d

ClickHouse at FOSDEM 2026

This post recaps ClickHouse's involvement at FOSDEM 2026 in Brussels. It highlights the community's activities during the event.

clickhouse.com clickhouse
10d

Show HN: 500k+ events/sec transformations for ClickHouse ingestion

This post highlights GlassFlow's work on achieving high-throughput (500k+ events/sec) transformations for ClickHouse ingestion, particularly in observability and real-time analytics pipelines. It addresses challenges related to scaling throughput.

github.com clickhouse
10d

From bytecode to bytes: automated magic packet generation

Cloudflare's blog post details how they automated the generation of malware trigger packets using symbolic execution on BPF bytecode. By leveraging the Z3 theorem prover, they significantly reduced analysis time, improving their ability to detect and respond to threats.

blog.cloudflare.com engineering
10d

Cortex AISQL: A Production SQL Engine for Unstructured Data

This paper introduces Cortex AISQL, a production SQL engine from Snowflake that integrates native semantic operations directly into SQL. This allows users to combine relational operations with semantic reasoning for querying unstructured data.

arxiv.org snowflake
11d

Managing the Context Window | Airbyte

The article discusses effective strategies for managing the context window in AI agents. It emphasizes improving performance, reducing costs, and maintaining relevant outputs, which is valuable for optimizing AI systems.

airbyte.com agents
11d

Engineering An AI Agent To Navigate Large-scale Event Data – Part 2

This article delves into the design of an AI agent for navigating large-scale event data, focusing on transforming query patterns into intelligent tools and crafting an effective agent architecture, which offers practical insights into building agents for complex data environments.

mlops.community mlops
11d

Context Engineering for AI Agents: A Deep Dive

This article discusses techniques for optimizing context, a finite resource, when designing AI agents, focusing on how to best utilize available information to enhance agent performance.

towardsdatascience.com llm
11d

ClickHouse Release 26.3

ClickHouse version 26.3 introduces async inserts by default, improved JOIN reordering, and materialized CTEs. These features could improve query performance and data management for users.

clickhouse.com clickhouse
11d

Apache Arrow ADBC 23 (Libraries) Release

The Apache Arrow team announced the version 23 release of the Apache Arrow ADBC libraries, which includes 41 resolved issues from 20 contributors. This release focuses on the libraries, which are at version 23, with the API specification versioned separately.

arrow.apache.org arrow
12d

A Guide to Context Engineering for LLMs

This ByteByteGo article explores context engineering for LLMs, explaining how LLMs process information and outlining strategies to improve context utilization.

blog.bytebytego.com architecture
12d

Continual learning for AI agents

This LangChain blog post discusses continual learning for AI agents, highlighting that learning occurs at the model, harness, and context layers, not just model weight updates. Understanding these distinctions is crucial for building systems that improve over time.

blog.langchain.com llm
13d

Syntaqlite Playground

The article introduces a Syntaqlite Playground, which is related to dbxlite and Metastax. It's a useful tool for Staff+ level data engineers, ML engineers, and analytics practitioners.

simonwillison.net llm
13d

Powering Multimodal Intelligence for Video Search

Netflix details how they are using multimodal intelligence to improve video search capabilities. The article likely covers the engineering challenges and solutions involved in building and deploying such a system at scale.

netflixtechblog.com engineering
15d

How My Agents Self-Heal in Production

This post details a self-healing deployment pipeline for a GTM Agent. The system automatically detects regressions after each deploy, determines if the change caused the regression, and uses an agent to create a pull request with a fix, minimizing manual intervention.

blog.langchain.com llm
15d

Towards Robustness: A Critique of Current Vector Database Assessments

This paper critiques the use of average recall as the dominant metric for evaluating vector databases, which are crucial in AI systems. It argues that relying solely on average recall can be problematic for users and researchers optimizing these systems.

arxiv.org vector-db
16d

Multi-Objective Agentic Rewrites for Unstructured Data Processing

This paper discusses DocETL, a declarative system for LLM-powered data processing that has gained traction across various domains. DocETL allows users to define complex data processing pipelines using LLMs, enabling tasks like information extraction and data transformation from unstructured document

arxiv.org llm
16d

Agentic Coding at ClickHouse

ClickHouse details their work on agentic coding. The article likely details the practical implementations and potential benefits of this approach within the ClickHouse ecosystem.

clickhouse.com clickhouse
16d

How to Orchestrate dbt with Dagster

This article describes how to use Dagster's dbt integration to run and monitor dbt models as part of a larger asset-driven pipeline, focusing on lineage and scheduling improvements.

dagster.io orchestration
16d

Debug Dagster Code with Docker

Learn step-by-step how to debug Dagster pipelines directly inside Docker, bridging development and deployment environments with practical tools.

dagster.io orchestration
16d

High-Performance Python for Pipelines

Use proven tips to make your Python code faster and more efficient, especially for data engineering and pipeline-heavy workloads.

dagster.io orchestration
16d

Open Models have crossed a threshold

LangChain reports that open models like GLM-5 and MiniMax M2.7 are now comparable to closed frontier models on agent tasks like file operations and tool use. The article presents evaluation results and instructions for using these open models.

blog.langchain.com llm
16d

Why we're rethinking cache for the AI era

Cloudflare discusses the challenges and opportunities in cache design presented by the explosion of AI-bot traffic, detailing the differences between AI bot traffic and human traffic and providing some early ideas for system design.

blog.cloudflare.com engineering
16d

The Missing Interface in Data Platform Engineering

This article discusses how data leaders should design the interface between data platforms and the teams that rely on them. It emphasizes the importance of clear boundaries and well-defined responsibilities in data platform engineering.

dataengineeringweekly.com data-engineering
17d

Data Inlining in DuckLake: Unlocking Streaming for Data Lakes

This blog post from the DuckDB team introduces data inlining in DuckLake to enable streaming for data lakes. It details the motivation, implementation, and benefits of this approach, including improved performance and reduced latency.

duckdb.org duckdb
17d

Dagster 1.12: Refinement and Acceleration

Dagster 1.12 introduces a redesigned UI, Components GA, streamlined deployment workflows, and major orchestration upgrades. These enhancements aim to make data orchestration faster, simpler, and more reliable for users.

dagster.io orchestration
17d

Multimodal Embeddings and RAG: A Practical Guide

This blog post explains multimodal embeddings for searching across different data types (text, images, audio, video) in RAG systems. It provides practical implementations using Weaviate and Gemini.

weaviate.io vector-db
18d

DuckDB Now Speaks Dutch!

This DuckDB blog post humorously explores an alternate reality where Dutch, not English, became the dominant language for SQL. It poses the question of how this linguistic shift might have shaped the development and standardization of SQL.

duckdb.org duckdb
18d

ClickHouse BYOC on Google Cloud now Generally Available

ClickHouse has announced the general availability of its Bring Your Own Cloud (BYOC) offering on Google Cloud. This allows users to run ClickHouse within their own Google Cloud account while maintaining full data sovereignty and zero-trust networking.

clickhouse.com clickhouse
18d

Exqutor: Extended Query Optimizer for Vector-augmented Analytical Queries

This paper introduces Exqutor, an extended query optimizer designed for vector-augmented analytical queries, particularly in Retrieval-Augmented Generation (RAG) pipelines. It aims to improve the efficiency of retrieving relevant external knowledge for large language model inference.

arxiv.org databases
19d

Under the hood: Redpanda Cloud Topics architecture

This article describes the architecture of Redpanda Cloud Topics, a new replication mechanism that uses object storage to reduce costs. The discussion of internals is valuable for engineers working with streaming data.

redpanda.com streaming
20d

Making HNSW Work with JOINs and WHERE Clauses on DuckDB

This article explains how to use HNSW indexes effectively with JOINs and WHERE clauses in DuckDB, demonstrating how to combine approximate nearest neighbor search with standard SQL operations for efficient data retrieval.

cigrainger.com duckdb
20d

Zero-Downtime Patching Part 1: Prewarming

This Neon blog post discusses their approach to zero-downtime patching using prewarming techniques to ensure continuous availability of customer databases. It details their system's redundancy and failover mechanisms.

neon.com postgres
22d

Agent Evaluation Readiness Checklist

The LangChain blog post offers a checklist for evaluating AI agents, covering error analysis, dataset construction, grader design, and offline/online evaluation. The checklist is intended to help ensure production readiness.

blog.langchain.com llm
22d

Zero-Downtime Patching in Lakebase Part 1: Prewarming

This Databricks blog post discusses techniques for ensuring database availability during patching in Lakebase. It focuses on prewarming as a method to minimize downtime during updates, which is crucial for maintaining service reliability in data platforms.

databricks.com databricks
22d

Qwen 3.5 27B at 1.1M tok/s on B200s, all configs on GitHub

This post shares the configurations used to push Qwen 3.5 27B to 1,103,941 tok/s on 12 nodes with 96 B200 GPUs using vLLM. The improvements came from changes to DP, context window, FP8 KV cache, and MTP-1 speculative decoding.

reddit.com llm
23d

Top 10 best practices tips for ClickHouse

This article presents ten best practices for ClickHouse, covering topics like primary key design, data types, materialized views, and join optimization. Benchmarks on a 150M row dataset illustrate the impact of these practices.

clickhouse.com clickhouse
23d

A one-line Kubernetes fix that saved 600 hours a year

Cloudflare describes a Kubernetes fix involving fsGroupChangePolicy that reduced Atlantis instance restart times from 30 minutes to 30 seconds by addressing a bottleneck in volume permission handling.

blog.cloudflare.com engineering
23d

ClickHouse is data lake ready

ClickHouse now supports direct querying of Iceberg and Delta Lake formats across major cloud catalogs. This feature eliminates the need for data migration, improving data lake accessibility.

clickhouse.com clickhouse
23d

A physical design advisor for DuckDB

A physical design advisor called Vizier has been developed for DuckDB. It analyzes queries and suggests changes to the database's physical layout, such as sort orders and indexes, to improve query performance.

reddit.com duckdb
23d

Agent Engineering Patterns: Dealing with large tool results

This blog post from Firetiger explores strategies for handling large tool results within AI agent workflows. It discusses approaches like summarization, pagination, and streaming to manage the volume of data returned by tools used by agents.

blog.firetiger.com agents
24d

The Case for Shared Storage - WarpStream

Shared-nothing made sense when storage was slow, but shared storage flips that tradeoff. The architectural case for building Kafka directly on object storage.

warpstream.com streaming
24d

Hacking the Kafka PRoTocOL - WarpStream

Kafka assumes stateful, partition-owning brokers. How WarpStream reverse-engineered it for stateless Agents. A deep dive into diskless Kafka load balancing.

warpstream.com streaming
24d

Getting started with WarpStream on Tigris - WarpStream

Run WarpStream on Tigris for globally distributed, durable Kafka streaming. This setup eliminates region-specific bucket planning and hidden data transfer fees, offering a streamlined approach to managing streaming infrastructure.

warpstream.com streaming
24d

Structured Logging in .NET with Serilog and ClickHouse

Learn how to send structured .NET logs directly to ClickHouse using Serilog — with full schema control, full-text search, and SQL queries over your log data. This post provides a step-by-step guide for setting up and using the integration.

clickhouse.com clickhouse
24d

Inside our approach to the Model Spec

Learn how OpenAI’s Model Spec serves as a public framework for model behavior, balancing safety, user freedom, and accountability as AI systems advance. This post details the considerations and mechanisms used to ensure responsible AI deployment.

openai.com llm
24d

Introducing the OpenAI Safety Bug Bounty program

OpenAI launches a Safety Bug Bounty program to identify AI abuse and safety risks, including agentic vulnerabilities, prompt injection, and data exfiltration. This program encourages community participation in enhancing the security and robustness of AI models.

openai.com llm
25d

Introducing the NUMBER data type

Trino is adding support for the NUMBER data type to handle high-precision numeric types beyond the existing DECIMAL limit. This will allow Trino to query data from sources that use these types without loss of precision, improving interoperability.

trino.io trino
25d

Smarter Auto-Scaling for ClickHouse: The Two-Window Approach

ClickHouse Cloud's two-window recommender and target-tracking CPU algorithm cut scale-down latency from 30 hours to 3 hours while eliminating oscillations and reducing infrastructure costs. The post details the algorithm and its impact on autoscaling performance.

clickhouse.com clickhouse
25d

Your WarpStream Questions, Answered - WarpStream

This article answers questions about WarpStream's architecture, BYOC vs. Serverless options, pricing, Kafka compatibility, performance trade-offs, and zero-disk streaming.

warpstream.com streaming
25d

Unlocking Idempotency with Retroactive Tombstones - WarpStream

Kafka idempotent producers without stateful brokers require rethinking deduplication. WarpStream uses retroactive tombstones to separate data from metadata, providing a technical solution for ensuring data integrity in streaming applications.

warpstream.com streaming
25d

Tiered Storage Won’t Fix Kafka - WarpStream

Tiered storage still runs stateful brokers with expensive disks and inter-AZ replication. It does not solve the real cost problem at the heart of Kafka, offering a critical analysis of a common architectural pattern.

warpstream.com streaming
25d

The Original Sin of Cloud Infrastructure - WarpStream

OSS big data tools like Kafka were built for hyper-scalers, then given to everyone. The article discusses why on-prem assumptions in open source infra cause pain in the cloud, offering a high-level perspective on cloud infrastructure design.

warpstream.com streaming
25d

How Netflix Live Streams to 100 Million Devices in 60 Seconds

This article outlines the architecture that allows Netflix to live stream to 100 million devices in 60 seconds. It focuses on the challenges and solutions involved in building a large-scale live streaming system.

blog.bytebytego.com architecture
25d

Sandboxing AI agents, 100x faster

Cloudflare introduces Dynamic Workers for executing AI-generated code in secure, lightweight isolates. This technique achieves millisecond startup times, significantly faster than traditional container-based sandboxing for AI agents.

blog.cloudflare.com engineering
25d

How Stripe Radar helps prevent free trial abuse

Stripe Engineering details how Radar uses machine learning to prevent free trial abuse. The system predicts abusive behavior with 90% accuracy, based on common trial terms violations.

stripe.com engineering
26d

How Agentic RAG Works?

In this article, we will look at how agentic RAG works, how it improves upon standard RAG, and the trade-offs that should be considered.

blog.bytebytego.com architecture
26d

Inside Gen 13: how we built our most powerful server yet

Cloudflare's Gen 13 servers introduce AMD EPYC™ Turin 9965 processors and a transition to 100 GbE networking to meet growing traffic demands. In this technical deep dive, we explain the engineering rationale behind each major component selection.

blog.cloudflare.com engineering
26d

Process Faster, Pay Less: Functional Isolation for Stream Processing

This arXiv paper presents a novel approach to stream processing by exploring functional isolation to reduce infrastructure costs. It discusses how concurrent workloads can extract insights from real-time data streams while optimizing resource utilization.

arxiv.org streaming
27d

ReViSQL: Achieving Human-Level Text-to-SQL

The paper introduces ReViSQL, an approach to translating natural language to SQL, aiming to achieve human-level performance. The research focuses on enhancing SQL reasoning by utilizing large language models and AI agents to decompose complex queries.

arxiv.org semantic-layer
27d

Announcing DuckDB 1.5.1

DuckDB 1.5.1 is released, including fixes and Lance support. The release notes are available on GitHub, and the new version can be installed from the installation page.

duckdb.org duckdb
27d

The Math That’s Killing Your AI Agent

This article uses compound probability to illustrate how seemingly accurate AI agents can fail in multi-step tasks. It also proposes a pre-deployment framework to mitigate such failures in production.

towardsdatascience.com ml
29d

DuckDB.ExtensionKit: Building DuckDB Extensions in C#

DuckDB has a flexible extension mechanism that allows extensions to be loaded dynamically at runtime, and this post shows how to build them in C#. This extension mechanism can add support for new file formats, introduce custom types, or provide specialized analytical functions.

duckdb.org duckdb
30d

Show HN: Blobsearch – Object storage and DuckDB based Elasticsearch alternative

The article introduces Blobsearch, an Elasticsearch alternative based on object storage (like S3) and DuckDB for querying logs rapidly. It focuses on using a durable storage solution (S3 with Parquet) combined with the analytical capabilities of DuckDB for cost-effective log analysis and monitoring

github.com duckdb
31d

Friend Bubbles: Enhancing Social Discovery on Facebook Reels

Friend bubbles in Facebook Reels highlight Reels your friends have liked or reacted to, helping you discover new content and making it easier to connect over shared interests. This article explains the technical architecture behind friend bubbles, including how machine learning estimates relationshi

engineering.fb.com engineering
31d

Beam Metrics in ClickHouse

This article explores using Apache Beam to ingest metrics into ClickHouse; this provides insights into how to leverage a data processing framework for efficient metric storage and analysis in a columnar database.

andrealeopardi.com clickhouse
31d

How ClickStack makes ClickHouse faster for observability

This post details how ClickStack integrates with ClickHouse to optimize queries for observability workloads. It covers techniques like progressive time window pagination, chunked charts, and automated use of materialized views, offering insights into performance tuning.

clickhouse.com clickhouse
31d

How one query ate 2 TB of RAM

This Postgres Weekly article discusses how a badly written query caused an OOM (Out-Of-Memory) killer issue, even with ample RAM. The culprit was `work_mem` exceeding expectations; this is a cautionary tale regarding resource allocation and query optimization in Postgres.

postgresweekly.com postgres
32d

Introducing Redpanda AI SDK for Go

Redpanda is open-sourcing their AI SDK for Go, designed for observable, resilient, and production-grade AI tooling.

redpanda.com streaming
32d

Nemotron 3 Nano 4B: A Compact Hybrid Model for Efficient Local AI

Nemotron 3 Nano 4B is presented as a compact LLM suitable for local AI, offering an efficient option for running inference on resource-constrained devices. Staff+ ML engineers working on edge deployment or low-latency applications should investigate this model's architecture and performance characte

huggingface.co ml
32d

Context Engineering from the Inside Out

This article explores context engineering, a topic critical for building AI-ready data systems. The post discusses designing data systems for AI consumption, machine-readable metadata, and contextual memory, providing insights into creating effective data pipelines for AI applications.

blog.yellowday.day community
32d

Introduction to Data-Centric Query Compilation

An introduction to data-centric query compilation, covering how modern engines like HyPer and Umbra generate machine code from query plans by pushing data through tight loops rather than pulling through iterator trees.

duckul.us duckdb
32d

Underrated Postgres: Create (Extended) Statistics

This article highlights the importance of extended statistics in Postgres for query optimization. It likely covers how to create and use extended statistics to improve query performance, especially for complex queries or datasets with skewed data distributions.

vela.simplyblock.io postgres
32d

DataOps Best Practices with Dagster: CI/CD, Monitoring & Data Quality

This Dagster blog post details CI/CD workflows using branch deployments, automatic retries, and backfill strategies; it also covers data quality via asset checks and monitoring with Dagster Insights, offering actionable advice for managing production data pipelines.

dagster.io orchestration
32d

Lower your warehouse costs via DuckDB transpilation

This article explores using DuckDB transpilation to reduce warehouse costs. It could involve techniques for rewriting SQL queries to leverage DuckDB's efficient execution or using DuckDB as a local processing layer before data warehousing, offering a practical method for cost optimization.

maxhalford.github.io duckdb
32d

Subagents

Covers subagent patterns for building composable AI agents that delegate tasks to specialized sub-agents, with practical implementation details.

simonwillison.net llm
32d

Building a product analytics warehouse on vanilla Postgres

This article discusses building a product analytics warehouse directly on Postgres. The article likely details schema design choices, performance optimization strategies (indexing, partitioning), and extension usage (like pgvector) relevant for those using Postgres beyond traditional transactional w

xata.io postgres
32d

How 5 Databases Scale Across Concurrency, Data, and Nodes

The article compares Exasol, ClickHouse, StarRocks, Trino, and DuckDB across concurrency, data volume, and node scaling. While, the comparison could highlight architectural differences, performance trade-offs, and suitability for different analytical workloads across these popular SQL engines.

exasol.com duckdb
32d

Show HN: Avalon - Synthetic FHIR R4 patient data as OMOP CDM 5.4 views

Avalon Synthetic clinical data pipeline , generate realistic FHIR R4 patient data, normalize it through Forge, and query it as OMOP CDM 5.4 views. What is Avalon? Avalon is an end-to-end pipeline that turns Synthea-generated FHIR bundles into clean, documented, queryable tables in BigQuery , then la

github.com community
33d

How Stripe’s Minions Ship 1,300 PRs a Week

Stripe uses internal coding agents called 'Minions' to generate over 1,300 automated pull requests per week. The article likely describes the architecture and implementation of these agents.

blog.bytebytego.com engineering
33d

Designing the new async-native ClickHouse Python client

ClickHouse-connect v0.12.0 introduces a new async-native Python client built using the half-sync/half-async pattern. Benchmarks show a 1.16x improvement in throughput and more stable tail latency under high concurrency.

clickhouse.com clickhouse
33d

Show HN: Synthea Fhir Data in BigQuery

We generated ~1,100 synthetic patients with Synthea, processed the FHIR R4 output through our normalization engine (Forge), and published it as a free public dataset on BigQuery Analytics Hub. 8 resource types: Patient, Encounter, Observation, Condition, Procedure, Immunization, MedicationRequest, D

news.ycombinator.com community
34d

Yeahchain, a high-throughput data sync layer

We just open-sourced the core data sync engine behind Yeahchain. The problem we solved: traditional databases were hitting performance bottlenecks during high-frequency sync operations. For Yeahchain, we moved to a custom, lock-free architecture that maps shared memory regions directly to our proces

news.ycombinator.com community
34d

What is agentic engineering?

Article URL: https://simonwillison.net/guides/agentic-engineering-patterns/what-is-agentic-engineering/ Comments URL: https://news.ycombinator.com/item?id=47393908 Points: 127 # Comments: 76

simonwillison.net agents
34d

Show HN: Lockstep – A data-oriented programming language

https://github.com/seanwevans/lockstep I want to share my work-in-progress systems language with a v0.1.0 release of Lockstep. It is a data-oriented systems programming language designed for high-throughput, deterministic compute pipelines. I built Lockstep to bridge the gap between the productivity

github.com community
34d

Redpanda pushes the envelope on NVIDIA Vera

This article reports on performance improvements in Redpanda using NVIDIA Vera, showing latency reductions and throughput gains compared to CPU models;

redpanda.com streaming
34d

Why sharing domain data across microservices is a silent killer

I spent a few years working at a company where all our microservices backed into MongoDB instances. We were constantly under top-down pressure to deliver fast, and because MongoDB is schemaless, it felt very easy to just add fields to our documents whenever we needed to expose data to another servic

news.ycombinator.com community
34d

Patch Me If You Can: AI Codemods for Secure-by-Default Android Apps

Even seemingly simple engineering tasks — like updating an API — can become monumental undertakings when you’re dealing with millions of lines of code and thousands of engineers, especially if the changes are security-related. Meta uses AI codemods to automate security-related changes in their Andro

engineering.fb.com engineering
36d

Querying DateTimes in ClickHouse

This ClickHouse blog post explains how to effectively query datetime columns, including examples for hourly bucketing and rush hour analysis using real taxi data.

clickhouse.com clickhouse
36d

Designing AI agents to resist prompt injection

OpenAI details how ChatGPT is designed to resist prompt injection and social engineering by constraining risky actions and protecting sensitive data within agent workflows, offering insights into security measures.

openai.com llm
38d

The Practical Limits of DuckDB on Commodity Hardware

This post discusses the practical limits of DuckDB when running on commodity hardware. Understanding these limitations is crucial for optimizing performance and resource allocation in real-world deployments.

reddit.com duckdb
38d

Big Data on the Cheapest MacBook

Apple released the MacBook Neo today and there is no shortage of tech reviews explaining whether it's the right device for you if you are a student, a photographer or a writer. What they don't tell you is whether it fits into our Big Data on Your Laptop ethos. We wanted to answer this using a data-d

duckdb.org duckdb
39d

How Advanced Browsing Protection Works in Messenger

This article shares the technical details behind how Advanced Browsing Protection (ABP) in Messenger protects the privacy of the links clicked on within chats while still warning people about malicious links. It illuminates some of the engineering challenges and infrastructure required to implement

engineering.fb.com engineering
40d

The Pulse: Cloudflare rewrites Next.js as AI rewrites commercial open source

The article highlights an engineer at Cloudflare who rewrote most of Next.js in one week using AI agents; this example suggests a future where AI can rapidly disrupt existing software moats and business models, raising important questions about the evolving role of software engineers.

blog.pragmaticengineer.com engineering
44d

Introducing Iceberg output for Redpanda Connect

The article introduces Iceberg output for Redpanda Connect, enabling users to land data directly into Apache Iceberg tables; it highlights advantages such as automated schema evolution and scalable routing, making it useful for those integrating streaming data with data lakes.

redpanda.com streaming
45d

The Decline of RAG in Agentic AI | Airbyte

Explore the decline of traditional RAG in the era of agentic AI, and how autonomous agents are reshaping retrieval, reasoning, and knowledge workflows.

airbyte.com data-engineering
46d

FFmpeg at Meta: Media Processing at Scale

FFmpeg is a multi-tool for media processing, supporting a wide variety of audio and video codecs and container formats. It can also orchestrate complex chains of filters for media editing and manipulation. For the people who use our apps, FFmpeg plays an important role in ensuring that our videos lo

engineering.fb.com engineering
47d

Investing in Infrastructure: Meta’s Renewed Commitment to jemalloc

Meta recognizes the long-term benefits of jemalloc, a high-performance memory allocator, in its software infrastructure. Meta is renewing focus on jemalloc, aiming to reduce maintenance needs and modernize the codebase while continuing to evolve the allocator to adapt to the latest hardware and work

engineering.fb.com engineering
47d

You don’t know what your agent will do until it’s in production

This article discusses the unique challenges of monitoring LLM agents due to their non-deterministic nature and infinite input possibilities; it proposes focusing on conversation quality and using production traces for continuous improvement, highlighting the shift from traditional software monitori

blog.langchain.com llm
52d

The tale of an unanticipated concurrency and locking gotcha

A surprising edge case involving row locks with joins in Postgres: non-null foreign keys and valid constraints do not guarantee an inner join will return a row under concurrent modifications. The post traces the exact sequence of operations that triggers the bug.

postgresweekly.com postgres
53d

Why Agents Need Ontology | Airbyte

Discover why AI agents need ontology to structure knowledge, improve reasoning, enable semantic understanding, and make better autonomous decisions.

airbyte.com data-engineering
53d

RCCLX: Innovating GPU Communications on AMD Platforms

Meta is open-sourcing the initial version of RCCLX – an enhanced version of RCCL that we developed and tested on Meta’s internal workloads. RCCLX is fully integrated with Torchcomms and aims to empower researchers and developers to accelerate innovation, regardless of their chosen backend. Communica

engineering.fb.com engineering
53d

How To Get Started With Kubernetes: A Practical Guide

A Kubernetes beginner roadmap that goes through all k8s concepts with links to external documentation and exercises. TL;DR For the past few years, I’ve worked in startup environments where learning.. View article

mlops.community mlops
53d

How we built Agent Builder’s memory system

The post delves into the design and implementation of Agent Builder's memory system, discussing the prioritization of memory, technical architecture, and future enhancements; it offers valuable insight into building persistent memory systems for AI agents and their impact on performance.

blog.langchain.com llm
56d

Querying 3 billion vectors

The author investigates a map-reduce solution for querying 3 billion vectors, inspired by a discussion with Jeff Dean. The article delves into the implementation details of this solution, exploring the challenges and potential optimizations.

vickiboykis.com ml
57d

Introducing the Apache Iceberg File Format API

The Apache Iceberg community has finalized the File Format API, a significant architectural enhancement that enables pluggable, consistent, and engine-agnostic file formats within the Iceberg Java codebase.

iceberg.apache.org iceberg
58d

Our Multi-Agent Architecture for Smarter Advertising

This Spotify Engineering blog post discusses their multi-agent architecture for smarter advertising. The article likely details the challenges, solutions, and benefits of using a multi-agent approach to improve advertising effectiveness.

engineering.atspotify.com engineering
58d

Supabase incident on February 12, 2026

Supabase provides a detailed account of the February 12 outage in us-east-2, explaining the root cause and the steps taken to prevent it from happening again. The article provides insight into the incident and the measures implemented to improve system reliability.

supabase.com postgres
65d

Automating RDS Postgres to Aurora Postgres Migration

This Netflix Tech Blog post discusses automating the migration of RDS Postgres to Aurora Postgres. The article likely details the challenges, solutions, and lessons learned during this process, offering insights for others undertaking similar migrations.

netflixtechblog.com postgres
65d

High-Throughput Graph Abstraction at Netflix: Part I

This Netflix Tech Blog post covers high-throughput graph abstraction. The article likely describes the architecture, implementation, and performance considerations of their graph abstraction system, offering practical insights for building similar systems.

netflixtechblog.com knowledge-graphs
68d

Building Prometheus: How Backend Aggregation Enables Gigawatt-Scale AI Clusters

This article shares details of the role backend aggregation (BAG) plays in building Meta’s gigawatt-scale AI clusters like Prometheus. BAG allows Meta to seamlessly connect thousands of GPUs across multiple data centers and regions. Their BAG implementation is connecting two different network fabric

engineering.fb.com engineering
68d

The Data Canary: How Netflix Validates Catalog Metadata

This Netflix Tech Blog post details how Netflix validates catalog metadata using a 'Data Canary' system. The article likely explains the architecture, implementation, and benefits of this system for ensuring data quality and reliability.

netflixtechblog.medium.com data-engineering
71d

The Missing Layer in Your AI Stack: Context, Not Just State

This article discusses how context graphs can improve AI agent performance, emphasizing the shift from simple state management to incorporating semantic understanding of the data; this is.

dataengineeringweekly.com data-engineering
78d

Data Bridge: How Netflix simplifies data movement

This Netflix Tech Blog post discusses 'Data Bridge', a system Netflix uses to simplify data movement. The article likely explains the architecture, implementation, and benefits of this system for improving data pipeline efficiency and reducing complexity.

netflixtechblog.com data-engineering
78d

I replaced a $120/year micro-SaaS in 20 minutes with LLM-generated code

This post explores how an individual replaced a paid SaaS subscription with LLM-generated code in just 20 minutes; this highlights the potential for LLMs to disrupt simple SaaS business models, especially for products that are not actively maintained.

blog.pragmaticengineer.com engineering
79d

The AI Evolution of Graph Search at Netflix

This Netflix Tech Blog post covers the AI evolution of graph search at Netflix. The article likely describes how they're using AI to improve graph search capabilities, offering insights into building intelligent search systems.

netflixtechblog.com knowledge-graphs
82d

Lessons From 2 Billion Agentic Workflows

The article shares lessons learned from observing billions of agentic workflows, focusing on the challenges of moving from a working demo to a production system.

blog.crewai.com agents
84d

Announcing Vortex Support in DuckDB

I think it is worth starting this intro by talking a little bit about the established format for columnar data. Parquet has done some amazing things for analytics. If you go back to the times where CSV was the better alternative, then you know how important Parquet is. However, even if the specific

duckdb.org duckdb
86d

Inside StarRocks: Why Joins Are Faster Than You’d Expect

This StarRocks blog post dives into the details of join optimization within the StarRocks database, explaining why joins can perform faster than expected. The author is a StarRocks committer and engineer at Celerdata.

starrocks.io clickhouse
88d

Apache Doris 4.0: Native Hybrid Search for AI Workloads

Apache Doris now supports native hybrid search for AI workloads. The new functionality allows vector search, full-text search, and structured analytics within a single SQL engine, enabling AI-powered applications to leverage a unified data platform.

doris.apache.org analytics
89d

Implement dbt Data Quality Checks with dbt-expectations

Deep technical guide to dbt-expectations covering regex validation, freshness/SLA checks, completeness validation within time windows, JSON schema validation, statistical distribution checks, and cross-column logic. Shows integration with production monitoring.

datadoghq.com data-quality
89d

The 2026 Data Mandate: Is Your Governance Architecture a Fortress or a Liability?

Examines how the EU AI Act, Cyber Resilience Act, and Data Act turn messy data from a performance tax into a legal liability. Covers the August 2026 deadline for High-Risk AI system compliance and argues governance must shift from reactive cleanup to embedded-by-design architecture.

towardsdatascience.com governance
94d

Why We Use Separate Tech Stacks for Personalization and Experimentation

This Spotify Engineering blog post explains the technical and practical rationale for using separate tech stacks for personalization and experimentation. The article likely details the benefits of this separation, such as improved agility and scalability.

engineering.atspotify.com engineering
101d

Build a real-time lakehouse architecture with Redpanda and Databricks

This post outlines building a real-time lakehouse architecture using Redpanda's Iceberg Topics and Databricks Unity Catalog for analytics-ready tables, eliminating the need for batch processing and orchestration, which is of interest to practitioners.

redpanda.com streaming
103d

Weaviate 1.35 Release

Weaviate 1.35 introduces Object Time-to-Live (TTL), zstd compression support, flat index RQ quantization, multimodal support with Weaviate Embeddings, and runtime configurable OIDC certificates.

weaviate.io vector-db
111d

Iceberg in the Browser

In this post, we describe the current patterns for interacting with Iceberg Catalogs, and pose the question: could it be done from a browser? After elaborating on the DuckDB ecosystem changes required to unlock this capability, we demonstrate our approach to interacting with an Iceberg REST Catalog.

duckdb.org duckdb
124d

The Three Durable Function Forms

This article proposes a model extending generic durable functions into three forms: stateless functions, stateful function objects, and linear function chains. It aims to standardize terminology in durable execution engines by linking concepts like 'workflows' and 'activities' to underlying executio

jack-vanlightly.com architecture
129d

Context Engineering - LLM Memory and Retrieval for AI Agents

This article discusses context engineering, focusing on how AI agents manage LLM memory by selecting, retrieving, and organizing context from short-term and long-term memory. Context engineering is important for improving the reliability of AI agents in production.

weaviate.io vector-db
131d

The Durable Function Tree - Part 2

This post delves into the architecture of durable function trees, exploring their integration within larger systems and the advantages they offer for durable execution.

jack-vanlightly.com architecture
135d

The Durable Function Tree - Part 1

This article explores constructing workflows using durable function calls arranged in trees, built on durable promises and continuations.

jack-vanlightly.com architecture
135d

Writes in DuckDB-Iceberg

Over the past several months, the DuckDB Labs team has been hard at work on the DuckDB-Iceberg extension, with full read support and initial write support released in v1.4.0. Today, we are happy to announce delete and update support for Iceberg v2 tables is available in v1.4.2! The Iceberg open tabl

duckdb.org duckdb
142d

Demystifying Determinism in Durable Execution

This article explains the concept of determinism within durable execution frameworks, focusing on identifying code sections that must be deterministic.

jack-vanlightly.com architecture
145d

Bringing RAG to Life with Dify and Weaviate

This article explains how to leverage the Dify and Weaviate integration for building Retrieval Augmented Generation (RAG) applications. This integration can be valuable for enhancing LLM applications with external knowledge.

weaviate.io vector-db
150d

The Growing Apache Polaris Ecosystem: The Iceberg Catalog Standard

Technical overview of Apache Polaris as the emerging open catalog standard for Iceberg. Covers multi-engine interoperability (Spark, Flink, Trino, StarRocks), built-in RBAC with table-level security, short-lived credential vending via cloud provider integrations, and Snowflake's managed Polaris offe

dremio.com governance
150d

Weaviate 1.34 Release

Weaviate 1.34 introduces flat index support with RQ quantization, server-side batching improvements, new client libraries, and Contextual AI integration. These features offer potential performance and functionality improvements for the vector database.

weaviate.io vector-db
159d

Apache Doris Tops JSONBench in Cold Queries and Data Quality

Apache Doris achieves top performance in the JSONBench benchmark, particularly in cold query performance and data quality. The benchmark measures query performance and data handling capabilities when processing JSON data.

doris.apache.org analytics
164d

New trend: programming by kicking off parallel AI agents

This article highlights the emerging trend of developers utilizing multiple AI agents in parallel to generate code. It explores the potential benefits and challenges of this approach to programming.

blog.pragmaticengineer.com engineering
170d

The 2026 Open-Source Data Quality and Data Observability Landscape

Comprehensive landscape of open-source data quality tools including Soda Core, Elementary Data, dbt Tests, and DataKitchen TestGen. Explores how the community is democratizing observability capabilities previously locked behind expensive platforms, and how AI is being used to automate test generatio

datakitchen.io data-quality
186d

Apache Doris Up to 34x Faster Than ClickHouse in Real-Time Updates

Apache Doris is shown to be significantly faster than ClickHouse in real-time updates, according to benchmark results. Using ClickBench and SSB (Star Schema Benchmark), Apache Doris outperforms ClickHouse by 18-34x in SSB and 2.5-4.6x in ClickBench.

doris.apache.org analytics
200d

Your Data Contracts Are in the Wrong Spot

Argues that most organizations place data contracts in the wrong part of the lifecycle, causing enforcement gaps. Makes the case for contracts closer to the producer, not the consumer, with practical guidance on where they should sit architecturally.

dataproducts.substack.com data-quality
200d

Search Mode Benchmarking

Learn how Search Mode compares against Hybrid Search on the BEIR, LoTTe, BRIGHT, EnronQA, and WixQA Information Retrieval benchmarks.

weaviate.io vector-db
208d

Deep Dive: Data Pruning in Apache Doris

Apache Doris utilizes various data pruning techniques to optimize query performance by skipping unnecessary data processing. This article dives into the implementation and strategies behind these data pruning techniques within the Doris architecture.

doris.apache.org analytics
223d

Apache Doris Up To 40x Faster Than ClickHouse | OLAP Showdown Part 2

Apache Doris demonstrates superior performance over ClickHouse in various benchmarks including CoffeeBench, TPC-H, and TPC-DS. The benchmarks show that Doris consistently outperforms ClickHouse, showcasing its efficiency and speed in OLAP workloads.

doris.apache.org analytics
224d

Column-Level Lineage in Fabric Spark with OpenLineage, Stashed in Delta Lake

Production-oriented guide showing how to capture column-level lineage in Microsoft Fabric Spark (which ships with OpenLineage pre-installed). Describes a Spark Plugin architecture where a REST API collects lineage events from an OpenLineage Listener, buffering them into Delta Tables for queryable li

rakirahman.me lineage
226d

Understanding Apache Fluss

This post delves into the internal workings of Apache Fluss, offering a detailed exploration for those interested in data system internals.

jack-vanlightly.com architecture
228d

A Conceptual Model for Storage Unification

This article introduces a conceptual model for storage unification, designed to present diverse storage systems and formats as a unified resource.

jack-vanlightly.com architecture
240d

Iceberg Catalogs 2025: Exploring Emerging Metadata Solutions

Compares next-generation Iceberg catalogs: Nessie (Git-style branching for data), Apache Polaris, Apache Gravitino, Lakekeeper, and Unity Catalog. Explains how these move beyond simple table-name resolution to provide version control, federated views, fine-grained policies, and multi-engine freedom.

e6data.com governance
247d

Data Quality Frameworks Comparison: Great Expectations, Soda Core, dbt, Deequ

Side-by-side technical comparison of Great Expectations, Soda Core, dbt tests, and Deequ across expressiveness, scalability, integration patterns, and ease of adoption. Provides a decision framework for which tool fits which use case, and discusses layering multiple tools across pipeline stages.

nurbolsakenov.com data-quality
252d

Data Pipeline Troubleshooting: Root Cause Analysis Through Lineage Metadata

Builds a complete order processing pipeline with Debezium CDC, Apache Flink transformations, and OpenLineage/Marquez for lineage tracking. Demonstrates how lineage metadata enables root cause analysis when pipeline failures occur, showing practical troubleshooting patterns with end-to-end visibility

debezium.io lineage
272d

Apache Iceberg and the Catalog Layer

Features Russell Spitzer (Apache Iceberg/Polaris PMC) discussing the distinction between business catalogs (discovery/listing) and system catalogs (governing access by understanding table layout). Covers how Polaris vends short-lived credentials scoped to exact table directories.

getdbt.com governance
285d

Evaluating Long-Context Question & Answer Systems

This article covers evaluation metrics, how to build eval datasets, evaluation methodology, and a review of several benchmarks for long-context question and answer systems.

eugeneyan.com ml
301d

Data Contracts and Data Observability: Whatnot's Full Circle Journey to Data Trust

Production case study from Whatnot (live shopping marketplace) on combining data contracts with Monte Carlo observability. Their stack uses Snowflake, dbt, and Dagster. Shows how enforcing contracts while layering automated observability kept data incidents flat despite exponential data growth.

montecarlodata.com data-quality
308d

Native Data Lineage in Debezium with OpenLineage

Technical walkthrough of Debezium's built-in OpenLineage integration for automatic CDC lineage tracking. Explains how Debezium Server emits OpenLineage events natively using the Java SDK, modeling run/job/dataset entities without manual instrumentation, with Marquez as a lineage backend.

debezium.io lineage
310d

What's New with Databricks Unity Catalog at Data + AI Summit 2025

Covers Unity Catalog announcements: Iceberg catalog federation for governing tables in AWS Glue/Hive/Snowflake without copying data, Unity Catalog Metrics as first-class governed assets, column-level permissions for PII, and the new Discover experience for certified data products with AI-driven reco

databricks.com governance
313d

More efficient multi-vector embeddings with MUVERA

Weaviate version 1.31 introduces the MUVERA encoding algorithm for multi-vector embeddings. The post explains the algorithm's details, including its functionality and use cases.

weaviate.io vector-db
318d

DuckLake: A Metadata Store for Data Lakes

DuckLake stores data lake metadata in a SQL database instead of files. 22-table schema replaces manifest files, enabling instant snapshot queries and ACID transactions without file listing overhead.

duckdb.org duckdb
347d

The Current State of Column-level Lineage

Explains the columnLineage dataset facet introduced in OpenLineage 0.9.0 for Spark integration. Covers how column-level lineage tracks which input fields produce each output field, its applications for GDPR/HIPAA/CCPA compliance, and the roadmap for extending support beyond Spark.

openlineage.io lineage
353d

Testing Custom Flink Jobs on Decodable

This article provides guidance on testing custom Flink jobs on Decodable, focusing on modular implementations to improve testability when dealing with external service dependencies. It addresses a common challenge in Flink development and offers practical solutions.

decodable.co flink
373d

Integrate Qdrant and Neo4j to Enhance Your RAG Pipeline

This article demonstrates integrating Neo4j with Qdrant to enhance RAG pipelines by enabling external vector searches; it guides users through a local setup with preloaded data, illustrating the practical aspects of this integration.

neo4j.com databases
446d

Building Knowledge Graph Agents With LlamaIndex Workflows

The article explains how to build knowledge graph agents using LlamaIndex workflows, offering a blueprint for constructing Text2Cypher agentic interfaces; this integration provides practical insights into developing agentic data pipelines.

neo4j.com databases
456d

Claude Converses With Neo4j Via MCP

This Neo4j Developer Blog post explains how to use Anthropic's Model Context Protocol (MCP) to give LLMs like Claude access to knowledge graphs in Neo4j.

neo4j.com knowledge-graphs
484d

Building Effective Agents

Anthropic's guide to building reliable AI agents: tool use patterns, prompt chaining, evaluation frameworks, error recovery, and when NOT to use agents.

anthropic.com ml
486d

Model Context Protocol: Open Standard for AI Tool Use

MCP standardizes how AI models connect to data sources and tools. Client-server architecture with typed resources, tool definitions, and prompts that any LLM application can implement.

modelcontextprotocol.io ml
510d

Effortless RAG With Text2CypherRetriever

The Text2CypherRetriever allows users to retrieve data from Neo4j using natural language, simplifying query generation for GenAI applications.

neo4j.com knowledge-graphs
533d

The DuckDB Local UI

DuckDB ships a built-in web UI for interactive SQL exploration, schema browsing, and result visualization -- no install needed beyond the CLI.

duckdb.org duckdb
534d

Why Do I Need CDC?

This technical blog post explores the importance of Change Data Capture (CDC) for developers. It covers the fundamentals of CDC, its common use cases, and the advantages of log-based CDC compared to other approaches. Understand how CDC can improve operational performance, enable real-time analytics,

decodable.co streaming
551d

Turn Your CSVs Into Graphs Using LLMs

The post details how to turn CSV files into graph models using LLMs, simplifying data relationships and enhancing insights.

neo4j.com knowledge-graphs
561d

Building a GraphRAG Agent With Neo4j and Milvus

Learn how to build a GraphRAG agent using Neo4j and Milvus, combining graph and vector search for enhanced retrieval, better context, and accurate answers.

neo4j.com knowledge-graphs
568d

Building Enterprise AI with Knowledge Graphs and LLMs

How enterprises combine knowledge graphs with LLMs: grounding responses in structured facts, reducing hallucinations, enabling explainable AI, and the architectural patterns for graph-augmented generation.

thenewstack.io ml
578d

Prefect 3.0: Workflow Orchestration Without the DAG

Prefect 3.0 drops DAGs entirely: Python-native flows with dynamic task creation, automatic retries, event-driven triggers, and a hosted platform that eliminates scheduler management.

prefect.io orchestration
581d

Building a Movie Recommendation System With Neo4j

Recommend movies to users based on their reading histories and ratings. Learn the setup of Neo4j, mapping data into Java with Neo4j Object Graph Mapper (Neo4j-OGM), and crafting Cypher queries for recommendations.

neo4j.com knowledge-graphs
583d

Why Every AI Application Needs a Semantic Layer

LLMs generating SQL without a semantic layer produce inconsistent, wrong metrics. How the dbt Semantic Layer provides guardrails: metric definitions, entity relationships, and governed access for AI agents.

getdbt.com dbt
591d

Why Polars is Faster Than Pandas

Architecture-level comparison: Polars' Rust-based columnar engine with lazy evaluation, query optimization, and Apache Arrow memory vs Pandas' eager NumPy-backed row operations. Benchmarks on real workloads.

blog.jetbrains.com data-engineering
607d

How Snowflake Builds Its Query Optimizer

Inside Snowflake's Cascades-style query optimizer: join reordering, pruning with micro-partition statistics, adaptive execution, and how they test optimizer correctness at scale.

snowflake.com snowflake
612d

Ontologies for AI: Why Structure Still Matters

Ontologies provide the structured backbone that LLMs lack: taxonomies, controlled vocabularies, entity disambiguation, and how combining ontological reasoning with neural approaches produces more reliable AI systems.

poolparty.biz ml
615d

Apache Arrow DataFusion: A Fast Query Engine in Rust

DataFusion as a modular query engine: how it powers InfluxDB 3.0, Comet Spark accelerator, and Ballista distributed queries. Extensible optimizer, custom table providers, and user-defined functions in Rust.

arrow.apache.org arrow
617d

Why We Switched from Airflow to Dagster

The asset-centric paradigm shift: why defining what data should exist (Dagster assets) is better than defining how to compute it (Airflow tasks). Software-defined assets, IO managers, and testability.

dagster.io orchestration
622d

Why Iceberg Won the Table Format War

Analysis of how Iceberg's catalog-agnostic design, hidden partitioning, and multi-engine support gave it an architectural advantage over Delta Lake and Hudi.

blog.det.life iceberg
638d

Data Governance Without the Bureaucracy

Practical data governance: automated PII detection, column-level lineage, data contracts between teams, freshness SLAs, and how to implement governance incrementally without blocking teams.

montecarlodata.com data-engineering
643d

DuckDB Extensions: Building Your Own

How DuckDB's community extension system works: writing C++ extensions, the extension repository, signed distribution, and examples of spatial, httpfs, and Iceberg extensions.

duckdb.org duckdb
653d

GraphRAG: Knowledge Graph-Enhanced Retrieval for LLMs

Microsoft's GraphRAG approach: automatically building knowledge graphs from document corpora, community detection for topic summarization, and how graph-based retrieval answers global questions that vector search cannot.

microsoft.github.io ml
656d

Building an LLM Router for High-Quality and Cost-Effective Responses

This post describes a method for building an LLM router that dynamically selects the optimal LLM for a given request based on configurable criteria. It covers techniques for evaluating LLM performance, implementing routing logic, and optimizing for cost-effectiveness.

anyscale.com ml
656d

Dynamic Tables in Snowflake: Declarative Data Pipelines

Snowflake Dynamic Tables: define a pipeline as a SQL query and let Snowflake handle scheduling, incremental refresh, and dependency management. Replaces streams + tasks for most use cases.

docs.snowflake.com snowflake
668d

How Stripe Builds Reliable Data Pipelines

Stripe's ledger system for financial data: immutable event log, double-entry accounting in the data warehouse, reconciliation pipelines, and how they ensure every cent is accounted for.

stripe.com engineering
670d

ClickHouse vs Snowflake: A Practitioner's Perspective

Honest comparison of ClickHouse and Snowflake architectures for real-time analytics workloads, covering query latency, ingestion throughput, cost models, and operational complexity.

clickhouse.com clickhouse
678d

What We Learned from a Year of Building with LLMs

Hard-won lessons from practitioners: prompt engineering diminishing returns, when to fine-tune vs RAG, evaluation beyond vibes, cost optimization, and the reliability gap between demo and production.

oreilly.com ml
683d

Data Quality at Scale: Lessons from Airbnb

Airbnb's Midas data quality framework: automated anomaly detection, lineage-based impact analysis, SLA tracking, and self-healing pipelines at petabyte scale.

medium.com data-engineering
687d

LinkedIn's Real-Time Data Infrastructure

How LinkedIn processes 7 trillion events per day: Kafka for event transport, Samza for stream processing, Venice for derived data serving, and Brooklin for cross-DC replication.

engineering.linkedin.com streaming
697d

Dagster vs Airflow: An Honest Comparison

Asset-centric vs task-centric orchestration: how Dagster's software-defined assets, type system, and built-in IO managers compare to Airflow's DAG paradigm.

dagster.io orchestration
699d

ClickHouse vs PostgreSQL for Analytics: When to Switch

When Postgres analytics hits a wall: column compression, vectorized execution, and approximate query processing in ClickHouse vs row-oriented scans in Postgres. Migration patterns and hybrid architectures.

clickhouse.com clickhouse
701d

Kimball is Dead, Long Live Kimball

Why dimensional modeling still matters even though the ELT era made star schemas seem obsolete. The semantic layer as the modern replacement for physical dimension tables.

benn.substack.com data-engineering
704d

How Netflix Migrated from Hive to Iceberg

Netflix's migration from Hive to Iceberg at exabyte scale, including incremental processing patterns with Maestro orchestrator and Spark.

netflixtechblog.com data-engineering
711d

Knowledge Graphs for RAG: Beyond Vector Search

Why vector similarity alone fails for complex reasoning. Using Neo4j knowledge graphs alongside embeddings: entity extraction, relationship mapping, graph traversal for multi-hop queries, and hybrid retrieval.

blog.langchain.dev llm
711d

How Figma Scaled to Multiple Databases

Figma's horizontal sharding journey: from a single Postgres instance to 100+ shards using PgBouncer, application-level routing, and their custom migration tooling for zero-downtime resharding.

figma.com postgres
721d

Text-to-SQL is Harder Than You Think

Why LLM-generated SQL fails in production: schema ambiguity, implicit business logic, multi-table joins, aggregate semantics, and why a semantic layer is the real solution instead of better prompting.

numbersstation.ai ml
724d

Data Contracts: The Missing Link in Data Mesh

How data contracts formalize the interface between producers and consumers, with practical schema enforcement patterns using protobuf, JSON Schema, and dbt tests.

dataproducts.substack.com data-engineering
727d

Retrieval Augmented Generation: Beyond the Basics

Advanced RAG patterns: multi-query retrieval, recursive summarization, parent-child chunk linking, self-RAG with reflection, and corrective RAG that verifies its own retrievals.

blog.langchain.dev llm
729d

Cube.js: The Headless BI Semantic Layer

How Cube's semantic layer sits between databases and consumers: pre-aggregations, access control, caching, and serving consistent metrics to dashboards, notebooks, and LLMs via API.

cube.dev analytics
731d

Practical Guide to RAG Pipeline Evaluation

End-to-end guide for building production RAG systems: chunking strategies, embedding model selection, retrieval metrics (MRR, NDCG), reranking, and hallucination detection.

anyscale.com ml
734d

The dbt Semantic Layer: Metrics as Code

How the dbt Semantic Layer works: MetricFlow engine, semantic models, dimension/measure definitions, and querying metrics from any BI tool via the JDBC/GraphQL API.

getdbt.com dbt
739d

pgvector: Embeddings and Vector Search in Postgres

Production-grade vector search with pgvector: HNSW vs IVFFlat index tradeoffs, optimal dimensionality, bulk loading strategies, and benchmarks against dedicated vector databases.

supabase.com postgres
747d

RDF, SPARQL, and the Semantic Web in 2024: Still Relevant?

The semantic web stack (RDF, OWL, SPARQL) is quietly powering enterprise knowledge management. How knowledge graphs, linked data, and ontologies are being integrated with LLMs and modern data architectures.

stardog.com databases
752d

DuckDB as the New jq

Using DuckDB as a command-line JSON processor, replacing jq for complex data transformations with SQL syntax.

pgrs.net duckdb
759d

Postgres Performance Tuning: The Definitive 2024 Guide

Deep dive into Postgres internals: shared_buffers vs OS cache, parallel query tuning, JIT compilation tradeoffs, connection pooling with PgBouncer, and VACUUM strategies for write-heavy workloads.

crunchydata.com postgres
765d

Postgres is Enough

The case for using Postgres as your only database: JSONB for documents, pg_cron for scheduling, pgvector for embeddings, logical replication for CDC, and extensions for everything else.

amazingcto.com postgres
768d

The Semantic Layer: A New Foundation for Data and AI

What a semantic layer actually is beyond marketing: universal metric definitions, entity relationships, access policies, and why it matters more in the age of LLM-generated SQL.

atscale.com data-engineering
770d

How Discord Stores Trillions of Messages with ScyllaDB

Discord's migration from Cassandra to ScyllaDB for their message store: hot partition detection, consistent hashing, compaction tuning, and achieving P99 reads under 1ms at 2T messages.

discord.com engineering
774d

ClickHouse MergeTree Internals

How ClickHouse's MergeTree engine works: LSM-tree-inspired sorted parts, sparse primary index, data skipping indexes, background merges, and why it achieves sub-second queries on billions of rows.

clickhouse.com clickhouse
779d

Fine-tuning LLMs Is Not As Hard As You Think

Practical fine-tuning guide using TRL, QLoRA, and Flash Attention 2. Covers dataset preparation, hyperparameter selection, evaluation, and deployment with real cost breakdowns.

philschmid.de ml
781d

Photon: The Next Generation Spark Engine at Databricks

Photon is a C++ vectorized execution engine that replaces Spark's JVM-based Catalyst for scan-heavy workloads, achieving 3-8x speedups through SIMD, memory-mapped I/O, and adaptive execution.

databricks.com databricks
789d

The Rise of the Analytics Engineer

How the analytics engineer role evolved from a dbt power user to a critical bridge between data engineering and business intelligence, with practical career guidance.

getdbt.com dbt
794d

WarpStream: Kafka Without the Disks

WarpStream's architecture: a Kafka-compatible broker that writes directly to S3 instead of local disks. No inter-broker replication, no partition reassignment, and 80% cheaper than self-hosted Kafka.

warpstream.com kafka
794d

Data Vault 2.0 in Practice: When and Why

Practical guide to Data Vault 2.0: hub-link-satellite patterns, hash keys for parallelism, point-in-time tables, and when Data Vault makes sense vs One Big Table or dimensional modeling.

scalefree.com data-engineering
801d

Apache Arrow: The Universal Columnar Format

How Arrow's in-memory columnar format enables zero-copy data exchange between Spark, DuckDB, Pandas, Polars, and databases via ADBC and Flight SQL.

arrow.apache.org duckdb
804d

Designing Data-Intensive Applications in 2024

Martin Kleppmann's reflections on how the landscape has changed since DDIA: new consensus protocols, CRDTs in production, the shift to event streaming, and what he'd write differently today.

martin.kleppmann.com engineering
810d

Querying Parquet Files on S3 with DuckDB

DuckDB's multi-database support: attach Postgres, MySQL, and SQLite databases alongside local files, and query across them with standard SQL joins.

duckdb.org duckdb
814d

Emerging Architectures for LLM Applications

Reference architecture for LLM applications covering RAG pipelines, embedding models, vector databases, orchestration frameworks, and evaluation patterns.

a16z.com ml
820d

Friendly SQL in DuckDB

DuckDB's SQL dialect extensions that make queries more readable: GROUP BY ALL, SELECT * EXCLUDE, implicit column aliases, and string slicing.

duckdb.org duckdb
825d

The Illustrated Stable Diffusion

Visual walkthrough of how Stable Diffusion works: the latent space, the denoising U-Net, CLIP text encoder, classifier-free guidance, and how LoRA fine-tuning adapts the model.

jalammar.github.io ml
828d

Snowflake's Architecture: A Deep Dive

Internal architecture of Snowflake's multi-cluster shared data architecture, covering storage layer, virtual warehouses, metadata store, and query optimization.

snowflake.com snowflake
830d

Mixture of Experts: How Sparse Models Scale

The MoE architecture behind Mixtral and Switch Transformer: expert routing, load balancing, training instability, and why sparse models achieve better performance per FLOP than dense models.

huggingface.co ml
832d

Attention Is All You Need (Explained)

The definitive visual explanation of the Transformer architecture: self-attention, multi-head attention, positional encoding, and how information flows through encoder-decoder layers.

jalammar.github.io ml
835d

Data Contracts and Data Observability: Whatnot’s Full Circle Journey to Data Trust

Whatnot went from no modern data stack to processing tens of millions of events across hundreds of event types each day. Zack Klein explains how Whatnot leverages data contracts and data observability to achieve high quality data at scale for stakeholders, focusing on a small team's approach to data

montecarlodata.com data-engineering
835d

One Billion Row Challenge in SQL

Solving the viral 1 Billion Row Challenge using DuckDB SQL instead of Java -- demonstrating that a single SQL query on a laptop can process 1B rows in under 4 seconds.

rmoff.net duckdb
837d

Data Contracts for the Warehouse

This article focuses on data contracts for data warehouses, emphasizing programmatic accountability in batch data processing. It outlines the importance of defining and enforcing data contracts to improve data quality and reliability.

dataproducts.substack.com data-quality
1179d

Deep Dive: Data Ingest in a Third Generation ML Architecture

This Anyscale blog post dives into data ingest in a third-generation ML architecture, specifically using Ray Data. It provides code samples to illustrate how distributed libraries can improve performance by exploiting distributed memory bandwidth.

anyscale.com ml
1601d