Introduction to dpm
Welcome to the dpm documentation!
dpm stands for data package manager.
Using the dpm
CLI or dpm web application, you can define, build, and publish data packages: code packages with a live, embedded connection to a data source. Users can upgrade data packages for lower latency, more reliability, stricter access controls, schema evolution, and time travel.
Why data packages?
As products scale, database workloads evolve. As organizations scale, teams often need to leverage other teams' data to build their products. As a result, engineering teams often fall into one of two traps.
Sub-optimal use of existing components | Complex web of pipelines, streams, and databases |
---|---|
Stretching Postgres to the max | Copying data from one team's database to another's |
Using MongoDB for analytics | Introducing specialized databases for each incremental use case |
Querying a data warehouse directly from production | Heavy coordination through schema changes & migrations |
Data packages are designed as a replacement for low-level data engineering work.
What is a data package?
Data packages enable engineers to safely query data, no matter where it is stored.
Data packages include:
- Declarative query interfaces with type-safety in the developer’s preferred runtime language, such as Python or TypeScript
- Embedded access policies for federated governance
- Change management through a familiar package versioning workflow
- Performance & consistency configurations for highly reliable & low latency apps
- Metadata, notably a version, maintainer, and description of intended usage and constraints
Data packages replace:
- Pipelines into operational & online analytics stores
- Caches and/or read-replicas
- API & SDK development
Features & use cases
- Securely distribute and import data products
- Query & enrich data from any source like a data micro-service
- Take analytical workloads off operational databases with no infrastructure setup
- Build apps & services with generated, type-safe query interfaces derived from a dataset schema
- Query data immediately without waiting on direct database access
- Perform time series bucketing, aggregations, grouping, filtering and sorting without writing complicated SQL queries
- Run analytical queries over large datasets with low latency and without hitting the underlying storage system
- Use data package versions to safely update your schemas without impacting downstream consumers
- Leverage data from Snowflake, BigQuery, or Databricks in customer-facing applications
- Look up single row records with single digit millisecond response times
- Turn your dbt models into data packages in minutes
How do data packages work?
Data producers define a data package by selecting tables from a data source. Then, a client package is generated from the tables' schemas, with configurable query interfaces in popular runtimes like Python and TypeScript.
The generated package can be published to registries like npm and PyPI, so consumers can install them using a familiar npm
or pip
workflow. They can also safely upgrade as the schema or other properties of the data package are updated.
The package is imported as a library dependency into a code project. The client enables users to write queries with type safety and helper functions for common date functions, aggregates, filters, and lookups. The query is routed through an agent process, which translates the query into a source-appropriate dialect.
- Node.js
- Python
import { FactsAppEngagement as FactsAppEngagementSnow } from 'snowflake-demo-package';
// Get avg time in app and user counts
// broken down by app and day of week
async function main() {
let { appTitle, foregroundduration, panelistid, starttimestamp } = FactsAppEngagementSnow.fields;
let query = FactsAppEngagementSnow.select(
appTitle.as("App Name"),
foregroundduration.avg().as("Avg Time in App"),
panelistid.countDistinct().as("User Count"),
starttimestamp.day.as("Day of week")
)
query.compile().then((data)=> console.log("Compiled query: ", data));
query.execute().then((data)=> console.log(data));
}
main().catch(console.error);
import asyncio
from pprint import pprint
from snowflake_demo_package_fast import FactsAppEngagement as FactsAppEngagement
# Get avg time in app and user counts
# broken down by app and day of week
async def query():
[app_title, foregroundduration, panelistid, starttimestamp] = [
FactsAppEngagement.fields.app_title,
FactsAppEngagement.fields.foregroundduration,
FactsAppEngagement.fields.panelistid,
FactsAppEngagement.fields.starttimestamp
]
query = FactsAppEngagement.select(
app_title.with_alias("App_Name"),
foregroundduration.avg().with_alias("Average_Time_in_App"),
panelistid.count_distinct().with_alias("User_Count"),
starttimestamp.day.with_alias("Day_of_Week")
).limit(10)
compiled_query = await query.compile()
results = await query.execute()
print(f"Compiled query:\n{compiled_query}")
print(f"Results:\n")
pprint(results)
asyncio.run(query())
Learn more
To stay up to date with dpm, be sure to follow @patch_data and @dpminstall on Twitter/X!
If you have questions about anything related to dpm, you're welcome to ask on GitHub Discussions.