Consolidate Prompt Schemas For Data Consistency

by Alex Johnson 48 views

Consolidate Prompt Schemas to Achieve a Single Source of Truth

Refactoring prompt schemas is a crucial step towards achieving a more maintainable, consistent, and reliable system. The current architecture, characterized by scattered schema definitions across multiple locations, leads to several challenges. These challenges include schema drift, manual duplication, inconsistent naming conventions, and discrepancies between SDK and API schemas. This article will provide an in-depth analysis of the problems, propose an architectural solution, and outline a detailed migration plan. This will also describe the core concept of single source of truth and its benefits. It will also touch upon the implementation of domain schema, API schemas, and the crucial integration with DSL (Domain-Specific Language).

The Problem: A Scattered Landscape of Prompt Schemas

The current state of the prompt configuration schemas is fragmented, with definitions spread across numerous locations. This lack of a single source of truth leads to the following issues:

  • Schema Drift: Inconsistencies arise when schemas are updated in one location but not others, leading to bugs and unexpected behavior. An example of this is the recent bug involving inputs using .min(1) versus .default([]).
  • Manual Duplication: Redundant definitions and the need to manually copy and paste schemas increase the risk of errors and make maintenance more complex.
  • Inconsistent Naming Conventions: The use of both snake case and camel case across different parts of the codebase requires ad-hoc transformations, increasing complexity.
  • SDK Divergence: SDK schemas often lag behind or diverge from the API schemas, leading to compatibility issues and frustration for developers.

This fragmented approach creates a maintenance nightmare and hinders the system's overall reliability. The key to fixing these issues is to consolidate the prompt schemas and designate a single source of truth.

Current State: A Detailed Look at Schema Locations and Specific Issues

The existing architecture involves several locations where prompt schemas are defined. These locations, each serving a different purpose, contribute to the overall complexity. Here's a breakdown:

  • prompts/schemas/field-schemas.ts: Defines atomic field schemas. This is a good example of reusability.
  • server/prompt-config/repositories/llm-config-version-schema.ts: Defines storage schema, which uses a mixed approach with snake case and versioning.
  • app/api/prompts/[[...route]]/schemas/outputs.ts: Defines the API response. It now derives from the storage schema.
  • app/api/prompts/[[...route]]/schemas/inputs.ts: Defines API input. It is currently manual and uses field schemas.
  • prompts/schemas/form-schema.ts: Defines the UI form, which derives from storage.
  • server/api/routers/prompts/prompts.trpc-router.ts: Contains tRPC input, which currently uses manual inline schemas.
  • optimization_studio/types/dsl.ts: Defines DSL types, creating a parallel type system (Field, LlmPromptConfigComponent).
  • prompts/utils/llmPromptConfigUtils.ts: Contains converters that perform manual mapping between form and DSL.
  • typescript-sdk/.../schema/prompt.schema.ts: Defines SDK types, which duplicates everything.

Specific Issues:

  1. Naming Inconsistency: This is a common issue where storage uses snake case (e.g., max_tokens) while the API uses camel case (e.g., maxTokens). This requires ad-hoc transformations throughout the codebase.
  2. tRPC Router Duplication: The prompts.trpc-router.ts manually defines schemas, which should be derived from a shared input schema.
  3. TypeScript SDK Duplication: The SDK duplicates schemas defined elsewhere, making it prone to inconsistencies.
  4. Service Type Manually Defined: The prompt.service.ts manually defines a VersionedPrompt type with many fields, which should be inferred from a schema.
  5. Form Schema Different Structure: The form-schema.ts has an extra wrapper (configData) not present in storage or the API.
  6. Optimization Studio DSL - Parallel Type System: The DSL has a separate type system that must be kept in sync with prompt schemas via manual converters.
  7. Manual Converters Between Systems: The llmPromptConfigUtils.ts contains manual mapping code (e.g., promptConfigFormValuesToOptimizationStudioNodeData), which can easily break if either side changes.

These issues highlight the need for a more structured and unified approach. The goal is to move towards a system where a single source of truth drives all schema definitions.

Proposed Architecture: A Layered Approach to Schema Management

The proposed architecture introduces a layered approach to schema management, with the domain schema at its core. This design promotes a single source of truth, reduces duplication, and improves consistency.

Layer 1: Domain Schema (Single Source of Truth)

This layer defines the core prompt configuration schema. It is the single source of truth for all prompt-related data. The schema uses camel case for consistency. The key files in this layer include:

  • langwatch/src/prompts/schemas/domain/config-data.schema.ts: Defines the core prompt configuration data.
  • langwatch/src/prompts/schemas/domain/version-metadata.schema.ts: Contains version tracking fields.
  • langwatch/src/prompts/schemas/domain/prompt-config.schema.ts: Defines the full versioned prompt.
  • langwatch/src/prompts/schemas/domain/index.ts: Exports the schemas.
// config-data.schema.ts
export const configDataSchema = z.object({
    prompt: z.string(),
    messages: z.array(messageSchema).default([]),
    inputs: z.array(inputSchema).default([]),
    outputs: z.array(outputSchema).min(1),
    model: z.string().min(1),
    temperature: z.number().optional(),
    maxTokens: z.number().optional(),
    demonstrations: nodeDatasetSchema.optional(),
    promptingTechnique: promptingTechniqueSchema.optional(),
    responseFormat: responseFormatSchema.optional(),
});

export type ConfigData = z.infer<typeof configDataSchema>;

Layer 2: Boundary Schemas (Derived)

This layer derives schemas from the domain schema, tailoring them for specific use cases such as storage, API input/output, and UI forms. Key files include:

  • langwatch/src/prompts/schemas/boundaries/storage.schema.ts: Transforms the domain schema to snake case for database storage.
  • langwatch/src/prompts/schemas/boundaries/api-input.schema.ts: Defines the API input schema, using .partial() to make fields optional for create/update operations.
  • langwatch/src/prompts/schemas/boundaries/api-output.schema.ts: Defines the API response schema, including metadata.
  • langwatch/src/prompts/schemas/boundaries/form.schema.ts: Defines the form schema, adding UI-specific concerns.
// storage.schema.ts
import { configDataSchema } from "../domain";
import { snakeCaseKeys } from "./transformers";

export const storageConfigDataSchema = configDataSchema.transform(snakeCaseKeys);

// api-input.schema.ts
export const createPromptInputSchema = configDataSchema.partial().extend({
    handle: handleSchema,
    scope: scopeSchema.optional(),
});

// api-output.schema.ts
export const apiResponseSchema = z.object({
    id: z.string(),
    handle: z.string().nullable(),
    // ...metadata
}).merge(configDataSchema);

Layer 3: Transform Utilities

This layer provides utilities for transforming data between different formats, such as snake case and camel case. This reduces the need for ad-hoc transformations throughout the codebase. The key file is:

  • langwatch/src/prompts/schemas/boundaries/transformers.ts: Contains functions for converting between snake case and camel case.
// transformers.ts
export function snakeToCamel<T extends Record<string, unknown>>(obj: T) {
    // max_tokens → maxTokens
}

export function camelToSnake<T extends Record<string, unknown>>(obj: T) {
    // maxTokens → max_tokens
}

Layer 4: DSL Integration

This layer focuses on integrating the domain schema with the DSL used in the optimization studio. The goal is to derive DSL types from the domain schema, eliminating the need for a separate parallel type system. The key file is:

  • optimization_studio/types/dsl.ts: Updated to derive DSL types from the domain schema.
// optimization_studio/types/dsl.ts
import { configDataSchema, type ConfigData } from "~/prompts/schemas/domain";

// Derive DSL types from domain schema instead of parallel definitions
export type LlmPromptConfigComponent = Signature & {
    configId?: string;
    handle?: string | null;
    versionMetadata?: VersionMetadata;
    // Derive from domain - no more manual Field[] definitions
    inputs: ConfigData["inputs"];
    outputs: ConfigData["outputs"];
    configData: ConfigData; // Or flatten parameters into configData
};

Layer 5: SDK Types (Generated)

This layer generates SDK types from the OpenAPI specification, eliminating the need for manually maintained SDK schemas. Key files include:

  • typescript-sdk: Deletes manual schemas and uses only generated OpenAPI types.
// typescript-sdk - DELETE manual schemas
// Import from generated OpenAPI types only
import type { paths } from "@/internal/generated/openapi/api-client";

export type PromptResponse = paths["/api/prompts/{id}"]["get"]["responses"]["200"]["content"]["application/json"];

This layered architecture promotes a single source of truth, reduces duplication, and improves consistency across the entire system.

Migration Plan: A Step-by-Step Guide

The migration to the new architecture involves several phases. This section outlines the key steps involved in each phase.

Phase 1: Extract Domain Schema

This phase focuses on creating the domain schema and establishing it as the single source of truth.

  • Create prompts/schemas/domain/config-data.schema.ts.
  • Export configDataSchema as the camel case single source of truth.
  • Add transform utilities for snake case conversion.

Phase 2: Derive Boundary Schemas

This phase involves deriving the boundary schemas from the domain schema.

  • Update api-output.schema.ts to use the domain schema.
  • Update api-input.schema.ts to derive from the domain schema.
  • Update llm-config-version-schema.ts to use the domain schema and snake case transform.
  • Update form-schema.ts to derive from the domain schema, keeping the UI structure.

Phase 3: Fix tRPC Router

This phase addresses the manual schema definitions in the tRPC router.

  • Create a shared input schema in prompts/schemas/boundaries/trpc-input.schema.ts.
  • Update prompts.trpc-router.ts to use the shared schema.

Phase 4: Clean Up Service Types

This phase simplifies service types by inferring them from the API response schema.

  • Delete the manual VersionedPrompt type.
  • Infer it from apiResponseSchema: type VersionedPrompt = z.infer<typeof apiResponseSchema>.

Phase 5: DSL Integration

This phase integrates the domain schema with the DSL in the optimization studio.

  • Update LlmPromptConfigComponent to derive inputs/outputs from the domain schema.
  • Simplify llmPromptConfigUtils.ts converters.
  • Consider flattening DSL parameters into the configData structure.
  • Update SignaturePropertiesPanelForm to use shared types.

Phase 6: SDK Consolidation

This phase cleans up the SDK by using generated OpenAPI types.

  • Delete typescript-sdk/schema/prompt.schema.ts manual schemas.
  • Use only generated OpenAPI types.
  • Ensure Python SDK generation stays in sync.

Phase 7: Add Compatibility Tests

This phase adds tests to ensure schema compatibility.

  • Test: storage schema ⊆ output schema.
  • Test: domain schema → storage → domain schema roundtrip.
  • Test: OpenAPI spec matches actual responses.

Key Architectural Decision: DSL Relationship

The relationship between the domain schema and the DSL is a critical architectural decision. Two options exist:

Option A: DSL derives from Domain (Recommended)

  • LlmPromptConfigComponent.inputs derives from `ConfigData[