Skip to main content

Overview

ManyPi uses JSON Schema to ensure your scraped data is always structured, validated, and type-safe. Define your schema once, and get guaranteed data consistency across all scrapes.
Benefits:
  • Catch data issues early with validation
  • Generate TypeScript types automatically
  • Ensure consistent data structure
  • Document your API responses

JSON Schema basics

Every ManyPi scraper uses a JSON Schema to define the structure of extracted data.

Simple example

{
  "type": "object",
  "properties": {
    "title": {
      "type": "string",
      "description": "Product title"
    },
    "price": {
      "type": "number",
      "description": "Price in USD"
    },
    "inStock": {
      "type": "boolean",
      "description": "Availability status"
    }
  },
  "required": ["title", "price"]
}
This schema guarantees:
  • title is always a string
  • price is always a number
  • inStock is always a boolean
  • title and price are always present
  • inStock is optional (not in required array)

Supported data types

Primitive types

{
  "type": "string",
  "description": "Any text value",
  "minLength": 1,
  "maxLength": 500,
  "pattern": "^[A-Z].*"  // Optional regex pattern
}
Examples: Product names, descriptions, URLs, categories

Complex types

{
  "type": "array",
  "description": "List of items",
  "items": {
    "type": "string"
  },
  "minItems": 1,
  "maxItems": 10,
  "uniqueItems": true
}
Examples:
Response
{
  "tags": ["electronics", "audio", "wireless"],
  "images": [
    "https://example.com/img1.jpg",
    "https://example.com/img2.jpg"
  ]
}

TypeScript integration

Generate TypeScript types from your JSON Schema for full type safety in your application.

Using json-schema-to-typescript

1

Install the package

npm install json-schema-to-typescript
2

Convert schema to TypeScript

generate-types.ts
import { compile } from 'json-schema-to-typescript';
import fs from 'fs';

// Your scraper's JSON Schema
const schema = {
  title: 'Product',
  type: 'object',
  properties: {
    title: { type: 'string' },
    price: { type: 'number' },
    rating: { type: 'number', minimum: 0, maximum: 5 },
    inStock: { type: 'boolean' },
    tags: {
      type: 'array',
      items: { type: 'string' }
    }
  },
  required: ['title', 'price', 'inStock']
};

// Generate TypeScript interface
compile(schema, 'Product').then(ts => {
  fs.writeFileSync('types/product.ts', ts);
});
3

Generated TypeScript types

types/product.ts
export interface Product {
  title: string;
  price: number;
  rating?: number;
  inStock: boolean;
  tags?: string[];
}
4

Use in your application

app.ts
import { Product } from './types/product';

async function scrapeProduct(url: string): Promise<Product> {
  const response = await fetch(
    'https://app.manypi.com/api/scrape/YOUR_SCRAPER_ID',
    {
      method: 'POST',
      headers: {
        'Authorization': `Bearer ${process.env.MANYPI_API_KEY}`,
        'Content-Type': 'application/json'
      },
      body: JSON.stringify({ url })
    }
  );
  
  const result = await response.json();
  
  if (!result.success) {
    throw new Error(result.error);
  }
  
  // Fully typed!
  const product: Product = result.data;
  
  // TypeScript knows these properties exist
  console.log(product.title);
  console.log(product.price);
  
  // TypeScript knows this is optional
  if (product.rating) {
    console.log(`Rating: ${product.rating}/5`);
  }
  
  return product;
}

Real-world schemas

E-commerce product

{
  "title": "Product",
  "type": "object",
  "properties": {
    "title": {
      "type": "string",
      "description": "Product name",
      "minLength": 1
    },
    "brand": {
      "type": "string",
      "description": "Brand name"
    },
    "currentPrice": {
      "type": "number",
      "description": "Current price in USD",
      "minimum": 0
    },
    "originalPrice": {
      "type": ["number", "null"],
      "description": "Original price before discount",
      "minimum": 0
    },
    "discount": {
      "type": ["number", "null"],
      "description": "Discount percentage",
      "minimum": 0,
      "maximum": 100
    },
    "rating": {
      "type": ["number", "null"],
      "description": "Average rating",
      "minimum": 0,
      "maximum": 5
    },
    "reviewCount": {
      "type": "integer",
      "description": "Number of reviews",
      "minimum": 0
    },
    "inStock": {
      "type": "boolean",
      "description": "Availability status"
    },
    "condition": {
      "type": "string",
      "enum": ["new", "used", "refurbished"],
      "description": "Product condition"
    },
    "images": {
      "type": "array",
      "description": "Product image URLs",
      "items": {
        "type": "string",
        "format": "uri"
      },
      "minItems": 1
    },
    "specifications": {
      "type": "array",
      "description": "Product specifications",
      "items": {
        "type": "object",
        "properties": {
          "name": { "type": "string" },
          "value": { "type": "string" }
        },
        "required": ["name", "value"]
      }
    },
    "shipping": {
      "type": "object",
      "description": "Shipping information",
      "properties": {
        "cost": { "type": "number", "minimum": 0 },
        "estimatedDays": { "type": "integer", "minimum": 0 },
        "freeShipping": { "type": "boolean" }
      }
    }
  },
  "required": [
    "title",
    "currentPrice",
    "inStock"
  ]
}

Job listing

{
  "title": "JobListing",
  "type": "object",
  "properties": {
    "jobTitle": {
      "type": "string",
      "description": "Job position title"
    },
    "company": {
      "type": "string",
      "description": "Company name"
    },
    "location": {
      "type": "object",
      "properties": {
        "city": { "type": "string" },
        "state": { "type": "string" },
        "country": { "type": "string" },
        "remote": { "type": "boolean" }
      },
      "required": ["city", "country"]
    },
    "salary": {
      "type": "object",
      "properties": {
        "min": { "type": "number", "minimum": 0 },
        "max": { "type": "number", "minimum": 0 },
        "currency": { "type": "string", "default": "USD" },
        "period": {
          "type": "string",
          "enum": ["hourly", "monthly", "yearly"]
        }
      }
    },
    "jobType": {
      "type": "string",
      "enum": ["full-time", "part-time", "contract", "internship"]
    },
    "experienceLevel": {
      "type": "string",
      "enum": ["entry", "mid", "senior", "lead", "executive"]
    },
    "skills": {
      "type": "array",
      "items": { "type": "string" },
      "description": "Required skills"
    },
    "description": {
      "type": "string",
      "description": "Job description"
    },
    "postedDate": {
      "type": "string",
      "format": "date",
      "description": "When the job was posted"
    },
    "applicationUrl": {
      "type": "string",
      "format": "uri",
      "description": "URL to apply"
    }
  },
  "required": [
    "jobTitle",
    "company",
    "location",
    "jobType"
  ]
}

Article/Blog post

{
  "title": "Article",
  "type": "object",
  "properties": {
    "headline": {
      "type": "string",
      "description": "Article title"
    },
    "author": {
      "type": "object",
      "properties": {
        "name": { "type": "string" },
        "bio": { "type": "string" },
        "avatarUrl": { "type": "string", "format": "uri" }
      },
      "required": ["name"]
    },
    "publishedDate": {
      "type": "string",
      "format": "date-time",
      "description": "Publication date and time"
    },
    "modifiedDate": {
      "type": ["string", "null"],
      "format": "date-time",
      "description": "Last modified date"
    },
    "category": {
      "type": "string",
      "description": "Article category"
    },
    "tags": {
      "type": "array",
      "items": { "type": "string" },
      "description": "Article tags"
    },
    "content": {
      "type": "string",
      "description": "Full article text"
    },
    "excerpt": {
      "type": "string",
      "description": "Short summary",
      "maxLength": 500
    },
    "featuredImage": {
      "type": "string",
      "format": "uri",
      "description": "Main article image"
    },
    "readingTime": {
      "type": "integer",
      "description": "Estimated reading time in minutes",
      "minimum": 1
    },
    "wordCount": {
      "type": "integer",
      "description": "Article word count",
      "minimum": 0
    }
  },
  "required": [
    "headline",
    "author",
    "publishedDate",
    "content"
  ]
}

Validation in practice

Client-side validation

Use libraries like Ajv to validate responses:
import Ajv from 'ajv';
import addFormats from 'ajv-formats';

const ajv = new Ajv();
addFormats(ajv);

// Your schema
const schema = {
  type: 'object',
  properties: {
    title: { type: 'string' },
    price: { type: 'number', minimum: 0 }
  },
  required: ['title', 'price']
};

const validate = ajv.compile(schema);

async function scrapeWithValidation(url: string) {
  const response = await fetch(/* ... */);
  const result = await response.json();
  
  if (!result.success) {
    throw new Error(result.error);
  }
  
  // Validate the data
  if (!validate(result.data)) {
    console.error('Validation errors:', validate.errors);
    throw new Error('Invalid data structure');
  }
  
  // Data is guaranteed to match schema
  return result.data;
}

Runtime type checking with Zod

import { z } from 'zod';

// Define schema with Zod
const ProductSchema = z.object({
  title: z.string().min(1),
  price: z.number().positive(),
  rating: z.number().min(0).max(5).optional(),
  inStock: z.boolean(),
  tags: z.array(z.string()).optional()
});

type Product = z.infer<typeof ProductSchema>;

async function scrapeProduct(url: string): Promise<Product> {
  const response = await fetch(/* ... */);
  const result = await response.json();
  
  // Parse and validate
  const product = ProductSchema.parse(result.data);
  
  // Fully typed and validated!
  return product;
}

Best practices

Not all pages have all data. Use nullable types for optional fields:
{
  "salePrice": {
    "type": ["number", "null"],
    "description": "Only present during sales"
  }
}
When a field has a limited set of possible values, use enums:
{
  "status": {
    "type": "string",
    "enum": ["active", "pending", "sold", "expired"]
  }
}
Add validation rules to catch data issues:
{
  "price": {
    "type": "number",
    "minimum": 0,
    "maximum": 1000000
  },
  "title": {
    "type": "string",
    "minLength": 1,
    "maxLength": 500
  }
}
Add descriptions to help future developers:
{
  "rating": {
    "type": "number",
    "minimum": 0,
    "maximum": 5,
    "description": "Average customer rating out of 5 stars"
  }
}
Don’t manually write types - generate them from your schema:
# Add to your build process
npm run generate-types

Common patterns

Handling optional nested objects

{
  "shipping": {
    "type": ["object", "null"],
    "properties": {
      "cost": { "type": "number" },
      "estimatedDays": { "type": "integer" }
    }
  }
}

Arrays with minimum items

{
  "images": {
    "type": "array",
    "items": { "type": "string", "format": "uri" },
    "minItems": 1,
    "description": "At least one image required"
  }
}

Conditional requirements

{
  "type": "object",
  "properties": {
    "hasDiscount": { "type": "boolean" },
    "discountPercent": { "type": "number" }
  },
  "if": {
    "properties": { "hasDiscount": { "const": true } }
  },
  "then": {
    "required": ["discountPercent"]
  }
}

Next steps