Skip to main content

Overview

Responsible scraping protects both your application and target websites. This guide covers rate limiting strategies, retry logic, and best practices for production use.
Important: Excessive requests can get your IP blocked and violate website terms of service. Always implement rate limiting.

Why rate limiting matters

Protect target websites

  • Prevents server overload
  • Respects website resources
  • Maintains good standing with site owners

Protect your application

  • Avoids IP bans
  • Prevents credit waste on failed requests
  • Ensures consistent data quality
  • Respects robots.txt
  • Follows terms of service
  • Demonstrates good faith usage

ManyPi rate limits

API limits

All plans have the same rate limit:
Limit TypeValue
Requests per minute60
Burst limit10 concurrent requests
Rate limits are applied per API key. You can create multiple API keys in your dashboard to scale horizontally (e.g., 3 API keys = 180 requests/minute).
While rate limits are the same across plans, your credit allocation varies by plan tier. Higher plans get more monthly credits for more scraping volume.
Pro tip: Create separate API keys for different services or environments (production, staging, batch jobs) to isolate rate limits and improve reliability.

Rate limit headers

Every API response includes rate limit information:
HTTP/1.1 200 OK
X-RateLimit-Limit: 60
X-RateLimit-Remaining: 55
X-RateLimit-Reset: 1640000000
  • X-RateLimit-Limit: Maximum requests per minute (60)
  • X-RateLimit-Remaining: Requests remaining in current window
  • X-RateLimit-Reset: Unix timestamp when limit resets

Implementing rate limiting

Simple delay between requests

const sleep = (ms) => new Promise(resolve => setTimeout(resolve, ms));

async function scrapeWithDelay(urls) {
  const results = [];
  
  for (const url of urls) {
    const result = await scrapeUrl(url);
    results.push(result);
    
    // Wait 2 seconds between requests
    await sleep(2000);
  }
  
  return results;
}

Token bucket algorithm

More sophisticated rate limiting that allows bursts:
class RateLimiter {
  private tokens: number;
  private lastRefill: number;
  private readonly capacity: number;
  private readonly refillRate: number; // tokens per second

  constructor(capacity: number, refillRate: number) {
    this.capacity = capacity;
    this.refillRate = refillRate;
    this.tokens = capacity;
    this.lastRefill = Date.now();
  }

  private refill(): void {
    const now = Date.now();
    const timePassed = (now - this.lastRefill) / 1000;
    const tokensToAdd = timePassed * this.refillRate;
    
    this.tokens = Math.min(this.capacity, this.tokens + tokensToAdd);
    this.lastRefill = now;
  }

  async acquire(): Promise<void> {
    this.refill();
    
    if (this.tokens < 1) {
      const waitTime = (1 - this.tokens) / this.refillRate * 1000;
      await new Promise(resolve => setTimeout(resolve, waitTime));
      this.refill();
    }
    
    this.tokens -= 1;
  }
}

// Usage
const limiter = new RateLimiter(10, 2); // 10 tokens, refill 2 per second

async function scrapeWithRateLimit(urls: string[]) {
  const results = [];
  
  for (const url of urls) {
    await limiter.acquire();
    const result = await scrapeUrl(url);
    results.push(result);
  }
  
  return results;
}

Using p-limit for concurrency control

import pLimit from 'p-limit';

// Allow max 5 concurrent requests
const limit = pLimit(5);

async function scrapeConcurrently(urls) {
  const promises = urls.map(url => 
    limit(() => scrapeUrl(url))
  );
  
  return Promise.all(promises);
}

// With delay between batches
async function scrapeBatches(urls, batchSize = 10) {
  const results = [];
  
  for (let i = 0; i < urls.length; i += batchSize) {
    const batch = urls.slice(i, i + batchSize);
    const batchResults = await scrapeConcurrently(batch);
    results.push(...batchResults);
    
    // Wait between batches
    if (i + batchSize < urls.length) {
      await sleep(5000); // 5 second delay
    }
  }
  
  return results;
}

Retry logic

Exponential backoff

Retry failed requests with increasing delays:
async function scrapeWithRetry(
  url: string,
  maxRetries = 3,
  baseDelay = 1000
): Promise<any> {
  let lastError: Error;
  
  for (let attempt = 0; attempt < maxRetries; attempt++) {
    try {
      const response = await fetch(
        'https://app.manypi.com/api/scrape/YOUR_SCRAPER_ID',
        {
          method: 'POST',
          headers: {
            'Authorization': `Bearer ${process.env.MANYPI_API_KEY}`,
            'Content-Type': 'application/json'
          },
          body: JSON.stringify({ url })
        }
      );
      
      if (response.status === 429) {
        // Rate limited - wait and retry
        const retryAfter = response.headers.get('Retry-After');
        const delay = retryAfter 
          ? parseInt(retryAfter) * 1000 
          : baseDelay * Math.pow(2, attempt);
        
        console.log(`Rate limited. Retrying in ${delay}ms...`);
        await sleep(delay);
        continue;
      }
      
      if (!response.ok) {
        throw new Error(`HTTP ${response.status}: ${response.statusText}`);
      }
      
      const data = await response.json();
      
      if (!data.success) {
        throw new Error(data.error);
      }
      
      return data;
      
    } catch (error) {
      lastError = error as Error;
      
      if (attempt < maxRetries - 1) {
        const delay = baseDelay * Math.pow(2, attempt);
        console.log(`Attempt ${attempt + 1} failed. Retrying in ${delay}ms...`);
        await sleep(delay);
      }
    }
  }
  
  throw new Error(`Failed after ${maxRetries} attempts: ${lastError.message}`);
}

Retry with jitter

Add randomness to prevent thundering herd:
function calculateBackoff(attempt, baseDelay = 1000, maxDelay = 30000) {
  const exponentialDelay = baseDelay * Math.pow(2, attempt);
  const jitter = Math.random() * 1000; // 0-1000ms random jitter
  return Math.min(exponentialDelay + jitter, maxDelay);
}

async function scrapeWithJitter(url, maxRetries = 3) {
  for (let attempt = 0; attempt < maxRetries; attempt++) {
    try {
      return await scrapeUrl(url);
    } catch (error) {
      if (attempt < maxRetries - 1) {
        const delay = calculateBackoff(attempt);
        await sleep(delay);
      } else {
        throw error;
      }
    }
  }
}

Error handling

Comprehensive error handling

interface ScrapeError {
  type: 'rate_limit' | 'network' | 'validation' | 'server' | 'unknown';
  message: string;
  retryable: boolean;
  retryAfter?: number;
}

async function scrapeWithErrorHandling(url: string): Promise<any> {
  try {
    const response = await fetch(/* ... */);
    const data = await response.json();
    
    if (!data.success) {
      const error: ScrapeError = {
        type: classifyError(data.errorType),
        message: data.error,
        retryable: isRetryable(data.errorType)
      };
      
      throw error;
    }
    
    return data;
    
  } catch (error) {
    if (error instanceof TypeError) {
      // Network error
      throw {
        type: 'network',
        message: 'Network request failed',
        retryable: true
      } as ScrapeError;
    }
    
    throw error;
  }
}

function classifyError(errorType: string): ScrapeError['type'] {
  switch (errorType) {
    case 'rate_limit_error':
      return 'rate_limit';
    case 'validation_error':
      return 'validation';
    case 'internal_error':
      return 'server';
    default:
      return 'unknown';
  }
}

function isRetryable(errorType: string): boolean {
  return ['rate_limit_error', 'internal_error', 'network_error']
    .includes(errorType);
}

Circuit breaker pattern

Prevent cascading failures:
class CircuitBreaker {
  private failures = 0;
  private lastFailureTime = 0;
  private state: 'closed' | 'open' | 'half-open' = 'closed';
  
  constructor(
    private threshold: number = 5,
    private timeout: number = 60000 // 1 minute
  ) {}
  
  async execute<T>(fn: () => Promise<T>): Promise<T> {
    if (this.state === 'open') {
      if (Date.now() - this.lastFailureTime > this.timeout) {
        this.state = 'half-open';
      } else {
        throw new Error('Circuit breaker is open');
      }
    }
    
    try {
      const result = await fn();
      this.onSuccess();
      return result;
    } catch (error) {
      this.onFailure();
      throw error;
    }
  }
  
  private onSuccess(): void {
    this.failures = 0;
    this.state = 'closed';
  }
  
  private onFailure(): void {
    this.failures++;
    this.lastFailureTime = Date.now();
    
    if (this.failures >= this.threshold) {
      this.state = 'open';
      console.log('Circuit breaker opened');
    }
  }
}

// Usage
const breaker = new CircuitBreaker(5, 60000);

async function scrapeWithCircuitBreaker(url: string) {
  return breaker.execute(() => scrapeUrl(url));
}

Production patterns

Queue-based processing

Use a queue for reliable, rate-limited scraping:
import Bull from 'bull';

// Create queue
const scrapeQueue = new Bull('scraping', {
  redis: { host: 'localhost', port: 6379 }
});

// Configure rate limiting
scrapeQueue.process({
  concurrency: 5,
  limiter: {
    max: 10,        // 10 jobs
    duration: 60000 // per minute
  }
}, async (job) => {
  const { url, scraperId } = job.data;
  
  try {
    const result = await scrapeUrl(url, scraperId);
    return result;
  } catch (error) {
    // Retry logic handled by Bull
    throw error;
  }
});

// Add jobs to queue
async function queueScrape(url: string, scraperId: string) {
  await scrapeQueue.add(
    { url, scraperId },
    {
      attempts: 3,
      backoff: {
        type: 'exponential',
        delay: 2000
      }
    }
  );
}

// Monitor queue
scrapeQueue.on('completed', (job, result) => {
  console.log(`Job ${job.id} completed`);
});

scrapeQueue.on('failed', (job, error) => {
  console.error(`Job ${job.id} failed:`, error.message);
});

Distributed rate limiting with Redis

Share rate limits across multiple servers:
import Redis from 'ioredis';

class DistributedRateLimiter {
  private redis: Redis;
  
  constructor(redisUrl: string) {
    this.redis = new Redis(redisUrl);
  }
  
  async checkLimit(
    key: string,
    limit: number,
    window: number // seconds
  ): Promise<boolean> {
    const now = Date.now();
    const windowStart = now - (window * 1000);
    
    // Remove old entries
    await this.redis.zremrangebyscore(key, 0, windowStart);
    
    // Count requests in current window
    const count = await this.redis.zcard(key);
    
    if (count >= limit) {
      return false;
    }
    
    // Add current request
    await this.redis.zadd(key, now, `${now}`);
    await this.redis.expire(key, window);
    
    return true;
  }
}

// Usage
const limiter = new DistributedRateLimiter('redis://localhost:6379');

async function scrapeWithDistributedLimit(url: string) {
  const allowed = await limiter.checkLimit(
    'scraping:rate-limit',
    100, // 100 requests
    60   // per 60 seconds
  );
  
  if (!allowed) {
    throw new Error('Rate limit exceeded');
  }
  
  return scrapeUrl(url);
}

Best practices

  • Check robots.txt for crawl-delay directives
  • Start with 2-3 second delays between requests
  • Monitor for 429 (Too Many Requests) responses
  • Adjust delays based on response times
Schedule heavy scraping during low-traffic periods:
function isOffPeakHours() {
  const hour = new Date().getHours();
  // 2 AM - 6 AM local time
  return hour >= 2 && hour < 6;
}

async function scrapeResponsibly(urls) {
  if (!isOffPeakHours()) {
    console.log('Waiting for off-peak hours...');
    await waitUntilOffPeak();
  }
  
  return scrapeBatch(urls);
}
Don’t re-scrape data that hasn’t changed:
const cache = new Map();

async function scrapeWithCache(url, ttl = 3600000) {
  const cached = cache.get(url);
  
  if (cached && Date.now() - cached.timestamp < ttl) {
    return cached.data;
  }
  
  const data = await scrapeUrl(url);
  cache.set(url, { data, timestamp: Date.now() });
  
  return data;
}
Track success rates and adjust accordingly:
class ScrapeMonitor {
  private stats = {
    total: 0,
    success: 0,
    failed: 0,
    rateLimited: 0
  };
  
  recordSuccess() {
    this.stats.total++;
    this.stats.success++;
  }
  
  recordFailure(type: string) {
    this.stats.total++;
    this.stats.failed++;
    if (type === 'rate_limit') {
      this.stats.rateLimited++;
    }
  }
  
  getSuccessRate() {
    return this.stats.success / this.stats.total;
  }
  
  shouldSlowDown() {
    // Slow down if >10% rate limited
    return this.stats.rateLimited / this.stats.total > 0.1;
  }
}
For high-volume scraping, consider rotating proxies:
const proxies = [
  'http://proxy1.example.com:8080',
  'http://proxy2.example.com:8080',
  'http://proxy3.example.com:8080'
];

let currentProxy = 0;

function getNextProxy() {
  const proxy = proxies[currentProxy];
  currentProxy = (currentProxy + 1) % proxies.length;
  return proxy;
}
Always use legitimate proxy services and respect website terms of service.

Monitoring rate limits

Check remaining quota

async function checkRateLimit() {
  const response = await fetch(
    'https://app.manypi.com/api/scrape/YOUR_SCRAPER_ID',
    {
      method: 'POST',
      headers: {
        'Authorization': `Bearer ${process.env.MANYPI_API_KEY}`,
        'Content-Type': 'application/json'
      },
      body: JSON.stringify({ url: 'https://example.com' })
    }
  );
  
  const remaining = response.headers.get('X-RateLimit-Remaining');
  const reset = response.headers.get('X-RateLimit-Reset');
  
  console.log(`Requests remaining: ${remaining}`);
  console.log(`Resets at: ${new Date(parseInt(reset!) * 1000)}`);
  
  return {
    remaining: parseInt(remaining!),
    resetAt: new Date(parseInt(reset!) * 1000)
  };
}

Alert on low quota

async function scrapeWithQuotaCheck(url: string) {
  const { remaining } = await checkRateLimit();
  
  if (remaining < 10) {
    await sendAlert('Low rate limit quota', {
      remaining,
      url
    });
  }
  
  return scrapeUrl(url);
}

Next steps