Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DevSecOps : Granular Testing and Logging #17004

Open
emvaldes opened this issue Jan 7, 2025 · 13 comments
Open

DevSecOps : Granular Testing and Logging #17004

emvaldes opened this issue Jan 7, 2025 · 13 comments
Assignees
Labels
DevSecOps Team Aq DevSecOps work label documentation Tickets that add documentation on existing features and services platform-future Platform - Future Capabilities reportstream
Milestone

Comments

@emvaldes
Copy link
Collaborator

emvaldes commented Jan 7, 2025

Objective: This stage focuses on ensuring that your API endpoints are robust, scalable, and properly instrumented to handle "pandemic-level" traffic. Below is a structured guide based on industry standards and best practices.

@emvaldes emvaldes added DevSecOps Team Aq DevSecOps work label platform-current Platform - Current Capabilities reportstream labels Jan 7, 2025
@emvaldes emvaldes added the platform-future Platform - Future Capabilities label Jan 7, 2025
@emvaldes emvaldes changed the title DevSecOps : Adopt and Deploy Load Testing Framework ***** DevSecOps : Adopt and Deploy Load Testing Framework Jan 7, 2025
@emvaldes emvaldes changed the title ***** DevSecOps : Adopt and Deploy Load Testing Framework ***** DevSecOps : ***** Jan 7, 2025
@emvaldes emvaldes added documentation Tickets that add documentation on existing features and services and removed platform-current Platform - Current Capabilities platform-future Platform - Future Capabilities labels Jan 7, 2025
@emvaldes emvaldes added this to the todo milestone Jan 7, 2025
@emvaldes
Copy link
Collaborator Author

emvaldes commented Jan 7, 2025

Understanding "Pandemic-Ready" Goals

Definition of Pandemic-Ready Goals

  • Scalability: Ensure APIs can handle sudden and extreme traffic spikes without downtime or performance degradation.
  • Resilience: APIs should continue to function under failure scenarios, such as partial outages, increased latencies, or high loads.
  • Observability: APIs should be instrumented to provide detailed logs and metrics to monitor performance, error rates, and traffic patterns in real-time.
  • Accuracy: Ensure APIs return consistent, correct responses even under extreme conditions.

@emvaldes emvaldes added the platform-future Platform - Future Capabilities label Jan 7, 2025
@emvaldes emvaldes changed the title ***** DevSecOps : ***** DevSecOps : Granular Testing and Logging/Instrumentation Jan 7, 2025
@emvaldes
Copy link
Collaborator Author

emvaldes commented Jan 7, 2025

Steps for Granular Test Sizing for APIs

Step 1: Define Granular Performance Metrics for Each API

The team must establish measurable goals for "pandemic-ready" performance for each API endpoint. This includes:

  1. Max Throughput:

    • Requests per second (RPS) each endpoint should handle.
    • Define "normal load," "peak load," and "pandemic-level" load.
  2. Latency Targets:

    • Average response time (e.g., <200ms under normal load).
    • Tail-end latency (e.g., 95th percentile <500ms).
  3. Error Budget:

    • Allowable error rate (e.g., <0.1% of requests fail under peak load).
  4. Concurrent Users:

    • Maximum number of concurrent API consumers the system should support.

@emvaldes
Copy link
Collaborator Author

emvaldes commented Jan 7, 2025

Step 2: Gather Baseline Metrics from Azure

Use Azure Monitor, Application Insights, and Log Analytics to collect real-world data on API performance:

  1. Throughput Metrics:

    • Query Azure Log Analytics for request counts across different API endpoints:
      requests
      | summarize RequestCount = count() by name, bin(timestamp, 1m)
  2. Latency Metrics:

    • Measure average and tail-end latency:
      requests
      | summarize AvgLatency = avg(duration), P95Latency = percentile(duration, 95) by name
  3. Error Rates:

    • Query for failed requests by API endpoint:
      requests
      | where success == false
      | summarize ErrorCount = count() by name, bin(timestamp, 1m)
  4. Traffic Patterns:

    • Identify peak traffic times and request distribution patterns.

@emvaldes
Copy link
Collaborator Author

emvaldes commented Jan 7, 2025

Step 3: Simulate Granular Load Tests

Use the collected data to design granular load tests that simulate realistic and extreme conditions.

Tools for Granular Load Testing:

  • K6 for API-level load tests:

    import http from 'k6/http';
    
    export let options = {
      stages: [
        { duration: '1m', target: 100 }, // Ramp up to 100 users
        { duration: '5m', target: 500 }, // Sustain 500 users
        { duration: '1m', target: 0 },   // Ramp down
      ],
      thresholds: {
        http_req_duration: ['p(95)<500'], // 95% of requests must complete within 500ms
      },
    };
    
    export default function () {
      http.get('https://your-api-endpoint.com/data');
    }
  • Apache JMeter for more complex scenarios involving multiple endpoints and workflows.

  • Azure Load Testing for native integration with Azure resources.

@emvaldes
Copy link
Collaborator Author

emvaldes commented Jan 7, 2025

Adding Logging and Instrumentation

Step 1: Define Logging Requirements

  1. Standardize Logs:

    • Use a consistent structure for all logs, e.g., JSON format with fields like:
      {
        "timestamp": "2023-01-01T12:00:00Z",
        "level": "INFO",
        "endpoint": "/api/v1/data",
        "method": "GET",
        "responseTime": 120,
        "statusCode": 200,
        "userId": "12345"
      }
  2. Log Categories:

    • Access Logs: Log every request, including metadata like request path, HTTP method, response time, and status code.
    • Application Logs: Log internal events, such as database queries or service calls.
    • Error Logs: Log all exceptions and failed requests with detailed error messages.
  3. Granular Logging:

    • For APIs processing large payloads, log metadata about the payload size and processing steps.
    • Example for batch processing:
      {
        "timestamp": "2023-01-01T12:00:00Z",
        "level": "INFO",
        "endpoint": "/api/v1/batch-process",
        "batchId": "batch-123",
        "batchSize": 2500,
        "processingTime": 3000
      }

@emvaldes
Copy link
Collaborator Author

emvaldes commented Jan 7, 2025

Step 2: Add Instrumentation for Metrics Collection

  1. Use Distributed Tracing:

    • Add distributed tracing headers (e.g., OpenTelemetry) to monitor API calls across services.
  2. Add Custom Metrics:

    • Use libraries like Prometheus (for Node.js, Python) or Azure Monitor SDK to capture custom metrics.

    Example for a custom latency metric in Node.js:

    const client = require('applicationinsights');
    
    app.get('/api/data', async (req, res) => {
      const start = Date.now();
      const data = await fetchDataFromDB();
      const latency = Date.now() - start;
    
      client.defaultClient.trackMetric({ name: "API Data Latency", value: latency });
      res.json(data);
    });

@emvaldes
Copy link
Collaborator Author

emvaldes commented Jan 7, 2025

Step 3: Use Azure-Specific Instrumentation

  1. Enable Application Insights:

    • Integrate Application Insights into your application to automatically capture logs, traces, and performance metrics.
  2. Add Custom Events:

    • Log custom events for specific operations:
      const appInsights = require('applicationinsights');
      appInsights.defaultClient.trackEvent({ name: "BatchProcessed", properties: { batchId: "123", size: 2500 } });

@emvaldes
Copy link
Collaborator Author

emvaldes commented Jan 7, 2025

Aligning on Pandemic-Ready Goals

Step 1: Define API-Specific Goals

  1. Set individual goals for each API endpoint:

    • Example:
      • /api/users: 5000 RPS, average latency < 200ms, <1% error rate.
      • /api/orders: 1000 RPS, average latency < 500ms, <0.5% error rate.
  2. Group endpoints into categories based on criticality:

    • High Criticality: User authentication, health data submission.
    • Medium Criticality: Analytics, reporting.
    • Low Criticality: Debugging, auxiliary services.

@emvaldes
Copy link
Collaborator Author

emvaldes commented Jan 7, 2025

Step 2: Define Success Metrics

Use Service Level Objectives (SLOs) and Service Level Indicators (SLIs):

  • SLO: 99.95% uptime for critical endpoints.
  • SLI: Error rate, latency, and throughput metrics for each API.

@emvaldes
Copy link
Collaborator Author

emvaldes commented Jan 7, 2025

Continuous Improvement

A. Real-Time Monitoring

  1. Use dashboards in Azure Monitor to track latency, throughput, and error rates in real time.
  2. Example for a latency dashboard:
    requests
    | summarize AvgLatency = avg(duration) by bin(timestamp, 1m), name

@emvaldes
Copy link
Collaborator Author

emvaldes commented Jan 7, 2025

B. Feedback Loop

  1. Share test results with both development and DevSecOps teams.
  2. Iterate on API improvements based on observed bottlenecks and logging insights.

@emvaldes
Copy link
Collaborator Author

emvaldes commented Jan 7, 2025

C. Automate Reporting

  1. Generate automated reports on API performance after every test run.
  2. Use tools like Power BI or Grafana for visualizing test results.

@emvaldes
Copy link
Collaborator Author

emvaldes commented Jan 7, 2025

Deliverables

  1. Granular performance test results for each API endpoint.
  2. Enhanced logging and instrumentation with Application Insights integration.
  3. Clear "pandemic-ready" performance goals for all APIs.
  4. Dashboards and alerts for continuous monitoring of API health.

@emvaldes emvaldes changed the title DevSecOps : Granular Testing and Logging/Instrumentation DevSecOps : Granular Testing and Logging Jan 7, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
DevSecOps Team Aq DevSecOps work label documentation Tickets that add documentation on existing features and services platform-future Platform - Future Capabilities reportstream
Projects
None yet
Development

No branches or pull requests

2 participants