Skip to content

Latest commit

 

History

History
67 lines (50 loc) · 2.38 KB

README.md

File metadata and controls

67 lines (50 loc) · 2.38 KB

Realtime Data Streaming With TCP Socket, Apache Spark, OpenAI LLM, Kafka and Elasticsearch | End-to-End Data Engineering Project

Table of Contents

Introduction

This project serves as a comprehensive guide to building an end-to-end data engineering pipeline using TCP/IP Socket, Apache Spark, OpenAI LLM, Kafka and Elasticsearch. It covers each stage from data acquisition, processing, sentiment analysis with ChatGPT, production to kafka topic and connection to elasticsearch.

System Architecture

System_architecture.png

The project is designed with the following components:

  • Data Source: We use yelp.com dataset for our pipeline.
  • TCP/IP Socket: Used to stream data over the network in chunks
  • Apache Spark: For data processing with its master and worker nodes.
  • Confluent Kafka: Our cluster on the cloud
  • Control Center and Schema Registry: Helps in monitoring and schema management of our Kafka streams.
  • Kafka Connect: For connecting to elasticsearch
  • Elasticsearch: For indexing and querying

What You'll Learn

  • Setting up data pipeline with TCP/IP
  • Real-time data streaming with Apache Kafka
  • Data processing techniques with Apache Spark
  • Realtime sentiment analysis with OpenAI ChatGPT
  • Synchronising data from kafka to elasticsearch
  • Indexing and Querying data on elasticsearch

Technologies

  • Python
  • TCP/IP
  • Confluent Kafka
  • Apache Spark
  • Docker
  • Elasticsearch

Getting Started

  1. Clone the repository:

    git clone https://github.com/airscholar/E2EDataEngineering.git
  2. Navigate to the project directory:

    cd E2EDataEngineering
  3. Run Docker Compose to spin up the spark cluster:

    docker-compose up

For more detailed instructions, please check out the video tutorial linked below.

Watch the Video Tutorial

For a complete walkthrough and practical demonstration, check out the video here: Realtime Streaming with TCP IP Spark LLM Kafka Elasticsearch.png