Project Overview

This project is a reusable architectural template that integrates large-scale data processing with modern API development. It combines the power of Apache Spark for batch processing, Spring Boot for exposing REST services, and MongoDB as the storage backend. The entire ecosystem is modularized and packaged with Docker, ensuring fast and reproducible deployment to any local environment.

Architectural Structure
Architectural Structure

Backend API and MongoDB (NoSQL)

The backend is built on Spring Boot 3.3.4, following a classic layered architecture: Controllers for the HTTP layer, Services for business logic, Repositories for data access, and DTOs for structuring responses. This design ensures that each component has a single responsibility.

The core storage system is MongoDB, a document-oriented NoSQL database. Unlike relational (SQL) databases that use rigid tables, MongoDB stores information in dynamic, JSON-like structures (BSON). This is ideal for modern projects because it allows great flexibility in the data schema, natural integration with Java objects (using Spring Data MongoDB), and much simpler horizontal scalability as the volume of books grows exponentially.

REST API Endpoints

HTTP MethodEndpointOperation Description
GET/api/v1/booksRetrieves the complete list of stored books.
GET/api/v1/books/{id}Fetches the details of a specific book by its unique identifier.
POST/api/v1/booksCreates a new entry. Includes validations (e.g., publication year cannot be in the future) and automatic normalization.
PUT/api/v1/books/{id}Updates the information of an existing book.
DELETE/api/v1/books/{id}Permanently deletes the document from the database.

Apache Spark: Distributed Processing

Apache Spark is a unified analytics engine for large-scale data processing. Its magic lies in its distributed architecture: it uses a Master Node (Driver) to manage and split the work, and multiple Worker Nodes to execute it. Additionally, each Worker utilizes several Cores for computation. This means that if you need to process 50 million records, the Driver splits the task into small chunks and distributes them to the Workers to process simultaneously (in parallel), drastically reducing runtime and avoiding bottlenecks from the memory limits of a single machine.

Usage Context: Template vs. Real-World Environment

In the context of this repository, using Spark to read a small CSV file is clearly overkill (using too much technology for a simple problem). However, this is done intentionally to demonstrate architecture.

In a real-world environment, data would not come from a local CSV file. A Spark job like this would connect to a Data Lake (such as Amazon S3 or Hadoop HDFS) to ingest terabytes of historical data, or to a real-time event streaming system (like Apache Kafka) to process millions of transactions per second. Spark would clean, normalize, and aggregate these massive datasets in parallel before finally saving the output to MongoDB for the Spring Boot backend to quickly serve to users.

RDDs vs DataFrames in the Project

The Spark module in this project includes two implementations to demonstrate both main capabilities of the framework:

1. RDDs (Resilient Distributed Datasets): The most fundamental and lowest-level data structure in Spark. They allow for very fine-grained control using lambda functions and functional programming in Java, but require the developer to optimize operations manually.

2. DataFrames (Recommended): A higher-level abstraction that organizes data in named columns, similar to a relational table or a DataFrame in Python/Pandas. The big advantage is that they leverage Spark's Catalyst optimizer under the hood, which automatically restructures your queries to be as efficient as possible before execution. In the vast majority of modern cases, DataFrames are the preferred option for performance and ease of use.

Orchestration with Docker Compose

To avoid the classic "works on my machine" problem, the entire ecosystem is orchestrated with docker-compose. This spins up three isolated but interconnected services: the MongoDB server, the Spring Boot backend exposing the API, and the Spark container, which acts as an ephemeral job (it starts up, runs the data ingestion, and then shuts down automatically). This allows any developer to have the complete infrastructure running in a matter of minutes.