Project Overview
This project is a reusable architectural template that integrates large-scale data processing with modern API development. It combines the power of Apache Spark for batch processing, Spring Boot for exposing REST services, and MongoDB as the storage backend. The entire ecosystem is modularized and packaged with Docker, ensuring fast and reproducible deployment to any local environment.
Backend API and MongoDB (NoSQL)
The backend is built on Spring Boot 3.3.4, following a classic layered architecture: Controllers for the HTTP layer, Services for business logic, Repositories for data access, and DTOs for structuring responses. This design ensures that each component has a single responsibility.
The core storage system is MongoDB, a document-oriented NoSQL database. Unlike relational (SQL) databases that use rigid tables, MongoDB stores information in dynamic, JSON-like structures (BSON). This is ideal for modern projects because it allows great flexibility in the data schema, natural integration with Java objects (using Spring Data MongoDB), and much simpler horizontal scalability as the volume of books grows exponentially.
REST API Endpoints
| HTTP Method | Endpoint | Operation Description |
|---|---|---|
| GET | /api/v1/books | Retrieves the complete list of stored books. |
| GET | /api/v1/books/{id} | Fetches the details of a specific book by its unique identifier. |
| POST | /api/v1/books | Creates a new entry. Includes validations (e.g., publication year cannot be in the future) and automatic normalization. |
| PUT | /api/v1/books/{id} | Updates the information of an existing book. |
| DELETE | /api/v1/books/{id} | Permanently deletes the document from the database. |
Apache Spark: Distributed Processing
Apache Spark is a unified analytics engine for large-scale data processing. Its magic lies in its distributed architecture: it uses a Master Node (Driver) to manage and split the work, and multiple Worker Nodes to execute it. Additionally, each Worker utilizes several Cores for computation. This means that if you need to process 50 million records, the Driver splits the task into small chunks and distributes them to the Workers to process simultaneously (in parallel), drastically reducing runtime and avoiding bottlenecks from the memory limits of a single machine.
Usage Context: Template vs. Real-World Environment
In the context of this repository, using Spark to read a small CSV file is clearly overkill (using too much technology for a simple problem). However, this is done intentionally to demonstrate architecture.
In a real-world environment, data would not come from a local CSV file. A Spark job like this would connect to a Data Lake (such as Amazon S3 or Hadoop HDFS) to ingest terabytes of historical data, or to a real-time event streaming system (like Apache Kafka) to process millions of transactions per second. Spark would clean, normalize, and aggregate these massive datasets in parallel before finally saving the output to MongoDB for the Spring Boot backend to quickly serve to users.
RDDs vs DataFrames in the Project
The Spark module in this project includes two implementations to demonstrate both main capabilities of the framework:
1. RDDs (Resilient Distributed Datasets): The most fundamental and lowest-level data structure in Spark. They allow for very fine-grained control using lambda functions and functional programming in Java, but require the developer to optimize operations manually.
2. DataFrames (Recommended): A higher-level abstraction that organizes data in named columns, similar to a relational table or a DataFrame in Python/Pandas. The big advantage is that they leverage Spark's Catalyst optimizer under the hood, which automatically restructures your queries to be as efficient as possible before execution. In the vast majority of modern cases, DataFrames are the preferred option for performance and ease of use.
Orchestration with Docker Compose
To avoid the classic "works on my machine" problem, the entire ecosystem is orchestrated with docker-compose. This spins up three isolated but interconnected services: the MongoDB server, the Spring Boot backend exposing the API, and the Spark container, which acts as an ephemeral job (it starts up, runs the data ingestion, and then shuts down automatically). This allows any developer to have the complete infrastructure running in a matter of minutes.


