Project Architecture & Workflow
This project implements an end-to-end content recommendation pipeline. It automatically ingests technical news from the Hacker News (Algolia) platform via web scraping using Selenium. Once the raw text is extracted, the system relies on Natural Language Processing (NLP) to understand the semantic meaning of each title.
Instead of relying purely on exact keyword matching, the system encodes the text into dense vectors (embeddings) and stores them in Qdrant, a high-performance Vector Database. By mapping a user's reading history into the same vector space, the system can perform a semantic search to recommend unread articles that conceptually align with the user's implicit interests.
Text Processing & Keyword Extraction with spaCy
Before vectorization, raw titles undergo rigorous linguistic preprocessing using spaCy. This step is crucial to reduce noise and extract the core subjects of the articles. The NLP pipeline performs tokenization, part-of-speech (POS) tagging, and lemmatization.
By filtering tokens to retain only relevant nouns (and filtering out stop words, punctuation, and verbs), the system extracts 2 to 4 highly representative keywords per article. These keywords serve a dual purpose: they provide human-readable tags for the articles and act as a hard filter (sparse metadata) during the vector retrieval phase to ensure topical relevance.
Semantic Vectorization and Qdrant Storage
The core engine uses the average_word_embeddings_komninos model from sentence-transformers to generate 300-dimensional dense vectors for each title. In this latent space, semantically similar concepts (e.g., "Machine Learning" and "Artificial Intelligence") are placed close to each other, even if they share no exact words.
These embeddings are stored persistently on disk using Qdrant. At query time, Qdrant utilizes Approximate Nearest Neighbor (ANN) algorithms, specifically Hierarchical Navigable Small World (HNSW) graphs, to find the closest vectors in sub-linear time, making the system highly scalable. The similarity metric used is Cosine Distance, which focuses on the angle between vectors rather than their magnitude, making it ideal for text comparison.
User Profiling & Recommendation Logic
To generate personalized recommendations, the system simulates a user profile based on a historical interaction log (reading history). The profile is constructed using a Mean Pooling strategy: the system calculates the mathematical centroid (average vector) of all the articles the user has read. This centroid represents the user's overall "center of interest" in the 300-dimensional latent space.
Additionally, the system tracks the frequency of extracted keywords from the read articles. When querying Qdrant, the system searches for vectors closest to the user's centroid, but applies a metadata payload filter to strictly return articles containing the user's most predominant keywords. This ensures the semantic search does not drift too far from the user's explicit topics.
Advanced Retrieval: Moving Beyond Dense Search
While the current implementation relies on dense vector search with a hard keyword filter, modern Information Retrieval systems often employ more sophisticated pipelines. As explored in my other project (ai-movie-recommendator), relying solely on dense embeddings can sometimes lead to the "lost in the middle" problem or miss exact term matches (e.g., specific acronyms or model names like "GPT-4").
The table below contrasts the current approach with advanced retrieval strategies that could be implemented to improve precision.
Comparison of Information Retrieval Strategies
| Strategy | Mechanism | Pros | Cons |
|---|---|---|---|
| Dense Search + Metadata Filter (Current) | Cosine similarity on embeddings, filtered by spaCy keywords. | Fast, highly scalable, captures deep semantic meaning. | Hard filtering can kill recall; misses exact keyword matches if not explicitly tagged. |
| Hybrid Search (Sparse + Dense) | Combines dense vectors with sparse vectors (BM25 / SPLADE). | Best of both worlds: captures semantics AND exact terminology. | Requires maintaining two indexes; memory intensive; needs weight tuning (alpha). |
| Two-Stage Pipeline (Retrieval + Reranking) | Fast ANN fetch (Top 100), then a Cross-Encoder scores the Top 10. | State-of-the-art accuracy; considers deep token interactions. | Cross-Encoders are computationally heavy; adds latency to the query. |
To view another project with more advanced retrieval techniques, check out my AI Movie Recommendator here.
Suggested Improvements & Production Readiness
- Model Upgrade: Replace the lightweight
komninosmodel with a modern architecture likeall-MiniLM-L6-v2or OpenAI embeddings for richer semantic capture. - Data Ingestion: Replace Selenium scraping with the official Hacker News API to improve ingestion speed, reliability, and reduce fragility against DOM changes.
- Temporal Decay: Implement a recency boost. In technical news, a mathematically close vector from 5 years ago is less valuable than one from yesterday. Adding a time-decay function to the similarity score would dramatically improve relevance.
- Diversity Mechanisms: Implement Maximal Marginal Relevance (MMR) in the final recommendation step to avoid recommending 5 articles about the exact same niche topic, ensuring content diversity for the user.
- Production Deployment: Transition Qdrant from local disk to a containerized Docker environment or Qdrant Cloud for replication, and set up Airflow/Cron schedulers for continuous index updates.

