CLOSE ×AboutCareerStackWorkProcessContact
2024–2025 · AI · NLP · PLAGIARISM DETECTION

Vedrix — Enterprise plagiarism detection at 90% accuracy.

YEAR
2024–2025
COMPANY
Personal Project
ROLE
Backend Engineer · NLP Engineer
DjangoPythonNLPLLMTF-IDFGoogle Search APIREST
NLP token streams through TF-IDF similarity graphs
90%
Detection Accuracy
50MB
Max Doc Size
Real-time
Processing
PROBLEM

Academic institutions lacked an affordable, accurate, enterprise-grade plagiarism detection system that could handle large documents, match against live web content, and deliver detailed similarity reports in real-time. Existing tools were expensive, inaccurate, or couldn't process complex document formats at scale.

APPROACH

Built the full system using Django REST Framework as the API backbone. Custom document parsing algorithms extract clean text from PDF and Word files up to 50MB, handling complex formatting, tables, and multi-column layouts. TF-IDF vectorization with cosine similarity handles fast intra-corpus matching. LLM APIs add semantic similarity detection beyond keyword matching. Google Search API powers real-time web content matching to detect online sources. An automated report generator highlights matching segments with source attribution and similarity percentages.

RESULT

90% plagiarism detection accuracy validated across academic datasets. Real-time processing of PDF/Word documents up to 50MB. Live web matching via Google Search API. Automated detailed reports with segment-level highlighting and source attribution.

— VISUALS
NLP ENGINE — TF-IDF vectorization with cosine similarity and LLM semantic analysis
NLP ENGINE — TF-IDF vectorization with cosine similarity and LLM semantic analysis
REPORT SYSTEM — Automated similarity reports with segment highlighting and source attribution
REPORT SYSTEM — Automated similarity reports with segment highlighting and source attribution