Data Engineer- Web Scraping & Threat Intelligence
Job Description
This role owns the end-to-end discovery, acquisition, and ingestion pipeline for continuously discovering, crawling, extracting, indexing, and normalizing millions of new artifacts dailyincluding documents, chats, forums, leaked datasets, repositories, threat actor communications, hacker marketplaces, unsecured infrastructure, and decentralized networks across the surface web, deep web, dark web, and anonymized networks.
Our Threat Research Teams mission is aggressive: achieve near-total coverage of global breach and leak data with 99%+ automation. Your work directly enables HEROICs ability to identify exposures before they are weaponized.
What You Will Do
Automated Intelligence Collection & Discovery
- Architect and operate large-scale, distributed crawling and discovery systems across:
- Surface web, deep web, and dark web
- Hacker forums, underground marketplaces, and breach communities
- Chat platforms (Telegram, Discord, IRC, WhatsApp, etc.)
- Paste sites, code repositories, and social platforms used for breach disclosure
- Continuously discover, archive, and download newly released datasets, logs, credentials, and artifacts the moment they appear
- Build automated collectors and archivers for anonymized and decentralized networks including:
- Tor (.onion), I2P, ZeroNet, Freenet, IPFS, GNUnet, Lokinet, Yggdrasil, and similar systems
- Design resilient workflows for unreliable, adversarial, or ephemeral data sources
- Normalize and index data from non-traditional network protocols and formats
- Develop automated scanning systems to identify:
- Unsecured databases (Elasticsearch, MySQL, PostgreSQL, MongoDB, etc.)
- Exposed cloud storage (S3, Azure, GCP, DigitalOcean Spaces)
- Open FTP servers, backups, and misconfigured archives
- Monitor and ingest data from file hosting and distribution platforms commonly used for breach dumps
- Build ETL pipelines to clean, normalize, enrich, and index structured and unstructured data
- Implement advanced anti-bot evasion strategies (proxy rotation, fingerprinting, CAPTCHA mitigation, session management)
- Integrate collected intelligence into centralized databases and search systems
- Design APIs and internal tooling to support downstream analysis and AI/ML workflows
- Implement advanced anti-bot, evasion, and resiliency techniques (proxy rotation, fingerprinting, CAPTCHA mitigation, session handling)
- Automate deployment, scaling, and monitoring using Docker, Kubernetes, and cloud infrastructure
- Continuously optimize performance, reliability, and cost efficiency of crawler clusters
- Minimum 4 years of hands-on experience in data engineering, intelligence collection, crawling, or distributed data pipelines
- Strong Python expertise and experience with frameworks such as Scrapy, Playwright, Selenium, or custom async systems
- Proven experience operating high-volume, automated data collection systems in production
- Deep understanding of web protocols, HTTP, DOM parsing, and adversarial scraping environments
- Experience with asynchronous, concurrent, and distributed architectures
Looking to get Placed? Try our Placement Guarantee Plan
- Familiarity with SQL and NoSQL databases (PostgreSQL, MongoDB, Elasticsearch, Cassandra)
- Strong Linux/Unix, shell scripting, and Git-based workflows
- Experience deploying and operating systems using Docker, Kubernetes, AWS, or GCP
- Excellent analytical, debugging, and problem-solving skills
- Strong written and verbal communication skills.
- Direct experience with dark web intelligence, breach data, OSINT, or threat research
- Familiarity with Tor, I2P, underground forums, stealer logs, or credential ecosystems
- Experience processing large breach datasets or stealer logs
- Background working in adversarial data environments
- Exposure to AI/ML-driven intelligence platforms
- Position Type: Full-time
- Location: Remote in India. Work from wherever you please! Your home, the beach, our offices, etc.
- Compensation: USD 1300-2000 monthly (depending on experience)
- Professional Growth: Amazing upward mobility in a rapidly expanding company.
- Innovative Culture: Be part of a team that leverages AI and cutting-edge technologies.
Skills
PythonCloud InfrastructureEtlMysqlWeb IntelligenceAi/mlData EngineerAiMlSqlIf an employer asks you to pay any kind of fee, please notify us immediately. Jobaaj does not charge any fee from the applicants and we do not allow other companies also to do so.
About Company
Important dates & deadlines?
Application Deadline
04 Apr 26, 04:03 PM IST
Similar Jobs
View All



