Data Engineer- Web Scraping & Threat Intelligence

HEROIC Cybersecurity

Data Science Analytics & Machine Learning

149+ Applicants

Posted: 4 months ago

4-6 years

Pune, Maharashtra

work from office

Posted: 4 months ago

Applicants: 149+

Job Description

About Company

Similar Jobs

Please verify your account first! Send OTP

Please click on the Apply to verify the status of jobs posted more than 15 days ago, as they may have expired. Similar Jobs

Job Description

About the Role: HEROIC Cybersecurity (HEROIC.com) is seeking a senior-level Threat Intelligence Data Engineer Automated Collection & Dark Web Intelligence to design, build, and operate fully automated intelligence collection systems that power our AI-driven cybersecurity and breach intelligence platforms.
This role owns the end-to-end discovery, acquisition, and ingestion pipeline for continuously discovering, crawling, extracting, indexing, and normalizing millions of new artifacts dailyincluding documents, chats, forums, leaked datasets, repositories, threat actor communications, hacker marketplaces, unsecured infrastructure, and decentralized networks across the surface web, deep web, dark web, and anonymized networks.
Our Threat Research Teams mission is aggressive: achieve near-total coverage of global breach and leak data with 99%+ automation. Your work directly enables HEROICs ability to identify exposures before they are weaponized.
What You Will Do
Automated Intelligence Collection & Discovery

Architect and operate large-scale, distributed crawling and discovery systems across:
Surface web, deep web, and dark web
Hacker forums, underground marketplaces, and breach communities
Chat platforms (Telegram, Discord, IRC, WhatsApp, etc.)
Paste sites, code repositories, and social platforms used for breach disclosure
Continuously discover, archive, and download newly released datasets, logs, credentials, and artifacts the moment they appear

Dark Web, Anonymized & Decentralized Networks

Build automated collectors and archivers for anonymized and decentralized networks including:
Tor (.onion), I2P, ZeroNet, Freenet, IPFS, GNUnet, Lokinet, Yggdrasil, and similar systems
Design resilient workflows for unreliable, adversarial, or ephemeral data sources
Normalize and index data from non-traditional network protocols and formats

Infrastructure & Exposure Discovery

Develop automated scanning systems to identify:
Unsecured databases (Elasticsearch, MySQL, PostgreSQL, MongoDB, etc.)
Exposed cloud storage (S3, Azure, GCP, DigitalOcean Spaces)
Open FTP servers, backups, and misconfigured archives
Monitor and ingest data from file hosting and distribution platforms commonly used for breach dumps

Pipeline Engineering & Operations

Build ETL pipelines to clean, normalize, enrich, and index structured and unstructured data
Implement advanced anti-bot evasion strategies (proxy rotation, fingerprinting, CAPTCHA mitigation, session management)
Integrate collected intelligence into centralized databases and search systems
Design APIs and internal tooling to support downstream analysis and AI/ML workflows
Implement advanced anti-bot, evasion, and resiliency techniques (proxy rotation, fingerprinting, CAPTCHA mitigation, session handling)
Automate deployment, scaling, and monitoring using Docker, Kubernetes, and cloud infrastructure
Continuously optimize performance, reliability, and cost efficiency of crawler clusters

What We Are Looking For

Minimum 4 years of hands-on experience in data engineering, intelligence collection, crawling, or distributed data pipelines
Strong Python expertise and experience with frameworks such as Scrapy, Playwright, Selenium, or custom async systems
Proven experience operating high-volume, automated data collection systems in production
Deep understanding of web protocols, HTTP, DOM parsing, and adversarial scraping environments
Experience with asynchronous, concurrent, and distributed architectures

Looking to get Placed? Try our Placement Guarantee Plan
Familiarity with SQL and NoSQL databases (PostgreSQL, MongoDB, Elasticsearch, Cassandra)
Strong Linux/Unix, shell scripting, and Git-based workflows
Experience deploying and operating systems using Docker, Kubernetes, AWS, or GCP
Excellent analytical, debugging, and problem-solving skills
Strong written and verbal communication skills.

Preferred / High-Value Experience

Direct experience with dark web intelligence, breach data, OSINT, or threat research
Familiarity with Tor, I2P, underground forums, stealer logs, or credential ecosystems
Experience processing large breach datasets or stealer logs
Background working in adversarial data environments
Exposure to AI/ML-driven intelligence platforms

What We Can Offer

Position Type: Full-time
Location: Remote in India. Work from wherever you please! Your home, the beach, our offices, etc.
Compensation: USD 1300-2000 monthly (depending on experience)
Professional Growth: Amazing upward mobility in a rapidly expanding company.
Innovative Culture: Be part of a team that leverages AI and cutting-edge technologies.

About Us: HEROIC Cybersecurity (HEROIC.com) is building the future of cybersecurity. Unlike traditional cybersecurity solutions, HEROIC takes a predictive and proactive approach to intelligently secure our users before an attack or threat occurs. Our work environment is fast-paced, challenging, and exciting. At HEROIC, youll work with a team of passionate, engaged individuals dedicated to intelligently securing the technology of people all over the world.

Skills

PythonCloud InfrastructureEtlMysqlWeb IntelligenceAi/mlData EngineerAiMlSql

If an employer asks you to pay any kind of fee, please notify us immediately. Jobaaj does not charge any fee from the applicants and we do not allow other companies also to do so.

About Company

HEROIC is building the next generation cybersecurity company. We are a fast-growing company that is passionate about protecting our customers from cyber threats. Our team is made up of experienced cybersecurity professionals who are dedicated to providing our customers with the best possible protection.

Important dates & deadlines?

Application Deadline

04 Apr 26, 04:03 PM IST

Similar Jobs

View All

Jobaaj

Don't Miss out any Updates

Subscribe now for the latest job alerts
and never miss an update

Job Alert

Google hiring for Specific Roles Apply Now!

1 min ago

New Opportunity

Amazon is hiring freshers Apply Now!

5 min ago

Featured Jobs

Microsoft opening 50+ positions Apply Now!

10 min ago