Work from Office
The ideal candidate will support the full scope of Data Engineer responsibilities and partner with the organization on strategic initiatives.
Our Company is a leading data platform that enables medical credential verification and empowers organizations with accurate and reliable information. The company data platform is a shared responsibility of the Data Engineering Team, which builds the overall data architecture, and the Data Platform Sources team, which builds integrations with numerous governmental websites and extracts the raw data required to fuel our data pipelines. As a Software Engineer within the Data Platform Sources team, your primary responsibility will be to create software that extracts, cleans, and normalizes data from government websites and various other data sources.
- Develop and maintain software applications for web scraping, data extraction, cleansing, and normalization from government websites and other data sources.
- Implement web scrapers to extract medical practitioner data from web-based search interfaces in a fast and scalable way and ensure the accuracy and integrity of the extracted data.
- Identify methods to overcome bot detection mechanisms and rate limitations
- Stay updated with the latest industry trends, techniques, and tools related to web scraping, and proactively apply new technologies to enhance the scraping process.
- Monitor the data quality of over 200 distinct data sources and regularly validate the accuracy, completeness, and consistency of the extracted data.
- Continuously improve scraping strategies and adapt to changes in countermeasures implemented by target websites to maintain data extraction capabilities.
- Troubleshoot and resolve issues related to web scraping, data extraction, and data quality, ensuring timely resolution to minimize any impact on downstream processes.
- Develop and maintain documentation related to data sources, their data characteristics and the unique behaviors of their websites.
- Collaborate with the Data Engineering team to extend and improve our web scraping frameworks and data architectures
- Bachelors degree or equivalent experience in Computer Science, Software Engineering, or a related field.
- Strong programming skills specifically in the Python language
- Excellent written and verbal communication skills
The ideal candidate should have:
- Experience with web scraping frameworks and libraries such as Playwright, Scrapy, BeautifulSoup, or Selenium.
- Proficiency in using third-party tools and services for web scraping, such as proxies, CAPTCHA solvers, and headless browsers.
- Strong analytical and problem-solving skills, with the ability to identify workarounds for extracting data from highly defensive data sources.
- Familiarity with the US healthcare industry is a strong plus.
- Familiarity with workflow orchestrators such as Airflow, Dagster or Prefect
- Proficiency in basic ETL transformations in SQL