Overview
The Atlas Project was conceived to address a pressing need for our client: the ability to gather and centralize up-to-date information on auctions, commercial offers, tenders, and similar opportunities. By leveraging advanced web scraping and automation techniques, we developed a solution that efficiently compiles relevant data from a vast array of sources.
The Challenge
Our client faced several significant burdens:
- Numerous Sources: With over 4000 websites to monitor, the sheer volume of data sources was overwhelming.
- Lack of APIs: Most of these websites did not provide APIs, complicating the data extraction process.
- High Volume of Data: Each website publishes numerous offers daily, requiring extensive filtering and compliance checks against specific criteria.
- Initial Data Processing: Extracting valuable information from raw data was necessary to make it useful for end-users.
The Solution
To tackle these challenges, we developed a sophisticated automated script employing web scraping. This script allows for efficient and rapid data extraction from a continually growing list of websites. Our approach includes:
- Simple Websites: Capture a snapshot of the webpage and extract necessary data directly.
- Complex Websites: Utilize browser automation libraries to simulate user actions and retrieve precise data.
- Websites with APIs: Integrate with available APIs to streamline data acquisition.
We generated a comprehensive list of pertinent offers by combining the scraped data with targeted keywords.
Implementation
Our solution comprised several key components and tasks:
1. Developing Python Console Commands
- Gradually transition its functionality to Python.
- Extract data from the outdated script’s database for analyst processing.
- Periodically update and extract data from new websites for further analyst review.
- Maintain the outdated script that gathers data from a fixed set of websites.
2. Custom Instructions with XPath and Selenium
- Create specific commands to extract information from individual websites.
- Implement the Robot Framework and Playwright for an alternative method of instruction writing.
3. Artificial Intelligence Integration
- Utilize AI to process files related to offers and extract useful information.
4. Website-Specific Instructions
- Develop and maintain tailored instructions for extracting useful data from various websites.
5. CI/CD Process Development
- Establish a continuous integration and deployment pipeline for code testing, analysis, and deployment.
6. Containerization
- Migrate from a constantly running virtual machine to an on-demand container-based solution.
Business Value
The Atlas Project delivered substantial value to our client:
- Automation: Streamlined the search process according to predefined criteria, eliminating human error.
- Centralization: Consolidated data from various sources into an external system for easier access and management.
- Efficiency: Automated initial data processing to extract and highlight critical information.
- Cost Reduction: Lowered infrastructure maintenance costs through efficient resource utilization.
Technologies Used
- Command Line Interface: Python, Click
- Automation Frameworks: Selenium, Robot Framework, Playwright
- Cloud Services: AWS EC2, Fargate, CloudWatch
- Artificial Intelligence: OpenAI
- Additional Tools: Docker, Git, CI/CD
Conclusion
The Atlas Project stands as a testament to our commitment to innovative solutions and client satisfaction. By leveraging cutting-edge technologies and methodologies, we transformed a complex data aggregation challenge into a streamlined, automated process that delivers reliable and timely information to our client's customers.