From the architecture point of view, the three main technical components of the project are:
- The ETL component, in charge of loading the RAPEX weekly report containing a multi-language product list onto an SQL database using Talend Open Studio solution.
- The core component, in charge of searching, scraping and performing text mining of the products added to cloud database during the ETL using Google Custom Search API and Python’s scraping utilities.
- Visualisation component: Processed data will be displayed later in a dashboard implemented with ELK Stack (Elasticsearch, Logstash and Kibana).
The architecture solution is represented in the following diagram, which describes the flow of the process and data when the user is interacting with the software. Note the components are marked in red is mentioned a number in order to match with the component:
The project architecture context is established inside AWS DIGIT cloud located in EU-West-1 (Ireland) zone, where Amazon Services and other DIGIT products are located.
AWS DIGIT cloud
The main black frame that includes every solution component is the AWS DIGIT cloud. For RAPEX Searcher, this section will describe the services provided by Amazon needed by the solution. These services are:
- Amazon Cognito: A service to manage credentials and user login into the web application
- Amazon Elasticsearch: A service composed by two applications (Elasticsearch and Kibana) to visualise the results obtained from the process executed by RAPEX Searcher
RAPEX private cloud
As displayed in the architecture diagrams, inside AWS DIGIT cloud is located the RAPEX private cloud including every architecture component not dependant on Amazon Services. In this private cloud there are two subnets with different scopes: the public and the private one.
Only RAPEX Searcher solution related components will be in this private cloud.
Public subnet
In the public subnet of RAPEX cloud contains three instances:
- Bastion host
- Production proxy
- Preproduction proxy
Bastion host
This machine works as gate to enter in the virtual machines conforming the private subnet and increase environment’s security, limiting the access to information of any instance to the one in charge –the bastion host.
Production and preproduction proxies
Reverse proxies that retrieve resources on behalf of users from production and preproduction servers. NGINS has been used as load-balancing software-based reverse proxy.
Private subnet production and preproduction environment
The private subnets contain the core component that process products information for the production and preproduction environments. Each one includes a PostgreSQL database and two machines. The servers consist of:
- ETL Windows server, where RAPEX reports information are extracted
- Ubuntu server, where the analytic processes is performed
ETL Windows server
This instance has as Operative System Windows Server due to Talend software requirements. This server is managing the loading and processing/transforming of RAPEX reports information. This transformation loads the master XML file which contain references to URLs with the weekly list of products. Afterwards, the process will access the previous URLs by each product in four different languages (English, French, Greek and Finnish) to load incrementally the definitive list of products parsing it into a model implemented in PostgreSQL database.
This processing is made using Talend Open Studio software.
Ubuntu server
This instance is deployed with a Linux OS –in particular, an Ubuntu. This server hosts the core component of the solution and a Logstash software installation.
The core component is a Python developed application that takes the data previously processed on the ETL. Then uses this information to search RAPEX products with Google API and to get the candidate URLs where they could be sold. These URLs are scraped and after the convenient analysis, the results obtained are stored again in the PostgreSQL database.
Logstash sents this information to a third component (Amazon Elasticsearch service in this case) in order to index to be used for the visualization software Kibana.