Skip to main content

ETL

The two different parts of the ETL flow are the following:

  • Analysis: This ETL enables the full processing of raw feedbacks received incrementally in AWS DocumentDB to final enriched outputs with key NLP feature (Sentiment analysis, etc.). The result is stored back in AWS DocumentDB.
  • Reporting: This ETL enables the full load of the newly received and enriched feedbacks to  AWS Elastic Search, where they will be queried by the visualisation tool, Kibana.

Analysis ETL flow

AWS Step Functions, Lambda and Batch functions orchestrate and scale the processing, while AWS Comprehend is responsible for Natural Language Processing. The functional aspect of the tools is articulated in this architecture as described in the following graph. In addition, the European Commission's custom translation tools eTranslation is used for natural language translation.

dorisArchitecture-part1

The AWS step function analysis orchestrates the ETL flow, starting the different tools in parallel or sequentially. The Serverless orchestrator also allows you to track the input, output and status between the different services in use, and pass it through the output from one service to another. AWS Lambda is a serverless computing tool that performs short ( <15 min) trigger-based calculations. Serverless means that it is a fully automated administration: AWS Lambda manages the computing infrastructure to execute your code. The only configuration required is the code to run and the trigger to run the code. In this case, the triggers of this function are the step function and the translation endpoint of the API gateway.  

The processing takes place in AWS Batch. AWS Batch dynamically provisions and scales instances of Amazon EC2 to run multiple images of the feedback processing (called jobs). Each job runs based on a specific ‘job definition’ specifying the job environment requirement as well as the command to run upon execution. Each job enters a queue and is executed in parallel within the limit of available computing power. The batch job can execute several types of container images (such as Python, Ruby, and so on) stored beforehand within AWS ECR.  For Doris+, feedback processing is divided into batches of 1000 ids for open questions and 10 consultations for file uploads.

AWS Comprehend is a natural language processing service from Amazon. Although the tool offers a huge variety of features, it fulfils five different functions in Doris+: detect the language, extract topics, extract entities, extract key phrases and sentiment analysis. Doris+ batch jobs call the synchronous API for each feedback and for each service it requires.

To meet language coverage and integrate within the EC landscape, Doris+ uses the EC's eTranslation tool. eTranslation is an asynchronous translation API that translates documents into all official European languages. Asynchronous method calls are a design template in which, when the translation is sent to their services, they inform the reception and send the translation to an HTTP endpoint. To receive and process the translation, a combination of AWS API Gateway and AWS Lambda processing is used to store the translation back in Doris+ AWS DocumentDB.

Reporting ETL flow

dorisArchitecture-part2

AWS Step Functions also orchestrates the last part of the ETL flow, the reporting. Its starts the different tools in parallel or sequentially with the goal to load processed feedback to AWS Elasticsearch. The orchestrator combines step functions with AWS Lambda and retrieves batches of 1000 identifiers loaded in AWS S3 in the previous step. The step function will then trigger AWS Batch, which retrieves the processed feedbacks from DocumentDB and loads it to AWS Elasticsearch, a part of AWS ELK. The ELK stack is an acronym of three open-source tools: Elasticsearch, Logstash, and Kibana.

Elasticsearch is an open-source, RESTful, distributed search and analytics engine built on Apache Lucene. It has built-in Kibana capabilities and allows to Store, analyses and SQL querying data while supporting schema-free JSON documents. The main goal of the tools is to be the database used by Kibana to quickly querying data required for the visualisation. 

Kibana is an open-source data visualization tool used by DORIS. Kibana provides interactive charts, pre-built aggregations and filters, and geospatial support.