Skip to main content

HADES (Datamining Analysis System of Petrol Stations in Spain) (HADES)

Anonymous (not verified)
Published on: 27/05/2010 Document Archived

HADES is a dashboard consisting of an indicator system that facilitates decision-making and shows a complete analysis of fuel prices. The scorecard sets out the main indicators related to information provided by petrol stations and oil traders and presents it in a clear and useful way for system management and it reports the evolution of the basic parameters of the hydrocarbons sector in Spain.

HADES allows an analysis of the data sent by the service stations for the two most important products: gasoline and diesel. The system consists of two parts:  a dashboard with price information stored by an operator, and logged by province and day and other part with reports obtained from a data mining analysis based on clustering techniques.

Policy Context

Hades is a dashboard that facilitates analysis of prices of the petrol stations. Based on clustering techniques, the system shows data gathered into groups of similar objects. Each group, called cluster, consists of a certain number of petrol stations having similar prices and being different from other petrol stations of other groups.

Information about petrol station prices is considered relevant to the citizens according to Article 20 of the Order ITC 2308/2008 that regulates the transfer of information from the petroleum products suppliers to MITYC, the Spanish Ministry of Industry, Tourism and Commerce. Based on the information gathered, the CNE and the Commission on Competition Defence conduct studies and initiate the administrative procedures concerning activities against competitiveness in the energy sector.

Description of target users and groups

For all users who need to easily obtain an overview of the data on the prices of the different stations, the system sets out the main indicators and presents them in a clear and useful way. Based on the information gathered, the CNA - the Spanish National Energy Regulation Authority - and the Commission on Competition Defence initiate the administrative procedures concerning competitiveness activities in the energy sector.

In our case, the study has focused on clustering models to study the possible groups of prices in each province. This allows evaluating prices competitiveness between different operators. All petrol stations belong to one of the "K"clusters, which are disjoint. The K-means algorithm (McQueen 1967), the most used partition clustering method, has been used for this purpose.

Description of the way to implement the initiative

Data mining is the process of extracting useful understandable information previously unknown from large quantities of stored data.  It is a process in which the basis of data  input generates output models, applying techniques and algorithms to extract patterns of data.  These models which will enable strategic decisions are based on information extracted from the data. In HADES we use clustering algorithms (k-means algorithm discovered by Mcqueen,1967) for explaining the similar prices in petrol stations.

Using the CRISP-DM (Cross Industry Standard Process for Data Mining) methodology,   a standard for the realization of data mining projects to reduce the time inversion, recovery is performed. The process model provides a description of the life cycle of a project containing the corresponding phases, tasks and relationships between these tasks. The life cycle of a data mining process consists of six phases where the sequence is strictly dependent on the outcome of the last phase carried out.  

  • Business understanding: the initial phase focuses on understanding theobjectives and project requirements provided by a final customer, and designing a preliminary plan defining the problem and allowing objectives.  In this phase it was  clarified  that managers provide the most relevant information to be included in the control panel;
  • Understanding the data: this phase begins with an initial collection of data and carries out activities aiming at understanding such data, evaluating their quality, with the goal to propose initial ideas to discover hidden trends.  Reports based on tables and graphics might give an immediate idea of data system; 
  • Preparation of data: in this phase a set of elaboration took place to obtain a final data-set that will feed the algorithms used for generating models.  Procedures were prepared and stored to load DTS new tables of the multidimensional model. 
  • Modeling: this phase will apply various algorithms to the data, calibrating their parameters with optimal values.  It is a normal procedure to review and perform changes to what elaborated in in the initial plan in view of the new information provided during this phase.  With the use of KNIME (an Open Source tool) it was possible to exploit the information contained in the database using multidimensional clustering techniques. 
  • Evaluation: at this stage it is common to have at least one valid model, in terms of data analysis and to perform a deep evaluation of each of the models reviewing the steps taken to verify the achievement of the objectives and to avoid any omission in order to obtain the best possible decision. 
  • Implementation: the generated models are applied in a standard production environment, so that results are organized and presented in a useful way for the customer.  This is often considered the last phase of the project although the data obtained in this phase may provide new feedback to provide a better evaluation.

Data mining fits into a much larger process known as KDD (Knowledge Discovery from Databases).

Technology solution

The data mining was implemented with the open source tool Knime.  Knime is a modular platform using Java and Elipse framework performing data mining tasks by the different algorithms implemented with data mining techniques.

The model of HADES with the data generated in different clusters and Knime is exported as XML and loaded into the database providing separate tables for each province with the price data per station and for each execution of data mining project.

The information is organized into multidimensional cubes or hyper cubes, supported by an SQL Server 2008 database.  DTS Stored procedures allow loading the multidimensional database in a ETL (extraction transformation and load) process. KNIME models use this information to produce clustering models showed in reports to the users of this system.The dashboard was designed using Performance Point Server 2007. Users have access via their browser to the predefined reports based on Reporting Services 2005.

Technology choice: Standards-based technology, Open source software

Main results, benefits and impacts

Thanks to the Dashboard is easy to consult the evolution of prices both national and local.  You can also consult information from a particular day of the year to monthly or annual averages.

Based on the information contained on it is possible to prepare studies and reports on competitiveness in petroleum products supply.Users can then navigate through the different dimensions of the model: geographic (province, municipality, town), product (gasoline, diesel), time (year, month, day) and operator.

The system aims at speeding up the query of large amounts of data providing quick references and taking into account the large volume of data available thanks to the information submitted by operators and owners of individual petrol stations.  There are currently over 16 million records in the operational system that collects information on prices of different petrol stations. Consulting this database to retrieve a large amount of information received from different sources (stations services operators) had an impact on the time necessary to provide reports.  To bypass this problem, it has been decided to create a Datamart system with information of consolidated petrol stations prices, allowing the search and speeding up the exploitation of system information.

Return on investment

Return on investment: Not applicable / Not available

Track record of sharing

16 million records are available in the system from more than 8.000 petrol stations; each petrol station changes their prices 3 times a week.

Lessons learnt

The use of Open Source software has been proved to be a good choice to realise this project. The Knime software allows the use of data mining techniques to study data from our system Descriptive models identify patterns that explain or summarise the data; clustering techniques allow grouping of homogeneous cases. In our case, the study has focused on clustering models to examine the possible groups of petrol prices in each province.

Studying price competition between different operators using data mining techniques we can offer citizens useful information about petrol stations prices and increase the transparency in this sector.
 
HADES has been selected as successful case at TECNIMAP 2010:
http://www.tecnimap.es/sites/default/files/webform/TECNIMAP%202010%20HADES.pdf

Scope: National
Login or create an account to comment.