Data Analytics for Energy Efficiency in Industrial Areas

The industry in the Basque Country is the sector with the highest energy consumption with a consumption of 40% of the total (5.034ktep) in 2015, ahead of the transport sector that accounted for 38.2%. In addition, in the industry, the electricity vector accounted for 36.7% of energy consumption and natural gas in that year, 45.3%. The producing companies in the industry are therefore great consumers of energy. The progressive incorporation of Information and Communication Technologies (ICT) in industrial environments, in an Industry 4.0 scheme, is favoring learning about the implications of industrial operation processes in energy consumption. Plant managers, energy efficiency experts, energy service companies, systems integration engineers, suppliers of industrial equipment and other actors in the industrial sector are challenged to make the most of the Industry 4.0 paradigm that is being Deploying, applying Energy Efficiency Strategies, sponsored by the Regulation and Regulation such as Energy Efficiency Directive, or the deployment of smart electric and thermal meters.

In this regard, there are increasingly developments and implementations of solutions based on ICT technology towards the reduction of energy consumption, with the consequent reduction of fixed and variable productive costs, the improvement of the performance of equipment and processes, technological innovation, optimization and allocation of resources and increased competitiveness. The objective is to integrate the full potential from data analytics capabilities in the productive processes to promote energy efficiency in industrial environments, acquire knowledge on how energy is used by the consumers, industrial or household to better plan the estimation of the Energy Demand by identifying the load patterns, load peaks, peak shitable power, … Through the application of advanced data analytics techniques (real-time time series prediction on GPU, tensorflow, SPARK, docker containers), advanced visualization techniques (Dataviz, D3.js), and whatever comes up enable a wide range of new value-added services for the entire energy demand and supply value chain. The data insights will unlock savings opportunities and help businesses provide greater value.

Online Learning and Concept Drift

The increasing number of applications favoring the generation of data streams – such as mobile phones, sensor networks and in general all scenarios under the so-called Internet of Things paradigm – has led us to the necessity for new approaches capable of dealing with fast incoming information flows. In these practical situations it is often assumed that the process behind the generation of such data streams is stationary, i.e. the statistical properties of the underlying phenomena that produce the information to be processed do not vary along time. Unfortunately, in many real scenarios this assumption does not hold since the data generation process becomes affected by a nonstationary event (such as eventual changes in the users’ habits, seasonality, periodicity, sensor errors, or any other factor alike). Under these circumstances the statistical distribution of the data may change (concept drift), which ultimately yields that models trained over these data sources are obsolete and does not adapt suitably to the new distribution of the data. Therefore, in the context of data mining in such nonstationary environments the construction of learning models requires adaptive approaches to ease their adaptation to drifts in the distribution of the data, either from an active (i.e. drift detection, which triggers a subsequent model adaptation) or a passive perspective (namely, the blind adaptation of the model whenever new data arrive).

Ensembles are one the most useful approaches to deal with nonstationary environments and have been successfully used to improve the accuracy of single classifiers in incremental learning. Recent studies have revealed that different diversity levels in an ensemble of learning machines are required in order to ensure good generalization properties of the overall model, in the presence of concept drift and in the absence of it. Diversity among the constituent learners in ensemble models has been empirically proven to be crucial when dealing with concept drift. Specifically these studies provide evidences that the diversity plays an important role before and after a concept drift, importance that is also subject to the severity of the drift. In some occasions, the amount of data is not enough to build diverse ensembles, and therefore it is not possible to achieve good performances. In those cases, the generation of synthetic process can also help to enrich a limited dataset and obtaining better classification or prediction results. The generation of synthetic samples could be an effective technique to construct diverse hypotheses using additional artificially-constructed training examples. Besides, these synthetic samples could be combined with traditional diversity generation schemes, such as class-switching, bagging or boosting, to train diverse ensembles which reach a trade-off between diversity and performance through an optimization technique running in online mode.

Time Series Analysis

Time series, conceived as a list of data points sorted in time order, are present in many different fields such as telecommunications, finance and biomedicine, among others. In such areas it is often the case that time series are assigned a category or label (e.g. the chance of a customer to churn from a telecommunications company based on the record of transactions), which is of interest for the underlying application (e.g. customer retention).

In order to predict the label associated to new time series, supervised learning aims at building classification models based on a record of past labeled time series. The most common time series classification method is the k-nearest neighbour (k-NN) scheme: when this model is queried for the label of a new item to be predicted, the distance to each sample in the training set is computed, from which the predicted label results as the majority class among the labels of the k closest training examples. In parallel to the more traditional approach to build these models based on the extraction of features from the time series, a research trend of vibrant activity in the literature gravitates on the use of tailored distances between time series and their exploitation in learning models that rely on pairwise similarity measures. Hence to compute the distance between two time series, not only feature-based similarity measures can be used, but also model and raw data-based distances. Model- and feature-based approaches assume a priori knowledge on the properties of the sources that generated the time series.

From a theoretical point of view, when novel distances are presented, it is assumed by the scientific community that a good similarity measure should be consistent with human intuition and time series pathologies such as; time axis distortions, noise or/and outliers presence and amplitude difference. Elastic similarity measures such as, DTW and Edit Distance for Real sequences (EDR) have been proven to be adequate to mitigate time shifts. Although the motivation for seeking new similarity measures is based on the need of having robust and limitations free functions, both their consolidation and acceptance depend on experimental results. This research line aims at developing novel univariate and multivariate time series similarity measures for distance-based classifiers and/or clustering algorithms, with applications to Energy and Industrial Engineering.

Bio-Inspired Computation for Optimization

In the current scientific community, optimization problems receive much attention in the artificial intelligence field. In this sense, we can find several kinds of optimization, such as continuous, linear, combinatorial, or numerical optimization. The resolution of optimization problems usually suppose a great intellectual and computational effort. Additionally, lots of optimization problems are easily applicable to, or directly drawn from, real world situations. For these reasons, many different methods have been proposed up to date to be applied to these problems.

Among all the different approaches that can be found in the literature to address optimization problems efficiently, some of the most successful ones are the heuristics, metaheuristics and hyperheuristics. On the one hand, a heuristic is an optimization method which objective is to solve a problem using specific knowledge of the problem. In this way, the main philosophy of a heuristic is to explore the solution space, intensifying the search in that areas considered as the most promising ones. The reasoning behind this behavior is the achieving of good optimization results in a quick time. On the other hand, a metaheuristic is a technique which objective is to solve a problem using only general information and knowledge common to a wide variety of optimization problems. In this sense, metaheuristics explore the space of solutions with the aim of achieving good solutions regardless the problem they are tackling. Finally, hyperheuristics are another kind of techniques that have a great momentum in the current scientific community. Hyperheuristics works with a range of simple heuristic functions, and it manages the choice of which one of these simple operators should be applied at every specific time, depending on some conditions, such as the point of the search process, the performance of each heuristic function, and the characteristics of the solution space region currently under analysis.

The team that compose the JRL has a wide experience in this field, specifically in the development of bio-inspired techniques for solving complex optimization problems. These methods have a great popularity in the current scientific community, being the focused scope of many research contributions in the literature year by year. The rationale behind the acquired momentum by this broad family of methods lies on their outstanding performance evinced in hundreds of research fields and problem instances. In this regard many different inspirational sources can be found for these solvers, such as the behavioral patterns of bats, fireflies, corals, bees or cuckoos, as well as the mechanisms behind genetic inheritance, musical harmony composition or bacterial foraging.

Optimal Design of Big Data Infrastructures

Big Data is another hot topic in the current scientific comunity, generating lots of works anually, and arousing interest in the business market. Big Data can be defined as huge collections of data, with such a size or complexity that make infeasible the use of conventional data processing techniques. In line with this, Big Data is supported by five different crucial aspects, usually known as the five V’s: Volume, Velocity, Variety, Veracity and Value.

The reason for the growing interest in this field is a consequence of the continuous advance of information technologies, both for the collection of data and for the storage of the same. The reason for the growing interest in this field is a consequence of the continuous advance of information technologies, both for the collection of data and for the storage of the same. This rapid evolution makes the management of these large amounts of data more demanding.

Activities related to Big Data include, among others, data acquisition, feature selection, size reduction, data extraction, analysis and evaluation, as well as possible predictions. In this sense, there are many techniques in the literature that have been used in the different fields that comprise the Big Data. The team that compose the JRL has a contrasted experience in this field, having applied different techniques and approaches to the different aspects of the Big Data.