Ingegneria industriale e dell'informazione
Permanent URI
Settori scientifico disciplinari compresi nell'area 9:
|
|
Browse
Browsing Ingegneria industriale e dell'informazione by Author "Bartoli, Alberto"
Now showing 1 - 6 of 6
Results Per Page
Sort Options
- PublicationGenetic Programming Techniques in Engineering Applications(Università degli studi di Trieste, 2014-04-01)
;De Lorenzo, Andrea ;Bartoli, AlbertoMedvet, EricMachine learning is a suite of techniques that allow developing algorithms for performing tasks by generalizing from examples. Machine learning systems, thus, may automatically synthesize programs from data. This approach is often feasible and cost-effective where manual programming or manual algorithm design is not. In the last decade techniques based on machine learning have spread in a broad range of application domains. In this thesis, we will present several novel applications of a specific machine Learning technique, called Genetic Programming, to a wide set of engineering applications grounded in real world problems. The problems treated in this work range from the automatic synthesis of regular expressions, to the generation of electricity price forecast, to the synthesis of a model for the tracheal pressure in mechanical ventilation. The results demonstrate that Genetic Programming is indeed a suitable tool for solving complex problems of practical interest. Furthermore, several results constitute a significant improvement over the existing state-of-the-art. The main contribution of this thesis is the design and implementation of a framework for the automatic inference of regular expressions from examples based on Genetic Programming. First, we will show the ability of such a framework to cope with the generation of regular expressions for solving text-extraction tasks from examples. We will experimentally assess our proposal comparing our results with previous proposals on a collection of real-world datasets. The results demonstrate a clear superiority of our approach. We have implemented the approach in a web application that has gained considerable interest and has reached peaks of more 10000 daily accesses. Then, we will apply the framework to a popular "regex golf" challenge, a competition for human players that are required to generate the shortest regular expression solving a given set of problems. Our results rank in the top 10 list of human players worldwide and outperform those generated by the only existing algorithm specialized to this purpose. Hence, we will perform an extensive experimental evaluation in order to compare our proposal to the state-of-the-art proposal in a very close and long-established research field: the generation of a Deterministic Finite Automata (DFA) from a labelled set of examples. Our results demonstrate that the existing state-of-the-art in DFA learning is not suitable for text extraction tasks. We will also show a variant of our framework designed for solving text processing tasks of the search-and-replace form. A common way to automate search-and-replace is to describe the region to be modified and the desired changes through a regular expression and a replacement expression. We will propose a solution to automatically produce both those expressions based only on examples provided by user. We will experimentally assess our proposal on real-word search-and-replace tasks. The results indicate that our proposal is indeed feasible. Finally, we will study the applicability of our framework to the generation of schema based on a sample of the eXtensible Markup Language documents. The eXtensible Markup Language documents are largely used in machine-to-machine interactions and such interactions often require that some constraints are applied to the contents of the documents. These constraints are usually specified in a separate document which is often unavailable or missing. In order to generate a missing schema, we will apply and will evaluate experimentally our framework to solve this problem. In the final part of this thesis we will describe two significant applications from different domains. We will describe a forecasting system for producing estimates of the next day electricity price. The system is based on a combination of a predictor based on Genetic Programming and a classifier based on Neural Networks. Key feature of this system is the ability of handling outliers-i.e., values rarely seen during the learning phase. We will compare our results with a challenging baseline representative of the state-of-the-art. We will show that our proposal exhibits smaller prediction error than the baseline. Finally, we will move to a biomedical problem: estimating tracheal pressure in a patient treated with high-frequency percussive ventilation. High-frequency percussive ventilation is a new and promising non-conventional mechanical ventilatory strategy. In order to avoid barotrauma and volutrauma in patience, the pressure of air insufflated must be monitored carefully. Since measuring the tracheal pressure is difficult, a model for accurately estimating the tracheal pressure is required. We will propose a synthesis of such model by means of Genetic Programming and we will compare our results with the state-of-the-art.1216 1911 - PublicationMachine learning in engineering applications(Università degli studi di Trieste, 2011-03-31)
;Davanzo, Giorgio ;Bartoli, AlbertoMedvet, EricNowadays the available computing and information-storage resources grew up to a level that allows to easily collect and preserve huge amount of data. However, several organizations are still lacking the knowledge or the tools to process these data into useful informations. In this thesis work we will investigate several issues that can be solved effectively by means of machine learning techniques, ranging from web defacement detection to electricity prices forecasting, from Support Vector Machines to Genetic Programming. We will investigate a framework for web defacement detection meant to allow any organization to join the service by simply providing the URLs of the resources to be monitored along with the contact point of an administrator. Our approach is based on anomaly detection and allows monitoring the integrity of many remote web resources automatically while remaining fully decoupled from them, in particular, without requiring any prior knowledge about those resources—thus being an unsupervised system. Furthermore, we will test several machine learning algorithms normally used for anomaly detection on the web defacement detection problem. We will present a scrolling system to be used on mobile devices to provide a more natural and effective user experience on small screens. We detect device motion by analyzing the video stream generated by the camera and then we transform the motion in a scrolling of the content rendered on the screen. This way, the user experiences the device screen like a small movable window on a larger virtual view, without requiring any dedicated motion-detection hardware. As regards information retrieval, we will present an approach for information extraction for multi-page printed document; the approach is designed for scenarios in which the set of possible document classes, i.e., document sharing similar content and layout, is large and may evolve over time. Our approach is based on probability: we derived a general form for the probability that a sequence of blocks contains the searched information. A key step in the understanding of printed documents is their classification based on the nature of information they contain and their layout; we will consider both a static and a dynamic scenario, in which document classes are/are not known a priori and new classes can/can not appear at any time. Finally, we will move to the edge of machine learning: Genetic Programming. The electric power market is increasingly relying on competitive mechanisms taking the form of day-ahead auctions, in which buyers and sellers submit their bids in terms of prices and quantities for each hour of the next day. We propose a novel forecasting method based on Genetic Programming; key feature of our proposal is the handling of outliers, i.e., regions of the input space rarely seen during the learning.1312 1765 - PublicationMachine Learning Techniques for Document Processing and Web Security(Università degli studi di Trieste, 2013-03-13)
;Sorio, Enrico ;Medvet, EricBartoli, AlbertoThe task of extracting structured information from documents that are unstructured or whose structure is unknown is of uttermost importance in many application domains, e.g., office automation, knowledge management, machine-to-machine interactions. In practice, this information extraction task can be automated only to a very limited extent or subject to strong assumptions and constraints on the execution environment. In this thesis work I will present several novel application of machine learning techniques aimed at extending the scope and opportunities for automation of information extraction from documents of different types, ranging from printed invoices to structured XML documents, to potentially malicious documents exposed on the web. The main results of this thesis consist in the design, development and experimental evaluation of a system for information extraction from printed documents. My approach is designed for scenarios in which the set of possible documents layouts is unknown and may evolve over time. The system uses the layout information to define layout-specific extraction rules that can be used to extract information from a document. As far as I know, this is the first information extraction system that is able to detect if the document under analysis has an unseen layout and hence needs new extraction rules. In such case, it uses a probability based machine learning algorithm in order to build those extraction rules using just the document under analysis. Another novel contribution of our system is that it continuously exploits the feedback from human operators in order to improve its extraction ability. I investigate a method for the automatic detection and correction of OCR errors. The algorithm uses domain-knowledge about possible misrecognition of characters and about the type of the extracted information to propose and validate corrections. I propose a system for the automatic generation of regular expression for text-extraction tasks. The system is based on genetic programming and uses a set of user-provided labelled examples to drive the evolutionary search for a regular expression suitable for the specified task. As regards information extraction from structured document, I present an approach, based on genetic programming, for schema synthesis starting from a set of XML sample documents. The tool takes as input one or more XML documents and automatically produces a schema, in DTD language, which describes the structure of the input documents. Finally I will move to the web security. I attempt to assess the ability of Italian public administrations to be in full control of the respective web sites. Moreover, I developed a technique for the detection of certain types of fraudulent intrusions that are becoming of practical interest on a large scale.1653 5359 - PublicationNew strategies for efficient and practical genetic programming.(Università degli studi di Trieste, 2008-03-18)
;Fillon, CyrilBartoli, AlbertoIn the last decades, engineers and decision makers expressed a growing interest in the development of effective modeling and simulation methods to understand or predict the behavior of many phenomena in science and engineering. Many of these phenomena are translated in mathematical models for convenience and to carry out an easy interpretation. Methods commonly employed for this purpose include, for example, Neural Networks, Simulated Annealing, Genetic Algorithms, Tabu search, and so on. These methods all seek for the optimal or near optimal values of a predefined set of parameters of a model built a priori. But in this case, a suitable model should be known beforehand. When the form of this model cannot be found, the problem can be seen from another level where the goal is to find a program or a mathematical representation which can solve the problem. According to this idea the modeling step is performed automatically thanks to a quality criterion which drives the building process. In this thesis, we focus on the Genetic Programming (GP) approach as an automatic method for creating computer programs by means of artificial evolution based upon the original contributions of Darwin and Mendel. While GP has proven to be a powerful means for coping with problems in which finding a solution and its representation is difficult, its practical applicability is still severely limited by several factors. First, the GP approach is inherently a stochastic process. It means there is no guarantee to obtain a satisfactory solution at the end of the evolutionary loop. Second, the performances on a given problem may be strongly dependent on a broad range of parameters, including the number of variables involved, the quantity of data for each variable, the size and composition of the initial population, the number of generations and so on. On the contrary, when one uses Genetic Programming to solve a problem, he has two expectancies: on the one hand, maximize the probability to obtain an acceptable solution, and on the other hand, minimize the amount of computational resources to get this solution. Initially we present innovative and challenging applications related to several fields in science (computer science and mechanical science) which participate greatly in the experience gained in the GP field. Then we propose new strategies for improving the performances of the GP approach in terms of efficiency and accuracy. We probe our approach on a large set of benchmark problems in three different domains. Furthermore we introduce a new approach based on GP dedicated to symbolic regression of multivariate data-sets where the underlying phenomenon is best characterized by a discontinuous function. These contributions aim to provide a better understanding of the key features and the underlying relationships which make enhancements successful in improving the original algorithm.1620 2198 - PublicationPerformance control of internet-based engineering applications.(Università degli studi di Trieste, 2008-03-18)
;Vercesi, PaoloBartoli, AlbertoGrazie alle tecnologie capaci di semplificare l'integrazione tra programmi remoti ospitati da differenti organizzazioni, le comunità scientifica ed ingegneristica stanno adottando architetture orientate ai servizi per: aggregare, condividere e distribuire le loro risorse di calcolo, per gestire grandi quantità di dati e per eseguire simulazioni attraverso Internet. I Web Service, per esempio, permettono ad un'organizzazione di esporre, in Internet, le funzionalità dei loro sistemi e di renderle scopribili ed accessibili in un modo controllato. Questo progresso tecnologico può permettere nuove applicazioni anche nell'area dell'ottimizzazione di progetti. Gli attuali sistemi di ottimizzazione di progetti sono di solito confinati all'interno di una singola organizzazione o dipartimento. D'altra parte, i moderni prodotti manifatturieri sono l'assemblaggio di componenti provenienti da diverse organizzazioni. Componendo i servizi delle organizzazioni coinvolte, si può creare un workflow che descrive il modello del prodotto composto. Questo servizio composto puo a sua volta essere usato da un sistema di ottimizzazione inter-organizzazione. I compromessi progettuali che sono implicitamente incorporati per architetture locali, devono essere riconsiderati quando questi sistemi sono messi in opera su scala globale in Internet. Ad esempio: i) la qualità delle connessioni tra i nodi può variare in modo impredicibile; ii) i nodi di terze parti mantengono il pieno controllo delle loro risorse, incluso, per esempio, il diritto di diminuire le risorse in modo temporaneo ed impredicibile. Dal punto di vista del sistema come un'entità unica, si vorrebbero massimizzare le prestazioni, cioè, per esempio, il throughput inteso come numero di progetti candidati valutati per unità di tempo. Dal punto di vista delle organizzazioni partecipanti al workflow si vorrebbe, invece, minimizzare il costo associato ad ogni valutazione. Questo costo può essere un ostacolo all'adozione del paradigma distribuito, perché le organizzazioni partecipanti condividono le loro risorse (cioè CPU, connessioni, larghezza di banda e licenze software) con altre organizzazioni potenzialmente sconosciute. Minimizzare questo costo, mentre si mantengono le prestazioni fornite ai clienti ad un livello accettabile, può essere un potente fattore per incoraggiare le organizzazioni a condividere effettvivamente le proprie risorse. Lo scheduling di istanze di workflows, ovvero stabilire quando e dove eseguire un certo workflow, in un tale ambiente multi-organizzazione, multi-livello e geograficamente disperso, ha un forte impatto sulle prestazioni. Questo lavoro investiga alcuni dei problemi essenziali di prestazioni e di costo legati a questo nuovo scenario. Per risolvere i problemi inviduati, si propone un sistema di controllo dell'accesso adattativo davanti al workflow engine che limita il numero di esecuzioni concorrenti. Questa proposta può essere implementata in modo molto semplice: tratta i servizi come black-box e non richiede alcuna interazione da parte delle organizzazioni partecipanti. La tecnica è stata valutata in un ampio spettro di scenari, attraverso simulazione ad eventi discreti. I risultati sperimentali suggeriscono che questa tecnica può fornire dei significativi benefici garantendo alti livelli di throughput e bassi costi.824 2281 - PublicationTechniques for large-scale automatic detection of web site defacements.(Università degli studi di Trieste, 2008-03-18)
;Medvet, EricBartoli, AlbertoWeb site defacement, the process of introducing unauthorized modifications to a web site, is a very common form of attack. This thesis describes the design and experimental evaluation of a framework that may constitute the basis for a defacement detection service capable of monitoring thousands of remote web sites sistematically and automatically. With this framework an organization may join the service by simply providing the URL of the resource to be monitored along with the contact point of an administrator. The monitored organization may thus take advantage of the service with just a few mouse clicks, without installing any software locally nor changing its own daily operational processes. The main proposed approach is based on anomaly detection and allows monitoring the integrity of many remote web resources automatically while remaining fully decoupled from them, in particular, without requiring any prior knowledge about those resources. During a preliminary learning phase a profile of the monitored resource is built automatically. Then, while monitoring, the remote resource is retrieved periodically and an alert is generated whenever something "unusual" shows up. The thesis discusses about the effectiveness of the approach in terms of accuracy of detection---i.e., missed detections and false alarms. The thesis also considers the problem of misclassified readings in the learning set. The effectiveness of anomaly detection approach, and hence of the proposed framework, bases on the assumption that the profile is computed starting from a learning set which is not corrupted by attacks; this assumption is often taken for granted. The influence of leaning set corruption on our framework effectiveness is assessed and a procedure aimed at discovering when a given unknown learning set is corrupted by positive readings is proposed and evaluated experimentally. An approach to automatic defacement detection based on Genetic Programming (GP), an automatic method for creating computer programs by means of artificial evolution, is proposed and evaluated experimentally. Moreover, a set of techniques that have been used in literature for designing several host-based or network-based Intrusion Detection Systems are considered and evaluated experimentally, in comparison with the proposed approach. Finally, the thesis presents the findings of a large-scale study on reaction time to web site defacement. There exist several statistics that indicate the number of incidents of this sort but there is a crucial piece of information still lacking: the typical duration of a defacement. A two months monitoring activity has been performed over more than 62000 defacements in order to figure out whether and when a reaction to the defacement is taken. It is shown that such time tends to be unacceptably long---in the order of several days---and with a long-tailed distribution.1252 6584