Machine Learning Techniques for Document Processing and Web Security
Tecniche di machine learning per la catalogazione automatica di documenti e sicurezza web
The task of extracting structured information from documents that are unstructured or whose structure is unknown is of uttermost importance in many application domains, e.g., office automation, knowledge management, machine-to-machine interactions. In practice, this information extraction task can be automated only to a very limited extent or subject to strong assumptions and constraints on the execution environment. In this thesis work I will present several novel application of machine learning techniques aimed at extending the scope and opportunities for automation of information extraction from documents of different types, ranging from printed invoices to structured XML documents, to potentially malicious documents exposed on the web. The main results of this thesis consist in the design, development and experimental evaluation of a system for information extraction from printed documents. My approach is designed for scenarios in which the set of possible documents layouts is unknown and may evolve over time. The system uses the layout information to define layout-specific extraction rules that can be used to extract information from a document. As far as I know, this is the first information extraction system that is able to detect if the document under analysis has an unseen layout and hence needs new extraction rules. In such case, it uses a probability based machine learning algorithm in order to build those extraction rules using just the document under analysis. Another novel contribution of our system is that it continuously exploits the feedback from human operators in order to improve its extraction ability. I investigate a method for the automatic detection and correction of OCR errors. The algorithm uses domain-knowledge about possible misrecognition of characters and about the type of the extracted information to propose and validate corrections. I propose a system for the automatic generation of regular expression for text-extraction tasks. The system is based on genetic programming and uses a set of user-provided labelled examples to drive the evolutionary search for a regular expression suitable for the specified task. As regards information extraction from structured document, I present an approach, based on genetic programming, for schema synthesis starting from a set of XML sample documents. The tool takes as input one or more XML documents and automatically produces a schema, in DTD language, which describes the structure of the input documents. Finally I will move to the web security. I attempt to assess the ability of Italian public administrations to be in full control of the respective web sites. Moreover, I developed a technique for the detection of certain types of fraudulent intrusions that are becoming of practical interest on a large scale.