Please use this identifier to cite or link to this item:
Title: Machine Learning Techniques for Document Processing and Web Security
Other Titles: Tecniche di machine learning per la catalogazione automatica di documenti e sicurezza web
Authors: Sorio, Enrico
Supervisore/Tutore: Medvet, Eric
Bartoli, Alberto
Issue Date: 13-Mar-2013
Publisher: Università degli studi di Trieste
The task of extracting structured information from documents that are unstructured or whose structure is unknown is of uttermost importance in many application domains, e.g., office automation, knowledge management, machine-to-machine interactions. In practice, this information extraction task can be automated only to a very limited extent or subject to strong assumptions and constraints on the execution environment.

In this thesis work I will present several novel application of machine learning techniques aimed at extending the scope and opportunities for automation of information extraction from documents of different types, ranging from printed invoices to structured XML documents, to potentially malicious documents exposed on the web.

The main results of this thesis consist in the design, development and experimental evaluation of a system for information extraction from printed documents. My approach is designed for scenarios in which the set of possible documents layouts is unknown and may evolve over time. The system uses the layout information to define layout-specific extraction rules that can be used to extract information from a document.
As far as I know, this is the first information extraction system that is able to detect if the document under analysis has an unseen layout and hence needs new extraction rules.
In such case, it uses a probability based machine learning algorithm in order to build those extraction rules using just the document under analysis.
Another novel contribution of our system is that it continuously exploits the feedback from human operators in order to improve its extraction ability.

I investigate a method for the automatic detection and correction of OCR errors. The algorithm uses domain-knowledge about possible misrecognition of characters and about the type of the extracted information to propose and validate corrections.

I propose a system for the automatic generation of regular expression for text-extraction tasks. The system is based on genetic programming and uses a set of user-provided labelled examples to drive the evolutionary search for a regular expression suitable for the specified task.

As regards information extraction from structured document, I present an approach, based on genetic programming, for schema synthesis starting from a set of XML sample documents.
The tool takes as input one or more XML documents and automatically produces a schema, in DTD language, which describes the structure of the input documents.

Finally I will move to the web security.
I attempt to assess the ability of Italian public administrations to be in full control of the respective web sites.
Moreover, I developed a technique for the detection of certain types of fraudulent intrusions that are becoming of practical interest on a large scale.
Ciclo di dottorato: XXV Ciclo
Keywords: machine learningdocument understandingweb securitycloud computing
Type: Doctoral Thesis
Language: en
NBN: urn:nbn:it:units-9913
Appears in Collections:Ingegneria industriale e dell'informazione

Files in This Item:
File Description SizeFormat
sorio_phd.pdf1.85 MBAdobe PDFThumbnail
Show full item record

CORE Recommender

Page view(s) 5

checked on Jun 1, 2023

Download(s) 5

checked on Jun 1, 2023

Google ScholarTM


Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.