Repository logo
  • English
  • Italiano
  • Log In
    or
    New user? Click here to register.Have you forgotten your password?
Repository logo
Repository logo
  • Communities & Collections
  • Series/Journals
  • EUT
  • Events
  • Statistics
  • English
  • Italiano
  • Log In
    or
    New user? Click here to register.Have you forgotten your password?
  1. Home
  2. Ricerca
  3. Tesi di dottorato
  4. Ingegneria industriale e dell'informazione
  5. Machine Learning Techniques for Document Processing and Web Security
 
  • Details
  • Metrics
Options

Machine Learning Techniques for Document Processing and Web Security

Tecniche di machine learning per la catalogazione automatica di documenti e sicurezza web
Sorio, Enrico
2013-03-13
Loading...
Thumbnail Image
http://hdl.handle.net/10077/8533
  • Doctoral Thesis

Abstract
The task of extracting structured information from documents that are unstructured or whose structure is unknown is of uttermost importance in many application domains, e.g., office automation, knowledge management, machine-to-machine interactions. In practice, this information extraction task can be automated only to a very limited extent or subject to strong assumptions and constraints on the execution environment. In this thesis work I will present several novel application of machine learning techniques aimed at extending the scope and opportunities for automation of information extraction from documents of different types, ranging from printed invoices to structured XML documents, to potentially malicious documents exposed on the web. The main results of this thesis consist in the design, development and experimental evaluation of a system for information extraction from printed documents. My approach is designed for scenarios in which the set of possible documents layouts is unknown and may evolve over time. The system uses the layout information to define layout-specific extraction rules that can be used to extract information from a document. As far as I know, this is the first information extraction system that is able to detect if the document under analysis has an unseen layout and hence needs new extraction rules. In such case, it uses a probability based machine learning algorithm in order to build those extraction rules using just the document under analysis. Another novel contribution of our system is that it continuously exploits the feedback from human operators in order to improve its extraction ability. I investigate a method for the automatic detection and correction of OCR errors. The algorithm uses domain-knowledge about possible misrecognition of characters and about the type of the extracted information to propose and validate corrections. I propose a system for the automatic generation of regular expression for text-extraction tasks. The system is based on genetic programming and uses a set of user-provided labelled examples to drive the evolutionary search for a regular expression suitable for the specified task. As regards information extraction from structured document, I present an approach, based on genetic programming, for schema synthesis starting from a set of XML sample documents. The tool takes as input one or more XML documents and automatically produces a schema, in DTD language, which describes the structure of the input documents. Finally I will move to the web security. I attempt to assess the ability of Italian public administrations to be in full control of the respective web sites. Moreover, I developed a technique for the detection of certain types of fraudulent intrusions that are becoming of practical interest on a large scale.
Subjects
  • machine learning

  • document understandin...

  • web security

  • cloud computing

Publisher
Università degli studi di Trieste
Languages
en
Licence
http://www.openstarts.units.it/dspace/default-license.jsp
File(s)
Loading...
Thumbnail Image
Name

sorio_phd.pdf

Format

Adobe PDF

Size

1.81 MB

Download
Indexed by

 Info

Open Access Policy

Share/Save

 Contacts

EUT Edizioni Università di Trieste

OpenstarTs

 Link

Wiki OpenAcces

Archivio Ricerca ArTS

Built with DSpace-CRIS software - Extension maintained and optimized by 4Science

  • Cookie settings
  • Privacy policy
  • End User Agreement
  • Send Feedback