Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
public:gsoc:poormantextract [2020/02/26 20:55]
kdrag0n
public:gsoc:poormantextract [2020/02/29 17:42] (current)
thealphadollar styling
Line 7: Line 7:
 First, you'll need to find lots of documents to process around the internet. We will provide you some, but you need to build your own corpus. A few ideas: First, you'll need to find lots of documents to process around the internet. We will provide you some, but you need to build your own corpus. A few ideas:
  
-- Tax documents. +- Tax documents.\\ 
-- Multiple choice exams. +- Multiple-choice exams.\\ 
-- Immigration forms. +- Immigration forms.\\ 
-- Resumes. +- Resumes.\\ 
-- Contracts. +- Contracts.\\ 
-- Blueprints. +- Blueprints.\\ 
-- Order forms. +- Order forms.\\ 
-- Invoices.+- Invoices.\\
  
-Etc+... and many more.
  
-Then create a system that is able to identify the parts of all those models, write some output (for example, a JSON file) that contains the coordinates of each of the parts, writes each part to a separate file, and OCRs whatever information can be OCR'ed and writes it to a database.+Then create a system that is able to identify the parts of all those models, write some output (for example, a JSON file) that contains the coordinates of each of the parts, writes each part to a separate file, and OCRs whatever information can be OCR'ed and writes it to a database ​(which can be as simple as document storage or a full-fledged SQL instance of your preferred flavor).
  
 We strongly recommend you play a bit with Amazon'​s Textract (for online tests it's free) to get an idea of what to expect. We strongly recommend you play a bit with Amazon'​s Textract (for online tests it's free) to get an idea of what to expect.
  • public/gsoc/poormantextract.txt
  • Last modified: 2020/02/29 17:42
  • by thealphadollar