Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Next revision
Previous revision
public:gsoc:poormantextract [2020/02/06 04:00]
cfsmp3 created
public:gsoc:poormantextract [2020/02/29 17:42] (current)
thealphadollar styling
Line 1: Line 1:
-====== Poor'​s ​man Textract ======+====== Poor Man's Textract ======
  
 **Introduction** **Introduction**
Line 7: Line 7:
 First, you'll need to find lots of documents to process around the internet. We will provide you some, but you need to build your own corpus. A few ideas: First, you'll need to find lots of documents to process around the internet. We will provide you some, but you need to build your own corpus. A few ideas:
  
-- Tax documents. +- Tax documents.\\ 
-- Multiple choice exams. +- Multiple-choice exams.\\ 
-- Immigration forms. +- Immigration forms.\\ 
-- Resumes. +- Resumes.\\ 
-- Contracts. +- Contracts.\\ 
-- Blueprints. +- Blueprints.\\ 
-- Order forms. +- Order forms.\\ 
-- Invoices.+- Invoices.\\
  
-Etc+... and many more.
  
-Then create a system that is able to identify the parts of all those models, write some output (for example, a JSON file) that contains the coordinates of each of the parts, writes each part to a separate file, and OCRs whatever information can be OCR'ed and writes it to a database.+Then create a system that is able to identify the parts of all those models, write some output (for example, a JSON file) that contains the coordinates of each of the parts, writes each part to a separate file, and OCRs whatever information can be OCR'ed and writes it to a database ​(which can be as simple as document storage or a full-fledged SQL instance of your preferred flavor).
  
 We strongly recommend you play a bit with Amazon'​s Textract (for online tests it's free) to get an idea of what to expect. We strongly recommend you play a bit with Amazon'​s Textract (for online tests it's free) to get an idea of what to expect.
Line 28: Line 28:
 [[https://​github.com/​knightron0/​exam-analyzer|knightron0'​s exam analyzer]] [[https://​github.com/​knightron0/​exam-analyzer|knightron0'​s exam analyzer]]
  
 +**Qualification tasks**\\ 
 +Take a look at [[https://​ccextractor.org/​public:​gsoc:​takehome|this page]].
  
  • public/gsoc/poormantextract.1580961643.txt.gz
  • Last modified: 2020/02/06 04:00
  • by cfsmp3