dockkvm.blogg.se

Extracting Information from Cats by Jacobus H Jessurun
Extracting Information from Cats by Jacobus H Jessurun









Extracting Information from Cats by Jacobus H Jessurun

This is my personal opinion (not aiming to criticize you) that use of word embedding or neural network based models everywhere is not feasible. But it depends on how you are going to use the extracted semantic information which you didn't mention in your post! Since your ultimate goal is to extract semantic meaning from invoices, why you need syntactical information? Is it going to help you in this task? If you have a large dataset, you can generate a dictionary for the specific targeted domain. However, SyntaxNet is actually for dependency parsing. Your second concern is not clear to me! Usually numeric values are replaced by some label, say NUM during pre-processing step in information extraction task.

Extracting Information from Cats by Jacobus H Jessurun

You can also think about simple language model (using bigram or trigram) since you have a very specific domain to work on. You can also use one-hot encoding as an alternative of tf-idf weight.

Extracting Information from Cats by Jacobus H Jessurun

To construct the corpus, you can remove common stop words (like a, the etc.) and then use tf-idf weight of each word to represent a document before feeding them to a skip-gram or CBOW model. Dataset has some obvious impact on word embeddings construction. If you have big dataset of invoices, its better you use that.

Extracting Information from Cats by Jacobus H Jessurun

So, for instance, if I present an unseen invoice to my neural network it would find the invoice's number (whatever its label is called) and extract its value. The reason I'm trying to understand semantic information about an invoice is to ultimately be able to extract values out of it. How do I go about extracting certain pieces of information? For instance, let's say we introduce a new invoice that has order number: 12345, assuming order number is understood to be the invoice number (or whatever vectors that lie in the same vicinity of order number), how do I extract the value 12345? One area I was looking at is SyntaxNet that could help here. Let's say all of the above work fine and I'm able to understand semantic information from a new unseen invoice. Is it sufficient to use a generic corpus of, for instance, wikipedia? Or should I use a specialized corpus for invoices? If it's the latter, how can I generate this corpus? I do have a big dataset of invoices that I can utilize. I would love for it to be corrected if anything looks wrong. So the very high level approach I described above seems quite alright to me.

  • Feed the output of word2vec to a CNN since vectors that are close together share similar semantic meanings.
  • Use word2vec to generate word embeddings (more on the corpus below).
  • After a long list of reading I came up with this: I'm trying to train a couple of neural networks (using tensorflow) to be able to extract semantic information from invoices.











    Extracting Information from Cats by Jacobus H Jessurun