Blog Post #5
A lot of thoughts and coding happened this week. So let's get into it...
two_models
This is a branch that incorporates an off-the-shelf Coreference Resolution model during pre-processing and post-processing of the data. I'll go through what this branch of code does an then explain why I decided to make these decisions.
The Coreference Resolution model that I'm using is from this repo which utilizes spaCy and Neural Networks to identify the pronouns and which noun the pronoun is referring to. Using this model, I run all the data that is used by the Pointer-Generator model to train; specifically, I run the reference summary (the summary that is written by a human of the given document). By running the training data through the Coreference Resolution model, I can utilize spaCy to identify clusters within the document (individual data). These clusters contain pretty much a dictionary of the noun and the pronouns/nouns that refer to the noun (e.g. {Wayne: [he, his, him, student]}). These clusters are then saved into a text file into a specified log directory, with the document's first 75 characters as the file name. After saving the clusters into a file, the document is then purged of the pronouns; meaning, all the pronouns are replaced by the noun that the pronouns are referring to (e.g. The text "Wayne saw himself." becomes "Wayne saw Wayne."). This purged document is then put through the Pointer-Generator model to be trained. Because during training, the Pointer-Generator model uses the reference summary (the purged document) to see what summary it's going to produce, the model is going to start producing summaries with no pronouns because the reference summary is not going to contain any pronouns. During decoding, the model takes the produced summary, and then finds the file in which the clusters of references are saved using the first 75 characters of the document. and then replaces the repeated nouns with the pronouns.
The main problem when trying to come up with a model that pre-processes and then post-processes the data is the difficulty of post-processing. I tried to find models, apps, papers, research on replacing nouns with pronouns (i.e. Changing "Wayne saw Wayne." back to Wayne saw himself.") but I had zero luck in finding anything. This lead to the idea of saving the clusters in some way and using the saved data to replace the generated summary. Because around 300k articles are used during training, I decided it would be best to save them in files instead of memory. Figuring out a way to store the data was also a challenged. Multiple methods were considered such as making a one big json and appending onto the file every time a file is read. But because of the fact that multiple threads are used to read the files and then train, it was impossible to utilize one file. Also, because the file has to be accessed quickly later during the decoding phase, it had to have some way of being searched and found quickly. This led me to the decision of saving the files with the document's body as the file name. I found out that 75 characters seemed to be the right balance between creating non-duplicate files while also making sure that the time creating the file doesn't take too long. The final step of this model caused a great amount of headache. In the perfect situation, this branch would work as in we take out all the pronouns while training, the model generates the exact copy of the reference summary with the pronouns replaced with the actual nouns, the decoder then takes a look at the saved clusters file and replaces the repeated nouns with the pronouns which would then look like an exact copy of the reference summary. But because this isn't the case, the "replacing the nouns with the pronouns" doesn't really work too well. The number of nouns within the generated summary is not always aligned with the reference summary and the placement/context of these pronouns might also not be aligned with the reference summary. These facts make the saved references useless other than for saving the pronouns and the nouns that were in the reference summary.
coreference_resolution_pipeline
This branch incorporates a "pipeline" within the Pointer-Generator model that deals with coreference resolution. This was a difficult to branch to work on because of so much theoretical and mathematical knowledge that is involved in tweaking the inner parts of the already working model to fix a solution that is still an ongoing study. After scrutinizing the Pointer-Generator model and trying to understand every little detail within the model, I decided to add a third switch. As mentioned in the blog before, the Pointer-Generator model has a switch which determines if the model should output a word from the vocabulary or a word from the input document. Having this switch implies that there are two pipelines of data that is trained from the input document and then sent down to the switch: one pipeline that trains and then draws the output from the document and one pipeline that trains and outputs from the vocabulary. Here I added a third pipeline that trains on a pre-processed reference summary. The pre-process involves running the reference summary through the Coreference Resolution model above and emptying every single word that is not a referenced noun. This is done by replacing all the words ID to 0 other than for the referenced noun (e.g. "Wayne saw himself." becomes [84123, 1512, 634] which gets processed and becomes [84123, 0, 0]). This is done to put more weight into the referenced noun so that it is highly likely to be included in the generated summary. By doing so, we can prevent pronouns from being used before being defined by a pronoun, and if there is an ambiguous pronoun/reference that is to be used, instead of making the document confusing by using this ambiguous pronoun/reference, the model would generate the noun instead (hopefully).
Diagrams and results will be posted very soon...
Comments
Post a Comment