Blog Post #3

I realize that I have not yet explained why I chose to do Automatic Text summarization. To keep things short, I've never really enjoyed reading for academic purposes. I like reading for the purpose of learning about topics I am interested in (i.e. Computer Science) and for general reading fictions that have no academic values, but when it comes to reading for classes that I am taking for the sake of the credits, I find it hard to be fully indulged in the reading material. From talking to fellow peers, I realized that that was the general consensus. With the rise of Machine Learning and Deep Learning, I figured it would be interesting to look into developing an automatic abstract text summarization tool for students such as myself to utilize. 

Picking up from the last blog post...

A lot has happened. I decided to explore Google/Stanford's model which utilizes a pointer system and a coverage system to account for some of the problems that the general models for text summarization use. Specifically, pointers deal with models having OOV problems by instead of generating <unk> tags, the pointers create a pipeline to the word from the input and brings that straight to the output. This creates a more fluid and understandable summary without any <unk> tags that almost always creates confusion for many of the models. The coverage mechanism, on the other hand, deals with the repetition problem. Many of the summaries generated by common summarization models tend to contain ideas and sentences that are repeated multiple times. The coverage system counters this by allowing the attention layer to base it's decisions on decisions that were made previously. By incorporating previous decisions in making new decisions, the model is able to not output anything that has been talked about before; therefore, solving the repetition problem.

A problem that I've noticed within the Google/Stanford model that other models also face is the pronouns. Many of the summaries tend to use pronouns without defining it first or even use it in ways that make it confusing for the reader to understand the object the pronoun is referencing to. This problem, apparently, is called reference resolution and it is a common problem that many Natural Language Processing models face.

Before going more into the problem, I've been trying to run the Google/Stanford's model on my system. Because the code that is available on GitHub is outdated (Python 2), I had to use someone's fork with code that's been converted to Python 3. Another repo with code that tokenizes the training data was also much outdated so I had to get someone's fork and research a bit into the framework used (Stanford CoreNLP) because not all problems were solved. After being able to run the code, I had to do some tinkering with the parameters (mostly decreasing the max_enc_steps and max_dec_steps which lowers the amount of tokens are used for encoding and decoding). Although I have a GTX 1080, it was still struggling with Out of Memory when running the code with the default parameters (since Google has a much better system). Decreasing the steps by a little bit (max_enc_steps: from 400 to 300; max_dec_steps: from 100 to 75) made the code run without any graphics memory running out. Each step took about 0.75 seconds to train and lasted about 4 days for training to complete. There were multiple times however when the program got an error saying the loss was infinite but it was just a bug that comes up here and there and due to the checkpoints that are saved throughout training, I was able to recover my place quickly and resume training.

Going back to the reference resolution, I was able to find a lot of research on the topic but the amount of knowledge that one must have before being able to understand the research is way too much. However, I was able to find 2 related papers (Neural Coreference Resolution and Improving Coreference Resolution by Learning Entity-Level Distributed Representations) that deal with machine learning and reference resolution so I am going to see what I can learn from the papers and see if I can add a layer or two, or even another pipeline, into the Google/Stanford model so that the output would be able to produce summaries with no confusing pronouns. I am also planning to compare this model with 3 different outputs: one regular model that I have currently, one with pre-processed data, and one with post-processed data. The current model will contain a few pronoun errors and the pre-processed/post-processed data models will have likely have no pronouns at all in the output.

Comments

Popular posts from this blog

Blog Post #5

Final Post

Senior CS Project Introduction