Posts

Showing posts with the label collection summary

2019-09-09: Introducing sumgram, a tool for generating the most frequent conjoined ngrams

Image
Comparison of top 20 (first column) bigrams, top 20 (second column) six-grams, and top 20 (third column) sumgrams (conjoined ngrams) generated by sumgram for a collection of documents about the 2014 Ebola Virus Outbreak . Proper nouns of more than two words (e.g., "centers for disease control and prevention") are split when generating bigrams, sumgram strives to remedy this. Generating six-grams surfaces non-salient six-grams. Click image to expand. A Web archive collection consists of groups of webpages that share a common topic e.g., “Ebola virus” or “Hurricane Harvey.” One of the most common tasks involved in understanding the "aboutness" of a collection is generating the top k (e.g., k = 20) ngrams. For example, given a collection about Ebola Virus , we could generate the top 20 bigrams as presented in Fig. 1. This simple operation of calculating the most frequent bigrams unveils useful bigrams that help us understand the focus of the collection, and m...