Posts

Showing posts with the label sumgram

2021-01-20: 366 dots in 2020 - top news stories of 2020

Image
Fig. 1 (Click image to expand) 366 dots in 2020 - Top news stories for 366 days in 2020. Each dot represents the average degree of the Giant Connected Component (GCC) with the largest average degree across all the 144 story graphs for a given day. The x-axis represents time, the y-axis represents the average degree of the GCC. The annotations (and legend) represented by colored dots were assigned semi-automatically . I join the chorus to say 2020 was a year like no other, and shaped by three historic events: the Coronavirus pandemic , the protests surrounding the Black Lives Matter movement, and the US Presidential elections . According to StoryGraph , in 2018, the top news story was the Kavanaugh hearings . In 2019, it was the Mueller Report . Similar to 2018 and 2019 , we analyzed all news stories collected by StoryGraph at 10-minute intervals every day in 2020, to identify the top news stories of 2020. Recall how we identify top news stories , explained briefly in 365 dots in 201...

2020-01-04: 365 dots in 2019 - top news stories of 2019

Image
Fig. 1 (Click on image to expand) 365 dots in 2019 - News stories for 365 days in 2019. Each dot represents the average degree of the Giant Connected Component (GCC) with the largest average degree across all the 144 story graphs for a given day. The x-axis represents time, the y-axis represents the average degree of the GCC. In March 2019 I published " 365 dots in 2018 " where I presented the top stories for each day in 2018 according to StoryGraph . Now that 2019 is over, it is natural to ask  what were the top news stories of 2019? News organizations will often publish "the year's top stories" or "year in review" (e.g., CNN , CBS , FoxNews ), but the selection criteria is not always made explicit. The closest to a selection criteria I have seen from news organizations is the presentation of their  top most viewed (or most popular) news stories. But this criteria is not accessible to ordinary users who cannot access the private traffic sta...

2019-09-09: Introducing sumgram, a tool for generating the most frequent conjoined ngrams

Image
Comparison of top 20 (first column) bigrams, top 20 (second column) six-grams, and top 20 (third column) sumgrams (conjoined ngrams) generated by sumgram for a collection of documents about the 2014 Ebola Virus Outbreak . Proper nouns of more than two words (e.g., "centers for disease control and prevention") are split when generating bigrams, sumgram strives to remedy this. Generating six-grams surfaces non-salient six-grams. Click image to expand. A Web archive collection consists of groups of webpages that share a common topic e.g., “Ebola virus” or “Hurricane Harvey.” One of the most common tasks involved in understanding the "aboutness" of a collection is generating the top k (e.g., k = 20) ngrams. For example, given a collection about Ebola Virus , we could generate the top 20 bigrams as presented in Fig. 1. This simple operation of calculating the most frequent bigrams unveils useful bigrams that help us understand the focus of the collection, and m...