Posts

Showing posts with the label Indexing

2022-07-22: Summary of "Web Archiving and Search Personalized"

Image
The Web Archiving and Search Personalized system automatically captures, archives, and indexes pages for both full-text search and replay. (Source: Kiesel et al., Figure 1a) According to a study conducted by Teevan et al. in 2007, 39% of search queries represent users trying to re-find previously viewed pages [1]. One approach to supporting users in this task is automatic personal web archiving. Each page that the user visits is saved, so that it can be found later, similar to an automated version of the "bookmark as archive" feature in Mabe et al.’s Memento-aware browser prototype [2]. However, creating a system that can save web pages as they are viewed, index them for full-text search, and replay them later is an ambitious goal. Johannes Kiesel ( @KieselJohannes ), Arjen P. de Vries ( @arjenpdevries ), Matthias Hagen ( @matthias_hagen ), Benno Stein ( @bennostein ), and Martin Potthast ( @martinpotthast ) created a prototype system for this purpose in their paper “Web Arc...

2022-05-03: Summary of "Temporal Search in Web Archives"

Image
  In this figure, nine document versions (grey) are shown ordered with respect to time. The versions have different term frequencies, represented by the score axis. The nine original versions are coalesced into three versions (black) in the inverted index. (Source: Berberich , Figure 3.3) According to a user survey conducted by the National Library of the Netherlands in 2007, full-text search is the top requested feature for web archives. Some web archives do allow for the public to perform full-text searches, while other web archives only allow the public to search by website address. For web archives that have implemented full-text search, every version of a document may be indexed, regardless of the similarity between consecutive document versions. In “ Temporal Search in Web Archives ” (2010), Berberich develops time-travel text search to improve searching in web archives. What extensions to existing architecture will support full-text search with respect to time across vers...

2016-05-31: Can I find this story? API: Yes, Google: Maybe, Native Search: No

Image
A story on Storify titled: "Lecture on Academic Freedom"  (capture date: 2016-05-31) The story on Storify titled: "Lecture on Academic Freedom" could not be found on Google   (capture date: 2016-05-31) The story on Storify titled: "Lecture on Academic Freedom" could not be found on Storify native search  (capture date: 2016-05-31) A part of our research ( funded by IMLS ) to build collections for stories or events involves exploring content curation sites like Storify  in order to determine if they hold quality (news worthy, timely, etc.) content. Storify is a social network service used to create stories which consists of text and multimedia content, as well as content from other social media sites like Twitter , Facebook and Instagram . Our exploration involved collecting stories from Storify over a period in other to manually inspect the stories to determine their newsworthiness. This exploration was dual natured: we col...

2014-09-25: Improving Accessibility of Archived Raster Dictionaries of Complex Script Languages

Image
The Internet Archive (IA) and Open Library offer over 6 million fully accessible public domain eBooks. I searched for the term "dictionary" while I was casually browsing the scanned book collection to see how many dictionaries they have. I found several dictionaries in various languages. I randomly picked  A Dictionary of the English Language (1828) - Samuel Johnson, John Walker, Robert S. Jameson from the search result. I opened the dictionary in fullscreen mode  using IA's opensource online BookReader application . This book reader application has common tools for browsing an image based book such as flipping pages, seeking a page, zooming, and changing the layout. In the toolbar it has some interesting features like reading aloud and full-text searching. I wondered how could it possibly perform text searching and read aloud an scanned raster image based book? I sneaked inside the page source code which pointed me to some documentation pages. I realized it is using ...