Donate to Science & Enterprise

S&E on Mastodon

S&E on LinkedIn

S&E on Flipboard

Please share Science & Enterprise

Yahoo Releases User Interaction Data for Machine Learning

Earth and server

(Suresh Subbaiah, Wikimedia Commons)

14 January 2016. The online company Yahoo is releasing an extensive data set of individual user interactions with some of its popular services to the academic community as raw material for studies of machine learning. The de-identified data sets will be part of Yahoo’s Webscope reference library offered to academic researchers.

The data sets cover individual interactions with Yahoo’s news, sports, finance, movies, and real estate sections, as well as its home page. The collection, says the company, has some 110 billion items accessed by with 20 million users, from February to May 2015. The entire uncompressed file is estimated to be 13.5 terabytes of data.

Items in the data set are identified by their titles, summary, and key phrases. Data on individuals accessing those items give their gender, age range, and generalized geographic location. Interactions with the items show the user’s local date and time, and some data about the device employed.

Yahoo’s Webscope program provides data sets for academic researchers and students covering computer systems, languages, images, graph and social data, ratings and classification data, as well as competition, advertising, and marketing data. The Webscope databases are part of Yahoo Labs, doing research in a range of fields related to the company’s business and services including, advertising, computer science, information and knowledge management, human-computer interactions, and machine learning.

In a statement, the company cites computer scientists planning on using the data sets. Gert Lanckriet at University of California in San Diego says “Access to data sets of this size is essential to design and develop machine learning algorithms and technology that scales to truly ‘big’ data.”

Tom Mitchell at Carnegie Mellon University adds, “Academic researchers everywhere will finally have access to realistic scale data to study how to automatically discover which news articles are of interest to which users, and will be able to compare their methods using this as a shared test case.”

Read more:

*     *     *

Comments are closed.