There’s a thing in journalism now where news is very often reframed in terms of personal anecdote and-or hot take. In an effort to have something new and clickable to say, we reach for the easiest, closest thing at hand, which is, well, ourselves—our opinions and experiences.
I worry about this a lot! I do it (and am doing it right now), and I think it’s not always for ill. But in a larger sense it’s worth wondering to what degree the larger news feed is being diluted by news stories that are not “content dense.” That is, what’s the real ratio between signal and noise, objectively speaking? To start, we’d need a reasonably objective metric of content density and a reasonably objective mechanism for evaluating news stories in terms of that metric. Read More
Daily Archives: May 8, 2019
Detecting Content-dense News Texts Combining Lexical and Syntactic Features for Detecting Content-dense Texts in News
Content-dense news report important factual information about an event in direct, succinct manner. Information seeking applications such as information extraction, question answering and summarization normally assume all text they deal with is content-dense.Here we empirically test this assumption on news articles from the business, U.S. inter-national relations, sports and science journalism domains. Our findings clearly indicate that about half of the news texts in our study are in fact not content-dense and motivate the development of a supervised content-density detector. We heuristically label a large training corpus for the task and train a two-layer classifying model based on lexical and unlexicalized syntactic features. On manually annotated data, we compare the performance of domain-specific classifiers, trained on data only from a given news domain and a general classifier in which data from all four domains is pooled together. Our annotation and prediction experiments demonstrate that the concept of content density varies depending on the domain and that naive annotators provide judgement biased toward the stereotypical domain label. Domain-specific classifiers are more accurate for domains in which content-dense texts are typically fewer. Domain independent classifiers repro-duce better naive crowdsourced judgements. Classification prediction is high across all conditions, around 80%. Read More
Detecting (Un)Important Content for Single-Document News Summarization
We present a robust approach for detecting intrinsic sentence importance in news,by training on two corpora of document-summary pairs. When used for single-document summarization, our approach,combined with the “beginning of document” heuristic, outperforms a state-of-the-art summarizer and the beginning-of-article baseline in both automatic and manual evaluations. These results represent an important advance because in the absence of cross-document repetition, single document summarizers for news have not been able to consistently outperform the strong beginning-of-article baseline. Read More