Humans are good at distinguishing content from navigational text, advertisements, related articles, etc. So good that publishers try to disguise advertisements and links as content.
This paper analyses some techniques used for boilerplate detection and shows that it is possible to achieve a high level of accuracy using a few simple features.
Extraction levels that have external requirements are discarded for being either computationally too expensive or because they require separate work for each website.
Site specific signals are avoided as they may lead to overfitting to the layout and content of a particular subset. This includes use of CSS classes and HTML tags.
Token based evaluation is also discarded as it inspects the text at a topical level and may skew results into describing a particular domain.
Instead we examine shallow text features, densitometric features, the absolute and relative text position, and a few heuristic features.
Some features discussed in the field of Quantitative Linguistics are: Average word length, average sentence length, and absolute number of words.
In Block Fusion, text blocks are classified by density and fused together when they belong to the same class.
When the text is highly segmented it is likely that content is followed by more content and template by template.
Examined heuristic features are: number of words that start with a capital letter, number of words that are all capital letters, the ratio between these and the absolute number of words, link density.
Average word length, and average sentence length indicate the complexity of statements and may point to the use of grammatical constructs and longer explanations.
Long text is assumed to have a descriptive nature as it supplies the reader with the subject matter’s details at the cost of higher syntactic complexity.
Short text is then assumed to be grammatically incomplete or simple consisting of only a few words. It is used for quick and economic coding when the audience is expected to perceive and understand the information without much effort. Such text is used for headlines and navigational text therefore regarded as functional.
A strong correlation is then noted between long text and content as well as short text and template.
It is observed that the use of block fusion closely resembles manual segmentation.
Some of the heuristic assumptions point to a high rate of capitalised words in headlines and navigational text, and a high rate of linked words on navigational and related links text.
If this interests you, go read the full paper on Boilerplate Detection using Shallow Text Features.