Why can paragraph detection help improve the user experience?

Text is one of the major points of attention for the reader and user of software applications. It is of primary importance to pay attention to its readability when analyzing the UI-UX of an interface. The font and the text color are of interest for accessibility aspects and quite well-known in the web community but there also exist specifications to ease the text paragraph readability on a screen.

The first part of this blog focuses on this aspect:

  • Recall the reasons for having a good accessibility on textual paragraphs.
  • Present the guidelines based on the norm ISO 9241-300.

The second part of this blog addresses the issue of evaluating the accessibility standards of a textual paragraph. We speak about:

  • the importance of using a screen shot instead of the html/css structure
  • the history of text detection techniques on images, 
  • the explanation of techniques coming from computer vision for automatic paragraph detection…


The readability of a text has two aspects. The first aspect, related to visual perception, is linked to the ability to distinguish the shapes that characterize the letters and their combinations in words, sentences and paragraphs. The second aspect is related to the simplicity of understanding its content. The first will either help or hinder the second: a difficulty in distinguishing characters and words will make it difficult for the reader to understand the meaning of what he/she is reading. 

The good news is that each of these two aspects is the subject of studies and standards describing rules and good practices which, when respected, contribute to greater ease of reading (1st aspect) and better understanding (2nd aspect). 

In particular, ease of reading is governed by ISO standards, which promote the readability and accessibility of a software interface and its visual comfort [ref]. 

These factors of good readability can be partly analyzed in an automatic way and revealed to the interface designers. 

Text paragraph and accessibility guidelines

The font plays an important role in the design of the human-computer interface. Several good practices are to be taken into account when textual data is used as information support (newspapers, articles, information etc.). 

In addition to the classical rules related to the color and size of the font, these rules and standards that follow, bring advantages either on the speed of reading, minimizing fatigue or effort of reading on screen:

  • Use dark letters on a light background. This choice guarantees an optimal contrast between the text and its support, as well as minimal visual fatigue thanks to the use of a light background.
  • Prefer a few long lines to many short ones. The use of too short lines slows down the reading and makes it tiring since it requires more eye movements. The reading is faster when the number of characters in a line overcomes 26 characters. The recommendation is to use between 50 to 55 characters per line, or even 30 to 35 in double columns.
  • Use white space in the layout. Spacing makes it easier to read. It is recommended to insert a blank line every 5 lines and leave a space between the columns.

The ISO 9241-3 standard defines the exact constraints on fonts so that texts can be read effortlessly on the screen. The table below summarizes the accessibility guidelines for a text paragraph and the associated font:

Feature of the paragraphReadability constraintIllustration
Minimum x-height of a character (xh) according to reading distance (d)xh > 2.8mm (d = 50cm)xh > 3.5mm (d = 60cm)xh > 4.0mm (d = 70cm)ref: MF
Space between lines (sp)h < sp < 1.5h
Number of characters in a line (nb)50 < nb < 55 (1 column)30 < nb < 35 (2 columns)
Feature of the fontReadability constraint
Font [ref]Sans serif font                 Sans serif           Serif
Italic FontTo avoid. Maximum inclination 45o
Line Thickness (thk)h/12 < thk < h/6
Character width (w)0.7 < w < 0.9
Space between characters (sp)sp > thk

Why use images ?

Maybe the most straightforward way to evaluate the text for websites is to deal with the structure of a web page based on the html/css. This would enable it to check correspondence to accessibility guidelines based on text colors and font size. However, this code analysis technique does not ensure neither the rendering of the text paragraph on a different kind and size of screens nor its location in the screen. Therefore, the paragraph rules can not be evaluated towards accessibility guidelines (see previous table). 

Let us note that some attempts have been made to produce a segmentation of the page from the html code [0], but it remains a complicated task. 

Moreover, the development of an HCI is not only website based. There is no international consensus for the creation of software interfaces or mobile applications as opposed to websites and the use of html/css.

In light of these limitations, the use of screenshots of the page rendering seems to be a good option to evaluate the accessibility of a text paragraph.

Screenshots and computer vision

So, how to automatically detect a paragraph from a screenshot of an application, a website or a software ?

The recipe is quite obvious: let’s have a screenshot of a screen and mix it with some computer vision techniques.

Let us note that the visual analysis of HCI has certain particular features (compared to other fields) which makes the analysis of the image easier. In particular:

  • the noise, glare or difference in lighting are absent. 
  • the angle of view perfect, 
  • there is no need to consider movement and 
  • there is no notion of perspective in the image. 

All these aspects ease the analysis of the image and makes the use of traditional techniques in computer vision for paragraph detection quite obvious.

Technical evolution of text detection through paragraph detection

Technically, how to detect a paragraph in an image? Actually there is no specific research published on that for now, as it has not been a clear purpose. 

Let’s contextualize its history before explaining different approaches.

Timeline for text detection and its application

The idea of a reading machine lives on from the middle of the 19th century. Nonetheless, the first efficient machine able to detect printed text arrived in 1929 as Gustav Tauschek invented a mechanical machine that enlightened some specific words on printed papers . Otherwise, the first commercialised Optical Character Recognition (OCR) was developed in 1951, mainly by David Shepard, in his free-time . This machine, called “Gismo”, consisted in converting printed messages into machine language, reading letter by letter. It was used by banks to read checks, at a rate of 100 checks per minute. Ever since, many OCRs have been marketed, including by IBM. Every generation attempts to outperform the previous by including a wider range of fonts and context on one hand, and by reducing the number of false positives on the other hand. In the beginning of the 20th century, much research was still published on the detection of text-area in images, to eliminate non-textual zones before using the OCR. I will go into more detail about these text detection methods later.
The OCR was mainly licenced programs, marketed for specific uses. This was revolutionned when HP open sourced the Tesseract research program, in 2005. It has been mainly developed by Google after that and is still one of the most common OCR. It allows the democratisation of text extraction for data analysis, developing the extraction of word meaning and more generally Natural Language Processing (NLP).
The original objective of digitizing text has been largely surpassed to leave room for a more and more intelligent reading. Thanks to the increase in computing resources, and the rise of neural networks, there has been a new wave of publications on text-detection. For instance, here are 3 recent examples of application that combine text detection and NLP: 

  • Facebook has developed Rossetta which allows, among other things, to contextualize Internet memes with their text. To use the NLP semantic in an image to improve its classification is called “image captioning”. How can it be useful ? A well known example is the following meme of baby Sammy Griner tasting sand, originally called “I hate sandcastles”. The same photo has been used in really different contexts, and the image does not illustrate the same idea at all: 
  • The “Amazon Rekognition” text recognition tool is an online service, able to
    detect and recognize objects, people, celebrities, … from images, but also from videos. It is efficient to detect text, even if the font is original, and the image is complex. 

This deep-learning based program is really easy to integrate into an application, but requires using AWS API, that is using Amazon development platform. 

  • The third example concerns Apple: the company offers an implementation of “text-recognition” in the “Vision” framework, to extract text in photos in real time. 

To have a complete overview on the GAFAM, note that Google and Microsoft also propose their own ORC, included in their API, respectively Cloud Vision API and Read API.

These tools seem to have a common approach: 1) detect the letters and OCRize with tesseract or improved version of it, 2) classify text that talk about the same ideas, using Natural Language Processing. The algorithms seem to perform very well, but they go much further than what we were interested in, which is to detect a paragraph. It really seems like bringing in an elephant to kill a mouse. The computational cost of training is surely really high, and using these tools means to rely on proprietary software, and a biased training set. 

Text and paragraph detection in practice

Now that the evolution of text detection is clearer, let’s go back to our technical issue, that is to detect a paragraph on images. Before going further, it is necessary to define the terms more clearly. First, there are two type of text in images: 

  • caption text = all the text that is put afterward an image. I will expand this definition a bit: it is the digital text that is generated as a print on the image. So when you look at a screenshot of a website, the vast majority of text is of this type.
  • scene text = all the characters included in the image, like on traffic signs, or on storefronts. Most of the research concerns this type of text [1] [2]. And this article [3] gives a clear explanation: “Scene texts overlap with the background therefore scene text detection and extraction are difficult as compared to the detection of caption text.” 

It is easy to detect caption text, whereas detecting scene text is still a rich research topic. Let’s give an idea about how text detection in images can be computed.

At the end of the 1990s, two different types of methods emerged to detect text in images. One uses edges whereas the other is based on texture. Since 2015, the use of deep learning methods for text detection have been widely developed. Let’s give a small insight into these different methods. 

Text detection with edge 

The first one is called “edge detection based” or “connected-component analysis-based” methods. Edge detection is a very rich topic in computer vision, it consists in detecting the strong differences between consecutive pixels.

Connected component is just a big word for a one-piece section:

In this image from bibmath (fr), A is a connected component and B is not, as it has 2 pieces. 

For text-detection, the first step of these methods consists in extracting the edges in the images. For that, numerous methods of edge detection exist but the simplest example is gradient filtering.

All these methods have threshold parameters, corresponding to the sharpness of the rupture between pixels, and the number of neighbor pixels that are taken into account. In this illustration, the considered neighboring is a square with a larger and larger side. As this parameter grows, the breaks are better detected and the outlines of the letters stand out more, thus the area around the letters/words gets bigger, until it overlaps the area from a neighboring letter/word.

Then, those pseudo-images are segmented into on-piece parts, the connected components. As the area increases, the number of connected components grows, until the right threshold is reached, that is when text-zones are detected. 

{from [4]}

When the size is optimal, a second step discriminates text areas from non-text areas, by analysing the specific structures of text. It uses for example the very particular shape of the packets of letters, with a strong contrast-ratio between the text and the background.

Text detection with texture

The second category of text-detection methods is [based on texture][5]. The idea is to use the textural characteristics of text-zones, assuming that “(1) Text possesses a certain frequency and orientation information; (2) Text shows spatial cohesion – characters of the same text string are of similar heights, orientation and spacing.” [to quote this article ][6]. To reveal the textures, the image is binarized and passed through a filter such as [Gabor][7] or Gaussian derivative.

Just to go a little further, let’s illustrate what the Gabor filter does. It filters lines with a certain rotation and a certain width. The Gabor filter can be used in much more cases than the one presented, I just wanted to give an intuition on what the texture detection can be. Keep in mind that it can tackle more complex shapes than lines, and there are other parameters to choose. 

Let’s first try the filter on this single word image and then on a mobile website screenshot.

When the image has a part where such a line is recognized, it darkens the result. The width parameter corresponds to the number of neighboring pixels taken into account, that is filter those schemes: 

As the width of the filter decreases, more parts of the letters fit the line and are kept. The first line corresponds to the zones masked by the filter. In the second line, the pink mask is overlaid on the image. The third line represents the different filters. On a global image these Gabor filters give:

So, this first parameter allows control of the thickness of the detected lines.

 Then, still with the same examples, and fixing the sharpness to “k_size = 30” let’s change the rotation, that is filter through the schemes: 

The letters are not recognizable through these filters but the horizontal and vertical lines allow to pick rectangle candidates for text-zones. 

The generic name of this operation is “Texture segmentation”. Then, authors apply different heuristics to select good candidates for text-zones, for example relying on the line structure. These heuristics are often hard-coded, but some authors learn them from datasets. The main problem of these methods is that it is really sensitive to font type, and font size, but they are more robust to more complex backgrounds.

Once the text zone is detected, it is passed through an OCR that translates the text into machine-readable text. The presented detection methods were mostly introduced as a preprocess to reduce the number of false positives and get the scene texts.

Text detection with deep learning

Although these geometric approaches are tangible, nowadays, deep neural networks outperform them. 

Many networks have been trained to recognize people, animal species, but also objects, notably trained on COCO dataset. Generally, the two main constraints of deep learning are the access to a sufficiently large set of training images, and the computational capacity.

For the first point, a lot of work has been done to annotate text areas in images. For example, in 2019, researchers at IBM used the source code of .pdfs, paralleled with their visual rendering, to label the text areas of 360,000 articles [8]. This initiative was taken up in 2021 by researchers at microsoft, to use .doc document source code [9]

From [9]

Thus, databases exist to detect text areas. However, these training data can be biased. For example, in scientific articles, the background is neutral, the font homogeneous, the text structured in one or two columns. This is for example not at all the case on websites or mobile applications.

Secondly, the computational capacity is not really a constraint nowadays, even for small structures, since it is possible to rent computational clusters. Moreover, many networks have already been trained, and their weights are available under APACHE license. Note that it will surely be necessary to proceed to a learning transfer to have correct results from trained networks, because of the bias mentioned just before on the input data.

While some networks are trained to recognize text areas, others [detect words one by one][10]. Reconstructing a paragraph from the words then becomes a geometry problem.

And what about paragraphs ? 

But going back to paragraph detection, we do not really need to OCRize texts to achieve this, as the websites are almost structured data, and the text is caption text. But the methods presented above can easily be adapted to directly detect paragraphs. For example, detect paragraphs can be computed by dilating the detected components zones introduced just above:

Idea of code from this source.

And, in the same way as the text detection methods, a filtering step will keep only the correct rectangles. Indeed, paragraphs can easily be detected with the same ideas as text detection, adapting the heuristics. 


Detecting paragraphs is a task that may be necessary to automate the UX/UI, if only to check if the text is readable. The “video/image” format, although it allows one to free from programming language dependencies, brings new difficulties, in particular to semantically segment the image. Thus, although paragraph detection is an easy task for a human, it is challenging in terms of computer vision. We have seen how text detection has evolved over time, from detecting printed letters to using the semantics of text embedded in images, such as traffic signs, or text in memes. Then, the main ideas from methods to detect scene-text can be used to detect paragraphs.


[0] Block-o-Matic: A web page segmentation framework, Sanoja 2014
[1] Text Detection in Natural Image by Connected Component Labeling, Zamen A. Ramadhan, Dhia Alzubaydi, (2019), url
[2] A Robust Algorithm for Text Detection in Images, Julinda Gllavata, Ralph Ewerth and Bemd Freisleben (2003)
[3] Detecting Text in Natural Scenes with Connected Component Clustering and Nontext Filtering, Mr. Mule S. S. and Mr. Holambe S. N. (2016), url
[5] Gabor Filter Based Text Extraction From Digital Document Images, Yu-Long Qiao et al. (2006),
[6] Finding Text In Images, Victor Wu, R. Manmatha, Edward M. Riseman (1997), url
[7] Text Detection in Natural Scene Images by Stroke Gabor Words, Yi Chucai and Tian Yingli (2011)
[8] PubLayNet: largest dataset ever for document layout analysis, Xu Zhong et. al (2019)
[9] LayoutReader: Pre-training of Text and Layout for Reading Order Detection, Zilong Wang, et. al. (2021)
[10] EAST: An Efficient and Accurate Scene Text Detector, Xinyu Zhou et. al (2017) 

Personalities/timeline not cited but used to construct the timeline: 

Partagez ce post !

Vous aimerez peut-être aussi

Abonnez-vous à notre newsletter !

Une fois pas mois (pas plus) recevez les nouveautés concernant l’UX et l’UI.