Can artificial intelligence read paintings?

“A man on a motorbike with a dog,” read one of the first automated captions of Raphael’s painting of Saint George killing the dragon. The funny caption became the name of the Barcelona-based project Saint George on a Bike.

Artificial intelligence (AI) is most known to be used in cities to analyse imagery from road cameras to recognise vehicle types, count them, identify congestions, and make predictions about traffic flows. Or local governments use it to model digital twins of cities to improve their policies and services. The Barcelona Supercomputing Center is expanding its range of applications by testing AI to detect objects and actions and generate descriptions of paintings.

Existing tools for automated image captions use datasets based on recent photos, not art pieces. “For many reasons that have to do with the variety in style, anachronisms, symbols or imaginary beings, these don’t work well on paintings,” explains Maria-Cristina Marinescu, Computer Scientist at the Barcelona Supercomputing Centre and the Principal Investigator of the project.

From images to paintings

Even one of the most known computer vision datasets in use today, the Microsoft COCO dataset, can’t exclusively rely on automation. Object detection, segmentation, and captioning are tasks performed through crowdsourcing.

And paintings are not photos. Some depict items that are not in use anymore or have changed shape over time. Old silhouettes represent different, much more popular, new objects. “For example, if you show a monk reading a book, an AI model trained over photographs will produce a caption like ‘person on the phone’,” illustrates Marinescu.

Paintings also include fictional characters, like Saint George’s dragon. “Existing models give poor results because they’re trained on a different type of data,” explains Marinescu. “But we can still reuse their knowledge as a basis for training AI for artworks, in conjunction with other techniques.”

How does it work right now?

The Barcelona team gathered 16,000 high-resolution European paintings between the 12th and 18th centuries to develop its model. Before the model can learn, it needs to be taught, so the team uses manual annotation to identify objects. “Based on aligned image to annotations, you can train an AI model to caption unseen images automatically,” says Marinescu.

In short, people draw bounding boxes around each element in a painting and label them. This type of annotation trains the model to recognise objects. So, citizens also have a role in the project by contributing to identifying and explaining cultural heritage in cities.

Another module then generates likely relationships between the bounding boxes based on heuristics and parameters such as the boxes’ dimension and position. For example, “a bounding box for a person placed above and overlapping with a horse makes it likely that the person is riding the horse rather than feeding it if proportions are right,” explains Marinescu.

These are only the first steps toward generating descriptions in natural language (language that has evolved naturally in humans through use and repetition without conscious planning or premeditation, Ed.).

Why bother?

Usually, most of the metadata associated with works of art has to do with the form, the style, the material, the historical context, the author’s life, the critique etc. “Very little is about the actual visual content,” says Marinescu. “Why? Because there’s an assumption that you see it.”

This is not always the case, the most obvious example being visually impaired people. For instance, it would enable better web accessibility for the visually impaired by associating paintings with more detailed descriptions rather than relying only on a short title. Fundación Once, the oldest and biggest Spanish organisation for people with disabilities, expressed interest in the project’s output.

Making sure digital tools are inclusive is one of the principles of the Cities Coalition for Digital Rights, which published a guide for cities to self-assess the accessibility of the ethical AI they are developing and procuring. The guide is part of the Global Observatory of Urban AI initiative.

At the same time, good metadata gathered from paintings makes it possible for computers to improve indexing, enable rich search, and browse functionalities. Programs can also find similarities between the content of the images and help people navigate connections in the art world, as they do with hyperlinks within a standard web page.

Rich annotations can be used to suggest recommendations or build virtual tours based on many different criteria. It can highlight connections between artworks or the past represented in paintings and the present captured by photographs.

Uses and open data

“The metadata we generate will be published on the European Data Portal and the Europeana Foundation aggregator platform,” explains Marinescu. “The models will also be available for download and use by anyone interested in cultural heritage research. Using our tools, museums could develop personalised or virtual tours with an educational vocation.” Who knows, maybe it will even be considered for the Open Data Initiative High-Value Datasets – an EU initiative that wants to collect datasets whose re-use can benefit society and the economy.

Cities could use the information to connect the content in museums and other cultural organisations. “For instance, depending on what street you are on, an app can suggest what else you can find out about that city’s heritage. It will tell you that the house you see is the birthplace of a famous architect who built the residence of a nobleman who sponsored a painter whose work is showcased in a city gallery,” says Marinescu.

Cities that are already wondering if there are legal considerations to implementing such an app can be reassured. Under the current proposal of the Artificial Intelligence Act, which regulates the legal and ethical use of AI, it would probably qualify as a low-risk system, bringing little obligations or responsibilities to its users.

Counting on you for the future

Going further, the Saint George on a Bike project must collect enough manual annotations. The team hopes to reach their goal through a crowdsourcing campaign on Zooniverse. “We would ideally have four or five descriptions for each image because everybody describes things differently. And we need enough variation for the model to generalise well,” says Marinescu.

So far, they have collected enough annotations for about 3,000 images. “They are certainly not enough, though,” says Marinescu. “We have built an initial model based on the existing annotations, and part of the results are already reasonable. We’re hopeful we can get results closer to fluent English.”

Once the team achieves this goal, you can run an unknown painting through the model and get the description of its content automatically. Raphael’s painting will correctly say “A knight killing a dragon”, and it may even be able to understand that it´s likely to be Saint George.

The St. George on a Bike team includes Joaquim Moré, Natural Language Processing; Artem Reshetnikov, Deep Learning; Cedric Bhihe, Machine Learning; Sergio Mendoza, Software Engineer and Maria-Cristina Marinescu, Semantic Technologies.

Wilma Dragonetti Eurocities Writer