Dutch autosummarization

by | Nov 10, 2020 | 0 comments

Application Track:

Ready Made

Code:

REACH-2020-READYMADE-VRT_2

Domain:

Proposed by:

De Vlaamse Radio- en Televisieomroeporganisatie nv (VRT)

Entity Logo:

Summary of the entity:

VRT is the public broadcaster of the Flemish Community in Belgium. Its mission is to inform, inspire and unite and so reinforce Flemish society. As a service providing organization, the VRT wants to take up a special position in society.

Summary of the challenge:

The main goal is to create a system that can create automatic summarizations of text.

Description of the global challenge:

In addition to enormous amounts of audio and video content, VRT also has a big collection of written texts, most notably articles from the different websites. The news website, for example, publishes almost one hundred articles every day. No human being can keep up to date with all that written content. In recent years, systems that can create automatic summarizations of text have become reliable. The challenge aims to create a system that can do this for VRT news articles. These are written in standard Dutch. The length of the summarizations should be able to be defined by the user.

For this challenge, VRT expects working algorithms that automatically summarize Dutch news articles. Furthermore, multiple summarizations are desired. First of all, VRT wants a subset of the most relevant information in the original article where the extracted content is not modified in any way, i.e. key phrases or key sentences. This process is called extractive auto-summarization. Secondly, VRT wants to use an algorithm that extracts the most relevant information from the original article and makes a semantic representation of that information. This semantic representation is then used to create a new summary of the article, resulting in a more readable summary. This process is called abstractive auto-summarization. VRT predominantly uses AWS as cloud service and our articles are written in Adobe Experience Manager.

Sub-challenges composing this experiment:

This challenge is composed of 2 sub-challenges:

  • Dutch Extractive Auto Summarization (REACH-2020-READYMADE-VRT_2.1)
  • Dutch Abstractive Auto Summarization (REACH-2020-READYMADE-VRT_2.2)

Expected global results:

At the end of the project, VRT expects the following results:

  • The trained, working models (the model files)
  • Results of the evaluation of the models: how were they evaluated, using what part of the data, what metrics were used to evaluate the performance…
  • The code to train and evaluate the models.
  • A pipeline (or at least an idea) for automatically summarizing news articles after they are written by our journalists.

DUTCH EXTRACTIVE AUTO SUMMARIZATION

Code:

REACH-2020-READYMADE-VRT_2.1

Summary of the sub-challenge:

The objective is to get algorithms that automatically summarize Dutch news articles, getting a subset of the most relevant information in the original article, without modifying the content in any way.

Description of the challenge:

In addition to enormous amounts of audio and video content, VRT also has a big collection of written texts, most notably articles from the different websites. The news website, for example, publishes almost one hundred articles every day. No human being can keep up to date with all that written content. In recent years, systems that can create automatic summarizations of text have become reliable. The challenge we propose is to create a system that can do this for our news articles. These are written in standard Dutch. The length of the summarizations should be able to be defined by the user.

For this challenge, VRT expects working algorithms that automatically summarize Dutch news articles. First of all, VRT wants a subset of the most relevant information in the original article where the extracted content is not modified in any way, i.e. key phrases or key sentences. This process is called extractive auto-summarization. Additionally, the key phrases should also be sorted according to relevance/importance, since we want to be able to let the user decide on how long they want the summary to be (e.g. based on the amount of time they have to read the news).

Expected outcomes:

At the end of this sub-challenge, VRT expects the following results:

  • The trained, working extractive models (the model files)
  • Results of the evaluation of the extractive models: how were they evaluated, using what part of the data, what metrics were used to evaluate the performance…
  • The code to train and evaluate the extractive models.
  • Extractive summaries of the articles we provided, where the phrases are sorted according to importance.
  • A pipeline (or at least an idea) for automatically summarizing news articles after they are written by our journalists.
  • A way for journalists to adjust the key phrases that are selected for the summary when the summary is not that good.

DUTCH ABSTRACTIVE AUTO SUMMARIZATION

Code:

REACH-2020-READYMADE-VRT_2.2

Summary of the sub-challenge:

The main goal is to get algorithms that can automatically summarize Dutch new articles, extracting the most relevant information from the original article and make a semantic representation of that information.

Description of the challenge:

In addition to enormous amounts of audio and video content, VRT also has a big collection of written texts, most notably articles from the different websites. The news website, for example, publishes almost one hundred articles every day. No human being can keep up to date with all that written content. In recent years, systems that can create automatic summarizations of text have become reliable. The challenge we propose is to create a system that can do this for our news articles. These are written in standard Dutch. The length of the summarizations should be able to be defined by the user.

For this challenge, VRT expects working algorithms that automatically summarize Dutch news articles. Specifically, VRT wants to use an algorithm that extracts the most relevant information from the original article and makes a semantic representation of that information. This semantic representation is then used to create a new summary of the article, resulting in a more readable summary. This process is called abstractive auto-summarization. Additionally, the semantic information on which the summary is based on should also be sorted according to relevance/importance, since we want to be able to let the user decide on how long they want the summary to be (e.g. based on the amount of time they have to read the news).

Expected outcomes:

At the end of this subchallenge, VRT expects the following results:

  • The trained, working abstractive models (the model files)
  • Results of the evaluation of the abstractive models: how were they evaluated, using what part of the data, what metrics were used to evaluate the performance,…
  • The code to train and evaluate the abstractive models.
  • Extractive summaries of the articles we provided, with the multiple summaries for the same article differing in length/amount of important information.
  • A pipeline (or at least an idea) for automatically summarizing news articles after they are written by our journalists.
  • A way for journalists to adjust the key semantic information that is being used for the summary when the summary is not that good.

How do we apply?

Read the Guidelines for Applicants

Doubts or questions? Read more about REACH on the About Us page,

have a look at our FAQ section or drop us an email at opencall@reach-incubator.eu.