[Turing-Southampton] Turing Defence & Security opportunity - responses by 20 December
Susan Davies
sdd1 at soton.ac.uk
Wed Dec 4 14:27:09 GMT 2019
***apologies if you receive this more than once***
The Turing’s Defence & Security programme would like to request expressions of interest for a 3-6 month project on Cross-Lingual Information Retrieval, with a budget of £50-60k. Initially they are looking for names of interested individuals, and a short profile on their suitability for the project.
Please see the statement of requirement below:
========
There is a strong requirement within defence and national security to triage large volumes of documents or other textual content in a range of languages. Typical techniques include either human-dependent approaches such as foreign language analysts (FLAs), or technology approaches such as bulk machine translation and key word searching (such as CLASE, developed by the MIT Lincoln Laboratory's Human Language Technology Group for the FBI). However, both of these methods have setbacks:
1. FLAs are rare and spread thinly, especially those with expertise in either low-resource or high-demand languages.
2. Despite large amounts of research, machine translation is still far from perfect, and if a single important word is mistranslated, this will not be found using keyword searching.
An alternate approach to this problem is to view the task as 'Cross Language Information Retrieval' (CLIR) rather than 'machine translation'. In this way, performance can be more usefully measured in terms of retrieval of documents of interest, rather than harder to quantify BLEU scores. A fairly novel approach to CLIR is to learn a multilingual embedding space into which documents can be projected, and carry out tasks such as classification, named entity recognition (NER), and sentiment analysis. All of these tasks help towards retrieving documents of interest in multiple languages.
One such approach, Multilingual BERT, has shown strong results for machine translation tasks, but its performance on CLIR hasn't been assessed in detail. It is proposed that a piece of work is undertaken with the following broad research aims:
1. Devise a suitable metric for determining the performance of a CLIR system
2. Create or source a representative corpus of multilingual test data
3. Explore techniques for finding documents of interest in foreign language corpora, such as multilingual document classification, topic detection, NER, and emotion detection
4. Create a performant CLIR system able to take English language queries as input, and find documents of interest in many languages, to include, but not limited to, Arabic, Mandarin, Russian, and Farsi.
The impact of such a system would be as follows:
1. English language analysts can begin to triage large foreign language corpora, increasing the volume of data that can be analysed
2. Scarce FLA resource can be better prioritised towards documents which are more likely to contain information of interest.
=======
_____________________________________________
Susan Davies
Coordination Manager, Web Science Institute<https://www.southampton.ac.uk/wsi/index.page?>
University Liaison Manager, The Alan Turing Institute<https://www.southampton.ac.uk/wsi/alan-turing-institute/alan-turing-institute.page>
Room 3041, Building 32
Web Science Institute
University of Southampton
Southampton SO17 1BJ
023 8059 3523 | 07768 266464
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.ecs.soton.ac.uk/pipermail/turing-southampton/attachments/20191204/e13c953d/attachment.html
More information about the Turing-Southampton
mailing list