Google has unveiled a new multilingual text vectorizer called RETVec (short for Resilient and Efficient Text Vectorizer) to help to detect potentially harmful content such as spam and malicious emails in Gmail.
“RETVec is trained to be robust against character-level manipulations including insertions, deletions, typos, homoglyphs, LEET substitutions, etc.,” according to project description on GitHub.
“The RETVec model is trained on top of a novel character encoder that can encode all UTF-8 characters and words efficiently.”
While major platforms like Gmail and YouTube rely on text classification models to detect phishing attacks, inappropriate comments, and scams, threat actors are known to create counter-strategy to bypass these defensive measures.
They have been observed resorting to adversarial text manipulations, which range from the use of homoglyphs to keyword stuffing to invisible characters.
RETVecwhich works in over 100 languages out of the box, aims to help build more robust and efficient server-side and on-device text classifiers, while also being more robust and efficient.
Vectorization is a natural language processing method (NLP) to map words or phrases from the vocabulary to the corresponding numerical representation to perform further analysis, such as sentiment analysis, text classification, and named entity recognition.
“Due to its novel architecture, RETVec works out-of-the-box with every language and all UTF-8 characters without the need for text preprocessing, making it an ideal candidate for on-device , web, and large text classification. deployments,” Google’s Elie Bursztein and Marina Zhang THE audience.
The tech giant said that integrating the vectorizer into Gmail improved the baseline spam detection rate by 38% and reduced the false positive rate by 19.4%. It also lowers the Tensor Processing Unit (TPU) using the model in 83%.
“The models trained by RETVec show faster inference speed due to its compact representation. Having smaller models reduces computational costs and reduces latency, which is important for large applications and on- device models,” added Bursztein and Zhang.