More than 5 years have passed since last update.

KDD2019 Applied Data Science Track Session ADS2: Language Models and Text Mining

Last updated at 2019-08-12Posted at 2019-08-12

Applied Data Science Track Session ADS2: Language Models and Text Mining

Automatic Dialogue Summary Generation for Customer Service

Chunyi Liu (AI Labs, Didi Chuxing); Peng Wang (AI Labs, Didi Chuxing); Jiang Xu (AI Labs, Didi Chuxing); Zang Li (AI Labs, Didi Chuxing); Jieping Ye (AI Labs, Didi Chuxing)

Background

Understanding user’s questions and provide
Agent needs to handle 100+ dialogues each day
(Poorly written summaries)

(Original text summaries the documents by dropping/keeping a subset of these..)
Abstractive text

Challenges

Quality requirement

Intercity
Logic
Correctness

(With respect to the traditional metrics cross-entropy loss BLUE, rouge-L )
Leader - writer net
Hierarchical Encoder:
Leader writer decoder:
Loss layer with reinforcement loss:

Hierarchical encoder
Token-level encoder
Utterance-level encoder
Writer net: given a key point, decoding the corresponding sub-summaries.
Preprocessing:

Normalize the samples by replacing the phone numbers, plate number with special symbols

Performance in practice
Eval metrics
Accept ratio: 67%(e)
ACW: 12s

Summary

Auxiliary key point sequence to guarantee the integrity , logic and correctness of automatically

Detection of Review Abuse via Semi-Supervised Binary Multi-Target Tensor Decomposition

Anil R Yelundur (Amazon); Vineet Chaoji (Amazon); Bamdev Mishra (Microsoft India)

Seller incentivization of reviewers
In e-commerce, shoppers usually depend on online user reviews to obtain detailed product insights to make an informed purchased
Agencies/internet groups connect sellers with reviewers.
Key signals

Entity soliciting fake reviews form dense bipartite cores with their fake
Fake reviews have similar rating
Fake reviews are temporally clustered

Unsupervised binary tensor decomposition model
Sentinel multi-target learning
Natural gradient learning: Challenges
Sentinel partial natural gradient learning: FIM
Computation continued
experimental results: abusive reviewer detection
experimental results: seller abuse evidence

Natural gradient vs stochastic gradient & sufficient statistics

ROC-AUC abusive reviewer detection
Impact: early detection of abusive reviewers

Conclusion

Applied tensor decomposition to identify abusive entity in e-commerce data:
Poly gamma data augmentation simplifies inference
Impact: early detection of abusive reviewers by sentinel
Currently investigating application of GCNN to our data in supervised setting

Unsupervised Clinical Language Translation

Wei-Hung Weng (Massachusetts Institute of Technology); Yu-An Chung (Massachusetts Institute of Technology); Peter Szolovits (Massachusetts Institute of Technology)
Good communication > better clinical outcomes
Affect clinical decision making

Great invasive ductal carcinoma/cancer/ abnormal cells
(Policystic ovary syndrome )

Automated sentence translation to fill gap

Goal: professional-to-consumer translation
Ontology / dictionary
Pattern-based mining with Wikipedia corpus

Challenge

Out-vocab, abbreviation
Still not understandable after replacement/ explanation
Not reliable, no even good dictionary
No previous sentence work

Method : unsupervised
Non-parallel data

Word level-translation
Learning word vector representation
(Matrix transformation to approximate ..)
Key: capturing semantics in a language based on distributional hypothesis
Unsupervised word vector representation
Matrix approximation for embedding alignment
Assumption: language embedding should be similar
Identical English words as the anchors for alignment

Evaluation

Deciding mutual nearest neighbors
K-nearest neighbors hubness problem
Only word-level translation is not enough
Sentence translation
Careful initialization by numberings alignment (word translation )
Language model fo
Sentence translation: statistical machine translation

Back-translation leveraging the information in target-to-source information
Translated sentence evaluation without reference

data

Mimic
professional language
Consumer language
…
Word-level translation
Exps word translation model / language model
sentence evaluation from clinicians
Sentence-level translation
Unsupervised clinical language translation
Fully- unsupervised bilinguall dictional

Gmail Smart Compose: Real-Time Assisted Writing

Mia Xu Chen (Google); Benjamin N. Lee (Google); Gagan Bansal (Google); Yuan Cao (Google); Shuyuan Zhang (Google); Justin Lu (Google); Jackie Tsay (Google); Yinan Wang (Google); Andrew M. Dai (Google); Zhifeng Chen (Google); Timothy Sohn (Google); Yonghui Wu (Google)

Smart compose saves users from typing over 2 billion characters each word

Challenges

Latency
Scale triggering
Metrics design
Personalization
Privacy & fairness

Finding the right model
Compare and understand state-of-the-art
Data
User-composed emails
~8B messages in English

Preprocessing

Tokenization, normalization, quotation,

Evaluation metrics: Log perplexity, exact match

Language model A: context encoder and language model
Contextual information helps Improving quality
Transformer achieves better.
Language model B:concat of subject previous email current email prefix
Seq2seq model

Attention helps modeling contextual information
SOTA machine translation models perform well

Life of a smart compose request
Prefix, beam search
Perplexity difference in general translate to exact match
While Transformer shows clear quality advantages in perplexity. The advantage is less evident in exact match.

Transformer decoding latency higher than LSTM
Growing latency gap between Trandformar and LSTM models as the suggestions get longer

Personalization model outperforms the global model when a is proper use
Deployed multilingual model in production

Naranjo Question Answering using End-to-End Multi-task Learning Model

Bhanu Pratap Singh Rawat (University of Massachusetts Amherst); Fei Li (University of Massachusetts Lowell); Hong Yu (University of Massachusetts Lowell)

Naranjo questions to infer the causality relation between the drug reactions (ADRs)
Provide some insights regarding the relevance

Data collection 584 discharge summaries which were annotated by 4 trained annotators.
annotators meticulously annotated

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up