#Applied Data Science Track Session ADS2: Language Models and Text Mining
Automatic Dialogue Summary Generation for Customer Service
Chunyi Liu (AI Labs, Didi Chuxing); Peng Wang (AI Labs, Didi Chuxing); Jiang Xu (AI Labs, Didi Chuxing); Zang Li (AI Labs, Didi Chuxing); Jieping Ye (AI Labs, Didi Chuxing)
Background
- Understanding user’s questions and provide
- Agent needs to handle 100+ dialogues each day
- (Poorly written summaries)
(Original text summaries the documents by dropping/keeping a subset of these..)
Abstractive text
Challenges
Quality requirement
- Intercity
- Logic
- Correctness
(With respect to the traditional metrics cross-entropy loss BLUE, rouge-L )
Leader - writer net
Hierarchical Encoder:
Leader writer decoder:
Loss layer with reinforcement loss:
Hierarchical encoder
Token-level encoder
Utterance-level encoder
Writer net: given a key point, decoding the corresponding sub-summaries.
Preprocessing:
- Normalize the samples by replacing the phone numbers, plate number with special symbols
Performance in practice
Eval metrics
Accept ratio: 67%(e)
ACW: 12s
Summary
Auxiliary key point sequence to guarantee the integrity , logic and correctness of automatically
Detection of Review Abuse via Semi-Supervised Binary Multi-Target Tensor Decomposition
Anil R Yelundur (Amazon); Vineet Chaoji (Amazon); Bamdev Mishra (Microsoft India)
Seller incentivization of reviewers
In e-commerce, shoppers usually depend on online user reviews to obtain detailed product insights to make an informed purchased
Agencies/internet groups connect sellers with reviewers.
Key signals
- Entity soliciting fake reviews form dense bipartite cores with their fake
- Fake reviews have similar rating
- Fake reviews are temporally clustered
Unsupervised binary tensor decomposition model
Sentinel multi-target learning
Natural gradient learning: Challenges
Sentinel partial natural gradient learning: FIM
Computation continued
experimental results: abusive reviewer detection
experimental results: seller abuse evidence
Natural gradient vs stochastic gradient & sufficient statistics
ROC-AUC abusive reviewer detection
Impact: early detection of abusive reviewers
Conclusion
- Applied tensor decomposition to identify abusive entity in e-commerce data:
- Poly gamma data augmentation simplifies inference
- Impact: early detection of abusive reviewers by sentinel
- Currently investigating application of GCNN to our data in supervised setting
Unsupervised Clinical Language Translation
Wei-Hung Weng (Massachusetts Institute of Technology); Yu-An Chung (Massachusetts Institute of Technology); Peter Szolovits (Massachusetts Institute of Technology)
Good communication > better clinical outcomes
Affect clinical decision making
- Great invasive ductal carcinoma/cancer/ abnormal cells
(Policystic ovary syndrome )
Automated sentence translation to fill gap
Goal: professional-to-consumer translation
Ontology / dictionary
Pattern-based mining with Wikipedia corpus
Challenge
- Out-vocab, abbreviation
- Still not understandable after replacement/ explanation
- Not reliable, no even good dictionary
- No previous sentence work
Method : unsupervised
Non-parallel data
Word level-translation
Learning word vector representation
(Matrix transformation to approximate ..)
Key: capturing semantics in a language based on distributional hypothesis
Unsupervised word vector representation
Matrix approximation for embedding alignment
Assumption: language embedding should be similar
Identical English words as the anchors for alignment
Evaluation
Deciding mutual nearest neighbors
K-nearest neighbors hubness problem
Only word-level translation is not enough
Sentence translation
Careful initialization by numberings alignment (word translation )
Language model fo
Sentence translation: statistical machine translation
Back-translation leveraging the information in target-to-source information
Translated sentence evaluation without reference
data
Mimic
professional language
Consumer language
…
Word-level translation
Exps word translation model / language model
sentence evaluation from clinicians
Sentence-level translation
Unsupervised clinical language translation
Fully- unsupervised bilinguall dictional
Gmail Smart Compose: Real-Time Assisted Writing
Mia Xu Chen (Google); Benjamin N. Lee (Google); Gagan Bansal (Google); Yuan Cao (Google); Shuyuan Zhang (Google); Justin Lu (Google); Jackie Tsay (Google); Yinan Wang (Google); Andrew M. Dai (Google); Zhifeng Chen (Google); Timothy Sohn (Google); Yonghui Wu (Google)
Smart compose saves users from typing over 2 billion characters each word
Challenges
- Latency
- Scale triggering
- Metrics design
- Personalization
- Privacy & fairness
Finding the right model
Compare and understand state-of-the-art
Data
User-composed emails
~8B messages in English
Preprocessing
Tokenization, normalization, quotation,
Evaluation metrics: Log perplexity, exact match
Language model A: context encoder and language model
Contextual information helps Improving quality
Transformer achieves better.
Language model B:concat of subject previous email current email prefix
Seq2seq model
Attention helps modeling contextual information
SOTA machine translation models perform well
Life of a smart compose request
Prefix, beam search
Perplexity difference in general translate to exact match
While Transformer shows clear quality advantages in perplexity. The advantage is less evident in exact match.
Transformer decoding latency higher than LSTM
Growing latency gap between Trandformar and LSTM models as the suggestions get longer
Personalization model outperforms the global model when a is proper use
Deployed multilingual model in production
##Naranjo Question Answering using End-to-End Multi-task Learning Model
Bhanu Pratap Singh Rawat (University of Massachusetts Amherst); Fei Li (University of Massachusetts Lowell); Hong Yu (University of Massachusetts Lowell)
Naranjo questions to infer the causality relation between the drug reactions (ADRs)
Provide some insights regarding the relevance
Data collection 584 discharge summaries which were annotated by 4 trained annotators.
annotators meticulously annotated