LoginSignup
0
0

More than 3 years have passed since last update.

KDD2019 Applied Data Science Track Session ADS2: Language Models and Text Mining

Last updated at Posted at 2019-08-12

まとめページへのリンク

Applied Data Science Track Session ADS2: Language Models and Text Mining

Automatic Dialogue Summary Generation for Customer Service

Chunyi Liu (AI Labs, Didi Chuxing); Peng Wang (AI Labs, Didi Chuxing); Jiang Xu (AI Labs, Didi Chuxing); Zang Li (AI Labs, Didi Chuxing); Jieping Ye (AI Labs, Didi Chuxing)

Background
  • Understanding user’s questions and provide
  • Agent needs to handle 100+ dialogues each day
  • (Poorly written summaries)

(Original text summaries the documents by dropping/keeping a subset of these..)
Abstractive text

Challenges

Quality requirement

  • Intercity
  • Logic
  • Correctness

(With respect to the traditional metrics cross-entropy loss BLUE, rouge-L )
Leader - writer net
Hierarchical Encoder:
Leader writer decoder:
Loss layer with reinforcement loss:

Hierarchical encoder
Token-level encoder
Utterance-level encoder
Writer net: given a key point, decoding the corresponding sub-summaries.
Preprocessing:

- Normalize the samples by replacing the phone numbers, plate number with special symbols

-

Performance in practice
Eval metrics
Accept ratio: 67%(e)
ACW: 12s

Summary

Auxiliary key point sequence to guarantee the integrity , logic and correctness of automatically

Detection of Review Abuse via Semi-Supervised Binary Multi-Target Tensor Decomposition

Anil R Yelundur (Amazon); Vineet Chaoji (Amazon); Bamdev Mishra (Microsoft India)

Seller incentivization of reviewers
In e-commerce, shoppers usually depend on online user reviews to obtain detailed product insights to make an informed purchased
Agencies/internet groups connect sellers with reviewers.
Key signals

  • Entity soliciting fake reviews form dense bipartite cores with their fake
  • Fake reviews have similar rating
  • Fake reviews are temporally clustered

Unsupervised binary tensor decomposition model
Sentinel multi-target learning
Natural gradient learning: Challenges
Sentinel partial natural gradient learning: FIM
Computation continued
experimental results: abusive reviewer detection
experimental results: seller abuse evidence

Natural gradient vs stochastic gradient & sufficient statistics

ROC-AUC abusive reviewer detection
Impact: early detection of abusive reviewers

Conclusion
  • Applied tensor decomposition to identify abusive entity in e-commerce data:
  • Poly gamma data augmentation simplifies inference
  • Impact: early detection of abusive reviewers by sentinel
  • Currently investigating application of GCNN to our data in supervised setting

Unsupervised Clinical Language Translation

Wei-Hung Weng (Massachusetts Institute of Technology); Yu-An Chung (Massachusetts Institute of Technology); Peter Szolovits (Massachusetts Institute of Technology)
Good communication > better clinical outcomes
Affect clinical decision making
* Great invasive ductal carcinoma/cancer/ abnormal cells
(Policystic ovary syndrome )

Automated sentence translation to fill gap

Goal: professional-to-consumer translation
Ontology / dictionary
Pattern-based mining with Wikipedia corpus

Challenge
  • Out-vocab, abbreviation
  • Still not understandable after replacement/ explanation
  • Not reliable, no even good dictionary
  • No previous sentence work

Method : unsupervised
Non-parallel data

Word level-translation
Learning word vector representation
(Matrix transformation to approximate ..)
Key: capturing semantics in a language based on distributional hypothesis
Unsupervised word vector representation
Matrix approximation for embedding alignment
Assumption: language embedding should be similar
Identical English words as the anchors for alignment

Evaluation

Deciding mutual nearest neighbors
K-nearest neighbors hubness problem
Only word-level translation is not enough
Sentence translation
Careful initialization by numberings alignment (word translation )
Language model fo
Sentence translation: statistical machine translation

Back-translation leveraging the information in target-to-source information
Translated sentence evaluation without reference

data

Mimic
professional language
Consumer language

Word-level translation
Exps word translation model / language model
sentence evaluation from clinicians
Sentence-level translation
Unsupervised clinical language translation
Fully- unsupervised bilinguall dictional

Gmail Smart Compose: Real-Time Assisted Writing

Mia Xu Chen (Google); Benjamin N. Lee (Google); Gagan Bansal (Google); Yuan Cao (Google); Shuyuan Zhang (Google); Justin Lu (Google); Jackie Tsay (Google); Yinan Wang (Google); Andrew M. Dai (Google); Zhifeng Chen (Google); Timothy Sohn (Google); Yonghui Wu (Google)

Smart compose saves users from typing over 2 billion characters each word

Challenges
  • Latency
  • Scale triggering
  • Metrics design
  • Personalization
  • Privacy & fairness

Finding the right model
Compare and understand state-of-the-art
Data
User-composed emails
~8B messages in English

Preprocessing

Tokenization, normalization, quotation,

Evaluation metrics: Log perplexity, exact match

Language model A: context encoder and language model
Contextual information helps Improving quality
Transformer achieves better.
Language model B:concat of subject previous email current email prefix
Seq2seq model

Attention helps modeling contextual information
SOTA machine translation models perform well

Life of a smart compose request
Prefix, beam search
Perplexity difference in general translate to exact match
While Transformer shows clear quality advantages in perplexity. The advantage is less evident in exact match.

Transformer decoding latency higher than LSTM
Growing latency gap between Trandformar and LSTM models as the suggestions get longer

Personalization model outperforms the global model when a is proper use
Deployed multilingual model in production

Naranjo Question Answering using End-to-End Multi-task Learning Model

Bhanu Pratap Singh Rawat (University of Massachusetts Amherst); Fei Li (University of Massachusetts Lowell); Hong Yu (University of Massachusetts Lowell)

Naranjo questions to infer the causality relation between the drug reactions (ADRs)
Provide some insights regarding the relevance

Data collection 584 discharge summaries which were annotated by 4 trained annotators.
annotators meticulously annotated

0
0
0

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up
0
0