More than 5 years have passed since last update.

Pythonで2つのリストの差集合を順序を保持したまま求める

Posted at 2019-09-27

どういう時に使うの？

通常リストの差集合を求める際にはset型というものを用いて求めます。
しかし、リストをset型に変換した時点で数字の昇順等でソートされてしまい、重複も削除されてしまいます。
リストの順序が何らかのスコアの高い順のID、のようなデータだと、重要な要素が失われてしまうことになります。
そのようなデータに対して使う感じです。

イメージ

list1 = [5, 4, 3, 2, 1]
list2 = [2, 4]

result = [5, 3, 1]

list1 - list2というイメージです。

list2の数字をlist1内から一括削除したいとき

Python版

sample.py

list1 = [5, 4, 3, 4, 2, 1]
list2 = [2, 4]

result = [i for i in list1 if i not in list2]
print(result)
[5, 3, 1]

Pyspark版

sample.py

import pyspark.sql.functions as F

@F.udf(returnType=ArrayType(IntegerType()))
def udf_list_diff(l1, l2):
    for i in l2:
        l1.remove(i)
    return l1

list2の数字をlist1内の最初の数字だけ削除したいとき

Python版

sample.py

def list_diff(l1, l2):
    for i in l2:
        l1.remove(i)
    return l1

list1 = [5, 4, 3, 4, 2, 1]
list2 = [2, 4]

result = list_diff(list1, list2)
print(result)
[5, 3, 4, 1]

Pyspark版

sample.py

import pyspark.sql.functions as F

@F.udf(returnType=ArrayType(IntegerType()))
def udf_list_diff(l2, l2):
    return [i for i in l1 if i not in l2]

appendix

順序を保持しなくてもよい場合

sample.py

list1 = [5, 4, 3, 4, 2, 1]
list2 = [2, 4]

result = (list(set(list1) - set(list2)))
print(result)
[1, 3, 5]

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up