More than 3 years have passed since last update.

CodeKataをKotlinでやってみた〜Data Munging編〜

Last updated at 2021-01-17Posted at 2020-12-30

今回もCodeKataをKotlinでやっていきたいと思います。
「そもそもCodeKataって何？」と言う方は"CodeKataをKotlinでやってみた〜Karate Chop編〜"をご参照ください。

トライしてみる

今回の課題は21あるKataのうち4つ目に当たる"Data Munging"です。大きく三つのパートに分かれています。

Part One: Weather Data
Part Two: Soccer League Table
Part Three: DRY Fusion

Part One: Weather Data

In weather.dat you’ll find daily weather data for Morristown, NJ for June 2002. Download this text file, then write a program to output the day number (column one) with the smallest temperature spread (the maximum temperature is the second column, the minimum the third column).

つまり、テキストファイルweather.datを読み取り、最高気温と最低気温の差が最も小さい日付をその値と共に出力することがPart Oneの課題となります。
ちなみに、weather.datのデータ形式は下記です。

 Dy MxT   MnT   AvT   HDDay  AvDP 1HrP TPcpn WxType PDir AvSp Dir MxS SkyC MxR MnR AvSLP

   1  88    59    74          53.8       0.00 F       280  9.6 270  17  1.6  93 23 1004.5
   2  79    63    71          46.5       0.00         330  8.7 340  23  3.3  70 28 1004.5
   3  77    55    66          39.6       0.00         350  5.0 350   9  2.8  59 24 1016.8
   4  77    59    68          51.1       0.00         110  9.1 130  12  8.6  62 40 1021.1
   5  90    66    78          68.3       0.00 TFH     220  8.3 260  12  6.9  84 55 1014.4
   6  81    61    71          63.7       0.00 RFH     030  6.2 030  13  9.7  93 60 1012.7
   7  73    57    65          53.0       0.00 RF      050  9.5 050  17  5.3  90 48 1021.8
   8  75    54    65          50.0       0.00 FH      160  4.2 150  10  2.6  93 41 1026.3
   9  86    32*   59       6  61.5       0.00         240  7.6 220  12  6.0  78 46 1018.6
  10  84    64    74          57.5       0.00 F       210  6.6 050   9  3.4  84 40 1019.0
  11  91    59    75          66.3       0.00 H       250  7.1 230  12  2.5  93 45 1012.6
  12  88    73    81          68.7       0.00 RTH     250  8.1 270  21  7.9  94 51 1007.0
  13  70    59    65          55.0       0.00 H       150  3.0 150   8 10.0  83 59 1012.6
  14  61    59    60       5  55.9       0.00 RF      060  6.7 080   9 10.0  93 87 1008.6
  15  64    55    60       5  54.9       0.00 F       040  4.3 200   7  9.6  96 70 1006.1
  16  79    59    69          56.7       0.00 F       250  7.6 240  21  7.8  87 44 1007.0
  17  81    57    69          51.7       0.00 T       260  9.1 270  29* 5.2  90 34 1012.5
  18  82    52    67          52.6       0.00         230  4.0 190  12  5.0  93 34 1021.3
  19  81    61    71          58.9       0.00 H       250  5.2 230  12  5.3  87 44 1028.5
  20  84    57    71          58.9       0.00 FH      150  6.3 160  13  3.6  90 43 1032.5
  21  86    59    73          57.7       0.00 F       240  6.1 250  12  1.0  87 35 1030.7
  22  90    64    77          61.1       0.00 H       250  6.4 230   9  0.2  78 38 1026.4
  23  90    68    79          63.1       0.00 H       240  8.3 230  12  0.2  68 42 1021.3
  24  90    77    84          67.5       0.00 H       350  8.5 010  14  6.9  74 48 1018.2
  25  90    72    81          61.3       0.00         190  4.9 230   9  5.6  81 29 1019.6
  26  97*   64    81          70.4       0.00 H       050  5.1 200  12  4.0 107 45 1014.9
  27  91    72    82          69.7       0.00 RTH     250 12.1 230  17  7.1  90 47 1009.0
  28  84    68    76          65.6       0.00 RTFH    280  7.6 340  16  7.0 100 51 1011.0
  29  88    66    77          59.7       0.00         040  5.4 020   9  5.3  84 33 1020.6
  30  90    45    68          63.6       0.00 H       240  6.0 220  17  4.8 200 41 1022.7
  mo  82.9  60.5  71.7    16  58.8       0.00              6.9          5.3

実装

class DataMunging {
    companion object {
        /**
         * calculate the smallest spread and return that value with date
         */
        private fun File.calculateSmallestSpread(): Map<String, String> {
            return this.readLines()
                    .drop(1).dropLast(1)
                    .mapNotNull {
                       // process string data to cleansed list data
                       val line = row.cleanseDataFormat()

                       // there is the row that is empty, not weather info
                       // instead of processing, just return null for that empty row
                       return if (line.isNotEmpty()) {
                           val date = line[0]
                           // extract temperature data to calculate spread
                           val maxTemp = line[1].toInt()
                           val minTemp = line[2].toInt()
                           // make a map containing date and spread as keys
                           mapOf("date" to date, "spread" to maxTemp.minus(minTemp).toString())
                       } else null
                    }
                    // have smallest spread date as first element and pick it to print
                    .sortedBy { it["spread"].toInt() }[0]
        }

        /**
         * cleans raw string data to neat list data
         */
        private fun String.cleanseDataFormat(): List<String> {
            return this.replace("\\s+".toRegex(), " ")
                    .replace("*", "")
                    .split(" ")
                    .drop(1)
        }
    }
}

File操作処理をFile.calculateSmallestSpread()として拡張したメソッドで行うようにしています。readLines()した直後は計算に不要なデータ行（ヘッダー行など）が含まれているため、まずはそれらを.drop(1).dropLast(1)として除去します。

続いてmapの箇所で最高気温と最低気温の差を実際に計算していきますが、そのためにはそれらの値が正常に参照できる必要があります。そのため、値の参照ができるようにデータのフォーマットをきれいにしてあげる処理をString.cleanseDataFormat()として拡張したメソッドで行っています。各値の区切りを半角スペースで統一し、計算に不要な"*"表示を取り除いています。

このようにして生成された変数lineはweather.datの各行値を要素として持つリストになっています（["日付", "最高気温", "最低気温"...]のような形）。そこから日付と計算に必要な最高/最低気温を参照して"date","spread"をキーとして持つmap形式にして次の処理に渡します。

最後に、"spread"をキーとする値でsortすることで最もspreadが小さい要素が先頭に来るようにして、その要素を返却し、変数smallestSpreadに格納しています。実際に出力される内容は下記のようになります。

{date=14, spread=2}

14日が最も最高気温と最低気温の差が小さく、その差は2度であったようです。生データを確認すると14日の行は14 61 59....となっており、確かに気温差は2度であったことが分かります（生データでは1行目=日付、2行目=最高気温、3行目=最低気温）。

Part Two: Soccer League Table

The file football.dat contains the results from the English Premier League for 2001/2. The columns labeled ‘F’ and ‘A’ contain the total number of goals scored for and against each team in that season (so Arsenal scored 79 goals against opponents, and had 36 goals scored against them). Write a program to print the name of the team with the smallest difference in ‘for’ and ‘against’ goals.

つまり、テキストファイルfootball.datを読み取り、得点数と失点数の差が最も小さいチームをその値と共に出力することがPart Twoの課題となります。
ちなみに、football.datのデータ形式は下記です（2001/2シーズンのプレミアリーグ順位表です。アーセナルが優勝した年ですね！）。

       Team            P     W    L   D    F      A     Pts
    1. Arsenal         38    26   9   3    79  -  36    87
    2. Liverpool       38    24   8   6    67  -  30    80
    3. Manchester_U    38    24   5   9    87  -  45    77
    4. Newcastle       38    21   8   9    74  -  52    71
    5. Leeds           38    18  12   8    53  -  37    66
    6. Chelsea         38    17  13   8    66  -  38    64
    7. West_Ham        38    15   8  15    48  -  57    53
    8. Aston_Villa     38    12  14  12    46  -  47    50
    9. Tottenham       38    14   8  16    49  -  53    50
   10. Blackburn       38    12  10  16    55  -  51    46
   11. Southampton     38    12   9  17    46  -  54    45
   12. Middlesbrough   38    12   9  17    35  -  47    45
   13. Fulham          38    10  14  14    36  -  44    44
   14. Charlton        38    10  14  14    38  -  49    44
   15. Everton         38    11  10  17    45  -  57    43
   16. Bolton          38     9  13  16    44  -  62    40
   17. Sunderland      38    10  10  18    29  -  51    40
   -------------------------------------------------------
   18. Ipswich         38     9   9  20    41  -  64    36
   19. Derby           38     8   6  24    33  -  63    30
   20. Leicester       38     5  13  20    30  -  64    28

こちらの実装の詳細については割愛させてください。理由は下記二つです。

基本的にPart Oneと同様の処理となり、差異となるのはmapNotNull内の処理記述のみとなるから
次節のPart ThreeでPart One、Part TwoをDRYに記述するよう修正するのでそちらで該当コードについて改めて記述するから

Part Three: DRY Fusion

Take the two programs written previously and factor out as much common code as possible, leaving you with two smaller programs and some kind of shared functionality.

つまり、上述の通りPart One、Part Twoで記述したそれぞれの処理について、共通部分をまとめて出来るだけDRYに記述することがPart Threeの課題となります。

自分は下記のように修正をしてみました（より良い方法について、ご指摘などありましたら是非コメントにて宜しくお願い致します）。

class DataMunging {
    companion object {
        /**
         * cleans raw string data to neat list data
         */
        private fun String.cleanseDataFormat(): List<String> {
            return this.replace("\\s+".toRegex(), " ")
                    .replace("*", "")
                    .split(" ")
                    .drop(1)
        }

        /**
         * read file and process its data in mapNotNull part with a method passed as argument
         */
        fun File.readFileAndProcessDataWith(operation: KFunction1<String, Map<String, String>>): Map<String, String> {
            return this.bufferedReader()
                    .readLines()
                    .drop(1).dropLast(1)
                    .mapNotNull {
                        operation(it)
                    }
                    // have smallest spread date as first element and pick it to return
                    .sortedBy { it["spread"]?.toInt() }[0]
        }

        /**
         * calculate the smallest temperature spread and return that value with date
         */
        fun calculateTemperatureSpread(row: String): Map<String, String>? {
            // process string data to cleansed list data
            val line = row.cleanseDataFormat()

            // there is the row that is empty, not weather info
            // instead of processing, just return null for that empty row
            return if (line.isNotEmpty()) {
                val date = line[0]
                // extract temperature data to calculate spread
                val maxTemp = line[1].toInt()
                val minTemp = line[2].toInt()

                mapOf("date" to date, "spread" to maxTemp.minus(minTemp).toString())
            } else null
        }

        /**
         * calculate the smallest score spread and return that value with team
         */
        fun calculateScoreSpread(row: String): Map<String, String>? {
            // process string data to cleansed list data
            val line = row.cleanseDataFormat()

            // there is the row only for separation "--------..", not team info
            // instead of processing, just return null for that separation row
            return if (line.size != 1) {
                val team = line[1]
                // extract score data to calculate spread
                val scoreFor = line[6].toInt()
                val scoreAgainst = line[8].toInt()

                mapOf("team" to team, "spread" to scoreFor.minus(scoreAgainst).absoluteValue.toString())
            } else null
        }
    }
}

まず、リファクタ実施前のcalculateTemperatureSpreadとcalculateScoreSpreadの共通処理部分をFileの拡張関数readFileAndProcessDataWithとして切り出しました。上述の通り両者の差異はmapNotNull内の処理のみだったため、それ以外のファイル読み出し/先頭・最終行の削除/spread値でのソートなどの処理がこちらの切り出し対象処理となっています。

差異となっていた各処理内容をcalculateTemperatureSpreadとcalculateScoreSpreadの各関数内で定義し、その関数をmapNotNull内での処理operationとして実行します。

意図通りに動いているか、テストコードで確認してみます。

class DataMungingTest {
    @Test
    fun testCalculateTemperatureSpread() {
        val temperatureSpread = File("resources/weather.dat")
                .readFileAndProcessDataWith(::calculateTemperatureSpread)

        assertEquals("14", temperatureSpread["date"])
        assertEquals("2", temperatureSpread["spread"])
    }

    @Test
    fun testCalculateScoreSpread() {
        val scoreSpread = File("resources/football.dat")
                .readFileAndProcessDataWith(::calculateScoreSpread)

        assertEquals("Aston_Villa", scoreSpread["team"])
        assertEquals("1", scoreSpread["spread"])
    }
}

無事テストケースを通過し、意図通りに動いていることが確認できました！

まとめ

特にPart Threeの処理共通化は色々なアプローチがあると思います。自分なりに考えて取り組んでみましたが、「このやり方がイケてるぜ」などございましたら是非是非ご教示ください！

お読みいただきまして、ありがとうございました！

関連記事一覧

CodeKataをKotlinでやってみた〜Karate Chop編〜
CodeKataをKotlinでやってみた〜Data Munging編〜
[CodeKataをKotlinでやってみた〜Bloom Filters編〜]
(https://qiita.com/Takuyaaaa/items/eaa3848bce3bccdd946f)
[CodeKataをKotlinでやってみた〜Anagrams編〜]
(https://qiita.com/Takuyaaaa/items/df06e24e6f2c7f8ced35)
[CodeKataをKotlinでやってみた〜Checkout編〜]
(https://qiita.com/Takuyaaaa/items/0a4b82e30c977444c0bc)
[CodeKataをKotlinでやってみた〜Sorting it Out編〜]
(https://qiita.com/Takuyaaaa/items/b5210c53bff3ff5f0512)
[CodeKataをKotlinでやってみた〜Counting Code Lines編〜]
(https://qiita.com/Takuyaaaa/items/cb9143fbcb9e0b2a7822)
[CodeKataをKotlinでやってみた〜Tom Swift Under the Milkwood編〜]
(https://qiita.com/Takuyaaaa/items/feccf69d2b9d95196a72)
[CodeKataをKotlinでやってみた〜Transitive Dependencies編〜]
(https://qiita.com/Takuyaaaa/items/9b43473b8feffe1ce9f7)
[CodeKataをKotlinでやってみた〜Word Chains編〜]
(https://qiita.com/Takuyaaaa/items/2539338252ad7e19ba18)
[CodeKataをKotlinでやってみた〜Simple Lists編〜]
(https://qiita.com/Takuyaaaa/items/36ef73522bfe8d054448)

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up

CodeKataをKotlinでやってみた 〜Data Munging編〜

トライしてみる

Part One: Weather Data

実装

Part Two: Soccer League Table

Part Three: DRY Fusion

まとめ

CodeKataをKotlinでやってみた〜Data Munging編〜