ビッグデータ分析-SQLの概要

SQLは、構造化照会言語の略です。これは、従来のデータウェアハウスおよびビッグデータテクノロジーのデータベースからデータを抽出するために最も広く使用されている言語の1つです。 SQLの基本を示すために、例を使って作業します。言語自体に焦点を合わせるために、R内でSQLを使用します。 SQLコードの記述に関しては、これはデータベースで行われるのとまったく同じです。

SQLのコアは、SELECT、FROM、およびWHEREの3つのステートメントです。次の例では、SQLの最も一般的な使用例を使用しています。 bda/part2/SQL_introduction フォルダーに移動し、 SQL_introduction.Rproj ファイルを開きます。次に、01_select.Rスクリプトを開きます。 RでSQLコードを作成するには、次のコードに示すように sqldf パッケージをインストールする必要があります。

# Install the sqldf package
install.packages('sqldf')

# load the library
library('sqldf')
library(nycflights13)

# We will be working with the fligths dataset in order to introduce SQL

# Let’s take a look at the table
str(flights)
# Classes 'tbl_d', 'tbl' and 'data.frame': 336776 obs. of  16 variables:

# $ year     : int  2013 2013 2013 2013 2013 2013 2013 2013 2013 2013 ...
# $ month    : int  1 1 1 1 1 1 1 1 1 1 ...
# $ day      : int  1 1 1 1 1 1 1 1 1 1 ...
# $ dep_time : int  517 533 542 544 554 554 555 557 557 558 ...
# $ dep_delay: num  2 4 2 -1 -6 -4 -5 -3 -3 -2 ...
# $ arr_time : int  830 850 923 1004 812 740 913 709 838 753 ...
# $ arr_delay: num  11 20 33 -18 -25 12 19 -14 -8 8 ...
# $ carrier  : chr  "UA" "UA" "AA" "B6" ...

# $ tailnum  : chr  "N14228" "N24211" "N619AA" "N804JB" ...
# $ flight   : int  1545 1714 1141 725 461 1696 507 5708 79 301 ...
# $ origin   : chr  "EWR" "LGA" "JFK" "JFK" ...
# $ dest     : chr  "IAH" "IAH" "MIA" "BQN" ...
# $ air_time : num  227 227 160 183 116 150 158 53 140 138 ...
# $ distance : num  1400 1416 1089 1576 762 ...
# $ hour     : num  5 5 5 5 5 5 5 5 5 5 ...
# $ minute   : num  17 33 42 44 54 54 55 57 57 58 ...

selectステートメントは、テーブルから列を取得し、それらに対して計算を行うために使用されます。最も簡単なSELECTステートメントは、 ej1 で示されています。 ej2 に示すように、新しい変数を作成することもできます。

### SELECT statement
ej1 = sqldf("
   SELECT
   dep_time
   ,dep_delay
   ,arr_time
   ,carrier
   ,tailnum
   FROM
   flights
")

head(ej1)
#    dep_time   dep_delay  arr_time  carrier  tailnum
# 1      517         2      830      UA       N14228
# 2      533         4      850      UA       N24211
# 3      542         2      923      AA       N619AA
# 4      544        -1     1004      B6       N804JB
# 5      554        -6      812      DL       N668DN
# 6      554        -4      740      UA       N39463

# In R we can use SQL with the sqldf function. It works exactly the same as in
a database

# The data.frame (in this case flights) represents the table we are querying
and goes in the FROM statement
# We can also compute new variables in the select statement using the syntax:

# old_variables as new_variable
ej2 = sqldf("
   SELECT
   arr_delay - dep_delay as gain,
   carrier
   FROM
   flights
")

ej2[1:5, ]
#    gain   carrier
# 1    9      UA
# 2   16      UA
# 3   31      AA
# 4  -17      B6
# 5  -19      DL

SQLで最もよく使用される機能の1つは、group byステートメントです。これにより、別の変数の異なるグループの数値を計算できます。スクリプト02_group_by.Rを開きます。

### GROUP BY

# Computing the average
ej3 = sqldf("
  SELECT
   avg(arr_delay) as mean_arr_delay,
   avg(dep_delay) as mean_dep_delay,
   carrier
   FROM
   flights
   GROUP BY
   carrier
")

#    mean_arr_delay   mean_dep_delay carrier
# 1       7.3796692      16.725769      9E
# 2       0.3642909       8.586016      AA
# 3      -9.9308886       5.804775      AS
# 4       9.4579733      13.022522      B6
# 5       1.6443409       9.264505      DL
# 6      15.7964311      19.955390      EV
# 7      21.9207048      20.215543      F9
# 8      20.1159055      18.726075      FL
# 9      -6.9152047       4.900585      HA
# 10     10.7747334      10.552041      MQ
# 11     11.9310345      12.586207      OO
# 12      3.5580111      12.106073      UA
# 13      2.1295951       3.782418      US
# 14      1.7644644      12.869421      VX
# 15      9.6491199      17.711744      WN
# 16     15.5569853      18.996330      YV

# Other aggregations
ej4 = sqldf("
   SELECT
   avg(arr_delay) as mean_arr_delay,
   min(dep_delay) as min_dep_delay,
   max(dep_delay) as max_dep_delay,
   carrier
   FROM
   flights
   GROUP BY
   carrier
")

# We can compute the minimun, mean, and maximum values of a numeric value
ej4
#      mean_arr_delay    min_dep_delay   max_dep_delay   carrier
# 1       7.3796692           -24           747          9E
# 2       0.3642909           -24          1014          AA
# 3      -9.9308886           -21           225          AS
# 4       9.4579733           -43           502          B6
# 5       1.6443409           -33           960         DL
# 6      15.7964311           -32           548         EV
# 7      21.9207048           -27           853         F9
# 8      20.1159055           -22           602         FL
# 9      -6.9152047           -16          1301         HA
# 10     10.7747334           -26          1137         MQ
# 11     11.9310345           -14           154         OO
# 12      3.5580111           -20           483         UA
# 13      2.1295951           -19           500         US
# 14      1.7644644           -20           653         VX
# 15      9.6491199           -13           471         WN
# 16     15.5569853           -16           387         YV

### We could be also interested in knowing how many observations each carrier has
ej5 = sqldf("
   SELECT
   carrier, count(*) as count
   FROM
   flights
   GROUP BY
   carrier
")

ej5
#      carrier  count
# 1       9E    18460
# 2       AA   32729
# 3       AS   714
# 4       B6   54635
# 5       DL   48110
# 6       EV   54173
# 7       F9   685
# 8       FL   3260
# 9       HA   342
# 10      MQ   26397
# 11      OO   32
# 12      UA   58665
# 13      US   20536
# 14      VX   5162
# 15      WN   12275
# 16      YV   601

SQLの最も便利な機能は結合です。結合とは、1つの列を使用して両方のテーブルの値を一致させる1つのテーブルでテーブルAとテーブルBを結合することを意味します。実際には、さまざまな種類の結合がありますが、開始するにはこれらが最も有用なものになります。内部結合と左外部結合です。

# Let’s create two tables: A and B to demonstrate joins.
A = data.frame(c1 = 1:4, c2 = letters[1:4])
B = data.frame(c1 = c(2,4,5,6), c2 = letters[c(2:5)])

A
# c1 c2
# 1  a
# 2  b
# 3  c
# 4  d

B
# c1 c2
# 2  b
# 4  c
# 5  d
# 6  e

### INNER JOIN
# This means to match the observations of the column we would join the tables by.
inner = sqldf("
   SELECT
   A.c1, B.c2
   FROM
   A INNER JOIN B
   ON A.c1 = B.c1
")

# Only the rows that match c1 in both A and B are returned
inner
# c1 c2
#  2  b
#  4  c

### LEFT OUTER JOIN
# the left outer join, sometimes just called left join will return the
# first all the values of the column used from the A table
left = sqldf("
  SELECT
   A.c1, B.c2
  FROM
   A LEFT OUTER JOIN B
   ON A.c1 = B.c1
")

# Only the rows that match c1 in both A and B are returned
left
#   c1    c2
#    1  <NA>
#    2    b
#    3  <NA>
#    4    c

Big-data-analytics-introduction-to-sql

ビッグデータ分析-SQLの概要