ビッグデータ分析-データ探索

探索的データ分析*は、ジョン・タッキー（1977）によって開発された概念であり、統計の新しい観点に基づいています。 Tuckeyの考えは、従来の統計では、データはグラフィカルに探索されておらず、仮説をテストするためだけに使用されていたというものでした。ツールを開発する最初の試みはスタンフォードで行われ、プロジェクトはhttp://stat-graphics.org/movies/prim9l[prim9]と呼ばれました。このツールは、9次元でデータを視覚化できたため、データの多変量の観点を提供できました。

近年、探索的データ分析は必須であり、ビッグデータ分析のライフサイクルに含まれています。洞察を見つけて組織内で効果的に伝達できる能力は、強力なEDA機能によって強化されます。

Tuckeyのアイデアに基づいて、Bell Labsは統計を行うためのインタラクティブなインターフェースを提供するために* Sプログラミング言語*を開発しました。 Sのアイデアは、使いやすい言語で広範なグラフィカル機能を提供することでした。今日の世界では、ビッグデータのコンテキストでは、 S プログラミング言語に基づいた R が分析用の最も一般的なソフトウェアです。

上位分析パッケージ

次のプログラムは、探索的データ分析の使用方法を示しています。

以下は、探索的データ分析の例です。このコードは、 part1/eda/exploratory_data_analysis.R ファイルでも利用できます。

library(nycflights13)
library(ggplot2)
library(data.table)
library(reshape2)

# Using the code from the previous section
# This computes the mean arrival and departure delays by carrier.
DT <- as.data.table(flights)
mean2 = DT[, list(mean_departure_delay = mean(dep_delay, na.rm = TRUE),
   mean_arrival_delay = mean(arr_delay, na.rm = TRUE)),
   by = carrier]

# In order to plot data in R usign ggplot, it is normally needed to reshape the data
# We want to have the data in long format for plotting with ggplot
dt = melt(mean2, id.vars = ’carrier’)

# Take a look at the first rows
print(head(dt))

# Take a look at the help for ?geom_point and geom_line to find similar examples
# Here we take the carrier code as the x axis
# the value from the dt data.table goes in the y axis

# The variable column represents the color
p = ggplot(dt, aes(x = carrier, y = value, color = variable, group = variable)) +
   geom_point() + # Plots points
   geom_line() + # Plots lines
   theme_bw() + # Uses a white background
   labs(list(title = 'Mean arrival and departure delay by carrier',
      x = 'Carrier', y = 'Mean delay'))
print(p)

# Save the plot to disk
ggsave('mean_delay_by_carrier.png', p,
   width = 10.4, height = 5.07)

コードは、次のような画像を生成する必要があります-

平均遅延

Big-data-analytics-data-exploration

ビッグデータ分析-データ探索