数据库作业 - chapter 3 #
使用了On_Time_Reporting_Carrier_On_Time_Performance_1987_present_2022_2.csv中0~999条数据作为样本
各机场与航班延误的关系 #
with a as(with delay1 as (select Origin from airline where DepDelay > 0)
select *, count(*) as cnt1 from delay1 group by Origin),
b as (with delay2 as (select Dest from airline where ArrDelay > 0)
select *, count(*) as cnt2 from delay2 group by Dest)
select * from a, b where Origin = Dest order by cnt1 desc;

注意到机场产生的航班延误中,产生终点延误多的机场常常也作为始发地具有较多的延误。
其中,JFK、ATL、LGA延误次数最多。除了JFK、CHS、EWR、SBN以外,大部分机场延误都是终点延误。
占比关系:
航班目的地与始发地对延误的影响 #

注意到纽约、亚特兰大、伯灵顿都具有很高的航班延误次数。另外,纽瓦克与夏洛茨维尔在终点延误次数发生上也很多。
航班航行距离与平均延误时间的关系 #
with a as(with delay1 as (select * from airline where DepDelay > 0)
select Distance, avg(DepDelayMinutes) as t1 from delay1 group by Distance),
b as (with delay2 as (select * from airline where ArrDelay > 0)
select Distance, avg(ArrDelayMinutes) as t2 from delay2 group by Distance)
select * from a, b where a.Distance = b.Distance order by t1 desc;


航班延误与距离没有明显线性相关关系 ,以出发延误为例,在距离200~1200的范围内,大部分航班在延误在1小时以内。而延误较长的航班通常都超过2小时。
导致延误次数最多的是哪一个已知因素 #
以出发延误为例:
with delay1 as (select * from airline where DepDelay > 0)
select Year,
count(Year),
sum(case when CarrierDelay>0 then 1 else 0 END)/count(Year),
sum(case when WeatherDelay>0 then 1 else 0 END)/count(Year),
sum(case when NASDelay>0 then 1 else 0 END)/count(Year),
sum(case when SecurityDelay>0 then 1 else 0 END)/count(Year),
sum(case when LateAircraftDelay>0 then 1 else 0 END)/count(Year)
from delay1 group by Year;

注意到影响大小的先后顺序为NAS>LateAircraft>Carrier>Weather>Security。
with delay1 as (select * from airline where DepDelay > 0)
select OriginCityName,
sum(case when CarrierDelay>0 then 1 else 0 END),
sum(case when WeatherDelay>0 then 1 else 0 END),
sum(case when NASDelay>0 then 1 else 0 END),
sum(case when SecurityDelay>0 then 1 else 0 END),
sum(case when LateAircraftDelay>0 then 1 else 0 END)
from delay1 group by OriginCityName;

对于不同城市,它们延误情况的主要影响因素各不相同。
航班的日期与延误的影响 #
with a as(with delay1 as (select DayOfWeek from airline where DepDelay > 0)
select *, count(*) as cnt1 from delay1 group by DayOfWeek),
b as (with delay2 as (select DayOfWeek from airline where ArrDelay > 0)
select *, count(*) as cnt2 from delay2 group by DayOfWeek)
select a.DayOfWeek, (cnt1+cnt2) as cnt from a, b where a.DayOfWeek = b.DayOfWeek order by cnt desc;

显然从周日、周一,周四、周五航班延误次数最多,周三延误最少。这可能和大多数人的出差/通勤习惯相关。
with a as(with delay1 as (select DayOfMonth from airline where DepDelay > 0)
select *, count(*) as cnt1 from delay1 group by DayOfMonth),
b as (with delay2 as (select DayOfMonth from airline where ArrDelay > 0)
select *, count(*) as cnt2 from delay2 group by DayOfMonth)
select a.DayOfMonth, (cnt1+cnt2) as cnt from a, b where a.DayOfMonth = b.DayOfMonth order by cnt desc;

从每月的情况来看,可以观察到大致以一周七天为一个周期的延误次数变化。越靠近月末,延误次数就越多。