使用sqlite / PHP获取两行/列中两个数字的方差和标准差


Get variance and standard deviation of two numbers in two different rows/columns with sqlite / PHP

我有一个SQLite数据库,结构如下:

rowid       ID                  startTimestamp   endTimestamp   subject
1           00:50:c2:63:10:1a   1000             1090           entrance
2           00:50:c2:63:10:1a   1100             1270           entrance
3           00:50:c2:63:10:1a   1300             1310           door1
4           00:50:c2:63:10:1a   1370             1400           entrance
.
.
.

我准备了一个sqlfiddle: http://sqlfiddle.com/#!2/fe8c6/2

使用这个SQL-Query,我可以得到一行和下一行之间的endTime和startTime的平均差值,按主题和ID排序:

SELECT
    id,
    ( MAX(endtimestamp) - MIN(startTimestamp)
    - SUM(endtimestamp-startTimestamp)
    ) / (COUNT(*)-1) AS averageDifference
FROM
    table1
WHERE ID = '00:50:c2:63:10:1a'
AND subject = 'entrance'
GROUP BY id;

我的问题是:计算平均值是没有问题的,这就是这个查询。但我怎么能得到这些值的标准差和方差?

首先通过将表连接到自身并按ID分组来找到感兴趣的时差,然后找到平均值,方差为V(x) = E(x^2) - (E(x))^2,标准差为sqrt(V),得出

SELECT ID, AVG(diff) AS average,
       AVG(diff*diff) - AVG(diff)*AVG(diff) AS variance,
       SQRT(AVG(diff*diff) - AVG(diff)*AVG(diff)) AS stdev
FROM
    (SELECT t1.id, t1.endTimestamp,
            min(t2.startTimeStamp) - t1.endTimestamp AS diff
    FROM table1 t1
    INNER JOIN table1 t2
    ON t2.ID = t1.ID AND t2.subject = t1.subject
    AND t2.startTimestamp > t1.startTimestamp  -- consider only later startTimestamps
    WHERE t1.subject = 'entrance'
    GROUP BY t1.id, t1.endTimestamp) AS diffs
GROUP BY ID

对于比简单求和更复杂的公式,您必须通过查找相应的下一个开始时间来计算每个记录的实际差值,如下所示:

SELECT (SELECT MIN(startTimestamp)
        FROM table1 AS next
        WHERE next.startTimestamp > table1.startTimestamp
          AND ID = '...'
       ) - endTimestamp AS timeDifference
FROM table1
WHERE nextStartTimestamp IS NOT NULL
  AND ID = '...'

然后你可以使用所有的不同值来进行计算:

SELECT SUM(timeDifference) / COUNT(*) AS average,
       AVG(timeDifference)            AS moreEfficientAverage,
       SUM(timeDifference * timeDifference) / COUNT(*) -
       AVG(timeDifference) * AVG(timeDifference) AS variance
FROM (SELECT (SELECT MIN(startTimestamp)
              FROM table1 AS next
              WHERE next.startTimestamp > table1.startTimestamp
                AND next.ID = '...'
             ) - endTimestamp AS timeDifference
      FROM table1
      WHERE nextStartTimestamp IS NOT NULL
        AND ID = '...')

若干点:

    你的平均值公式是错误的,正确的公式是SUM(endtimestamp-starttimestamp)/COUNT(endtimestamp)。我不知道为什么有MIN/MAX项。COUNT(*)将计算NULL行,并将给出错误的结果。
  1. sqlite有一个avg函数来查找平均值。
  2. 方差公式为SUM((endtimestamp-starttimestamp)*(endtimestamp-starttimestamp)) - AVG(endtimestamp-starttimestamp)*AVG(endtimestamp-starttimestamp)
  3. 标准差是方差的平方根。

作为对问题作者评论的回应,为了计算方差,开始和结束时间必须通过自连接相互配对。

由于SQL lite中没有row_number函数,这有点不美观。

SELECT id,
       AVG(startTimestamp-endTimestamp) as mean,
       SUM((startTimestamp-endTimestamp)^2) - AVG(startTimestamp-endTimestamp)^2 as variance,
       SQRT(SUM((startTimestamp-endTimestamp)^2) - AVG(startTimestamp-endTimestamp)^2) as stDev
FROM
    (SELECT
        t1.id,
        t1.endTimestamp,
        MIN(t2.startTimestamp) as starttimestamp
    FROM table1 t1
         INNER JOIN
         table1 t2 ON t1.endTimestamp<=t2.startTimestamp
    GROUP BY t1.id, t1.endTimestamp) t
GROUP BY id;

参见SQL Fiddle