如何在 PHP 中从数组中获取箱形图键号


How do I get box plot key numbers from an array in PHP?

假设我有一个数组,其值如下:

$values = array(48,30,97,61,34,40,51,33,1);

我希望这些值能够绘制如下所示的箱形图:

$box_plot_values = array(
    'lower_outlier'  => 1,
    'min'            => 8,
    'q1'             => 32,
    'median'         => 40,
    'q3'             => 56,
    'max'            => 80,
    'higher_outlier' => 97,
);

我将如何在PHP中执行此操作?

function box_plot_values($array)
{
    $return = array(
        'lower_outlier'  => 0,
        'min'            => 0,
        'q1'             => 0,
        'median'         => 0,
        'q3'             => 0,
        'max'            => 0,
        'higher_outlier' => 0,
    );
    $array_count = count($array);
    sort($array, SORT_NUMERIC);
    $return['min']            = $array[0];
    $return['lower_outlier']  = $return['min'];
    $return['max']            = $array[$array_count - 1];
    $return['higher_outlier'] = $return['max'];
    $middle_index             = floor($array_count / 2);
    $return['median']         = $array[$middle_index]; // Assume an odd # of items
    $lower_values             = array();
    $higher_values            = array();
    // If we have an even number of values, we need some special rules
    if ($array_count % 2 == 0)
    {
        // Handle the even case by averaging the middle 2 items
        $return['median'] = round(($return['median'] + $array[$middle_index - 1]) / 2);
        foreach ($array as $idx => $value)
        {
            if ($idx < ($middle_index - 1)) $lower_values[]  = $value; // We need to remove both of the values we used for the median from the lower values
            elseif ($idx > $middle_index)   $higher_values[] = $value;
        }
    }
    else
    {
        foreach ($array as $idx => $value)
        {
            if ($idx < $middle_index)     $lower_values[]  = $value;
            elseif ($idx > $middle_index) $higher_values[] = $value;
        }
    }
    $lower_values_count = count($lower_values);
    $lower_middle_index = floor($lower_values_count / 2);
    $return['q1']       = $lower_values[$lower_middle_index];
    if ($lower_values_count % 2 == 0)
        $return['q1'] = round(($return['q1'] + $lower_values[$lower_middle_index - 1]) / 2);
    $higher_values_count = count($higher_values);
    $higher_middle_index = floor($higher_values_count / 2);
    $return['q3']        = $higher_values[$higher_middle_index];
    if ($higher_values_count % 2 == 0)
        $return['q3'] = round(($return['q3'] + $higher_values[$higher_middle_index - 1]) / 2);
    // Check if min and max should be capped
    $iqr = $return['q3'] - $return['q1']; // Calculate the Inner Quartile Range (iqr)
    if ($return['q1'] > $iqr)                  $return['min'] = $return['q1'] - $iqr;
    if ($return['max'] - $return['q3'] > $iqr) $return['max'] = $return['q3'] + $iqr;
    return $return;
}

Lilleman的代码非常出色。我真的很欣赏他处理中位数和 q1/q3 的方式。如果我先回答这个问题,我将以一种更困难但不必要的方式处理奇数和偶数的值。我的意思是使用 if 4 次用于 4 种不同的模式情况( count( 值 ) , 4 )。 但他的方式只是整洁。我真的很佩服他的作品。

我想对最大值、最小值、higher_outliers 和lower_outliers进行一些改进。因为 q1 - 1.5*IQR 只是下限,我们应该找到大于此界限的最小值作为"最小值"。对于"最大值"也是如此。此外,可能有多个异常值。因此,我想根据Lilleman的工作进行一些更改。谢谢。

function box_plot_values($array)
{
     $return = array(
    'lower_outlier'  => 0,
    'min'            => 0,
    'q1'             => 0,
    'median'         => 0,
    'q3'             => 0,
    'max'            => 0,
    'higher_outlier' => 0,
);
$array_count = count($array);
sort($array, SORT_NUMERIC);
$return['min']            = $array[0];
$return['lower_outlier']  = array();
$return['max']            = $array[$array_count - 1];
$return['higher_outlier'] = array();
$middle_index             = floor($array_count / 2);
$return['median']         = $array[$middle_index]; // Assume an odd # of items
$lower_values             = array();
$higher_values            = array();
// If we have an even number of values, we need some special rules
if ($array_count % 2 == 0)
{
    // Handle the even case by averaging the middle 2 items
    $return['median'] = round(($return['median'] + $array[$middle_index - 1]) / 2);
    foreach ($array as $idx => $value)
    {
        if ($idx < ($middle_index - 1)) $lower_values[]  = $value; // We need to remove both of the values we used for the median from the lower values
        elseif ($idx > $middle_index)   $higher_values[] = $value;
    }
}
else
{
    foreach ($array as $idx => $value)
    {
        if ($idx < $middle_index)     $lower_values[]  = $value;
        elseif ($idx > $middle_index) $higher_values[] = $value;
    }
}
$lower_values_count = count($lower_values);
$lower_middle_index = floor($lower_values_count / 2);
$return['q1']       = $lower_values[$lower_middle_index];
if ($lower_values_count % 2 == 0)
    $return['q1'] = round(($return['q1'] + $lower_values[$lower_middle_index - 1]) / 2);
$higher_values_count = count($higher_values);
$higher_middle_index = floor($higher_values_count / 2);
$return['q3']        = $higher_values[$higher_middle_index];
if ($higher_values_count % 2 == 0)
    $return['q3'] = round(($return['q3'] + $higher_values[$higher_middle_index - 1]) / 2);
// Check if min and max should be capped
$iqr = $return['q3'] - $return['q1']; // Calculate the Inner Quartile Range (iqr)
$return['min'] = $return['q1'] - 1.5*$iqr; // This ( q1 - 1.5*IQR ) is actually the lower bound,
                                           // We must compare every value in the lower half to this.
                                           // Those less than the bound are outliers, whereas
                                           // The least one that greater than this bound is the 'min'
                                           // for the boxplot.
foreach( $lower_values as  $idx => $value )
{
    if( $value < $return['min'] )  // when values are less than the bound
    {
        $return['lower_outlier'][$idx] = $value ; // keep the index here seems unnecessary
                                                  // but those who are interested in which values are outliers 
                                                  // can take advantage of this and asort to identify the outliers
    }else
    {
        $return['min'] = $value; // when values that greater than the bound
        break;  // we should break the loop to keep the 'min' as the least that greater than the bound
    }
}
$return['max'] = $return['q3'] + 1.5*$iqr; // This ( q3 + 1.5*IQR ) is the same as previous.
foreach( array_reverse($higher_values) as  $idx => $value )
{
    if( $value > $return['max'] )
    {
        $return['higher_outlier'][$idx] = $value ;
    }else
    {
        $return['max'] = $value;
        break;
    }
}
    return $return;
}

我希望这对那些对这个问题感兴趣的人有所帮助。如果有更好的方法来知道哪些值是异常值,请向我添加评论。谢谢!

我有一个不同的解决方案来计算下部和较高的胡须。与 ShaoE 的解决方案一样,它发现最小值大于或等于下限(Q1 - 1.5 * IQR),反之亦然。

我使用 array_filter 遍历数组,将值传递给回调函数并返回一个数组,其中只有回调给出 true 的值(请参阅 php.net 的array_filter手册)。在这种情况下,将返回大于下限的值,并将其用作min的输入,其本身返回最小值。

// get lower whisker
$whiskerMin = min(array_filter($array, function($value) use($quartile1, $iqr) {
        return $value >= $quartile1 - 1.5 * $iqr;
    } ));
// get higher whisker vice versa
$whiskerMax = max(array_filter($array, function($value) use($quartile3, $iqr) {
        return $value <= $quartile3 + 1.5 * $iqr;
    } ));

请注意,它忽略了异常值,我只用正值对其进行了测试。