Box plot frequently asked questions

Box plot frequently asked questions

What is a box plot?
A box plot (also known as a box-and-whisker plot) is a graphical representation of the distribution of a dataset. It shows the minimum, first quartile (Q1), median, third quartile (Q3), and maximum, providing a summary of the data's range, central tendency, and variability.
When should I use a box plot?
Box plots are useful when you want to:
  • Compare the distribution of multiple datasets.
  • Identify the spread and outliers of a dataset.
  • Visualize the symmetry or skewness of the data.
  • Compare central tendencies (e.g., medians) across different categories or groups.
How do I interpret a box plot?
  • Range: The whiskers show the range of the data, excluding outliers.
  • Central Tendency: The median line inside the box shows the central value of the data.
  • Spread: The box itself (IQR) shows how the data is distributed. A wider box indicates more variability, while a narrower box suggests less variability.
  • Outliers: Points outside the whiskers represent potential outliers, indicating unusual data points.
  • What are the different types of box plots we can create?
    Let's take an example of a manufacturing process that makes bolts. Two machine tools, Machine 1 and Machine 2, manufacture these bolts. The diameter of the bolt is a critical tolerance that needs to be controlled. These machines run for two shifts daily (Shift 1 and Shift 2) by different operators. The following 1000 rows of data were collected for this process:

    Shift Machine 1 Machine 2
    1 1.512637086 1.461893716
    1 1.514703275 1.499922138
    1 1.536790136 1.454054748
    1 1.509811708 1.502964807
    1 1.527360168 1.538022745
    1 1.517283088 1.358204676

    We want to depict the diameters using the Box plots graphically. The following types of box plots can be created:
    1. Box plot of diameter for all bolts created on Machine 1
    2. Box plot of diameters of bolts comparing Machine 1 and Machine 2
    3. Box plot of diameters of bolts of Machine 1 comparing Shift 1 and Shift 2
    4. Box plot of diameters of bolts comparing Machine 1 and Machine 2 across shifts Shift 1 and Shift 2
    Note that we can have up to 20 categorical variables for this plot. If you have more than 20 categorical variables, you must split the data into multiple groups and create a box plot for each group. You can use the by-variable option for this exercise.

    Case 1: Box plot of diameters for machine 1.

    For this case, click on Analysis Setup and specify the Data options as shown below:


    The graphical output is shown below.
    Case 2: Comparison of bolt diameters across Machine 1 and Machine 2

    For this case, click on Analysis Setup and specify the Data options as shown below:


    The graphical output is shown below.

    Case 3: Comparison of bolt diameters of Machine 1 across shifts.

    For this case, click on Analysis Setup and specify the Data options as shown below:


    The graphical output is shown below.


    Case 4: Comparison of bolt diameters across machines and shifts.

    For this case, click on Analysis Setup and specify the Data options as shown below:


    The graphical output is shown below.

    How do we change the order of the categories (x-axis) in which the box plot appears on the chart?
    Case 1: When you are plotting multiple Y variables.

    The box plot is created in the order in which the data is entered in the Data tab. For example, if the data tab is specified as follows, Machine 1 will appear before Machine 2.


    An example output is shown below:


    If we now change the order of how the data is specified as below, we get the following:




    Case 2: When plotting using a Categorical variable

    In this case, the order of the x-axis is based on the alphabetical order of the categories. The user cannot control the order of the variables unless they rename the variables to appear in the order they would like to see alphabetically. 

    How do we generate 2 level categories on the box plot?
    Case 1: One Y variable and two X variables
    Click on Analysis Setup > Data and select two categories to plot the box plot. Here, we are creating a bar chart based on procedure and department. Drag and drop both categories under Categorical Variables.



    An example output is shown below.


    If we reverse the order of the categories:



    We get the following chart.


    Case 2: Two Y variables and one X variable
    We can also have a case where we have two Y variables and one X variable, as shown in the figure below. Click on Analysis Setup > Data and specify the following:


    A sample output is shown below.


    How do we add a horizontal reference line on the box plot?
    To add a horizontal reference line on a chart, click on Analysis Setup > Charts and specify the value in the Horizontal Ref Lines as shown below:


    The resulting chart will have a reference line added to the chart, as shown below:

    Notes
    You can specify multiple values separated by a semi-colon if you need more than one horizontal line.

    Can we show the raw data points on a box plot?
    You can superimpose the raw data points on the box plot by clicking on Analysis Setup > Setup > Add Points and selecting Show Data. The following dialog box is an example of the setup tab.



    An example output is shown below.


    Notes
    Note that the raw data points are randomly dispersed on the plot, and each time you click on Compute Outputs, you may get a different dispersion of the data.
    Can we show or hide the outliers on the box plot?
    We can control the display of the outliers on a box plot by clicking on Analysis Setup > Setup > Outliers and specifying either Yes or No.


    The following figure shows the box plot without outliers.


    The following figure shows the same box plot with outliers. Note the outliers shown in red.

    Notes
    Note that the conclusion in the session window displays the number of outliers found on the chart. You can place your cursor on the chart to display the value of the outliers.
    How do I get a list of all outlier values from a box plot?
    You can unhide the temporary columns that store the values used for creating the graphs (typically D: AD columns). You can see the outliers listed in one of the columns. An example of one of the charts is shown below.
    Can I connect the two boxes displayed on the box plot?
    You can connect the mean values of the boxes by specifying the option Analysis Setup > Setup > Connector to Yes. You can display the mean values by selecting Show Mean under Add Points.



    What does each part of a box plot represent?
  • Box (Interquartile Range - IQR): The central box represents the middle 50% of the data, from Q1 (25th percentile) to Q3 (75th percentile).
  • Median Line: A line inside the box represents the median (50th percentile) of the data.
  • Whiskers: The lines extending from the box (whiskers) show the range of data. They extend from the box to the smallest and largest data points within 1.5 times the interquartile range from Q1 and Q3.
  • Outliers: Data points that fall outside of the whiskers (usually greater than 1.5 times the IQR) are considered outliers and are often plotted as individual points.
  • Can a box plot show multiple datasets at once?
    Yes! Box plots can be used to compare multiple datasets. When comparing multiple groups, each dataset will have its own box plot, and you can visually compare the median, spread, and outliers across them.
    What are the advantages of using a box plot over other charts?
  • Outlier detection: Box plots make it easy to spot outliers in the data.
  • Data distribution summary: They provide a clear summary of the data's distribution, including the range and variability.
  • Comparing groups: Box plots are ideal for comparing distributions between multiple groups or categories.
  • How do I handle skewed data in a box plot?
    Skewed data will show an asymmetrical box plot. If the right whisker is longer than the left, the data is positively skewed (skewed right). If the left whisker is longer than the right, the data is negatively skewed (skewed left). The position of the median line within the box will also indicate the skew.
    Can a box plot show multiple variables or categories?
    Yes, box plots can show multiple variables or categories side by side for comparison. This is especially useful when comparing different groups or subgroups of data (e.g., box plots for sales data by region).
    What’s the difference between a box plot and a histogram?
    A box plot summarizes the data distribution with a focus on key percentiles (minimum, Q1, median, Q3, maximum), outliers, and spread. A histogram shows the frequency distribution of data, dividing it into bins. Box plots are more useful for comparing groups or datasets, while histograms are better for understanding the overall distribution of a single dataset.
    How do I identify outliers in a box plot?
    Outliers are represented as points outside the whiskers. A common rule is that data points more than 1.5 times the IQR above Q3 or below Q1 are considered outliers.
    What are some limitations of box plots?
  • Loss of detailed data: Box plots provide a high-level summary of the data but do not show individual data points (unless outliers are displayed).
  • Can be misleading with small datasets: Box plots may not accurately represent data with small sample sizes, as they can exaggerate the appearance of spread or variability.
  • Assumption of symmetry: Box plots assume that the data is roughly symmetric. If the data is highly skewed, this might not be clear from the plot.
  • Can I create a box plot for categorical data?
    Box plots are typically used for continuous data, but you can use a box plot to compare distributions across categories. For example, you might create a box plot to compare exam scores (continuous data) across different schools or regions (categorical data).
    How do we change the order of the categories (x-axis) in which the box plot appears on the chart?
    Case 1: When you are plotting multiple Y variables.

    The box plot is created in the order in which the data is entered in the Data tab. For example, if the data tab is specified as follows, Machine 1 will appear before Machine 2.


    An example output is shown below:


    If we now change the order of how the data is specified as below, we get the following:




    Case 2: When plotting using a Categorical variable

    In this case, the order of the x-axis is based on the alphabetical order of the categories. The user cannot control the order of the variables unless they rename the variables to appear in the order they would like to see alphabetically. 

    How do we generate 2 level categories on the box plot?
    Case 1: One Y variable and two X variables
    Click on Analysis Setup > Data and select two categories to plot the box plot. Here, we are creating a bar chart based on procedure and department. Drag and drop both categories under Categorical Variables.



    An example output is shown below.


    If we reverse the order of the categories:



    We get the following chart.


    Case 2: Two Y variables and one X variable
    We can also have a case where we have two Y variables and one X variable, as shown in the figure below. Click on Analysis Setup > Data and specify the following:


    A sample output is shown below.


    How do we add a horizontal reference line on the box plot?
    To add a horizontal reference line on a chart, click on Analysis Setup > Charts and specify the value in the Horizontal Ref Lines as shown below:


    The resulting chart will have a reference line added to the chart, as shown below:

    Notes
    You can specify multiple values separated by a semi-colon if you need more than one horizontal line.

    Can we show the raw data points on a box plot?
    You can superimpose the raw data points on the box plot by clicking on Analysis Setup > Setup > Add Points and selecting Show Data. The following dialog box is an example of the setup tab.



    An example output is shown below.


    Notes
    Note that the raw data points are randomly dispersed on the plot, and each time you click on Compute Outputs, you may get a different dispersion of the data.
    Can we show or hide the outliers on the box plot?
    We can control the display of the outliers on a box plot by clicking on Analysis Setup > Setup > Outliers and specifying either Yes or No.


    The following figure shows the box plot without outliers.


    The following figure shows the same box plot with outliers. Note the outliers shown in red.

    Notes
    Note that the conclusion in the session window displays the number of outliers found on the chart. You can place your cursor on the chart to display the value of the outliers.
    How do I get a list of all outlier values from a box plot?
    You can unhide the temporary columns that store the values used for creating the graphs (typically D: AD columns). You can see the outliers listed in one of the columns. An example of one of the charts is shown below.
    Can I connect the two boxes displayed on the box plot?
    You can connect the mean values of the boxes by specifying the option Analysis Setup > Setup > Connector to Yes. You can display the mean values by selecting Show Mean under Add Points.



     Reference: Some of the text in this article has been generated using AI tools such as ChatGPT and edited for content and accuracy.
     
     
     


      • Related Articles

      • Dot plot frequently asked questions

        What is a dot plot? A dot plot is a simple graphical representation of data where each data point is shown as a dot above a number line. It is used to show the distribution of data points and to compare frequencies of specific values in a dataset. ...
      • Individual value plot frequently asked questions

        What is an individual value plot? An Individual Value Plot (also known as a "dot plot" in some contexts) is a type of data visualization that displays each individual data point on a number line. This plot helps show the distribution of data and ...
      • Box plot overview

        A box plot, or whisker plot, is a graphical method for summarizing data distribution through five primary summary statistics: the minimum, first quartile Q1, median, third quartile Q3, and maximum. This is an essential tool in descriptive statistics ...
      • Network plot frequently asked questions

        What is a Network Plot ? A Network Plot in Sigma Magic is a visual representation of relationships between data points (nodes) and their connections (edges). It helps analyze structures, dependencies, and groupings within a dataset. What types of ...
      • Marginal plot fequently asked questions

        What is a Marginal Plot in Sigma Magic? A marginal plot in Sigma Magic is a combination of a scatter plot and marginal distributions (boxplots, histograms, or density plots) to analyze relationships between two continuous variables. Why should I use ...