Tables with summary data are quite common and naturally occur with alarming regularity. We can use gt to produce such summary components. These summary rows are automatically inserted within the relevant row groups, where each summary row is the result of a different aggregation function.

Preparing the Input Data Table

To most easily insert the necessary summary rows, we need to ensure that some preparatory work is done. The key change to the input data table should be to have a groupname column present, with categorical group names for each grouping. The following example data table contains both groupname and rowname columns, along with two data columns (named value_1 and value_2). Having the magic column names groupname and rowname will signal to gt that they should be used to form a table stub where groupname collects rows into a row group (furnishing a group heading row above), and rowname provides row labels within each row group. Here is a diagram showing the restructuring:

The incoming data table for the purpose of the first example will be created using dplyr::tribble():

There are three groups in this data table: A, B, and C. The presence of the groupname column will tell gt to partition the stub into three row groups. This is how the gt display table appears after simply providing tbl to gt():

value_1 value_2
A
1 235.6 260.1
2 184.3 84.4
3 342.3 126.3
4 234.9 83.0
B
1 190.9 832.5
2 743.3 281.2
3 252.3 732.5
4 344.7 546.1
C
1 197.2 818.0
2 284.3 394.4

Generating Summary Rows

Summaries by row group can be generated by using the summary_rows() function. There’s a lot of control in how the summaries are conducted and formatted. First of all, we can choose which groups will receive summary rows (the rows appear at the bottom of each row group). This optional choice of groups can be made using the groups argument (providing a vector of group names). Secondly, we can choose which columns should be included in the summary with the columns argument (in this case, wrapping column names in vars()). If we provide nothing to groups, then all groups will receive summary data for the selected columns.

To make any sort of summary, we need to use functions that will perform the aggregation. We can provide base functions such as mean(), sum(), min(), max(), and more, within a list() or c(). Each function provided will result in a summary row for each group.

Because each function will yield a row, we need to be able to identify that row. So, each summary row will receive a summary row label. We can provide our preferred names by naming the functions within the list or c() (e.g, list(average = "mean", total = "sum", SD = "sd") or c("mean", total = "sum", stdev = "sd")). If names aren’t supplied, then gt will derive names from the functions supplied and make the names unique.

Now that the summary_rows() function has been somewhat explained, let’s look at how we can get summary rows for the tbl table:

value_1 value_2
A
1 235.6 260.1
2 184.3 84.4
3 342.3 126.3
4 234.9 83.0
mean 249.28
sum 997.10
sd 66.51
B
1 190.9 832.5
2 743.3 281.2
3 252.3 732.5
4 344.7 546.1
mean 382.80
sum 1,531.20
sd 248.51
C
1 197.2 818.0
2 284.3 394.4

Here we see that summary rows were created for the A and B groups (group C has no summary lines in its row group). It was specified in columns that only the value_1 column should be evaluated and, indeed, the value_2 column shows no values whatsoever within the summary rows.

There is some flexibility in how we supply aggregation functions to summary_rows(). It’s sometimes sufficient to supply function names enclosed in quotes (e.g., list("mean", "sum", "sd")) but the next section will demonstrate a more advanced method for supplying these functions (where we don’t have to rely on function defaults). If we already have a vector of function names that use all of the default arguments within those functions, we can pass this directly to fns. This is nice if we want to build a common named vector of function names and summary labels. Here’s an example that does just that:

value_1 value_2
A
1 235.6 260.1
2 184.3 84.4
3 342.3 126.3
4 234.9 83.0
average 249.28
total 997.10
std dev 66.51
B
1 190.9 832.5
2 743.3 281.2
3 252.3 732.5
4 344.7 546.1
average 382.80
total 1,531.20
std dev 248.51
C
1 197.2 818.0
2 284.3 394.4
average 240.75
total 481.50
std dev 61.59

Providing Calls to Functions with a Formula Interface

Sometimes we want to use functions with some parameters set. The earlier example using list("mean", "sum", "sd") will break down if we have NA values in our data and we want to exclude those by using na.rm = TRUE. In this case we can provide a call to a function with a right-hand side (RHS) formula shortcut, having . serve as the data values. Here is an example where we modify our tbl to include some NA values (calling it tbl_na) and then changing the contents of the list()—using c() will work just as well—as calls to functions:

value_1 value_2
A
1 235.6 260.1
2 184.3 84.4
3 NA 126.3
4 234.9 83.0
average 218.27 138.45
sum 654.80 553.80
std dev 29.42 83.55
B
1 190.9 NA
2 NA 281.2
3 252.3 NA
4 NA 546.1
average 221.60 413.65
sum 443.20 827.30
std dev 43.42 187.31
C
1 197.2 NA
2 284.3 394.4

The input to fns is very permissive in regard to how the functions are defined. It is entirely valid to provide functions in the various forms shown earlier such that list("sum", ~mean(., na.rm = TRUE), SD = "sd") will be correctly interpreted.

Formatting the Summary Output Values

Using summary_rows() to generate a summary is, in effect, creating new data and thus we need some means to format the new summary row data. We can use the formatter option to supply a formatting function to handle the formatting of the summary row data. Thankfully, we have a collection of formatting functions available in the package (i.e., all of the fmt_*() functions). The default for formatter is set to fmt_number which is a sensible default for many scenarios. The setting of argument values for a particular formatter can be done in the ... area of the function call.

Let’s start with an example where we format the summary row data using fmt_number as the formatter. The fmt_number() function several arguments to which we can pass values (decimals, drop_trailing_zeros, negative_val, locale, etc.). Here, we will provide decimals = 3 to augment fmt_number():

value_1 value_2
A
1 235.6 260.1
2 184.3 84.4
3 342.3 126.3
4 234.9 83.0
mean 249.275 138.450
sum 997.100 553.800
min 184.300 83.000
max 342.300 260.100
B
1 190.9 832.5
2 743.3 281.2
3 252.3 732.5
4 344.7 546.1
C
1 197.2 818.0
2 284.3 394.4

We can also store these argument values as local variables and pass them in both separate fmt_number() calls and then to a summary_rows() calls with fmt_number() used as the formatter. This is useful for standardizing formatting parameters across different table cell types.

value_1 value_2
A
1 235,60 260,10
2 184,30 84,40
3 342,30 126,30
4 234,90 83,00
mean 249,28 138,45
sum 997,10 553,80
min 184,30 83,00
max 342,30 260,10
B
1 190,90 832,50
2 743,30 281,20
3 252,30 732,50
4 344,70 546,10
C
1 197,20 818,00
2 284,30 394,40

Extracting the Summary Data from the gt Table Object

For a reproducible workflow, we do not want to have situations where any data created or modified cannot be accessed. While having summarized values be created in a gt pipeline presents advantages to readability and intent of analysis, it is recognized that the output table itself is essentially ‘read only’, as the input data undergoes processing and movement to an entirely different format.

However, the object created still contains data and we can obtain the summary data from a gt table object using the extract_summary() function. Taking the gt_summary object, we get a data frame containing the summary data while preserving the correct data types:

The output data frames within the list always contain the groupname and rowname columns. The groupname column is filled with the name of the stub group given in summary_rows(). The rowname column contains the descriptive stub labels for the summary rows (recall that values are either supplied explicitly in summary_rows(), or, are generated from the function names). The remaining columns are from the original dataset.

The output data frame from extract_summary() can be reintroduced to a reproducible workflow and serve as downstream inputs or undergo validation. Perhaps interestingly, the output data frame is structured in a way that facilitates direct input to gt() (i.e., has the magic groupname and rowname columns). This can produce a new, standalone summary table where the summary rows are now data rows:

value_1 value_2
A
mean 249.275 138.450
sum 997.100 553.800
sd 66.506 83.551
B
mean 382.800 598.075
sum 1,531.200 2,392.300
sd 248.507 242.306

Providing Our Own Aggregation Functions to Generate Summary Rows

While many of the functions available in base R and within packages are useful as aggregate functions, we may occasionally have need to create our own custom functions. The only things to keep in mind are to ensure that a vector of values is the main input, and, a single value is returned. The return value can be most any class (e.g., numeric, character, logical) and it’s the formatter function that will handle any custom formatting while also converting to character.

Here, we’ll define a function that takes a vector of numeric values and outputs the two highest values (sorted low to high) above a threshold value. The output from this function is always a formatted character string.

Because this is character value that’s returned, we don’t need formatting functions like fmt_number(), fmt_percent(), etc. However, a useful formatter (and we do need some formatter) is the fmt_passthrough() function. Like the name suggests, it to great extent passes values through but formats as character (like all the fmt_*() function do) and it provides the option to decorate the output with a pattern. Let’s have a look at how the agg_highest_two_above_value() function can be used with the fmt_passthrough() formatter function.

value_1 value_2
A
1 235.6 260.1
2 184.3 84.4
3 126.3
4 234.9 83.0
high (234.90, 235.60) (260.10)
B
1 190.9
2 281.2
3 252.3
4 546.1
high (190.90, 252.30) (281.20, 546.10)
C
1 197.2
2 284.3 394.4
high (197.20, 284.30) (394.40)

We can extract the summary data from the summary_tbl object. Note that columns value_1 and value_2 are classed as character since it was character outputs that were generated by the agg_highest_two_above_value() function.

Using Different Formatters for Different Columns

Suppose we have data across columns that should be summarized in the same way (i.e., common aggregate functions) but formatted differently (e.g., decimal notation, scientific notation, percentages, etc.). For each type of formatting, we need to call summary_rows(), however, we can employ certain strategies to make the code a bit more succinct while doing so.

Let’s start with a new data table. This table contains groupname and rowname columns, which making using summary_rows() possible, along with large, medium, and small data columns, all containing numeric values.

Here, we would like to apply a different format (both to the data rows and to the summary rows) across the three columns. We can store formatting options to three different list objects (large_fmts, medium_fmts, and small_fmts) and use the stored values in the gt pipeline.

large medium small
A
1 2.35 × 105 342.20 34.00%
2 1.84 × 105 971.70 23.00%
3 3.43 × 105 392.60 73.00%
4 2.34 × 105 684.50 93.00%
sum 9.96 × 105 2,391.00 223.00%
mean 2.49 × 105 597.75 55.75%
sd 6.68 × 104 291.41 32.82%
B
1 1.91 × 105 328.00 35.00%
2 7.48 × 105 671.30 98.00%
3 2.53 × 105 934.20 74.00%
4 3.44 × 105 673.30 27.00%
sum 1.54 × 106 2,606.80 234.00%
mean 3.84 × 105 651.70 58.50%
sd 2.51 × 105 248.62 33.39%

Passing in parameters like this is useful, especially if there are larger numbers of columns. Additionally, we can store formatting parameters outside of the gt() pipeline and separate our concerns between data structuring and data formatting. The separation of styles and options into objects becomes more important if we intend to centralize formatting options for reuse.

Another thing to note in the above example is that even though multiple independent calls of summary_rows() were made, summary data within common summary row names were ‘squashed’ together, thus avoiding the fragmentation of summary rows. Put another way, we don’t create additional summary rows from a group across separate calls if we are referencing the same summary row labels. If the summary row labels provided in fns were to be different across columns, however, additional summary rows would be produced even if the types of data aggregations were to be functionally equivalent.