---
title: "Introduction to queryup"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Introduction to queryup}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

```{r}
library(queryup)
```


The purpose of `queryup` is to retrieve protein information using queries to the [UniProtKB REST API](https://www.uniprot.org/help/api_queries).


## Queries
Queries combine different fields to identify matching database entries. Here, queries are submitted using the function `query_uniprot()`. In the `queryup` R package, a query must be formatted as a list containing character vectors named after existing UniProt fields (available query fields can be found in the [API documentation](https://www.uniprot.org/help/query-fields) or in the package data `query_fields$field`). Different query fields must be matched simultaneously. For instance, the following query uses the fields *gene_exact* to return the UniProt entries of all proteins encoded by gene *Pik3r1* :


```{r message=FALSE, warnings=FALSE}
query <- list("gene_exact" = "Pik3r1")
df <- query_uniprot(query, show_progress = FALSE)
head(df)
```

Available query fields can be listed using the package data `query_fields`:
```{r}
query_fields$field
```

## Columns

By default, `query_uniprot()` returns a data.frame with  UniProt accession IDs, gene names, organism and Swiss-Prot review status. You can choose which data columns to retrieve using the `columns` parameter. 

```{r message=FALSE, warnings=FALSE}
df <- query_uniprot(query, 
                    columns = c("id", "sequence", "keyword", "gene_primary"),
                    show_progress = FALSE)
```

See the [API documentation](https://www.uniprot.org/help/return_fields) or the package data `return_fields` for all available columns.
Available returned fields can be listed using the package data `return_fields`:
```{r}
head(return_fields)
```

Note that the parameter `columns` and the name of the corresponding column in the output data frame do not necessarily match (they correspond to columns "field" and "label" respectively in the package data `return_fields`).

```{r}
names(df)
```

Let's check the sequence and the UniProt keywords corresponding to the first entry :

```{r}
as.character(df$Sequence[1])
as.character(df$Keywords[1])
```

## Combining query fields

Our first query returned many matches. We can build more specific queries by using more than one query field. By default, matching entries must satisfy all query fields simultaneously. Let's retrieve the only Swiss-Prot reviewed protein entry encoded by gene *Pik3r1* in *Homo sapiens* (taxon: 9606):

```{r message=FALSE, warnings=FALSE}
query <- list("gene_exact" = "Pik3r1", 
              "reviewed" = "true", 
              "organism_id" = "9606")
df <- query_uniprot(query, show_progress = FALSE)
print(df)
```

## Multiple items per query field

It is also possible to look for entries that match different items within a single query field. Items from a given query field are looked for independently. Hence, the following query will return all Swiss-Prot reviewed proteins encoded by either *Pik3r1* or *Pik3r2* in either *Mus musculus* (taxon: 10090) or *Homo sapiens* (taxon: 9606): 

```{r message=FALSE, warnings=FALSE}
query <- list("gene_exact" = c("Pik3r1", "Pik3r2"), 
              "reviewed" = "true", 
              "organism_id" = c("9606", "10090"))
df <- query_uniprot(query, show_progress = FALSE)
print(df)
```

## Queries with invalid entries

If a query containing invalid entries is sent to the UniProt REST API, an error message is returned and no information about the other potentially  valid entries can be retrieved. To overcome this limitation, `queryup` parses the error messages and remove invalid entries from the query. Hence, `query_uniprot()` will return information for valid entries only :

```{r}
invalid_ids <- c("P226", "CON_P22682", "REV_P47941")
valid_ids <- c("A0A0U1ZFN5", "P22682")
ids <- c(invalid_ids, valid_ids)
query <- list("accession_id" = ids)
query_uniprot(query)
```

## Long queries

Because UniProt REST API limits the size of queries, long queries containing more than a few hundreds entries cannot be passed in a single request. To overcome this limitation, the `queryup` package splits long queries into smaller ones. For instance, the dataset `uniprot_entries` that is bundled with the `queryup` package contains information for 1000 UniProt entries. We could retrieve the ENSEMBL ids corresponding to these entries using :

```{r message=FALSE, warning=FALSE}
ids <- uniprot_entries$Entry
query <- list("accession_id" = ids)
columns <- c("gene_names", "xref_ensembl")
df <- query_uniprot(query, columns = columns, show_progress = FALSE)
head(df)
```

## Protein-protein interactions

Another usage could be to retrieve protein-protein interactions among a set of UniProt entries:
```{r message=FALSE, warning=FALSE}
ids <- sample(uniprot_entries$Entry, 400)
query <- list("accession_id" = ids, 
              "interactor" = ids)
columns <- "cc_interaction"
df <- query_uniprot(query = query, columns = columns, show_progress = FALSE)
head(df)
```