Complex regions and calculation of homology
The human genome
In recent years, clinical genetics has undergone major developments in the field of precision medicine due to the incorporation of massive parallel sequencing in clinical routine, significantly increasing the diagnostic yield.
One of the most extended sequencing strategies is second generation sequencing. It works by sequencing millions of DNA fragments at the same time, providing reads with highly accurate values, being almost a faithful image of the sequence being read from the genome. However, despite all advantages that this methodology provides, it also has limitations associated to the length of the reads generated by this type of sequencing, which usually varies between 75 and 300 bases pairs.
At Igenomix we take advantage of the speed and reliability of the data generated by Illumina, Inc. for its use in molecular diagnosis within the a clinical environment for molecular diagnosis.
The main purpose of this page is to provide useful information to our customers regarding the limitations associated with the technology used in our massive parallel sequencing studies, especially with respect to the homology within the exome used.
Drawbacks of short-read sequencing
Although the short-read sequencing methodology has provided us amazing advances in the molecular diagnosis field through an accurate evaluation of the genomes, the drawbacks associated to the technology shall should not be forgotten, as they may have impact in the molecular diagnosis of a patient.
In fact, the main drawbacks, or limitations of the second-generation sequencing can be listed as:
-
Limitation in homopolymer regions: the repetition of a nucleotide more than 5-6 times in the genome makes the evaluation of that position almost impossible due to the synchronization of the polymerase during amplification and sequencing.
-
Secondary structures: secondary structures formed during the library preparation and sequencing procedure can result resulting in distinct types of biases in the final results.
-
Homologous regions: regions in the genome with high sequence similarity to other genomic locations that can lead to bioinformatic mapping issues and may cause variant calling errors.
-
Repetitive regions: regions difficult to map due to the repetitiveness of their sequence, such as centromeric and telomeric regions.
Bioinformatic calculation of homologous regions
In order to obtain accurate and confident results, homologous regions must be identified before the analysis.
In our assessment process to obtain a list of transcripts and exons that may present issues in the mapping and variant calling steps, the following information was used:
- Tables genomicSuperDups and getRmNgsProblemHigh from the UCSC database (2022-07-11). These tables have been created through the generation of in silico data to determine the mapping quality of each region of the genome.
- RefSeq regions (v2021-03-24).
- Transcript indicated by the MANE database (v1.0).
- OMIM database (2022-08-31).
Using the following information, health professionals are able to evaluate if the study´s targeted genes of the study could be affected in the analysis process, missing what might be relevant information in the patient diagnosis.

id | Gene | Transcript | Exons with low average mappability | Exons with >90% homology | Exons with >95% homology |
---|---|---|---|---|---|
1 | ABCA3 | NM_001089 | 16, 31 | ||
1 | ABCA3 | NM_001089 | 16, 31 | ||
2 | ABCA7 | NM_019112 | 18 | ||
2 | ABCA7 | NM_019112 | 18 | ||
3 | ABCC6 | NM_001171 | 1-9 | 1-9 | 1-9 |
3 | ABCC6 | NM_001171 | 1-9 | 1-9 | 1-9 |
4 | ABCD1 | NM_000033 | 7-10 | 7-10 | 7-10 |
4 | ABCD1 | NM_000033 | 7-10 | 7-10 | 7-10 |
5 | ACAN | NM_001369268 | 12 | ||
5 | ACAN | NM_001369268 | 12 | ||
6 | ACR | NM_001097 | 4-5 | ||
6 | ACR | NM_001097 | 4-5 | ||
7 | ACTB | NM_001101 | 2-6 | ||
7 | ACTB | NM_001101 | 2-6 | ||
8 | ACTG1 | NM_001614 | 2-6 | ||
8 | ACTG1 | NM_001614 | 2-6 | ||
9 | ADAMTSL2 | NM_014694 | 10-19 | 10-19 | 10-19 |
9 | ADAMTSL2 | NM_014694 | 10-19 | 10-19 | 10-19 |
10 | ADGRE2 | NM_013447 | 3-10 | ||
10 | ADGRE2 | NM_013447 | 3-10 |

