Skip to content

pyHLAMSA: Python Utility for Star Alleles Sequence Alignment

pyHLAMSA is a Python tool for handling Multiple Sequence Alignments (MSA) of genes with star alleles, specifically focusing on IMGT-HLA, IMGT-KIR, and CYP datasets. Key features include:

  • MSA Download: Fetch MSAs for latest/specific versions.
  • Gene and Allele Management: Add, delete, list, and select genes and alleles.
  • Intron/Exon Operations: Manipulate specific intron/exon segments.
  • Position and Region Selection: Choose targeted positions or regions.
  • Alignment Operations: Add, view, concatenate, crop, and find differences.
  • Genomic and Nucleotide Merging: Seamlessly merge sequences.
  • Variant Operations: Calculate consensus, gather statistics, and perform variant-related tasks.
  • Biopython Compatibility: Transform alignments into Biopython MultipleSeqAlignment.
  • Format Conversion: Load and save MSAs in various formats (VCF, MSF, TXT, BAM, GFF).

pyHLAMSA streamlines complex gene sequence analysis, offering efficient tools for researchers.

Run in command line

You can simply use this package by command line.

If you want to use more powerful function, try the APIs written in below sections.

pip3 install git+https://github.com/linnil1/pyHLAMSA
# show help
pyhlamsa -h
# download kir
pyhlamsa download --family kir --db-folder tmpdir/tmp_kir_db --version 2100  tmpdir/kir --include-genes KIR2DL1 KIR2DL2
# view the msa
pyhlamsa view tmpdir/kir.KIR2DL1 --position 3-100 --include-alleles KIR2DL1*consensus KIR2DL1*063
# save the intron1+exon1 region to kir1.*
pyhlamsa view tmpdir/kir.KIR2DL1 --region intron1 exon1 --name tmpdir/kir1 --save --bam --gff --vcf --fasta-gapless --fasta-msa

Features

1. Read sequences from database in one line

HLAmsa provided a simple way to read HLA data

It can automatically download and read the sequences

>>> from pyhlamsa import HLAmsa

>>> hla = HLAmsa(["A", "B"], filetype="gen",
                 version="3470")

>>> print(hla.list_genes())
['A', 'B']

>>> print(hla["A"])
<A gen alleles=4101 block=5UTR(301) exon1(80) intron1(130) exon2(335) intron2(273) exon3(413) intron3(666) exon4(287) intron4(119) exon5(117) intron5(444) exon6(33) intron6(143) exon7(48) intron7(170) exon8(5) 3UTR(302)>

KIRmsa can also read sequences of KIR

>>> from pyhlamsa import KIRmsa

# If don't specific the genes, it will read all genes.
>>> kir = KIRmsa(ipd_folder="KIR_v2100", version="2100")

>>> print(kir.list_genes())
['KIR2DL1', 'KIR2DL2', 'KIR2DL3', 'KIR2DL4', 'KIR2DL5', 'KIR2DP1', 'KIR2DS1', 'KIR2DS2', 'KIR2DS3', 'KIR2DS4', 'KIR2DS5', 'KIR3DL1', 'KIR3DL2', 'KIR3DL3', 'KIR3DP1', 'KIR3DS1']

>>> print(kir["KIR2DL1"])
<KIR2DL1 gen alleles=173 block=5UTR(268) exon1(34) intron1(964) exon2(36) intron2(728) exon3(282) intron3(1441) exon4(300) intron4(1534) exon5(294) intron5(3157) exon6(51) intron6(4270) exon7(102) intron7(462) exon8(53) intron8(98) exon9(177) 3UTR(510)>

Kind note: In our modules, exon is actually CDS, so it doesn't include UTR.

2. Merge gene and nuc MSA is quiet simple

This main features give us a chance to use genomic MSA and nucleotide MSA at the same time.

The nucleotide MSA is a exon-only sequence, thus we fill the intron with E after merged.

# merge gen and nuc sequences when loading
>>> hla = HLAmsa(["A"], filetype=["gen", "nuc"],
                 imgt_alignment_folder="alignment_v3470")
>>> print(hla["A"])
<A gen alleles=7349 block=5UTR(301) exon1(80) intron1(130) exon2(351) intron2(273) exon3(436) intron3(666) exon4(361) intron4(119) exon5(117) intron5(444) exon6(33) intron6(143) exon7(48) intron7(170) exon8(5) 3UTR(302)>


# or manually
>>> a_gen = HLAmsa("A", filetype="gen", 
>>>                imgt_alignment_folder="alignment_v3470")["A"]

>>> print(a_gen)
<A gen alleles=4101 block=5UTR(301) exon1(80) intron1(130) exon2(335) intron2(273) exon3(413) intron3(666) exon4(287) intron4(119) exon5(117) intron5(444) exon6(33) intron6(143) exon7(48) intron7(170) exon8(5) 3UTR(302)>

>>> a_nuc = HLAmsa("A", filetype="nuc", 
>>>                imgt_alignment_folder="alignment_v3470")["A"]
>>> print(a_nuc)
<A nuc alleles=7353 block=exon1(80) exon2(351) exon3(436) exon4(361) exon5(117) exon6(33) exon7(48) exon8(5)>

>>> a_gen = a_gen.remove('A*03:437Q')
>>> print(a_gen.merge_exon(a_nuc))
<A gen alleles=7349 block=5UTR(301) exon1(80) intron1(130) exon2(351) intron2(273) exon3(436) intron3(666) exon4(361) intron4(119) exon5(117) intron5(444) exon6(33) intron6(143) exon7(48) intron7(170) exon8(5) 3UTR(302)>

3. Block-based selection: You can select ANY intron/exon.

# select exon2 and exon3
>>> exon23 = a_gen.select_exon([2,3])  # 1-base
>>> print(exon23)
<A nuc alleles=4100 block=exon2(335) exon3(413)>


# select exon2 + intron2 + exon3
>>> e2i2e3 = a_gen.select_block([3,4,5])  # 0-base
>>> print(e2i2e3)
<A  alleles=4100 block=exon2(335) intron2(273) exon3(413)>

4. Easy to compare alleles: select and print

# select first 10 alleles
>> exon23_10 = exon23.select_allele(exon23.get_sequence_names()[:10])
>>> print(exon23_10)
<A nuc alleles=10 block=exon2(335) exon3(413)>

# print it
>>> exon23_10.print_alignment()
                  812                                                      1136
                    |                                                         |
 A*01:01:01:01      ACCGAGCGAA CCTGGGGACC CTGCGCGGCT ACTACAACCA GAGCGAGGAC G| GTTCTCACA CC-ATCCAGA TAATGTATGG CTGCGACG-- ----------
 A*01:01:01:02N     ACCGAGCGAA CCTGGGGACC CTGCGCGGCT ACTACAACCA GAGCGAGGAC G| GTTCTCACA CC-ATCCAGA TAATGTATGG CTGCGACG-- ----------
 A*01:01:01:03      ACCGAGCGAA CCTGGGGACC CTGCGCGGCT ACTACAACCA GAGCGAGGAC G| GTTCTCACA CC-ATCCAGA TAATGTATGG CTGCGACG-- ----------
 A*01:01:01:04      ACCGAGCGAA CCTGGGGACC CTGCGCGGCT ACTACAACCA GAGCGAGGAC G| GTTCTCACA CC-ATCCAGA TAATGTATGG CTGCGACG-- ----------
 A*01:01:01:05      ACCGAGCGAA CCTGGGGACC CTGCGCGGCT ACTACAACCA GAGCGAGGAC G| GTTCTCACA CC-ATCCAGA TAATGTATGG CTGCGACG-- ----------
 A*01:01:01:06      ACCGAGCGAA CCTGGGGACC CTGCGCGGCT ACTACAACCA GAGCGAGGAC G| GTTCTCACA CC-ATCCAGA TAATGTATGG CTGCGACG-- ----------
 A*01:01:01:07      ACCGAGCGAA CCTGGGGACC CTGCGCGGCT ACTACAACCA GAGCGAGGAC G| GTTCTCACA CC-ATCCAGA TAATGTATGG CTGCGACG-- ----------
 A*01:01:01:08      ACCGAGCGAA CCTGGGGACC CTGCGCGGCT ACTACAACCA GAGCGAGGAC G| GTTCTCACA CC-ATCCAGA TAATGTATGG CTGCGACG-- ----------
 A*01:01:01:09      ACCGAGCGAA CCTGGGGACC CTGCGCGGCT ACTACAACCA GAGCGAGGAC G| GTTCTCACA CC-ATCCAGA TAATGTATGG CTGCGACG-- ----------
 A*01:01:01:10      ACCGAGCGAA CCTGGGGACC CTGCGCGGCT ACTACAACCA GAGCGAGGAC G| GTTCTCACA CC-ATCCAGA TAATGTATGG CTGCGACG-- ----------


# using regex to select
# "|" indicate the border of block, in this case, it's the border of exon2 and exon3
>>> exon23_1field = exon23.select_allele(r"A\*.*:01:01:01$")
>>> exon23_1field.print_alignment_diff()
                  812                                                      1136
                    |                                                         |
 A*01:01:01:01      ACCGAGCGAA CCTGGGGACC CTGCGCGGCT ACTACAACCA GAGCGAGGAC G| GTTCTCACA CC.ATCCAGA TAATGTATGG CTGCGACG.. ..........
 A*02:01:01:01      ------T-G- ---------- ---------- ---------- --------C- -| --------- --.G------ GG-------- --------.. ..........
 A*03:01:01:01      ------T-G- ---------- ---------- ---------- --------C- -| --------- --.------- ---------- --------.. ..........
 A*11:01:01:01      ------T-G- ---------- ---------- ---------- ---------- -| --------- --.------- ---------- --------.. ..........
 A*23:01:01:01      ------A--- ----C---T- GC--T-C--- ---------- --------C- -| --------- --.C------ -G----T--- --------.. ..........
 A*25:01:01:01      ------A--G ----C---T- GC--T-C--- ---------- ---------- -| --------- --.------- GG-------- --------.. ..........
 A*26:01:01:01      ---------- ---------- ---------- ---------- ---------- -| --------- --.------- GG-------- --------.. ..........
 A*29:01:01:01      ---------- ---------- ---------- ---------- --------C- -| --------- --.------- -G-------- ----C---.. ..........
 A*30:01:01:01      ------T-G- ---------- ---------- ---------- --------C- -| --------- --.------- ---------- --------.. ..........
 A*32:01:01:01      ------A--G ----C---T- GC--T-C--- ---------- --------C- -| --------- --.------- -G-------- --------.. ..........
 A*33:01:01:01      ------T-G- ---------- ---------- ---------- --------C- -| --------- --.------- -G-------- --------.. ..........
 A*34:01:01:01      ------T-G- ---------- ---------- ---------- ---------- -| --------- --.------- GG-------- --------.. ..........
 A*36:01:01:01      ---------- ---------- ---------- ---------- ---------- -| --------- --.------- ---------- --------.. ..........
 A*66:01:01:01      ------T-G- ---------- ---------- ---------- ---------- -| --------- --.------- GG-------- --------.. ..........
 A*68:01:01:01      ------T-G- ---------- ---------- ---------- --------C- -| --------- --.------- -G-------- --------.. ..........
 A*69:01:01:01      ------T-G- ---------- ---------- ---------- --------C- -| --------- --.G------ GG-------- --------.. ..........
 A*74:01:01:01      ------T-G- ---------- ---------- ---------- --------C- -| --------- --.------- -G-------- --------.. ..........
 A*80:01:01:01      ---------- ---------- ---------- ---------- ---------- -| --------- --.------- ---------- --------.. ..........


# show only variantiation
>>> exon23_1field.print_snv()
Total variantion: 71
                         536 537  541          565 567 570          593          640          654 658          684          721      
                           |   |    |            |   |   |            |            |            |   |            |            |      
 A*01:01:01:01      | ATTTCT   TCAC ATCCG| CCGGC CG  CGG GGA.G | ATCGCCGTGG | .G.ACACG.C | CGTGCGGTTC GACA| AGA..A GATG| CGGGCG CCGT|
 A*02:01:01:01      | ------   ---- -----| ----- --  --- ---.- | -----A---- | .-.-----.- | ---------- ----| ---..G ----| ------ ----|
 A*03:01:01:01      | ------   ---- -----| ----- --  --- ---.- | ---------- | .-.-----.- | ---------- ----| ---..G ----| ------ ----|
 A*11:01:01:01      | ------   A--- C----| ----- --  --- ---.- | ---------- | .-.-----.- | ---------- ----| ---..G ----| ------ ----|
 A*23:01:01:01      | ------   C--- -----| ----- --  --- ---.- | ---------- | .-.-----.- | ---------- ----| ---..G ----| ------ ----|
 A*25:01:01:01      | ------   A--- C----| ----- --  --- ---.- | ---------- | .-.-----.- | ---------- ----| ---..G ----| ------ ----|
 A*26:01:01:01      | ------   A--- C----| ----- --  --- ---.- | ---------- | .-.-----.- | ---------- ----| ---..G ----| ------ ----|
 A*29:01:01:01      | -----A   C--- -----| ----- --  --- ---.- | ---------- | .-.-----.- | ---------T ----| ---..G ----| -----A ----|
 A*30:01:01:01      | ------   C--- -----| ----- A-  T-- A--.- | -----A---- | .-.-----.- | ---------- ----| ---..G ----| ------ ----|
 A*32:01:01:01      | ------   ---- -----| ----- --  --- ---.- | ---------- | .-.-----.- | ---------T ----| ---..G ----| ------ ----|
 A*33:01:01:01      | -----A   C--- -----| ----- --  --- ---.- | ---------- | .-.-----.- | ---------- ----| ---..G ----| ------ ----|
 A*34:01:01:01      | ------   A--- C----| ----- --  --- ---.- | ---------- | .-.-----.- | ---------- ----| ---..G ----| ------ ----|
 A*36:01:01:01      | ------   ---- -----| ----- --  --- ---.- | ---------- | .-.-----.- | ---------- ----| ---..- ----| ------ ----|
 A*66:01:01:01      | ------   A--- C----| ----- --  --- ---.- | ---------- | .-.-----.- | ---------- ----| ---..G ----| ------ ----|
 A*68:01:01:01      | ------   A--- C----| ----- --  --- ---.- | ---------- | .-.-----.- | ---------- ----| ---..G ----| ------ ----|
 A*69:01:01:01      | ------   A--- C----| ----- --  --- ---.- | ---------- | .-.-----.- | ---------- ----| ---..G ----| ------ ----|
 A*74:01:01:01      | ------   ---- -----| ----- --  --- ---.- | ---------- | .-.-----.- | ---------T ----| ---..G ----| ------ ----|
 A*80:01:01:01      | ------   ---- -----| ----- --  --- ---.- | -----A---- | .-.--T--.- | -----A---- ----| ---..G ----| ------ ----|

5. Some useful operation

# Calculate variant frequency of ( A T C G - ) per base
>>> print(exon23.calculate_frequency()[:10])
[[2, 3, 0, 4095, 0], [2, 1, 4097, 0, 0], [0, 4098, 2, 0, 0], [0, 3, 4095, 2, 0], [0, 911, 3188, 0, 1], [0, 1, 4097, 1, 1], [0, 0, 1, 0, 4099], [4097, 0, 0, 2, 1], [4, 3, 4090, 3, 0], [1, 4097, 1, 1, 0]]


# Get consensus(The largest one in frequency) among the msa
>>> exon23.get_consensus(include_gap=True)
'GCTCCC-ACTCCATGAGGTATTTCTTCACATCCGTGTCCCGGCCCGGCCGCGGGGA----GCCCCGCTTCATCGCCGTGGGC-----------------------TACGTGGACG-ACACG-CAGTTCGTGCGGTTCGACAGCGACGCCGCGAGCCAGAGGATGGAGCCG--------------------CGGGCGCCGTGGATA-GAGCAGGAGGGGCCGGAGTATTGGGACCAGGAGACACGGA-------------A-TGTGAAGGCCCACTCACAGACTGACCGAGTGGACCTGGGGACCCTGCGCGGCTACTACAACCAGAGCGAGGCCGGTTCTCACACC-ATCCAGATGATGTATGGCTGCGACG--------------TGGGG-TCGGACGGGCGCTTCCTCCGCGGGTACCAGCA---GGACGCCTACGACGGCAAGGATTAC---ATCGCCCTGAAC------------------------GAGGACCTGCGCTCTTGGACCGCGGCGGAC--------ATGGCGGCTCAGATCACCAAGCGC-AAGT----GGGAGG--CGGCCC-ATGT------------------------------------------GGCGG-AGCAGTTGAGAGCCTACCTGGAGGGCACG--------TGCGTG----GAGTGGCTCCG--CAGATA-CCTGGAGAACGGGAAGGAGACGCTGCAGC-----------------GCACGG'


# Add sequence into MSA
# include_gap=False in get_consensus will ignore the frequency of gap. i.e. choose one of ATCG
>>> a_merged = hla["A"]
>>> consensus_seq = a_merged.get_consensus(include_gap=False)
>>> a_merged.append("A*consensus", consensus_seq)
>>> a_merged.fill_imcomplete("A*consensus")


# Shrink: remove gap if all bases in the column are gap
>>> exon23_10.shrink().print_snv()
 gDNA               200
                    |
 A*01:01:01:01      AAGGCCCACT CACAGACTGA CCGAGCGAAC CTGGGGACCC TGCGCGGCTA CTACAACCAG AGCGAGGACG| GTTCTCACAC CATCCAGATA ATGTATGGCT
 A*01:01:01:02N     ---------- ---------- ---------- ---------- ---------- ---------- ----------| ---------- ---------- ----------
 A*01:01:01:03      ---------- ---------- ---------- ---------- ---------- ---------- ----------| ---------- ---------- ----------
 A*01:01:01:04      ---------- ---------- ---------- ---------- ---------- ---------- ----------| ---------- ---------- ----------
 A*01:01:01:05      ---------- ---------- ---------- ---------- ---------- ---------- ----------| ---------- ---------- ----------
 A*01:01:01:06      ---------- ---------- ---------- ---------- ---------- ---------- ----------| ---------- ---------- ----------
 A*01:01:01:07      ---------- ---------- ---------- ---------- ---------- ---------- ----------| ---------- ---------- ----------
 A*01:01:01:08      ---------- ---------- ---------- ---------- ---------- ---------- ----------| ---------- ---------- ----------
 A*01:01:01:09      ---------- ---------- ---------- ---------- ---------- ---------- ----------| ---------- ---------- ----------
 A*01:01:01:10      ---------- ---------- ---------- ---------- ---------- ---------- ----------| ---------- ---------- ----------


# select specific column 
>>> a_gen[12:100]
<A  alleles=4100 block=(88)>

>>> a_gen[[12,100]].print_alignment_diff()
 gDNA               0
                    |
 A*01:01:01:01      GG
 A*01:01:01:02N     -G
 A*01:01:01:03      GG
 A*01:01:01:04      --
 A*01:01:01:05      --
 A*01:01:01:06      --
 A*01:01:01:07      --
 A*01:01:01:08      --
 A*01:01:01:09      --
 A*01:01:01:10      --
 A*01:01:01:11      GG
 A*01:01:01:12      GG


# concat
>>> print(a_gen.select_exon([2]) + a_gen.select_exon([3]))
<A nuc alleles=4100 block=exon2(335) exon3(413)>

6. Write and export the MSA

Causion: If you run merge_exon, or the msa is from filetype=['gen', 'nuc'], You should fill the E BEFORE save it.

You can fill it by consensus_seq shown before.

  • MultipleSeqAlignment

    Transfer to MultipleSeqAlignment

    >>> print(a_gen.to_MultipleSeqAlignment())
    Alignment with 4100 rows and 3866 columns
    CAGGAGCAGAGGGGTCAGGGCGAAGTCCCAGGGCCCCAGGCGTG...AAA A*01:01:01:01
    --------------------------------------------...--- A*01:01:01:02N
    CAGGAGCAGAGGGGTCAGGGCGAAGTCCCAGGGCCCCAGGCGTG...AAA A*01:01:01:03
    --------------------------------------------...--- A*01:01:01:04
    --------------------------------------------...AAA A*01:01:01:05
    --------------------------------------------...--- A*01:01:01:06
    --------------------------------------------...AAA A*01:01:01:07
    --------------------------------------------...--- A*01:01:01:08
    --------------------------------------------...AAA A*01:01:01:09
    --------------------------------------------...--- A*01:01:01:10
    CAGGAGCAGAGGGGTCAGGGCGAAGTCCCAGGGCCCCAGGCGTG...--- A*01:01:01:11
    CAGGAGCAGAGGGGTCAGGGCGAAGTCCCAGGGCCCCAGGCGTG...AAA A*01:01:01:12
    --------------------------------------------...AAA A*01:01:01:13
    --------------------------------------------...AAA A*01:01:01:14
    --------------------------------------------...--- A*01:01:01:15
    CAGGAGCAGAGGGGTCAGGGCGAAGTCCCAGGGCCCCAGGCGTG...--- A*01:01:01:16
    CAGGAGCAGAGGGGTCAGGGCGAAGTCCCAGGGCCCCAGGCGTG...--- A*01:01:01:17
    CAGGAGCAGAGGGGTCAGGGCGAAGTCCCAGGGCCCCAGGCGTG...AAA A*01:01:01:18
    ...
    --------------------------------------------...--- A*80:07
    
  • list of SeqRecord

    # Save as MSA
    SeqIO.write(a_gen.to_records(gap=True), "filename.msa.fa", "fasta")
    # Save as no-gapped sequences
    SeqIO.write(a_gen.to_records(gap=False), "filename.fa", "fasta")
    

  • fasta

    a_gen.to_fasta("filename.msa.fa", gap=True)
    a_gen.to_fasta("filename.fa", gap=False)
    

  • bam

    a_gen.to_bam("filename.bam")
    

  • gff

    a_gen.to_gff("filename.gff")
    

    After save the MSA as bam and gff, you can show the alignments on IGV msa_igv_example

  • vcf

    a_gen.to_vcf("filename.vcf.gz")
    

  • IMGT MSA format (xx_gen.txt)

    a_gen.to_imgt_alignment("A_gen.txt")
    a_gen.to_imgt_alignment("A_nuc.txt", seq_type="nuc")
    

  • save/load

    Save our model in fasta and json, where json contains block, index information

    a_gen.save_msa("a_gen.fa", "a_gen.json")
    a_gen = Genemsa.load_msa("a_gen.fa", "a_gen.json")
    
  • load msa from other format

    pyHLAMSA only support reading from MultipleSeqAlignment, which is very useful object, can be generate by reading MSA by Bio.AlignIO.

    Checkout https://biopython.org/wiki/AlignIO#file-formats for format supporting.

    For example

    from pyhlamsa import Genemsa
    msa = Genemsa.from_MultipleSeqAlignment(AlignIO.read(your_data_path, your_data_format))
    

TODO

  • [x] Testing
  • [x] Main function
  • [x] exon-only sequence handling
  • [x] Reading from file
  • [x] Some useful function: copy, remove, get_sequence_names, __len__, size, split, concat
  • [x] Cannot handle pseudo exon
  • [x] merge blocks and labels
  • [x] Fix KIR when merge gen and nuc
  • [x] Use index to trace orignal position
  • [x] CYP
  • [x] Set reference
  • [x] Download latest version of IMGT or IPD
  • [x] Remove seqtype
  • [x] Split code
  • [x] save to VCF
  • [x] Add command line usage
  • [x] Remove msaio
  • [x] Rewrite gene Mixins (Too complicated)
  • [x] Fix selction and removeing allele by regex
  • [x] Change to ACGT order
  • [x] Rename: split -> split_block, remove -> remove_allele
  • [ ] CDS != exon, (rename it?)
  • [ ] Rename: pyHLAMSA -> py_star_msa (MAYBE, becuase the project is originally written for HLA)

Requirement

  • python3.9
  • biopython
  • pysam
  • wget
  • git

Installation

pip3 install git+https://github.com/linnil1/pyHLAMSA
# or
git clone https://github.com/linnil1/pyHLAMSA
pip3 install -e pyHLAMSA

Appendix: CYP

Steps:

  1. Download fasta from https://www.pharmvar.org/download
  2. unzip to ./pharmvar-5.1.10
  3. Read it by pyhlamsa
    # 4. Read it
    from pyhlamsa import CYPmsa
    cyp = CYPmsa(pharmvar_folder="./pharmvar-5.1.10")
    
    # 5. Test it
    >>> print(cyp['CYP26A1'].format_variantion_base())
    
                           6407         6448         8142
                              |            |            |
     CYP26A1*1.001       GCGAGCGCGG | ATGTTCCGAA | TCGGGTGTGT
     CYP26A1*3.001       ---------- | -----A---- | ----------
     CYP26A1*2.001       -----A---- | ---------- | ----------
     CYP26A1*4.001       ---------- | ---------- | -----C----
    

Setup Document

pip3 install mkdocs-material mkdocstrings[python-lagacy]==0.18
mkdocs serve

I use python-lagacy 0.18 because inherited_members is not support now

Test

pip3 install -e .
pip3 install pytest black mypy
pytest
black pyhlamsa
mypy pyhlamsa

Tested Version

  • CYP: 5.2.2
  • HLA: 3.49.0
  • KIR: 2.10.0 (The DB in 2.11.0 contains bugs in KIR2DL4/5)

Some QAs

Why not inherit Bio.AlignIO.MultipleSeqAlignment?

The class does't support lot of functions than I expected.

And it's tidious to overwrite most of the functions to maintain our blocks information

Why not use numpy to save the sequence?

Performance issue is not my bottle-neck yet.

Citations

  • HLA Robinson J, Barker DJ, Georgiou X, Cooper MA, Flicek P, Marsh SGE: IPD-IMGT/HLA Database. Nucleic Acids Research (2020), 48:D948-55
  • IMGT Robinson J, Barker DJ, Georgiou X, Cooper MA, Flicek P, Marsh SGE IPD-IMGT/HLA Database Nucleic Acids Research (2020) 48:D948-55
  • CYP(PharmVar) Pharmacogene Variation Consortium (PharmVar) at www.PharmVar.org using the following references: Gaedigk et al. 2018, CPT 103(3):399-401 (PMID 29134625); Gaedigk et al. 2020, CPT 107(1):43-46 (PMID 31758698) and Gaedigk et al. 2021, CPT 110(3):542-545 (PMID 34091888)
  • This github

Document

See https://linnil1.github.io/pyHLAMSA

Gene Module.

All the msa operation are merged into Genemsa

Genemsa (GenemsaIO, GenemsaTextOp, GenemsaBase)

Genemsa: core object in pyhlamsa

More details written in GenemsaBase

Basic usage:

from pyhlamsa import Genemsa
msa = Genemsa("test1")
msa.set_blocks([3,7])
msa.assume_label("other")
msa.append("A", "GGAGGA--CA")
msa.append("B", "GGGGGAA--A")
msa.print_alignment()
                  1   4     9
                  |   |     |
A                 GGA|GGA--|CA
B                 GGG|GGAA-|-A

Source code in pyhlamsa/gene/genemsa.py
class Genemsa(
    GenemsaIO,
    GenemsaTextOp,
    GenemsaBase,
):
    """
    Genemsa: core object in pyhlamsa

    More details written in GenemsaBase

    Basic usage:
    ``` python
    from pyhlamsa import Genemsa
    msa = Genemsa("test1")
    msa.set_blocks([3,7])
    msa.assume_label("other")
    msa.append("A", "GGAGGA--CA")
    msa.append("B", "GGGGGAA--A")
    msa.print_alignment()
                      1   4     9
                      |   |     |
    A                 GGA|GGA--|CA
    B                 GGG|GGAA-|-A
    ```
    """

__add__(self: ~GenemsaType, msa: ~GenemsaType) -> ~GenemsaType inherited special

Source code in pyhlamsa/gene/genemsa.py
def __add__(self: GenemsaType, msa: GenemsaType) -> GenemsaType:
    """
    Concat 2 MSA

    Example:
      >>> print(a_gen.select_exon([2]) + a_gen.select_exon([3]))
      <A nuc alleles=4100 block=exon2(335) exon3(413)>
    """
    names0 = set(self.get_sequence_names())
    names1 = set(msa.get_sequence_names())
    if names0 != names1:
        raise ValueError(
            "Can not concat because some allele is miss: "
            + str(names0.symmetric_difference(names1))
        )
    new_msa = self.copy()
    new_msa.blocks.extend(copy.deepcopy(msa.blocks))
    new_msa.index.extend(copy.deepcopy(msa.index))
    for name, seq in msa.items():
        new_msa.alleles[name] += seq
    return new_msa

__getitem__(self: ~GenemsaType, index: Any = None) -> ~GenemsaType inherited special

Source code in pyhlamsa/gene/genemsa.py
def __getitem__(self: GenemsaType, index: Any = None) -> GenemsaType:
    """
    Extract the region of the sequences by index (start from 0),
    but block information will not preserved

    Example:
      >>> msa = Genemsa("A", "gen")
      >>> msa.read_alignment_file("A_gen.txt")
      >>> # Inspect 50-100bp in the MSA
      >>> extract_msa = msa[50:100]
      >>> print(extract_msa)

      >>> # Inspect 2nd 3rd 5th bp in the MSA
      >>> extract_msa = msa[[1,2,4]]
      >>> print(extract_msa)
    """
    if not index:
        return self.copy()
    if not self:
        raise ValueError("MSA is empty")

    # Extract specific region in alignment
    if isinstance(index, int):
        index = [index]
    if isinstance(index, slice):
        alleles = {allele: seq[index] for allele, seq in self.items()}
        index = self.index[index]
    elif isinstance(index, (tuple, list)):
        alleles = {
            allele: "".join([seq[i] for i in index]) for allele, seq in self.items()
        }
        index = [self.index[i] for i in index]
    # Fail
    else:
        raise TypeError("Bad usage")

    new_msa = self.copy(copy_allele=False)
    new_msa.set_blocks([len(next(iter(alleles.values())))])
    new_msa.index = copy.deepcopy(index)
    new_msa.extend(alleles)
    return new_msa

__init__(self, gene_name: str = 'Unamed', alleles: dict = {}, blocks: Iterable = [], index: Iterable = [], reference: str = '') inherited special

Source code in pyhlamsa/gene/genemsa.py
def __init__(
    self,
    gene_name: str = "Unamed",
    alleles: dict[str, str] = {},
    blocks: Iterable[BlockInfo] = [],
    index: Iterable[IndexInfo] = [],
    reference: str = "",
):
    """
    Attributes:
        gene_name: The name of the gene

        alleles: MSA data.

            Allele name as the key and the sequence string as the value.


        blocks: list of block information
        index: list of index(position) information
        reference: allele of the msa (Optional)
    """
    self.gene_name = gene_name
    self.alleles: dict[str, str] = {}
    self.blocks = copy.deepcopy(list(blocks or []))  # intron exon length
    self.index = copy.deepcopy(list(index or []))
    self.logger = logging.getLogger(__name__)
    self.reference = reference

    if alleles:
        self.extend(alleles)

__iter__(self) -> Iterator inherited special

Source code in pyhlamsa/gene/genemsa.py
def __iter__(self) -> Iterator[str]:
    """Iter allele like iter(dict)"""
    return iter(self.alleles)

__len__(self) -> int inherited special

Source code in pyhlamsa/gene/genemsa.py
def __len__(self) -> int:
    """
    Get the number of alleles in the MSA

    Example:
      >>> len(a_gen)
      4100
    """
    return len(self.alleles)

__repr__(self) -> str inherited special

Source code in pyhlamsa/gene/genemsa.py
def __repr__(self) -> str:
    """Show name, number of alleles and block infos in this MSA"""
    block_info = " ".join([f"{b.name}({b.length})" for b in self.list_blocks()])
    return f"<{self.gene_name} alleles={len(self)} block={block_info}>"

append(self: ~GenemsaType, name: str, seq: str) -> ~GenemsaType inherited

Add a sequence into MSA (inplace)

Make sure the sequence length is same as in MSA

Source code in pyhlamsa/gene/genemsa.py
def append(self: GenemsaType, name: str, seq: str) -> GenemsaType:
    """
    Add a sequence into MSA (inplace)

    Make sure the sequence length is same as in MSA
    """
    if len(seq) == 0:  # OK to add 0 length string
        self.alleles[name] = seq
        return self
    if len(seq) != self.get_length():
        raise ValueError("Length not match to alignments")
    if name in self:
        raise ValueError(f"{name} already exist")

    self.alleles[name] = seq
    return self

assume_label(self: ~GenemsaType, seq_type: str = 'gen') -> ~GenemsaType inherited

It will automatically generate the block's label according on seq_type. (inplace)

seq_type: * gen: 5UTR-exon1-intron1-exon2-...-exon9-3UTR * nuc: exon1-exon2-...-exon9 * other: block1-block2-block3-...

!!! block_length If manually assign the block_length, the old block will be cleared.

Source code in pyhlamsa/gene/genemsa.py
def assume_label(self: GenemsaType, seq_type: str = "gen") -> GenemsaType:
    """
    It will automatically generate the block's label
    according on `seq_type`. (inplace)

    seq_type:
      * gen: 5UTR-exon1-intron1-exon2-...-exon9-3UTR
      * nuc: exon1-exon2-...-exon9
      * other: block1-block2-block3-...

    block_length:
        If manually assign the block_length, the old block will be cleared.
    """
    if not self.get_block_length():
        raise ValueError("This msa doesn't have any blocks")

    if seq_type == "gen":
        assert self.get_block_length() % 2 == 1 and self.get_block_length() >= 3
        for i, blk in enumerate(self.blocks):
            if i == 0:
                blk.type = "five_prime_UTR"
                blk.name = "5UTR"
            elif i == self.get_block_length() - 1:
                blk.type = "three_prime_UTR"
                blk.name = "3UTR"
            elif i % 2:
                blk.type = "exon"
                blk.name = f"exon{i // 2 + 1}"
            else:
                blk.type = "intron"
                blk.name = f"intron{i // 2}"

    elif seq_type == "nuc":
        for i, blk in enumerate(self.blocks):
            blk.type = "exon"
            blk.name = f"exon{i+1}"
    else:
        for i, blk in enumerate(self.blocks):
            blk.type = "gene_fragment"
            blk.name = f"block{i+1}"

    # inplace to reset the index
    self.index = self.reset_index().index
    return self

calculate_frequency(self) -> list inherited

Calculate ATCG and gap frequency of each bp in MSA

Returns:

Type Description
frequency (list[list[int]])

Each items contains the number of ATCG and gap.

Source code in pyhlamsa/gene/genemsa.py
def calculate_frequency(self) -> list[list[int]]:
    """
    Calculate ATCG and gap frequency of each bp in MSA

    Returns:
      frequency (list[list[int]]):
        Each items contains the number of ATCG and gap.
    """
    freqs = []
    for i in zip(*self.alleles.values()):
        freqs.append(
            [i.count("A"), i.count("C"), i.count("G"), i.count("T"), i.count("-")]
        )
    return freqs

contains(self, allele: str) -> bool inherited

Implement in operator

Examples:

>>> msa_a = HLAmsa("A")["A"]
>>> "A*01:01:01:01" in msa_a
True
Source code in pyhlamsa/gene/genemsa.py
def contains(self, allele: str) -> bool:
    """
    Implement `in` operator

    Example:
        >>> msa_a = HLAmsa("A")["A"]
        >>> "A*01:01:01:01" in msa_a
        True
    """
    return allele in self.alleles

copy(self: ~GenemsaType, copy_allele: bool = True) -> ~GenemsaType inherited

Clone a new MSA.

Parameters:

Name Type Description Default
copy_allele bool

Copy the sequences as well

True
Source code in pyhlamsa/gene/genemsa.py
def copy(self: GenemsaType, copy_allele: bool = True) -> GenemsaType:
    """
    Clone a new MSA.

    Args:
      copy_allele: Copy the sequences as well
    """
    # Child's Type
    Genemsa = type(self)
    new_msa = Genemsa(
        self.gene_name,
        blocks=self.blocks,
        index=self.index,
        reference=self.reference,
        alleles=self.alleles if copy_allele else {},
    )
    return new_msa

extend(self: ~GenemsaType, msa: Union[~GenemsaType, dict[str, str]]) -> ~GenemsaType inherited

Add MSA's alleles into this MSA (inplace)

Source code in pyhlamsa/gene/genemsa.py
def extend(
    self: GenemsaType, msa: Union[GenemsaType, dict[str, str]]
) -> GenemsaType:
    """Add MSA's alleles into this MSA (inplace)"""
    if isinstance(msa, GenemsaBase):
        if [b.length for b in self.list_blocks()] != [
            b.length for b in msa.list_blocks()
        ]:
            raise ValueError("Block length is different")
    for name, seq in msa.items():
        self.append(name, seq)
    return self

fill_incomplete(self: ~GenemsaType, ref_allele: str) -> ~GenemsaType inherited

Fill the E in exon-only sequences with ref_allele sequence (inplace)

Source code in pyhlamsa/gene/genemsa.py
def fill_incomplete(self: GenemsaType, ref_allele: str) -> GenemsaType:
    """Fill the `E` in exon-only sequences with ref_allele sequence (inplace)"""
    if ref_allele not in self:
        raise KeyError(f"{ref_allele} not found")

    ref_seq = self.get(ref_allele)
    for allele, seq in self.items():
        if "E" in seq:
            # replace it
            self.alleles[allele] = "".join(
                [seq[i] if seq[i] != "E" else ref_seq[i] for i in range(len(seq))]
            )
    return self

format_alignment(self, **kwargs_format: Any) -> Iterator inherited

Transfer MSA to string(generator) To the things in step3 and step4.

Parameters:

Name Type Description Default
kwargs_format Any
  • show_position_set(set[int]): Force to show the position
  • max_page_width(int): The max width of each line (Default = 140char)
{}
Source code in pyhlamsa/gene/genemsa.py
def format_alignment(self, **kwargs_format: Any) -> Iterator[str]:
    """
    Transfer MSA to string(generator)
    To the things in step3 and step4.

    Args:
      kwargs_format:
        * show_position_set(set[int]): Force to show the position
        * max_page_width(int): The max width of each line (Default = 140char)
    """
    if not self:
        raise ValueError("MSA is empty")
    pages = self._format_page([self.index], **kwargs_format)
    yield from chain.from_iterable(map(self._apply_page, pages))

get(self, allele: str) -> str inherited

Get the sequence by allele name

Source code in pyhlamsa/gene/genemsa.py
def get(self, allele: str) -> str:
    """Get the sequence by allele name"""
    return self.alleles[allele]

get_allele_or_error(self, allele: str = '') -> tuple inherited

Get the sequence by allele name

Parameters:

Name Type Description Default
allele str

Allele name. If not provided, reference allele are used

''

Returns:

Type Description
allele_name (str) and its sequence (str)
Source code in pyhlamsa/gene/genemsa.py
def get_allele_or_error(self, allele: str = "") -> tuple[str, str]:
    """
    Get the sequence by allele name

    Args:
      allele (str): Allele name. If not provided, reference allele are used

    Returns:
      allele_name (str) and its sequence (str):
    """
    if not allele:
        allele = self.get_reference()[0]
    if allele not in self:
        raise ValueError(f"{allele} not found")
    return allele, self.get(allele)

get_block(self, block: Union[int, str, pyhlamsa.gene.base.BlockInfo]) -> BlockInfo inherited

Get block by str or id

Source code in pyhlamsa/gene/genemsa.py
def get_block(self, block: BlockInput) -> BlockInfo:
    """Get block by str or id"""
    return self.blocks[self.get_block_index(block)]

get_block_index(self, block: Union[int, str, pyhlamsa.gene.base.BlockInfo]) -> int inherited

Find the index of the block

Source code in pyhlamsa/gene/genemsa.py
def get_block_index(self, block: BlockInput) -> int:
    """Find the index of the block"""
    if isinstance(block, str):
        for i, b in enumerate(self.list_blocks()):
            if b.name == block:
                return i
    elif isinstance(block, BlockInfo):
        for i, b in enumerate(self.list_blocks()):
            if b.name == block.name:
                return i
    elif isinstance(block, int):
        id = block
        if id < 0:
            id = self.get_block_length() + id
        if 0 <= id < self.get_block_length():
            return id
    else:
        raise NotImplementedError(f"Type of {block} not work now")
    raise IndexError(f"{block} not found or out of index")

get_block_interval(self, block: Union[int, str, pyhlamsa.gene.base.BlockInfo]) -> tuple inherited

Calculate the start(included) and end index (excluded) of the block

Source code in pyhlamsa/gene/genemsa.py
def get_block_interval(self, block: BlockInput) -> tuple[int, int]:
    """Calculate the start(included) and end index (excluded) of the block"""
    index = self.get_block_index(block)
    start = sum(self.blocks[i].length for i in range(index))
    return start, start + self.blocks[index].length

get_block_length(self) -> int inherited

Return list of blocks

Source code in pyhlamsa/gene/genemsa.py
def get_block_length(self) -> int:
    """Return list of blocks"""
    return len(self.list_blocks())

get_cigar(self, target_allele: str, ref_allele: str = '') -> list inherited

Get the cigar string of target_allele from ref_allele

If ref_allele not set, it will automatically find the reference by get_reference

Returns:

Type Description
cigar(list[op_str, num])

The list of operator and number of bases

Exmaple: cigar = [(M, 1), (X, 1), (D, 2), (M, 2)]

Source code in pyhlamsa/gene/genemsa.py
def get_cigar(
    self, target_allele: str, ref_allele: str = ""
) -> list[tuple[str, int]]:
    """
    Get the cigar string of target_allele from ref_allele

    If ref_allele not set,
    it will automatically find the reference by get_reference

    Return:
      cigar(list[op_str, num]): The list of operator and number of bases
    Exmaple:
      `cigar = [(M, 1), (X, 1), (D, 2), (M, 2)]`
    """
    if target_allele not in self:
        raise KeyError(f"{target_allele} not found")
    ref_allele, _ = self.get_allele_or_error(ref_allele)
    return cigar.calculate_cigar(self.get(ref_allele), self.get(target_allele))

get_consensus(self, include_gap: bool = False) -> str inherited

Generate the consensus sequence by choosing maximum frequency base

Parameters:

Name Type Description Default
include_gap bool

Allow consensus contains gap if gap is the maximum item.

If include_gap=False and all the base on that position is gap (not shrinked before), it will warning and fill with A.

E will be ignored.

False

Examples:

a0                 CCATT-GGT--GTCGGGTTTCCAG
a1                 CCACTGGGT--ATCGGGTTTCCAG
c2                 CAATTGGGT--GTCGGGT---AAG
consensus          CCATTGGGT--GTCGGGTTTCCAG
consensus(no-gap)  CCATTGGGTAAGTCGGGTTTCCAG
Source code in pyhlamsa/gene/genemsa.py
def get_consensus(self, include_gap: bool = False) -> str:
    """
    Generate the consensus sequence by choosing maximum frequency base

    Args:
      include_gap (bool):
        Allow consensus contains gap if gap is the maximum item.

        If include_gap=False and all the base on that position is gap
        (not shrinked before),
        it will warning and fill with A.

        `E` will be ignored.

    Example:
    ```
    a0                 CCATT-GGT--GTCGGGTTTCCAG
    a1                 CCACTGGGT--ATCGGGTTTCCAG
    c2                 CAATTGGGT--GTCGGGT---AAG
    consensus          CCATTGGGT--GTCGGGTTTCCAG
    consensus(no-gap)  CCATTGGGTAAGTCGGGTTTCCAG
    ```
    """
    freqs = self.calculate_frequency()
    if not include_gap:
        if any(sum(f[:4]) == 0 for f in freqs):
            self.logger.warning(
                "MSA contains gap, try .shrink() before .get_consensus()"
            )
        max_ind = [max(range(4), key=lambda i: f[i]) for f in freqs]
    else:
        max_ind = [max(range(5), key=lambda i: f[i]) for f in freqs]
    return "".join(map(lambda i: "ACGT-"[i], max_ind))

get_length(self) -> int inherited

Get the length of MSA

Source code in pyhlamsa/gene/genemsa.py
def get_length(self) -> int:
    """Get the length of MSA"""
    # 0 sequences is allow
    if not self:
        # another way to calculate length
        return sum(i.length for i in self.list_blocks())
    else:
        # when extend, reference allele may not exists here
        # return len(self.get_reference()[1])
        return len(next(iter(self.items()))[1])

get_reference(self) -> tuple inherited

Get the reference in MSA, if not existed, output the first one

Returns:

Type Description
tuple

(allele_name, allele_seq)

Source code in pyhlamsa/gene/genemsa.py
def get_reference(self) -> tuple[str, str]:
    """
    Get the reference in MSA, if not existed, output the first one

    Returns:
      (allele_name, allele_seq)
    """
    if self.reference in self:
        return (self.reference, self.get(self.reference))
    if not self:
        raise ValueError("MSA is empty")
    self.logger.warning(
        f"Reference {self.reference} not existed in MSA."
        " Using the first allele instead."
    )
    allele, seq = next(iter(self.items()))
    self.reference = allele  # set it
    return allele, seq

get_sequence_names(self) -> KeysView inherited

Same as list_alleles

Source code in pyhlamsa/gene/genemsa.py
def get_sequence_names(self) -> KeysView[str]:
    """Same as list_alleles"""
    return self.list_alleles()

get_variantion_base(self) -> list inherited

Get the base positions where variation occurs

Examples:

msa:
  s0: AAT
  s1: AAC
  s2: CAC
>>> msa.get_variantion_base()
[0, 2]

Returns:

Type Description
positions

Each integer represent the position of variation

Source code in pyhlamsa/gene/genemsa.py
def get_variantion_base(self) -> list[int]:
    """
    Get the base positions where variation occurs

    Example:
      ```
      msa:
        s0: AAT
        s1: AAC
        s2: CAC
      >>> msa.get_variantion_base()
      [0, 2]
      ```

    Returns:
      positions:
        Each integer represent the position of variation
    """
    freqs = self.calculate_frequency()
    num = len(self)
    base = []
    for i, freq in enumerate(freqs):
        if num not in freq:
            base.append(i)
    return base

items(self) -> Iterable inherited

list allele name along with allele sequence like dict.items()

Source code in pyhlamsa/gene/genemsa.py
def items(self) -> Iterable[tuple[str, str]]:
    """list allele name along with allele sequence like dict.items()"""
    return self.alleles.items()

list_alleles(self) -> KeysView inherited

List all the allele's sequence name in MSA

Examples:

>>> a_gen.list_alleles()[:3]
['A*01:01:01:01', 'A*01:01:01:02N', 'A*01:01:01:03']
Source code in pyhlamsa/gene/genemsa.py
def list_alleles(self) -> KeysView[str]:
    """
    List all the allele's sequence name in MSA

    Example:
       >>> a_gen.list_alleles()[:3]
       ['A*01:01:01:01', 'A*01:01:01:02N', 'A*01:01:01:03']
    """
    return self.alleles.keys()

list_blocks(self) -> list inherited

Return list of blocks

Source code in pyhlamsa/gene/genemsa.py
def list_blocks(self) -> list[BlockInfo]:
    """Return list of blocks"""
    return self.blocks

list_index(self) -> list inherited

Return list of index

Source code in pyhlamsa/gene/genemsa.py
def list_index(self) -> list[IndexInfo]:
    """Return list of index"""
    return self.index

merge_exon(self: ~GenemsaType, msa_nuc: ~GenemsaType) -> ~GenemsaType inherited

Merge nuc MSA into gen MSA

It's allow that nuc MSA has new allele name than gen MSA, Genemsa will add the sequence in MSA, and the intron will fill by E

If the exon part of gen MSA is differnet (e.g. less gapped) from nuc MSA, Genemsa will try to merge if it can

Note that the index will be reset

Examples:

# case1
msa_gen:
  1: "AA|TT|CC",
  2: "AA|TC|CC",
msa_nuc:
  3: "TC",
After merge:
  1: "AA|TT|CC",
  2: "AA|TC|CC",
  3: "EE|TC|EE"
# case2
msa_gen:
  1: "AA|TT|CC",
  2: "AA|TC|CC",
msa_nuc:
  1: "TT-",
  2: "T-C",
  4: "TTC",
After merge:
  1: "AA|TT-|CC",
  2: "AA|T-C|CC",
  4: "EE|TTC|EE"
Source code in pyhlamsa/gene/genemsa.py
def merge_exon(self: GenemsaType, msa_nuc: GenemsaType) -> GenemsaType:
    """
    Merge nuc MSA into gen MSA

    It's allow that nuc MSA has new allele name than gen MSA,
    Genemsa will add the sequence in MSA, and the intron will fill by `E`

    If the exon part of gen MSA is differnet (e.g. less gapped) from nuc MSA,
    Genemsa will try to merge if it can

    Note that the index will be reset

    Example:
      ```
      # case1
      msa_gen:
        1: "AA|TT|CC",
        2: "AA|TC|CC",
      msa_nuc:
        3: "TC",
      After merge:
        1: "AA|TT|CC",
        2: "AA|TC|CC",
        3: "EE|TC|EE"
      ```

      ```
      # case2
      msa_gen:
        1: "AA|TT|CC",
        2: "AA|TC|CC",
      msa_nuc:
        1: "TT-",
        2: "T-C",
        4: "TTC",
      After merge:
        1: "AA|TT-|CC",
        2: "AA|T-C|CC",
        4: "EE|TTC|EE"
      ```
    """
    # A mapping from gen name to nuc index
    nuc_name_index = {
        b.name: i for i, b in enumerate(msa_nuc.list_blocks()) if b.type == "exon"
    }

    # check it's one-to-one mapping
    exon_set = set(b.name for b in self.list_blocks() if b.type == "exon")
    if set(nuc_name_index.keys()) != exon_set:
        raise ValueError(
            f"Cannot match blocks: " f"gen={exon_set} nuc={nuc_name_index.keys()}"
        )

    # create new msa and make sure the order of alleles
    new_msa = self.copy(copy_allele=False)
    new_msa.set_blocks([])
    new_msa.index = []
    new_msa.extend(
        {
            name: ""
            for name in self.get_sequence_names() | msa_nuc.get_sequence_names()
        }
    )

    # allele names
    gen_names = set(self.get_sequence_names())
    nuc_names = set(msa_nuc.get_sequence_names())
    exclude_name: set[str] = set()

    # block-wise
    msas_gen = self.split_block()
    msas_nuc = msa_nuc.split_block()
    for i_gen, blk_gen in enumerate(self.blocks):
        # intron -> fill with E
        if blk_gen.name not in nuc_name_index:
            for name in nuc_names - gen_names:
                msas_gen[i_gen].append(name, "E" * blk_gen.length)
            new_msa += msas_gen[i_gen].remove_allele(
                list(exclude_name), inplace=True
            )
        # exon -> check before merge
        else:
            i_nuc = nuc_name_index[blk_gen.name]
            # content change or length change
            if msas_nuc[i_nuc].get_length() != msas_gen[i_gen].get_length() or any(
                msas_nuc[i_nuc].get(name) != msas_gen[i_gen].get(name)
                for name in (nuc_names & gen_names)
            ):
                # check before merge
                if len(gen_names - nuc_names):
                    raise ValueError(
                        f"Some alleles doesn't exist in nuc MSA: "
                        f"{gen_names - nuc_names}"
                    )

                diff_name = filter(
                    lambda name: msas_nuc[i_nuc].get(name).replace("-", "")
                    != msas_gen[i_gen].get(name).replace("-", ""),
                    gen_names,
                )
                diff_names = list(diff_name)
                if diff_names:
                    self.logger.warning(
                        f"Some exon sequences in gen MSA "
                        f"is not same as in nuc MSA "
                        f"{blk_gen.name}: {diff_names}"
                    )
                    new_msa.remove_allele(diff_names)
                    exclude_name.update(diff_names)
            new_msa += msas_nuc[i_nuc].remove_allele(list(exclude_name))
    return new_msa.reset_index()

meta_to_json(self) -> dict[str, Any] inherited

Extract all meta information about this msa into dict(json)

Source code in pyhlamsa/gene/genemsa.py
def meta_to_json(self) -> dict[str, Any]:
    """Extract all meta information about this msa into dict(json)"""
    meta = {
        "index": [dataclasses.asdict(i) for i in self.index],
        "blocks": [dataclasses.asdict(b) for b in self.blocks],
        "name": self.gene_name,
        "reference": self.reference,
    }
    return meta

print_alignment(self, **kwargs_format: Any) -> None inherited

Print the MSA

Note that the index shown on output header is 1-base absolute position.

Also, the index is absolute index, if you you want continuous positiont, use .reset_index() first.

  • * indicate deletion
  • ATCG indicate the SNV
  • E indicate the null base in exon-only sequence
  • | is intron and exon boundary

Examples:

>>> a = msa.select_allele(r"A\*.*:01:01:01$").select_exon([6,7])
>>> print(''.join(a.format_alignment()))
# or simply use
>>> a.print_alignment()
                    3166                                  3342
                       |                                     |
   A*01:01:01:01       ATAGAAAAGG AGGGAGTTAC ACTCAGGCTG CAA| GCAGTGA CAGTGCCCAG
   A*02:01:01:01       ATAGAAAAGG AGGGAGCTAC TCTCAGGCTG CAA| GCAGTGA CAGTGCCCAG
   A*03:01:01:01       ATAGAAAAGG AGGGAGTTAC ACTCAGGCTG CAA| GCAGTGA CAGTGCCCAG
   A*11:01:01:01       ATAGAAAAGG AGGGAGTTAC ACTCAGGCTG CAA| GCAGTGA CAGTGCCCAG
   A*23:01:01:01       ATAGAAAAGG AGGGAGCTAC TCTCAGGCTG CAA| GCAGTGA CAGTGCCCAG
   A*25:01:01:01       ATAGAAAAGG AGGGAGCTAC TCTCAGGCTG CAA| GCAGTGA CAGTGCCCAG
Source code in pyhlamsa/gene/genemsa.py
def print_alignment(self, **kwargs_format: Any) -> None:
    """
    Print the MSA

    Note that the index shown on output header is 1-base absolute position.

    Also, the index is absolute index, if you you want continuous positiont,
    use `.reset_index()` first.

    * `*` indicate deletion
    * `ATCG` indicate the SNV
    * `E` indicate the null base in exon-only sequence
    * `|` is intron and exon boundary

    Examples:
      >>> a = msa.select_allele(r"A\\*.*:01:01:01$").select_exon([6,7])
      >>> print(''.join(a.format_alignment()))
      # or simply use
      >>> a.print_alignment()
                          3166                                  3342
                             |                                     |
         A*01:01:01:01       ATAGAAAAGG AGGGAGTTAC ACTCAGGCTG CAA| GCAGTGA CAGTGCCCAG
         A*02:01:01:01       ATAGAAAAGG AGGGAGCTAC TCTCAGGCTG CAA| GCAGTGA CAGTGCCCAG
         A*03:01:01:01       ATAGAAAAGG AGGGAGTTAC ACTCAGGCTG CAA| GCAGTGA CAGTGCCCAG
         A*11:01:01:01       ATAGAAAAGG AGGGAGTTAC ACTCAGGCTG CAA| GCAGTGA CAGTGCCCAG
         A*23:01:01:01       ATAGAAAAGG AGGGAGCTAC TCTCAGGCTG CAA| GCAGTGA CAGTGCCCAG
         A*25:01:01:01       ATAGAAAAGG AGGGAGCTAC TCTCAGGCTG CAA| GCAGTGA CAGTGCCCAG
    """
    page_str = self.format_alignment(**kwargs_format)
    for i in page_str:
        print(i, end="")

print_alignment_diff(self, ref_allele: str = '', **kwargs_format: Any) -> None inherited

Print the sequences of all alleles diff from ref_allele sequence.

Same as print_alignment, but once the base of alleles is same as the base in reference sequence, - will be used

Returns:

Type Description
str

A formatted string

Examples:

>>> a = msa.select_allele(r"A\*.*:01:01:01$").select_exon([6,7])
>>> print(a.format_alignment_diff())
                    3166                                  3342
                       |                                     |
   A*01:01:01:01       ATAGAAAAGG AGGGAGTTAC ACTCAGGCTG CAA| GCAGTGA CAGTGCC
   A*02:01:01:01       ---------- ------C--- T--------- ---| ------- -------
   A*03:01:01:01       ---------- ---------- ---------- ---| ------- -------
   A*11:01:01:01       ---------- ---------- ---------- ---| ------- -------
   A*23:01:01:01       ---------- ------C--- T--------- ---| ------- -------
   A*25:01:01:01       ---------- ------C--- T--------- ---| ------- -------
   A*26:01:01:01       ---------- ------C--- T--------- ---| ------- -------
   A*29:01:01:01       ---------- ------C--- T--------- ---| ------- -------
Source code in pyhlamsa/gene/genemsa.py
def print_alignment_diff(self, ref_allele: str = "", **kwargs_format: Any) -> None:
    """
    Print the sequences of all alleles diff from `ref_allele` sequence.

    Same as print_alignment, but
    once the base of alleles is same as the base in reference sequence,
    `-` will be used

    Returns:
      str: A formatted string

    Examples:
      ```
      >>> a = msa.select_allele(r"A\\*.*:01:01:01$").select_exon([6,7])
      >>> print(a.format_alignment_diff())
                          3166                                  3342
                             |                                     |
         A*01:01:01:01       ATAGAAAAGG AGGGAGTTAC ACTCAGGCTG CAA| GCAGTGA CAGTGCC
         A*02:01:01:01       ---------- ------C--- T--------- ---| ------- -------
         A*03:01:01:01       ---------- ---------- ---------- ---| ------- -------
         A*11:01:01:01       ---------- ---------- ---------- ---| ------- -------
         A*23:01:01:01       ---------- ------C--- T--------- ---| ------- -------
         A*25:01:01:01       ---------- ------C--- T--------- ---| ------- -------
         A*26:01:01:01       ---------- ------C--- T--------- ---| ------- -------
         A*29:01:01:01       ---------- ------C--- T--------- ---| ------- -------
      ```
    """
    new_msa = self._calc_diff_msa(ref_allele)
    new_msa.print_alignment(**kwargs_format)

print_alignment_from_center(self: ~GenemsaType, pos: Union[int, collections.abc.Iterable[int]], left: int = 5, right: int = 5, **kwargs_format: Any) -> None inherited

Print all alleles sequences from the center of specific position

Check .format_alignment() for output format detail

Parameters:

Name Type Description Default
pos int or list of int

The (0-base relative) positions

required
left int

How many base shall print at the left of the center

5
right int

How many base shall print at the right of the center

5

Returns:

Type Description
str

A formatted string

Examples:

 gDNA                    3022
                         |
 HLA-A*03:02        AGAGAAAAAT
 HLA-A*01:40        -----G----
 HLA-A*03:20        -----G----
Source code in pyhlamsa/gene/genemsa.py
def print_alignment_from_center(
    self: GenemsaType,
    pos: Union[int, Iterable[int]],
    left: int = 5,
    right: int = 5,
    **kwargs_format: Any,
) -> None:
    """
    Print all alleles sequences from the center of specific position

    Check `.format_alignment()` for output format detail

    Args:
      pos (int or list of int): The (0-base relative) positions
      left (int): How many base shall print at the left of the center
      right (int): How many base shall print at the right of the center

    Returns:
      str: A formatted string

    Examples:
        ```
         gDNA                    3022
                                 |
         HLA-A*03:02        AGAGAAAAAT
         HLA-A*01:40        -----G----
         HLA-A*03:20        -----G----
        ```
    """
    if isinstance(pos, int):
        want_pos = [pos]
    else:
        want_pos = list(pos)
    if not want_pos:
        return
    new_msa = self._calc_center_msa(want_pos, left, right)
    show_position_set = set(self.index[i].pos for i in want_pos)
    new_msa.print_alignment(show_position_set=show_position_set, **kwargs_format)

print_snv(self, **kwargs_format: Any) -> None inherited

A handy function to show all the variation between the alleles

Note: the | in the output string is NOT intron and exon boundary

Check .format_alignment() for output format detail

Returns:

Type Description
str

A formatted string

Examples:

Total variantion: 71
                        308    314          329          342    349
                          |      |            |            |      |
 A*01:01:01:01       TGGCCGTCAT GGCGCC| CCCT CCTCCT| ACTC TCGGGGGCCC TGG
 A*02:01:01:01       ---------- ------| ---- -G----| ---- --------T- ---
 A*03:01:01:01       ---------- ------| ---- ------| ---- ---------- ---
 A*11:01:01:01       ---------- ------| ---- ------| ---- ---------- ---
 A*23:01:01:01       ---------- ------| ---- -G----| ---- ---------- ---
 A*25:01:01:01       ---------- ------| ---- -G----| ---- ---------- ---
 A*26:01:01:01       ---------- ------| ---- -G----| ---- ---------- ---
 A*29:01:01:01       ---------- ------| ---- ------| ---- -T-------- ---
 A*30:01:01:01       ---------- ------| ---- ------| ---- ---------- ---
 A*32:01:01:01       ---------- ------| ---- ------| ---- -T-------- ---
 A*33:01:01:01       ---------- ------| ---- ------| ---- -T-------- ---
 A*34:01:01:01       -----A---- ------| ---- -G----| ---- ---------- ---
 A*36:01:01:01       ---------- ------| ---- ------| ---- ---------- ---
 A*66:01:01:01       ---------- ------| ---- -G----| ---- ---------- ---
 A*68:01:01:01       ---------- ------| ---- -G----| ---- ---------- ---
 A*69:01:01:01       ---------- ------| ---- -G----| ---- ---------- ---
 A*74:01:01:01       ---------- ------| ---- ------| ---- -T-------- ---
 A*80:01:01:01       ---------- -C----| ---- ------| ---- ---------- ---
Source code in pyhlamsa/gene/genemsa.py
def print_snv(self, **kwargs_format: Any) -> None:
    """
    A handy function to show all the variation between the alleles

    Note: the `|` in the output string is NOT intron and exon boundary

    Check `.format_alignment()` for output format detail

    Returns:
      str: A formatted string
    Example:
        ```
        Total variantion: 71
                                308    314          329          342    349
                                  |      |            |            |      |
         A*01:01:01:01       TGGCCGTCAT GGCGCC| CCCT CCTCCT| ACTC TCGGGGGCCC TGG
         A*02:01:01:01       ---------- ------| ---- -G----| ---- --------T- ---
         A*03:01:01:01       ---------- ------| ---- ------| ---- ---------- ---
         A*11:01:01:01       ---------- ------| ---- ------| ---- ---------- ---
         A*23:01:01:01       ---------- ------| ---- -G----| ---- ---------- ---
         A*25:01:01:01       ---------- ------| ---- -G----| ---- ---------- ---
         A*26:01:01:01       ---------- ------| ---- -G----| ---- ---------- ---
         A*29:01:01:01       ---------- ------| ---- ------| ---- -T-------- ---
         A*30:01:01:01       ---------- ------| ---- ------| ---- ---------- ---
         A*32:01:01:01       ---------- ------| ---- ------| ---- -T-------- ---
         A*33:01:01:01       ---------- ------| ---- ------| ---- -T-------- ---
         A*34:01:01:01       -----A---- ------| ---- -G----| ---- ---------- ---
         A*36:01:01:01       ---------- ------| ---- ------| ---- ---------- ---
         A*66:01:01:01       ---------- ------| ---- -G----| ---- ---------- ---
         A*68:01:01:01       ---------- ------| ---- -G----| ---- ---------- ---
         A*69:01:01:01       ---------- ------| ---- -G----| ---- ---------- ---
         A*74:01:01:01       ---------- ------| ---- ------| ---- -T-------- ---
         A*80:01:01:01       ---------- -C----| ---- ------| ---- ---------- ---
        ```
    """
    msa, grouped_bases = self._calc_snp_msa()
    show_position_set = set(
        self.index[i].pos for bases in grouped_bases for i in bases
    )
    print(f"#Total variantion: {sum(map(len, grouped_bases))}\n")
    msa = msa._calc_diff_msa()
    msa.print_alignment(show_position_set=show_position_set, **kwargs_format)

remove_allele(self: ~GenemsaType, query: Union[str, collections.abc.Iterable[str]], inplace: bool = True) -> ~GenemsaType inherited

Remove allele sequences by regex(when query is string) or by exactly deleting (when query is a list)

Source code in pyhlamsa/gene/genemsa.py
def remove_allele(
    self: GenemsaType, query: Union[str, Iterable[str]], inplace: bool = True
) -> GenemsaType:
    """
    Remove allele sequences by regex(when query is string)
    or by exactly deleting (when query is a list)
    """
    if inplace:
        new_msa = self
    else:
        new_msa = self.copy()
    if isinstance(query, str):
        for allele in list(self):
            if re.match(query, allele):
                del new_msa.alleles[allele]
    elif isinstance(query, Iterable):
        for allele in query:
            del new_msa.alleles[allele]
    else:
        raise NotImplementedError
    return new_msa

reset_index(self: ~GenemsaType) -> ~GenemsaType inherited

Reset index: The old position information will be discard.

Each position information will be counted from 0 and the label and name will copy from its block information

Examples:

>>> print(a.format_alignment_diff())
                   101 103 105 107 109 111 113 115 117 119
                     |   |   |   |   |   |   |   |   |   |
 A*01:01:01:01       G   G   T   C   C   A   C   C   G   A
 A*01:01:01:02N      -   -   -   -   -   -   -   -   -   -

>>> print(a.reset_index().format_alignment_diff())
                    1
                    |
 A*01:01:01:01      GGTCCACCGA
 A*01:01:01:02N     ----------
Source code in pyhlamsa/gene/genemsa.py
def reset_index(self: GenemsaType) -> GenemsaType:
    """
    Reset index:
    The old position information will be discard.

    Each position information will be counted from 0 and
    the label and name will copy from its block information

    Example:
    ``` python
    >>> print(a.format_alignment_diff())
                       101 103 105 107 109 111 113 115 117 119
                         |   |   |   |   |   |   |   |   |   |
     A*01:01:01:01       G   G   T   C   C   A   C   C   G   A
     A*01:01:01:02N      -   -   -   -   -   -   -   -   -   -

    >>> print(a.reset_index().format_alignment_diff())
                        1
                        |
     A*01:01:01:01      GGTCCACCGA
     A*01:01:01:02N     ----------
    ```
    """
    new_msa = self.copy()
    start = 0
    new_msa.index = []
    for block in self.list_blocks():
        for _ in range(block.length):
            new_msa.index.append(
                IndexInfo(
                    pos=start,
                    type=block.type,
                    name=block.name,
                )
            )
            start += 1
    assert start == self.get_length()
    return new_msa

reverse_complement(self: ~GenemsaType) -> ~GenemsaType inherited

Reverse the sequences

Source code in pyhlamsa/gene/genemsa.py
def reverse_complement(self: GenemsaType) -> GenemsaType:
    """Reverse the sequences"""
    new_msa = self.copy(copy_allele=False)
    new_msa.blocks = copy.deepcopy(list(reversed(self.blocks)))
    new_msa.index = copy.deepcopy(list(reversed(self.index)))
    new_msa.extend(
        {allele: str(Seq(seq).reverse_complement()) for allele, seq in self.items()}
    )
    return new_msa

save_msa(self, file_fasta: FileType, file_json: FileType) -> None inherited

Save Genemsa to pyhlamsa format (fasta and json)

Source code in pyhlamsa/gene/genemsa.py
def save_msa(self, file_fasta: FileType, file_json: FileType) -> None:
    """Save Genemsa to pyhlamsa format (fasta and json)"""
    self.to_fasta(file_fasta, gap=True)
    f_json, require_close = getFileHandle(file_json)
    json.dump(self.meta_to_json(), f_json)
    if require_close:
        f_json.close()

select_allele(self: ~GenemsaType, query: Union[str, collections.abc.Iterable[str]]) -> ~GenemsaType inherited

Select allele name by regex(when query is string) or by exactly selecting (when query is a list)

Examples:

>>> # select allele name start with A*01:01
>>> msa.select_allele(r"A\*01:01.*")
>>> # select allele by list of string
>>> msa.select_allele(["A*01:01", "A*02:03"])
Source code in pyhlamsa/gene/genemsa.py
def select_allele(
    self: GenemsaType, query: Union[str, Iterable[str]]
) -> GenemsaType:
    """
    Select allele name by regex(when query is string)
    or by exactly selecting (when query is a list)

    Examples:
      >>> # select allele name start with A*01:01
      >>> msa.select_allele(r"A\\*01:01.*")
      >>> # select allele by list of string
      >>> msa.select_allele(["A*01:01", "A*02:03"])
    """
    new_msa = self.copy(copy_allele=False)
    if isinstance(query, str):
        new_msa.extend(
            {allele: seq for allele, seq in self.items() if re.match(query, allele)}
        )
    elif isinstance(query, Iterable):
        new_msa.extend({allele: self.get(allele) for allele in query})
    else:
        raise NotImplementedError
    return new_msa

select_block(self: ~GenemsaType, index: Iterable) -> ~GenemsaType inherited

Extract blocks by index

Parameters:

Name Type Description Default
index list of int

Leave empty if you want all the blocks.

Index start from 0. e.g.

  • 0 for 5-UTR
  • 1 for exon1
  • 2 for intron1
  • 3 for exon2
  • 4 for 3-UTR(for two exons gene)
  • -1 for last block

or you can use list of string, it will select by block name

required

Examples:

>>> a_gen.select_block([-1])
<A  alleles=4101 block=3UTR(302)>
>>> a_gen.select_block([2, 3])
<A  alleles=4101 block=intron1(130) exon2(335)>
>>> a_gen.select_block(["5UTR", "exon3"])
<A  alleles=4101 block=5UTR(301) exon3(413)>
Source code in pyhlamsa/gene/genemsa.py
def select_block(self: GenemsaType, index: Iterable[BlockInput]) -> GenemsaType:
    """
    Extract blocks by index

    Args:
      index (list of int): Leave empty if you want all the blocks.

        Index start from 0.  e.g.

        * 0 for 5-UTR
        * 1 for exon1
        * 2 for intron1
        * 3 for exon2
        * 4 for 3-UTR(for two exons gene)
        * -1 for last block

        or you can use list of string,
        it will select by block name

    Example:
      >>> a_gen.select_block([-1])
      <A  alleles=4101 block=3UTR(302)>

      >>> a_gen.select_block([2, 3])
      <A  alleles=4101 block=intron1(130) exon2(335)>

      >>> a_gen.select_block(["5UTR", "exon3"])
      <A  alleles=4101 block=5UTR(301) exon3(413)>
    """
    new_msa = self.copy(copy_allele=False)

    # choose block index by index
    new_block = []
    new_index = []
    all_pos = []
    index_ids = [self.get_block_index(i) for i in index]
    for i in index_ids:
        new_block.append(self.blocks[i])
        start, end = self.get_block_interval(i)
        all_pos.append((start, end))
        new_index.extend(self.index[start:end])
    new_msa.blocks = new_block
    new_msa.index = new_index

    # extract the sequences inside block region
    for allele, gen_seq in self.items():
        new_seq = "".join([gen_seq[start:end] for start, end in all_pos])
        new_msa.append(allele, new_seq)
    return new_msa.copy()

select_complete(self: ~GenemsaType) -> ~GenemsaType inherited

Select non exon-only sequences (No E in the sequence)

Source code in pyhlamsa/gene/genemsa.py
def select_complete(self: GenemsaType) -> GenemsaType:
    """Select non exon-only sequences (No `E` in the sequence)"""
    new_msa = self.copy(copy_allele=False)
    new_msa.extend({allele: seq for allele, seq in self.items() if "E" not in seq})
    return new_msa

select_exon(self: ~GenemsaType, exon_index: Iterable = []) -> ~GenemsaType inherited

Extract the exon by index.

Parameters:

Name Type Description Default
exon_index list[str|int]

Index start from 1. i.e.

  • 1 for exon1
  • 2 for exon2

Leave empty if you want all the exons

If the exon_index contains list of string, it will select by name

[]

Examples:

>>> a_gen.select_exon([2]))
<A gen alleles=7350 block=exon2(351)>
>>> a_gen.select_exon([2, 3]))
<A nuc alleles=7350 block=exon2(351) exon3(436)>
>>> a_gen.select_exon(["exon2", "exon3"]))
<A nuc alleles=7350 block=exon2(351) exon3(436)>
Source code in pyhlamsa/gene/genemsa.py
def select_exon(
    self: GenemsaType, exon_index: Iterable[BlockInput] = []
) -> GenemsaType:
    """
    Extract the exon by index.

    Args:
      exon_index (list[str|int]): Index start from 1. i.e.

        * 1 for exon1
        * 2 for exon2

        Leave empty if you want all the exons

        If the exon_index contains list of string,
        it will select by name

    Example:
      >>> a_gen.select_exon([2]))
      <A gen alleles=7350 block=exon2(351)>

      >>> a_gen.select_exon([2, 3]))
      <A nuc alleles=7350 block=exon2(351) exon3(436)>

      >>> a_gen.select_exon(["exon2", "exon3"]))
      <A nuc alleles=7350 block=exon2(351) exon3(436)>
    """
    exons = [b for b in self.list_blocks() if b.type == "exon"]

    # If not specific the index, extract all exons
    exon_list: list[BlockInput] = []
    if not exon_index:
        exon_list = exons  # type: ignore
    else:
        for i in exon_index:
            if isinstance(i, int):
                # exon -> blocks position
                if i < 1 or i > len(exons):
                    raise IndexError(f"{i} is out of exon index")
                i = exons[i - 1]
            exon_list.append(i)

    # check
    for i in exon_list:
        block = self.get_block(i)
        if block.type != "exon":
            raise IndexError(f"{block} is not exon: input={i}")
    return self.select_block(exon_list)

select_incomplete(self: ~GenemsaType) -> ~GenemsaType inherited

Select exon-only sequences (E exists in the sequence)

Source code in pyhlamsa/gene/genemsa.py
def select_incomplete(self: GenemsaType) -> GenemsaType:
    """Select exon-only sequences (`E` exists in the sequence)"""
    new_msa = self.copy(copy_allele=False)
    new_msa.extend({allele: seq for allele, seq in self.items() if "E" in seq})
    return new_msa

set_blocks(self: ~GenemsaType, blocks: Iterable) -> ~GenemsaType inherited

Set blocks (inplace)

Source code in pyhlamsa/gene/genemsa.py
def set_blocks(
    self: GenemsaType, blocks: Iterable[Union[int, BlockInfo]]
) -> GenemsaType:
    """Set blocks (inplace)"""
    new_blocks = []
    for i in blocks:
        if isinstance(i, int):
            new_blocks.append(BlockInfo(length=i))
        else:
            new_blocks.append(copy.deepcopy(i))

    if len(self) and self.get_length() != sum(blk.length for blk in new_blocks):
        raise ValueError("Total block length not match to alignments")

    self.blocks = new_blocks
    return self

set_reference(self: ~GenemsaType, allele: str) -> ~GenemsaType inherited

Set the reference in msa (inplace)

Source code in pyhlamsa/gene/genemsa.py
def set_reference(self: GenemsaType, allele: str) -> GenemsaType:
    """Set the reference in msa (inplace)"""
    if allele not in self:
        self.logger.warning(f"Cannot find {allele} in MSA")
    self.reference = allele
    return self

shrink(self: ~GenemsaType) -> ~GenemsaType inherited

Remove empty base if all bases in that column is gap

Source code in pyhlamsa/gene/genemsa.py
def shrink(self: GenemsaType) -> GenemsaType:
    """Remove empty base if all bases in that column is gap"""
    # index to delete
    freqs = self.calculate_frequency()
    masks = [f[4] != sum(f) for f in freqs]
    new_msa = self.copy(copy_allele=False)

    # recalcuate blocks
    for i in range(len(self.blocks)):
        start, end = self.get_block_interval(i)
        new_msa.blocks[i].length = sum(masks[start:end])
    assert sum(masks) == new_msa.get_length()
    new_msa.index = [new_msa.index[i] for i in range(len(masks)) if masks[i]]

    # remove base in allele
    for allele, seq in self.items():
        new_msa.append(allele, "".join(seq[i] for i in range(len(seq)) if masks[i]))

    return new_msa

size(self) -> tuple inherited

Get the size (num_of_allele, length_of_sequence)

Source code in pyhlamsa/gene/genemsa.py
def size(self) -> tuple[int, int]:
    """Get the size (num_of_allele, length_of_sequence)"""
    return (len(self), self.get_length())

sort(self: ~GenemsaType) -> ~GenemsaType inherited

Sort the allele by their sequences (inplace)

Source code in pyhlamsa/gene/genemsa.py
def sort(self: GenemsaType) -> GenemsaType:
    """Sort the allele by their sequences (inplace)"""
    self.alleles = dict(sorted(self.items(), key=lambda i: i[1]))
    return self

sort_name(self: ~GenemsaType) -> ~GenemsaType inherited

Sort the allele by alelle name (inplace)

Source code in pyhlamsa/gene/genemsa.py
def sort_name(self: GenemsaType) -> GenemsaType:
    """Sort the allele by alelle name (inplace)"""
    self.alleles = dict(sorted(self.items(), key=lambda i: i[0]))
    return self

split_block(self: ~GenemsaType) -> list inherited

Split the msa by blocks

Source code in pyhlamsa/gene/genemsa.py
def split_block(self: GenemsaType) -> list[GenemsaType]:
    """Split the msa by blocks"""
    return [self.select_block([i]) for i in range(len(self.list_blocks()))]

to_MultipleSeqAlignment(self) -> MultipleSeqAlignment inherited

Transfer this object to MultipleSeqAlignment(biopython)

Source code in pyhlamsa/gene/genemsa.py
def to_MultipleSeqAlignment(self) -> MultipleSeqAlignment:
    """Transfer this object to MultipleSeqAlignment(biopython)"""
    return MultipleSeqAlignment(
        self.to_records(gap=True), annotations=self.meta_to_json()
    )

to_bam(self, fname: str, ref_allele: str = '', save_ref: bool = True) -> None inherited

Save the MSA into bam

All alleles will seen as reads aligned on ref_allele

Parameters:

Name Type Description Default
fname str

The name of bamfile

required
ref_allele str

The reference allele. If the ref_allele is empty, the first allele will be reference.

''
save_ref bool

The reference allele will also be saved in the bam file

True
Source code in pyhlamsa/gene/genemsa.py
def to_bam(self, fname: str, ref_allele: str = "", save_ref: bool = True) -> None:
    """
    Save the MSA into bam

    All alleles will seen as reads aligned on `ref_allele`

    Args:
      fname (str): The name of bamfile
      ref_allele (str): The reference allele.
          If the ref_allele is empty, the first allele will be reference.
      save_ref (bool): The reference allele will also be saved in the bam file
    """
    ref_allele, ref_seq = self.get_allele_or_error(ref_allele)

    # setup reference and header
    header = {
        "HD": {"VN": "1.0"},
        "SQ": [
            {"LN": len(ref_seq.replace("-", "").replace("E", "")), "SN": ref_allele}
        ],
    }

    # write bam file
    with pysam.AlignmentFile(fname, "wb", header=header) as outf:
        for allele, seq in self.items():
            # skip
            if not save_ref and allele == ref_allele:
                continue

            # init bam record
            a = pysam.AlignedSegment()
            a.query_name = allele
            a.query_sequence = seq.replace("E", "").replace("-", "")
            a.cigar = cigar.cigar_to_pysam(self.get_cigar(allele, ref_allele))  # type: ignore

            # set to default
            a.mapping_quality = 60
            a.reference_id = 0
            a.reference_start = 0
            # a.template_length = 0
            # a.query_qualities = [30] * len(a.query_sequence)
            # a.flag = 0
            # a.next_reference_id = 0
            # a.next_reference_start = 0
            outf.write(a)

    pysam.sort("-o", fname, fname)  # type: ignore
    pysam.index(fname)  # type: ignore

to_fasta(self, fname: FileType, gap: bool = True, ref_only: bool = False) -> None inherited

Save the MSA into fasta

Parameters:

Name Type Description Default
gap bool

The sequence included gap or not

True
ref_only bool

Save reference sequence only

False
Source code in pyhlamsa/gene/genemsa.py
def to_fasta(
    self, fname: FileType, gap: bool = True, ref_only: bool = False
) -> None:
    """
    Save the MSA into fasta

    Args:
        gap (bool): The sequence included gap or not
        ref_only (bool): Save reference sequence only
    """
    if ref_only:
        msa = self.select_allele([self.get_reference()[0]])
    else:
        msa = self
    SeqIO.write(msa.to_records(gap=gap), fname, "fasta")

to_gff(self, fname: FileType, strand: str = '+', ref_allele: str = '', igv_show_label: bool = False, save_all: bool = False) -> None inherited

Save to GFF3 format

Parameters:

Name Type Description Default
fname str

The file name of gff3

required
strand str

Must be "+" or "-".

If the strand is "-", it will add strand in GFF file, if you want to reverse the sequence, please use reverse_complement first.

'+'
ref_allele str

The name of allele (Must be the same in save_bam)

''
igv_show_label bool

If it's false, it will generate proper GFF3. Set it for True as default for easiler label reading in IGV.

False
save_all bool

Set it True if you want to create gff records for all alleles in msa Note that this is not very fast.

False
Source code in pyhlamsa/gene/genemsa.py
def to_gff(
    self,
    fname: FileType,
    strand: str = "+",
    ref_allele: str = "",
    igv_show_label: bool = False,
    save_all: bool = False,
) -> None:
    """
    Save to GFF3 format

    Args:
      fname (str): The file name of gff3
      strand (str): Must be "+" or "-".

          If the strand is "-", it will add `strand` in GFF file,
          if you want to reverse the sequence, please use `reverse_complement` first.

      ref_allele (str): The name of allele (Must be the same in save_bam)
      igv_show_label (bool): If it's false, it will generate proper GFF3.
          Set it for True as default for easiler label reading in IGV.
      save_all (bool):
          Set it True if you want to create gff records for all alleles in msa
          Note that this is not very fast.
    """
    # TODO: should I save strand == '-' in model?
    ref_allele = self.get_allele_or_error(ref_allele)[0]

    # labels
    if not all(b.type for b in self.blocks):
        self.logger.warning(
            "You should assign block's label. (We assume seq_type='other')"
        )
        self.assume_label("other")

    if save_all:
        alleles = list(self.list_alleles())
    else:
        alleles = [ref_allele]

    records = []
    for allele in alleles:
        self.logger.debug(f"{allele} to gff")

        # remove gap
        msa = self.select_allele([allele]).shrink()
        records.extend(
            _block_to_gff(
                msa.blocks, allele, strand=strand, igv_show_label=igv_show_label
            )
        )

    # save
    f_gff, require_close = getFileHandle(fname)
    f_gff.write("##gff-version 3\n")
    f_gff.write(_table_to_string(records))
    if require_close:
        f_gff.close()

to_imgt_alignment(self, file: FileType, seq_type: str = 'gen', index_start_from: str = 'exon1') -> None inherited

Export to IMGT-alignment-like format.

But some features do not implemented: * AA codon position: So the values in the header always 1 * position: So the values in the header always 1

Parameters:

Name Type Description Default
seq_type str

gen or nuc format

'gen'
index_start_from str

set the start position (1) starts from.

'exon1'
Source code in pyhlamsa/gene/genemsa.py
def to_imgt_alignment(
    self,
    file: FileType,
    seq_type: str = "gen",
    index_start_from: str = "exon1",
) -> None:
    """
    Export to IMGT-alignment-like format.

    But some features do not implemented:
    * AA codon position: So the values in the header always 1
    * position: So the values in the header always 1

    Args:
        seq_type: gen or nuc format
        index_start_from: set the start position (1) starts from.
    """
    # This function should be located in io_operation
    new_msa = self._calc_imgt_alignment_msa()
    start = new_msa.get_block_interval(index_start_from)[0]

    # The position is calculated by reference sequence not MSA itself
    ref_no_gap_pos = -1
    seq = new_msa.get_reference()[1]
    for i, ind in enumerate(new_msa.index):
        if seq[i] != ".":
            ref_no_gap_pos += 1
        ind.pos = ref_no_gap_pos

    if seq_type == "gen":
        # force modify the index
        # IMGT ends with -1 and start with 1
        for ind in new_msa.index:
            ind.pos -= start - 1
            if ind.pos < 0:
                ind.pos -= 1
        pages = self._format_page(
            [new_msa.index],
            header_text_align="left",
            index_header=["gDNA"],
            show_position_when_same_value=False,
        )
    elif seq_type == "nuc":
        fake_index = [IndexInfo(pos=0)] * len(new_msa.index)
        pages = self._format_page(
            [new_msa.index, fake_index],
            header_text_align="left",
            index_header=["cDNA", "AA Codon"],
            show_position_when_same_value=False,
        )
    else:
        raise NotImplementedError(f"Not implement {seq_type=}")

    # write
    txt, require_close = getFileHandle(file)
    txt.writelines(chain.from_iterable(map(new_msa._apply_page, pages)))
    if require_close:
        txt.close()

to_records(self, gap: bool = True) -> list[SeqRecord] inherited

Transfer MSA to list of SeqRecord.

If the base is E which indicate the bases are not existed for exon-only sequence, those beses are treated as gap.

Parameters:

Name Type Description Default
gap bool

The sequence included gap or not

True
Source code in pyhlamsa/gene/genemsa.py
def to_records(self, gap: bool = True) -> list[SeqRecord]:
    """
    Transfer MSA to list of SeqRecord.

    If the base is `E`
    which indicate the bases are not existed for exon-only sequence,
    those beses are treated as gap.

    Args:
      gap (bool): The sequence included gap or not
    """
    if gap:
        return [
            SeqRecord(Seq(seq.replace("E", "-")), id=allele, description="")
            for allele, seq in self.items()
        ]
    else:
        return [
            SeqRecord(
                Seq(seq.replace("E", "").replace("-", "")),
                id=allele,
                description="",
            )
            for allele, seq in self.items()
        ]

to_vcf(self, file_vcf: str, ref_allele: str = '', save_ref: bool = True, plain_text: bool = False) -> None inherited

Save Genemsa into vcf format

It will save msa into sorted and normalized vcf file (xxx.vcf.gz) along with vcf's index (xxx.vcf.gz.tbi) (You can still output plain vcf without any manipulation by set plain_text=True)

Note that vcf-format discard the per-base alignment especially when there are many snps/indels mixed together in that position. Also, after vcf is normalized, the msa may not be the same MSA alignment as before. (But sequences are still the same)

Parameters:

Name Type Description Default
ref_allele str

The name of reference allele. I recommend the reference allele contains no gaps.

''
save_ref bool

The reference allele will also be saved in the vcf file

True
plain_text bool

Disable sort vcf and index vcf

False
Source code in pyhlamsa/gene/genemsa.py
def to_vcf(
    self,
    file_vcf: str,
    ref_allele: str = "",
    save_ref: bool = True,
    plain_text: bool = False,
) -> None:
    """
    Save Genemsa into vcf format

    It will save msa into sorted and normalized vcf file (xxx.vcf.gz)
    along with vcf's index (xxx.vcf.gz.tbi)
    (You can still output plain vcf without any manipulation by set plain_text=True)

    Note that vcf-format discard the per-base alignment especially
    when there are many snps/indels mixed together in that position.
    Also, after vcf is normalized, the msa may not be the same MSA alignment as before.
    (But sequences are still the same)

    Args:
      ref_allele (str):
        The name of reference allele.
        I recommend the reference allele contains no gaps.
      save_ref (bool): The reference allele will also be saved in the vcf file
      plain_text (bool): Disable sort vcf and index vcf
    """
    ref_allele, ref_seq = self.get_allele_or_error(ref_allele)

    # extract all variants from all alleles
    allele_variants = {}
    for allele in self.list_alleles():
        if not save_ref and allele == ref_allele:
            continue
        variants = vcf.extract_variants(ref_seq, self.get(allele))
        for v in variants:
            v.chrom = ref_allele
        allele_variants[allele] = variants

    # remove suffix
    if file_vcf.endswith(".vcf.gz"):
        basename = file_vcf[:-7]
    elif file_vcf.endswith(".vcf"):
        basename = file_vcf[:-4]
    else:
        basename = file_vcf
    tmpname = f"{basename}.tmp"

    # to vcf
    with open(f"{tmpname}.vcf", "w") as f_vcf:
        f_vcf.write(vcf.get_vcf_header(ref_allele, ref_seq))
        f_vcf.write("\n")
        f_vcf.write(_table_to_string(vcf.variants_to_table(allele_variants)))
    if plain_text:
        os.rename(f"{tmpname}.vcf", f"{basename}.vcf")
        return

    # sort, normalize and index
    self.select_allele([ref_allele]).to_fasta(f"{tmpname}.fa", gap=False)
    with open(f"{tmpname}.norm.vcf.gz", "wb") as f_vcf:
        f_vcf.write(
            bcftools.norm(  # type: ignore
                "-f", f"{tmpname}.fa", f"{tmpname}.vcf", "-O", "z"
            )
        )
    with open(f"{basename}.vcf.gz", "wb") as f_vcf:
        f_vcf.write(bcftools.sort(f"{tmpname}.norm.vcf.gz", "-O", "z"))  # type: ignore
    bcftools.index(f"{basename}.vcf.gz", "-t", "-f")  # type: ignore
    os.remove(f"{tmpname}.fa")
    os.remove(f"{tmpname}.fa.fai")
    os.remove(f"{tmpname}.vcf")
    os.remove(f"{tmpname}.norm.vcf.gz")

truth(self) -> bool inherited

If msa has 0 alleles return False else return Implement if(self) function

Source code in pyhlamsa/gene/genemsa.py
def truth(self) -> bool:
    """
    If msa has 0 alleles return False
    else return Implement if(self) function
    """
    return bool(self.alleles)

HLAmsa (Familymsa)

A HLA interface.

This module read the HLA MSA from alignments/*.txt files, which use | to separate the intron/exon/UTR regions, but the labels are assumed with the order: 5UTR exon1 intron1 exon2 ... exonN 3UTR

Attributes:

Name Type Description
genes dict[str, Genemsa]

The dictionary use gene_name as key and msa object as value

Source code in pyhlamsa/gene_family/hla.py
class HLAmsa(Familymsa):
    """
    A HLA interface.

    This module read the HLA MSA from alignments/*.txt files,
    which use `|` to separate the intron/exon/UTR regions, but
    the labels are assumed with the order:
    `5UTR exon1 intron1 exon2 ... exonN 3UTR`

    Attributes:
      genes (dict[str, Genemsa]):
        The dictionary use gene_name as key and msa object as value
    """

    def __init__(
        self,
        genes: GeneSet = None,
        filetype: TypeSet = ["gen", "nuc"],
        imgt_alignment_folder: str = "",
        version: str = "Latest",
    ):
        """
        Args:
            genes (str | list[str]): A list of genes you want to read.

                Set None if you want read all gene in HLA

            filetype (str | list[str] | set[str]): A list of filetype.

                If both `gen` and `nuc` are given, it will merge them automatically.

            imgt_alignment_folder (str): Path to your IMGT-formatted MSA alignment file

                You can manually download the folder from
                <https://github.com/ANHIG/IMGTHLA/tree/Latest/alignments>
                or <http://ftp.ebi.ac.uk/pub/databases/ipd/imgt/hla/>
                and unzip Alignments_Rel_3470.zip

                Otherwise it will automatically download the database to
                `imgt_alignment_folder`. Default is `./alignment_v{verion}`

            version (str): IMGT version you want to download (e.g. 3470 for 3.47.0)

                If `imgt_alignment_folder` is existed, this value will be ignored.
                Use `Latest` to get latest version.
        """
        if not imgt_alignment_folder:
            imgt_alignment_folder = f"alignment_v{version}"
        super().__init__(
            genes, filetype, db_folder=imgt_alignment_folder, version=version
        )

    def _download_db(self, version: str = "Latest") -> None:
        """
        Download the IMGTHLA alignments folder to `db_folder`

        I get the alignments folder from git instead of
        <http://ftp.ebi.ac.uk/pub/databases/ipd/imgt/hla/>
        for better version controlling
        """
        with TemporaryDirectory() as tmp_dir:
            self._run_shell(
                "git",
                "clone",
                "--branch",
                version,
                "--single-branch",
                "https://github.com/ANHIG/IMGTHLA.git",
                tmp_dir,
            )
            self._run_shell("mv", tmp_dir + "/alignments", self.db_folder)

    def _get_name(self, search_name: str) -> set[str]:
        """Handy function to list names from file pattern"""
        arr_files = glob(search_name)
        return set([f.split("/")[-1].split("_")[0] for f in arr_files])

    def list_db_gene(self, filetype: TypeSet) -> list[str]:
        """List the gene in folder"""
        drb = set(["DRB1", "DRB3", "DRB4", "DRB5"])
        if "gen" in filetype:
            names = names_gen = self._get_name(f"{self.db_folder}/*_gen.txt") | drb
        if "nuc" in filetype:
            names = names_nuc = self._get_name(f"{self.db_folder}/*_nuc.txt") | drb
        if "gen" in filetype and "nuc" in filetype:
            names = names_gen & names_nuc
            # * E exon7 is ambiguous, exon8 is gone (Also exon8 is not pseudo exon)
            #    <E gen alleles=258 block=5UTR(301) exon1(64) intron1(130)
            #        exon2(270) intron2(244) exon3(276) intron3(621) exon4(276)
            #        intron4(124) exon5(117) intron5(751) exon6(33) intron6(104)
            #        exon7(43) intron7(165) exon8(5) 3UTR(356)>
            #    <E nuc alleles=262 block=exon1(64) exon2(270) exon3(276)
            #        exon4(276) exon5(117) exon6(33) exon7(41)>
            if "gen" in filetype and "nuc" in filetype:
                names = names - set(["E"])
        return list(sorted(names))

    def read_db_gene(self, gene: str, filetype: TypeSet) -> Genemsa:
        """
        Read `{gene}_{filetype}.txt`.

        If both `gen` and `nuc` are given, it will merge them.
        """
        if "gen" in filetype:
            msa_gen = Genemsa.read_imgt_alignment(f"{self.db_folder}/{gene}_gen.txt")
            msa_gen.gene_name = gene
            if gene != "P":
                # P: special case: has even block
                # <P gen alleles=5 block=(475) (261) (589) (276) (124) (117)
                #                        (412) (33) (150) (48) (163) (297)>
                msa_gen.assume_label("gen")
            self.logger.debug(f"Gen {msa_gen}")
        if "nuc" in filetype:
            # Special Case: DRB* nuc are in DRB_nuc.txt
            if gene.startswith("DRB"):
                msa_nuc = Genemsa.read_imgt_alignment(f"{self.db_folder}/DRB_nuc.txt")
                msa_nuc = msa_nuc.select_allele(gene + ".*")
            else:
                msa_nuc = Genemsa.read_imgt_alignment(
                    f"{self.db_folder}/{gene}_nuc.txt"
                )

            msa_nuc.gene_name = gene
            msa_nuc.assume_label("nuc")
            self.logger.debug(f"Nuc {msa_nuc}")

        if "gen" in filetype and "nuc" in filetype:
            # remove some gene
            diff_name = list(
                set(msa_gen.get_sequence_names()) - set(msa_nuc.get_sequence_names())
            )
            if diff_name:
                self.logger.warning(
                    f"Remove alleles doesn't exist in gen and nuc either: {diff_name}"
                )
            msa_gen = msa_gen.remove_allele(diff_name)

            # merge
            msa_merged = msa_gen.merge_exon(msa_nuc)
            self.logger.debug(f"Merged {msa_merged}")
            return msa_merged

        elif "gen" in filetype:
            return msa_gen
        elif "nuc" in filetype:
            return msa_nuc
        raise ValueError("gen or nuc are not exist in filetype")

__getitem__(self, index: str) -> Genemsa inherited special

Source code in pyhlamsa/gene_family/hla.py
def __getitem__(self, index: str) -> Genemsa:
    """Get specific gene's msa"""
    return self.genes[index]

__init__(self, genes: Union[str, Iterable[str]] = None, filetype: Union[str, Iterable[str]] = ['gen', 'nuc'], imgt_alignment_folder: str = '', version: str = 'Latest') special

Parameters:

Name Type Description Default
genes str | list[str]

A list of genes you want to read.

Set None if you want read all gene in HLA

None
filetype str | list[str] | set[str]

A list of filetype.

If both gen and nuc are given, it will merge them automatically.

['gen', 'nuc']
imgt_alignment_folder str

Path to your IMGT-formatted MSA alignment file

You can manually download the folder from https://github.com/ANHIG/IMGTHLA/tree/Latest/alignments or http://ftp.ebi.ac.uk/pub/databases/ipd/imgt/hla/ and unzip Alignments_Rel_3470.zip

Otherwise it will automatically download the database to imgt_alignment_folder. Default is ./alignment_v{verion}

''
version str

IMGT version you want to download (e.g. 3470 for 3.47.0)

If imgt_alignment_folder is existed, this value will be ignored. Use Latest to get latest version.

'Latest'
Source code in pyhlamsa/gene_family/hla.py
def __init__(
    self,
    genes: GeneSet = None,
    filetype: TypeSet = ["gen", "nuc"],
    imgt_alignment_folder: str = "",
    version: str = "Latest",
):
    """
    Args:
        genes (str | list[str]): A list of genes you want to read.

            Set None if you want read all gene in HLA

        filetype (str | list[str] | set[str]): A list of filetype.

            If both `gen` and `nuc` are given, it will merge them automatically.

        imgt_alignment_folder (str): Path to your IMGT-formatted MSA alignment file

            You can manually download the folder from
            <https://github.com/ANHIG/IMGTHLA/tree/Latest/alignments>
            or <http://ftp.ebi.ac.uk/pub/databases/ipd/imgt/hla/>
            and unzip Alignments_Rel_3470.zip

            Otherwise it will automatically download the database to
            `imgt_alignment_folder`. Default is `./alignment_v{verion}`

        version (str): IMGT version you want to download (e.g. 3470 for 3.47.0)

            If `imgt_alignment_folder` is existed, this value will be ignored.
            Use `Latest` to get latest version.
    """
    if not imgt_alignment_folder:
        imgt_alignment_folder = f"alignment_v{version}"
    super().__init__(
        genes, filetype, db_folder=imgt_alignment_folder, version=version
    )

__iter__(self) -> Iterable[str] inherited special

Source code in pyhlamsa/gene_family/hla.py
def __iter__(self) -> Iterable[str]:
    """Iter gene name like iter(dict)"""
    return iter(self.genes)

items(self) -> Iterable[tuple[str, pyhlamsa.gene.genemsa.Genemsa]] inherited

list gene name and msa like dict.items()

Source code in pyhlamsa/gene_family/hla.py
def items(self) -> Iterable[tuple[str, Genemsa]]:
    """list gene name and msa like dict.items()"""
    return self.genes.items()

list_db_gene(self, filetype: Union[str, Iterable[str]]) -> list

List the gene in folder

Source code in pyhlamsa/gene_family/hla.py
def list_db_gene(self, filetype: TypeSet) -> list[str]:
    """List the gene in folder"""
    drb = set(["DRB1", "DRB3", "DRB4", "DRB5"])
    if "gen" in filetype:
        names = names_gen = self._get_name(f"{self.db_folder}/*_gen.txt") | drb
    if "nuc" in filetype:
        names = names_nuc = self._get_name(f"{self.db_folder}/*_nuc.txt") | drb
    if "gen" in filetype and "nuc" in filetype:
        names = names_gen & names_nuc
        # * E exon7 is ambiguous, exon8 is gone (Also exon8 is not pseudo exon)
        #    <E gen alleles=258 block=5UTR(301) exon1(64) intron1(130)
        #        exon2(270) intron2(244) exon3(276) intron3(621) exon4(276)
        #        intron4(124) exon5(117) intron5(751) exon6(33) intron6(104)
        #        exon7(43) intron7(165) exon8(5) 3UTR(356)>
        #    <E nuc alleles=262 block=exon1(64) exon2(270) exon3(276)
        #        exon4(276) exon5(117) exon6(33) exon7(41)>
        if "gen" in filetype and "nuc" in filetype:
            names = names - set(["E"])
    return list(sorted(names))

list_genes(self) -> list inherited

List all the gene's name in this family

Source code in pyhlamsa/gene_family/hla.py
def list_genes(self) -> list[str]:
    """List all the gene's name in this family"""
    return list(self.genes.keys())

read_db_gene(self, gene: str, filetype: Union[str, Iterable[str]]) -> Genemsa

Read {gene}_{filetype}.txt.

If both gen and nuc are given, it will merge them.

Source code in pyhlamsa/gene_family/hla.py
def read_db_gene(self, gene: str, filetype: TypeSet) -> Genemsa:
    """
    Read `{gene}_{filetype}.txt`.

    If both `gen` and `nuc` are given, it will merge them.
    """
    if "gen" in filetype:
        msa_gen = Genemsa.read_imgt_alignment(f"{self.db_folder}/{gene}_gen.txt")
        msa_gen.gene_name = gene
        if gene != "P":
            # P: special case: has even block
            # <P gen alleles=5 block=(475) (261) (589) (276) (124) (117)
            #                        (412) (33) (150) (48) (163) (297)>
            msa_gen.assume_label("gen")
        self.logger.debug(f"Gen {msa_gen}")
    if "nuc" in filetype:
        # Special Case: DRB* nuc are in DRB_nuc.txt
        if gene.startswith("DRB"):
            msa_nuc = Genemsa.read_imgt_alignment(f"{self.db_folder}/DRB_nuc.txt")
            msa_nuc = msa_nuc.select_allele(gene + ".*")
        else:
            msa_nuc = Genemsa.read_imgt_alignment(
                f"{self.db_folder}/{gene}_nuc.txt"
            )

        msa_nuc.gene_name = gene
        msa_nuc.assume_label("nuc")
        self.logger.debug(f"Nuc {msa_nuc}")

    if "gen" in filetype and "nuc" in filetype:
        # remove some gene
        diff_name = list(
            set(msa_gen.get_sequence_names()) - set(msa_nuc.get_sequence_names())
        )
        if diff_name:
            self.logger.warning(
                f"Remove alleles doesn't exist in gen and nuc either: {diff_name}"
            )
        msa_gen = msa_gen.remove_allele(diff_name)

        # merge
        msa_merged = msa_gen.merge_exon(msa_nuc)
        self.logger.debug(f"Merged {msa_merged}")
        return msa_merged

    elif "gen" in filetype:
        return msa_gen
    elif "nuc" in filetype:
        return msa_nuc
    raise ValueError("gen or nuc are not exist in filetype")

HLAmsaEX (Familymsa)

A HLA interface but read gene from MSF and read intron/exon information from hla.dat

I think this one is more reliable

Attributes:

Name Type Description
genes dict[str, Genemsa]

The dictionary use gene_name as key and msa object as value

Source code in pyhlamsa/gene_family/hla_ex.py
class HLAmsaEX(Familymsa):
    """
    A HLA interface but read gene from MSF and
    read intron/exon information from hla.dat

    I think this one is more reliable

    Attributes:
      genes (dict[str, Genemsa]):
        The dictionary use gene_name as key and msa object as value
    """

    def __init__(
        self,
        genes: GeneSet = None,
        filetype: TypeSet = ["gen", "nuc"],
        imgt_folder: str = "",
        version: str = "Latest",
    ):
        """
        Args:
            genes (str | list[str]): A list of genes you want to read.

                Set None if you want read all gene in HLA

            filetype (str | list[str] | set[str]): A list of filetype.

                If both `gen` and `nuc` are given, it will merge them automatically.

            imgt_folder (str): Path to your IMGT/HLA root folder

                You can manually download <https://github.com/ANHIG/IMGTHLA/> and
                checkout to specific branch

                Or it will automatically download the database with assigned `version` to
                `imgt_folder`. Default is `./IMGT_v{verion}`

            version (str): IMGT version you want to download (e.g. 3470 for 3.47.0)

                If `imgt_folder` is existed, this value will be ignored.

                You can use `Latest` to get the latest version
        """
        if not imgt_folder:
            imgt_folder = f"IMGT_v{version}"
        super().__init__(genes, filetype, db_folder=imgt_folder, version=version)

    def _download_db(self, version: str = "Latest") -> None:
        """
        Download the IMGT/HLA msf and hla.dat to folder `imgt_folder`
        """
        try:
            self._run_shell("git", "lfs")
        except subprocess.CalledProcessError:
            self.logger.error(f"git lfs not installed (Try apt install git-lfs)")
            exit()

        self._run_shell(
            "git",
            "clone",
            "--branch",
            version,
            "--single-branch",
            "https://github.com/ANHIG/IMGTHLA.git",
            self.db_folder,
        )
        self._run_shell("git", "lfs", "pull", cwd=self.db_folder)

    def _get_name(self, search_name: str) -> set[str]:
        """Extract name from file pattern"""
        arr_files = glob(search_name)
        return set([f.split("/")[-1].split("_")[0] for f in arr_files])

    def list_db_gene(self, filetype: TypeSet) -> list[str]:
        """List the gene in folder"""
        drb = set(["DRB1", "DRB3", "DRB4", "DRB5"])
        if "gen" in filetype:
            names = names_gen = self._get_name(f"{self.db_folder}/msf/*_gen.msf") | drb
        if "nuc" in filetype:
            names = names_nuc = self._get_name(f"{self.db_folder}/msf/*_nuc.msf") | drb
        if "gen" in filetype and "nuc" in filetype:
            # HLA-E nuc has 7 bases differences from exon7 and exon8
            names = (names_gen & names_nuc) - set("E")
        return sorted(names)

    def read_db_gene(self, gene: str, filetype: TypeSet) -> Genemsa:
        """
        Read `msf/{gene}_{filetype}.msf` and hla.dat.

        If both `gen` and `nuc` are given, it will merge them.
        """
        if not hasattr(self, "dat"):
            self.logger.debug(f"Reading hla.dat")
            self.dat = dat.read_dat_block(f"{self.db_folder}/hla.dat")

        if "gen" in filetype:
            msa_gen = Genemsa.read_msf_file(f"{self.db_folder}/msf/{gene}_gen.msf")
            msa_gen.gene_name = gene
            msa_gen = dat.apply_dat_info_on_msa(msa_gen, self.dat, seq_type="gen")
            self.logger.debug(f"{msa_gen}")

        if "nuc" in filetype:
            # Special Case: DRB* nuc are in DRB_nuc.txt
            if gene.startswith("DRB"):
                msa_nuc = Genemsa.read_msf_file(f"{self.db_folder}/msf/DRB_nuc.msf")
                msa_nuc = msa_nuc.select_allele(gene + ".*")
            else:
                msa_nuc = Genemsa.read_msf_file(f"{self.db_folder}/msf/{gene}_nuc.msf")
            msa_nuc.gene_name = gene
            msa_nuc = dat.apply_dat_info_on_msa(msa_nuc, self.dat, seq_type="nuc")
            self.logger.debug(f"{msa_nuc}")

        if "gen" in filetype and "nuc" in filetype:
            # remove some gen not included in nuc
            diff_name = list(
                set(msa_gen.get_sequence_names()) - set(msa_nuc.get_sequence_names())
            )
            if diff_name:
                self.logger.warning(
                    f"Remove alleles existed in gen but not in nuc: {diff_name}"
                )
            msa_gen = msa_gen.remove_allele(diff_name)

            # merge
            msa_merged = msa_gen.merge_exon(msa_nuc)
            self.logger.debug(f"{msa_merged}")
            return msa_merged
        elif "gen" in filetype:
            return msa_gen
        elif "nuc" in filetype:
            return msa_nuc
        raise ValueError("gen or nuc are not exist in filetype")

__getitem__(self, index: str) -> Genemsa inherited special

Source code in pyhlamsa/gene_family/hla_ex.py
def __getitem__(self, index: str) -> Genemsa:
    """Get specific gene's msa"""
    return self.genes[index]

__init__(self, genes: Union[str, Iterable[str]] = None, filetype: Union[str, Iterable[str]] = ['gen', 'nuc'], imgt_folder: str = '', version: str = 'Latest') special

Parameters:

Name Type Description Default
genes str | list[str]

A list of genes you want to read.

Set None if you want read all gene in HLA

None
filetype str | list[str] | set[str]

A list of filetype.

If both gen and nuc are given, it will merge them automatically.

['gen', 'nuc']
imgt_folder str

Path to your IMGT/HLA root folder

You can manually download https://github.com/ANHIG/IMGTHLA/ and checkout to specific branch

Or it will automatically download the database with assigned version to imgt_folder. Default is ./IMGT_v{verion}

''
version str

IMGT version you want to download (e.g. 3470 for 3.47.0)

If imgt_folder is existed, this value will be ignored.

You can use Latest to get the latest version

'Latest'
Source code in pyhlamsa/gene_family/hla_ex.py
def __init__(
    self,
    genes: GeneSet = None,
    filetype: TypeSet = ["gen", "nuc"],
    imgt_folder: str = "",
    version: str = "Latest",
):
    """
    Args:
        genes (str | list[str]): A list of genes you want to read.

            Set None if you want read all gene in HLA

        filetype (str | list[str] | set[str]): A list of filetype.

            If both `gen` and `nuc` are given, it will merge them automatically.

        imgt_folder (str): Path to your IMGT/HLA root folder

            You can manually download <https://github.com/ANHIG/IMGTHLA/> and
            checkout to specific branch

            Or it will automatically download the database with assigned `version` to
            `imgt_folder`. Default is `./IMGT_v{verion}`

        version (str): IMGT version you want to download (e.g. 3470 for 3.47.0)

            If `imgt_folder` is existed, this value will be ignored.

            You can use `Latest` to get the latest version
    """
    if not imgt_folder:
        imgt_folder = f"IMGT_v{version}"
    super().__init__(genes, filetype, db_folder=imgt_folder, version=version)

__iter__(self) -> Iterable[str] inherited special

Source code in pyhlamsa/gene_family/hla_ex.py
def __iter__(self) -> Iterable[str]:
    """Iter gene name like iter(dict)"""
    return iter(self.genes)

items(self) -> Iterable[tuple[str, pyhlamsa.gene.genemsa.Genemsa]] inherited

list gene name and msa like dict.items()

Source code in pyhlamsa/gene_family/hla_ex.py
def items(self) -> Iterable[tuple[str, Genemsa]]:
    """list gene name and msa like dict.items()"""
    return self.genes.items()

list_db_gene(self, filetype: Union[str, Iterable[str]]) -> list

List the gene in folder

Source code in pyhlamsa/gene_family/hla_ex.py
def list_db_gene(self, filetype: TypeSet) -> list[str]:
    """List the gene in folder"""
    drb = set(["DRB1", "DRB3", "DRB4", "DRB5"])
    if "gen" in filetype:
        names = names_gen = self._get_name(f"{self.db_folder}/msf/*_gen.msf") | drb
    if "nuc" in filetype:
        names = names_nuc = self._get_name(f"{self.db_folder}/msf/*_nuc.msf") | drb
    if "gen" in filetype and "nuc" in filetype:
        # HLA-E nuc has 7 bases differences from exon7 and exon8
        names = (names_gen & names_nuc) - set("E")
    return sorted(names)

list_genes(self) -> list inherited

List all the gene's name in this family

Source code in pyhlamsa/gene_family/hla_ex.py
def list_genes(self) -> list[str]:
    """List all the gene's name in this family"""
    return list(self.genes.keys())

read_db_gene(self, gene: str, filetype: Union[str, Iterable[str]]) -> Genemsa

Read msf/{gene}_{filetype}.msf and hla.dat.

If both gen and nuc are given, it will merge them.

Source code in pyhlamsa/gene_family/hla_ex.py
def read_db_gene(self, gene: str, filetype: TypeSet) -> Genemsa:
    """
    Read `msf/{gene}_{filetype}.msf` and hla.dat.

    If both `gen` and `nuc` are given, it will merge them.
    """
    if not hasattr(self, "dat"):
        self.logger.debug(f"Reading hla.dat")
        self.dat = dat.read_dat_block(f"{self.db_folder}/hla.dat")

    if "gen" in filetype:
        msa_gen = Genemsa.read_msf_file(f"{self.db_folder}/msf/{gene}_gen.msf")
        msa_gen.gene_name = gene
        msa_gen = dat.apply_dat_info_on_msa(msa_gen, self.dat, seq_type="gen")
        self.logger.debug(f"{msa_gen}")

    if "nuc" in filetype:
        # Special Case: DRB* nuc are in DRB_nuc.txt
        if gene.startswith("DRB"):
            msa_nuc = Genemsa.read_msf_file(f"{self.db_folder}/msf/DRB_nuc.msf")
            msa_nuc = msa_nuc.select_allele(gene + ".*")
        else:
            msa_nuc = Genemsa.read_msf_file(f"{self.db_folder}/msf/{gene}_nuc.msf")
        msa_nuc.gene_name = gene
        msa_nuc = dat.apply_dat_info_on_msa(msa_nuc, self.dat, seq_type="nuc")
        self.logger.debug(f"{msa_nuc}")

    if "gen" in filetype and "nuc" in filetype:
        # remove some gen not included in nuc
        diff_name = list(
            set(msa_gen.get_sequence_names()) - set(msa_nuc.get_sequence_names())
        )
        if diff_name:
            self.logger.warning(
                f"Remove alleles existed in gen but not in nuc: {diff_name}"
            )
        msa_gen = msa_gen.remove_allele(diff_name)

        # merge
        msa_merged = msa_gen.merge_exon(msa_nuc)
        self.logger.debug(f"{msa_merged}")
        return msa_merged
    elif "gen" in filetype:
        return msa_gen
    elif "nuc" in filetype:
        return msa_nuc
    raise ValueError("gen or nuc are not exist in filetype")

KIRmsa (Familymsa)

A KIR interface that read MSA from MSF and KIR.dat

Attributes:

Name Type Description
genes dict[str, Genemsa]

The msa object for each gene

Source code in pyhlamsa/gene_family/kir.py
class KIRmsa(Familymsa):
    """
    A KIR interface that read MSA from MSF and KIR.dat

    Attributes:
        genes (dict[str, Genemsa]): The msa object for each gene
    """

    def __init__(
        self,
        genes: GeneSet = None,
        filetype: TypeSet = ["gen", "nuc"],
        ipd_folder: str = "",
        version: str = "Latest",
    ):
        """
        Args:
            genes (str | list[str]): A list of genes you want to read.

                Set None if you want read all gene in HLA

            filetype (str | list[str] | set[str]): A list of filetype.

                If both `gen` and `nuc` are given, it will merge them automatically.

            ipd_folder (str): Path to your IPD/KIR folder

                You can manually download <https://github.com/ANHIG/IPDKIR> and
                checkout to specific branch

                Or it will automatically download the database with assigned `version` to
                `ipd_folder`. Default is `./kIR_v{verion}`

            version (str): IMGT version you want to download (2110 for version 2.11.0)

                If `ipd_folder` is existed, this value will be ignored.

                version='Latest' to get latest version, however sometime it cannot work
                because database may change the format, or contains bugs
                (e.g. 2.11.0 https://github.com/ANHIG/IPDKIR/issues/44)
        """
        # Why not version 2110 -> 2DL4,2DL5 has exon 4
        if not ipd_folder:
            ipd_folder = f"KIR_v{version}"
        super().__init__(genes, filetype, db_folder=ipd_folder, version=version)

    def _download_db(self, version: str = "Latest") -> None:
        """
        Download the KIR to `IPDKIR`
        """
        self._run_shell(
            "git",
            "clone",
            "--branch",
            version,
            "--single-branch",
            "https://github.com/ANHIG/IPDKIR",
            self.db_folder,
        )

    def _get_name(self, search_name: str) -> set[str]:
        """Extract name from file pattern"""
        arr_files = glob(search_name)
        return set([f.split("/")[-1].split("_")[0] for f in arr_files])

    def list_db_gene(self, filetype: TypeSet) -> list[str]:
        """List the gene in folder"""
        if "gen" in filetype:
            names = names_gen = self._get_name(f"{self.db_folder}/msf/*_gen.msf")
        if "nuc" in filetype:
            # Most's of E nuc is different from dat record
            names = names_nuc = self._get_name(f"{self.db_folder}/msf/*_nuc.msf")
        if "gen" in filetype and "nuc" in filetype:
            names = names_gen & names_nuc
        return sorted(names)

    def read_db_gene(self, gene: str, filetype: TypeSet) -> Genemsa:
        """
        Read `{gene}_{filetype}.msf and kir.dat`.

        If both `gen` and `nuc` are given, it will merge them.
        """
        if not hasattr(self, "dat"):
            self.logger.debug(f"Reading kir.dat")
            if os.path.exists(f"{self.db_folder}/KIR.dat"):
                self.dat = dat.read_dat_block(f"{self.db_folder}/KIR.dat")
            else:
                self.dat = dat.read_dat_block(f"{self.db_folder}/kir.dat")

        if "gen" in filetype:
            msa_gen = Genemsa.read_msf_file(f"{self.db_folder}/msf/{gene}_gen.msf")
            msa_gen.gene_name = gene
            msa_gen = dat.apply_dat_info_on_msa(msa_gen, self.dat, seq_type="gen")
            self.logger.debug(f"Gen {msa_gen}")

        if "nuc" in filetype:
            msa_nuc = Genemsa.read_msf_file(f"{self.db_folder}/msf/{gene}_nuc.msf")
            msa_nuc.gene_name = gene
            msa_nuc = dat.apply_dat_info_on_msa(msa_nuc, self.dat, seq_type="nuc")
            self.logger.debug(f"Nuc {msa_nuc}")

        if "gen" in filetype and "nuc" in filetype:
            # remove some gen not included in nuc
            diff_name = list(
                set(msa_gen.get_sequence_names()) - set(msa_nuc.get_sequence_names())
            )
            if diff_name:
                self.logger.warning(
                    f"Remove alleles doesn't exist in gen and nuc either: {diff_name}"
                )
            msa_gen = msa_gen.remove_allele(diff_name)

            # specical case
            # exon 3 is pseudo exon
            # so, fill with gene's exon3
            gene_has_pseudo_exon3 = [
                "KIR2DL1",
                "KIR2DL2",
                "KIR2DL3",
                "KIR2DP1",
                "KIR2DS1",
                "KIR2DS2",
                "KIR2DS3",
                "KIR2DS4",
                "KIR2DS5",
            ]
            if gene in gene_has_pseudo_exon3:
                exon3 = msa_gen.select_block([5])
                for name in set(msa_nuc) - set(msa_gen):
                    exon3.append(name, "-" * exon3.get_length())
                msa_nuc = (
                    msa_nuc.select_block(list(range(0, 2)))
                    + exon3
                    + msa_nuc.select_block(list(range(3, len(msa_nuc.blocks))))
                )

            # merge
            msa_merged = msa_gen.merge_exon(msa_nuc)
            self.logger.debug(f"Merged {msa_merged}")
            return msa_merged

        elif "gen" in filetype:
            return msa_gen
        elif "nuc" in filetype:
            return msa_nuc
        raise ValueError("gen or nuc are not exist in filetype")

__getitem__(self, index: str) -> Genemsa inherited special

Source code in pyhlamsa/gene_family/kir.py
def __getitem__(self, index: str) -> Genemsa:
    """Get specific gene's msa"""
    return self.genes[index]

__init__(self, genes: Union[str, Iterable[str]] = None, filetype: Union[str, Iterable[str]] = ['gen', 'nuc'], ipd_folder: str = '', version: str = 'Latest') special

Parameters:

Name Type Description Default
genes str | list[str]

A list of genes you want to read.

Set None if you want read all gene in HLA

None
filetype str | list[str] | set[str]

A list of filetype.

If both gen and nuc are given, it will merge them automatically.

['gen', 'nuc']
ipd_folder str

Path to your IPD/KIR folder

You can manually download https://github.com/ANHIG/IPDKIR and checkout to specific branch

Or it will automatically download the database with assigned version to ipd_folder. Default is ./kIR_v{verion}

''
version str

IMGT version you want to download (2110 for version 2.11.0)

If ipd_folder is existed, this value will be ignored.

version='Latest' to get latest version, however sometime it cannot work because database may change the format, or contains bugs (e.g. 2.11.0 https://github.com/ANHIG/IPDKIR/issues/44)

'Latest'
Source code in pyhlamsa/gene_family/kir.py
def __init__(
    self,
    genes: GeneSet = None,
    filetype: TypeSet = ["gen", "nuc"],
    ipd_folder: str = "",
    version: str = "Latest",
):
    """
    Args:
        genes (str | list[str]): A list of genes you want to read.

            Set None if you want read all gene in HLA

        filetype (str | list[str] | set[str]): A list of filetype.

            If both `gen` and `nuc` are given, it will merge them automatically.

        ipd_folder (str): Path to your IPD/KIR folder

            You can manually download <https://github.com/ANHIG/IPDKIR> and
            checkout to specific branch

            Or it will automatically download the database with assigned `version` to
            `ipd_folder`. Default is `./kIR_v{verion}`

        version (str): IMGT version you want to download (2110 for version 2.11.0)

            If `ipd_folder` is existed, this value will be ignored.

            version='Latest' to get latest version, however sometime it cannot work
            because database may change the format, or contains bugs
            (e.g. 2.11.0 https://github.com/ANHIG/IPDKIR/issues/44)
    """
    # Why not version 2110 -> 2DL4,2DL5 has exon 4
    if not ipd_folder:
        ipd_folder = f"KIR_v{version}"
    super().__init__(genes, filetype, db_folder=ipd_folder, version=version)

__iter__(self) -> Iterable[str] inherited special

Source code in pyhlamsa/gene_family/kir.py
def __iter__(self) -> Iterable[str]:
    """Iter gene name like iter(dict)"""
    return iter(self.genes)

items(self) -> Iterable[tuple[str, pyhlamsa.gene.genemsa.Genemsa]] inherited

list gene name and msa like dict.items()

Source code in pyhlamsa/gene_family/kir.py
def items(self) -> Iterable[tuple[str, Genemsa]]:
    """list gene name and msa like dict.items()"""
    return self.genes.items()

list_db_gene(self, filetype: Union[str, Iterable[str]]) -> list

List the gene in folder

Source code in pyhlamsa/gene_family/kir.py
def list_db_gene(self, filetype: TypeSet) -> list[str]:
    """List the gene in folder"""
    if "gen" in filetype:
        names = names_gen = self._get_name(f"{self.db_folder}/msf/*_gen.msf")
    if "nuc" in filetype:
        # Most's of E nuc is different from dat record
        names = names_nuc = self._get_name(f"{self.db_folder}/msf/*_nuc.msf")
    if "gen" in filetype and "nuc" in filetype:
        names = names_gen & names_nuc
    return sorted(names)

list_genes(self) -> list inherited

List all the gene's name in this family

Source code in pyhlamsa/gene_family/kir.py
def list_genes(self) -> list[str]:
    """List all the gene's name in this family"""
    return list(self.genes.keys())

read_db_gene(self, gene: str, filetype: Union[str, Iterable[str]]) -> Genemsa

Read {gene}_{filetype}.msf and kir.dat.

If both gen and nuc are given, it will merge them.

Source code in pyhlamsa/gene_family/kir.py
def read_db_gene(self, gene: str, filetype: TypeSet) -> Genemsa:
    """
    Read `{gene}_{filetype}.msf and kir.dat`.

    If both `gen` and `nuc` are given, it will merge them.
    """
    if not hasattr(self, "dat"):
        self.logger.debug(f"Reading kir.dat")
        if os.path.exists(f"{self.db_folder}/KIR.dat"):
            self.dat = dat.read_dat_block(f"{self.db_folder}/KIR.dat")
        else:
            self.dat = dat.read_dat_block(f"{self.db_folder}/kir.dat")

    if "gen" in filetype:
        msa_gen = Genemsa.read_msf_file(f"{self.db_folder}/msf/{gene}_gen.msf")
        msa_gen.gene_name = gene
        msa_gen = dat.apply_dat_info_on_msa(msa_gen, self.dat, seq_type="gen")
        self.logger.debug(f"Gen {msa_gen}")

    if "nuc" in filetype:
        msa_nuc = Genemsa.read_msf_file(f"{self.db_folder}/msf/{gene}_nuc.msf")
        msa_nuc.gene_name = gene
        msa_nuc = dat.apply_dat_info_on_msa(msa_nuc, self.dat, seq_type="nuc")
        self.logger.debug(f"Nuc {msa_nuc}")

    if "gen" in filetype and "nuc" in filetype:
        # remove some gen not included in nuc
        diff_name = list(
            set(msa_gen.get_sequence_names()) - set(msa_nuc.get_sequence_names())
        )
        if diff_name:
            self.logger.warning(
                f"Remove alleles doesn't exist in gen and nuc either: {diff_name}"
            )
        msa_gen = msa_gen.remove_allele(diff_name)

        # specical case
        # exon 3 is pseudo exon
        # so, fill with gene's exon3
        gene_has_pseudo_exon3 = [
            "KIR2DL1",
            "KIR2DL2",
            "KIR2DL3",
            "KIR2DP1",
            "KIR2DS1",
            "KIR2DS2",
            "KIR2DS3",
            "KIR2DS4",
            "KIR2DS5",
        ]
        if gene in gene_has_pseudo_exon3:
            exon3 = msa_gen.select_block([5])
            for name in set(msa_nuc) - set(msa_gen):
                exon3.append(name, "-" * exon3.get_length())
            msa_nuc = (
                msa_nuc.select_block(list(range(0, 2)))
                + exon3
                + msa_nuc.select_block(list(range(3, len(msa_nuc.blocks))))
            )

        # merge
        msa_merged = msa_gen.merge_exon(msa_nuc)
        self.logger.debug(f"Merged {msa_merged}")
        return msa_merged

    elif "gen" in filetype:
        return msa_gen
    elif "nuc" in filetype:
        return msa_nuc
    raise ValueError("gen or nuc are not exist in filetype")

CYPmsa (Familymsa)

A CYP interface that read MSA from fasta and haplotypes.tsv

Attributes:

Name Type Description
genes dict[str, Genemsa]

The msa object for each gene

Source code in pyhlamsa/gene_family/cyp.py
class CYPmsa(Familymsa):
    """
    A CYP interface that read MSA from fasta and haplotypes.tsv

    Attributes:
        genes (dict[str, Genemsa]): The msa object for each gene
    """

    def __init__(
        self,
        genes: GeneSet = None,
        filetype: TypeSet = [],
        pharmvar_folder: str = "",
        version: str = "5.2.2",
    ):
        """
        Args:
            genes (str | list[str]): A list of genes you want to read.

                Set None if you want read all gene in HLA

            filetype: Ignore

            pharmvar_folder (str): Path to your pharmvar folder

                You should manually download <https://www.pharmvar.org/download>
                and unzip it.

            version (str): PharmVar version you want to download

                Not works now
        """
        if not pharmvar_folder:
            pharmvar_folder = f"pharmvar-{version}"
        super().__init__(genes, filetype, db_folder=pharmvar_folder, version=version)

    def _download_db(self, version: str = "") -> None:
        """
        Download the CYP from https://www.pharmvar.org/download
        """
        raise ValueError(
            "You should download CYP genes from https://www.pharmvar.org/download"
        )

    def list_db_gene(self, filetype: TypeSet = []) -> list[str]:
        """List the gene in folder"""
        return sorted(os.listdir(self.db_folder))

    def read_db_gene(self, gene: str, filetype: TypeSet = []) -> Genemsa:
        """
        Read `{gene}/{gene}.haplotypes.fasta` and `{gene}/{gene}.haplotypes.tsv`

        `filetype` will be ignored now
        """
        # read fasta
        ref_seqs = {}
        for seq in SeqIO.parse(
            f"{self.db_folder}/{gene}/{gene}.haplotypes.fasta", "fasta"
        ):
            # split allele name
            # e.g.  rs75017182, rs56038477 PV01077 NG_008807.2 PharmVar Version:5.1.10
            for name in seq.description.replace(", ", ",").split(" ")[0].split(","):
                ref_seqs[name.strip()] = str(seq.seq)
        self.logger.debug(f"Read sequence {ref_seqs.keys()}")

        # read tsv
        # split by '\t' and ignore header
        var_text = open(glob(f"{self.db_folder}/{gene}/RefSeqGene/*.haplotypes.tsv")[0])
        table = [i.strip().split("\t") for i in var_text if not i.startswith("#")][1:]

        # Get Reference sequence
        reference_seq = None
        reference_name = None
        for i in table:
            if i[0] not in ref_seqs:
                continue
            if i[3] == "REFERENCE" and ref_seqs[i[0]][0]:
                if reference_seq:
                    assert reference_seq == ref_seqs[i[0]]
                    continue
                reference_name = i[0]
                reference_seq = ref_seqs[reference_name]
        if not reference_seq:
            # hard-coded for DPYD, because it's reference `NG_008807.2` is missing
            if gene == "DPYD":
                reference_name = "rs111858276"
                reference_seq = ref_seqs[reference_name]
                reference_seq = (
                    reference_seq[: 376460 - 1] + "A" + reference_seq[376460:]
                )
            else:
                raise ValueError(f"Not reference found in {gene}")

        assert reference_name is not None
        # Fill with Reference and reference is the first one
        alleles = {reference_name: ""}
        alleles.update(
            {allele_name: reference_seq for allele_name in set(i[0] for i in table)}
        )
        self.logger.debug(f"Read vcf {alleles.keys()}")

        # Remove duplciate
        # in 5.1.10, there exist two same row
        # CYP2D6*149  CYP2D6  rs59421388  NG_008376.4 8203    8203    G   A   substitution
        # CYP2D6*149  CYP2D6  rs59421388  NG_008376.4 8203    8203    G   A   substitution
        table_unique1 = filter(lambda i: i[3] != "REFERENCE", table)
        table_unique2 = cast(tuple, map(tuple, table_unique1))  # type: ignore
        table_unique = cast(list, map(list, set(list(table_unique2))))  # type: ignore

        # Reconstruct msa from VCF variant
        # VCF type: substitution and deleteion, insertion
        # Table Header: ['Haplotype Name', 'Gene', 'rsID', 'ReferenceSequence',
        #   'Variant Start', 'Variant Stop', 'Reference Allele', 'Variant Allele', 'Type']
        # Insertion will move the index, so insert from last position first
        for i in sorted(table_unique, key=lambda i: -int(i[4])):
            if i[8] == "substitution":
                pos = int(i[4]) - 1
                end = pos + 1
            elif i[8] == "deletion":
                pos = int(i[4]) - 1
                end = pos + len(i[6])
                i[7] = "-" * len(i[6])
            elif i[8] == "insertion":
                # hard coded bugs
                if i[0].startswith("CYP2A6*31"):
                    # CYP2A6*31.001   4530    4552    CCCCTTCCTGAGACCCTTAACCC -   deletion
                    # CYP2A6*31.001   4530    4531    -   AATCCATATGTGGAATCTG insertion
                    continue
                if i[0].startswith("CYP2A6*27"):
                    # CYP2A6*27.001   7183    7184    -   A   insertion
                    continue

                pos = int(i[4])
                end = int(i[4]) + len(i[7])
                i[6] = "-" * len(i[7])
                # check the gap is already inserted by others
                # if false: insert '-' for all alleles
                # else: replace the '-' with insertion seq
                if set(alleles[i[0]][pos:end]) != set("-"):
                    for allele_name in alleles:
                        alleles[allele_name] = (
                            alleles[allele_name][:pos]
                            + i[6]
                            + alleles[allele_name][pos:]
                        )
            else:
                raise ValueError(f"Unknown Type {i[8]}")

            assert alleles[i[0]][pos:end] == i[6]
            alleles[i[0]] = alleles[i[0]][:pos] + i[7] + alleles[i[0]][end:]

        # split allele name (alleles.key() will change)
        # e.g. rs75017182, rs56038477  DPYD    rs75017182  NG_008807.2 346167
        for allele_name in list(alleles.keys()):
            if ", " in allele_name:
                for an in allele_name.split(","):
                    alleles[an.strip()] = alleles[allele_name]
                del alleles[allele_name]

        # double check
        length = None
        for allele_name in alleles:
            if length is None:
                length = len(alleles[allele_name])
            else:
                assert length == len(alleles[allele_name])
            if alleles[allele_name].replace("-", "") != ref_seqs[allele_name]:
                raise ValueError(f"{allele_name} is not same as reference")

        assert length
        msa = Genemsa(gene)
        msa.alleles = alleles
        msa.blocks = [BlockInfo(length=length)]
        return msa.assume_label("other")

__getitem__(self, index: str) -> Genemsa inherited special

Source code in pyhlamsa/gene_family/cyp.py
def __getitem__(self, index: str) -> Genemsa:
    """Get specific gene's msa"""
    return self.genes[index]

__init__(self, genes: Union[str, Iterable[str]] = None, filetype: Union[str, Iterable[str]] = [], pharmvar_folder: str = '', version: str = '5.2.2') special

Parameters:

Name Type Description Default
genes str | list[str]

A list of genes you want to read.

Set None if you want read all gene in HLA

None
filetype Union[str, Iterable[str]]

Ignore

[]
pharmvar_folder str

Path to your pharmvar folder

You should manually download https://www.pharmvar.org/download and unzip it.

''
version str

PharmVar version you want to download

Not works now

'5.2.2'
Source code in pyhlamsa/gene_family/cyp.py
def __init__(
    self,
    genes: GeneSet = None,
    filetype: TypeSet = [],
    pharmvar_folder: str = "",
    version: str = "5.2.2",
):
    """
    Args:
        genes (str | list[str]): A list of genes you want to read.

            Set None if you want read all gene in HLA

        filetype: Ignore

        pharmvar_folder (str): Path to your pharmvar folder

            You should manually download <https://www.pharmvar.org/download>
            and unzip it.

        version (str): PharmVar version you want to download

            Not works now
    """
    if not pharmvar_folder:
        pharmvar_folder = f"pharmvar-{version}"
    super().__init__(genes, filetype, db_folder=pharmvar_folder, version=version)

__iter__(self) -> Iterable[str] inherited special

Source code in pyhlamsa/gene_family/cyp.py
def __iter__(self) -> Iterable[str]:
    """Iter gene name like iter(dict)"""
    return iter(self.genes)

items(self) -> Iterable[tuple[str, pyhlamsa.gene.genemsa.Genemsa]] inherited

list gene name and msa like dict.items()

Source code in pyhlamsa/gene_family/cyp.py
def items(self) -> Iterable[tuple[str, Genemsa]]:
    """list gene name and msa like dict.items()"""
    return self.genes.items()

list_db_gene(self, filetype: Union[str, Iterable[str]] = []) -> list

List the gene in folder

Source code in pyhlamsa/gene_family/cyp.py
def list_db_gene(self, filetype: TypeSet = []) -> list[str]:
    """List the gene in folder"""
    return sorted(os.listdir(self.db_folder))

list_genes(self) -> list inherited

List all the gene's name in this family

Source code in pyhlamsa/gene_family/cyp.py
def list_genes(self) -> list[str]:
    """List all the gene's name in this family"""
    return list(self.genes.keys())

read_db_gene(self, gene: str, filetype: Union[str, Iterable[str]] = []) -> Genemsa

Read {gene}/{gene}.haplotypes.fasta and {gene}/{gene}.haplotypes.tsv

filetype will be ignored now

Source code in pyhlamsa/gene_family/cyp.py
def read_db_gene(self, gene: str, filetype: TypeSet = []) -> Genemsa:
    """
    Read `{gene}/{gene}.haplotypes.fasta` and `{gene}/{gene}.haplotypes.tsv`

    `filetype` will be ignored now
    """
    # read fasta
    ref_seqs = {}
    for seq in SeqIO.parse(
        f"{self.db_folder}/{gene}/{gene}.haplotypes.fasta", "fasta"
    ):
        # split allele name
        # e.g.  rs75017182, rs56038477 PV01077 NG_008807.2 PharmVar Version:5.1.10
        for name in seq.description.replace(", ", ",").split(" ")[0].split(","):
            ref_seqs[name.strip()] = str(seq.seq)
    self.logger.debug(f"Read sequence {ref_seqs.keys()}")

    # read tsv
    # split by '\t' and ignore header
    var_text = open(glob(f"{self.db_folder}/{gene}/RefSeqGene/*.haplotypes.tsv")[0])
    table = [i.strip().split("\t") for i in var_text if not i.startswith("#")][1:]

    # Get Reference sequence
    reference_seq = None
    reference_name = None
    for i in table:
        if i[0] not in ref_seqs:
            continue
        if i[3] == "REFERENCE" and ref_seqs[i[0]][0]:
            if reference_seq:
                assert reference_seq == ref_seqs[i[0]]
                continue
            reference_name = i[0]
            reference_seq = ref_seqs[reference_name]
    if not reference_seq:
        # hard-coded for DPYD, because it's reference `NG_008807.2` is missing
        if gene == "DPYD":
            reference_name = "rs111858276"
            reference_seq = ref_seqs[reference_name]
            reference_seq = (
                reference_seq[: 376460 - 1] + "A" + reference_seq[376460:]
            )
        else:
            raise ValueError(f"Not reference found in {gene}")

    assert reference_name is not None
    # Fill with Reference and reference is the first one
    alleles = {reference_name: ""}
    alleles.update(
        {allele_name: reference_seq for allele_name in set(i[0] for i in table)}
    )
    self.logger.debug(f"Read vcf {alleles.keys()}")

    # Remove duplciate
    # in 5.1.10, there exist two same row
    # CYP2D6*149  CYP2D6  rs59421388  NG_008376.4 8203    8203    G   A   substitution
    # CYP2D6*149  CYP2D6  rs59421388  NG_008376.4 8203    8203    G   A   substitution
    table_unique1 = filter(lambda i: i[3] != "REFERENCE", table)
    table_unique2 = cast(tuple, map(tuple, table_unique1))  # type: ignore
    table_unique = cast(list, map(list, set(list(table_unique2))))  # type: ignore

    # Reconstruct msa from VCF variant
    # VCF type: substitution and deleteion, insertion
    # Table Header: ['Haplotype Name', 'Gene', 'rsID', 'ReferenceSequence',
    #   'Variant Start', 'Variant Stop', 'Reference Allele', 'Variant Allele', 'Type']
    # Insertion will move the index, so insert from last position first
    for i in sorted(table_unique, key=lambda i: -int(i[4])):
        if i[8] == "substitution":
            pos = int(i[4]) - 1
            end = pos + 1
        elif i[8] == "deletion":
            pos = int(i[4]) - 1
            end = pos + len(i[6])
            i[7] = "-" * len(i[6])
        elif i[8] == "insertion":
            # hard coded bugs
            if i[0].startswith("CYP2A6*31"):
                # CYP2A6*31.001   4530    4552    CCCCTTCCTGAGACCCTTAACCC -   deletion
                # CYP2A6*31.001   4530    4531    -   AATCCATATGTGGAATCTG insertion
                continue
            if i[0].startswith("CYP2A6*27"):
                # CYP2A6*27.001   7183    7184    -   A   insertion
                continue

            pos = int(i[4])
            end = int(i[4]) + len(i[7])
            i[6] = "-" * len(i[7])
            # check the gap is already inserted by others
            # if false: insert '-' for all alleles
            # else: replace the '-' with insertion seq
            if set(alleles[i[0]][pos:end]) != set("-"):
                for allele_name in alleles:
                    alleles[allele_name] = (
                        alleles[allele_name][:pos]
                        + i[6]
                        + alleles[allele_name][pos:]
                    )
        else:
            raise ValueError(f"Unknown Type {i[8]}")

        assert alleles[i[0]][pos:end] == i[6]
        alleles[i[0]] = alleles[i[0]][:pos] + i[7] + alleles[i[0]][end:]

    # split allele name (alleles.key() will change)
    # e.g. rs75017182, rs56038477  DPYD    rs75017182  NG_008807.2 346167
    for allele_name in list(alleles.keys()):
        if ", " in allele_name:
            for an in allele_name.split(","):
                alleles[an.strip()] = alleles[allele_name]
            del alleles[allele_name]

    # double check
    length = None
    for allele_name in alleles:
        if length is None:
            length = len(alleles[allele_name])
        else:
            assert length == len(alleles[allele_name])
        if alleles[allele_name].replace("-", "") != ref_seqs[allele_name]:
            raise ValueError(f"{allele_name} is not same as reference")

    assert length
    msa = Genemsa(gene)
    msa.alleles = alleles
    msa.blocks = [BlockInfo(length=length)]
    return msa.assume_label("other")