anzaika/parse_flybase_annotation
Folders and files
| Name | Name | Last commit date | ||
|---|---|---|---|---|
Repository files navigation
Purpose
=======
Parse flybase gene annotation format.
Source Data
===========
Is a multiline text file, where each line has 11 tab separated columns.
0 | unknown | 585 |
1 | mrna_id | CG11023-RA |
2 | chromosome | chr2L |
3 | strand | + |
4 | mrna_start | 7528 | \ -- with UTRs
5 | mrna_stop | 9491 | /
6 | mrna_c_start | 7679 | \ -- without UTRs
7 | mrna_c_stop | 9276 | /
8 | seg_count | 3 |
9 | seg_starts | 7528,8228,8667, |
10 | seg_stops | 8116,8589,9491, |
gene_id can be inferred from mrna_id: CG11023-RA --> 11023.
Output
======
Two collections: mrnas and segments. Segments have their UTRs cut out.
Blueprints:
mrna:
* mrna_id (String)
* gene_id (String)
* strand (String) -- '+' or '-'
* chromosome (String)
* segments (Array[Array]) -- coordinates of segments, e.g.: [[1,2],[3,4]]
sorted 5' --> 3'
segment:
* start (Fixnum)
* stop (Fixnum)
* chromosome (String)
* mrnas (Array[String]) -- array of mrna_ids
* splicing (String) -- 'const' or 'alt'
Usage
=====
require 'parse_flybase_annotation'
result = ParseFlybaseAnnotation.parse('path_to_annotation_file')
segments_collection = result['segments']
mrnas_collection = result['mrnas']