Reach out directly about this role
By city
3-5 years
Experience
Full-time
Employment
Hybrid, Remote, Onsite
Work Format
Middle
Grade
Backend
Specialization
IT & Tech
Industry
Corporation
Company Type
Developer of a storage format for data in YTsaurus dynamic tables
YTsaurus is a software product for building large data lakes, where data can be processed using different paradigms: both MapReduce (background processing) and NewSQL (real-time). YTsaurus has its own data storage layer and its own implementations of storage formats—efficient for Yandex's real-world data and volumes.
You will work on the data storage layer and adapt it for fast analytics tasks.
Hierarchical data format One of the important tasks is to develop a compression format for hierarchical data, which will allow both efficiently reading large ranges and quickly retrieving one specific document or part of it.
Such a task requires working with various compression mechanisms as well as low-level engineering work at the CPU and memory access level. You will need both SIMD instructions and code adaptation to the processor's memory hierarchy. We expect that you love algorithms and efficient C++ programming!
Historical data format for analytics tasks Dynamic tables (as we call the NewSQL component of YTsaurus) traditionally use a data format tailored for transaction processing. History in such data is stored together with timestamps, which allows providing a snapshot isolation level. Such data is redundant for analytics tasks: simpler formats are better suited for them. You will need to find a compromise and adapt history storage in dynamic tables to make them suitable for transactional-analytical tasks.
Analytical indexes Analytics uses its own indexes: SMA, star-tree. You will need to add them to the data formats, implement their construction, and use them in queries. This task will require diving into the entire SQL query processing cycle.
More about backend at Yandex — in the channel Yandex for Backend