How To Prepare Data
There are couple of regular steps to prepare data:
1. Use a simple text replacement tool to remove the first layer. Text replacement is an underrated method, especially with regular expressions.
2. Use text processing tools in linux to build stream pipelines. They help process large files and allow for their reuse.
3. Then, use a Python script for more complex calculations.
The most useful tricks:
- save the revisions to reproduce
- use a progress bar every time you start processing to predict the end of the process
- add cursor argument if possible. a pipe can be broken at any time
- use file caching to save responses from external services. this helps avoid blocks
- JSON is a convenient format to serialize/deserialize temporary structured data
20 Mar, 2025